## Anime recommender using Jaccard similarity

Recommendation systems are engines that use data to predict / narrow down what a user would like based on user history, item attributes or context features. There are generally 3 types of recommenders:
- Item based
- User based
- Hybrid

In this notebook, an item based recommender is built using items attributes by calculating the jaccard similarity between items.

In [None]:
# importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import pdist, squareform

In [None]:
# loading data
df = pd.read_csv("../data/processed/clean_data.csv")
df.head(2)

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x=df['type'])
plt.title("Count of Anime Type", fontsize=15, weight='bold')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(data=df, x='type', y='Feedback', estimator='mean', errorbar='sd')
plt.title("Average Feedback by Anime Type", fontsize=15, weight='bold')
plt.show()

In [None]:
# function to clean the genre column

def clean_genre(genre):
    genre_list = genre.split(',')
    genre = ",".join([x.strip(' ') for x in genre_list])
    return genre

In [None]:
df['genre'] = df['genre'].apply(clean_genre)

## Jaccard Similarity

In [None]:
# since Feedback will not be used as a feature, we only need 1 entry of each movie since each movie has similar genres across its entries
df_subset = df.drop_duplicates(subset="name")
df_subset.head(2)

In [None]:
# creating dummie variables
df_dummies = df_subset['genre'].str.get_dummies(sep=',')
df_dummies = pd.concat([df_dummies, df_subset['Audience'].str.get_dummies(), df_subset['type'].str.get_dummies()], axis=1)
df_dummies.head()

#### Finding the distances between all animes using their genre

In [None]:
# calculating the jaccard distance between all genres
jaccard_distance = pdist(df_dummies.values, metric="jaccard")

# converting it to squareform
squared_jaccard_distance = squareform(jaccard_distance)

In [None]:
jaccard_array = 1 - squared_jaccard_distance

# converting to dataframe
distance_df = pd.DataFrame(jaccard_array, index=df_subset['name'], columns=df_subset['name'])

In [None]:
distance_df.head()

In [None]:
def get_jaccard_recommendation(name):
    temp_df = distance_df[[name]].reset_index().rename({'name':'Anime', name:'jaccard_similarity'}, axis=1)
    temp_df = temp_df.sort_values(by='jaccard_similarity', ascending=False)[1:11]
    return temp_df.reset_index(drop=True)

In [None]:
get_jaccard_recommendation('Naruto')