# Content Based Filtering

The simplest form of Collaborative Filtering model is a Content Based Filtering model. In this model, we create a profile for every user based on the items that they have already rated and project both the user and the item in a shared space.

Once we have a user profile, we could use it to perform dot-product with different item representations and come up with a nearest neighbour search algorithm.

In [76]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
%matplotlib inline

In [48]:
anime_data = pd.read_csv("../data/animes.csv")
profile_data = pd.read_csv("../data/profiles.csv")
reviews_data = pd.read_csv("../data/ModifiedReviewsData.csv")

In [3]:
anime_data = anime_data.drop_duplicates(subset = ["uid"]).reset_index(drop = True)
anime_data.shape

(16216, 12)

In [7]:
anime_data.head(2)

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...


As we see from the above dataframe, an anime has many different features. It has both it's own categorical and numerical features. It is up to us how to create the user profiles but they must be created in a smart way.

The drawback with CB Filtering is that if we don't have a user profile for a given person, then it is not possible to serve them recommendations. And user profiles can only be created after the user has interacted with some if not all items.

Let's attempt this problem first by using the genre of an anime first and later introduce numerical features like episodes, members, score etc. to build a user profile.

# Anime-genre mapping matrix

In [14]:
# Get a list of all the unique genres across all anime

all_genre = []
for genre in anime_data.genre:
    all_genre.extend(eval(genre))
all_genre = sorted(list(set(all_genre)))

In [28]:
print(f"There are a total of {len(all_genre)} genres:")
print(all_genre)

There are a total of 43 genres:
['Action', 'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi', 'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei', 'Kids', 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music', 'Mystery', 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai', 'School', 'Sci-Fi', 'Seinen', 'Shoujo', 'Shoujo Ai', 'Shounen', 'Shounen Ai', 'Slice of Life', 'Space', 'Sports', 'Super Power', 'Supernatural', 'Thriller', 'Vampire', 'Yaoi', 'Yuri']


In [35]:
# Create a container to hold all the anime -> genre mapping
entries = []
for row in anime_data[["uid", "title", "genre"]].itertuples():
    # Get the uid, title of anime and the list of genres it belongs to
    uid = row[1]
    title = row[2]
    genres = eval(row[3])
    
    # Normalize the score across each row
    
    genre_map = [0] * len(all_genre)
    
    for genre in genres:
    # Normalize the score across each row
        genre_map[all_genre.index(genre)] = np.sqrt(1 / len(genres))
    
    entry = [uid, title] + genre_map + [len(genres)]
    entries.append(entry)
    
# Create a dataframe of all the anime and it's respective genres
anime_genre_map_df = pd.DataFrame(entries, columns = ["uid", "title"] + all_genre + ["n_genres"])
anime_genre_map_df.head(2)

Unnamed: 0,uid,title,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri,n_genres
0,28891,Haikyuu!! Second Season,0.0,0.0,0.0,0.447214,0.0,0.0,0.447214,0.0,...,0.0,0.0,0.447214,0.0,0.0,0.0,0.0,0.0,0.0,5
1,23273,Shigatsu wa Kimi no Uso,0.0,0.0,0.0,0.0,0.0,0.0,0.447214,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


In [34]:
# Have a look at the features which represent Haikhuu!! Second Season
anime_genre_map_df.iloc[0, 2:-1]

Action                  0
Adventure               0
Cars                    0
Comedy           0.447214
Dementia                0
Demons                  0
Drama            0.447214
Ecchi                   0
Fantasy                 0
Game                    0
Harem                   0
Hentai                  0
Historical              0
Horror                  0
Josei                   0
Kids                    0
Magic                   0
Martial Arts            0
Mecha                   0
Military                0
Music                   0
Mystery                 0
Parody                  0
Police                  0
Psychological           0
Romance                 0
Samurai                 0
School           0.447214
Sci-Fi                  0
Seinen                  0
Shoujo                  0
Shoujo Ai               0
Shounen          0.447214
Shounen Ai              0
Slice of Life           0
Space                   0
Sports           0.447214
Super Power             0
Supernatural

# Create a User Profile for every user

In [73]:
def create_user_profile(user_df):
    
    # Iterate over all anime which the user has watched and their scores and add their representation
    # Weighted by the score they gave in order to build a user profile of sorts.
    
    user_representation = np.zeros(43)

    for row in user_df.itertuples():
        anime_mapping = anime_genre_map_df[anime_genre_map_df.uid == row[2]].reset_index(drop = True).iloc[0, 2:-1].values
        mean_score = row[3]
        user_representation += np.array(anime_mapping, dtype = np.float) * mean_score
        
    return user_representation

In [None]:
users = reviews_data.uid.unique()
user_profiles = {}

for user in tqdm(users, desc = "Extracting user profiles..."):
    user_df = reviews_data[reviews_data.uid == user].reset_index(drop = True)
    user_profiles[user] = create_user_profile(user_df)

Extracting user profiles...:   0%|          | 0/130519 [00:00<?, ?it/s]

# Recommend anime to any given user

In [None]:
def recommend_anime(user_profile):
    