# Content Based Filtering

The simplest form of Collaborative Filtering model is a Content Based Filtering model. In this model, we create a profile for every user based on the items that they have already rated and project both the user and the item in a shared space.

Once we have a user profile, we could use it to perform dot-product with different item representations and come up with a nearest neighbour search algorithm.

In [1]:
import pandas as pd
import numpy as np
import random

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [2]:
anime_data = pd.read_csv("../data/animes.csv")
profile_data = pd.read_csv("../data/profiles.csv")
reviews_data = pd.read_csv("../data/ModifiedReviewsData.csv")

In [3]:
anime_data = anime_data.drop_duplicates(subset = ["uid"]).reset_index(drop = True)
anime_data.shape

(16216, 12)

In [4]:
anime_data.head(2)

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...


As we see from the above dataframe, an anime has many different features. It has both it's own categorical and numerical features. It is up to us how to create the user profiles but they must be created in a smart way.

The drawback with CB Filtering is that if we don't have a user profile for a given person, then it is not possible to serve them recommendations. And user profiles can only be created after the user has interacted with some if not all items.

Let's attempt this problem first by using the genre of an anime first and later introduce numerical features like episodes, members, score etc. to build a user profile.

# Anime-genre mapping matrix

In [5]:
# Get a list of all the unique genres across all anime

all_genre = []
for genre in anime_data.genre:
    all_genre.extend(eval(genre))
all_genre = sorted(list(set(all_genre)))

In [6]:
print(f"There are a total of {len(all_genre)} genres:")
print(all_genre)

There are a total of 43 genres:
['Action', 'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi', 'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei', 'Kids', 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music', 'Mystery', 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai', 'School', 'Sci-Fi', 'Seinen', 'Shoujo', 'Shoujo Ai', 'Shounen', 'Shounen Ai', 'Slice of Life', 'Space', 'Sports', 'Super Power', 'Supernatural', 'Thriller', 'Vampire', 'Yaoi', 'Yuri']


In [7]:
# Create a container to hold all the anime -> genre mapping
entries = []
for row in anime_data[["uid", "title", "genre"]].itertuples():
    # Get the uid, title of anime and the list of genres it belongs to
    uid = row[1]
    title = row[2]
    genres = eval(row[3])
    
    # Normalize the score across each row
    
    genre_map = [0] * len(all_genre)
    
    for genre in genres:
    # Normalize the score across each row
        genre_map[all_genre.index(genre)] = np.sqrt(1 / len(genres))
    
    entry = [uid, title] + [len(genres)] + genre_map
    entries.append(entry)
    
# Create a dataframe of all the anime and it's respective genres
anime_genre_map_df = pd.DataFrame(entries, columns = ["uid", "title", "n_genres"] + all_genre)
anime_genre_map_df.head(2)

Unnamed: 0,uid,title,n_genres,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,28891,Haikyuu!! Second Season,5,0.0,0.0,0.0,0.447214,0.0,0.0,0.447214,...,0.0,0.0,0.0,0.447214,0.0,0.0,0.0,0.0,0.0,0.0
1,23273,Shigatsu wa Kimi no Uso,5,0.0,0.0,0.0,0.0,0.0,0.0,0.447214,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Have a look at the features which represent Haikhuu!! Second Season
anime_genre_map_df.iloc[0, 3:]

Action                  0
Adventure               0
Cars                    0
Comedy           0.447214
Dementia                0
Demons                  0
Drama            0.447214
Ecchi                   0
Fantasy                 0
Game                    0
Harem                   0
Hentai                  0
Historical              0
Horror                  0
Josei                   0
Kids                    0
Magic                   0
Martial Arts            0
Mecha                   0
Military                0
Music                   0
Mystery                 0
Parody                  0
Police                  0
Psychological           0
Romance                 0
Samurai                 0
School           0.447214
Sci-Fi                  0
Seinen                  0
Shoujo                  0
Shoujo Ai               0
Shounen          0.447214
Shounen Ai              0
Slice of Life           0
Space                   0
Sports           0.447214
Super Power             0
Supernatural

# Create a User Profile for every user

In [9]:
def create_user_profile(user_df, anime_df, binary = False):
    
    # Iterate over all anime which the user has watched and their scores and add their representation
    # Weighted by the score they gave in order to build a user profile of sorts.
    
    features_length = len(anime_df.columns[3:])
    user_representation = np.zeros(features_length)

    for row in user_df.itertuples():
        anime_mapping = anime_df[anime_df.uid == row[2]].reset_index(drop = True).iloc[0, 3:].values
        mean_score = row[3]
        # If you want to binarize the output, you could do it by taking -1 for anime
        # whose rating is below 5 and 1 for anime whose rating is above 5 (very simplistic)
        if binary:
            sc = 1 
            if mean_score <=5: sc = -1
            user_representation += np.array(anime_mapping, dtype = np.float) * sc
        else:
            user_representation += np.array(anime_mapping, dtype = np.float) * mean_score
        
    return user_representation

In [10]:
# Extract the user profiles using the function defined above
users = reviews_data.uid.unique()
user_profiles = {}

for user in tqdm(users, desc = "Extracting user profiles..."):
    user_df = reviews_data[reviews_data.uid == user].reset_index(drop = True)
    user_profiles[user] = create_user_profile(user_df, anime_genre_map_df)

Extracting user profiles...:   0%|          | 0/130519 [00:00<?, ?it/s]

# Recommend anime to any given user

In [11]:
def recommend_anime(user_profile, anime_df):
    # Get a list of all the anime representations
    anime_representations = anime_df.iloc[:, 3:].values
    anime_names = list(anime_df.title)
    
    # Compute the dot products for every user-item combination (single user but all items)
    similarities = []
    for anime in tqdm(anime_representations, desc = "Getting similarity measures"):
        similarities.append(np.dot(user_profile, anime))
    
    # Sort by similarity and get top 10 anime recommendations for this user
    ordered_sims = np.array(similarities).argsort()[-10:][::-1]
    
    similar_animes = [anime_names[x] for x in ordered_sims]
    return (similar_animes, [similarities[x] for x in ordered_sims])

In [12]:
all_users = list(user_profiles.keys())

In [47]:
# Sample a random user and look at the anime which he has rated
random_user = random.sample(all_users, 1)[0]

random_user_review = reviews_data[reviews_data.uid == random_user].reset_index(drop = True)
random_user_review

Unnamed: 0,uid,anime_uid,Mean_score
0,300971,37779,10


In [48]:
# Look at the recommendations which our engine is building
anime_recos = recommend_anime(user_profiles[random_user], anime_genre_map_df)
print(anime_recos)

Getting similarity measures:   0%|          | 0/16216 [00:00<?, ?it/s]

(['Yakusoku no Neverland', 'Yakusoku no Neverland 2nd Season', 'Higurashi no Naku Koro ni Kaku: Outbreak', 'Darkside Blues', 'Karakurizoushi Ayatsuri Sakon', 'Duan Nao 2', 'Paprika', 'Zankyou no Terror', 'Boogiepop wa Warawanai (2019)', 'Duan Nao'], [10.0, 9.128709291752768, 8.16496580927726, 8.16496580927726, 8.16496580927726, 8.16496580927726, 7.715167498104595, 7.0710678118654755, 7.0710678118654755, 7.0710678118654755])


A very important point to note here is that since the dataset contains data (users) which has only been rated once, that means we can't really draw conclusion about the liked movies when the user interaction has been below average or below par.

Since we have no idea about the liking of a user, we cannot recommend anything suitable with this simplistic model as we're not capturing concept relations like `comedy` might be opposite of `serious/thriller` etc.

# Adding numerical features into the mix

Every anime has some numerical features like episodes and average score for that anime. We can also factor in these as features in the representation of an anime feature vector.

But we shall normalize these before we do any kind of analysis because our genre features have been enforced in the `[0, 1]` range. And if we don't follow that convention, inadvertantly features which have a larger variance will get more preference; so we will `Normalize` them to restrict them in a 0-1 range.

Also, note that these could be good features since `episodes` gives a overview of how long/short a series is or if an anime is a moview or a series etc. and `score` gives an idea about the overall majority vote for the anime.

We can see that there's around `341 anime` with unknown scores and close to `492 anime` with unknown number of episodes. Considering that it's a very small fraction of our dataset `(<5%)`, we can mean impute these values since they're needed for the analysis.

In [36]:
anime_data["episodes"] = anime_data["episodes"].fillna(np.mean(anime_data["episodes"]))
anime_data["score"] = anime_data["score"].fillna(np.mean(anime_data["score"]))

In [37]:
numerical_df = anime_data[["title", "episodes", "score"]]
numerical_df["episodes"] = numerical_df.episodes / np.max(numerical_df.episodes)
numerical_df["score"] = numerical_df.episodes / np.max(numerical_df.score)
numerical_df.head(2)

Unnamed: 0,title,episodes,score
0,Haikyuu!! Second Season,0.008178,0.000886
1,Shigatsu wa Kimi no Uso,0.007197,0.00078


In [38]:
anime_genre_numerical_map_df = pd.merge(anime_genre_map_df, numerical_df, on = "title", how = "inner")

In [39]:
users = reviews_data.uid.unique()
num_genre_user_profiles = {}

for user in tqdm(users, desc = "Extracting user profiles..."):
    user_df = reviews_data[reviews_data.uid == user].reset_index(drop = True)
    num_genre_user_profiles[user] = create_user_profile(user_df, anime_genre_numerical_map_df)

Extracting user profiles...:   0%|          | 0/130519 [00:00<?, ?it/s]

In [49]:
# Look at the recommendations which our engine is building
anime_recos = recommend_anime(num_genre_user_profiles[random_user], anime_genre_numerical_map_df)
print(anime_recos)

Getting similarity measures:   0%|          | 0/16220 [00:00<?, ?it/s]

(['Yakusoku no Neverland', 'Yakusoku no Neverland 2nd Season', 'Karakurizoushi Ayatsuri Sakon', 'Duan Nao 2', 'Higurashi no Naku Koro ni Kaku: Outbreak', 'Darkside Blues', 'Paprika', 'Ergo Proxy', 'Boogiepop wa Warawanai (2019)', 'Duan Nao'], [10.000155897699143, 9.128860746043609, 8.165303587625404, 8.1651172635681, 8.16497880075219, 8.16497880075219, 7.715180489579523, 7.071366615788833, 7.07130165841419, 7.071275675464333])
