# Anime Alchemy: Unveiling Your Next Favorite through Machine Learning

Anime, a realm where imagination knows no bounds, enthralls viewers with its diverse worlds and compelling narratives. Yet, the vast expanse of anime choices can leave enthusiasts overwhelmed when selecting their next watch. This is where the fusion of machine learning and anime enters the stage. By harnessing the capabilities of algorithms and data analysis, machine learning offers a transformative solution to the eternal question: "What anime should I watch next?"

Gone are the days of aimlessly scrolling through recommendations or relying on random selections. Machine learning algorithms, fueled by the wealth of data from fans and streaming platforms, now provide tailored suggestions that resonate with individual preferences. In this exploration of anime and machine learning, we unravel the mechanics behind recommendation systems, the data they rely on, and the techniques that empower them to accurately predict your next anime obsession.

From collaborative and content-based filtering to hybrid models and deep learning, we'll demystify these algorithms and their ability to capture the essence of what makes a show captivating. However, beyond the technical underpinnings, we'll also delve into the intricate connection between data-driven insights and the artistry that underpins every anime tale.

Whether you're an anime veteran seeking fresh adventures or a newcomer ready to embark on an animated journey, join us as we navigate the world of anime recommendations powered by machine learning. Together, we'll uncover how this fusion of technology and creativity is reshaping the way we indulge in the enchanting realm of anime. - Thanks ChatGPT

I will cover the following things, probably in separate pages:
- Collaborative Filtering using Single Value Decomposition
- Collobarative Filtering using Neural Nets
- Hybrid Collobarative Filtering.

Data used for this project contains 300k unique users. Thank you Azathoth for saving me from needing to scrape the data myself. (It took me 5-10 minutes per user to scrape all the review data, and I quickly realised that it was going to take too long.) 

Link to the data is here on kaggle: https://www.kaggle.com/datasets/azathoth42/myanimelist?select=users_cleaned.csv

The aim from the data I have is the achieve a clean set of data for user id, anime id which are the features (X) and rating (Y)

In [7]:
import pandas as pd

In [2]:
anime_raw = pd.read_csv('./data/full/anime_cleaned.csv')
scores_raw = pd.read_csv('./data/full/animelists_cleaned.csv')
users_raw = pd.read_csv('./data/full/users_cleaned.csv')

In [3]:
# Need user_id and user_name
users_raw.head()

Unnamed: 0,username,user_id,user_watching,user_completed,user_onhold,user_dropped,user_plantowatch,user_days_spent_watching,gender,location,birth_date,access_rank,join_date,last_online,stats_mean_score,stats_rewatched,stats_episodes
0,karthiga,2255153,3,49,1,0,0,55.091667,Female,"Chennai, India",1990-04-29 00:00:00,,2013-03-03 00:00:00,2014-02-04 01:32:00,7.43,0.0,3391
1,Damonashu,37326,45,195,27,25,59,82.574306,Male,"Detroit,Michigan",1991-08-01 00:00:00,,2008-02-13 00:00:00,2017-07-10 06:52:54,6.15,6.0,4903
2,bskai,228342,25,414,2,5,11,159.483333,Male,"Nayarit, Mexico",1990-12-14 00:00:00,,2009-08-31 00:00:00,2014-05-12 16:35:00,8.27,1.0,9701
3,terune_uzumaki,327311,5,5,0,0,0,11.394444,Female,"Malaysia, Kuantan",1998-08-24 00:00:00,,2010-05-10 00:00:00,2012-10-18 19:06:00,9.7,6.0,697
4,Bas_G,5015094,35,114,6,20,175,30.458333,Male,"Nijmegen, Nederland",1999-10-24 00:00:00,,2015-11-26 00:00:00,2018-05-10 20:53:37,7.86,0.0,1847


username, user_id will be useful here.

Interestingly "stats_mean_score" suggests that each person will have different clusters of ratings. There was some thought to renormalise the ratings for each user based on their mean score. However I decided against it as some users may only be willing to review good or users may only be willing to review bad anime. I cannot distinguish this from optimisistic reviewers and pessimistic reviews respectively.

In [4]:
scores_raw.head()

Unnamed: 0,username,anime_id,my_watched_episodes,my_start_date,my_finish_date,my_score,my_status,my_rewatching,my_rewatching_ep,my_last_updated,my_tags
0,karthiga,21,586,0000-00-00,0000-00-00,9,1,,0,2013-03-03 10:52:53,
1,karthiga,59,26,0000-00-00,0000-00-00,7,2,,0,2013-03-10 13:54:51,
2,karthiga,74,26,0000-00-00,0000-00-00,7,2,,0,2013-04-27 16:43:35,
3,karthiga,120,26,0000-00-00,0000-00-00,7,2,,0,2013-03-03 10:53:57,
4,karthiga,178,26,0000-00-00,0000-00-00,7,2,0.0,0,2013-03-27 15:59:13,


my_score is the label which can be joined together user_raw

In [5]:
reviews_df = scores_raw[['username', 'anime_id', 'my_score', 'my_status']]
usernames_and_ids = users_raw[['username', 'user_id']]

reviews_df = reviews_df.merge(anime_raw[['anime_id', 'title']], left_on='anime_id', right_on='anime_id')
reviews_df = reviews_df.merge(usernames_and_ids[['username', 'user_id']], left_on='username', right_on='username')
# Remove scores with 0 - it's not a possible value acccording to MyAnimeList
reviews_df = reviews_df[reviews_df['my_score'] > 0]

In [6]:
reviews_df.head()

Unnamed: 0,username,anime_id,my_score,my_status,title,user_id
0,karthiga,21,9,1,One Piece,2255153
1,karthiga,59,7,2,Chobits,2255153
2,karthiga,74,7,2,Gakuen Alice,2255153
3,karthiga,120,7,2,Fruits Basket,2255153
4,karthiga,178,7,2,Ultra Maniac,2255153


Let's use a crude way to check for sparsity of the data we have.

In [7]:
user_per_anime = reviews_df['anime_id'].value_counts().reset_index().rename(columns={"count": "number_of_users"})
anime_per_user = reviews_df['username'].value_counts().reset_index().rename(columns={"count": "number_of_animes"})

There wre a bunch of users where they have only review very few animes, making the user entry very sparse. This will not be good for the collaborative filtering algo. Let's keep it in for more as when we get to more advanced collaborative filtering techniques which uses content-based filtering and hybrid, these "user who review little" cases will still be useful 

In [8]:
user_per_anime[user_per_anime['number_of_users'] < 10].shape

(265, 2)

In [9]:
anime_per_user[anime_per_user['number_of_animes'] < 10].shape

(3926, 2)

Only taking users that have made 10 ratings or more is a bit arbitary but it's something to work with now. This can be tweaked if the model is not performing well.
I can filter this based on `value_counts()` and then join it back in with inner join on the username and anime_id

In [10]:
user_per_anime = user_per_anime[user_per_anime['number_of_users'] > 10]
anime_per_user = anime_per_user[anime_per_user['number_of_animes'] > 10]

reviews_df = pd.merge(reviews_df, anime_per_user, left_on = 'username', right_on = 'username', how = 'inner')
reviews_df = pd.merge(reviews_df, user_per_anime, left_on = 'anime_id', right_on = 'anime_id', how = 'inner')
reviews_df

Unnamed: 0,username,anime_id,my_score,my_status,title,user_id,number_of_animes,number_of_users
0,karthiga,21,9,1,One Piece,2255153,53,31030
1,Damonashu,21,10,1,One Piece,37326,194,31030
2,bskai,21,8,1,One Piece,228342,402,31030
3,Slimak,21,10,1,One Piece,61677,223,31030
4,kioniel,21,9,4,One Piece,144049,296,31030
...,...,...,...,...,...,...,...,...
19125480,Benku,35909,7,6,Guruguru Petit Anime Gekijou,3781121,1029,11
19125481,Oyakatasama4ever,35909,10,2,Guruguru Petit Anime Gekijou,39242,648,11
19125482,Hancock92,35909,2,2,Guruguru Petit Anime Gekijou,5082848,954,11
19125483,Madison_Brown,35909,6,2,Guruguru Petit Anime Gekijou,4719207,788,11


Perfect. Drop the columns which we don't need and save to csv so it can be used for the next step.

In [13]:
reviews_df = reviews_df[['anime_id', 'user_id', 'my_score']]

In [22]:
reviews_df

Unnamed: 0,anime_id,user_id,my_score
0,21,2255153,9
1,21,37326,10
2,21,228342,8
3,21,61677,10
4,21,144049,9
...,...,...,...
19125480,35909,3781121,7
19125481,35909,39242,10
19125482,35909,5082848,2
19125483,35909,4719207,6


In [15]:
reviews_df.to_csv('processed_reviews.csv')

## Collaborative Filtering using Single Value Decomposition

Collaborative Filtering is a technique that is based on the premise that users who share similar tastes in the past will likely have similar preferences in the future. One of the methods that I will be covering the implementation of is the Singular Value Decomposition (SVD), a mathematical technique that helps uncover patterns and latent factors within a user-item interaction matrix. By reducing the dimensionality of the data through singular value decomposition, it extracts latent factors that explain user preferences and item characteristics. These latent factors might represent attributes like genres, themes, animation styles, or narrative structures in the context of anime.

If people are interested in the mathematical theory behind this, I will write it in another article.

In [1]:
import surprise as sp

In [6]:
reader = sp.Reader(rating_scale=(1, 10))
# The sp.Dataset must accept the columns in a specfic order of user_id, item_id and rating.
reviews_df = pd.read_csv('processed_reviews.csv')
data = sp.Dataset.load_from_df(reviews_df[['user_id', 'anime_id', 'my_score']], reader)

# If you have already saved the processed reviews into a file you can use this line instead.
#data = sp.Dataset.load_from_file('processed_reviews.csv', reader=reader)

# By default the split is 0.8 to train and 0.2 to test.
trainset, testset = sp.model_selection.split.train_test_split(data)

In [7]:
algo = sp.SVD()
# Run 5-fold cross-validation and print results using RMSE (Root mean square error) and MAE (mean absolute error)
results = sp.model_selection.cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1811  1.1798  1.1797  1.1810  1.1807  1.1805  0.0006  
MAE (testset)     0.8708  0.8704  0.8703  0.8708  0.8707  0.8706  0.0002  
Fit time          250.25  256.27  258.06  259.64  263.60  257.56  4.39    
Test time         114.16  121.74  98.07   115.30  102.41  110.33  8.75    


In [8]:
from surprise import dump
# Dump algorithm and reload it.
dump.dump('algo.pickle', algo=algo)


In [36]:
_, loaded_algo = dump.load('algo.pickle')

To avoid Out of Memory errors, I'm going to test it with a single user rather than trying to backfill with rating estimates for each user who hasn't rated shows.

In [40]:
user_id = 61677
reviews_for_user = pd.read_csv('processed_reviews.csv')
reviews_for_user = reviews_for_user[reviews_for_user['user_id']== 61677]
reviews_for_user


Unnamed: 0.1,Unnamed: 0,anime_id,user_id,my_score
3,3,21,61677,10
30477,30477,59,61677,9
134373,134373,269,61677,9
219810,219810,857,61677,7
277896,277896,1735,61677,10
...,...,...,...,...
7495287,7495287,11837,61677,7
7501421,7501421,17389,61677,9
7503764,7503764,22777,61677,7
7506745,7506745,31904,61677,7


In [44]:
from surprise import Dataset, Reader
user_to_test = Dataset.load_from_df(reviews_for_user[['user_id', 'anime_id', 'my_score']], Reader(rating_scale=(1,10)))
trainset, testset = sp.model_selection.split.train_test_split(user_to_test)
predictions = loaded_algo.test(testset)


In [47]:
predictions

[Prediction(uid=61677, iid=2476, r_ui=8.0, est=7.4204642943772825, details={'was_impossible': False}),
 Prediction(uid=61677, iid=2386, r_ui=5.0, est=6.325046801984353, details={'was_impossible': False}),
 Prediction(uid=61677, iid=355, r_ui=8.0, est=7.810006611181714, details={'was_impossible': False}),
 Prediction(uid=61677, iid=7593, r_ui=7.0, est=7.360856065907273, details={'was_impossible': False}),
 Prediction(uid=61677, iid=24439, r_ui=8.0, est=7.595897803514228, details={'was_impossible': False}),
 Prediction(uid=61677, iid=4744, r_ui=5.0, est=6.964902337648262, details={'was_impossible': False}),
 Prediction(uid=61677, iid=411, r_ui=8.0, est=7.721308039301372, details={'was_impossible': False}),
 Prediction(uid=61677, iid=464, r_ui=6.0, est=6.662791103120198, details={'was_impossible': False}),
 Prediction(uid=61677, iid=6895, r_ui=7.0, est=8.051689671562059, details={'was_impossible': False}),
 Prediction(uid=61677, iid=4186, r_ui=8.0, est=7.793201152883065, details={'was_imp

The algo `SVD` will contain the model which we can use to predict the score.

In [50]:
from collections import defaultdict


def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [52]:
top_n = get_top_n(predictions, n=10)
anime_raw = pd.read_csv('./data/full/anime_cleaned.csv')


def get_title_by_id(anime_id, anime_raw):
    title = anime_raw.loc[anime_raw['anime_id'] == anime_id, 'title'].values
    if len(title) > 0:
        return title[0]
    else:
        return "Anime not found"
# For the first 10 users Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [get_title_by_id(iid, anime_raw) for (iid, _) in user_ratings])


61677 ['Naruto: Shippuuden', 'Shingeki no Kyojin', 'One Piece', 'Kuroshitsuji', 'Claymore', 'Kuroshitsuji II', 'Rosario to Vampire', 'Green Green', 'Zero no Tsukaima: Princesses no Rondo', 'Yuu☆Yuu☆Hakusho']


## Collaborative Filtering using neural networks
Collaborative Filtering can be done by other techniques such as matrix factorization (SVD - Single Value decomposition using the surpise library).
Neural networks mathematically will do the same thing with the use of embedding layers - both are effectively matrix dimensionality reduction. We can one hot encode the user id and anime ids and draw out any latent features in the output of the embedding layers.
Reason for using neural network over matrix factorization is because we can then extend the model further by joining the output of the embedding layer with further features such as details of the anime or demographic data of the user.

For this I will use the suprise https://surpriselib.com/ library

### PyTorch
I'm going to use PyTorch because a lot of people use it - documentation is far and wide

In [None]:
import torch
# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


I saw somewhere that I probably should scale the input space so it is within bounds of 0 to 1

In [None]:
from torch.utils.data import Dataset

class ReviewDataSet(Dataset):

    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data.iloc[idx, :]
        print(sample)
        label = None
        # review_data = self.data[]
        return sample['username'], sample['anime_id'], sample['my_score']

In [None]:
data = ReviewDataSet('processed_reviews.csv')