# Anime Recommender System

For this project, we will taking a look at two dataset. The first dataset contains all Anime (up to 2023) and lists their features such as its genre, airing status, popularity, rank and members. The second dataset contains user reviews of Anime in a rating system of 0 to 10. Both of these datasets were originally from MyAnimeList and was published by a user on Kaggle. (https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset)

The purpose of this project is to build a recommender system that gives recommendations to users based on their previous scores of Anime using collaborative filtering and content-based filtering. We will also create a pipeline to streamline our work and make it more efficient for future use. 

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import itertools
import random
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from ast import literal_eval
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from surprise import Reader, KNNBasic, Dataset, SVD
from surprise.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from surprise import accuracy

## Inspect and Load Dataset

In [2]:
anime_df = pd.read_csv('anime-dataset-2023.csv')
anime_df.info()
anime_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   anime_id      24905 non-null  int64 
 1   Name          24905 non-null  object
 2   English name  24905 non-null  object
 3   Other name    24905 non-null  object
 4   Score         24905 non-null  object
 5   Genres        24905 non-null  object
 6   Synopsis      24905 non-null  object
 7   Type          24905 non-null  object
 8   Episodes      24905 non-null  object
 9   Aired         24905 non-null  object
 10  Premiered     24905 non-null  object
 11  Status        24905 non-null  object
 12  Producers     24905 non-null  object
 13  Licensors     24905 non-null  object
 14  Studios       24905 non-null  object
 15  Source        24905 non-null  object
 16  Duration      24905 non-null  object
 17  Rating        24905 non-null  object
 18  Rank          24905 non-null  object
 19  Popu

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


In [3]:
reviews_df = pd.read_csv('users-score-2023.csv')
reviews_df.info()
reviews_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24325191 entries, 0 to 24325190
Data columns (total 5 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   user_id      int64 
 1   Username     object
 2   anime_id     int64 
 3   Anime Title  object
 4   rating       int64 
dtypes: int64(3), object(2)
memory usage: 927.9+ MB


Unnamed: 0,user_id,Username,anime_id,Anime Title,rating
0,1,Xinil,21,One Piece,9
1,1,Xinil,48,.hack//Sign,7
2,1,Xinil,320,A Kite,5
3,1,Xinil,49,Aa! Megami-sama!,8
4,1,Xinil,304,Aa! Megami-sama! Movie,8


In [4]:
reviews_df.describe(include = 'all')

Unnamed: 0,user_id,Username,anime_id,Anime Title,rating
count,24325190.0,24324959,24325190.0,24325191,24325190.0
unique,,270032,,16611,
top,,trafagibr,,Death Note,
freq,,2986,,126492,
mean,440384.3,,9754.686,,7.62293
std,366946.9,,12061.96,,1.66151
min,1.0,,1.0,,1.0
25%,97188.0,,873.0,,7.0
50%,387978.0,,4726.0,,8.0
75%,528043.0,,13161.0,,9.0


## Data Cleaning and Preparation

### Anime Dataset

#### Removing Hentai and Music Genres

In [5]:
# Finding ids of hentai 
hentai_id = anime_df.anime_id[anime_df['Genres'].apply(lambda tags: 'Hentai' in tags)].tolist()

# Removing hentai ids from dataset
anime_df = anime_df[~anime_df['Genres'].apply(lambda tags: 'Hentai' in tags)]

There are two more animes rated as Hentai so we will remove these as well.

In [6]:
# Initial Rating value count
anime_df.Rating.value_counts()

Rating
PG-13 - Teens 13 or older         8502
G - All Ages                      7660
PG - Children                     4050
R - 17+ (violence & profanity)    1414
R+ - Mild Nudity                  1123
UNKNOWN                            668
Rx - Hentai                          2
Name: count, dtype: int64

In [7]:
# Removing hentai from Rating
anime_df = anime_df[anime_df.Rating != 'Rx - Hentai']

In [8]:
# Initial Type value count
anime_df.Type.value_counts()

Type
TV         7597
Movie      4374
ONA        3476
Music      2686
OVA        2681
Special    2529
UNKNOWN      74
Name: count, dtype: int64

In [9]:
anime_df = anime_df[anime_df.Type != 'Music']

#### Adding Column for Release Year

In [10]:
# Replacing 'Not available' with null values
anime_df.replace('Not available', np.nan, inplace=True)

# Split to release year and finished year
anime_duration = anime_df.Aired.str.split(' to ')
anime_duration

0         [Apr 3, 1998, Apr 24, 1999]
1                       [Sep 1, 2001]
2         [Apr 1, 1998, Sep 30, 1998]
3         [Jul 3, 2002, Dec 25, 2002]
4        [Sep 30, 2004, Sep 29, 2005]
                     ...             
24895               [May 31, 2023, ?]
24896                       [2024, ?]
24900                [Jul 4, 2023, ?]
24901               [Jul 27, 2023, ?]
24902               [Jul 19, 2023, ?]
Name: Aired, Length: 20731, dtype: object

In [11]:
anime_df['release_date'] = anime_duration.str.get(0)
release_year = anime_df['release_date'].str.split(' ')
anime_df['release_year'] = release_year.str.get(-1)
anime_df['release_year'].unique()

array(['1998', '2001', '2002', '2004', '2005', '1999', '2003', '1995',
       '1997', '1996', '1988', '1993', '2000', '1979', '1989', '1991',
       '1985', '1986', '1994', '1992', '1990', '1978', '1973', '2006',
       '1987', '1984', '1982', '1977', '1983', '1980', '1976', '1968',
       '1981', '2007', '1971', '1967', '1975', '1962', '1965', '1969',
       '1974', '1964', '2008', '1972', '1970', '1966', '1963', '1945',
       '2009', '2012', '2021', '1933', '1929', '1943', '2010', '1931',
       '1930', '1932', '1934', '1936', '1928', '1960', '1958', '2011',
       '1959', '1947', '1917', '1935', '1938', '1939', '1941', '1942',
       '1948', '1950', '1957', '1961', '1918', '1924', '1925', '1926',
       '1927', '1937', '1940', '1944', '1946', '1949', '1951', '1952',
       '1953', '1954', '1955', '1956', '2016', '2013', '2019', '2018',
       '2014', '2015', '2017', nan, '2022', '2020', '2023', '2024',
       '2025'], dtype=object)

#### Removing duplicate rows

In [12]:
anime_df[anime_df.Name.duplicated() == True].head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL,release_date,release_year
24586,55351,Azur Lane,Azur Lane,アズールレーン,UNKNOWN,"Action, Slice of Life",Assorted commercials for the Azur Lane Mobile ...,Special,UNKNOWN,"Apr 17, 2020 to ?",...,30 sec,PG-13 - Teens 13 or older,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1421/...,"Apr 17, 2020",2020.0
24807,55610,Souseiki,UNKNOWN,UNKNOWN,UNKNOWN,Fantasy,As Shoko Asahara is depicted in a less embelli...,OVA,1.0,,...,Unknown,G - All Ages,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1704/...,,


In [13]:
# Removing all duplicated anime titles
anime_df = anime_df.drop_duplicates(subset=['Name'], keep='first')

#### Dealing with null values

In [14]:
# Removing 'UNKNOWN' values
anime_df.replace('UNKNOWN', np.nan, inplace=True)

# Converting columns to float data type
anime_df = anime_df.astype({'Score': 'float64', 'Rank': 'float64', 'Scored By': 'float64'})

# Replacing null values with mean value
anime_df.Score = anime_df.Score.fillna(anime_df.Score.mean())
anime_df.Rank = anime_df.Rank.fillna(anime_df.Rank.mean())
anime_df['Scored By'] = anime_df['Scored By'].fillna(anime_df['Scored By'].mean())

# Using forward-fill to replace null values in 'Rating'
anime_df['Rating'].fillna(method = 'ffill', inplace = True)

# Replacing null value with mode value in 'release_year'
anime_df['release_year'] = anime_df['release_year'].fillna(anime_df['release_year'].mode()[0])
anime_df = anime_df.astype({'release_year': 'int'})

#### Splitting genres

In [15]:
# Replacing null values 
anime_df['Genres'] = anime_df.Genres.fillna('Unknown')

# Create a list of all possible genres
all_genres = sorted(set([genre for sublist in anime_df['Genres'] for genre in sublist]))

# Convert genres to binary vectors
def genre_to_binary(genre_list):
    return [1 if genre in genre_list else 0 for genre in all_genres]

anime_df['genre_binary'] = anime_df['Genres'].apply(genre_to_binary)

In [16]:
# Filtered anime dataset
anime_filtered = anime_df.drop(columns = ['Other name', 'Name', 'English name', 'Image URL', 'Episodes', 'Duration', 
                                    'Premiered', 'Status', 'Producers', 'Licensors', 'Studios', 
                                   'Aired', 'Synopsis', 'Type', 'release_date', 'Genres'])
anime_filtered.head()

Unnamed: 0,anime_id,Score,Source,Rating,Rank,Popularity,Favorites,Scored By,Members,release_year,genre_binary
0,1,8.75,Original,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,1998,"[1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, ..."
1,5,8.38,Original,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,2001,"[1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, ..."
2,6,8.22,Manga,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,1998,"[1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, ..."
3,7,7.25,Original,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,2002,"[1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, ..."
4,8,6.94,Manga,PG - Children,4240.0,5126,14,6413.0,15001,2004,"[1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, ..."


## Testing Machine Learning Algorithm

### Combining Datasets

In [17]:
# Picking a random user 
users = reviews_df.user_id.unique()
user = random.choice(users)
user_review = reviews_df[reviews_df.user_id == user]

In [18]:
# # Prepare the training data
train_shows = anime_filtered.merge(user_review, on = 'anime_id')
train_shows.head()

Unnamed: 0,anime_id,Score,Source,Rating,Rank,Popularity,Favorites,Scored By,Members,release_year,genre_binary,user_id,Username,Anime Title,rating
0,1,8.75,Original,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,1998,"[1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, ...",7782,elsherl,Cowboy Bebop,9
1,7,7.25,Original,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,2002,"[1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, ...",7782,elsherl,Witch Hunter Robin,8
2,15,7.92,Manga,PG-13 - Teens 13 or older,688.0,1252,1997,86524.0,177688,2005,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",7782,elsherl,Eyeshield 21,10
3,16,8.0,Manga,PG-13 - Teens 13 or older,589.0,862,4136,81747.0,260166,2005,"[1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...",7782,elsherl,Hachimitsu to Clover,10
4,19,8.87,Manga,R+ - Mild Nudity,26.0,142,47235,368569.0,1013100,2004,"[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, ...",7782,elsherl,Monster,10


### Surprise

In [19]:
# Initialising reader and load user review dataset
reader = Reader(rating_scale=(1,10))
rec_data = Dataset.load_from_df(train_shows[['user_id', 'anime_id', 'rating']], reader)

# Train-test split 
trainset, testset = train_test_split(rec_data, test_size=.3, random_state=42)

# Use SVD from Surprise to train a collaborative filter
anime_reco = SVD()
anime_reco.fit(trainset)

# Make predictions on the test set
predictions = anime_reco.test(testset)

# Evaluate model
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

RMSE: 1.3067
MAE:  1.0480


In [20]:
# Normalize features
num_cols = ['Score', 'Popularity', 'Favorites', 'Members', 'release_year', 'Scored By']
scaler = MinMaxScaler()
anime_filtered[num_cols] = scaler.fit_transform(anime_filtered[num_cols])

# Combine numerical columns and genre binary columns
feature_matrix = np.concatenate((anime_filtered[num_cols].values, pd.DataFrame(anime_df['genre_binary'].tolist())), axis=1)

# Compute similarity
similarity_matrix = cosine_similarity(feature_matrix)

In [36]:
# Define a function to get similar anime
def get_similar_anime(anime_id, anime_df, similarity_matrix, top_n=5):
    # Find the index of the given anime in the DataFrame
    index = anime_df[anime_df['anime_id'] == anime_id].index[0]

    # Get indices of the most similar animes
    similar_indices = similarity_matrix[index].argsort()[-top_n-1:-1][::-1]
    
    # Return the corresponding anime rows from the DataFrame
    return anime_df.iloc[similar_indices]

# Function for collaborative filtering and content based filtering
def hybrid_recommend(user_id, anime_df, algo, similarity_matrix, top_n=3):
    # Collaborative filtering: Predict ratings for all animes
    anime_ids = anime_df['anime_id'].values
    predictions = [algo.predict(user_id, anime_id) for anime_id in anime_ids]
    predictions = sorted(predictions, key=lambda x: x.est, reverse=True)[:top_n]
    
    # Content-based filtering: Get similar animes for top recommendations
    recommendations = []
    for pred in predictions:
        similar_anime = get_similar_anime(pred.iid, anime_df, similarity_matrix, top_n=2)
        recommendations.extend(similar_anime[['anime_id','Name']].to_dict(orient='records'))
    
    return recommendations

In [37]:
# Recommend anime for user
recommendations = hybrid_recommend(user_id=user, anime_df=anime_df, algo=anime_reco, similarity_matrix=similarity_matrix)
print(f"Based on your previous scores, we recommend {recommendations[0]['Name']}, {recommendations[1]['Name']}, {recommendations[2]['Name']}, {recommendations[3]['Name']} and {recommendations[4]['Name']}.")

Based on your previous scores, we recommend Fushigi no Umi no Nadia Omake Gekijou, Sakura Taisen: Le Nouveau Paris, Weiß Survive R, Princess Lover! Picture Drama and Steamboy.


## Conclusion

The goal of this project is to develop a simple recommender system using Surprise and recommend Anime to users. The dataset was first cleaned and processed by removing columns, dealing with null values and encoding categorical columns. Next, we combined our datasets and trained a Singular Value Decomposition (SVD) system to our dataset. We then created two functions that uses collaborative filtering and content-based filtering to select Anime that are similar to a Anime that the user has reviewed. 