# Anime Recommendations Project

This project will analyse data taken from MyAnimeList and use a machine learning algorithm to be able to provide recommendations to people based on previous ratings on animes. 

We will first begin on doing an overview of the data, this grants the oversight to be able to accurately plan and analyse it before making changes.

First, lets collect all the libraries we will use into a singular code snippet at the beggining, this way the first cell to be executed brings in all the libraries needed for the entire project, we will begin with <font color=orange> **Pandas** </font> as we need to read in both of the CSV files provided.

In [1]:
import pandas as pd
import numpy as np
import scipy as sp

from scipy.sparse import csr_matrix

from sklearn.metrics.pairwise import cosine_similarity

## Exploratory Data Analysis

Next we can use pandas to scan in the raw CSV files for further editing. 

There are two files that we can work with, Ratings which holds the data for what each person rated an anime, this data is stored completely in numerical form, the user_ID, rating itself and the anime reference is done using numerical replacements instead of words.

The other file is the anime file, this holds the name, anime ID to reference to the Ratings dataframe and other information such as length and number of ratings.

In [2]:
anime_filepath = 'C:/Users/User/Data Science Courses OR Projects/Projects/Anime Recommendations/Data/anime.csv'
anime_df = pd.read_csv(anime_filepath)

anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
ratings_filepath = "C:/Users/User/Data Science Courses OR Projects/Projects/Anime Recommendations/Data/rating.csv"
ratings_df = pd.read_csv(ratings_filepath)


ratings_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [4]:
print(f"Number of rows and columns in Anime Database: {anime_df.shape}")
print(f"Number of rows and columns in Ratings Database: {ratings_df.shape}")

Number of rows and columns in Anime Database: (12294, 7)
Number of rows and columns in Ratings Database: (7813737, 3)


Typically, the first thing you do is remove null values within the dataset, when datasets have very low numbers, you can replace them with average values, however with the abundance available here, that isn't necessary.

In [5]:
"""
The first print statement shows the number of is null values within the dataframe.

The second print statement calculates the percentage of null values in reference to the whole 
dataframe.
"""
#print(anime_df.isnull().sum())

print(round(anime_df.isnull().sum() /len(anime_df.index), 4)*100)

anime_id    0.00
name        0.00
genre       0.50
type        0.20
episodes    0.00
rating      1.87
members     0.00
dtype: float64


In [6]:
"""
The first print statement shows the number of is null values within the dataframe.

The second print statement calculates the percentage of null values in reference to the whole 
dataframe.
"""
#print(ratings_df.isnull().sum())

print(round(ratings_df.isnull().sum() /len(ratings_df.index), 4)*100)

user_id     0.0
anime_id    0.0
rating      0.0
dtype: float64


Currently, Type and Genre have empty values, we can either remove them or we can fill them with mode values.

In [7]:
# This deletes any animes that do not have a rating, this is because they won't be useful 

anime_df2 = anime_df[~np.isnan(anime_df['rating'])]

In [8]:
# This modifies the genre category to fill any values that are empty and place in the mode.

anime_df['genre'] = anime_df['genre'].fillna(anime_df['genre'].dropna().mode().values[0])

# Same for `type`

anime_df['type'] = anime_df['type'].fillna(anime_df['type'].dropna().mode().values[0])

In [9]:
anime_df.isnull().sum()

anime_id      0
name          0
genre         0
type          0
episodes      0
rating      230
members       0
dtype: int64

## Feature Engineering Section

Numerical values are the easiest to deal with when setting up the dataframe for ML Algorithms, an easy way to implement this fastest is by converting all negligable answers such as N/A NAN or negative numbers into zeros.

In [10]:
ratings_df['rating'] = ratings_df ['rating'].apply(lambda x: np.nan if x==-1 else x)
ratings_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,
1,1,24,
2,1,79,
3,1,226,
4,1,241,


Now we focus on specifically the information we need, this will obviously change depending on the task at hand

In [11]:
# Here we are trying to recommend animes only, so we need to remove all elements of movies
anime_df = anime_df[anime_df['type']=='TV']

# Combine both anime and ratings DF
comb_df = ratings_df.merge (anime_df, left_on = 'anime_id', right_on = 'anime_id' , suffixes= ['_user' , ''])
#comb_df.head()

# Select important tags
comb_df = comb_df[['user_id' , 'name' , 'rating']]
comb_df.head()

# To make the dataframe smaller and easier to run
comb_df = comb_df[comb_df.user_id <= 7500]


Create a table with the x axis with the names of of anime, and the y axis with the user ID, the values with the rating 
itself

In [12]:

pivot = comb_df.pivot_table(index=['user_id'], columns=['name'], values='rating')
pivot.head()

name,.hack//Roots,.hack//Sign,.hack//Tasogare no Udewa Densetsu,009-1,07-Ghost,11eyes,12-sai.: Chicchana Mune no Tokimeki,3 Choume no Tama: Uchi no Tama Shirimasenka?,30-sai no Hoken Taiiku,91 Days,...,"Zone of the Enders: Dolores, I",Zukkoke Knight: Don De La Mancha,ef: A Tale of Melodies.,ef: A Tale of Memories.,gdgd Fairies,gdgd Fairies 2,iDOLM@STER Xenoglossia,s.CRY.ed,xxxHOLiC,xxxHOLiC Kei
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,6.49,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,8.11,


In [13]:
# Normalise the columns and place them between 0 and 1

pivot_norm = pivot.apply(lambda x : (x-np.mean(x)) / (np.max(x) / np.min(x)), axis=1)
pivot_norm.head()

name,.hack//Roots,.hack//Sign,.hack//Tasogare no Udewa Densetsu,009-1,07-Ghost,11eyes,12-sai.: Chicchana Mune no Tokimeki,3 Choume no Tama: Uchi no Tama Shirimasenka?,30-sai no Hoken Taiiku,91 Days,...,"Zone of the Enders: Dolores, I",Zukkoke Knight: Don De La Mancha,ef: A Tale of Melodies.,ef: A Tale of Memories.,gdgd Fairies,gdgd Fairies 2,iDOLM@STER Xenoglossia,s.CRY.ed,xxxHOLiC,xxxHOLiC Kei
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,-0.709943,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,0.281382,


In [14]:
pivot_norm.fillna(0 , inplace=True)
pivot_norm.head()

name,.hack//Roots,.hack//Sign,.hack//Tasogare no Udewa Densetsu,009-1,07-Ghost,11eyes,12-sai.: Chicchana Mune no Tokimeki,3 Choume no Tama: Uchi no Tama Shirimasenka?,30-sai no Hoken Taiiku,91 Days,...,"Zone of the Enders: Dolores, I",Zukkoke Knight: Don De La Mancha,ef: A Tale of Melodies.,ef: A Tale of Memories.,gdgd Fairies,gdgd Fairies 2,iDOLM@STER Xenoglossia,s.CRY.ed,xxxHOLiC,xxxHOLiC Kei
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,-0.709943,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.281382,0.0


Again sticking too the idea that numericals are easier to deal with than categorical data, we change the names of each anime to represent a number. The names and numbers are still linked however.

In [15]:
pivot_norm = pivot_norm.T
pivot_norm.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
.hack//Roots,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
.hack//Sign,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
.hack//Tasogare no Udewa Densetsu,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
009-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
07-Ghost,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The current column count  is 7450.

Now we will remove all of the values that only have zeros

In [16]:
pivot_norm = pivot_norm.loc[:, (pivot_norm != 0).any(axis=0)]
pivot_norm.shape

(2734, 7189)

Now we can remove the values filled by zeros, we do not need to spend computing power on analysing each value if it a zero.

In [17]:
# This will remove all zeros whilst keeping the matrix of values consistent, it will go across 
# the values by row (csr) 
pivot_sparse = sp.sparse.csr_matrix(pivot_norm.values)
pivot_sparse

<2734x7189 sparse matrix of type '<class 'numpy.float64'>'
	with 549454 stored elements in Compressed Sparse Row format>

## Machine Learning Model

In [18]:
#model based on anime similarity
anime_similarity = cosine_similarity(pivot_sparse)

#Df of anime similarities
ani_sim_df = pd.DataFrame(anime_similarity, index = pivot_norm.index, columns = pivot_norm.index)

In [22]:
def anime_recommendation (ani_name):
    
    number = 1
    print('Recommended because you watched {}:\n'.format(ani_name))
    for anime in ani_sim_df.sort_values(by = ani_name, ascending = False).index[1:6]:
        print(f'#{number}: {anime}, {round(ani_sim_df[anime][ani_name]*100,2)}% match')
        number +=1  

In [26]:
anime_recommendation("Monster")

Recommended because you watched Monster:

#1: Great Teacher Onizuka, 32.69% match
#2: Baccano!, 31.77% match
#3: Cowboy Bebop, 30.71% match
#4: Gyakkyou Burai Kaiji: Ultimate Survivor, 28.34% match
#5: Code Geass: Hangyaku no Lelouch, 28.19% match
