This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Content

Anime.csv

anime_id - myanimelist.net's unique id identifying an anime.

name - full name of anime.

genre - comma separated list of genres for this anime.

type - movie, TV, OVA, etc.

episodes - how many episodes in this show. (1 if movie).

rating - average rating out of 10 for this anime.

members - number of community members that are in this anime's
"group".

Rating.csv

user_id - non identifiable randomly generated user id.

anime_id - the anime that this user has rated.

rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).


## Preprocessing

In [1]:
import pandas as pd 
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MultiLabelBinarizer

In [2]:
anime_df = pd.read_csv('/Users/Ladi/Desktop/anime.csv')
ratings_df = pd.read_csv('/Users/Ladi/Desktop/rating.csv')

In [3]:
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


### Dropping the columns that won't be needed for this particular recommendation system.

In [4]:
anime_df= anime_df.drop('rating',1).drop('genre',1).drop('type',1).drop('episodes',1).drop('members',1)
anime_df.head()

  anime_df= anime_df.drop('rating',1).drop('genre',1).drop('type',1).drop('episodes',1).drop('members',1)


Unnamed: 0,anime_id,name
0,32281,Kimi no Na wa.
1,5114,Fullmetal Alchemist: Brotherhood
2,28977,Gintama°
3,9253,Steins;Gate
4,9969,Gintama&#039;


### Looking at the ratings dataframe

In [5]:
ratings_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


## Collaborative Filtering

The technique we're going to use is called Collaborative Filtering. This technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the Pearson Correlation Function.

### Creating an Input User 

In [6]:
userInput = [
            {'name':'Hunter x Hunter (2011)', 'rating':9},
            {'name':'One Punch Man', 'rating':7},
            {'name':'Naruto', 'rating':8},
            {'name':"Gintama", 'rating':8},
            {'name':'Mushishi', 'rating':9}
         ] 
inputAnime = pd.DataFrame(userInput)
inputAnime

Unnamed: 0,name,rating
0,Hunter x Hunter (2011),9
1,One Punch Man,7
2,Naruto,8
3,Gintama,8
4,Mushishi,9


### Adding anime_id to input user

In [7]:
#Filtering out the movies by name
inputId = anime_df[anime_df['name'].isin(inputAnime['name'].tolist())]
inputAnime = pd.merge(inputId,inputAnime, on=['name'],how='left')
inputAnime

Unnamed: 0,anime_id,name,rating
0,11061,Hunter x Hunter (2011),9
1,918,Gintama,8
2,30276,One Punch Man,7
3,457,Mushishi,9
4,20,Naruto,8


### Creating a subset of users that have watched and reviewed the same animes as the input user

In [8]:
userSubset = ratings_df[ratings_df['anime_id'].isin(inputAnime['anime_id'].tolist())]
userSubset.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
156,3,20,8
246,3,30276,-1
306,5,20,6
382,5,918,9


### Grouping by userid

In [9]:
userSubsetGroup = userSubset.groupby(['user_id'])

Sorting the groups so that users that share the most animes in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.

In [11]:
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

### let's look at the first 3

In [12]:
userSubsetGroup[0:3]

[(39,
        user_id  anime_id  rating
  3636       39        20      10
  3649       39       457      -1
  3652       39       918      -1
  3693       39     11061      -1
  3808       39     30276      -1),
 (567,
         user_id  anime_id  rating
  54624      567        20       9
  54655      567       457       9
  54680      567       918      10
  54821      567     11061      10
  54931      567     30276      10),
 (784,
         user_id  anime_id  rating
  75630      784        20       7
  75655      784       457       8
  75663      784       918       8
  75758      784     11061       9
  75980      784     30276       9)]

Next, we are going to compare 100 users to our specified user and find the one that is most similar.
We're going to find out how similar each user is to the input through the Pearson Correlation Coefficient. It is used to measure the strength of a linear association between the two variables.

In [13]:
userSubsetGroup = userSubsetGroup[0:100]

Calculating the Pearson Correlation between input user and subset group, and storing it in a dictionary, where the key is the user Id and the value is the coefficient.

In [14]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='anime_id')
    inputAnime = inputAnime.sort_values(by='anime_id')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputAnime[inputAnime['anime_id'].isin(group['anime_id'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0



In [15]:
pearsonCorrelationDict.items()

dict_items([(39, -0.1336306209562121), (567, -0.32732683535400187), (784, -0.07142857142856708), (1114, 0.2112885636821287), (1176, 0.3668996928526651), (1237, 0.13363062095621192), (1344, 0.04310416013535741), (1435, 0.2439750182371328), (1501, 0.5976143046671957), (1530, 0.07142857142856726), (1576, 0.5345224838248562), (1889, 0), (2143, 0.8012455796764165), (2264, 0), (2555, 0.7637626158259629), (3278, 0.15724272550829355), (3518, -0.21946557273221337), (3592, -0.13363062095621375), (3660, 0.32732683535398766), (4161, 0.21274790016365555), (4251, 0.4285714285714228), (4437, 0.8685990362153867), (4468, -0.7559289460184528), (4512, -0.045834924851407825), (4658, 0.6428571428571443), (4759, 0.6910233190806429), (5073, -0.3273268353539844), (5423, 0.013464426851070375), (5526, 0), (5598, 0), (5701, 0.763762615825972), (6118, 0), (6166, 0.46770717334673495), (6184, 0), (6265, -0.21821789023598057), (6416, 0.25475508554262843), (6638, 0), (7345, 0.2452557357939894), (7519, 0.5976143046671

In [16]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['user_id'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,user_id
0,-0.133631,39
1,-0.327327,567
2,-0.071429,784
3,0.211289,1114
4,0.3669,1176


### Getting the top 50 users that are most similar to the input.

In [17]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,user_id
53,0.906327,11867
91,0.869657,21068
21,0.868599,4437
12,0.801246,2143
69,0.785714,16601


### Rating of selected users to all movies

This is done by taking the weighted average of the ratings of the animes using the Pearson Correlation as the weight. To do this, we first need to get the animes watched by the users in our pearsonDF from the ratings dataframe and then store their correlation in a new column called _similarityIndex". This is achieved below by merging of these two tables.

In [18]:
topUsersRating=topUsers.merge(ratings_df, left_on='user_id', right_on='user_id', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,user_id,anime_id,rating
0,0.906327,11867,1,8
1,0.906327,11867,6,-1
2,0.906327,11867,19,9
3,0.906327,11867,20,8
4,0.906327,11867,26,7


Multiplying the animes rating by its weight (the similarity index), then sum up the new ratings and divide it by the sum of the weights.

In [19]:
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,user_id,anime_id,rating,weightedRating
0,0.906327,11867,1,8,7.250616
1,0.906327,11867,6,-1,-0.906327
2,0.906327,11867,19,9,8.156943
3,0.906327,11867,20,8,7.250616
4,0.906327,11867,26,7,6.344289


In [20]:
#Applying a sum to the topUsers after grouping it up by user_id
tempTopUsersRating = topUsersRating.groupby('anime_id').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,12.080512,94.918125
5,5.171021,39.176237
6,8.700718,57.766454
7,1.412665,9.811406
8,0.037547,0.300376


In [21]:
#Creating an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['anime_id'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,anime_id
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,7.857127,1
5,7.576113,5
6,6.639275,6
7,6.945319,7
8,8.0,8


### Sorting it and seeing the top 10 animes that the algorithm recommended

In [22]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,anime_id
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
31592,10.0,31592
2174,10.0,2174
2009,10.0,2009
4312,10.0,4312
5291,10.0,5291
3027,10.0,3027
8278,10.0,8278
2252,10.0,2252
1773,10.0,1773
12001,9.650913,12001


In [23]:
anime_df.loc[anime_df['anime_id'].isin(recommendation_df.head(10)['anime_id'].tolist())]

Unnamed: 0,anime_id,name
666,31592,Pokemon XY&amp;Z
1203,2174,Hokuto no Ken: Raoh Gaiden Gekitou-hen
1417,5291,Hokuto no Ken Zero: Kenshirou Den
1448,1773,Hokuto no Ken: Raoh Gaiden Junai-hen
1732,2009,Yawara! Special: Zutto Kimi no Koto ga... .
1752,4312,Hokuto no Ken: Toki-den
2245,12001,One Piece 3D: Gekisou! Trap Coaster
2709,3027,Hokuto no Ken: Yuria-den
5095,2252,Devilman
5448,8278,Biohazard 4: Incubate


### Advantages and Disadvantages of Collaborative Filtering

### Advantages

Takes other user's ratings into consideration

Doesn't need to study or extract information from the recommended item

Adapts to the user's interests which might change over time

### Disadvantages

Approximation function can be slow

There might be a low amount of users to approximate

Privacy issues when trying to learn the user's preferences