# Anime Recommender (pt 2 - CLF)

Welcome to part 2 of my anime recommender. Here I will use collaborative filtering through a library called Surprise. To recap from [part 1](https://github.com/Mayank-Bhatia/Anime-Recommender/blob/master/Part_1_KNN.ipynb), we have a dataset containing information on user preference data from 73,516 users on 12,294 anime, found on [myanimelist.net](https://myanimelist.net/). In an attempt to build an anime recommendation system, a Nearest Neighbors appraoch returned some interesting results in the form of anime that were fairly similar to the queried anime. However, it failed to stand up to a deeper degree of similarity that is provided on myanimelist.net in the form of user recommendations - aka recommendations made and voted on by users themselves.

This is a good motivation to try a more powerful technique for building recommendation systems, that of collaborative filtering. This is a technique that recommends items to users based on the preferences other users have expressed for those items.

Before we start, let's acknowledge the fast that myanimelist.net has a recommendation section for each anime where users can recommend similar anime. Here's is [Death Note's](https://myanimelist.net/anime/1535/Death_Note/userrecs). Can we build a model that generates similar anime that the viewerbase would approve of? 

Let's begin by importing the necessary libraries.

In [10]:
import numpy as np
import pandas as pd
from surprise import KNNBasic, KNNWithMeans, KNNBaseline
from surprise import Dataset
from surprise import GridSearch, print_perf
from surprise import Reader

ImportError: cannot import name 'KNNBasic' from 'surprise' (unknown location)

In [9]:
pip install scikit-surprise

Note: you may need to restart the kernel to use updated packages.Collecting scikit-surprise
  Using cached scikit-surprise-1.1.3.tar.gz (771 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py): started
  Building wheel for scikit-surprise (setup.py): finished with status 'error'
  Running setup.py clean for scikit-surprise
Failed to build scikit-surprise
Installing collected packages: scikit-surprise
  Running setup.py install for scikit-surprise: started
  Running setup.py install for scikit-surprise: finished with status 'error'



  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [76 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-310
      creating build\lib.win-amd64-cpython-310\surprise
      copying surprise\accuracy.py -> build\lib.win-amd64-cpython-310\surprise
      copying surprise\builtin_datasets.py -> build\lib.win-amd64-cpython-310\surprise
      copying surprise\dataset.py -> build\lib.win-amd64-cpython-310\surprise
      copying surprise\dump.py -> build\lib.win-amd64-cpython-310\surprise
      copying surprise\reader.py -> build\lib.win-amd64-cpython-310\surprise
      copying surprise\trainset.py -> build\lib.win-amd64-cpython-310\surprise
      copying surprise\utils.py -> build\lib.win-amd64-cpython-310\surprise
      copying surprise\__init__.py -> build\lib.win-amd64-cpython-310\surprise
      copying surprise\__main__.py

## The Data

In [2]:
anime = pd.read_csv('anime.csv')
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


anime_id - myanimelist.net's unique id identifying an anime <br>
name - full name of anime <br>
genre - comma separated list of genres for this anime <br>
type - movie, TV, OVA, etc <br>
episodes - how many episodes in this show. (1 if movie) <br>
rating - average rating out of 10 for this anime <br>
members - number of community members that are in this anime's "group" 

In [3]:
anime_rating = pd.read_csv('rating.csv')
anime_rating.tail()

Unnamed: 0,user_id,anime_id,rating
7813732,73515,16512,7
7813733,73515,17187,9
7813734,73515,22145,10
7813735,73516,790,9
7813736,73516,8074,9


user_id - non identifiable randomly generated user id <br> 
anime_id - the anime that this user has rated <br>
rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating) <br>

### Data Cleaning

Much like in part 1, we deal with missing values.

In [4]:
anime.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [5]:
anime_rating.isnull().sum()

user_id     0
anime_id    0
rating      0
dtype: int64

For now, rating.csv looks to be better behaved than anime.csv. This allows us to simply follow the same process of data cleaning as in [part 1](https://github.com/Mayank-Bhatia/Anime-Recommender/blob/master/Part_1_KNN.ipynb).

In [6]:
anime['type'] = anime['type'].fillna('None')
anime['genre'] = anime['genre'].fillna('None')
anime['rating'] = anime['rating'].fillna('None')
anime.isnull().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

Now, rating.csv contains user-preference data, so we will train our model only on this data. Then why did I bother cleaning the anime.csv data? That's because our model will retrieve information about the similar anime from anime.csv in order to present it to the user. Here is a general view:

1) Give the model an anime name <br>
2) Convert that anime to its corresponding anime_id using anime.csv <br>
3) Locate 10 similar anime_id for this within rating.csv <br>
4) These anime_id are now taken back anime.csv to retrieve information about the recommended anime such as genre, rating, etc <br>
5) Finally, display a list of 10 similar anime along with some relevant information about each anime

With that said, we can see that there are two small issues with the anime_rating dataset. Firstly, there are users that have watched a show without giving it a rating afterwards. The dataset turns their lack of rating for a show into a rating of -1. Let's remove instances of such anime vewings. Secondly, some users that have seen very few anime (user_id 73516 only has 2 anime on their list). Because we are seeking to build a model that will recommend 10 similar anime, I feel it's necessary to drop rows which include users with a low anime viewing count. Let's only keep user_id with 20 or more anime.

In [7]:
anime_rating.rating.unique()

array([-1, 10,  8,  6,  9,  7,  3,  5,  4,  1,  2], dtype=int64)

In [8]:
anime_rating = anime_rating[anime_rating.rating > 0] # only keep ratings between 1-10
anime_rating.rating.unique()

array([10,  8,  6,  9,  7,  3,  5,  4,  1,  2], dtype=int64)

In [9]:
anime_rating['user_anime_count'] = anime_rating.groupby('user_id')['user_id'].transform(np.count_nonzero)
anime_rating.tail() 

Unnamed: 0,user_id,anime_id,rating,user_anime_count
7813732,73515,16512,7,179
7813733,73515,17187,9,179
7813734,73515,22145,10,179
7813735,73516,790,9,2
7813736,73516,8074,9,2


In [10]:
anime_rating = anime_rating[anime_rating.user_anime_count > 19] # only keep users that have seen at least 20 anime
anime_rating.tail()

Unnamed: 0,user_id,anime_id,rating,user_anime_count
7813730,73515,13659,8,179
7813731,73515,14345,7,179
7813732,73515,16512,7,179
7813733,73515,17187,9,179
7813734,73515,22145,10,179


In [11]:
anime_rating = anime_rating.drop('user_anime_count', axis=1) # having served its purpose, we can drop the user count column

At this stage, the rating dataset contains users with enough anime under their belt to influence our recommender system. We are now ready to train on this data.

## Picking hyperparameters

Let's tune hyperparameters using grid search. But before that, we load the pandas dataframe rating_data into surprise via a dummy reader. 

In [13]:
dummy_reader = Reader(line_format='user item rating', rating_scale=(1, 10))
rating_data = Dataset.load_from_df(anime_rating[['user_id', 'anime_id', 'rating']], dummy_reader)
rating_data.split(n_folds=3)

Define measures for similarity. Included in the Surprise package are:

cosine - Compute the cosine similarity between all pairs of users (or items) <br>
msd	- Compute the Mean Squared Difference similarity between all pairs of users (or items) <br>
pearson_baseline - Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means <br>

In [14]:
sim_options1 = {'name': 'pearson_baseline', 'user_based': False}
sim_options2 = {'name': 'msd', 'user_based': False}
sim_options3 = {'name': 'cosine', 'user_based': False}

bsl_options1 = {'method': 'als', 'learning_rate': .001}
bsl_options2 = {'method': 'sgd', 'learning_rate': .001}

param_grid = {'sim_options': [sim_options1,sim_options2,sim_options3]}

param_grid_bsl = {'sim_options': [sim_options1,sim_options2,sim_options3],
                  'bsl_options': [bsl_options1,bsl_options2]}

We can now ready to set up our grid search using algorithms that are directly derived from a basic nearest neighbors approach. Included in the Surprise package are:

KNNBasic - A basic collaborative filtering algorithm using nearest neighbors <br>
KNNWithMeans - A basic collaborative filtering algorithm, taking into account the mean ratings of each user <br>
KNNBaseline - A basic collaborative filtering algorithm taking into account a baseline rating

In [15]:
grid_search_basic = GridSearch(KNNBasic, param_grid, measures=['RMSE', 'FCP'], verbose=0)
grid_search_means = GridSearch(KNNWithMeans, param_grid, measures=['RMSE', 'FCP'], verbose=0)
grid_search_bsl = GridSearch(KNNBaseline, param_grid_bsl, measures=['RMSE', 'FCP'], verbose=0)

In [16]:
grid_search_basic.evaluate(rating_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [17]:
grid_search_means.evaluate(rating_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [18]:
grid_search_bsl.evaluate(rating_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing simil

  sim = construction_func[name](*args)


Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.


We get the best RMSE scores and parameters as follows:

In [22]:
print(grid_search_basic.best_score['RMSE'])
print(grid_search_basic.best_params['RMSE'])

1.18593231147
{'sim_options': {'user_based': False, 'name': 'pearson_baseline'}}


In [23]:
print(grid_search_means.best_score['RMSE'])
print(grid_search_means.best_params['RMSE'])

1.11228556969
{'sim_options': {'user_based': False, 'name': 'pearson_baseline'}}


In [24]:
print(grid_search_bsl.best_score['RMSE'])
print(grid_search_bsl.best_params['RMSE'])

1.10980073966
{'sim_options': {'user_based': False, 'name': 'pearson_baseline'}, 'bsl_options': {'learning_rate': 0.001, 'method': 'als'}}


With the smallest mean RMSE of 1.10980073966 over three folds of the rating_data, the model with the highest performance seems to be KNNBaseline, with parameters displayed above. We are now ready to train the model on the entire dataset.

In [39]:
anime_algo = KNNBaseline(sim_options=sim_options1, bsl_options=bsl_options1)
rating_trainset = rating_data.build_full_trainset()
testing_model = anime_algo.train(rating_trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


We finally define functions to help display the results of our KNNBaseline model. Note that recommend_me function will only input exact names of anime as seen on [myanimelist.net](https://myanimelist.net/).

In [42]:
def get_index(x):
    # gives index for the anime
    return anime[anime['name']==x].index.tolist()[0]

In [43]:
def recommend_me(a):
    print('Here are 10 anime similar to', a, ':' '\n')
    index = get_index(a)
    anime_nbrs = anime_algo.get_neighbors(index, k=10)
    
    for i in anime_nbrs[:]:
            print(anime.iloc[i]['name'], 
                  '\n' 'Genre: ', anime.iloc[i]['genre'],
                  '\n' 'Episode count: ', anime.iloc[i]['episodes'],
                  '\n' 'Rating out of 10:', anime.iloc[i]['rating'], '\n')

In [44]:
recommend_me('Death Note')

Here are 10 anime similar to Death Note :

Kimi ni Todoke 2nd Season 
Genre:  Romance, School, Shoujo, Slice of Life 
Episode count:  12 
Rating out of 10: 8.17 

Tales of Zestiria the X: Saiyaku no Jidai 
Genre:  Action, Fantasy 
Episode count:  1 
Rating out of 10: 7.44 

3 Choume no Tama: Onegai! Momo-chan wo Sagashite!! 
Genre:  Adventure, Kids 
Episode count:  1 
Rating out of 10: 6.24 

Byousoku 5 Centimeter 
Genre:  Drama, Romance, Slice of Life 
Episode count:  3 
Rating out of 10: 8.1 

Mirai Shounen Conan: Tokubetsu-hen - Kyodaiki Gigant no Fukkatsu 
Genre:  Adventure, Drama, Sci-Fi 
Episode count:  1 
Rating out of 10: 6.93 

Chihayafuru 2 
Genre:  Drama, Game, Josei, Slice of Life, Sports 
Episode count:  25 
Rating out of 10: 8.52 

Gakkatsu! 2nd Season 
Genre:  Comedy, School 
Episode count:  25 
Rating out of 10: 6.51 

Utawarerumono 
Genre:  Action, Drama, Fantasy, Sci-Fi 
Episode count:  26 
Rating out of 10: 7.78 

Trinity Seven OVA 
Genre:  Action, Comedy, Ecchi, Fan

## Failure

These anime aren't at all like Death Note. In fact, these results are worse that part 1 :( 

This was my first run with the surprise package. I will have to come back to this problem to find a better approach to collab filtering - with or without using surprise. 