In [1]:
%pylab inline
import pandas as pd
import random
import time

from surprise import Reader
from surprise import KNNBasic, KNNBaseline
from surprise import Dataset
from surprise import SVD, NMF
from surprise import accuracy
from surprise.model_selection import train_test_split

Populating the interactive namespace from numpy and matplotlib


# Association Recommendations

> Contributors: Eric Keränen, Samuel Aitamaa & Teemu Luhtanen

This assignment is of exploratory nature. Your task is to explore the applicability of scikitsurprise in building a recommendation engine for the filthy Anime dataset.

## Part I 

What kind of preprocessing is necessary for the ratings dataset?

In [2]:
df_anime = pd.read_csv("anime.csv")
df_anime.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [3]:
df_ratings = pd.read_csv("rating.csv")
df_ratings.head(10)

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1
5,1,355,-1
6,1,356,-1
7,1,442,-1
8,1,487,-1
9,1,846,-1


As we can see, there are a lot of "-1" values in the ratings dataset. This basically means that the user has watched the anime, but hasn't rated it.

> "Rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating)."  
> - CooperUnion, [Kaggle](https://www.kaggle.com/CooperUnion/anime-recommendations-database/version/1)

Because of this, we copied <code>df_ratings</code> to <code>df_watched</code> and dropped rows where rating-value was -1 from <code>df_ratings</code>.

In [4]:
df_watched = df_ratings
df_ratings = df_ratings[df_ratings.rating != -1]

In [5]:
df_watched.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,7813737.0,36727.956745,20997.946119,1.0,18974.0,36791.0,54757.0,73516.0
anime_id,7813737.0,8909.072104,8883.949636,1.0,1240.0,6213.0,14093.0,34519.0
rating,7813737.0,6.14403,3.7278,-1.0,6.0,7.0,9.0,10.0


Now <code>df_watched</code> includes rated animes and animes that were watched but not rated. This data frame will be used and explained in **Part III**.

In [6]:
df_ratings.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,6337241.0,36747.914434,21013.403087,1.0,18984.0,36815.0,54873.0,73516.0
anime_id,6337241.0,8902.866383,8881.999647,1.0,1239.0,6213.0,14075.0,34475.0
rating,6337241.0,7.808497,1.572496,1.0,7.0,8.0,9.0,10.0


And <code>df_ratings</code> includes only rated animes.

## Part II

How do the recommendation algorithms (e.g. KNN and SVD) perform with a data set of
this magnitude? Do you encounter hardware limitations? If yes, how can you
circumvent some of the limitations to be able to carry on with the experiment?

In [7]:
reader = Reader(rating_scale=(1, 10))

data = Dataset.load_from_df(df_ratings[['user_id', 'anime_id', 'rating']], reader)

# Sample random trainset and testset
# Test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

sim_options = {
    'user_based': False,  # This parameter had to be changed because of hardware limitations
}

# We'll test which algorithm performs best
algos = [SVD(), KNNBasic(sim_options = sim_options), NMF(), KNNBaseline(sim_options = sim_options)]
algoNames = ["SVD", "KNNBasic", "NMF", "KNNBaseline"]


for index, algo in enumerate(algos):
    t0 = time.time()
    
    # Train the algorithm on the trainset, and predict ratings for the testset
    algo.fit(trainset)
    predictions = algo.test(testset)
    
    t1 = time.time()
    etime = t1 - t0
    
    # Then compute RMSE
    print (f"RMSE for {algoNames[index]} is {accuracy.rmse(predictions)}\n")
    print (f"Elapsed time for {algoNames[index]} was {etime:.2f} s\n\n")

RMSE: 1.1405
RMSE for SVD is 1.1405407890691563

Elapsed time for SVD was 216.16 s


Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.2171
RMSE for KNNBasic is 1.2171330455503544

Elapsed time for KNNBasic was 318.01 s


RMSE: 2.2255
RMSE for NMF is 2.225512323581884

Elapsed time for NMF was 241.94 s


Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.1686
RMSE for KNNBaseline is 1.1685677875177922

Elapsed time for KNNBaseline was 379.63 s




We decided to try <code>SVD</code>, <code>KNNBasic</code>, <code>NMF</code> and <code>KNNBaseline</code> from the <code>Surprise</code> library. As we can see from the computing times, each of these algorithms were quite heavy with a data set of this magnitude. The best perforimng algorithm turned out to be <code>SVD</code>, with both computing time and RMSE based accuracy.

For <code>KNNBasic</code> and <code>KNNBaseline</code> we encountered hardware problems due to the huge size of the dataset. We circumvented these problems by adding an additional <code>sim_options</code> parameter, <code>user_based</code> as <code>False</code>. This changes the computing from similarities between users to similiarities between items. The computing times still remained relatively high compared to the <code>SVD</code>.

## Part III

Can you combine the information in the two files in a meaningful way to have the
recommender display the titles of the recommended movies?

In [8]:
df_merged = pd.merge(df_watched, df_anime, left_on="anime_id", right_on="anime_id")

Combine the anime list and the watched list with the same <code>anime_id</code>. We used the watched data frame from **Part I** which also has the animes that user has watched but not rated. We decided to include shows that the user has watched, even though the user hasn't rated them.

After this we train and use the SVD algorithm since it had the best performance in **Part II**.

In [9]:
# SVD had the lowest RMSE, so it performed the best out of the four recommendation algorithms
algo = SVD()
algo.fit(trainset)
pred = algo.test(testset)
print (f"RMSE for SVD is {accuracy.rmse(pred)}")

RMSE: 1.1405
RMSE for SVD is 1.1404871542262143


<code>user_id_to_predict</code> is the user for whome we want to recommend different anime.

First we get all the unique anime id's, then we get the id's of the anime that the user has watcher (even those that haven't been rated). Wehn we have both of these id's, we drop all the common id's with <code>np.setdiff1d(unique_ids, iids1001)</code> so that we only have the id's of the anime that the user hasn't watched.

After this we start to predict which shows the user might like in a for loop.
After getting the predictions we set the predictions in to a new data frame and sort the predictions accoring to the prediction score. Highest score predictions are the ones which are recommended the most for the user.

We drop some unnecessary columns and leave only the recommended anime name, the genre, the average rating that the anime has and the rating (prediction) which the algorithm thinks the user might give the show.

In [12]:
user_id_to_predict = 10000

# Get a unique list of the anime ids
unique_ids = df_merged['anime_id'].unique()
# Get the list of anime ids that the user_id_to_predict has watched
iids1001 = df_merged.loc[df_merged['user_id'] == user_id_to_predict, 'anime_id']
# Remove the movies the user has watched (even if they are not rated)
movies_to_predict = np.setdiff1d(unique_ids, iids1001)

my_recs = []
for iid in movies_to_predict:
    my_recs.append((iid, algo.predict(uid = user_id_to_predict, iid = iid).est))
predictions = pd.DataFrame(my_recs, columns=['iid', 'prediction']).sort_values('prediction', ascending=False)

recommendations = pd.merge(df_anime, predictions, left_on = "anime_id", right_on = "iid").sort_values('prediction', ascending=False)
recommendations = recommendations.drop(['type', 'episodes', 'members', 'iid', 'anime_id'], axis = 1)
# Print top 10 recommendation for user_id_to_predict
recommendations.head(10).style.hide_index()

name,genre,rating,prediction
Katekyo Hitman Reborn!,"Action, Comedy, Shounen, Super Power",8.37,9.511117
Saint Seiya: The Lost Canvas - Meiou Shinwa 2,"Action, Adventure, Martial Arts, Shounen, Super Power, Supernatural",8.36,9.438869
Kuroshitsuji,"Action, Comedy, Demons, Fantasy, Historical, Shounen, Supernatural",8.06,9.427915
Kamisama Hajimemashita◎,"Comedy, Demons, Fantasy, Romance, Shoujo, Supernatural",8.28,9.416489
Steins;Gate,"Sci-Fi, Thriller",9.17,9.411822
Saint Seiya: The Lost Canvas - Meiou Shinwa,"Action, Adventure, Martial Arts, Shounen, Super Power, Supernatural",8.24,9.377605
Koe no Katachi,"Drama, School, Shounen",9.05,9.351185
Bleach,"Action, Comedy, Shounen, Super Power, Supernatural",7.95,9.335158
Gintama°,"Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen",9.25,9.309967
Kamisama Hajimemashita: Kako-hen,"Comedy, Demons, Fantasy, Shoujo, Supernatural",8.64,9.307089


From the list we can see that "Katekyo Hitman Reborn!" is the most recommended anime for user with the id of 10000.