# Anime Recommedation Model

In this notebook, an anime recommender model is trained using content based filtering on a neural network. An anime dataset with millions of ratings on thousands of animes will be preprocessed with feature engineering to train a model that can recommend a new or existing user animes based on their watch preferences.


# Outline
- [ 1 - Packages ](#1)
- [ 2 - Preprocessing Data](#2)
  - [ 2.1 Loading and Visualizing the Data](#2.1)
  - [ 2.2 Anime Data Processing](#2.2)
  - [ 2.3 User Ratings Processing](#2.3)
  - [ 2.4 Features and Labels](#2.4)
  - [ 2.5 Scaling](#2.5)
- [ 3 - Recommendation Model](#3)
  - [ 3.1 Model](#3.1)
  - [ 3.2 Training](#3.2)
  - [ 3.3 New User Recommendation](#3.3)
  - [ 3.4 Existing User Recommendation](#3.4)
- [ 4 - Results](#4)

<a name="1"></a>
## 1 - Packages 

Below are all the needed packages for this notebook.
- [numpy](https://www.numpy.org) is the fundamental package for scientific computing with Python.
- [pandas](https://pandas.pydata.org) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
- [tensorflow](https://www.tensorflow.org) is an end-to-end machine learning platform.
- [scikit-learn](https://scikit-learn.org/stable/) is a library of simple and efficient tools for predictive data analysis.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

<a name="2"></a>
## 2 - Preprocessing Data

The dataset for the model we'll build contains 57 million ratings of 17.562 anime and the preference from 325.772 different users from myanimelist.net. We will be using two dataframes from the data: An anime dataset that includes the infromation about the anime, such as name, id, overasll rating, genre, etc. and a user ratings dataset, that contains the ratings by the users.

The dataset can be found here: [Anime Database 2020](https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020)
<br/><br/>
<a name="2.1"></a>
### 2.1 Loading and Visualizing the Data

In [2]:
#Load data
anime_data = pd.read_csv("./Data/anime.csv")
user_data = pd.read_csv("./Data/rating_complete.csv")

In [3]:
print(f"anime_data: {anime_data.shape}")
print(f"user_data: {user_data.shape}")

anime_data: (17562, 35)
user_data: (57633278, 3)


In [4]:
anime_data.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0


In [5]:
user_data.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9


We can see the anime dataset has 35 columns, most of which we will not use. As for the user dataset, we have the user id, the anime id that is being rated, along with its rating, so we can leave that one as it is for now.

In order to train our model, we'll have to transform our datasets. We'll train our model to take into account the year and genres of each anime, and the ratings for the users. For our anime data, for each anime, there should be a column for every genre and if an anime belong to a genre, the corresponding columns  should be positive/1. For the user data, we need to aggregate the ratings of every user and group them by genre, so we will also end up with a column for every genre and get the average rating per genre for each user.

<a name="2.2"></a>
### 2.2 Anime Data Processing

Let's start with the anime data, and keep only the columns we need.

In [6]:
proc_ad = anime_data[['MAL_ID', 'Name', 'English name', 'Aired', 'Score', 'Genres']]

In [7]:
proc_ad.head()

Unnamed: 0,MAL_ID,Name,English name,Aired,Score,Genres
0,1,Cowboy Bebop,Cowboy Bebop,"Apr 3, 1998 to Apr 24, 1999",8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space"
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop:The Movie,"Sep 1, 2001",8.39,"Action, Drama, Mystery, Sci-Fi, Space"
2,6,Trigun,Trigun,"Apr 1, 1998 to Sep 30, 1998",8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen"
3,7,Witch Hunter Robin,Witch Hunter Robin,"Jul 2, 2002 to Dec 24, 2002",7.27,"Action, Mystery, Police, Supernatural, Drama, ..."
4,8,Bouken Ou Beet,Beet the Vandel Buster,"Sep 30, 2004 to Sep 29, 2005",6.98,"Adventure, Fantasy, Shounen, Supernatural"


In [8]:
print("Unknown values in 'Aired' column: ", len(proc_ad.loc[proc_ad['Aired'] == "Unknown"]))
print("Unknown values in 'Genres' column: ", len(proc_ad.loc[proc_ad['Genres'] == "Unknown"]))

Unknown values in 'Aired' column:  309
Unknown values in 'Genres' column:  63


There are some unknown values in a couple of rows, and we'll need both of these columns filled, so let's get rid of the these null rows.

Also, the Aired column has the air date and also the end date for most anime. We only need the year of its release, so let's trim it to only the first year. Finally, we'll replace the unknown values in the score column for a 0 value.

In [9]:
#Get rid of unknown values
proc_ad = proc_ad[proc_ad.Aired != "Unknown"]
proc_ad = proc_ad[proc_ad.Genres != "Unknown"]

In [10]:
#Get year
proc_ad['Aired'] = proc_ad.Aired.str.extract("(\d{4})")

#Replace missing values
proc_ad.loc[proc_ad['Score'] == "Unknown", 'Score'] = 0.0

#Reset index
proc_ad = proc_ad.reset_index(drop=True)

In [11]:
proc_ad.head()

Unnamed: 0,MAL_ID,Name,English name,Aired,Score,Genres
0,1,Cowboy Bebop,Cowboy Bebop,1998,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space"
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop:The Movie,2001,8.39,"Action, Drama, Mystery, Sci-Fi, Space"
2,6,Trigun,Trigun,1998,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen"
3,7,Witch Hunter Robin,Witch Hunter Robin,2002,7.27,"Action, Mystery, Police, Supernatural, Drama, ..."
4,8,Bouken Ou Beet,Beet the Vandel Buster,2004,6.98,"Adventure, Fantasy, Shounen, Supernatural"


Now we should retrieve all of the genres and create their columns.

In [12]:
#Get genres
genres = proc_ad.Genres.str.split(', ', expand=True).stack().unique()
print(f"Genres: {genres}")

Genres: ['Action' 'Adventure' 'Comedy' 'Drama' 'Sci-Fi' 'Space' 'Mystery'
 'Shounen' 'Police' 'Supernatural' 'Magic' 'Fantasy' 'Sports' 'Josei'
 'Romance' 'Slice of Life' 'Cars' 'Seinen' 'Horror' 'Psychological'
 'Thriller' 'Super Power' 'Martial Arts' 'School' 'Ecchi' 'Vampire'
 'Military' 'Historical' 'Dementia' 'Mecha' 'Demons' 'Samurai' 'Game'
 'Shoujo' 'Harem' 'Music' 'Shoujo Ai' 'Shounen Ai' 'Kids' 'Hentai'
 'Parody' 'Yuri' 'Yaoi']


In [13]:
#Set inital values
for col in genres:
    proc_ad[col] = 0

In [14]:
proc_ad.head()

Unnamed: 0,MAL_ID,Name,English name,Aired,Score,Genres,Action,Adventure,Comedy,Drama,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
0,1,Cowboy Bebop,Cowboy Bebop,1998,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop:The Movie,2001,8.39,"Action, Drama, Mystery, Sci-Fi, Space",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,Trigun,Trigun,1998,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,7,Witch Hunter Robin,Witch Hunter Robin,2002,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,8,Bouken Ou Beet,Beet the Vandel Buster,2004,6.98,"Adventure, Fantasy, Shounen, Supernatural",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we have the columns for the genres, we can add a value of 1 to a column, to indicate that the anime is of that genre.

In [15]:
#Assign values to rated columns
for index, row in proc_ad.iterrows():
    for genre in genres:
        if genre in row['Genres']:
            proc_ad.loc[index, genre] = 1

In [16]:
proc_ad = proc_ad.drop(['Genres'], axis=1)

In [17]:
proc_ad.head()

Unnamed: 0,MAL_ID,Name,English name,Aired,Score,Action,Adventure,Comedy,Drama,Sci-Fi,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
0,1,Cowboy Bebop,Cowboy Bebop,1998,8.78,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop:The Movie,2001,8.39,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,6,Trigun,Trigun,1998,8.24,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,7,Witch Hunter Robin,Witch Hunter Robin,2002,7.27,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,8,Bouken Ou Beet,Beet the Vandel Buster,2004,6.98,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
print(f"Current shape: {proc_ad.shape}")

Current shape: (17192, 48)


We now have an anime dataset we can use to create our training set. We created new features and got rid of the unnecessary rows and columns.

<a name="2.3"></a>
### 2.3 User Ratings Processing

For our user ratings data, we have 57 million ratings. Unfortunately, my computer does not have enough RAM to process and handle multiple dataframes of 57 million items, so we'll start by reducing our ratings dataset to 15 million.

In [19]:
print(f"User Data shape: {user_data.shape}")

User Data shape: (57633278, 3)


In [20]:
#Reduce Data
proc_ud = user_data[:15000000]

In [21]:
proc_ud.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9


In [22]:
print(f"New shape: {proc_ud.shape}")

New shape: (15000000, 3)


In [23]:
print(f"User ids: {proc_ud.user_id.unique()}")
print(f"\n# of users: {len(proc_ud.user_id.unique())}")

User ids: [    0     1     2 ... 92083 92084 92085]

# of users: 80692


To get the average rating per user of each user, we need to know to what genres the rated anime belongs to. We can achieve this by joining our user dataframe with our anime dataset.

In [24]:
#Join user and anime data
proc_ud = proc_ud.join(proc_ad.set_index('MAL_ID'), on='anime_id')

In [25]:
proc_ud.head()

Unnamed: 0,user_id,anime_id,rating,Name,English name,Aired,Score,Action,Adventure,Comedy,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
0,0,430,9,Fullmetal Alchemist: The Conqueror of Shamballa,Fullmetal Alchemist:The Movie - Conqueror of S...,2005,7.57,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,1004,5,Kanojo to Kanojo no Neko,She and Her Cat:Their Standing Points,2002,7.33,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,3010,7,Kaiketsu Zorro,The Magnificent Zorro,1996,7.23,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,570,7,Jin-Rou,Jin-Roh:The Wolf Brigade,2000,7.79,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,2762,9,Igano Kabamaru,Unknown,1983,7.87,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
print("Users who rated anime 430 (row, user id):")
print(proc_ud.loc[proc_ud['anime_id'] == 430, 'user_id'])

Users who rated anime 430 (row, user id):
0               0
929             6
2396           18
2752           19
4310           33
            ...  
14991509    92018
14992860    92026
14994550    92040
14995188    92046
14999336    92076
Name: user_id, Length: 11681, dtype: int64


We check and remove null values (Rated anime ids not found in our anime dataset).

In [27]:
#Genres with null values
proc_ud.isnull().sum()

user_id             0
anime_id            0
rating              0
Name             1255
English name     1255
Aired            1255
Score            1255
Action           1255
Adventure        1255
Comedy           1255
Drama            1255
Sci-Fi           1255
Space            1255
Mystery          1255
Shounen          1255
Police           1255
Supernatural     1255
Magic            1255
Fantasy          1255
Sports           1255
Josei            1255
Romance          1255
Slice of Life    1255
Cars             1255
Seinen           1255
Horror           1255
Psychological    1255
Thriller         1255
Super Power      1255
Martial Arts     1255
School           1255
Ecchi            1255
Vampire          1255
Military         1255
Historical       1255
Dementia         1255
Mecha            1255
Demons           1255
Samurai          1255
Game             1255
Shoujo           1255
Harem            1255
Music            1255
Shoujo Ai        1255
Shounen Ai       1255
Kids      

In [28]:
#Get rid of null values
proc_ud = proc_ud.dropna()
proc_ud = proc_ud.reset_index(drop=True)

At this point, we now have a dataframe with the features for our anime training set and ratings. Let's save these for later.

In [29]:
#Get anime vector features and ratings vector
x_items = proc_ud.drop(['user_id', 'rating'], axis=1)
y_ratings = proc_ud['rating']

Now that we joined our user and anime data, we know what genres a user has rated. We assign the anime rating to the corresponiding genre columns, and drop the columns we no longer need.

In [30]:
#Set ratings to matching genres
for genre in genres:
    proc_ud[genre] = proc_ud[genre].multiply(proc_ud['rating'], axis="index")

In [31]:
proc_ud.head()

Unnamed: 0,user_id,anime_id,rating,Name,English name,Aired,Score,Action,Adventure,Comedy,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
0,0,430,9,Fullmetal Alchemist: The Conqueror of Shamballa,Fullmetal Alchemist:The Movie - Conqueror of S...,2005,7.57,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,1004,5,Kanojo to Kanojo no Neko,She and Her Cat:Their Standing Points,2002,7.33,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,3010,7,Kaiketsu Zorro,The Magnificent Zorro,1996,7.23,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,570,7,Jin-Rou,Jin-Roh:The Wolf Brigade,2000,7.79,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,2762,9,Igano Kabamaru,Unknown,1983,7.87,9.0,9.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
user_ratings = proc_ud.drop(['anime_id', 'rating', 'Name', 'English name', 'Aired', 'Score'], axis=1)

In [33]:
user_ratings.head()

Unnamed: 0,user_id,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,Police,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
0,0,0.0,0.0,9.0,9.0,0.0,0.0,0.0,9.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,9.0,9.0,9.0,0.0,0.0,0.0,0.0,9.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


By grouping our rows by user and getting the mean values, we can now get the average ratings of each user.

In [34]:
#Replace non rated values temporarly to exclude from grouping
user_ratings.loc[user_ratings.user_id == 0, 'user_id'] = -1
user_ratings.replace(0.0, np.nan, inplace=True)
user_ratings.loc[user_ratings.user_id == -1, 'user_id'] = 0

#Get user ratings average grouped by genre
user_ratings = user_ratings.groupby('user_id').mean()

In [35]:
user_ratings.head()

Unnamed: 0_level_0,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,Police,Supernatural,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,7.7,7.444444,7.75,7.2,7.0,,8.5,7.076923,7.0,7.571429,...,8.333333,,6.0,,6.0,,,9.0,,
1,7.886364,7.75,7.86,8.447368,8.642857,,8.818182,8.12963,9.0,7.954545,...,,7.0,9.0,,,6.0,,7.25,,
2,8.551724,8.875,8.5,8.4,8.4,,8.111111,8.608696,10.0,8.375,...,,7.833333,7.0,,,,7.5,9.0,,
3,7.571429,7.84,7.492228,7.956522,7.630435,9.0,7.658537,7.645161,8.5,7.655914,...,7.75,7.121212,8.0,6.0,,,,7.8125,,
4,7.53125,7.769231,7.685185,7.724638,7.133333,,7.952381,7.545455,10.0,7.676471,...,7.785714,7.166667,8.2,5.5,7.0,,,6.5,,


In [36]:
user_ratings.fillna(0, inplace=True)

#Round averages
user_ratings[genres] = user_ratings[genres].round(2)

In [37]:
user_ratings.head()

Unnamed: 0_level_0,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,Police,Supernatural,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,7.7,7.44,7.75,7.2,7.0,0.0,8.5,7.08,7.0,7.57,...,8.33,0.0,6.0,0.0,6.0,0.0,0.0,9.0,0.0,0.0
1,7.89,7.75,7.86,8.45,8.64,0.0,8.82,8.13,9.0,7.95,...,0.0,7.0,9.0,0.0,0.0,6.0,0.0,7.25,0.0,0.0
2,8.55,8.88,8.5,8.4,8.4,0.0,8.11,8.61,10.0,8.38,...,0.0,7.83,7.0,0.0,0.0,0.0,7.5,9.0,0.0,0.0
3,7.57,7.84,7.49,7.96,7.63,9.0,7.66,7.65,8.5,7.66,...,7.75,7.12,8.0,6.0,0.0,0.0,0.0,7.81,0.0,0.0
4,7.53,7.77,7.69,7.72,7.13,0.0,7.95,7.55,10.0,7.68,...,7.79,7.17,8.2,5.5,7.0,0.0,0.0,6.5,0.0,0.0


In [38]:
print(f"Current shape: {user_ratings.shape}")

Current shape: (80692, 43)


With the user ratings, we can now finish our training set. We were able to create new features by using our other set and then processing the ratings.

<a name="2.4"></a>
### 2.4 Features and Labels

To train the model, we will get the dot product of two vectors: The user content and the anime content. We already have the content for our anime features and ratings which we previously saved. We only need the features for our user vector. With our user genre ratings from above, we can join it with the processed user data to get the ratings of users with rows corresponding to the animes they've rated.

In [39]:
#Get user vectors
x_users = proc_ud['user_id'].to_frame().join(user_ratings.set_index(user_ratings.index), on='user_id')

In [40]:
x_users.head()

Unnamed: 0,user_id,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,Police,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
0,0,7.7,7.44,7.75,7.2,7.0,0.0,8.5,7.08,7.0,...,8.33,0.0,6.0,0.0,6.0,0.0,0.0,9.0,0.0,0.0
1,0,7.7,7.44,7.75,7.2,7.0,0.0,8.5,7.08,7.0,...,8.33,0.0,6.0,0.0,6.0,0.0,0.0,9.0,0.0,0.0
2,0,7.7,7.44,7.75,7.2,7.0,0.0,8.5,7.08,7.0,...,8.33,0.0,6.0,0.0,6.0,0.0,0.0,9.0,0.0,0.0
3,0,7.7,7.44,7.75,7.2,7.0,0.0,8.5,7.08,7.0,...,8.33,0.0,6.0,0.0,6.0,0.0,0.0,9.0,0.0,0.0
4,0,7.7,7.44,7.75,7.2,7.0,0.0,8.5,7.08,7.0,...,8.33,0.0,6.0,0.0,6.0,0.0,0.0,9.0,0.0,0.0


In [41]:
x_items.head()

Unnamed: 0,anime_id,Name,English name,Aired,Score,Action,Adventure,Comedy,Drama,Sci-Fi,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
0,430,Fullmetal Alchemist: The Conqueror of Shamballa,Fullmetal Alchemist:The Movie - Conqueror of S...,2005,7.57,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1004,Kanojo to Kanojo no Neko,She and Her Cat:Their Standing Points,2002,7.33,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3010,Kaiketsu Zorro,The Magnificent Zorro,1996,7.23,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,570,Jin-Rou,Jin-Roh:The Wolf Brigade,2000,7.79,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2762,Igano Kabamaru,Unknown,1983,7.87,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
y_ratings.head()

0    9
1    5
2    7
3    7
4    9
Name: rating, dtype: int64

In [43]:
#Drop unecessary columns no longer needed
x_users = x_users.drop(['user_id'], axis=1)
x_items = x_items.drop(['anime_id', 'Name', 'English name'], axis=1)

In [44]:
print(f"x_users: {x_users.shape}")
print(f"x_items: {x_items.shape}")
print(f"y_ratings: {y_ratings.shape}")

x_users: (14998745, 43)
x_items: (14998745, 45)
y_ratings: (14998745,)


In [45]:
#To numpy array
x_users = x_users.to_numpy(dtype=float)
x_items = x_items.to_numpy(dtype=float)
y_ratings = y_ratings.to_numpy(dtype=float)

We now have our feature vectors and labels, with the vectors and ratings having the same number of rows.

<a name="2.5"></a>
### 2.5 Scaling

Finally, before we get to the model, let's scale and split our data.

In [46]:
# scale training data
item_train_unscaled = x_items
user_train_unscaled = x_users
y_train_unscaled = y_ratings

scalerItem = StandardScaler()
scalerItem.fit(x_items)
x_items = scalerItem.transform(x_items)

scalerUser = StandardScaler()
scalerUser.fit(x_users)
x_users = scalerUser.transform(x_users)

scalerTarget = MinMaxScaler((-1, 1))
scalerTarget.fit(y_ratings.reshape(-1, 1))
y_ratings = scalerTarget.transform(y_ratings.reshape(-1, 1))

print(np.allclose(item_train_unscaled, scalerItem.inverse_transform(x_items)))
print(np.allclose(user_train_unscaled, scalerUser.inverse_transform(x_users)))

True
True


In [47]:
#Split data
item_train, item_test = train_test_split(x_items, test_size=0.3, shuffle=True, random_state=1)
user_train, user_test = train_test_split(x_users, test_size=0.3, shuffle=True, random_state=1)
y_train, y_test = train_test_split(y_ratings, test_size=0.3, shuffle=True, random_state=1)
print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test data shape: {item_test.shape}")

movie/item training data shape: (10499121, 45)
movie/item test data shape: (4499624, 45)


<a name="3"></a>
## 3 - Recommender Model

<a name="3.1"></a>
### 3.1 Model

Our model will have two networks that will be computed with dot product. These networks will be the same. They'll have two Dense layers with 256 and 128 neurons, and we will use functional models to construct our networks.

In [48]:
#Feature shapes
num_user_features = user_train.shape[1]
num_item_features = item_train.shape[1]

In [49]:
#Create networks
num_outputs = 32

user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs, activation='linear')
])

item_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs, activation='linear')
])

#Create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

#Create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features))
va = item_NN(input_item)
va = tf.linalg.l2_normalize(va, axis=1)

#Compute the dot product of the two vectors vu and va
output = tf.keras.layers.Dot(axes=1)([vu, va])

#Specify the inputs and output of the model
model = tf.keras.Model([input_user, input_item], output)

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 43)]         0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 45)]         0           []                               
                                                                                                  
 sequential (Sequential)        (None, 32)           48288       ['input_1[0][0]']                
                                                                                                  
 sequential_1 (Sequential)      (None, 32)           48800       ['input_2[0][0]']                
                                                                                              

<a name="3.2"></a>
### 3.2 Training

In [50]:
model.compile(optimizer = keras.optimizers.Adam(learning_rate=0.01),
              loss = tf.keras.losses.MeanSquaredError())

In [51]:
model.fit([user_train, item_train], y_train,
          validation_data = ([user_test, item_test], y_test),
          epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x222fabfb5b0>

In [53]:
#Evaluate with test data
model.evaluate([user_test, item_test], y_test)



0.06458355486392975

Our loss function for both our training and testing set are very similar, so we trained a good model that did not overfit the data.

<a name="3.3"></a>
### 3.3 New User Recommendation

To test our model, we create a new user vector. Let's say our new user like action, adventure, magic, psychological, and thriller anime shows.

In [54]:
#Create new user vector
new_Action = 9.0
new_Adventure = 8.0
new_Comedy = 0.0
new_Drama = 0.0
new_Sci_Fi = 0.0
new_Space = 0.0
new_Mystery = 0.0
new_Shounen = 0.0
new_Police = 0.0
new_Supernatural = 0.0
new_Magic = 6.0
new_Fantasy = 0.0
new_Sports = 0.0
new_Josei = 0.0
new_Romance = 0.0
new_Slice_of_Life = 0.0
new_Cars = 0.0
new_Seinen = 0.0
new_Horror = 0.0
new_Psychological = 9.0
new_Thriller = 7.0
new_Super_Power = 0.0
new_Martial_Arts = 0.0
new_School = 0.0
new_Ecchi = 0.0
new_Vampire = 0.0
new_Military = 0.0
new_Historical = 0.0
new_Dementia = 0.0
new_Mecha = 0.0
new_Demons = 0.0
new_Samurai = 0.0
new_Game = 0.0
new_Shoujo = 0.0
new_Harem = 0.0
new_Music = 0.0
new_Shoujo_Ai = 0.0
new_Shounen_Ai = 0.0
new_Kids = 0.0
new_Hentai = 0.0
new_Parody = 0.0
new_Yuri = 0.0
new_Yaoi = 0.0

user_vec = np.array([[new_Action, new_Adventure, new_Comedy, new_Drama, new_Sci_Fi, new_Space,
                    new_Mystery, new_Shounen, new_Police, new_Supernatural, new_Magic, new_Fantasy,
                    new_Sports, new_Josei, new_Romance, new_Slice_of_Life, new_Cars, new_Seinen, new_Horror,
                    new_Psychological, new_Thriller, new_Super_Power, new_Martial_Arts, new_School,
                    new_Ecchi, new_Vampire, new_Military, new_Historical, new_Dementia, new_Mecha,
                    new_Demons, new_Samurai, new_Game, new_Shoujo, new_Harem, new_Music, new_Shoujo_Ai,
                    new_Shounen_Ai, new_Kids, new_Hentai, new_Parody, new_Yuri, new_Yaoi]])

We replicate our vector to match the number of anime in our set, scale it, and make a prediction with our model to get the highest ratings predicted for the user.

In [55]:
#Generate and replicate the user vector to match the number movies in the data set.
user_vecs = np.tile(user_vec, (len(proc_ad), 1))
item_vecs = proc_ad.to_numpy()

#Scale our user and item vectors
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs[:, 3:])

#Make a prediction
y_p = model.predict([suser_vecs, sitem_vecs])

#Unscale y prediction 
y_pu = scalerTarget.inverse_transform(y_p)

#Sort the results, highest prediction first
sorted_index = np.argsort(-y_pu,axis=0).reshape(-1).tolist()  #negate to get largest rating first
sorted_ypu   = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]  #using unscaled vectors for display



In [56]:
#Create list of recommendations
columns = {0: 'ID', 1: 'Name', 2: 'English Name', 3: 'Year', 4: 'Rating'}
new_user_recommendations = pd.DataFrame(sorted_items)
new_user_recommendations = new_user_recommendations.rename(columns=columns)
new_user_recommendations = new_user_recommendations[['ID', 'Name', 'English Name', 'Year', 'Rating']]
new_user_recommendations = new_user_recommendations.join(anime_data[['MAL_ID', 'Genres']].set_index('MAL_ID'), on='ID')
new_user_recommendations.head(10)

Unnamed: 0,ID,Name,English Name,Year,Rating,Genres
0,40682,Kingdom 3rd Season,Unknown,2020,8.4,"Action, Historical, Military, Seinen"
1,25537,Fate/stay night Movie: Heaven's Feel - I. Pres...,Fate/stay night:Heaven's Feel - I. Presage Flower,2017,8.26,"Action, Fantasy, Magic, Supernatural"
2,17389,Kingdom 2nd Season,Kingdom:Season 2,2013,8.39,"Action, Military, Historical, Seinen"
3,28701,Fate/stay night: Unlimited Blade Works 2nd Season,Fate/stay night [Unlimited Blade Works] Season 2,2015,8.33,"Action, Fantasy, Magic, Supernatural"
4,33049,Fate/stay night Movie: Heaven's Feel - II. Los...,Fate/stay night:Heaven's Feel - II. Lost Butte...,2019,8.59,"Action, Fantasy, Magic, Supernatural"
5,34440,Code Geass: Hangyaku no Lelouch III - Oudou,Code Geass:Lelouch of the Rebellion III - Glor...,2018,8.04,"Action, Mecha, Military, School, Sci-Fi, Super..."
6,22297,Fate/stay night: Unlimited Blade Works,Fate/stay night [Unlimited Blade Works],2014,8.22,"Action, Fantasy, Magic, Supernatural"
7,25777,Shingeki no Kyojin Season 2,Attack on Titan Season 2,2017,8.45,"Action, Military, Mystery, Super Power, Drama,..."
8,35760,Shingeki no Kyojin Season 3,Attack on Titan Season 3,2018,8.59,"Action, Military, Mystery, Super Power, Drama,..."
9,12031,Kingdom,Kingdom,2012,8.04,"Action, Historical, Military, Seinen"


We now have a list of recommended anime that match the user's preferences.

<a name="3.4"></a>
### 3.4 Existing User Recommendation

For an exisiting user, we do the same process, for the user vector of one of the users in our data.

In [57]:
#Existing user vector
user_ratings.head(1)

Unnamed: 0_level_0,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,Police,Supernatural,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,7.7,7.44,7.75,7.2,7.0,0.0,8.5,7.08,7.0,7.57,...,8.33,0.0,6.0,0.0,6.0,0.0,0.0,9.0,0.0,0.0


In [58]:
uid = 1
watched_animes = proc_ud.loc[proc_ud['user_id'] == uid, 'anime_id'].tolist()

In [59]:
indices = []
c = 0
for i in watched_animes:
    indices.append(proc_ad.index[proc_ad['MAL_ID'] == i][0])
    c += 1

In [60]:
#Get user anime ratings matching anime data
y_vecs = np.zeros(len(proc_ad))
c = 0
for i in indices:
    y_vecs[i] = proc_ud[(proc_ud['user_id'] == uid) & (proc_ud['anime_id'] == watched_animes[c])].rating
    c += 1

In [61]:
user_vec = user_ratings.loc[user_ratings.index == uid].values

#Form a set of user vectors. This is the same vector, transformed and repeated.
user_vecs = np.tile(user_vec, (len(proc_ad), 1))
item_vecs = proc_ad.to_numpy()

#Scale our user and item vectors
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs[:, 3:])

#Make a prediction
y_p = model.predict([suser_vecs, sitem_vecs])

#Unscale y prediction 
y_pu = scalerTarget.inverse_transform(y_p)

#Sort the results, highest prediction first
sorted_index = np.argsort(-y_pu,axis=0).reshape(-1).tolist()  #negate to get largest rating first
sorted_ypu   = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]  #using unscaled vectors for display
sorted_user  = user_vecs[sorted_index]
sorted_y     = y_vecs[sorted_index]



In [62]:
#Create list of recommendations
columns = {0: 'ID', 1: 'Name', 2: 'English Name', 3: 'Year', 4: 'Rating'}
existing_user_recommendations = pd.DataFrame(sorted_items)
existing_user_recommendations = existing_user_recommendations.rename(columns=columns)
existing_user_recommendations = existing_user_recommendations[['ID', 'Name', 'English Name', 'Year', 'Rating']]
existing_user_recommendations = existing_user_recommendations.join(anime_data[['MAL_ID', 'Genres']].set_index('MAL_ID'), on='ID')
existing_user_recommendations.head(10)

Unnamed: 0,ID,Name,English Name,Year,Rating,Genres
0,19,Monster,Monster,2004,8.76,"Drama, Horror, Mystery, Police, Psychological,..."
1,2904,Code Geass: Hangyaku no Lelouch R2,Code Geass:Lelouch of the Rebellion R2,2008,8.91,"Action, Military, Sci-Fi, Super Power, Drama, ..."
2,1575,Code Geass: Hangyaku no Lelouch,Code Geass:Lelouch of the Rebellion,2006,8.72,"Action, Military, Sci-Fi, Super Power, Drama, ..."
3,44,Rurouni Kenshin: Meiji Kenkaku Romantan - Tsui...,Samurai X:Trust and Betrayal,1999,8.73,"Action, Historical, Drama, Romance, Martial Ar..."
4,38524,Shingeki no Kyojin Season 3 Part 2,Attack on Titan Season 3 Part 2,2019,9.1,"Action, Drama, Fantasy, Military, Mystery, Sho..."
5,5114,Fullmetal Alchemist: Brotherhood,Fullmetal Alchemist:Brotherhood,2009,9.19,"Action, Military, Adventure, Comedy, Drama, Ma..."
6,40028,Shingeki no Kyojin: The Final Season,Attack on Titan Final Season,2020,9.17,"Action, Military, Mystery, Super Power, Drama,..."
7,9253,Steins;Gate,Steins;Gate,2011,9.11,"Thriller, Sci-Fi"
8,11061,Hunter x Hunter (2011),Hunter x Hunter,2011,9.1,"Action, Adventure, Fantasy, Shounen, Super Power"
9,245,Great Teacher Onizuka,Great Teacher Onizuka,1999,8.7,"Slice of Life, Comedy, Drama, School, Shounen"


<a name="4"></a>
## 4 - Results

Our trained model resulted in a good loss function of 0.06 that also held for our test data, and predicted good recommendations for a new and existing user. The recommendation model could be further expanded by including more features.