# I. Introduction 
    Movie recommendation is a very common and widely used technique among all different media platforms. Anime represents a unique group of the community and it has its own culture with a different user distribution compared to that of regular movies.  It is very interesting to analyze the information from the anime community and build a recommendation system for its own specialty. In this work, I implemented a Data-Driven Recommendation Model on anime datasets.

In [2]:
# Import Package
import pandas as pd
import os
import numpy as np
import glob
from sklearn.utils import shuffle
import seaborn as sns 
from sklearn.metrics import jaccard_similarity_score
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics.pairwise import manhattan_distances
from scipy.sparse.linalg import svds


# II. Dataset 

The datasets I used in this work were downloaded from  [Kaggle](https://www.kaggle.com/azathoth42/myanimelist). The original data were collected from the website known as MyAnimeList.net which is an anime and manga social networking and social cataloging application website. This dataset aims to be representative sample of internet otaku community for demographics analysis and trends inside this group. It contains information about users’ gender, location, birth date, as well as the anime’s airing date, genres, producer and the relationship between users and anime. 

In [3]:
obs =os.path.dirname(os.getcwd()) 
Datpath= obs + "\\Data\\"
Output = obs + "\\output\\"
Dataset = list(glob.glob(Datpath + "*.csv" ))

# read  the datasets
Anime= pd.read_csv(Dataset[0])
User= pd.read_csv(Dataset[5])
UserAnime= pd.read_csv(Dataset[3])

In [4]:
print("Before: The shape of Anime",Anime.shape)
print("Before: The shape of User ",User.shape)
print("Before: The shape of UserAnime",UserAnime.shape)

Before: The shape of Anime (14478, 31)
Before: The shape of User  (108711, 17)
Before: The shape of UserAnime (80076112, 11)


# III. Data Preprocessing

#### Anime Datasets

    In the anime dataset, I dropped the non-significant features such as Japanese title, title_synonyms,image_url,opening theme, and ending theme etc., which are not applicable to this work. I also dropped the anime not yet aired as any rating scores would cause inaccuracy. 

In [5]:
Anime=Anime[Anime["status"]!="Not yet aired"]
Anime=Anime[['anime_id', 'title','source','studio', 'genre','episodes','rating',
             'score', 'scored_by', 'rank', 'popularity','producer']]
print("Cleaned: The shape of Anime",Anime.shape)
print("\n show Anime dataset: \n")
shuffle(Anime).head(10)

Cleaned: The shape of Anime (14142, 12)

 show Anime dataset: 



Unnamed: 0,anime_id,title,source,studio,genre,episodes,rating,score,scored_by,rank,popularity,producer
4358,1459,Uchuu no Kishi Tekkaman Blade,Original,Tatsunoko Production,"Action, Adventure, Space, Mecha, Romance, Mili...",49,R - 17+ (violence & profanity),7.55,3831,1512.0,3637,"Sotsu, Studio Jack"
6057,31753,Ganbare! Lulu Lolo 3rd Season,Picture book,Fanworks,"Kids, Slice of Life",10,PG - Children,6.3,10,10174.0,11868,
6080,696,Tenchi Muyou!,Original,AIC,"Sci-Fi, Harem, Space, Comedy, Romance, Shounen",26,PG-13 - Teens 13 or older,7.52,23188,1591.0,1590,"TV Tokyo, Pioneer LDC"
12019,31044,Kamisama Minarai: Himitsu no Cocotama,Original,OLM,"Slice of Life, Kids, Magic, Fantasy",0,PG - Children,6.11,219,7057.0,7088,"TV Tokyo, Lantis"
5088,6137,Tsuki no Waltz,Unknown,Madhouse,Music,1,G - All Ages,6.63,1054,5087.0,7102,NHK
12826,35198,Free!: Take Your Marks,Original,Kyoto Animation,"Slice of Life, Sports, School",1,PG-13 - Teens 13 or older,7.88,1242,734.0,2817,
13628,31551,Yurumates 3D Plus: Natsuyasumi Maison du Wish ...,Unknown,,"Slice of Life, Comedy, Seinen",1,PG-13 - Teens 13 or older,6.12,197,7042.0,9755,
13835,32074,Sowa Sowa Calendar,Music,,"Kids, Music",1,G - All Ages,5.41,22,12253.0,13526,NHK
178,1587,Tenpou Ibun: Ayakashi Ayashi,Original,Bones,"Supernatural, Demons, Historical",25,R - 17+ (violence & profanity),7.0,3288,3582.0,3394,"Aniplex, Mainichi Broadcasting System"
7758,7721,Meat or Die,Unknown,,"Adventure, Comedy",12,PG - Children,5.0,75,9042.0,11078,Toho


#### Users Dataset

<ol>
  <p>For user dataset, I used cleaned version in this work. It ignore the users with wrong information. For example, the users whose last online date was before their joined date. The users who watched incredibly huge amount of anime episodes (over 100 million) that likely is not true. Users who watched more episodes than the given anime produced. The users with information that last online time earlier than 1950 (simply impossible) and the users younger than 5 and older than 90.</p>
  <li>I filter out the user having mean score=0 and whose spent watching days less than 1. I also dropped the non-significant features such as location, access_rank,join_date,last_online,stats_rewatched,stats_episodes etc, which are not applicable to this work.</li>
    
  <li>I used "birth_date" feature to change it to "Age", it transfer the datetime datatype to int and make it easier for further analysis..</li>
<ol>

    
    

In [6]:
User = User[User["stats_mean_score"]!=0 & User["stats_mean_score"].notnull()]
User = User[User["user_days_spent_watching"]> 1]
User['age'] = User['birth_date'].apply(lambda row: (2018-int(row.split(sep= "-")[0])))
User=User.drop(["user_onhold","user_dropped","user_plantowatch","access_rank","join_date","last_online","stats_rewatched","stats_episodes","location","birth_date"],axis=1)
User=User.reset_index(drop=True)
User['username']=User['username'].apply(str)
print("The shape of User ",User.shape)
print("\n show User dataset: \n")
shuffle(User).head(10)

The shape of User  (105403, 8)

 show User dataset: 



Unnamed: 0,username,user_id,user_watching,user_completed,user_days_spent_watching,gender,stats_mean_score,age
1582,AnimeGURU475,174689,20,655,181.588889,Male,7.59,31
50092,Ankoku78,80772,15,79,32.183333,Female,7.93,24
45590,Nameo,15686,94,568,147.850694,Male,6.77,30
88555,sasky_96,4208521,33,167,41.705903,Male,7.87,23
23533,Promithius,37186,4,57,36.120139,Male,7.28,30
78050,Batman8,4605131,98,138,39.198495,Male,9.38,18
24631,komic,238353,83,1547,345.535417,Male,7.65,33
72866,Laokoon,266567,20,214,56.184028,Male,8.69,25
21901,Skippyloo2,165644,0,102,30.493056,Female,7.64,20
58343,annie2021,274963,5,219,91.335417,Female,7.04,24


#### User-Anime Dataset
User-Anime dataset provide information of userID, the anime_Id that this user watched and the score that the user given.

In [7]:
def UserAnimefilter(Anime, User, UserAnime):
    animelist= list(Anime["anime_id"].values)
    userlist =list(User["username"].values)
    filtered =pd.DataFrame()
    for df in pd.read_csv(UserAnime,chunksize=1000000):# this speed up the runing time 
        df=df.drop(['my_start_date', 'my_finish_date','my_status','my_rewatching_ep','my_rewatching',"my_last_updated","my_tags"],axis=1)
        df = df[df["anime_id"].isin(animelist)] # filtered by anime list
        df = df[df["username"].isin(userlist)] # filtered by User list
        filtered=filtered.append(df)
    return filtered
UserAnime=UserAnimefilter(Anime, User, Dataset[3])
print("The shape of UserAnime",UserAnime.shape)
print("\n show UserAnime dataset: \n")
shuffle(UserAnime).head(10)

The shape of UserAnime (32797285, 4)

 show UserAnime dataset: 



Unnamed: 0,username,anime_id,my_watched_episodes,my_score
7742606,Agk,18229,12,8
44527659,Flancake,144,13,8
20676363,Bocchan,5420,0,0
16699409,SideWalking,1048,13,6
73734724,Franation,28299,1,8
20315606,CreamyVelvet,14293,1,8
27299070,kkukaa,223,153,4
30609672,Nyarumi,8311,25,0
30310806,Foga-Kun,1852,1,0
48477995,AniaxMega,10532,2,8


# VI. Architecture of data-driven recommendation model

### Mode1: Recommendation based on Top Animes 
    In order to find the most popular and highly scored animes, I used IMDB’s (Internet Movie Database) weighted rating formula to construct the top anime chart. The original anime score is the average score among all users who have watched that anime.  It is very possible that a highly scored anime was rated by just few person, it is therefore statistically incorrect to classify that anime has a highly scored anime. I will therefore need to set a threshold for the minimum number of ratings as I build the recommender system. This is important so that I can reduce the animes with high score but only few people scored. Mathematically, the IMDB’s is represented as 
    
    Weighted Rating  = v/ (v+m)R)+(m(v+m)C
    
    
    where, v is the number of score for the anime, 
           m is the minimum number of scores required (can be set or changed by user) to be listed in the chart, 
           R is the average score of the anime, 
           C is the mean score across the whole data.


In [8]:
def Bestanime(file,perc,imdb,top):
    m = file['scored_by'].quantile(perc)
    C = file[file['score'].notnull()]['score'].mean()
    v= file['scored_by']
    R = file['score']
    result= pd.DataFrame((v/(v+m) * R) + (m/(m+v) * C)).rename(columns = {0:'IMDB'})
    result = pd.concat([file,result],axis=1)
    result= result[result["IMDB"]>imdb]
    result.sort_values("IMDB", inplace=True,ascending=False)
    return result[:top]

    Here, I use 85th percentile to TOP 100 animes with IMDB score grader than 8.

In [9]:
Topanime = Bestanime(Anime,0.85,7,100) 
Topanime

Unnamed: 0,anime_id,title,source,studio,genre,episodes,rating,score,scored_by,rank,popularity,producer,IMDB
2555,5114,Fullmetal Alchemist: Brotherhood,Manga,Bones,"Action, Military, Adventure, Comedy, Drama, Ma...",64,R - 17+ (violence & profanity),9.25,733592,1.0,4,"Aniplex, Square Enix, Mainichi Broadcasting Sy...",9.197759
1028,32281,Kimi no Na wa.,Original,CoMix Wave Films,"Supernatural, Drama, Romance, School",1,PG-13 - Teens 13 or older,9.19,471398,2.0,33,"Kadokawa Shoten, Toho, Sound Team Don Juan, La...",9.111124
2999,9253,Steins;Gate,Visual novel,White Fox,"Thriller, Sci-Fi",24,PG-13 - Teens 13 or older,9.14,563857,5.0,8,"Frontier Works, Media Factory, Movic, AT-X, Ka...",9.074904
418,11061,Hunter x Hunter (2011),Manga,Madhouse,"Action, Adventure, Super Power, Shounen",148,PG-13 - Teens 13 or older,9.11,403377,8.0,35,"VAP, Nippon Television Network, Shueisha",9.020774
6572,28851,Koe no Katachi,Manga,Kyoto Animation,"Drama, School, Shounen",1,PG-13 - Teens 13 or older,9.04,296261,12.0,77,"Shochiku, Pony Canyon, Kodansha, ABC Animation...",8.922866
2783,4181,Clannad: After Story,Visual novel,Kyoto Animation,"Slice of Life, Comedy, Supernatural, Drama, Ro...",24,PG-13 - Teens 13 or older,9.01,363239,14.0,55,"Pony Canyon, TBS, Rakuonsha, Animation Do",8.914758
7403,2904,Code Geass: Hangyaku no Lelouch R2,Original,Sunrise,"Action, Military, Sci-Fi, Super Power, Drama, ...",25,R - 17+ (violence & profanity),8.95,543904,18.0,22,"Bandai Visual, Mainichi Broadcasting System, B...",8.887065
7105,199,Sen to Chihiro no Kamikakushi,Original,Studio Ghibli,"Adventure, Supernatural, Drama",1,PG - Children,8.92,498602,19.0,39,Toho,8.852266
11640,23273,Shigatsu wa Kimi no Uso,Manga,A-1 Pictures,"Drama, Music, Romance, School, Shounen",22,PG-13 - Teens 13 or older,8.88,413527,21.0,37,"Aniplex, Dentsu, Kodansha, Fuji TV, Kyoraku In...",8.799996
6511,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,Manga,Production I.G,"Comedy, Sports, Drama, School, Shounen",10,PG-13 - Teens 13 or older,8.99,157934,17.0,274,"TOHO animation, Shueisha",8.782021


###  Mode2：User-based Collaborative Filtering
Collaborative Filtering (CF) is a command method of recommendation based on users’ behavior. In this work I used User-Based Collaborative filtering.I built an m × n matrix U for user and their watched animes, where m is the number for all users, and n in the number of total animes. The element $u_{ij}$ in the matrix $U_{m\times n}$ means that user i has watched the anime j and gave their own individual score, I filled up element $u_{ij}$ by the user’s (i) score for this anime 
(j), otherwise I filled up the element as 0. The following table show the structure of the matrix. 


In [10]:
UserMatrix=pd.DataFrame(UserAnime.groupby('username')['my_score'].apply(list))
UserMatrix =UserMatrix["my_score"].apply(pd.Series)
UserMatrix = UserMatrix.fillna(0).reset_index()
UserMatrix['username']=UserMatrix['username'].apply(str)
print("The shape of User-Anime Matrix:",UserMatrix.shape)
UserMatrix.head(20)

The shape of User-Anime Matrix: (105402, 12975)


Unnamed: 0,username,0,1,2,3,4,5,6,7,8,...,12964,12965,12966,12967,12968,12969,12970,12971,12972,12973
0,----phoebelyn,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,---L-AND-AME-4EV,8.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,--AnimeBoy--,0.0,6.0,6.0,6.0,6.0,0.0,0.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,--Etsuko--,9.0,9.0,0.0,8.0,8.0,8.0,8.0,8.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,--FallenAngel--,10.0,0.0,8.0,9.0,9.0,9.0,9.0,8.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,--Kin--,7.0,10.0,8.0,6.0,8.0,9.0,10.0,10.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,--N--,0.0,9.0,9.0,8.0,9.0,10.0,9.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,--PHOENIX--,9.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,--RIE--,0.0,6.0,0.0,0.0,0.0,5.0,5.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,--Sc--,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
def SelectedUser(UserMatrix,Username):
    targetUser =  UserMatrix[UserMatrix["username"]==Username]
    UserMatrixbulider=UserMatrix[UserMatrix["username"]!=Username].reset_index(drop=True)
    distance= dict()
    for i in range(0,len(UserMatrixbulider)):
        dist = manhattan_distances(targetUser.iloc[:,1:], UserMatrixbulider.iloc[i:i+1,1:])
        distance.update({UserMatrixbulider.at[i,"username"]:dist[0][0]})
    SelectedUser =min(distance, key=distance.get)
    print("The user who is most samliar to input user is :", SelectedUser)
    selectedAnime = UserAnime[UserAnime["username"]==SelectedUser]
    Animelist = selectedAnime[selectedAnime["my_score"]>8.5]["anime_id"].values.tolist()
    LikedAnime = list(map(int, Animelist))
    LikedAnimeName=Anime[Anime['anime_id'].isin(LikedAnime)]
    return LikedAnimeName[:30]

Here, we randomly select a user and recommend anime for him 

In [19]:
Name=User['username'][25]
print("The input user is",Name)
print("The fllowing animes are recommended to",Name)
SelectedUser(UserMatrix,Name)

The input user is Tomoki-sama
The fllowing animes are recommended to Tomoki-sama
The user who is most samliar to input user is : Gid_Lucion


Unnamed: 0,anime_id,title,source,studio,genre,episodes,rating,score,scored_by,rank,popularity,producer
1,2104,Seto no Hanayome,Manga,Gonzo,"Comedy, Parody, Romance, School, Shounen",26,PG-13 - Teens 13 or older,7.89,91206,727.0,366,"TV Tokyo, AIC, Square Enix, Sotsu"
4,12365,Bakuman. 3rd Season,Manga,J.C.Staff,"Comedy, Drama, Romance, Shounen",25,PG-13 - Teens 13 or older,8.67,107767,50.0,426,"NHK, Shueisha"
13,14227,Tonari no Kaibutsu-kun,Manga,Brain&#039;s Base,"Slice of Life, Comedy, Romance, School, Shoujo",13,PG-13 - Teens 13 or older,7.71,281448,1093.0,88,"TV Tokyo, Aniplex, Nihon Ad Systems, Kodansha"
19,4224,Toradora!,Light novel,J.C.Staff,"Slice of Life, Comedy, Romance, School",25,PG-13 - Teens 13 or older,8.39,557898,174.0,13,"Genco, Starchild Records, Magic Capsule, Yomiu..."
20,10030,Bakuman. 2nd Season,Manga,J.C.Staff,"Comedy, Drama, Romance, Shounen",25,PG-13 - Teens 13 or older,8.49,122090,108.0,374,"NHK, Shueisha"
29,9863,SKET Dance,Manga,Tatsunoko Production,"Comedy, School, Shounen",77,PG-13 - Teens 13 or older,8.33,44132,213.0,646,"TV Tokyo, Avex Entertainment, Dentsu"
43,7674,Bakuman.,Manga,J.C.Staff,"Comedy, Drama, Romance, Shounen",25,PG-13 - Teens 13 or older,8.32,167429,216.0,172,"NHK, Shueisha"
99,7791,K-On!!,4-koma manga,Kyoto Animation,"Comedy, Music, School, Slice of Life",26,PG-13 - Teens 13 or older,8.14,178001,383.0,183,"Pony Canyon, TBS, Movic, Rakuonsha, Animation Do"
218,10521,Working&#039;!!,4-koma manga,A-1 Pictures,"Slice of Life, Comedy, Romance",13,PG-13 - Teens 13 or older,7.96,93040,609.0,516,"Aniplex, Yomiuri Telecasting, Rakuonsha"
297,15809,Hataraku Maou-sama!,Light novel,White Fox,"Comedy, Demons, Supernatural, Romance, Fantasy",13,PG-13 - Teens 13 or older,7.98,389033,578.0,51,"Sotsu, Lantis, Pony Canyon, Magic Capsule, Inf..."


###  Mode3：Genre-based Collaborative Filtering

    We know that we need to compute the similarity between Animes in item-based CF. In this work I used Jaccard similarity score. Let u_{i, k} denotes the similarity between user i and user k and v_{i, j} denotes the rating that user i gives to item j with v_{i, j} = ? if the user has not rated that item. Jaccard similarity can be expressed as the following:

J(A,B) = $\frac{|A\cap B|}{|A\cup B|} = \frac{|A\cap B|}{|A| + |B| - |A\cap B|}$

# Singular Value Decomposition
    Here, We can see the UserMatrix is very sparse. A way to handle the sparsity issue is to leverage a latent factor model to capture the similarity between users and items. In this work, I decomposed the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. One matrix be seen as the user matrix where rows represent users and columns are latent factors. The other matrix is the item matrix where rows are latent factors and columns represent items. (latent factors is a concept that a user or an anime have. For example, latent factor can refer to the genre that the anime belongs to)


![alt text][logo]

[logo]: https://wikimedia.org/api/rest_v1/media/math/render/svg/38aad41a3e44ded68732715012a793256eae76fe "Logo Title Text 2"

    SVD decreases the dimension of the UserMatrix matrix by extracting its latent factors. It helps us better understand the relationship between users and items as they become directly comparable. However,The main drawback of SVD is that there is no explanation to the reason why we recommend this item to an user. This is a huge problem if users are eager to know why a specific anime is recommended to them.