#Content-based recommendations
Although deep-learning seems to be leading the trend with recommendation engines, classic content-based recommendations are still very effective methods for recommending items to users. In this notebook, we will explore how we can use different similarity metrics and NLP techniques for content-based recommendations

In [1]:
#mount driver with data
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Get lastest verison of pandas and nltk package for NLP

In [2]:
!pip install nltk
!pip install pandas==0.23.4

Collecting pandas==0.23.4
[?25l  Downloading https://files.pythonhosted.org/packages/e1/d8/feeb346d41f181e83fba45224ab14a8d8af019b48af742e047f3845d8cff/pandas-0.23.4-cp36-cp36m-manylinux1_x86_64.whl (8.9MB)
[K    100% |████████████████████████████████| 8.9MB 3.8MB/s 
Installing collected packages: pandas
  Found existing installation: pandas 0.22.0
    Uninstalling pandas-0.22.0:
      Successfully uninstalled pandas-0.22.0
Successfully installed pandas-0.23.4


In [3]:
import json
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, pairwise_distances
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import nltk
#download nltk stopwords and corpus
nltk.download('stopwords')
nltk.download('wordnet')
#read item data
animes = pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/aniRec/data/processed_anime.csv")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [4]:
animes

Unnamed: 0.1,Unnamed: 0,anime_id,title,type,source,episodes,status,duration,rating,score,...,related_animes,related_manga,Staffs,Characters,Description,favorites_ratio,time_bucket,length,rank_bucket,duration_bucket
0,0,11013,Inu x Boku SS,TV,Manga,12,Finished Airing,24,PG-13 - Teens 13 or older,7.63,...,[],[],"['Nakamura, Yuuichi', 'Hidaka, Rina', 'Hanazaw...",[],Ririchiyo Shirakiin is the sheltered daughter ...,0.009895,shinkai,77.0,mediocre,normal
1,1,2104,Seto no Hanayome,TV,Manga,26,Finished Airing,24,PG-13 - Teens 13 or older,7.89,...,[],[],"['Momoi, Haruko', 'Mizushima, Takahiro', 'Noga...","['Seto, Sun', 'Michishio, Nagasumi', 'Edomae, ...",Michishio Nagasumi's life couldn't be any more...,0.012642,haruhism,182.0,normal_show,normal
2,2,5262,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,24,PG - Children,7.55,...,[],[],"['Sawashiro, Miyuki', 'Itou, Kanae', 'Chiba, S...","['Hinamori, Amu', 'Fujisaki, Nagihiko', 'Hotor...",Now Utau has left Easter and restarted her sin...,0.011436,haruhism,356.0,mediocre,normal
3,3,721,Princess Tutu,TV,Original,38,Finished Airing,16,PG-13 - Teens 13 or older,8.21,...,[],[],"['Katou, Nanae', 'Sakurai, Takahiro', 'Mizuki,...",[],"In a fairy tale come to life, the clumsy, swee...",0.035837,new_milenium,280.0,normal_show,normal
4,4,12365,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,24,PG-13 - Teens 13 or older,8.67,...,[],[],"['Abe, Atsushi', 'Hino, Satoshi', 'Okamoto, No...","['Mashiro, Moritaka', 'Takagi, Akito', 'Niizum...","Onto their third serialization, manga duo Mori...",0.011392,shinkai,175.0,quality_show,normal
5,5,6586,Yume-iro Pâtissière,TV,Manga,50,Finished Airing,24,G - All Ages,8.03,...,[],[],"['Okamoto, Nobuhiko', 'Yuuki, Aoi', 'Yonaga, T...","['Kashino, Makoto', 'Amano, Ichigo', 'Hanabusa...","Aside from her deep passion for eating cakes, ...",0.018104,monogatari,357.0,normal_show,normal
6,6,178,Ultra Maniac,TV,Manga,26,Finished Airing,24,G - All Ages,7.26,...,[],[],"['Kanda, Akemi', 'Horie, Yui', 'Chiba, Susumu'...","['Rio', 'Luna']",Hideo Middle School eighth-grader Ayu Tateishi...,0.005356,new_milenium,175.0,mediocre,normal
7,7,2787,Shakugan no Shana II (Second),TV,Light novel,24,Finished Airing,24,PG-13 - Teens 13 or older,7.72,...,[355],[3074],"['Kugimiya, Rie', 'Hino, Satoshi', 'Kawasumi, ...","['Shana', 'Sakai, Yuuji', 'Yoshida, Kazumi', '...",The heated bond between Shana and Yuji is test...,0.004873,haruhism,175.0,mediocre,normal
8,8,4477,Nodame Cantabile: Paris-hen,TV,Manga,11,Finished Airing,23,PG-13 - Teens 13 or older,8.24,...,[],[],"['Kawasumi, Ayako', 'Seki, Tomokazu', 'Ogawa, ...","['Noda, Megumi', 'Chiaki, Shinichi', 'Stresema...",Having been given the opportunity to study in ...,0.003292,haruhism,70.0,fine_show,normal
9,9,853,Ouran Koukou Host Club,TV,Manga,26,Finished Airing,23,PG-13 - Teens 13 or older,8.34,...,[],[],"['Miyano, Mamoru', 'Sakamoto, Maaya', 'Suzumur...",[],Haruhi Fujioka is a bright scholarship candida...,0.044673,haruhism,175.0,fine_show,normal


We can see that we have an extra index as a column. Let's drop that

In [0]:
animes = animes.drop(columns=['Unnamed: 0'])

##Feature engineering for list-like features

Although most of EDA and feature engineering was already finished in our **anime-EDA** notebook, some customizations have to be made. For instance, lists should be changed to string format so that can use get_dummies method for one-hot encoding

In [6]:
import ast
list_type_features = ['producer','licensor','studio','genre','musicians','related_animes','related_manga','Staffs','Characters']

def process_list(x):
  arr = ast.literal_eval(x)
  return  ",".join(str(x).strip() for x in arr)

for column in list_type_features:
  animes[column] = animes[column].apply(lambda x:process_list(x))
animes[list_type_features]

Unnamed: 0,producer,licensor,studio,genre,musicians,related_animes,related_manga,Staffs,Characters
0,"Aniplex,Square Enix,Mainichi Broadcasting Syst...",Sentai Filmworks,David Production,"Comedy,Supernatural,Romance,Shounen",MUCC,,,"Nakamura, Yuuichi,Hidaka, Rina,Hanazawa, Kana,...",
1,"TV Tokyo,AIC,Square Enix,Sotsu",Funimation,Gonzo,"Comedy,Parody,Romance,School,Shounen",,,,"Momoi, Haruko,Mizushima, Takahiro,Nogawa, Saku...","Seto, Sun,Michishio, Nagasumi,Edomae, Lunar,Ma..."
2,"TV Tokyo,Sotsu",,Satelight,"Comedy,Magic,School,Shoujo",,,,"Sawashiro, Miyuki,Itou, Kanae,Chiba, Saeko,Tak...","Hinamori, Amu,Fujisaki, Nagihiko,Hotori, Tadas..."
3,"Memory-Tech,GANSIS,Marvelous AQL",ADV Films,Hal Film Maker,"Comedy,Drama,Magic,Romance,Fantasy",Ritsuko Okazaki,,,"Katou, Nanae,Sakurai, Takahiro,Mizuki, Nana,Ya...",
4,"NHK,Shueisha",,J.C.Staff,"Comedy,Drama,Romance,Shounen","nano.RIPE,Sphere,Hyadain,JAM Project",,,"Abe, Atsushi,Hino, Satoshi,Okamoto, Nobuhiko,M...","Mashiro, Moritaka,Takagi, Akito,Niizuma, Eiji,..."
5,"Yomiuri Telecasting,DAX Production,Shueisha",,"Studio Pierrot,Studio Hibari","Kids,School,Shoujo",Mayumi Gojou,,,"Okamoto, Nobuhiko,Yuuki, Aoi,Yonaga, Tsubasa,T...","Kashino, Makoto,Amano, Ichigo,Hanabusa, Satsuk..."
6,Studio Jack,"Discotek Media,Geneon Entertainment USA",Production Reed,"Magic,Comedy,Romance,School,Shoujo",can/goo,,,"Kanda, Akemi,Horie, Yui,Chiba, Susumu,Kamiya, ...","Rio,Luna"
7,"Geneon Universal Entertainment,ASCII Media Works",Funimation,J.C.Staff,"Action,Drama,Fantasy,Romance,School,Supernatural","Mami Kawada,KOTOKO",355,3074,"Kugimiya, Rie,Hino, Satoshi,Kawasumi, Ayako,It...","Shana,Sakai, Yuuji,Yoshida, Kazumi,Carmel, Wil..."
8,"Fuji TV,Asmik Ace Entertainment,Sony Music Ent...",,J.C.Staff,"Music,Slice of Life,Comedy,Romance,Josei",,,,"Kawasumi, Ayako,Seki, Tomokazu,Ogawa, Shinji,M...","Noda, Megumi,Chiaki, Shinichi,Stresemann, Fran..."
9,"VAP,Hakusensha,Nippon Television Network",Funimation,Bones,"Comedy,Harem,Romance,School,Shoujo",Chieko Kawabe,,,"Miyano, Mamoru,Sakamoto, Maaya,Suzumura, Kenic...",


Now, let's check data sparcity for which features we can use

In [7]:
for column in list_type_features:
  print(column+" "+str(animes[animes[column]==''].shape[0]))

producer 7015
licensor 10272
studio 5853
genre 64
musicians 9925
related_animes 11547
related_manga 12538
Staffs 3898
Characters 6823


We can see that a lot of data are missing. We will only consider Staffs, Genre, and Studio features since these are reletively less sparse (especially genre feature) and staffs and studio can affect the show's quality.

##Similarity matrix

We will be mainly considering euclidean distance and cosine similarity score as similarity metrics. Jaccard and Pearson coefficients will be also provided as an option for later customizations. Euclidean distance will be used for features with a lot of missing values or if the magnitude fo the feature matters (e.g. time/score)

In [0]:
#define similarity function
def similarity_matrix(matrix,measure='euclidean'):
  if measure == 'euclidean':
    return 1./(1. + euclidean_distances(matrix)) #take reciprocal to scale between 0~1
  elif measure == 'cosine':
    return cosine_similarity(matrix)
  elif measure == 'pearson':
    return matrix.corr
  elif measure =='jacc':
    return pairwise_distances(matrix,metric='jaccard')

In [0]:
#use cosine since magnitude doesn't matter and data is sparse (a lot of columns)
genre_sim = similarity_matrix(animes['genre'].str.get_dummies(","),measure='cosine')

In [0]:
#producer similarity is an option but we are skipping it to save RAM

#use euclidean because there are too many zero vectors (missing values) 
studio_sim = similarity_matrix(animes['studio'].str.get_dummies(","))
similarity = np.multiply(genre_sim,studio_sim)
#producer_sim = similarity_matrix(animes['producer'].str.get_dummies(","))

## NLP with description data

A lot of the entries contain the text description of the anime and we will be using TF-IDF vectorizer calculate similarity measures

In [0]:
#first make sure all data are populated with something
animes['Description'] = animes['Description'].fillna("")

In [0]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
        #use stopwords to remove meaningless words and punctuations
        self.stop = stopwords.words('english')
        #use regex to tokenize words with longer than 3 characters
        self.tokenizer = RegexpTokenizer(r'\w{3,}')
    def __call__(self, articles):
        #tokenize as list if the word is not a punctuation or meaningless word (e.g. is,was,and...)
        return [self.wnl.lemmatize(t) for t in self.tokenizer.tokenize(articles) if t not in self.stop]
#use TfidVectorizer to get term-frequency representaiton of the word
tfidf_vec = TfidfVectorizer(min_df=15,tokenizer=LemmaTokenizer(),stop_words='english')
#use vectorizer on anime descriptions
tfidf_matrix = tfidf_vec.fit_transform(animes['Description']).toarray()
tfidf_feature_names = tfidf_vec.get_feature_names()

#multiply synopsis simlarity to total similarity matrix
similarity = np.multiply(similarity,similarity_matrix(tfidf_matrix,measure='cosine'))

In [13]:
#check tokens
tfidf_feature_names 

['000',
 '100',
 '10th',
 '11th',
 '12th',
 '13th',
 '14th',
 '15th',
 '16th',
 '17th',
 '18th',
 '1945',
 '1989',
 '1999',
 '19th',
 '1st',
 '2000',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '20th',
 '21st',
 '22nd',
 '25th',
 '2nd',
 '300',
 '3rd',
 '4th',
 '500',
 '5th',
 '6th',
 '7th',
 '8th',
 '9th',
 'abandon',
 'abandoned',
 'abducted',
 'ability',
 'able',
 'abnormal',
 'aboard',
 'abroad',
 'absolute',
 'absolutely',
 'abuse',
 'academic',
 'academy',
 'accept',
 'accepted',
 'accepting',
 'accepts',
 'access',
 'accident',
 'accidentally',
 'acclaimed',
 'accompanied',
 'accompany',
 'accompanying',
 'accomplish',
 'according',
 'ace',
 'achieve',
 'achievement',
 'achieving',
 'acquaintance',
 'acquire',
 'act',
 'acting',
 'action',
 'active',
 'activity',
 'actor',
 'actress',
 'actual',
 'actually',
 'adapt',
 'adaptation',
 'adapted',
 'adapts',
 'add',
 'added',
 'addition',
 'additiona

In [0]:
#skip characters_sim to save RAM

#use euclidean distance because feature has a lot of zero-vectors
similarity = np.multiply(similarity,similarity_matrix(animes['Staffs'].str.get_dummies(",")))
#characters_sim = similarity_matrix(animes['Characters'].str.get_dummies(","))

In [0]:
#get years from aired dates
animes['aired_from'] = animes['aired_from'].apply(lambda x: int(x[0:4]))

In [0]:
#use eucliean since all three features are numeric values with magnitude
similarity = np.multiply(similarity,similarity_matrix(animes['rank'].values.reshape(-1,1)))
similarity = np.multiply(similarity,similarity_matrix(animes['popularity'].values.reshape(-1,1)))
similarity = np.multiply(similarity,similarity_matrix(animes['aired_from'].values.reshape(-1,1))) 

In [17]:
#use similarity matrix to get top 7 recommendations for chosen show. 
#Top entry is item itself since identical items have perfect similarity with itself
def recommend_by_id(id):
    #sort by similarity for item
    similar_animes = sorted(list(enumerate(similarity[id])),key=lambda x:x[1],reverse=True)
    #print top 7 recommendation
    for anime in similar_animes[0:7]:
        print(animes["title"][anime[0]])
recommend_by_id(16)

Kimi ni Todoke
Kimi ni Todoke 2nd Season
Gekkan Shoujo Nozaki-kun
ReLIFE
K-On!!
Toki wo Kakeru Shoujo
Hyouka


##Evaludation of recommendation model

From the result above, the content recommendation seems to be working quite well! Kimi ni Todoke is a school-themed highschool romance series and so are the other animes that are being recommended. All of them have school/romance elements and the top recommendation is its sequel which should be a very natural recommendation. However, generalizing on just few results doesn't do justice.

### Ideas for improvments
1. **Use a hit rate metrics to measure performance of the model:**
    Using a hit rate is better than taking the average rating of the items and calculating the error against real rating
    because accurate rating does not imply that the user will like that item. Hit rate is better because recommendation is
    already in a top-N list format. Or ranking-loss might even work better as a matrix
2. **Using other features**: For this model, we only used a few features which we think is important with characteristics of the show. But, perhaps trying out other features may increase its performance as well.
3. **Different method for similarity score**: For now, we were just multiplying each similarity matrix to obtain overall similarity
Obviously similarity in one feature may be more important than other feature so using each similarity measure as a feature to regression model may help in assigning approperiate weightings to each feature.
4. **Better NLP techniques**: As seen in the list of feature names, some of the words still contained character's names which are irrelevant to the plot. Removing these names manually may also help extract better features. Also a lot of them contain numbers which are also irrelevant to the plot.