### Content Based with new Anime Dataset

Importing neccessary libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
anime_new = pd.read_csv('DATA/anime_cleaned.csv', index_col = 0)
anime_new.head(3)

Unnamed: 0,title,genre,synopsis,producer,studio,rating,scoredby,members,source,aired
1,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever...",Bandai Visual,Sunrise,8.81,363889,704490,Original,"Apr 3, 1998 to Apr 24, 1999"
2,Cowboy Bebop Tengoku no Tobira,"Action, Space, Drama, Mystery, Sci-Fi","Another day, another bounty—such is the life o...","Sunrise, Bandai Visual",Bones,8.41,111187,179899,Original,"Sep 1, 2001"
3,Trigun,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0...",Victor Entertainment,Madhouse,8.31,197451,372709,Manga,"Apr 1, 1998 to Sep 30, 1998"


In [3]:
len(anime_new)

3185

Feature columns have already been split into a list containing them

In [4]:
anime_new['synopsis'].head()

1    In the year 2071, humanity has colonized sever...
2    Another day, another bounty—such is the life o...
3    Vash the Stampede is the man with a $$60,000,0...
4    Witches are individuals with special powers li...
5    Sena is like any other shy kid starting high s...
Name: synopsis, dtype: object

Applying vectorization to the plot summary to get other similar animes, using tf-idf vectorizer

In [5]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
anime_new['synopsis'] = anime_new['synopsis'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(anime_new['synopsis'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(3185, 23415)

With this matrix in hand, we compute similarity scores using cosine similarity

In [6]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

We define a function that takes in an anime title as an input and outputs a list of the 10 most similar animes. Firstly, for this, you need a reverse mapping of anime titles and DataFrame 

In [7]:
#Construct a reverse map of indices and anime titles
indices = pd.Series(anime_new.index, index=anime_new['title']).drop_duplicates()

In [8]:
indices.head()

title
Cowboy Bebop                      1
Cowboy Bebop Tengoku no Tobira    2
Trigun                            3
Witch Hunter Robin                4
Eyeshield 21                      5
dtype: int64

In [9]:
# Function that takes in anime title as input and outputs most similar anime
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the anime that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all anime with that anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the anime based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar anime
    sim_scores = sim_scores[1:11]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar anime
    return anime_new['title'].iloc[anime_indices]

In [10]:
get_recommendations('Death Note')

1303    Hokuto no Ken Raoh Gaiden Ten no Haoh
869       Hokuto no Ken Raoh Gaiden Junai-hen
25                                      Akira
1221                               Soul Eater
991                           Tekkon Kinkreet
996                                 Skull Man
1479                               Durarara!!
2462                        K Return of Kings
256                          UGUltimate Girls
405                         Juubee Ninpuuchou
Name: title, dtype: object

Similarity using genre, producer, studio, source 




In [11]:
anime_new.dtypes

title        object
genre        object
synopsis     object
producer     object
studio       object
rating      float64
scoredby      int64
members       int64
source       object
aired        object
dtype: object

In [12]:
#converting necessary columns to strings for processing
anime_new['genre'] = anime_new["genre"].astype('str')
anime_new['producer'] = anime_new["producer"].astype('str')
anime_new['studio'] = anime_new["studio"].astype('str')

In [13]:
#creating a function to split the particular columns into lists again
features = ['genre', 'producer', 'studio', 'source']
def split_columns(x):
  return x.split(',')

In [14]:
#applying the function to the same set of columns
for feature in features:
    anime_new[feature] = anime_new[feature].apply(split_columns)

In [15]:
anime_new.head(3)

Unnamed: 0,title,genre,synopsis,producer,studio,rating,scoredby,members,source,aired
1,Cowboy Bebop,"[Action, Adventure, Comedy, Drama, Sci-Fi,...","In the year 2071, humanity has colonized sever...",[Bandai Visual],[Sunrise],8.81,363889,704490,[Original],"Apr 3, 1998 to Apr 24, 1999"
2,Cowboy Bebop Tengoku no Tobira,"[Action, Space, Drama, Mystery, Sci-Fi]","Another day, another bounty—such is the life o...","[Sunrise, Bandai Visual]",[Bones],8.41,111187,179899,[Original],"Sep 1, 2001"
3,Trigun,"[Action, Sci-Fi, Adventure, Comedy, Drama,...","Vash the Stampede is the man with a $$60,000,0...",[Victor Entertainment],[Madhouse],8.31,197451,372709,[Manga],"Apr 1, 1998 to Sep 30, 1998"


In [16]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if source exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [17]:
#applying the clean_data function
for feature in features:
    anime_new[feature] = anime_new[feature].apply(clean_data)

In [18]:
def create_soup(x):
    return ' '.join(x['genre']) + ' ' + ' '.join(x['producer']) + ' ' + ' '.join(x['studio']) + ' ' + ' '.join(x['source'])

In [19]:
# Create a new soup feature
anime_new['soup'] = anime_new.apply(create_soup, axis=1)

In [20]:
anime_new['soup']

1       action adventure comedy drama sci-fi space ban...
2       action space drama mystery sci-fi sunrise band...
3       action sci-fi adventure comedy drama shounen v...
4       action magic police supernatural drama mystery...
5       action sports comedy shounen tvtokyo nihonadsy...
                              ...                        
3181    comedy demons supernatural shounen yomikoadver...
3182    action adventure comedy fantasy magic eggfirm ...
3183    comedy school tohoanimation asahiproduction we...
3184    comedy fantasy shounen tvtokyo avexpictures sh...
3185    action adventure shounen warnerbros.japan shue...
Name: soup, Length: 3185, dtype: object

In [21]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(anime_new['soup'])

In [22]:
count_matrix.shape

(3185, 938)

In [23]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

Using the get_recommendations function for the new cosine similarity

In [24]:
get_recommendations('Death Note', cosine_sim2) #based on features

630                                           Samurai Gun
1354                               Kurokami The Animation
2102                    Gifuu Doudou!! Kanetsugu to Keiji
22      Rurouni Kenshin Meiji Kenkaku Romantan - Tsuio...
1312                                        Tekken Chinmi
125                                  Peace Maker Kurogane
136                                          Tenjou Tenge
1368                                         Souten Kouro
247                 Mutsu Enmei Ryuu Gaiden Shura no Toki
784                Sumomomo Momomo Chijou Saikyou no Yome
Name: title, dtype: object

In [25]:
get_recommendations('Death Note', cosine_sim) #based on storyline

1303    Hokuto no Ken Raoh Gaiden Ten no Haoh
869       Hokuto no Ken Raoh Gaiden Junai-hen
25                                      Akira
1221                               Soul Eater
991                           Tekkon Kinkreet
996                                 Skull Man
1479                               Durarara!!
2462                        K Return of Kings
256                          UGUltimate Girls
405                         Juubee Ninpuuchou
Name: title, dtype: object

In [26]:
#writing a function to get recommendations using both storyline and features based on 50% for both
def get_recommendations_both(title, cos_sim1, cos_sim2):
  # Get the index of the movie that matches the title
  idx = indices[title]

  # Get the pairwsie similarity scores of all anime using similarity1
  sim_scores1 = list(enumerate(cos_sim1[idx]))

  # Get the pairwsie similarity scores of all anime using similarity1
  sim_scores2 = list(enumerate(cos_sim2[idx]))

  #Getting the average of both similarity scores
  sim_scores_avg = [(sim_scores1[i][0],(sim_scores1[i][1] + sim_scores2[i][1])/2) for i in range(len(sim_scores1))]

  # Sort the movies based on the similarity scores
  sim_scores_avg = sorted(sim_scores_avg, key=lambda x: x[1], reverse=True)

  #Get the scores of the 10 most similar movies
  sim_scores_avg = sim_scores_avg[1:11]

  #Get the movie indices
  anime_indices = [i[0] for i in sim_scores_avg]

  #Return the top 10 most similar movies
  return anime_new['title'].iloc[anime_indices]


In [27]:
get_recommendations_both('Death Note',cosine_sim,cosine_sim2)

1303                Hokuto no Ken Raoh Gaiden Ten no Haoh
630                                           Samurai Gun
22      Rurouni Kenshin Meiji Kenkaku Romantan - Tsuio...
2102                    Gifuu Doudou!! Kanetsugu to Keiji
1354                               Kurokami The Animation
1312                                        Tekken Chinmi
136                                          Tenjou Tenge
247                 Mutsu Enmei Ryuu Gaiden Shura no Toki
784                Sumomomo Momomo Chijou Saikyou no Yome
125                                  Peace Maker Kurogane
Name: title, dtype: object