#Importing and Analyzing the data

In [1]:
import pandas as pd

##Getting Dataset and info##

In [2]:
# Importing dataset
anime = pd.read_csv("anime.csv", encoding = 'utf8')

In [3]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Toy Story (1995),"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Jumanji (1995),"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Grumpier Old Men (1995),"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Waiting to Exhale (1995),"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Father of the Bride Part II (1995),"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [4]:
anime.shape # Getting shape

(12294, 7)

In [5]:
anime.columns #Getting Columns

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')

In [6]:
anime.genre # Getting genre that what are the different generas

0                     Drama, Romance, School, Supernatural
1        Action, Adventure, Drama, Fantasy, Magic, Mili...
2        Action, Comedy, Historical, Parody, Samurai, S...
3                                         Sci-Fi, Thriller
4        Action, Comedy, Historical, Parody, Samurai, S...
                               ...                        
12289                                               Hentai
12290                                               Hentai
12291                                               Hentai
12292                                               Hentai
12293                                               Hentai
Name: genre, Length: 12294, dtype: object

##Normalization of the data using TFIDF Term Frequency and Inverse Dense Frequency Techniques##

Coverting the all the generes to the binary matrix and then these generes are normalized

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer 
# term frequency inverse document frequncy is a numerical statistic that is intended to reflect how important a word is to document in a collecion or cor

In [8]:
# So basically, with the help of TFIDF, english word is being stopped
tfidf = TfidfVectorizer(stop_words = "english")    # taking stop words from tfid vectorizer

##Dealing with the Missing values##

In [9]:
anime["genre"].isnull().sum() # getting the total sum of the missing values in genre
# So there are 62 missing values

62

In [10]:
anime["genre"] = anime["genre"].fillna("general") # and assigning or filling these null values with 'general' string.

In [11]:
anime["genre"].isnull().sum() #Now after filling those Nan values, missing values are 0

0

In [12]:
anime.info

<bound method DataFrame.info of        anime_id                                               name  \
0         32281                                   Toy Story (1995)   
1          5114                                     Jumanji (1995)   
2         28977                            Grumpier Old Men (1995)   
3          9253                           Waiting to Exhale (1995)   
4          9969                 Father of the Bride Part II (1995)   
...         ...                                                ...   
12289      9316       Toushindai My Lover: Minami tai Mecha-Minami   
12290      5543                                        Under World   
12291      5621                     Violence Gekiga David no Hoshi   
12292      6133  Violence Gekiga Shin David no Hoshi: Inma Dens...   
12293     26081                   Yasuji no Pornorama: Yacchimae!!   

                                                   genre   type episodes  \
0                   Drama, Romance, School, Superna

In [13]:
anime.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


##CREATING TFIDF MATRIX##

In [14]:
# So, here we are preparing the Tfidf matrix by fitting and transforming
tfidf_matrix = tfidf.fit_transform(anime.genre)   # Transforming a count matrix to a normalized tf or tf-idf representation
tfidf_matrix.shape # creating like dummy variables

(12294, 47)

We need to find what is relation between two genres and how they are closely related,
We hav econcept of cosine distance,
as the cosine90 = 0 means these two vectors are at 90 degress and they are never going to meet mrans they dont have similarities.
as Cosine0 = 1, so we can say that,
both the vectors are in one direction and similer. So this is how we can compare two generes and movies.


1. So, from the above matrix we need to find the similarity score.
2. There are several metrics for this such as the euclidean, 
the Pearson and the cosine similarity scores

3. A numeric quantity to represent the similarity between 2 movies 
4. Cosine similarity - metric is independent of magnitude and easy to calculate 
5. cosine(x,y)= (x.y⊺)/(||x||.||y||)

##Calculating the Dot product using linear_kernel()##

So, now with help of linear_kernel() dot product, we will get for every one genere to another genre, so ther will be 12294 X 12294 values.
It becomes more, difficult so we will make matrix of it.

In [15]:
from sklearn.metrics.pairwise import linear_kernel

In [16]:
#Now we are computing the cosine similarity on Tfidf matrix
cosine_sim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)
# So, above crteated matrix is fitted in the cosine Similarity matrix

So, above cosine mtarix formation is done on basis of the 47 attributes that are genres.

In [17]:
cosine_sim_matrix # the value more close to 1 means they are similer and close to 0 means they are not
                    # similer.

array([[1.        , 0.14784981, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.14784981, 1.        , 0.1786367 , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.1786367 , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        1.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        1.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        1.        ]])

In [18]:
cosine_sim_matrix.shape #Matrix is created 

(12294, 12294)

##Creating the anime index##

So,  now we are creating index for the movie names, because after these ther will be like a index page for a perticular movie so we can access it fast

In [19]:
# Now we are creating a mapping of anime name to index number 
anime_index = pd.Series(anime.index, index = anime['name']).drop_duplicates()

In [20]:
anime_index

name
Toy Story (1995)                                          0
Jumanji (1995)                                            1
Grumpier Old Men (1995)                                   2
Waiting to Exhale (1995)                                  3
Father of the Bride Part II (1995)                        4
                                                      ...  
Toushindai My Lover: Minami tai Mecha-Minami          12289
Under World                                           12290
Violence Gekiga David no Hoshi                        12291
Violence Gekiga Shin David no Hoshi: Inma Densetsu    12292
Yasuji no Pornorama: Yacchimae!!                      12293
Length: 12294, dtype: int64

In [21]:
# I'm getting the index number of any movie.
anime_id1 = anime_index["Assassins (1995)"]
anime_id1

22

In [22]:
anime_id2 = anime_index["Under World"]
anime_id2

12290

##Creating function to get similer movies for a given movie##

In [23]:
def get_recommendations(Name, topN):  # Two inputs--> name of the movie and number of the recomendations.  
 
    anime_id = anime_index[Name] # Now, we will get the index for a movie name
                                 # And assigned to index_id.
    

    cosine_scores = list(enumerate(cosine_sim_matrix[anime_id]))
    # Now, we created a cosine similarity matrix of 12294x12294,
    # from there, we will pass anime_id to it and for that specific movie 
    # we will get 12294 values that are cosine matrix values with other movies.
    # and this is assigned to cosine_scores.
    
    # Now, these cosine_scores are sorted in descending order with tempory lambda function
    # And then reassigned to cosine_scores.
    cosine_scores = sorted(cosine_scores, key=lambda x:x[1], reverse = True)
    
    # Now these scores are trimmed with how many recomendations you want.
    cosine_scores_N = cosine_scores[0: topN+1] # topN+1 because my algorithm is not able to find that it has 
    # calculate 12294 x 12294 - (12294) because these 12294 extra values are with self comparison so, I have to write
    # topN + 1.
    
    # So, now for those cosine matric values we are getting there index numbers.
    anime_idx  =  [i[0] for i in cosine_scores_N]
    anime_scores =  [i[1] for i in cosine_scores_N]
    
    # Similar movies and scores
    anime_similar_show = pd.DataFrame(columns=["name", "Score"]) # So, creating the dataFrame of name and score

    anime_similar_show["name"] = anime.loc[anime_idx, "name"] # In that dataframe, name of that movie is assigned.
    anime_similar_show["Score"] = anime_scores #Same for the score.
    anime_similar_show.reset_index(inplace = True)  # Restting the indexes
    print (anime_similar_show) # That dataframe will be printed.
    

##Checking##

In [24]:
# Firstly, we have to get the index id of the movie.
anime_index["Bad Boys (1995)"]

118

##Passing the movie name to the function##

In [25]:
get_recommendations("Father of the Bride Part II (1995)", topN = 10)

    index                                name     Score
0       2             Grumpier Old Men (1995)  1.000000
1       4  Father of the Bride Part II (1995)  1.000000
2       8                 Sudden Death (1995)  1.000000
3       9                    GoldenEye (1995)  1.000000
4      12                        Balto (1995)  1.000000
5      63                    Fair Game (1995)  1.000000
6      65              Misérables, Les (1995)  1.000000
7     216                         I.Q. (1994)  1.000000
8     306        Bullets Over Broadway (1994)  1.000000
9   10896                      Gintama (2017)  1.000000
10    380               Color of Night (1994)  0.940044
