#### From the IMDB's Movie Dataset, lets predict the rating of a movie based on its genre. We will find movies similar to its genre using KNN and try to base on our assumption that ratings of similar movies would also be similar.

#### Dataset Source : https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

In [1]:
#packages import section
import pandas as pd

#### I have stored the kaggle file locally and am reading from it. Lets name our dataframe as 'ratings' and start working on it.

In [2]:
ratings = pd.read_csv("../IMDB/movie_metadata.csv")
ratings.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


#### The data look good but lets find if any of the movies have any duplicate entries also.

In [3]:
ratings.set_index('movie_title').index.get_duplicates()

['20,000 Leagues Under the Sea\xa0',
 "A Dog's Breakfast\xa0",
 'A Nightmare on Elm Street\xa0',
 'A Woman, a Gun and a Noodle Shop\xa0',
 'Across the Universe\xa0',
 'Alice in Wonderland\xa0',
 'Aloha\xa0',
 'Around the World in 80 Days\xa0',
 'Bad Moms\xa0',
 'Ben-Hur\xa0',
 'Big Fat Liar\xa0',
 'Brothers\xa0',
 'Carrie\xa0',
 'Casino Royale\xa0',
 'Cat People\xa0',
 'Chasing Liberty\xa0',
 'Cinderella\xa0',
 'Clash of the Titans\xa0',
 'Conan the Barbarian\xa0',
 'Crash\xa0',
 'Creepshow\xa0',
 'Crossroads\xa0',
 'Dangerous Liaisons\xa0',
 'Dawn of the Dead\xa0',
 'Day of the Dead\xa0',
 'Death at a Funeral\xa0',
 'Dekalog\xa0            ',
 'Disturbia\xa0',
 'Dodgeball: A True Underdog Story\xa0',
 'Dredd\xa0',
 'Eddie the Eagle\xa0',
 'Exodus: Gods and Kings\xa0',
 'Fantastic Four\xa0',
 'First Blood\xa0',
 'Footloose\xa0',
 'Forsaken\xa0',
 'From Hell\xa0',
 'Ghostbusters\xa0',
 'Glory\xa0',
 'Godzilla Resurgence\xa0',
 'Goosebumps\xa0',
 'Halloween II\xa0',
 'Halloween\xa0',
 'H

#### Oh Boy! Lots of duplicate data. Lets remove the duplicate fields and also omit the na values and view the data again. Lets also try to limit the analysis on movie title, genre and imdb score only.

In [4]:
#remove duplicate
ratings = ratings.drop_duplicates(['movie_title'])

#drop rows with na values from columns which are going to be analysed
ratings = ratings[['movie_title', 'genres', 'imdb_score']].dropna()
ratings.head()

Unnamed: 0,movie_title,genres,imdb_score
0,Avatar,Action|Adventure|Fantasy|Sci-Fi,7.9
1,Pirates of the Caribbean: At World's End,Action|Adventure|Fantasy,7.1
2,Spectre,Action|Adventure|Thriller,6.8
3,The Dark Knight Rises,Action|Thriller,8.5
4,Star Wars: Episode VII - The Force Awakens ...,Documentary,7.1


#### Seems its now easier to interpret the data. 
#### Now lets generate a list 'genreList' with all possible genres mentioned in the dataset.

In [5]:
#generating a list which will hold all the possible genres for comparison
genreList = []
for index, row in ratings.iterrows():
    genres = row["genres"].split('|')
    
    for genre in genres:
        if genre not in genreList:
            genreList.append(genre)

#### 'genreList' will now hold all the genres. But how will we know about the genres each movie is classified into. Certainly not all. Lets find out!!
#### Lets create a new column in the dataframe which will hold binary values whether a genre is present or not in it. First lets create a method which will return back a list of binary values for the genres of each movie. The 'genreList' will be useful now to compare against the values. 

In [6]:
#method to assign binary values to genres present in the df
def getValue(genre_list):
    binaryList = []
    
    for genre in genreList:
        if genre in genre_list:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

#### Now that the method is created, lets put it into some use and add a new column in the dataframe which we mentioned earlier. This column will be critical in the KNN calculation later on.

In [7]:
#sGenres will hold the binary values of genres which will be helpful for KNN calculation
ratings['sGenres'] = ratings['genres'].apply(lambda x: getValue(x.split('|')))
ratings['sGenres'].head()

0    [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1    [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2    [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3    [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4    [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: sGenres, dtype: object

#### The 0s and 1s look good and hope it would be easier for the computer to interpret it this way also. Well who am I kidding, it has to!
#### It is evident that our dataframe has not got any uniform identity column yet. Lets make one for it.

In [8]:
# new column = movieId
ratings['movieId'] = range(1, len(ratings) + 1)
ratings.head()

Unnamed: 0,movie_title,genres,imdb_score,sGenres,movieId
0,Avatar,Action|Adventure|Fantasy|Sci-Fi,7.9,"[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
1,Pirates of the Caribbean: At World's End,Action|Adventure|Fantasy,7.1,"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2
2,Spectre,Action|Adventure|Thriller,6.8,"[1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3
3,The Dark Knight Rises,Action|Thriller,8.5,"[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4
4,Star Wars: Episode VII - The Force Awakens ...,Documentary,7.1,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",5


#### Now lets use the scipy package to create a method which will calculate cosine distance between genres of two movies. Various other methods can be used in place of cosine. Check out the following link: https://docs.scipy.org/doc/scipy-0.18.1/reference/spatial.distance.html

In [9]:
#method to compute spatial distance between two movies
from scipy import spatial

def ComputeDistance(movieId1, movieId2):
    a = ratings.iloc[movieId1,]
    b = ratings.iloc[movieId2,]
    
    genresA = a[3]
    genresB = b[3]
    
    genreDistance = spatial.distance.cosine(genresA, genresB)
    
    scoreA = a[2]
    scoreB = b[2]
    scoreDistance = abs(scoreA - scoreB)
    
    return genreDistance + scoreDistance

#### Lets use the ComputeDistance method to calculate distance between two random movies and see the result

In [10]:
#distance between movie id 1 and movie id 2
ComputeDistance(1, 2)

0.63333333333333308

#### The more the distance, the less similar the movies are. And the distance is somewhat big. Let's see what these random movies actually were.

In [11]:
print(ratings.iloc[1,])
print(ratings.iloc[2,])

movie_title            Pirates of the Caribbean: At World's End 
genres                                  Action|Adventure|Fantasy
imdb_score                                                   7.1
sGenres        [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
movieId                                                        2
Name: 1, dtype: object
movie_title                                             Spectre 
genres                                 Action|Adventure|Thriller
imdb_score                                                   6.8
sGenres        [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
movieId                                                        3
Name: 2, dtype: object


#### Hide the rum. Captain Sparrow is here, mate. So is Bond. James Bond.
#### Well Jack Sparrow and James Bond won't agree ever. Arrgh!

#### Lets find Nemo now. Actually let's find Nemo's rating now from the movies similar to its genre. 

In [12]:
nemo = ratings.loc[ratings['movie_title'].str.contains("Nemo")]
nemo

Unnamed: 0,movie_title,genres,imdb_score,sGenres,movieId
338,Finding Nemo,Adventure|Animation|Comedy|Family,8.2,"[0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, ...",335


#### Now that everything is ready, lets finally use the methods to find the movie's rating. I am using 10 nearest neighbors of Nemo to find out rating of the movie. After all the calculation, lets display all the nearest movies also with their data also.

In [13]:
import operator

def getNeighbors(baseMovie, K):
    distances = []
    
    for index, movie in ratings.iterrows():
        if movie['movieId'] != baseMovie['movieId'].values[0]:
            dist = ComputeDistance(baseMovie['movieId'].values[0]-1, movie['movieId']-1)
            distances.append((movie['movieId'], dist))
    
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    
    for x in range(K):
        neighbors.append(distances[x])
    return neighbors

K = 10
avgRating = 0
neighbors = getNeighbors(nemo, K)

for neighbor in neighbors:
    avgRating += ratings.iloc[neighbor[0]-1,][2]    
    print(ratings.iloc[neighbor[0]-1,][0], ratings.iloc[neighbor[0]-1,][1], str(neighbor[1]), \
                                                                                str(ratings.iloc[neighbor[0]-1,][2]))
    
print('\n')

avgRating /= K
print('The predicted rating of the movie is: ', avgRating)

Up  Adventure|Animation|Comedy|Family 0.1 8.3
Monsters, Inc.  Adventure|Animation|Comedy|Family|Fantasy 0.205572809 8.1
Toy Story 3  Adventure|Animation|Comedy|Family|Fantasy 0.205572809 8.3
Toy Story  Adventure|Animation|Comedy|Family|Fantasy 0.205572809 8.3
Shaun the Sheep              Animation|Comedy|Family 0.233974596216 8.3
How to Train Your Dragon  Adventure|Animation|Family|Fantasy 0.25 8.2
Howl's Moving Castle  Adventure|Animation|Family|Fantasy 0.25 8.2
Inside Out  Adventure|Animation|Comedy|Drama|Family|Fantasy 0.283503419072 8.3
A Charlie Brown Christmas  Animation|Comedy|Family 0.333974596216 8.4
A Christmas Story  Comedy|Family 0.392893218813 8.1


The predicted rating of the movie is:  8.25


#### So the model predicts rating to be 8.25. No doubt Finding Nemo was a good movie. And see how the movies similar are most animated, adventure, comedy or family in their genres.
#### Now lets check the actual genre of Finding Nemo and its rating on imdb

In [14]:
#Finding Nemo's movie id is 335 and index must be 334
print(ratings.iloc[334,][0], ratings.iloc[334,][1], ratings.iloc[334,][2])

Finding Nemo  Adventure|Animation|Comedy|Family 8.2


#### Seems the model worked out pretty well as the genres matched well so did the rating! Finding Nemo was one of the best movies created by Disney and we predicted its genre almost correctly. Hurray!

### K Nearest Neighbors (KNN) is a method used to classify new data points based on distance to known data. We just need to define some distance metric between items in the dataset and use it to find the K closest neighbors. Those neighbors can then be used to predict some property of a test item by letting them all vote on the final classification of test item.

### - Data beats Emotions! :D