## Predicting a Movie Rating just based on Genre and reading information

### Concept: define some distance metric bewteen the items in your dataset and find the K closest items.
### Then you can use those items to predict some property of a test item by let them vote on it.

### Here we will try to guess the rating of a movie by looking at 10 movies that are closest to in terms of genre and popularity (given by the number of people who rated it)

### Find out Distance between Movies based on their genres to predict a Movie rating

### Let's load our data

### Here the actual movie rating file is u.data

In [1]:
import pandas as pd
import numpy as np

In [6]:
ratings = pd.read_csv("D:/DataScience/DataScience-Python3/ml-100k/u.data")
len(ratings)
ratings

Unnamed: 0,0	50	5	881250949
0,0\t172\t5\t881250949
1,0\t133\t1\t881250949
2,196\t242\t3\t881250949
3,186\t302\t3\t891717742
4,22\t377\t1\t878887116
5,244\t51\t2\t880606923
6,166\t346\t1\t886397596
7,298\t474\t4\t884182806
8,115\t265\t2\t881171488
9,253\t465\t5\t891628467


### here we are going to implement first 3 columns which we need to predict

In [13]:
#so we have to define a table with 3 columns to get user_id,movie_id a,d rating
r_cols = ["User_ID", "Movie_ID", "Rating"]
ratings = pd.read_csv("D:/DataScience/DataScience-Python3/ml-100k/u.data", sep='\t', usecols=range(3), names=r_cols)
ratings

Unnamed: 0,User_ID,Movie_ID,Rating
0,0,50,5
1,0,172,5
2,0,133,1
3,196,242,3
4,186,302,3
5,22,377,1
6,244,51,2
7,166,346,1
8,298,474,4
9,115,265,2


In [16]:
ratings.head()

Unnamed: 0,User_ID,Movie_ID,Rating
0,0,50,5
1,0,172,5
2,0,133,1
3,196,242,3
4,186,302,3


### Now we will gorup everything by Movie_ID and 
### will compute total number of ratings(each movie's popularity)and average movie rating for every movie

In [23]:
MovieProperties = ratings.groupby('Movie_ID').agg({"Rating": [np.size, np.mean]})
MovieProperties.head()

Unnamed: 0_level_0,Rating,Rating
Unnamed: 0_level_1,size,mean
Movie_ID,Unnamed: 1_level_2,Unnamed: 2_level_2
1,452,3.878319
2,131,3.206107
3,90,3.033333
4,209,3.550239
5,86,3.302326


### as we seen above, Movie_ID 1 has 452 ratings (measure of it's popularity, shows howmany people watched and rated)
### and given rating on average 3.8

### These raw movie ratings are not useful to comput the distances between movies
### so we will create a new data frame where we will have normalized number of ratings, 
### so value of 0 means nobody rated it, value of 1 means it's the most popular movie there.

In [24]:
MovieNumRatings = pd.DataFrame(MovieProperties["Rating"]["size"])
MovieNormalizedRatings = MovieNumRatings.apply(lambda x: (x - np.min(x))/(np.max(x) - np.min(x)))
MovieNormalizedRatings.head()

Unnamed: 0_level_0,size
Movie_ID,Unnamed: 1_level_1
1,0.773585
2,0.222985
3,0.152659
4,0.356775
5,0.145798


### now let's extract genre information from u.item file
### there are 19 feilds in that file, each feild corresponds to a specific genre.
### value of 0 means it's not in that genre, 1 means it is in that genre.
### note: a movie may have more than one genre associated with it
### now we will put everything together in one python dictionary(each entry contains movie name, list of  genre values, the nomalized popularity score, and average rating for each movie)

In [68]:
movieDict = {}
with open(r"D:/DataScience/DataScience-Python3/ml-100k/u.item") as f:
    temp = ""
    for line in f:
        #line.decode("ISO-8859-1")
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        
        movieDict[movieID] = (name, list(genres), MovieNormalizedRatings.loc[movieID].get('size'), 
                              MovieProperties.loc[movieID].Rating.get('mean'))
        
        


### let's check the movieID of 1

In [69]:
movieDict[7]

('Twelve Monkeys (1995)',
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 0.6706689536878216,
 3.798469387755102)

### so we have movie name, genres list, popularityn score and average rating

### Now let's define a function which compute the distance between 2 movies based on how similar their genres are and how similar their properties are using cosine matric
### just to make sure that it works, here we compute distance between MovieID's 2 and 4

In [70]:
from scipy import spatial

def ComputeDistance(a,b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance

ComputeDistance(movieDict[1], movieDict[2])


1.5506003430531732

### Note: Higher distance means les similar the movies are. let's check what actually movies 1&2 and confirm that they are not really so similar or too similar

In [71]:
print(movieDict[1])
print(movieDict[2])

('Toy Story (1995)', [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 0.7735849056603774, 3.8783185840707963)
('GoldenEye (1995)', [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 0.22298456260720412, 3.2061068702290076)


### as we seen above those movies 1 & 2 are not at all similar by seeing popularity score and genres list

### let's write some code to take a given movieID and find the K nearest neighbours
### so now let's compute the distance between some given test movie (Twelve Monkeys in this example which i taken as test movie) with all other movies in our data set, when they sort by distance, print out the K-nearest neighbours

In [116]:
import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if(movie != movieDict):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

K =35
avgRating = 0
neighbors = getNeighbors(7, K) #movieID 7 is Twelve Monkeys which is my test movie
for neighbor in neighbors:
    avgRating += movieDict[neighbor][3]
    print(movieDict[neighbor][0] + " ==> " + str(movieDict[neighbor][3]))
    
avgRating /= K
        

Twelve Monkeys (1995) ==> 3.798469387755102
Contact (1997) ==> 3.8035363457760316
E.T. the Extra-Terrestrial (1982) ==> 3.8333333333333335
Dead Man Walking (1995) ==> 3.8963210702341136
Mr. Holland's Opus (1995) ==> 3.7781569965870307
Empire Strikes Back, The (1980) ==> 4.206521739130435
Shawshank Redemption, The (1994) ==> 4.445229681978798
Pulp Fiction (1994) ==> 4.060913705583756
Silence of the Lambs, The (1991) ==> 4.28974358974359
Day the Earth Stood Still, The (1951) ==> 3.9381443298969074
One Flew Over the Cuckoo's Nest (1975) ==> 4.291666666666667
Jerry Maguire (1996) ==> 3.7109375
2001: A Space Odyssey (1968) ==> 3.969111969111969
Dead Poets Society (1989) ==> 3.9163346613545817
Trainspotting (1996) ==> 3.884
Time to Kill, A (1996) ==> 3.685344827586207
It's a Wonderful Life (1946) ==> 4.121212121212121
Back to the Future (1985) ==> 3.834285714285714
Clockwork Orange, A (1971) ==> 3.909502262443439
To Kill a Mockingbird (1962) ==> 4.292237442922374
People vs. Larry Flynt, The 

### results pretty good
### let's compute the average rating of 10 nearest movies to Twelve Monkeys (our test movie)

In [113]:
avgRating

3.8977250425836067

### let's see how does it compare to actual Twelve Monkeys's average rating

In [104]:
movieDict[7]

('Twelve Monkeys (1995)',
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 0.6706689536878216,
 3.798469387755102)

### results are pretty good (average rating of our 35 nearest movies is alomost close to average rating of to Twelve Monkeys( test movie)