In [1]:
import sys
print (sys.version)
import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split as tts

3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]


In [2]:
r_cols = ["userid", "id", "rating"]

ratings_data = pd.read_csv("D:/DataScience/EarlyBirds_DataScience_Test/ratings.csv",
                          names=r_cols, usecols=range(3), low_memory=False, encoding='latin-1')  

ratings_data = ratings_data.drop(0)
ratings_data = ratings_data[:50000] 
ratings_data.head()

Unnamed: 0,userid,id,rating
1,1,81834,5.0
2,1,112552,5.0
3,1,98809,0.5
4,1,99114,4.0
5,1,858,5.0


In [3]:
len(ratings_data) 


50000

In [4]:
m_cols = ["id", "original_title"]
movies_info = pd.read_csv("D:/DataScience/EarlyBirds_DataScience_Test/movies_metadata.csv", 
                          low_memory=False, encoding='latin-1')
movies = movies_info.loc[:,m_cols]
movies = movies[:50000]
movies.head()


Unnamed: 0,id,original_title
0,862,Toy Story
1,8844,Jumanji
2,15602,Grumpier Old Men
3,31357,Waiting to Exhale
4,11862,Father of the Bride Part II


In [5]:
len(movies)

45466

In [6]:

movie_data = pd.merge(ratings_data, movies)  
movie_data.head()


Unnamed: 0,userid,id,rating,original_title
0,1,858,5.0,Sleepless in Seattle
1,3,858,4.0,Sleepless in Seattle
2,5,858,5.0,Sleepless in Seattle
3,20,858,4.5,Sleepless in Seattle
4,24,858,5.0,Sleepless in Seattle


In [7]:
movie_data["rating"] = movie_data["rating"].astype(float) # here i converted it to float

In [8]:
movie_data.head()

Unnamed: 0,userid,id,rating,original_title
0,1,858,5.0,Sleepless in Seattle
1,3,858,4.0,Sleepless in Seattle
2,5,858,5.0,Sleepless in Seattle
3,20,858,4.5,Sleepless in Seattle
4,24,858,5.0,Sleepless in Seattle


In [9]:
movie_data.groupby('original_title').rating.mean().head(10)

original_title
...PiÃ¹ forte ragazzi!          3.226027
10 Items or Less                4.166667
10 Things I Hate About You      2.833333
10,000 BC                       4.000000
12 + 1                          3.000000
15 Minutes                      3.500000
16 Blocks                       3.000000
1984                            1.625000
2 Days in Paris                 3.100000
20,000 Leagues Under the Sea    2.707547
Name: rating, dtype: float64

In [10]:
movie_data.groupby('original_title')['rating'].mean().sort_values(ascending=False).head() 

original_title
Badlands                         5.0
Gosford Park                     5.0
El asaltante                     5.0
Santa and the Ice Cream Bunny    5.0
Fados                            5.0
Name: rating, dtype: float64

In [11]:
#Let's now plot the total number of ratings for a movie:
movie_data.groupby('original_title')['rating'].count().sort_values(ascending=False).head(10)  

original_title
The Million Dollar Hotel              191
Terminator 3: Rise of the Machines    189
Ð¡Ð¾Ð»ÑÑÐ¸Ñ                        177
The 39 Steps                          157
Monsoon Wedding                       143
Once Were Warriors                    140
5 Card Stud                           139
License to Wed                        127
Sleepless in Seattle                  124
Sissi                                 119
Name: rating, dtype: int64

In [12]:
#create ratings_mean_count dataframe and first add the average rating of each movie to this dataframe:
ratings_mean_count = pd.DataFrame(movie_data.groupby('original_title')['rating'].mean())  

In [13]:
#Next, we need to add the number of ratings for a movie to the ratings_mean_count dataframe.
ratings_mean_count['rating_counts'] = pd.DataFrame(movie_data.groupby('original_title')['rating'].count())  

In [14]:
#Now let's take a look at our newly created dataframe.
ratings_mean_count.head() 

Unnamed: 0_level_0,rating,rating_counts
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1
...PiÃ¹ forte ragazzi!,3.226027,73
10 Items or Less,4.166667,3
10 Things I Hate About You,2.833333,3
"10,000 BC",4.0,1
12 + 1,3.0,4


### Above,we can see movie title, along with the average rating and number of ratings for the movie.

## Finding Similarities Between Movies

### We will use the correlation between the ratings of a movie as the similarity metric. 
### To find the correlation between the ratings of the movie, we need to create a matrix where each column is a movie name and each row contains the rating assigned by a specific user to that movie.
### Note: matrix will have a lot of null values NaN since every movie is not rated by every user.

### Let's create the matrix of movie titles and corresponding user ratings

In [15]:
user_movie_rating = movie_data.pivot_table(index='userid', 
                                           columns='original_title', 
                                           values='rating')  
user_movie_rating.head(10)  

original_title,...PiÃ¹ forte ragazzi!,10 Items or Less,10 Things I Hate About You,"10,000 BC",12 + 1,15 Minutes,16 Blocks,1984,2 Days in Paris,"20,000 Leagues Under the Sea",...,ë¹ì§,ì¬ë§ë¦¬ì,ì¼ì,ìììë ê²ë¤,ì¬ëë³´ì´,"ì¥í, íë ¨",ìµì¢ë³ê¸° í,í´ìì,í¬ë¡ì° ê³ ì¤í¸,í
userid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
100,,,,,,,,,,,...,,,,,,,,,,
101,,,,,,,,,,,...,,,,,,,,,,
102,,,,,,,,,,,...,,,,,,,,,,
103,,,,,,,,,,,...,,,,,,,,,,
104,,,,,,,,,,,...,,,,,,,,,,
105,,,,,,,,,,,...,,,,,,,,,,
106,,,,,,,,,,,...,,,,,,,,2.5,,
107,,,,,,,,,,,...,,,,,,,,,,


### As We seen above that each column contains all the user ratings for a particular movie.

### Let's find all the user ratings for the movie "The 39 Steps" and find the movies similar to it. 
### We chose this movie since it's been one of the highest number of ratings and 
### we want to find the correlation between movies that have a higher number of ratings.

In [16]:
the_39_steps_ratings = user_movie_rating["The 39 Steps"]  
the_39_steps_ratings.head(15)

userid
1      NaN
10     NaN
100    NaN
101    NaN
102    NaN
103    NaN
104    NaN
105    NaN
106    NaN
107    NaN
108    4.0
109    NaN
11     NaN
110    5.0
111    NaN
Name: The 39 Steps, dtype: float64

### Now let's retrieve all the movies that are similar to "The 39 Steps".

### We can find the correlation between the user ratings for the "The 39 Steps" and all the other movies using corrwith() function

In [17]:
movies_like_the_39_steps = user_movie_rating.corrwith(the_39_steps_ratings)

corr_the_39_steps= pd.DataFrame(movies_like_the_39_steps, columns=['Correlation'])  
corr_the_39_steps.dropna(inplace=True)  
corr_the_39_steps.head()  

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,Correlation
original_title,Unnamed: 1_level_1
...PiÃ¹ forte ragazzi!,-0.049827
10 Items or Less,1.0
10 Things I Hate About You,-1.0
2 Days in Paris,0.5
"20,000 Leagues Under the Sea",0.222056


### Let's sort the movies in descending order of correlation to see highly correlated movies at the top.

In [18]:
corr_the_39_steps.sort_values('Correlation', ascending=False).head(10)

Unnamed: 0_level_0,Correlation
original_title,Unnamed: 1_level_1
L'Ours,1.0
The Hi-Lo Country,1.0
The Glass House,1.0
Gyakufunsha kazoku,1.0
à¸£à¸à¹à¸à¸à¹à¸² à¸¡à¸²à¸«à¸²à¸à¸°à¹à¸à¸­,1.0
The Ghost of Frankenstein,1.0
The Fearless Vampire Killers,1.0
My Own Private Idaho,1.0
Garde Ã vue,1.0
Tropa de Elite,1.0


### From the output we can see that the movies that have high correlation with "The 39 Steps" are not very well known. 
### This shows that correlation alone is not a good metric for similarity because there can be a user who watched '"The 39 Steps" and only one other movie and rated both of them as 5.

### To solve this problem is to retrieve only those correlated movies that have at least more than 50 ratings. 

In [19]:
corr_the_39_steps = corr_the_39_steps.join(ratings_mean_count['rating_counts'])  
corr_the_39_steps.head()

Unnamed: 0_level_0,Correlation,rating_counts
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1
...PiÃ¹ forte ragazzi!,-0.049827,73
10 Items or Less,1.0,3
10 Things I Hate About You,-1.0,3
2 Days in Paris,0.5,5
"20,000 Leagues Under the Sea",0.222056,53


### from above, we can see that the movie "10 Items or Less	", which has the highest correlation has only 3 ratings. 
### This means that only 3 users gave same ratings to "The 39 Steps", "10 Items or Less".
### we can deduce that a movie cannot be declared similar to the another movie based on just 3 ratings. This is why we added "rating_counts" column. Let's now filter movies correlated to "The 39 Steps", that have more than 50 ratings.

In [20]:
corr_the_39_steps[corr_the_39_steps ['rating_counts']>90].sort_values('Correlation', ascending=False).head(15) 

Unnamed: 0_level_0,Correlation,rating_counts
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1
The 39 Steps,1.0,157
48 Hrs.,0.499982,108
Titanic,0.464545,99
The Hours,0.440348,93
Terminator 3: Rise of the Machines,0.437133,189
The Conversation,0.384461,109
La passion de Jeanne d'Arc,0.35605,119
Dawn of the Dead,0.337794,111
5 Card Stud,0.320698,139
Monsoon Wedding,0.311808,143


### Now we can see from the output the movies that are highly correlated with "The 39 Steps". 
### The movies in the list are some of the most famous movies Hollywood movies, and since "The 39 Steps" is also a very famous movie and is perfectly correlated by its slef, 
### there is a high chance that these movies are correlated and can recommend to people.