The dataset we are going to use is the MovieLens Dataset, which cotains 100k ratings of approximately 9000 movies by 700 users.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv',encoding="Latin1")
df_r = ratings.copy()
df_m = movies.copy()

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,12882,1,4.0,1147195252
1,12882,32,3.5,1147195307
2,12882,47,5.0,1147195343
3,12882,50,5.0,1147185499
4,12882,110,4.5,1147195239


In [4]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
df_combined = pd.merge(ratings, movies, on = 'movieId')

In [6]:
df_combined.shape

(264505, 6)

# Collaborative Filtering (CF)

1) **Memory-Based CF** - It is an approach which finds similarity between users or between items to recommend similar items. Examples include Item-based/User-based top-N recommendations.

2) **Model-Based CF** - In this approach we use different data mining, machine learning algorithms to predict users' rating of unrated items.  Examples include Singular Value Decomposition (SVD) , Principal Component Analysis (PCA) etc.

## Create User-Item Matrix

In [7]:
user_item_matrix = df_combined.pivot_table(index = 'userId', columns = 'title', values = 'rating')
user_item_matrix.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),127 Hours (2010),...,Young Guns II (1990),Young Sherlock Holmes (1985),Zack and Miri Make a Porno (2008),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,,,,,,,,,,,...,,,,,,,,,,
320,,,,,,,,,,,...,,,,,,,,,,
359,,,,,2.0,4.0,4.0,,,,...,,,,,,,,,,
370,,,,,,,,,,5.0,...,,,3.5,,4.0,5.0,,5.0,,
910,,1.5,,,,,,,,,...,,,,,,4.0,,,,3.5


## Memory Based Collaborative Filtering

There are many measures to calculate the similarity matrix, some of them are -->

1) **Jaccard Similarity** - It is a statistic used for comparing the similarity and diversity of sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets.

2) **Cosine Similarity** - It measures the angle between the ratings vector. If the angle is 0°, then they are vectors having same orientation and if the angle is 180°, then they are highly dissimilar vectors.

3) **Pearson Similarity** - It is actually Centered-Cosine similarity. We subtract the mean ratings from the user ratings, so that the mean is centered at 0, and then calculate the cosine similarity.

### User based Collaborative Filtering

In [8]:
user_matrix = user_item_matrix.copy()

# We will fill the row wise NaN's with the corresponding user's mean ratings, so that we can carry out Pearson correlation.

user_matrix = user_matrix.apply(lambda row: row.fillna(row.mean()), axis=1)
user_matrix.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),127 Hours (2010),...,Young Guns II (1990),Young Sherlock Holmes (1985),Zack and Miri Make a Porno (2008),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,...,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457,3.329457
320,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,...,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613,3.701613
359,3.685474,3.685474,3.685474,3.685474,2.0,4.0,4.0,3.685474,3.685474,3.685474,...,3.685474,3.685474,3.685474,3.685474,3.685474,3.685474,3.685474,3.685474,3.685474,3.685474
370,3.794404,3.794404,3.794404,3.794404,3.794404,3.794404,3.794404,3.794404,3.794404,5.0,...,3.794404,3.794404,3.5,3.794404,4.0,5.0,3.794404,5.0,3.794404,3.794404
910,3.89808,1.5,3.89808,3.89808,3.89808,3.89808,3.89808,3.89808,3.89808,3.89808,...,3.89808,3.89808,3.89808,3.89808,3.89808,4.0,3.89808,3.89808,3.89808,3.5


In [9]:
# calculate pearson correlation coefficient
corr_mat = user_matrix.T.corr()
user_316_corr = corr_mat.iloc[0]

In [21]:
corr_mat

userId,316,320,359,370,910,975,1015,1387,1447,1588,...,137118,137209,137227,137446,137559,137609,137805,138072,138176,138200
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,1.000000,0.060063,0.072075,0.043266,0.039305,0.045616,0.035341,0.038068,-1.248514e-02,0.050183,...,0.052632,1.048638e-01,1.135832e-02,0.029674,0.092552,0.017876,0.051371,0.077377,0.026924,-0.022727
320,0.060063,1.000000,0.063054,0.027315,0.006811,0.075620,0.011910,0.042509,-2.873860e-24,0.067389,...,0.115325,6.512991e-02,7.199638e-02,0.097554,0.064769,-0.006251,0.077256,0.098845,0.038752,0.056639
359,0.072075,0.063054,1.000000,0.135836,0.076131,0.036757,0.046418,0.066544,4.287659e-02,0.109726,...,0.120191,2.067214e-02,3.216562e-02,0.039599,0.108502,0.026371,0.075492,0.102698,0.099307,0.003147
370,0.043266,0.027315,0.135836,1.000000,0.108404,0.071655,0.070893,-0.003139,5.223516e-02,0.090241,...,0.091218,4.959445e-02,4.344263e-03,0.040692,0.110434,0.019767,-0.001364,0.052187,0.050997,0.009950
910,0.039305,0.006811,0.076131,0.108404,1.000000,0.021814,0.027339,-0.032211,-6.301121e-03,-0.007491,...,0.039464,-1.762007e-02,2.005766e-02,-0.004581,0.040866,-0.001438,-0.026082,0.073272,-0.012058,0.007610
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137609,0.017876,-0.006251,0.026371,0.019767,-0.001438,0.021046,0.017830,0.007626,4.808024e-02,0.022029,...,0.017844,2.388397e-02,-6.717645e-26,0.016852,0.133128,1.000000,0.019354,0.037478,0.056937,0.026599
137805,0.051371,0.077256,0.075492,-0.001364,-0.026082,0.030725,0.032425,0.039565,7.483176e-03,0.103785,...,0.064964,6.872135e-02,5.938730e-02,0.027496,0.038884,0.019354,1.000000,0.077038,0.016916,0.066554
138072,0.077377,0.098845,0.102698,0.052187,0.073272,0.124565,0.023265,0.073955,7.120523e-02,0.044739,...,0.122660,3.546430e-02,2.772109e-02,0.136839,0.080100,0.037478,0.077038,1.000000,0.094920,0.064754
138176,0.026924,0.038752,0.099307,0.050997,-0.012058,0.049599,0.075441,0.029366,1.996078e-01,0.042615,...,0.022104,2.102216e-24,1.804627e-03,0.127924,0.035111,0.056937,0.016916,0.094920,1.000000,0.000900


In [10]:
# considering the correlation of all users with the first user only
user_316_corr.sort_values(ascending=False, inplace=True)

In [11]:
user_316_corr.head()

userId
316       1.000000
113673    0.216770
117918    0.202073
9050      0.180958
12882     0.178995
Name: 316, dtype: float64

In [98]:
# Neglect the 1st corr value as it is user1 itself
top50_corr_users = user_316_corr[1:51]

Below is a list of all movies that user 316 has ever rated.

In [86]:
df_combined[ df_combined['userId'] == 316]

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
15,316,1,2.5,1150538725,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
506,316,32,3.0,1150546651,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
2314,316,150,2.5,1150538707,Apollo 13 (1995),Adventure|Drama|IMAX
2824,316,165,3.5,1150538753,Die Hard: With a Vengeance (1995),Action|Crime|Thriller
3106,316,260,4.0,1150538711,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
3639,316,296,4.0,1150538691,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
4818,316,356,4.5,1150538695,Forrest Gump (1994),Comedy|Drama|Romance|War
5835,316,380,3.0,1150538720,True Lies (1994),Action|Adventure|Comedy|Romance|Thriller
6518,316,480,2.5,1150538700,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
7124,316,527,4.0,1150538737,Schindler's List (1993),Drama|War


In [87]:
# user1 has not rated 2 movie
df_combined[ (df_combined['userId'] == 316) & (df_combined['movieId'] == 2) ] 

Unnamed: 0,userId,movieId,rating,timestamp,title,genres


In [88]:
print('2nd Movie : ', movies['title'][ movies['movieId'] == 2 ].values)

2nd Movie :  ['Jumanji (1995)']


So, let's calculate what ratings user 1 would give to the movie with the help of similarrity vector. And based on that rating, we can compare it with a threshold rating. If the rating is higher it will be visible to the active user in his/her recommended list.

In [89]:
df_n_ratings = pd.DataFrame(df_combined.groupby('title')['rating'].mean())
df_n_ratings['total ratings'] = pd.DataFrame(df_combined.groupby('title')['rating'].count())
df_n_ratings.rename(columns = {'rating': 'mean ratings'}, inplace=True)

df_n_ratings.sort_values('total ratings', ascending=False).head()

Unnamed: 0_level_0,mean ratings,total ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Matrix, The (1999)",4.195359,668
"Lord of the Rings: The Fellowship of the Ring, The (2001)",4.091561,628
Forrest Gump (1994),3.91868,621
Pulp Fiction (1994),4.217781,613
"Lord of the Rings: The Two Towers, The (2002)",4.035176,597


In [90]:
# the average rating of this movie

df_n_ratings.loc[['Jumanji (1995)']]

Unnamed: 0_level_0,mean ratings,total ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Jumanji (1995),3.069892,279


In [99]:
top50_users = top50_corr_users.keys()

count = 0
users = list()
for user in top50_users:
    if df_combined[ (df_combined['userId'] == user) & (df_combined['movieId'] == 2) ]['rating'].sum()  :
        count +=1
        users.append(user)

print(count)

23


There are 23 similar users among the Top-50 similar users that have rated the movie "Jumanji (1995)".

* Now calculate the rating user 316 would give to the movie, 

* **Predicted rating** = sum of [ (weights) * (ratings) ]  **/** sum of  (weights)

*weights* is the correlation of the corresponding user with the first user.



In [100]:
# Use Weighted average of k similar users

def predict_rating():
    sum_similarity = 0
    weighted_ratings = 0
    for user in users:
        weighted_ratings += top50_corr_users.ix[user] * df_combined[ (df_combined['userId'] == user) & 
                                                                    (df_combined['movieId'] == 2) ]['rating'].sum()
        sum_similarity += top50_corr_users.ix[user]

    print(weighted_ratings / sum_similarity)
    
    
predict_rating()

2.607220246818444


### Item Based Collaborative Filtering


* Instead of finding user's look-alike, we try finding movie's look-alike. 

In [44]:
# Find similar movies to jurassic Park
df_n_ratings.loc[['Jurassic Park (1993)']]

Unnamed: 0_level_0,mean ratings,total ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Jurassic Park (1993),3.555344,524


In [45]:
item_matrix = user_item_matrix.copy()
item_matrix.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),127 Hours (2010),...,Young Guns II (1990),Young Sherlock Holmes (1985),Zack and Miri Make a Porno (2008),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,,,,,,,,,,,...,,,,,,,,,,
320,,,,,,,,,,,...,,,,,,,,,,
359,,,,,2.0,4.0,4.0,,,,...,,,,,,,,,,
370,,,,,,,,,,5.0,...,,,3.5,,4.0,5.0,,5.0,,
910,,1.5,,,,,,,,,...,,,,,,4.0,,,,3.5


In [46]:
# We will fill the column wise NaN's with the corresponding movie's mean ratings, so that we can carry out Pearson correlation.
# Here we assume avg ratings for the user that has not a rated movie.

item_matrix = item_matrix.apply(lambda col : col.fillna(col.mean()), axis=0)
item_matrix.head(5)

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),127 Hours (2010),...,Young Guns II (1990),Young Sherlock Holmes (1985),Zack and Miri Make a Porno (2008),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,3.241667,3.798742,3.552083,3.399441,2.49,2.606796,3.409091,2.064516,4.246032,3.743119,...,3.071429,3.390625,3.201299,3.235294,3.620915,3.759259,3.265258,3.336735,2.645038,3.218085
320,3.241667,3.798742,3.552083,3.399441,2.49,2.606796,3.409091,2.064516,4.246032,3.743119,...,3.071429,3.390625,3.201299,3.235294,3.620915,3.759259,3.265258,3.336735,2.645038,3.218085
359,3.241667,3.798742,3.552083,3.399441,2.0,4.0,4.0,2.064516,4.246032,3.743119,...,3.071429,3.390625,3.201299,3.235294,3.620915,3.759259,3.265258,3.336735,2.645038,3.218085
370,3.241667,3.798742,3.552083,3.399441,2.49,2.606796,3.409091,2.064516,4.246032,5.0,...,3.071429,3.390625,3.5,3.235294,4.0,5.0,3.265258,5.0,2.645038,3.218085
910,3.241667,1.5,3.552083,3.399441,2.49,2.606796,3.409091,2.064516,4.246032,3.743119,...,3.071429,3.390625,3.201299,3.235294,3.620915,4.0,3.265258,3.336735,2.645038,3.5


In [51]:
item_matrix.isna().sum()

title
'burbs, The (1989)                                        0
(500) Days of Summer (2009)                               0
*batteries not included (1987)                            0
10 Things I Hate About You (1999)                         0
10,000 BC (2008)                                          0
101 Dalmatians (1996)                                     0
101 Dalmatians (One Hundred and One Dalmatians) (1961)    0
102 Dalmatians (2000)                                     0
12 Angry Men (1957)                                       0
127 Hours (2010)                                          0
13 Going on 30 (2004)                                     0
13th Warrior, The (1999)                                  0
1408 (2007)                                               0
15 Minutes (2001)                                         0
16 Blocks (2006)                                          0
1984 (Nineteen Eighty-Four) (1984)                        0
2 Days in the Valley (1996)       

This signifies that every Movie is rated by atleast 1 user.

In [50]:
item_matrix.corr()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),127 Hours (2010),...,Young Guns II (1990),Young Sherlock Holmes (1985),Zack and Miri Make a Porno (2008),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The (1989)",1.000000,-0.002131,0.029159,0.070699,-0.034320,0.017691,-0.081674,-0.015144,0.019162,0.055528,...,0.225548,0.083517,0.038010,-0.031515,-0.000689,-0.006790,-0.005013,0.032961,0.001660,0.157162
(500) Days of Summer (2009),-0.002131,1.000000,-0.020139,0.038540,-0.062317,-0.027464,-0.021535,-0.008178,0.009544,0.107425,...,0.027124,0.016189,0.075278,-0.013114,0.005344,0.218782,0.063563,0.033974,-0.045866,-0.002570
*batteries not included (1987),0.029159,-0.020139,1.000000,0.070977,-0.080494,0.089110,0.087145,0.040255,0.039247,-0.015162,...,-0.038468,-0.016874,-0.023074,0.043757,0.006963,-0.017833,-0.019881,-0.018150,0.029581,0.010719
10 Things I Hate About You (1999),0.070699,0.038540,0.070977,1.000000,0.028655,0.285990,0.178250,0.326591,0.039670,-0.013990,...,0.059804,0.089462,0.097053,0.160172,0.107668,0.073599,0.044219,0.007151,0.182743,0.104335
"10,000 BC (2008)",-0.034320,-0.062317,-0.080494,0.028655,1.000000,0.070560,0.077944,-0.030526,0.022602,-0.002199,...,0.021505,0.029439,0.041334,0.056101,0.014226,0.035407,-0.045118,0.003611,0.130858,0.060192
101 Dalmatians (1996),0.017691,-0.027464,0.089110,0.285990,0.070560,1.000000,0.293391,0.376731,0.001802,-0.007482,...,0.056133,0.041930,0.025253,0.054210,-0.006358,0.015439,0.026865,0.056195,0.059852,0.021647
101 Dalmatians (One Hundred and One Dalmatians) (1961),-0.081674,-0.021535,0.087145,0.178250,0.077944,0.293391,1.000000,0.244193,0.083038,0.058711,...,-0.027025,0.047355,0.056733,0.067539,0.004090,0.039485,-0.006764,0.115944,0.119060,0.055753
102 Dalmatians (2000),-0.015144,-0.008178,0.040255,0.326591,-0.030526,0.376731,0.244193,1.000000,0.023061,-0.001349,...,0.107248,0.028667,0.017663,0.044130,-0.003916,0.040150,-0.012099,0.049578,0.104964,0.018395
12 Angry Men (1957),0.019162,0.009544,0.039247,0.039670,0.022602,0.001802,0.083038,0.023061,1.000000,0.032557,...,0.093274,0.073642,0.064513,0.038543,0.058670,0.050556,0.014668,0.043194,0.091707,0.073991
127 Hours (2010),0.055528,0.107425,-0.015162,-0.013990,-0.002199,-0.007482,0.058711,-0.001349,0.032557,1.000000,...,-0.039043,-0.009292,-0.013980,-0.002516,-0.021216,0.121463,0.052340,0.049514,-0.002676,0.011752


* There are lot of NaN values and that is because when we are calculating the Pearson correlation, if the rating vector has all the values same for eg -> [3.0 , 3.0, 3.0, 3.0, ....], then the **Standard Deviation** is zero and division by zero is undefined, and thus its correlation with any other rating vector is NaN.

* Since there are many movies that are rated only by 1 user , there the whole column mean is filled with the rating of that user, and therefore it's Pearson correlation gives NaN values with any other column.

In [52]:
item_corr_matrix = item_matrix.corr()

In [53]:
jurassic_park_corr = item_corr_matrix['Jurassic Park (1993)']
jurassic_park_corr = jurassic_park_corr.sort_values(ascending=False)
jurassic_park_corr.dropna(inplace=True)

In [54]:
movies_similar_to_jurassic_park = pd.DataFrame(data=jurassic_park_corr.values, columns=['Correlation'], 
                                               index = jurassic_park_corr.index)
movies_similar_to_jurassic_park = movies_similar_to_jurassic_park.join(df_n_ratings['total ratings'])
movies_similar_to_jurassic_park.head(10)

Unnamed: 0_level_0,Correlation,total ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Jurassic Park (1993),1.0,524
Star Wars: Episode VI - Return of the Jedi (1983),0.36843,474
Speed (1994),0.357916,373
Star Wars: Episode V - The Empire Strikes Back (1980),0.357008,510
E.T. the Extra-Terrestrial (1982),0.348051,402
Independence Day (a.k.a. ID4) (1996),0.342534,459
Indiana Jones and the Last Crusade (1989),0.341462,414
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981),0.337147,505
Men in Black (a.k.a. MIB) (1997),0.336483,476
Mrs. Doubtfire (1993),0.334781,380


In [56]:
movies_similar_to_jurassic_park = movies_similar_to_jurassic_park[1:]
movies_similar_to_jurassic_park[ movies_similar_to_jurassic_park['total ratings'] > 400 ].sort_values(ascending=False,
                                                                                          by=['Correlation']).head(10)

Unnamed: 0_level_0,Correlation,total ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars: Episode V - The Empire Strikes Back (1980),0.357008,510
E.T. the Extra-Terrestrial (1982),0.348051,402
Independence Day (a.k.a. ID4) (1996),0.342534,459
Indiana Jones and the Last Crusade (1989),0.341462,414
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981),0.337147,505
Men in Black (a.k.a. MIB) (1997),0.336483,476
Toy Story (1995),0.327286,496
Terminator 2: Judgment Day (1991),0.324536,462
Back to the Future (1985),0.319201,513
Titanic (1997),0.306548,467
