# Making Recommendations Based on Correlation

In [32]:
import numpy as np
import pandas as pd

In [33]:
df_links = pd.read_csv(r'links.csv')
df_movies = pd.read_csv(r'movies.csv')
df_ratings = pd.read_csv(r'ratings.csv')
df_tags = pd.read_csv(r'tags.csv')

### Preparing Data For Correlation

We will look for restaurants that are similar to the most popular movie from the last notebook "Forest Gump". "Similarity" will be defined by how well other movie correlate with "Forest Gump" in the user-item matrix. In this matrix, we have all the users in the rows and all the movies in the columns. It has many NaNs because most of the time users have not visited many movies —we call this a sparse matrix.

In [34]:
df_ratings


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [47]:
df_movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


### Creating User Item Interaction Matrix

In [35]:
movrating_crosstab = pd.pivot_table(data=df_ratings, values='rating', index='userId', columns='movieId')
movrating_crosstab.head(10)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
6,,4.0,5.0,3.0,5.0,4.0,4.0,3.0,,3.0,...,,,,,,,,,,
7,4.5,,,,,,,,,,...,,,,,,,,,,
8,,4.0,,,,,,,,2.0,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,


Let's look at the users that have watched "Forrest Gump": from previous notebook(movie_with popularity recommendations)

In [36]:
#Forest Gump movie Id=356
top_popular_movieId = 356

In [37]:
Forestgump_ratings = movrating_crosstab[top_popular_movieId]
Forestgump_ratings[Forestgump_ratings>=0] # exclude NaNs

userId
1      4.0
6      5.0
7      5.0
8      3.0
10     3.5
      ... 
605    3.0
606    4.0
608    3.0
609    4.0
610    3.0
Name: 356, Length: 329, dtype: float64

## Evaluating Similarity Based on Correlation

Now we will look at how well other movies correlate with  Forest Gump. A strong positive correlation between two movies indicates that users who liked one movie also liked the other. A negative correlation would mean that users who liked one movie did not like the other. So, we will look for strong, positive correlations to find similar movie.

In [38]:
# we get warnings because computing the pearson correlation coefficient with NaNs, but the results are still ok
similar_to_Forestgump = movrating_crosstab.corrwith(Forestgump_ratings)
similar_to_Forestgump.head(40)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


movieId
1     0.303465
2     0.367247
3     0.534682
4     0.388514
5     0.349541
6     0.137421
7     0.106567
8     0.656020
9     0.000000
10    0.217441
11    0.190299
12    0.376395
13    0.000000
14    0.187500
15    0.617191
16   -0.032398
17    0.151531
18    0.440560
19    0.421868
20    0.442330
21   -0.018817
22    0.451930
23    0.600756
24    0.601555
25    0.247030
26    0.441777
27    0.316228
28   -0.396226
29    0.187306
30         NaN
31    0.279782
32    0.177380
34    0.094434
36    0.353062
38    0.260360
39    0.165583
40         NaN
41   -0.007661
42   -0.408248
43   -0.167571
dtype: float64

Many movies get a NaN, because there are no users that went to both. But some of them give us a correlation score. Let's drop NaNs and look at the valid results:

In [39]:
corr_Forestgump = pd.DataFrame(similar_to_Forestgump, columns=['PearsonR'])
corr_Forestgump.dropna(inplace=True)
corr_Forestgump.head(12)

Unnamed: 0_level_0,PearsonR
movieId,Unnamed: 1_level_1
1,0.303465
2,0.367247
3,0.534682
4,0.388514
5,0.349541
6,0.137421
7,0.106567
8,0.65602
9,0.0
10,0.217441


In [40]:
rating = pd.DataFrame(df_ratings.groupby('movieId')['rating'].mean())
rating['rating_count'] = df_ratings.groupby('movieId')['rating'].count()

In [41]:
rating.head(5)

Unnamed: 0_level_0,rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.92093,215
2,3.431818,110
3,3.259615,52
4,2.357143,7
5,3.071429,49


In [42]:
Forestgump_corr_summary = corr_Forestgump.join(rating['rating_count'])
Forestgump_corr_summary.drop(top_popular_movieId, inplace=True) # drop Forestgump (movieId=356) itself
Forestgump_corr_summary.head(5)

Unnamed: 0_level_0,PearsonR,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.303465,215
2,0.367247,110
3,0.534682,52
4,0.388514,7
5,0.349541,49


Let's filter out movies with a rating count above 20.

Then, take the top 10 movies in terms of similarity to Forestgump:

In [59]:
top10 = Forestgump_corr_summary[Forestgump_corr_summary['rating_count']>20].sort_values('PearsonR', ascending=False).head(10)
top10

Unnamed: 0_level_0,PearsonR,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
65,0.723238,31
106072,0.715809,21
3101,0.701856,36
111362,0.682284,30
2795,0.677043,26
80549,0.670081,27
86833,0.663176,21
93840,0.653015,22
2431,0.652302,27
62,0.652144,80


In [44]:
df_movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [45]:
top10 = top10.merge(df_movies, left_index=True, right_on="movieId")
top10

Unnamed: 0,PearsonR,rating_count,movieId,title,genres
58,0.723238,31,65,Bio-Dome (1996),Comedy
8289,0.715809,21,106072,Thor: The Dark World (2013),Action|Adventure|Fantasy|IMAX
8137,0.708827,20,101864,Oblivion (2013),Action|Adventure|Sci-Fi|IMAX
2343,0.701856,36,3101,Fatal Attraction (1987),Drama|Thriller
8425,0.682284,30,111362,X-Men: Days of Future Past (2014),Action|Adventure|Sci-Fi
2101,0.677043,26,2795,National Lampoon's Vacation (1983),Comedy
7416,0.670081,27,80549,Easy A (2010),Comedy|Romance
6327,0.663947,20,48738,"Last King of Scotland, The (2006)",Drama|Thriller
7605,0.663176,21,86833,Bridesmaids (2011),Comedy
7859,0.653015,22,93840,"Cabin in the Woods, The (2012)",Comedy|Horror|Sci-Fi|Thriller


### Create a function:

Create a function that takes as input a movieId and a number (n), and outputs the names of the top n most similar movies to the inputed one.

You can assume that the user-item matrix ( movrating_crosstab) is already created.

In [58]:
def itembased_moviesrecom(n,movieId=356, movies=df_movies, ratings=df_ratings):
  
    movieId = int(input("Input the Id of the movie you liked"))
    movrating_crosstab = pd.pivot_table(data=ratings, values='rating', index='userId', columns='movieId')
    chosen_movie_ratings =  movrating_crosstab[movieId]
    corr_with_chosen = movrating_crosstab.corrwith(chosen_movie_ratings)
    corr_with_chosen.dropna(inplace=True)
    corr_with_chosen.drop(movieId,inplace=True)
    rating = pd.DataFrame(df_ratings.groupby('movieId')['rating'].mean())
    rating['rating_count'] = df_ratings.groupby('movieId')['rating'].count()
    corr_with_chosen = pd.DataFrame(corr_with_chosen).merge(rating,how='left',on='movieId')
    corr_with_chosen = corr_with_chosen[corr_with_chosen['rating_count']>=20]
  
    return corr_with_chosen.sort_values(0,ascending=False).head(n).merge(movies,how='left',on='movieId')


In [57]:
itembased_moviesrecom(10)

Input the Id of the movie you liked1


  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0,movieId,0,rating,rating_count,title,genres
0,55269,0.86881,3.428571,21,"Darjeeling Limited, The (2007)",Adventure|Comedy|Drama
1,122916,0.818393,4.025,20,Thor: Ragnarok (2017),Action|Adventure|Sci-Fi
2,371,0.805735,3.309524,21,"Paper, The (1994)",Comedy|Drama
3,6953,0.793611,3.3,25,21 Grams (2003),Crime|Drama|Mystery|Romance|Thriller
4,4235,0.783188,3.934783,23,Amores Perros (Love's a Bitch) (2000),Drama|Thriller
5,4855,0.770451,3.775,20,Dirty Harry (1971),Action|Crime|Thriller
6,1125,0.7698,3.6875,24,"Return of the Pink Panther, The (1975)",Comedy|Crime
7,63992,0.769261,2.409091,22,Twilight (2008),Drama|Fantasy|Romance|Thriller
8,88405,0.768301,3.05,20,Friends with Benefits (2011),Comedy|Romance
9,531,0.763257,3.25,24,"Secret Garden, The (1993)",Children|Drama


### BONUS (Next iteration)
Instead of flitering out movies with a rating count above 20, let's consider a restaurant X as similar to Y only if at least 3 users have gone to both X and Y. 

i.e. user 143, 153, and 168 went to both restaurants - not 3 random users visited X, and a different 3 random users visited y

In [15]:
# your code here