# Create the similarity matrix
In 3 simple steps:

Create the big users-items table

Replace NaNs with zeros

Compute pairwise cosine similarities

1. Create the big users-items table.
We are just reshaping (pivoting) the data, so that we have users as rows and restaurants as columns. We need the data to be in this shape to compute similarities between users in the next step.

In [2]:
import numpy as np
import pandas as pd

In [3]:
df_links = pd.read_csv(r'links.csv')
df_movies = pd.read_csv(r'movies.csv')
df_ratings = pd.read_csv(r'ratings.csv')
df_tags = pd.read_csv(r'tags.csv')

In [4]:
df_ratings


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [10]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [5]:
users_items = pd.pivot_table(data=df_ratings, 
                                 values='rating', 
                                 index='userId', 
                                 columns='movieId')

In [6]:

users_items.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


### 2. Replace NaNs with zeros
The cosine similarity can't be computed with NaN's

In [7]:
users_items.fillna(0, inplace=True)
users_items.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3. Compute cosine similarities

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)
user_similarities.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,1.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.05972,0.0,1.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
4,0.194395,0.003726,0.002251,1.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5,0.12908,0.016614,0.00502,0.128659,1.0,0.300349,0.108342,0.429075,0.0,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792


### Building the recommender step by step:
Let's focus on one random user (user 1) and compute the recommendations only for this user, as an example. Then, we will build a function that can compute recommendations for any users. We will follow these steps:

Compute the weights.

Find movies user 1 has not rated.

Compute the ratings user 1 would give to those unrated movies.

Find the top 5 movies from the rating predictions.

### 1. Compute the weights
Here we will exclude user 1 using .query().

In [57]:
# compute the weights for one user
userId = 1

weights = (
    user_similarities.query("userId!=@userid")[userId] / sum(user_similarities.query("userId!=@userId")[userId])
          )
weights.head(6)

userId
2    0.000336
3    0.000736
4    0.002395
5    0.001590
6    0.001579
7    0.001956
Name: 1, dtype: float64

In [51]:
weights.sum()

1.000000000000001

### 2. Find restaurants user 1 has not rated.
We will exclude our user, since we don't want to include them on the weights.

In [52]:
users_items.loc[userId,:]==0

movieId
1         False
2          True
3         False
4          True
5          True
          ...  
193581     True
193583     True
193585     True
193587     True
193609     True
Name: 1, Length: 9724, dtype: bool

In [53]:
# select restaurants that the inputed user has not visited
not_visited_movies = users_items.loc[users_items.index!=userId, users_items.loc[userId,:]==0]
not_visited_movies.T

userId,2,3,4,5,6,7,8,9,10,11,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3. Compute the ratings user 1 would give to those unrated restaurants.

In [19]:
not_visited_movies.T.dot(weights)

movieId
2         0.842127
4         0.027652
5         0.275115
7         0.321403
8         0.046876
            ...   
193581    0.000286
193583    0.000250
193585    0.000250
193587    0.000250
193609    0.002973
Length: 9492, dtype: float64

In [20]:
# dot product between the not-visited-restaurants and the weights
weighted_averages = pd.DataFrame(not_visited_movies.T.dot(weights), columns=["predicted_rating"])
weighted_averages

Unnamed: 0_level_0,predicted_rating
movieId,Unnamed: 1_level_1
2,0.842127
4,0.027652
5,0.275115
7,0.321403
8,0.046876
...,...
193581,0.000286
193583,0.000250
193585,0.000250
193587,0.000250


### 4. Find the top 5 movies from the rating predictions

In [21]:
recommendations = weighted_averages.merge(df_movies, left_index=True, right_on="movieId")
recommendations.sort_values("predicted_rating", ascending=False).head()

Unnamed: 0,predicted_rating,movieId,title,genres
277,2.654727,318,"Shawshank Redemption, The (1994)",Crime|Drama
507,2.087327,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi
659,1.859548,858,"Godfather, The (1972)",Crime|Drama
2078,1.663564,2762,"Sixth Sense, The (1999)",Drama|Horror|Mystery
3638,1.62482,4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy


### Function:
Make a function that recommends the top n movies to an inputted userId

In [45]:
def user_movie_similarity(userId=1,n=10,user_movie=users_items,movie_names=df_movies):
  userId=int(input("What is your userId "))
  n=int(input("How many movies do you want to get "))
  user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)
  weights = (
    user_similarities.query("userId!=@userId")[userId] / sum(user_similarities.query("userId!=@userId")[userId])
          )
  not_visited_movies = users_items.loc[users_items.index!=userId, users_items.loc[userId,:]==0]
  weighted_averages = pd.DataFrame(not_visited_movies.T.dot(weights), columns=["predicted_rating"])
  recommendations = weighted_averages.merge(df_movies, left_index=True, right_on="movieId")
  return recommendations.sort_values("predicted_rating", ascending=False).head(n)
  

In [56]:
user_movie_similarity()

What is your userId 200
How many movies do you want to get 10


Unnamed: 0,predicted_rating,movieId,title,genres
510,2.241383,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
507,1.878633,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi
461,1.823776,527,Schindler's List (1993),Drama|War
46,1.800264,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
1503,1.765395,2028,Saving Private Ryan (1998),Action|Drama|War
659,1.7316,858,"Godfather, The (1972)",Crime|Drama
2078,1.652478,2762,"Sixth Sense, The (1999)",Drama|Horror|Mystery
322,1.535623,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX
520,1.516728,608,Fargo (1996),Comedy|Crime|Drama|Thriller
398,1.481389,457,"Fugitive, The (1993)",Thriller
