## Create the similarity matrix

In 3 simple steps:

1. Create the big users-items table

2. Replace NaNs with zeros

3. Compute pairwise cosine similarities

### 1. Create the big users-items table.

We are just reshaping (pivoting) the data, so that we have users as rows and restaurants as columns. We need the data to be in this shape to compute similarities between users in the next step.

In [34]:
import pandas as pd

# rating_final.csv
url = 'https://drive.google.com/file/d/1ptu4AlEXO4qQ8GytxKHoeuS1y4l_zWkC/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
frame = pd.read_csv(path)

# 'geoplaces2.csv'
url = 'https://drive.google.com/file/d/1ee3ib7LqGsMUksY68SD9yBItRvTFELxo/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
geodata = pd.read_csv(path, encoding = 'CP1252') # change encoding to 'mbcs' in Windows

places =  geodata[['placeID', 'name']]

users_items = pd.pivot_table(data=frame, 
                                 values='rating', 
                                 index='userID', 
                                 columns='placeID')

users_items.head()

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,,,,...,,,,0.0,,,,,,
U1002,,,,,,,,,,,...,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,...,2.0,,,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,2.0,,
U1005,,,,,,,,,,,...,,,,,,,,,,


### 2. Replace NaNs with zeros
The cosine similarity can't be computed with NaN's

In [35]:
users_items.fillna(0, inplace=True)
users_items.head()

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
U1003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
U1005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3. Compute cosine similarities

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)
user_similarities.head()

userID,U1001,U1002,U1003,U1004,U1005,U1006,U1007,U1008,U1009,U1010,...,U1129,U1130,U1131,U1132,U1133,U1134,U1135,U1136,U1137,U1138
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,1.0,0.227921,0.166957,0.0,0.059761,0.111456,0.188982,0.0,0.106904,0.0,...,0.0,0.0,0.0,0.353553,0.0,0.083478,0.0,0.0,0.14825,0.0
U1002,0.227921,1.0,0.266371,0.158362,0.095346,0.088911,0.075378,0.0,0.426401,0.0,...,0.0,0.0,0.0,0.402911,0.0,0.199778,0.0,0.322329,0.413919,0.355335
U1003,0.166957,0.266371,1.0,0.0,0.0,0.325645,0.0,0.0,0.374817,0.0,...,0.0,0.0,0.0,0.118056,0.0,0.439024,0.0,0.059028,0.476463,0.208232
U1004,0.0,0.158362,0.0,1.0,0.166091,0.07744,0.131306,0.0,0.037139,0.0,...,0.0,0.0,0.0,0.350931,0.0,0.0,0.0,0.280745,0.103005,0.0
U1005,0.059761,0.095346,0.0,0.166091,1.0,0.0,0.237171,0.0,0.0,0.447214,...,0.0,0.0,0.0,0.084515,0.0,0.0,0.0,0.0,0.124035,0.0


## Building the recommender step by step:

Let's focus on one random user (user `U1001`) and compute the recommendations only for this user, as an example. Then, we will build a function that can compute recommendations for any users. We will follow these steps:

1. Compute the weights.

2. Find restaurants user `U1001` has not rated.

3. Compute the ratings user `U1001` would give to those unrated restaurants.

4. Find the top 5 restaurants from the rating predictions.

### 1. Compute the weights

Here we will exclude user `U1001` using `.query()`.

In [37]:
userID = "U1001"
user_similarities[userID] 

userID
U1001    1.000000
U1002    0.227921
U1003    0.166957
U1004    0.000000
U1005    0.059761
           ...   
U1134    0.083478
U1135    0.000000
U1136    0.000000
U1137    0.148250
U1138    0.000000
Name: U1001, Length: 138, dtype: float64

In [38]:
user_id = "U1001"  #we have taken a userID e.g U1001 and save it into a variable user_id
user_similarities.query("userID!=@user_id")#[user_id]
user_similarities[user_similarities.index != user_id]#[user_id]

#("userID!=@user_id")#[user_id]


userID,U1001,U1002,U1003,U1004,U1005,U1006,U1007,U1008,U1009,U1010,...,U1129,U1130,U1131,U1132,U1133,U1134,U1135,U1136,U1137,U1138
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1002,0.227921,1.000000,0.266371,0.158362,0.095346,0.088911,0.075378,0.0,0.426401,0.000000,...,0.0,0.0,0.0,0.402911,0.0,0.199778,0.0,0.322329,0.413919,0.355335
U1003,0.166957,0.266371,1.000000,0.000000,0.000000,0.325645,0.000000,0.0,0.374817,0.000000,...,0.0,0.0,0.0,0.118056,0.0,0.439024,0.0,0.059028,0.476463,0.208232
U1004,0.000000,0.158362,0.000000,1.000000,0.166091,0.077440,0.131306,0.0,0.037139,0.000000,...,0.0,0.0,0.0,0.350931,0.0,0.000000,0.0,0.280745,0.103005,0.000000
U1005,0.059761,0.095346,0.000000,0.166091,1.000000,0.000000,0.237171,0.0,0.000000,0.447214,...,0.0,0.0,0.0,0.084515,0.0,0.000000,0.0,0.000000,0.124035,0.000000
U1006,0.111456,0.088911,0.325645,0.077440,0.000000,1.000000,0.073721,0.0,0.083406,0.000000,...,0.0,0.0,0.0,0.078811,0.0,0.130258,0.0,0.039406,0.231326,0.278019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
U1134,0.083478,0.199778,0.439024,0.000000,0.000000,0.130258,0.110432,0.0,0.249878,0.156174,...,0.0,0.0,0.0,0.177084,0.0,1.000000,0.0,0.354169,0.303204,0.000000
U1135,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000
U1136,0.000000,0.322329,0.059028,0.280745,0.000000,0.039406,0.000000,0.0,0.151186,0.000000,...,0.0,0.0,0.0,0.142857,0.0,0.354169,0.0,1.000000,0.104828,0.251976
U1137,0.148250,0.413919,0.476463,0.103005,0.124035,0.231326,0.098058,0.0,0.388290,0.184900,...,0.0,0.0,0.0,0.314485,0.0,0.303204,0.0,0.104828,1.000000,0.000000


In [39]:
sum(user_similarities.query("userID!=@user_id")[user_id])

9.769831925163341

In [40]:
# compute the weights for one user
user_id = "U1001"

weights = (
    user_similarities.query("userID!=@user_id")[user_id] / sum(user_similarities.query("userID!=@user_id")[user_id])
          )
weights.head(6)

userID
U1002    0.023329
U1003    0.017089
U1004    0.000000
U1005    0.006117
U1006    0.011408
U1007    0.019343
Name: U1001, dtype: float64

In [41]:
weights.sum()

1.0

### 2. Find restaurants user `U1001` has not rated.

We will exclude our user, since we don't want to include them on the weights.

In [42]:
users_items.loc[user_id,:]==0

placeID
132560    True
132561    True
132564    True
132572    True
132583    True
          ... 
135088    True
135104    True
135106    True
135108    True
135109    True
Name: U1001, Length: 130, dtype: bool

In [43]:
users_items

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
U1003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
U1005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
U1134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
U1135,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
# select restaurants that the inputed user has not visited
not_visited_restaurants = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]# before , will exclude the restaurants that user has
# not visited after , will exclude restaurants that user has not rated. my explanation
not_visited_restaurants.T

userID,U1002,U1003,U1004,U1005,U1006,U1007,U1008,U1009,U1010,U1011,...,U1129,U1130,U1131,U1132,U1133,U1134,U1135,U1136,U1137,U1138
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
132560,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
132561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
132564,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
132572,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
132583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
135104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
135106,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
135108,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3. Compute the ratings user `U1001` would give to those unrated restaurants.

In [45]:
not_visited_restaurants.T.dot(weights)

placeID
132560    0.000000
132561    0.000000
132564    0.000000
132572    0.193427
132583    0.000000
            ...   
135088    0.000000
135104    0.000000
135106    0.174631
135108    0.102436
135109    0.000000
Length: 122, dtype: float64

In [47]:
# dot product between the not-visited-restaurants and the weights
weighted_averages = pd.DataFrame(not_visited_restaurants.T.dot(weights), columns=["predicted_rating"])
weighted_averages.head()

Unnamed: 0_level_0,predicted_rating
placeID,Unnamed: 1_level_1
132560,0.0
132561,0.0
132564,0.0
132572,0.193427
132583,0.0


### 4. Find the top 5 restaurants from the rating predictions

In [48]:
#recommendations = weighted_averages.reset_index().merge(places, left_on=True, right_on="placeID")
#second way
recommendations = weighted_averages.reset_index().merge(places, left_on="placeID", right_on="placeID")
#third way
#recommendations = weighted_averages.reset_index().merge(places)

recommendations.sort_values("predicted_rating", ascending=False).head()

Unnamed: 0,placeID,predicted_rating,name
115,135085,0.878773,Tortas Locas Hipocampo
90,135052,0.742529,La Cantina Restaurante
77,135032,0.622755,Cafeteria y Restaurant El Pacifico
80,135038,0.549689,Restaurant la Chalita
98,135062,0.495248,Restaurante El Cielo Potosino


### Challenge:

1. Make a function that recommends the top `n` restaurants to an inputted `userID`

2. Make this function for the movies dataset.

In [49]:
def top_n_rest(user_id, n):
  user_similarities.query("userID!=@user_id")#[user_id]
  user_similarities[user_similarities.index != user_id]#[user_id]
  weights = (
    user_similarities.query("userID!=@user_id")[user_id] / sum(user_similarities.query("userID!=@user_id")[user_id])
          )
  not_visited_restaurants = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]
  not_visited_restaurants.T
  not_visited_restaurants.T.dot(weights)
  weighted_averages = pd.DataFrame(not_visited_restaurants.T.dot(weights), columns=["predicted_rating"])
  recommendations = weighted_averages.reset_index().merge(places, left_on="placeID", right_on="placeID")
  return recommendations.sort_values("predicted_rating", ascending=False).head(n)


In [50]:
top_n_rest('U1002', 10)

Unnamed: 0,placeID,predicted_rating,name
32,132834,0.469536,Gorditas Doa Gloria
98,135060,0.43632,Restaurante Marisco Sam
76,135032,0.416535,Cafeteria y Restaurant El Pacifico
80,135038,0.405008,Restaurant la Chalita
71,135025,0.397823,El Rincon de San Francisco
74,135028,0.367957,La Virreina
92,135051,0.362631,Restaurante Versalles
83,135042,0.338743,Restaurant Oriental Express
75,135030,0.320859,Preambulo Wifi Zone Cafe
25,132754,0.311895,Cabana Huasteca


In [51]:
top_n_rest('U1007', 10)

Unnamed: 0,placeID,predicted_rating,name
30,132825,0.567865,puesto de tacos
94,135052,0.530938,La Cantina Restaurante
100,135062,0.382909,Restaurante El Cielo Potosino
99,135060,0.353399,Restaurante Marisco Sam
93,135051,0.339689,Restaurante Versalles
84,135042,0.335671,Restaurant Oriental Express
32,132834,0.33555,Gorditas Doa Gloria
76,135028,0.318393,La Virreina
77,135030,0.28544,Preambulo Wifi Zone Cafe
87,135045,0.28375,Restaurante la Gran Via


In [52]:
top_n_rest('U1134', 10)

Unnamed: 0,placeID,predicted_rating,name
31,132834,0.461764,Gorditas Doa Gloria
76,135028,0.458739,La Virreina
40,132862,0.440893,La Posada del Virrey
77,135030,0.420492,Preambulo Wifi Zone Cafe
49,132921,0.387625,crudalia
82,135038,0.374341,Restaurant la Chalita
21,132723,0.369708,Gordas de morales
97,135058,0.357894,Restaurante Tiberius
73,135025,0.351048,El Rincon de San Francisco
93,135052,0.338433,La Cantina Restaurante


2nd challenge

Make this function for the movies dataset.

In [17]:
import pandas as pd

url = 'https://drive.google.com/file/d/1S0CtDB8NYUs94KgO0VDv6b2R1CShQcLF/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
links = pd.read_csv(path)


url = 'https://drive.google.com/file/d/1sW3zww6gMzoln0-U0Zs7HW_bKYjtH99i/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
movies = pd.read_csv(path)

url = 'https://drive.google.com/file/d/1nUpoWkhzhnYtUFvGYTR317RHiq7XtTx9/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
ratings = pd.read_csv(path)

url = 'https://drive.google.com/file/d/1F9szBIzHvE9sk-p89sk1zpxVEG_gJezg/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
tags = pd.read_csv(path)

In [24]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [27]:
movies_1 =  movies[['movieId', 'title']]
movies_1.head()


Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [18]:
ratings_movies = movies.merge(ratings)
ratings_movies

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483
...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,1537109082
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,1537109545
100833,193585,Flint (2017),Drama,184,3.5,1537109805
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,1537110021


In [19]:
users_items = pd.pivot_table(data=ratings_movies, 
                                 values='rating', 
                                 index='userId', 
                                 columns='movieId')

users_items.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [20]:
users_items.fillna(0, inplace=True)
users_items.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Compute cosine similarities

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)
user_similarities.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,1.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.05972,0.0,1.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
4,0.194395,0.003726,0.002251,1.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5,0.12908,0.016614,0.00502,0.128659,1.0,0.300349,0.108342,0.429075,0.0,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792


In [28]:
def top_n_movies(user_id, n):
 
  user_similarities.query("userId!=@user_id")#[user_id]
  user_similarities[user_similarities.index != user_id]#[user_id]
  weights = (
    user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id])
          )
  not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]
  weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])
  recommendations = weighted_averages.reset_index().merge(movies_1, left_on="movieId", right_on="movieId")
  return recommendations.sort_values("predicted_rating", ascending=False).head(n)

In [29]:
top_n_movies(2, 10)

Unnamed: 0,movieId,predicted_rating,title
1935,2571,2.901554,"Matrix, The (1999)"
312,356,2.84119,Forrest Gump (1994)
2221,2959,2.773393,Fight Club (1999)
257,296,2.533774,Pulp Fiction (1994)
508,593,2.289741,"Silence of the Lambs, The (1991)"
4786,7153,2.285982,"Lord of the Rings: The Return of the King, The..."
3629,4993,2.267806,"Lord of the Rings: The Fellowship of the Ring,..."
4127,5952,2.079658,"Lord of the Rings: The Two Towers, The (2002)"
224,260,2.070271,Star Wars: Episode IV - A New Hope (1977)
657,858,1.922954,"Godfather, The (1972)"


In [30]:
top_n_movies(4, 10)

Unnamed: 0,movieId,predicted_rating,title
256,318,2.63693,"Shawshank Redemption, The (1994)"
288,356,2.593046,Forrest Gump (1994)
42,50,1.976829,"Usual Suspects, The (1995)"
612,858,1.970388,"Godfather, The (1972)"
426,527,1.949681,Schindler's List (1993)
829,1210,1.934653,Star Wars: Episode VI - Return of the Jedi (1983)
385,480,1.901938,Jurassic Park (1993)
468,589,1.870437,Terminator 2: Judgment Day (1991)
90,110,1.859478,Braveheart (1995)
0,1,1.799827,Toy Story (1995)
