# Using Cosine Similarity to make movie recommendations
- The similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B
- The closer the vectors, the smaller will be the angle and the larger the cosine
- Cosine Similarity will be used to determine the degree of similarity between the two movies. What is cosine similarity and how does it work?

*Assume we have two vectors. If the vectors are almost parallel, i.e. the angle between them is zero, we can conclude that they are “similar,” because cos(0)=1. If the vectors are orthogonal, we can say they are independent or NOT “similar,” because cos(90)=0.*

References:
- https://github.com/rposhala/Recommender-System-on-MovieLens-dataset/blob/main/Item_based_Collaborative_Recommender_System_using_KNN.ipynb

In [73]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix # for sparse matrices
from sklearn.neighbors import NearestNeighbors # for nearest neighbors models
from pprint import pprint

In [74]:
items_df = pd.read_csv('./data/items.csv')
ratings_df = pd.read_csv('./data/ratings.csv')
df = pd.merge(ratings_df, items_df, on='movie_id')
# keep only required columns
df = df[['user_id', 'movie_id', 'rating', 'title']]
print(df.shape)
df.head()

(102295, 4)


Unnamed: 0,user_id,movie_id,rating,title
0,196,242,3,Kolya (1996)
1,186,302,3,L.A. Confidential (1997)
2,22,377,1,Heavyweights (1994)
3,244,51,2,Legends of the Fall (1994)
4,166,346,1,Jackie Brown (1997)


In [75]:
# In some cases a user might have rated the same movie multiple times.
refined_dataset = df.groupby(by=['user_id','title'], as_index=False).agg({"rating":"mean"})
print(refined_dataset.shape)
refined_dataset.head()

(99693, 3)


Unnamed: 0,user_id,title,rating
0,1,101 Dalmatians (1996),2.0
1,1,12 Angry Men (1957),5.0
2,1,"20,000 Leagues Under the Sea (1954)",3.0
3,1,2001: A Space Odyssey (1968),4.0
4,1,"Abyss, The (1989)",3.0


## Recommending movies for a movie

In [76]:
# Create pivot table
movies_df= refined_dataset.pivot_table(index="title",columns='user_id',values='rating').fillna(0)
movies_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1-900 (1994),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101 Dalmatians (1996),2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),5.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187 (1997),0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [77]:
# Create sparse matrix
movies_sparse = csr_matrix(movies_df.values)

KNN calculates the distance from all points in the proximity of the unknown data and filters out the ones with the shortest distances to it. As a result, it’s often referred to as a distance-based algorithm.

In [78]:
# Building the model
model_knn= NearestNeighbors(metric= 'cosine', algorithm='brute')

# Fitting the model 
model_knn.fit(movies_sparse)

In [79]:
# Shape[0]---> selecting from rows
random_movie= np.random.choice(movies_df.shape[0])
print(random_movie)
# Now we will try to find the movies related to random_movie
distances, indices = model_knn.kneighbors(movies_df.iloc[random_movie,:].values.reshape(1,-1), n_neighbors= 6)

983


In [80]:
#Here 0---> recomendation for same movies

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(movies_df.index[random_movie])) # For which movies it selected
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, movies_df.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Mina Tannenbaum (1994):

1: I Can't Sleep (J'ai pas sommeil) (1994), with distance of 0.2794233078771078:
2: Story of Xinghua, The (1993), with distance of 0.4597775663879018:
3: Hana-bi (1997), with distance of 0.4896896369201712:
4: Silence of the Palace, The (Saimt el Qusur) (1994), with distance of 0.4896896369201712:
5: Girls Town (1996), with distance of 0.4896896369201712:


## Recommending movies for a user

In [81]:
users_df = refined_dataset.pivot_table(index="user_id",columns='title',values='rating').fillna(0)
users_df.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0


In [82]:
users_sparse = csr_matrix(users_df.values)

In [83]:
model_knn= NearestNeighbors(metric= 'cosine', algorithm='brute')
model_knn.fit(users_sparse)

In [84]:
## function to find top n similar users of the given input user 
def get_similar_users(user, n = 5):
  ## input to this function is the user and number of top similar users you want.

  knn_input = np.asarray([users_df.values[user-1]])  #.reshape(1,-1)
  # knn_input = user_to_movie_df.iloc[0,:].values.reshape(1,-1)
  distances, indices = model_knn.kneighbors(knn_input, n_neighbors=n+1)
  
  print("Top",n,"users who are very much similar to the User-",user, "are: ")
  print(" ")
  for i in range(1,len(distances[0])):
    print(i,". User:", indices[0][i]+1, "separated by distance of",distances[0][i])
  return indices.flatten()[1:] + 1, distances.flatten()[1:]


In [85]:
user_id = 778
print(" Few of movies seen by the User:")
pprint(list(refined_dataset[refined_dataset['user_id'] == user_id]['title'])[:10])
similar_user_list, distance_list = get_similar_users(user_id,5)

 Few of movies seen by the User:
['Amityville Horror, The (1979)',
 'Angels in the Outfield (1994)',
 'Apocalypse Now (1979)',
 'Apollo 13 (1995)',
 'Austin Powers: International Man of Mystery (1997)',
 'Babe (1995)',
 'Back to the Future (1985)',
 'Blues Brothers, The (1980)',
 'Chasing Amy (1997)',
 'Clerks (1994)']
Top 5 users who are very much similar to the User- 778 are: 
 
1 . User: 124 separated by distance of 0.4586649429539592
2 . User: 933 separated by distance of 0.5581959868865324
3 . User: 56 separated by distance of 0.5858413112292744
4 . User: 738 separated by distance of 0.5916272517988691
5 . User: 653 separated by distance of 0.5991479757406326


Now we will try to recommend movies to the user based on the similar users

In [86]:
# Calculating weightage of each user
weightage_list = distance_list/np.sum(distance_list)
print(weightage_list)

[0.16419139 0.19982119 0.20971757 0.2117888  0.21448105]


In [87]:
# Retrieve the movie ratings of the similar users
mov_rtngs_sim_users = users_df.values[similar_user_list]
print(mov_rtngs_sim_users)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 2. ... 0. 0. 0.]
 [0. 0. 3. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [88]:
# Movie name list
movies_list = users_df.columns
print(movies_list)

Index([''Til There Was You (1997)', '1-900 (1994)', '101 Dalmatians (1996)',
       '12 Angry Men (1957)', '187 (1997)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '3 Ninjas: High Noon At Mega Mountain (1998)', '39 Steps, The (1935)',
       ...
       'Yankee Zulu (1994)', 'Year of the Horse (1997)', 'You So Crazy (1994)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Poisoner's Handbook, The (1995)',
       'Zeus and Roxanne (1997)', 'unknown',
       'Á köldum klaka (Cold Fever) (1994)'],
      dtype='object', name='title', length=1664)


In [89]:
print("Weightage list shape:", len(weightage_list))
print("mov_rtngs_sim_users shape:", mov_rtngs_sim_users.shape)
print("Number of movies:", len(movies_list))

Weightage list shape: 5
mov_rtngs_sim_users shape: (5, 1664)
Number of movies: 1664


In [90]:
# Reshaping the weightage list to match the shape of mov_rtngs_sim_users
weightage_list = weightage_list[:,np.newaxis] + np.zeros(len(movies_list))
weightage_list.shape

(5, 1664)

In [91]:
# Calculating the new rating matrix with the weightage of each user
new_rating_matrix = weightage_list*mov_rtngs_sim_users
mean_rating_list = new_rating_matrix.sum(axis =0) # Summing the ratings of all similar users
print(mean_rating_list)

[0.         0.         1.02879509 ... 0.         0.         0.        ]


In [92]:
new_rating_matrix

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.39964238, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.62915272, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [93]:
def recommend_movies(n):
  n = min(len(mean_rating_list),n) # In case n is greater than the number of movies
  pprint(list(movies_list[np.argsort(mean_rating_list)[::-1][:n]]))

In [94]:
print("Movies recommended based on similar users are: ")
recommend_movies(5)

Movies recommended based on similar users are: 
['Star Wars (1977)',
 'Terminator, The (1984)',
 "Schindler's List (1993)",
 'Fugitive, The (1993)',
 'Forrest Gump (1994)']


### Drawbacks:

1. It also recommends movies which are already seen by the given input User

2. There is a possibility of recommending the movies which are not at all seen by any of the similar users


Let's address these drawbacks!

In [95]:
def filtered_movie_recommendations(n):
  
  first_zero_index = np.where(mean_rating_list == 0)[0][-1]
  sortd_index = np.argsort(mean_rating_list)[::-1]
  sortd_index = sortd_index[:list(sortd_index).index(first_zero_index)]
  n = min(len(sortd_index),n)
  movies_watched = list(refined_dataset[refined_dataset['user_id'] == user_id]['title'])
  filtered_movie_list = list(movies_list[sortd_index])
  count = 0
  final_movie_list = []
  for i in filtered_movie_list:
    if i not in movies_watched:
      count+=1
      final_movie_list.append(i)
    if count == n:
      break
  if count == 0:
    print("There are no movies left which are not seen by the input users and seen by similar users. May be increasing the number of similar users who are to be considered may give a chance of suggesting an unseen good movie.")
  else:
    pprint(final_movie_list)

In [96]:
filtered_movie_recommendations(5)

['Star Wars (1977)',
 "Schindler's List (1993)",
 'Princess Bride, The (1987)',
 'Empire Strikes Back, The (1980)',
 'Return of the Jedi (1983)']


In [97]:
def recommender_system(user_id, n_similar_users, n_movies): #, user_to_movie_df, knn_model):
  
  print("Movies seen by the User:")
  pprint(list(refined_dataset[refined_dataset['user_id'] == user_id]['title']))
  print("")

  def get_similar_users(user, n = 5):
    
    knn_input = np.asarray([users_df.values[user-1]])
    
    distances, indices = model_knn.kneighbors(knn_input, n_neighbors=n+1)
    
    print("Top",n,"users who are very much similar to the User-",user, "are: ")
    print(" ")

    for i in range(1,len(distances[0])):
      print(i,". User:", indices[0][i]+1, "separated by distance of",distances[0][i])
    print("")
    return indices.flatten()[1:] + 1, distances.flatten()[1:]


  def filtered_movie_recommendations(n = 10):
  
    first_zero_index = np.where(mean_rating_list == 0)[0][-1]
    sortd_index = np.argsort(mean_rating_list)[::-1]
    sortd_index = sortd_index[:list(sortd_index).index(first_zero_index)]
    n = min(len(sortd_index),n)
    movies_watched = list(refined_dataset[refined_dataset['user_id'] == user_id]['title'])
    filtered_movie_list = list(movies_list[sortd_index])
    count = 0
    final_movie_list = []
    for i in filtered_movie_list:
      if i not in movies_watched:
        count+=1
        final_movie_list.append(i)
      if count == n:
        break
    if count == 0:
      print("There are no movies left which are not seen by the input users and seen by similar users. May be increasing the number of similar users who are to be considered may give a chance of suggesting an unseen good movie.")
    else:
      pprint(final_movie_list)

  similar_user_list, distance_list = get_similar_users(user_id,n_similar_users)
  weightage_list = distance_list/np.sum(distance_list)
  mov_rtngs_sim_users = users_df.values[similar_user_list]
  movies_list = users_df.columns
  weightage_list = weightage_list[:,np.newaxis] + np.zeros(len(movies_list))
  new_rating_matrix = weightage_list*mov_rtngs_sim_users
  mean_rating_list = new_rating_matrix.sum(axis =0)
  print("")
  print("Movies recommended based on similar users are: ")
  print("")
  filtered_movie_recommendations(n_movies)

In [98]:
recommender_system(543, 15,15)

Movies seen by the User:
['2001: A Space Odyssey (1968)',
 'Aladdin (1992)',
 'Alien (1979)',
 'Aliens (1986)',
 'All Things Fair (1996)',
 'Amadeus (1984)',
 'American President, The (1995)',
 'Apocalypse Now (1979)',
 'Apollo 13 (1995)',
 'Aristocats, The (1970)',
 'Austin Powers: International Man of Mystery (1997)',
 'Babe (1995)',
 'Back to the Future (1985)',
 'Barcelona (1994)',
 'Batman (1989)',
 'Batman Forever (1995)',
 'Batman Returns (1992)',
 'Being There (1979)',
 'Big Blue, The (Grand bleu, Le) (1988)',
 'Birds, The (1963)',
 'Blade Runner (1982)',
 'Blues Brothers, The (1980)',
 'Boot, Das (1981)',
 'Bound (1996)',
 'Brassed Off (1996)',
 'Braveheart (1995)',
 'Brazil (1985)',
 'Bridge on the River Kwai, The (1957)',
 'Bridges of Madison County, The (1995)',
 'Bullets Over Broadway (1994)',
 'Cape Fear (1991)',
 'Caught (1996)',
 'Cemetery Man (Dellamorte Dellamore) (1994)',
 'Cinema Paradiso (1988)',
 'Citizen Kane (1941)',
 'City of Lost Children, The (1995)',
 'Clear