#  Collaborative filtering
Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past.

Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.

It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user.  
It looks at the items they like and combines them to create a ranked list of suggestions.

1. User-User collaborative filtering
This algorithm first finds the similarity score between users. Based on this similarity score, it then picks out the most similar users and recommends products which these similar users have liked or bought previously.  
-- 
--
2. Item-Item collaborative filtering
In this algorithm, we compute the similarity between each pair of items.
we  find the similarity between each movie pair and based on that, we  recommend similar movies which are liked by the users in the past.



In [131]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors


In [132]:
overall_stats = pd.read_csv('ml-100k/u.info', header=None)
print("Details of users, items and ratings involved in the loaded movielens dataset: ",list(overall_stats[0]))

## same item id is same as movie id, item id column is renamed as movie id
column_names1 = ['user id','movie id','rating','timestamp']
dataset = pd.read_csv('ml-100k/u.data', sep='\t',header=None,names=column_names1)

print("|| Length: ", len(dataset),"|| Max Movie ID: " , max(dataset['movie id']), "|| Min Movie ID:", min(dataset['movie id']) )

d = 'movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western'
column_names2 = d.split(' | ')

# Load Data
items_dataset = pd.read_csv('ml-100k/u.item', sep='|',header=None,names=column_names2,encoding='latin-1')

movie_dataset = items_dataset[['movie id','movie title']]

## looking at length of original items_dataset and length of unique combination of rows in items_dataset after removing movie id column
print( "|| All items: ",len(items_dataset) ,"|| Unique: "  , len(items_dataset.groupby(by=column_names2[1:])) )
# We can see there are 18 extra movie id's for already mapped movie title
# and the same duplicate movie id is assigned to the user in the user-item dataset.

# Merging required datasets

merged_dataset = pd.merge(dataset, movie_dataset, how='inner', on='movie id')

print("\n________\n")
print("Merged Dataset")
#Merged Dataset
print(merged_dataset[(merged_dataset['movie title'] == 'Chasing Amy (1997)') & (merged_dataset['user id'] == 894)])
print("\n________\n")
print("Refined Dataset")

# Refined Dataset
refined_dataset = merged_dataset.groupby(by=['user id','movie title'], as_index=False).agg({"rating":"mean"})
print(refined_dataset.head(3))





Details of users, items and ratings involved in the loaded movielens dataset:  ['943 users', '1682 items', '100000 ratings']
|| Length:  100000 || Max Movie ID:  1682 || Min Movie ID: 1
|| All items:  1682 || Unique:  1664

________

Merged Dataset
       user id  movie id  rating  timestamp         movie title
4800       894       246       4  882404137  Chasing Amy (1997)
22340      894       268       3  879896041  Chasing Amy (1997)

________

Refined Dataset
   user id                          movie title  rating
0        1                101 Dalmatians (1996)     2.0
1        1                  12 Angry Men (1957)     5.0
2        1  20,000 Leagues Under the Sea (1954)     3.0


# Training KNN model to build item-based collaborative Recommender System

### Movie Recommendation using KNN with Input as User id, Number of similar users should the model pick and Number of movies you want to get recommended

In [133]:
# pivot and create movie-user matrix
user_to_movie_df = refined_dataset.pivot(
    index='user id',
     columns='movie title',
      values='rating').fillna(0)

user_to_movie_df.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0


In [134]:
# transform matrix to scipy sparse matrix 
user_to_movie_sparse_df = csr_matrix(user_to_movie_df.values) 
# Sparse data structures allow us to store only non-zero values assuming the rest of them are zeros
user_to_movie_sparse_df

<943x1664 sparse matrix of type '<class 'numpy.float64'>'
	with 99693 stored elements in Compressed Sparse Row format>

Fitting K-Nearest Neighbours model to the scipy sparse matrix

In [135]:
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(user_to_movie_sparse_df)


NearestNeighbors(algorithm='brute', metric='cosine')

In [136]:
## function to find top n similar users of the given input user 
def get_similar_users(user, n = 5):
  ## input to this function is the user and number of top similar users you want.

  knn_input = np.asarray([user_to_movie_df.values[user-1]])  #.reshape(1,-1) # convert to array

  distances, indices = knn_model.kneighbors(knn_input, n_neighbors=n+1)
  
  print("Top",n,"users who are very much similar to the User-",user, "are: ")
  print(" ")
  for i in range(1,len(distances[0])):
    print(i,". User:", indices[0][i]+1, "separated by distance of",distances[0][i])
  return indices.flatten()[1:] + 1, distances.flatten()[1:]
# flatten a matrix to one dimension.

Specify User id and Number of similar users we want to consider here

In [137]:
# All this will be currently done for only one User
from pprint import pprint

user_id = 728

print(" Some movies seen by the User:")
pprint(list(refined_dataset[refined_dataset['user id'] == user_id]['movie title'])[:5])
similar_user_list, distance_list = get_similar_users(user_id,4)


 Some movies seen by the User:
['Birdcage, The (1996)',
 'Broken Arrow (1996)',
 'Cold Comfort Farm (1995)',
 'Courage Under Fire (1996)',
 "Dante's Peak (1997)"]
Top 4 users who are very much similar to the User- 728 are: 
 
1 . User: 277 separated by distance of 0.5756316747111707
2 . User: 722 separated by distance of 0.5936102633052809
3 . User: 678 separated by distance of 0.5988291429628039
4 . User: 905 separated by distance of 0.6065307840088783


With the help of the KNN model built, we could get desired number of top similar users.

Now we will have to pick the top movies to recommend.

One way would be by taking the average of the existing ratings given by the similar users and picking the top 10 or 15 movies to recommend to our current user.

But I feel recommendation would be more effective if we define weights to ratings by each similar user based on the thier distance from the input user. Defining these weights would give us the accurate recommendations by eliminating the chance of decision manipulation by the users who are relatively very far from the input user.



In [138]:
similar_user_list, distance_list


(array([277, 722, 678, 905], dtype=int64),
 array([0.57563167, 0.59361026, 0.59882914, 0.60653078]))

In [139]:
# for weightage = individual / sum 
weightage_list = distance_list/np.sum(distance_list)
weightage_list


array([0.24241187, 0.24998307, 0.25218086, 0.2554242 ])

In [140]:
# get the movie ratings given by  similar user from the our movie-user dataframe 
mov_rtngs_sim_users = user_to_movie_df.values[similar_user_list]
mov_rtngs_sim_users


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [141]:
# get list of movies 
movies_list = user_to_movie_df.columns
movies_list


Index([''Til There Was You (1997)', '1-900 (1994)', '101 Dalmatians (1996)',
       '12 Angry Men (1957)', '187 (1997)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '3 Ninjas: High Noon At Mega Mountain (1998)', '39 Steps, The (1935)',
       ...
       'Yankee Zulu (1994)', 'Year of the Horse (1997)', 'You So Crazy (1994)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Poisoner's Handbook, The (1995)',
       'Zeus and Roxanne (1997)', 'unknown',
       'Á köldum klaka (Cold Fever) (1994)'],
      dtype='object', name='movie title', length=1664)

In [142]:
# check values 
print("Weightage list shape:", len(weightage_list))
print("mov_rtngs_sim_users shape:", mov_rtngs_sim_users.shape)
print("Number of movies:", len(movies_list))


Weightage list shape: 4
mov_rtngs_sim_users shape: (4, 1664)
Number of movies: 1664


Broadcasting weightage matrix to similar user rating matrix. so that it gets compatible for matrix operations

In [143]:
weightage_list = weightage_list[:,np.newaxis] + np.zeros(len(movies_list))
weightage_list.shape



(4, 1664)

In [144]:

new_rating_matrix = weightage_list*mov_rtngs_sim_users
mean_rating_list = new_rating_matrix.sum(axis =0)
mean_rating_list.shape


(1664,)

In [145]:
from pprint import pprint
# Pretty print
def recommend_movies(n):
  n = min(len(mean_rating_list),n)

  pprint(list(movies_list[np.argsort(mean_rating_list)[::-1][:n]]))


In [146]:
print("Movies recommended based on similar users are: ")
recommend_movies(10)


Movies recommended based on similar users are: 
['English Patient, The (1996)',
 'Monty Python and the Holy Grail (1974)',
 'Princess Bride, The (1987)',
 'Star Wars (1977)',
 'Empire Strikes Back, The (1980)',
 'Saint, The (1997)',
 'Scream (1996)',
 'Apollo 13 (1995)',
 'Fargo (1996)',
 'Dead Man Walking (1995)']


Drawbacks:

1. But this recommendation system has a drawback, it also recommends movies which are already seen by the given input User.

2. And also there is a possibility of recommending the movies which are not at all seen by any of the similar users.

Above drawbacks are addressed and a new recommender system with modification is built



Below function is defined to remove the movies which are already seen the current user and not at all seen by any of the similar users

In [147]:
def filtered_movie_recommendations(n):
  
  # Sorting the movies and storing them so we can create list of movies user has not seen 

  first_zero_index = np.where(mean_rating_list == 0)[0][-1]
  sortd_index = np.argsort(mean_rating_list)[::-1]
  sortd_index = sortd_index[:list(sortd_index).index(first_zero_index)]
  
  # creating a filtered similar movies based on similar users
  n = min(len(sortd_index),n) 
  movies_watched = list(refined_dataset[refined_dataset['user id'] == user_id]['movie title'])
  filtered_movie_list = list(movies_list[sortd_index])
  
  count = 0
  final_movie_list = []
  
  for i in filtered_movie_list:
    if i not in movies_watched:
      count+=1
      final_movie_list.append(i)
    if count == n:
      break
  if count == 0:
    print("There are no movies left which are not seen by the input users and seen by similar users. May be increasing the number of similar users who are to be considered may give a chance of suggesting an unseen good movie.")
  else:
    pprint(final_movie_list)


In [148]:
filtered_movie_recommendations(10)


['Monty Python and the Holy Grail (1974)',
 'Princess Bride, The (1987)',
 'Star Wars (1977)',
 'Empire Strikes Back, The (1980)',
 'Scream (1996)',
 'Apollo 13 (1995)',
 'Dead Man Walking (1995)',
 'Raiders of the Lost Ark (1981)',
 'Wrong Trousers, The (1993)',
 'Contact (1997)']


Giving Input as User id, Number of similar Users to be considered, Number of top movie we want to recommend

In [164]:
from pprint import pprint

def recommender_system(user_id, n_similar_users, n_movies): #, user_to_movie_df, knn_model):
  
  #print("Movie seen by the User:")
  #pprint(list(refined_dataset[refined_dataset['user id'] == user_id]['movie title'])) # Movies Seen by the User
  print("")

  # def get_similar_users(user, user_to_movie_df, knn_model, n = 5):
  def get_similar_users(user, n = 5):
    
    knn_input = np.asarray([user_to_movie_df.values[user-1]])
    
    distances, indices = knn_model.kneighbors(knn_input, n_neighbors=n+1)
    
    print("Top",n,"users who are very much similar to the User-",user, "are: ")
    print(" ")

    for i in range(1,len(distances[0])):
      print(i,". User:", indices[0][i]+1, "separated by distance of",distances[0][i])
    print("")
    return indices.flatten()[1:] + 1, distances.flatten()[1:]


  def filtered_movie_recommendations(n = 10):
  
    first_zero_index = np.where(mean_rating_list == 0)[0][-1]
    sortd_index = np.argsort(mean_rating_list)[::-1]
    sortd_index = sortd_index[:list(sortd_index).index(first_zero_index)]
    n = min(len(sortd_index),n)
    movies_watched = list(refined_dataset[refined_dataset['user id'] == user_id]['movie title'])
    filtered_movie_list = list(movies_list[sortd_index])
    count = 0
    final_movie_list = []
    for i in filtered_movie_list:
      if i not in movies_watched:
        count+=1
        final_movie_list.append(i)
      if count == n:
        break
    if count == 0:
      print("There are no movies left which are not seen by the input users and seen by similar users. May be increasing the number of similar users who are to be considered may give a chance of suggesting an unseen good movie.")
    else:
      pprint(final_movie_list)

  similar_user_list, distance_list = get_similar_users(user_id,n_similar_users)
  
  weightage_list = distance_list/np.sum(distance_list)
  mov_rtngs_sim_users = user_to_movie_df.values[similar_user_list]
  movies_list = user_to_movie_df.columns
  weightage_list = weightage_list[:,np.newaxis] + np.zeros(len(movies_list))
  new_rating_matrix = weightage_list*mov_rtngs_sim_users
  mean_rating_list = new_rating_matrix.sum(axis =0)
  
  print("")
  print("Movies recommended based on similar users are: ")
  print("")

  filtered_movie_recommendations(n_movies)


In [166]:
print("Enter user id")
user_id= 672
print("number of similar users to be considered")
sim_users = 5
print("Enter number of movies to be recommended:")
n_movies = 10
recommender_system(user_id,sim_users,n_movies)
# recommender_system(300, 15,15)


Enter user id
number of similar users to be considered
Enter number of movies to be recommended:

Top 5 users who are very much similar to the User- 672 are: 
 
1 . User: 438 separated by distance of 0.5351461971750443
2 . User: 413 separated by distance of 0.6129753242002371
3 . User: 192 separated by distance of 0.6434125874861286
4 . User: 735 separated by distance of 0.6489220797772721
5 . User: 869 separated by distance of 0.6529331361687787


Movies recommended based on similar users are: 

['Fargo (1996)',
 'Chasing Amy (1997)',
 'Scream (1996)',
 'Leaving Las Vegas (1995)',
 'English Patient, The (1996)',
 'Titanic (1997)',
 'Contact (1997)',
 'Air Force One (1997)',
 'Seven Years in Tibet (1997)',
 'Pillow Book, The (1995)']


### Movie Recommendation using KNN with Input as Movie Name and Number of movies you want to get recommended:

In [151]:
# pivot and create movie-user matrix
movie_to_user_df = refined_dataset.pivot(
     index='movie title',
   columns='user id',
      values='rating').fillna(0)

movie_to_user_df.head()

user id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1-900 (1994),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101 Dalmatians (1996),2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),5.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187 (1997),0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [152]:
# transform matrix to scipy sparse matrix
movie_to_user_sparse_df = csr_matrix(movie_to_user_df.values)
movie_to_user_sparse_df

<1664x943 sparse matrix of type '<class 'numpy.float64'>'
	with 99693 stored elements in Compressed Sparse Row format>

In [153]:
movies_list = list(movie_to_user_df.index)
movies_list[:10]

["'Til There Was You (1997)",
 '1-900 (1994)',
 '101 Dalmatians (1996)',
 '12 Angry Men (1957)',
 '187 (1997)',
 '2 Days in the Valley (1996)',
 '20,000 Leagues Under the Sea (1954)',
 '2001: A Space Odyssey (1968)',
 '3 Ninjas: High Noon At Mega Mountain (1998)',
 '39 Steps, The (1935)']

In [154]:
movie_dict = {movie : index for index, movie in enumerate(movies_list)}
#print(movie_dict)

In [155]:
case_insensitive_movies_list = [i.lower() for i in movies_list]

In [156]:
knn_movie_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_movie_model.fit(movie_to_user_sparse_df)

NearestNeighbors(algorithm='brute', metric='cosine')

In [157]:
## function to find top n similar users of the given input user 
def get_similar_movies(movie, n = 10):
  ## input to this function is the movie and number of top similar movies you want.
  index = movie_dict[movie]
  knn_input = np.asarray([movie_to_user_df.values[index]])
  n = min(len(movies_list)-1,n)
  distances, indices = knn_movie_model.kneighbors(knn_input, n_neighbors=n+1)
  
  print("Top",n,"movies which are very much similar to the Movie-",movie, "are: ")
  print(" ")
  for i in range(1,len(distances[0])):
    print(movies_list[indices[0][i]])
  

In [158]:
from pprint import pprint
movie_name = '101 Dalmatians (1996)'

get_similar_movies(movie_name,15)

Top 15 movies which are very much similar to the Movie- 101 Dalmatians (1996) are: 
 
Jack (1996)
Twister (1996)
Willy Wonka and the Chocolate Factory (1971)
Independence Day (ID4) (1996)
Toy Story (1995)
Father of the Bride Part II (1995)
Hunchback of Notre Dame, The (1996)
Lion King, The (1994)
Mrs. Doubtfire (1993)
Jungle Book, The (1994)
Grumpier Old Men (1995)
Mission: Impossible (1996)
Mr. Holland's Opus (1995)
Homeward Bound II: Lost in San Francisco (1996)
Dragonheart (1996)


Defining a function which outputs movie names as suggestion when the user mis spells the movie name. User might have intended to type any of these movie names.



In [159]:
# function which takes input and returns suggestions for the user

def get_possible_movies(movie):

    temp = ''
    possible_movies = case_insensitive_movies_list.copy()
    for i in movie :
      out = []
      temp += i
      for j in possible_movies:
        if temp in j:
          out.append(j)
      if len(out) == 0:
          return possible_movies
      out.sort()
      possible_movies = out.copy()

    return possible_movies

This function provides user with movie name suggestions if movie name is mis-spelled or Recommends similar movies to the input movie if the movie name is valid

In [160]:
class invalid(Exception):
    pass

def spell_correction():
    
    try:

      movie_name = input("Enter the Movie name: ")
      movie_name_lower = movie_name.lower()
      if movie_name_lower not in case_insensitive_movies_list :
        raise invalid
      else :
        # movies_list[case_insensitive_country_names.index(movie_name_lower)]
        num_recom = int(input("Enter Number of movie recommendations needed: "))
        get_similar_movies(movies_list[case_insensitive_movies_list.index(movie_name_lower)],num_recom)

    except invalid:

      possible_movies = get_possible_movies(movie_name_lower)

      if len(possible_movies) == len(movies_list) :
        print("Movie name entered is does not exist in the list ")
      else :
        indices = [case_insensitive_movies_list.index(i) for i in possible_movies]
        print("Entered Movie name is not matching with any movie from the dataset . Please check the below suggestions :\n",[movies_list[i] for i in indices])
        spell_correction()


In [161]:
spell_correction()


Movie name entered is does not exist in the list 


Let's now look at how sparse the movie-user matrix is by calculating percentage of zero values in the data.


In [162]:

# calcuate total number of entries in the movie-user matrix
num_entries = movie_to_user_df.shape[0] * movie_to_user_df.shape[1]
# calculate total number of entries with zero values
num_zeros = (movie_to_user_df==0).sum(axis=1).sum()
# calculate ratio of number of zeros to number of entries
ratio_zeros = num_zeros / num_entries
print('There is about {:.2%} of ratings in our data is missing'.format(ratio_zeros))


There is about 93.65% of ratings in our data is missing
