<a href="https://colab.research.google.com/github/Rocky12312/Movie_recommendation_engine/blob/master/Recommender_systems_using_collaborative_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
cd /content/drive/My\ Drive/Recommender_systems/Collaborative_based_f

/content/drive/My Drive/Recommender_systems/Collaborative_based_f


Importing necessary required libraries

In [3]:
import os
import sys
import re
import csv
import math
import random
import numpy as np
import pandas as pd
from collections import defaultdict

Here we will be using surprise library for building the recommender system.
Surprise is a Python scikit for recommender systems.

Link to surprise documentation and it is best practice to explore the documentation of python libraries- http://surpriselib.com/

In [4]:
!pip install surprise



In [5]:
from surprise import Dataset
from surprise import Reader

Basically the data we will be using for building the recommender system contain the ratings as well as movies data file.

In [6]:
df_ratings = pd.read_csv("data/ratings.csv")
df_movies = pd.read_csv("data/movies.csv")

In [7]:
df_ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


Our ratings file consist of the UserId's ,movieId's, ratings to movie and the time stamp.We can see that we have 100000 ratings from 671 unique users to movies in our ratings file.

In [8]:
df_movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9120,162672,Mohenjo Daro (2016),Adventure|Drama|Romance
9121,163056,Shin Godzilla (2016),Action|Adventure|Fantasy|Sci-Fi
9122,163949,The Beatles: Eight Days a Week - The Touring Y...,Documentary
9123,164977,The Gay Desperado (1936),Comedy


Our movies file contain the movie id's ,movie title and the genre of the movie. This is the data that we will use to model our recommender system to generate recommendations.

Basically what we will do in user based collaborative filtering is that we will recommend movies based on the similarity between the users.Taking a example such that user A like movie x ,y and user A is found similar to User B using some similarity calculation so then what recommender system do is found it good to recommend the movies x and y to user B.(This is the principle of collaboratve based filtering where we focus upon properties of user for recommendation)

For similarity calculation we have lot of ways like euclidean distance ,pearson similarity, jaccard similarity etc but we will go with cosine similarity as it seen that it outperforms any other sililarity calculation metrics for this task.

We will be using the data given for each user to find the similarity between the pair of users(we will use the rating by user to movies for creating a multidimensional vector as a feature vector for each user).


In [9]:
#Path to data file
ratings_file_path = "data/ratings.csv"
movies_file_path = "data/movies.csv"

Generating the movies_id_to_name and name_to_movies_id dictionary.

In [10]:
#We will be creating two dictionary which will contain mapping from movies id to movie name and reverse mapping
movies_id_to_name = {}
name_to_movies_id = {}

#INitializing the reader
reader = Reader(line_format="user item rating timestamp", sep=",", skip_lines=1)

with open(movies_file_path, newline="", encoding="ISO-8859-1") as csvfile:
    movieReader = csv.reader(csvfile)
    #Skipping the header line(containing label)
    next(movieReader)
    for row in movieReader:
      movieid = int(row[0])
      moviename = row[1]
      movies_id_to_name[movieid] = moviename
      name_to_movies_id[moviename] = movieid

In [11]:
def load_data(reader):
  ratingsDataset = 0
  #Loading the data(using surprise to load data)
  ratingsDataset = Dataset.load_from_file(ratings_file_path, reader=reader)
  
  return ratingsDataset

In [12]:
#Getting the movie name using movie id(if found movie id return movie id otherwise return an empty string)
def getMovieName(movieID):
  if movieID in movies_id_to_name:
    return movies_id_to_name[movieID]
  else:
    return ""


#Getting the movie id using movie name(if found movie name return movie id otherwise zero)
def getMovieID(movieName):
  if movieName in name_to_movies_id:
    return name_to_movies_id[movieName]
  else:
    return 0

#User Based Collaborative filtering

In [13]:
from surprise import KNNBasic

In [14]:
#Taking the test subject to be the user with id 90
testSubject = "90"
#Will be taking similar users based on this k value
k = 10

#Loading the dataset
data = load_data(reader)
trainSet = data.build_full_trainset()

#Smilarity will be user based(user based collaborative filtering)
#Will be using cosine similarity for similarity evaluation
sim_options = {"name": "cosine","user_based": True}
model = KNNBasic(sim_options=sim_options)
model.fit(trainSet)
simsMatrix = model.compute_similarities()

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [15]:
simsMatrix.shape

(671, 671)

We will be having a 671*671 shape matrix that contain similarities between the users in our data.

In [16]:
import heapq
from operator import itemgetter

In [17]:
#Getting top N similar users to the test subject for which we wanna create recommendations
#Also we can select users up to some similarity threshold
testUserInnerID = trainSet.to_inner_uid(testSubject)
similarityRow = simsMatrix[testUserInnerID]

#Basically here what we gonna do is select the similar users from the similarity matrix based on a threshold value(say 10 similar users etc)
similarUsers = []
for innerID, score in enumerate(similarityRow):
  if (innerID != testUserInnerID):
    #Appending the user id and similarity score
    similarUsers.append((innerID, score))

kNeighbors = heapq.nlargest(k, similarUsers, key=lambda t: t[1])

In [18]:
kNeighbors

[(0, 1.0),
 (1, 1.0),
 (4, 1.0),
 (10, 1.0),
 (12, 1.0),
 (26, 1.0),
 (32, 1.0),
 (34, 1.0),
 (36, 1.0),
 (37, 1.0)]

In [19]:
#Getting the movies similar users rates and adding up ratings for each item weighted by user similarity
candidates = defaultdict(float)
for similarUser in kNeighbors:
  innerID = similarUser[0]
  userSimilarityScore = similarUser[1]
  theirRatings = trainSet.ur[innerID]
  #Calculating the candidate scores
  for rating in theirRatings:
    candidates[rating[0]] += (rating[1] / 5.0) * userSimilarityScore

In [20]:
#Building a dictionary of stuff the user has already seen
watched = {}
for itemID, rating in trainSet.ur[testUserInnerID]:
  watched[itemID] = 1
    
#Getting top rated items(movies)
pos = 0
for itemID, ratingSum in sorted(candidates.items(), key=itemgetter(1), reverse=True):
  if not itemID in watched:
    movieID = trainSet.to_raw_iid(itemID)
    print(getMovieName(int(movieID)), ratingSum)
    pos += 1
    if (pos > 10):
        break

Pulp Fiction (1994) 4.3
Schindler's List (1993) 3.5
Forrest Gump (1994) 3.3
Lion King, The (1994) 3.2
Sixth Sense, The (1999) 2.9
Usual Suspects, The (1995) 2.6
American Beauty (1999) 2.6
Dances with Wolves (1990) 2.5
Office Space (1999) 2.5
Braveheart (1995) 2.4000000000000004
Silence of the Lambs, The (1991) 2.3


Basically in user based collaborative filtering the procedure we are following for recommending top k items to user is that initially we are taking a 2d matrix which has users on x axis and movies on yaxis and then we are creating a similarity matrix between the users taking the movies ratings corresponding to each user as its feature and finding cosine similarity between each pair of users following this we get a similarity matrix of users.After generation of similarity matrix say we want to recommend movies to a user what we will do is find the users similar to user under subject based on a threshold to the number of users selected and then after selecting the similar users we will candidate dictionary of the movies watched by similar users and then based upon the candidate scores of each movie we will recommend top n movies(n is choice).

There are some variations which we can apply so as to get some good reasonable results.

In [21]:
#One thing which we can do is apply some condition on to the candidates scores to get the recommendation.
#Getting top rated items(movies)
for itemID, ratingSum in sorted(candidates.items(), key=itemgetter(1), reverse=True):
  if not itemID in watched and ratingSum > 3:
    movieID = trainSet.to_raw_iid(itemID)
    print(getMovieName(int(movieID)), ratingSum)

Pulp Fiction (1994) 4.3
Schindler's List (1993) 3.5
Forrest Gump (1994) 3.3
Lion King, The (1994) 3.2


These are the movies recommended based upon rating sum we can also decide the threshold of rating sum so that above this value we can recommend movies.

#Item Based Collaborative filtering

In [22]:
#Taking the test subject to be the user with id 90
testSubject = "90"
#Will be taking similar items based on this k value
k = 10

#Loading the dataset and computing the user similarity matrix
data = load_data(reader)
trainSet = data.build_full_trainset()

#Smilarity will be item based(item based collaborative filtering)
#Will be using cosine similarity for similarity evaluation
sim_options = {"name": "cosine","user_based": False}
model = KNNBasic(sim_options=sim_options)
model.fit(trainSet)
simsMatrix = model.compute_similarities()

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [23]:
simsMatrix.shape

(9066, 9066)

As we can see that shape of our similarity matrix changed as now we are using items(movies) insted of users.

One thing which should be kept in mind while dealing with collaborative filtering is that item based collaborative is always computation efficient as comapred to user based filtering because the number of users are always greater than the items in catalog so thats why the similarity matrix will be smaller.

In [24]:
#Getting top N similar item to the test subject for which we wanna create recommendations
#Also we can select item up to some similarity threshold
testUserInnerID = trainSet.to_inner_uid(testSubject)

#Get the top K items we rated
testUserRatings = trainSet.ur[testUserInnerID]
kNeighbors = heapq.nlargest(k, testUserRatings, key=lambda t: t[1])

In [25]:
#Getting the similar items to stuff one liked weighted by rating
candidates = defaultdict(float)
for itemID, rating in kNeighbors:
  similarityRow = simsMatrix[itemID]
  for innerID, score in enumerate(similarityRow):
    candidates[innerID] += score * (rating / 5.0)

In [26]:
#Building a dictionary of stuff the user has already seen
watched = {}
for itemID, rating in trainSet.ur[testUserInnerID]:
  watched[itemID] = 1
    
#Getting top rated items(movies)
pos = 0
for itemID, ratingSum in sorted(candidates.items(), key=itemgetter(1), reverse=True):
  if not itemID in watched:
    movieID = trainSet.to_raw_iid(itemID)
    print(getMovieName(int(movieID)), ratingSum)
    pos += 1
    if (pos > 10):
        break

Decline of Western Civilization, The (1981) 9.18210632723149
Ride with the Devil (1999) 9.179529369863763
Edge of Seventeen (1998) 9.178433446602895
Casualties of War (1989) 9.17310807840759
Devil's Backbone, The (Espinazo del diablo, El) (2001) 9.166616704501587
Shadow, The (1994) 9.166114595042242
Palm Beach Story, The (1942) 9.164458216017712
Concorde: Airport '79, The (1979) 9.1598196284343
Lost World, The (1925) 9.15934984270229
I Stand Alone (Seul contre tous) (1998) 9.158111846647614
Great Mouse Detective, The (1986) 9.157821320052


Whhat we do in item based collaborative filtering for recommending top k items to user is somewhat different.Initially we start by taking a 2d matrix which has movies on x axis and users on yaxis and then we create a similarity matrix between the movies taking the movies ratings corresponding to each movie by the users as its feature and finding cosine similarity between each pair of movies, following this we get a similarity matrix of movies.After generation of similarity matrix say we want to recommend movies to a user for that what we will do is take the movies watched by the user based on a threshold whom we are going to recommend the movies and find the similar movies to the the movies rated by user and then create a candidate dictionary and based upon the candidate score will recommend top n movies(again n is a choice).

Again there are some variations which we can apply so as to get some good reasonable results.

In [27]:
lst = []
for _, ratingSum in sorted(candidates.items(), key=itemgetter(1), reverse=True):
  lst.append(ratingSum)
pos = 0.02*len(lst)
print(lst[int(pos)])

9.032537167199374


We can decide the pos based upon the number of recommendation which we wanna give to our users.

In [28]:
#One thing which we can do is apply some condition on to the candidates scores rating sum to get the recommendation.
#Getting top rated items(movies)
for itemID, ratingSum in sorted(candidates.items(), key=itemgetter(1), reverse=True):
  if not itemID in watched and ratingSum > lst[int(pos)]:
    movieID = trainSet.to_raw_iid(itemID)
    print(getMovieName(int(movieID)), ratingSum)

Decline of Western Civilization, The (1981) 9.18210632723149
Ride with the Devil (1999) 9.179529369863763
Edge of Seventeen (1998) 9.178433446602895
Casualties of War (1989) 9.17310807840759
Devil's Backbone, The (Espinazo del diablo, El) (2001) 9.166616704501587
Shadow, The (1994) 9.166114595042242
Palm Beach Story, The (1942) 9.164458216017712
Concorde: Airport '79, The (1979) 9.1598196284343
Lost World, The (1925) 9.15934984270229
I Stand Alone (Seul contre tous) (1998) 9.158111846647614
Great Mouse Detective, The (1986) 9.157821320052
Riding in Cars with Boys (2001) 9.157180223388375
Shopgirl (2005) 9.156001123941994
Notes on a Scandal (2006) 9.156001123941994
Laurel Canyon (2002) 9.155864778192655
Adventures of Pinocchio, The (1996) 9.15485465049468
Suspicion (1941) 9.153171850733909
Mouse That Roared, The (1959) 9.151310535902622
Taking of Pelham One Two Three, The (1974) 9.149544953316555
Three Caballeros, The (1945) 9.148265749590161
Gate of Heavenly Peace, The (1995) 9.1469940

#Using KNN Recommenders for generating predictions.

User based KNN what we gonna do here is that we will try to find the rating prediction for a movie by a user who haven't seen the movie.

The procedure which we will follow is that first of all we will find the users similar to user under subject who have watched the movie under subject which we are thinking of recommending to the user based rating weighted similarity score.

Generating the similarity matrix.

In [29]:
#Taking the test subject to be the user with id 90
testSubject = "90"
testMovieName = "Toy Story (1995)"
#Will be taking similar users based on this k value
k = 30

#Loading the dataset
data = load_data(reader)
trainSet = data.build_full_trainset()

#Smilarity will be user based(user based collaborative filtering)
#Will be using cosine similarity for similarity evaluation
sim_options = {"name": "cosine","user_based": True}
model = KNNBasic(sim_options=sim_options)
model.fit(trainSet)
simsMatrix = model.compute_similarities()

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [30]:
testMovieID = getMovieID(testMovieName)
testMovieID

1

Next step is find the similar users who have rated the movie under subject

In [31]:
#Getting top N similar users to the test subject for which we wanna create recommendations
#Also we can select users up to some similarity threshold
testUserInnerID = trainSet.to_inner_uid(testSubject)
similarityRow = simsMatrix[testUserInnerID]

#Basically here what we gonna do is select the similar users from the similarity matrix based on a threshold value(say 10 similar users etc)

similarUsers = []
for innerID, score in enumerate(similarityRow):
  if (innerID != testUserInnerID):
    #Appending the user id and similarity score
    similarUsers.append((innerID, score))

#Selecting top 30 similar users
kNeighbors = heapq.nlargest(k, similarUsers, key=lambda t: t[1])

In [32]:
testUserInnerID

89

In [33]:
kNeighbors

[(0, 1.0),
 (1, 1.0),
 (4, 1.0),
 (10, 1.0),
 (12, 1.0),
 (26, 1.0),
 (32, 1.0),
 (34, 1.0),
 (36, 1.0),
 (37, 1.0),
 (38, 1.0),
 (39, 1.0),
 (45, 1.0),
 (49, 1.0),
 (51, 1.0),
 (52, 1.0),
 (59, 1.0),
 (61, 1.0),
 (64, 1.0),
 (84, 1.0),
 (95, 1.0),
 (97, 1.0),
 (100, 1.0),
 (108, 1.0),
 (121, 1.0),
 (123, 1.0),
 (128, 1.0),
 (132, 1.0),
 (139, 1.0),
 (140, 1.0)]

In [35]:
#Getting all the similar users who rated the movie under subject
UsersRated = []
def getUsersRated(neighborsList,df_rating,testMovieID):
  for record in neighborsList:
    userid = record[0]+1
    df_new = df_rating[df_rating["userId"] == userid]
    if df_new[df_new["movieId"] == testMovieID].empty == False:
      UsersRated.append(record)

In [36]:
getUsersRated(kNeighbors,df_ratings,testMovieID)

In [37]:
UsersRated

[(12, 1.0), (36, 1.0), (121, 1.0), (123, 1.0)]

Now as we have our similar users what we gonna do is calculate the rating prediction for the movie under subject by user under subject.

In [38]:
def getRatingUserID(user_id, movie_id, df_rating):
  Allratings = df_rating[df_rating["userId"] == user_id]
  selectedRatings = Allratings[Allratings["movieId"] == movie_id]
  rating = selectedRatings["rating"].values
  rating = rating.tolist()
  return rating[0]

In [40]:
weightedRatingTotal = 0
for rec in UsersRated:
  uID = rec[0]+1
  score = rec[1]
  rating = getRatingUserID(uID,testMovieID,df_ratings)
  weightedRatingTotal += rating*score

ratingPrediction = weightedRatingTotal/len(UsersRated)

In [41]:
print("The rating prediction by user under subject for movie under subject is: ",np.floor(ratingPrediction))
#Taking floor or ceil in rating predictions is not good practice but lets take it and will work on it in future

The rating prediction by user under subject for movie under subject is:  4.0


So we can see that rating predictions for Toy Story by user 90 is 4.125 approx 4(if taking floor although it is not a good practice) so we can recommend the movie to user and we can set a threshold to the rating prediction so that above that value we will recommend movie to user otherwise not recommend.