## Name: Kandarp Chaudhary
## Roll No.: D21016

In [1]:
#importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import jaccard_score

# >> Content Based Recommendation:

## 1) Data preparation:

In [2]:
#importing ratings dataset
df = pd.read_csv("ratings.csv")
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
#loading the dataset
data = pd.read_csv("movies.csv")
data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


* In this dataset we can see that the movie title is consisting of year of release at the end of title. We can use this information for our recommendation in some form. So first we will extract the year and make a new column in our dataset named 'year' to store this information.

In [4]:
#Extracting the year from movie title
d = []
for m in data["title"]:
    d.append(m[-5:-1])   # Extracting 4 characters from 5th last character of movie title for each row and appending it in a list

In [5]:
#Adding a new column year consisting releasing year of the movie
data["year"] = d
data.tail()

Unnamed: 0,movieId,title,genres,year
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017
9739,193585,Flint (2017),Drama,2017
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy,1991


In [6]:
len(d), type(data["year"][0])

(9742, str)

* Currently the year column elements have data type of string. When I tried to convert it to integer I ran into an error stating cannot convert into integer. So now I thought of checking how many elements in the current year column are not numeric values and instead have charecters due to movie title not containg the releasing year information.

In [7]:
# Extracting index for elements not having numeric value in list d
e = []
for i in range(len(d)):
    try:                   #trying to convert string to integer
        d[i] = int(d[i])
    except:                #if failed to convert to an integer then appending the index in a list
        e.append(i)
len(e)

12

So there are 12 elements having non numeric value in our year list.

In [8]:
#Checking what values do these 12 element have in movie title
a = []
for i in e:
    a.append(data.iloc[i,1])
a

['Babylon 5',
 'Ready Player One',
 'Hyena Road',
 'The Adventures of Sherlock Holmes and Doctor Watson',
 'Nocturnal Animals',
 'Paterson',
 'Moonlight',
 'The OA',
 'Cosmos',
 'Maria Bamford: Old Baby',
 'Generation Iron 2',
 'Black Mirror']

In [9]:
# Extracting the median for the movies which do have year mentioned in the movie title
x = []                 #initializing list for storing integer movie years
for i in d:
    if type(i) is int: #if type of data is intiger then append to the list
        x.append(i)
np.median(x)

1999.0

As these 12 movies doesn't have year in their title, we will assign median value of remaining years, 1999 in the year column corrosponding to these movies.

In [10]:
# Median imputation for year
for i in e:
    d[i] = np.median(x)

In [11]:
data['year'] = d #updating year column in dataframe

In [12]:
data.describe()

Unnamed: 0,movieId,year
count,9742.0,9742.0
mean,42200.353623,1994.620304
std,52160.494854,18.52391
min,1.0,1902.0
25%,3248.25,1988.0
50%,7300.0,1999.0
75%,76232.0,2008.0
max,193609.0,2018.0


In [13]:
data.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995.0
2,3,Grumpier Old Men (1995),Comedy|Romance,1995.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995.0
4,5,Father of the Bride Part II (1995),Comedy,1995.0


* Now, we will make seperate column for each distinct genres and assign a value of 1 or 0 in that genere column based on whether that movie is of that genre or not.

In [14]:
#Every genre is separated by a | so we simply have to call the split function on |
data['genres'] = data.genres.str.split('|')
data.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995.0
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995.0
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995.0
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995.0
4,5,Father of the Bride Part II (1995),[Comedy],1995.0


In [15]:
#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
df1 = data.copy()

#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in df1.iterrows():    #Iterate over DataFrame rows as (index, Series) pairs.
    for genre in row['genres']:
        df1.at[index, genre] = 1     #DataFeame.at provides access to a single value for a row,column label pair.
        
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
df1 = df1.fillna(0)
df1.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),[Comedy],1995.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
df1.shape

(9742, 24)

* Now creating a sub dataframe containing year and newly created genre columns only and we will use this columns to find the similarity between movies.

In [17]:
df2 = df1.iloc[:,3:]
df2.sample(2)

Unnamed: 0,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
6971,2007.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8053,2012.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


## 2) Similarity Matrix:

### i. Cosine similarity

I will use the year, genre dataframe to find out cosine similarities between movies.

In [18]:
# Making cosine similarity matrix as a Datframe and renaming column & row names with movie title.
csm = pd.DataFrame(cosine_similarity(df2), index = df1["title"], columns = df1["title"])

In [19]:
csm.head()

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1.0,1.0,0.999999,0.999999,0.999999,0.999999,0.999999,1.0,0.999999,0.999999,...,0.999999,0.999999,0.999999,0.999999,0.999999,1.0,1.0,0.999999,0.999999,0.999999
Jumanji (1995),1.0,1.0,0.999999,0.999999,0.999999,0.999999,0.999999,1.0,0.999999,0.999999,...,0.999999,0.999999,0.999999,0.999999,0.999999,0.999999,1.0,1.0,0.999999,0.999999
Grumpier Old Men (1995),0.999999,0.999999,1.0,1.0,1.0,0.999999,1.0,0.999999,1.0,0.999999,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Waiting to Exhale (1995),0.999999,0.999999,1.0,1.0,1.0,0.999999,1.0,0.999999,0.999999,0.999999,...,0.999999,1.0,1.0,0.999999,0.999999,0.999999,1.0,1.0,0.999999,1.0
Father of the Bride Part II (1995),0.999999,0.999999,1.0,1.0,1.0,0.999999,1.0,1.0,1.0,0.999999,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


* As you can see in the above matrix, the values are very close to 1 in each element due to added year column having value between 1902 to 2018 and other columns having values either 0 or 1 only. To resolve this issue I am using jaccard distance to calculate similarity which ignores the magnitude within each columns.

### ii. Jacard similarity

In [20]:
# Calculating pairwise jaccard distance.
jsm = 1 - pairwise_distances(df2, metric = "hamming") # (1-hamming distance) = jaccard distance
# Converting it to a DataFrame and renaming column & row names with movie title.
jsm = pd.DataFrame(jsm, index=df1["title"], columns=df1["title"])

In [21]:
jsm.head()

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1.0,0.904762,0.761905,0.714286,0.809524,0.619048,0.761905,0.857143,0.714286,0.714286,...,0.714286,0.714286,0.714286,0.761905,0.666667,0.809524,0.857143,0.666667,0.714286,0.761905
Jumanji (1995),0.904762,1.0,0.761905,0.714286,0.809524,0.714286,0.761905,0.952381,0.809524,0.809524,...,0.619048,0.714286,0.714286,0.761905,0.761905,0.714286,0.761905,0.761905,0.714286,0.761905
Grumpier Old Men (1995),0.761905,0.761905,1.0,0.952381,0.952381,0.761905,1.0,0.809524,0.857143,0.761905,...,0.761905,0.761905,0.857143,0.809524,0.809524,0.761905,0.809524,0.809524,0.761905,0.904762
Waiting to Exhale (1995),0.714286,0.714286,0.952381,1.0,0.904762,0.714286,0.952381,0.761905,0.809524,0.714286,...,0.714286,0.809524,0.904762,0.761905,0.761905,0.714286,0.761905,0.857143,0.714286,0.857143
Father of the Bride Part II (1995),0.809524,0.809524,0.952381,0.904762,1.0,0.809524,0.952381,0.857143,0.904762,0.809524,...,0.809524,0.809524,0.904762,0.857143,0.857143,0.809524,0.857143,0.857143,0.809524,0.952381


In [22]:
# Checking distribution of jaccard similarity matrix
jsm.iloc[:,:10].describe()

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995)
count,9742.0,9742.0,9742.0,9742.0,9742.0,9742.0,9742.0,9742.0,9742.0,9742.0
mean,0.676751,0.729297,0.802783,0.797797,0.8348,0.750951,0.802783,0.7693,0.815952,0.751576
std,0.077435,0.056148,0.076955,0.083646,0.069775,0.064946,0.076955,0.052402,0.051346,0.057706
min,0.380952,0.428571,0.47619,0.52381,0.52381,0.47619,0.47619,0.47619,0.52381,0.47619
25%,0.619048,0.714286,0.761905,0.761905,0.809524,0.714286,0.761905,0.714286,0.809524,0.714286
50%,0.666667,0.714286,0.809524,0.809524,0.857143,0.761905,0.809524,0.761905,0.809524,0.761905
75%,0.714286,0.761905,0.857143,0.857143,0.857143,0.761905,0.857143,0.809524,0.857143,0.761905
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


* As seen in the distribution, jaccard similarity matrix is giving much better results than cosine similarity matrix. So we will use jaccard similarity matrix to obtain final recommendation.

## 3) Movie Recommendation Function:

I am creating a 'recommend_movies' function which takes user id (u_id), number of movie recommendation required(n) and minimum movie similarity required(sim) as the attributes and gives recommended movie titles as output.

The function can be sub divided in three parts: 
1) Identifying movies which the user liked, 

2) Finding similar movies to the 5 user liked movies and,

3) Subtracting the similar movies from recommendation which are already watched by user and give subset of that as output.

In [23]:
def recommend_movies (u_id, n=5, sim=0.9):
    
    #1) Identifying the movies which the user liked
    r = max(df.loc [(df.userId==u_id), "rating"])  #extracting maximum rating given by a user from ratings dataset
    #if maximum rating given by user is <4, then take watched movieIDs as liked movieIDs
    if r < 4:
        print("None liked so far!")
        l = list(df.loc[(df.userId==u_id), "movieId"])
    #Making list of movie ids which were 5 rated and 4 rated by user
    else :
        s5 = df.loc[(df.userId==u_id) & (df.rating==5), "movieId"]  #list of movie id having ratings=5
        s4 = df.loc[(df.userId==u_id) & (df.rating==4), "movieId"]  #list of movie id having ratings=4
        np.random.seed(16)                                             #setting a seed value for reproducibility
        #If number of 5-rated movies are more than 5, then return a list of randomly selected movie
        if len(s5) > 5:
            l = list (np.random.choice(s5,5,replace = False))
        #If the total number of 5-rated movies & 4-rated movies are less than 5, then return a list of movie rated 4  and 5
        elif len(s5) + len(s4) < 5:
            l = list(s5)+list(s4)
        #otherwise return a list containing all the 5-rated movies and a few random selection 
        else: 
            l = list (s5) + list (np.random.choice (s4, 5-len(s5),replace = False))
            
    #2) Finding similar movies for each of the 5 liked movies
    sm = []
    for i in l:  
        #Get the movie title for the given movie id
        t= df1.loc [df1.movieId==i, "title"]
        t = np.array(t)[0]
        #Extract the column containing the movie ID from the similarity matrix 
        sc= np.array(jsm.loc[:,t])
        #Extracting similar movies titles from jaccard similarity matrix index
        re= list(jsm.loc[t, :][jsm.loc [t,:]> sim].index)
        for j in re:
            sm.append(j)
    
    #3) Finding user watchlist and subtracting it from found similar movies list
    id1=list(df.loc[df.userId==u_id, "movieId"]) #Extracting watched movie ids from ratings data
    watched_user = []
    for j in id1:
        watched_user.append((df1.loc[df1.movieId == j, ["title"]].iloc[0,0])) #appending movie title corrosponding to the movieid  
    
    r = list(set(sm)-set(watched_user))      #Extracting movies which are not watched by the given user but are watched by similar users
    r = np.random.choice(r,n, replace=False) #Randomly recommending "top" number of movies from r without replacement and returning it as output
    return(r)

Let's see the recommendation provided by the function using different function attributes for userid 7.

In [24]:
recommend_movies(u_id = 7)

array(['Outlander (2008)', 'Battlestar Galactica: Razor (2007)',
       'Crow: City of Angels, The (1996)', 'Monkey Business (1952)',
       'Screamers (1995)'], dtype='<U158')

In [25]:
recommend_movies(u_id = 7,n = 10)

array(['Outlander (2008)', 'Battlestar Galactica: Razor (2007)',
       'Crow: City of Angels, The (1996)', 'Monkey Business (1952)',
       'Screamers (1995)', 'Ballistic: Ecks vs. Sever (2002)',
       'Re-Animator (1985)',
       'Trip to the Moon, A (Voyage dans la lune, Le) (1902)',
       'New Kids Nitro (2011)', 'Thunderbirds (2004)'], dtype='<U158')

In [26]:
recommend_movies(u_id = 7,n = 8,sim = 0.95)

array(["Kelly's Heroes (1970)",
       'Star Trek III: The Search for Spock (1984)',
       'Thor: Ragnarok (2017)', 'Black Panther (2017)',
       'Star Trek: First Contact (1996)', 'Flash Gordon (1980)',
       'Jupiter Ascending (2015)', 'Star Trek Beyond (2016)'],
      dtype='<U88')

# >> Collaborative Recommendations:

## 1) Data Preparation:

In [27]:
#loading the ratings dataset
df = pd.read_csv("ratings.csv")
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Checking how many unique users and movies are there in the dataset

In [28]:
len(np.unique(df[['userId']])), len(np.unique(df[['movieId']]))

(610, 9724)

In [29]:
#loading the movies dataset
data = pd.read_csv("movies.csv")
data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Checking how many unique movies are there in the dataset.

In [30]:
len(np.unique(data.movieId))

9742

* As seen from movies and ratings dataset, out of 9742 movies, only 9724 movies have ratings available. This means when recommending movies through colleborative recommendation, we will not recommend these 18 movies which are not present in ratings dataset.

Creating a userID x movieID matrix such that we can access ratings against each user-movie pair

In [31]:
df1 = df.pivot(index = 'userId', columns ='movieId', values = 'rating')
df1.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


Creating a copy of userID x movieID matrix that would be used for another missing value imputation approach

In [32]:
df2 = df1.copy()

Replacing null values in userID x movieID matrix with 0, so that we can make user-user cosine similarity matrix from it. I am replacing it by 0 instead of user average because I want to compare recommendations obtained using this with the mean centered ratings' userID x movieID matrix in later steps.

In [33]:
df1[df1.isnull()] = 0
df1.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2) Similarity Matrix:

Creating a user-user cosine similarity matrix. I will use this matrix to identify 5 most similar users to my client user

In [34]:
csm = pd.DataFrame(cosine_similarity(df1), index = df1.index, columns = df1.index)
csm.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,1.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.05972,0.0,1.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
4,0.194395,0.003726,0.002251,1.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5,0.12908,0.016614,0.00502,0.128659,1.0,0.300349,0.108342,0.429075,0.0,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792


Checking how data is distributed in user-user cosine similarity matrix

In [35]:
csm.iloc[:,:10].describe()

userId,1,2,3,4,5,6,7,8,9,10
count,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0
mean,0.134684,0.052725,0.010084,0.092815,0.124011,0.123589,0.136654,0.154505,0.04409,0.066918
std,0.08387,0.070423,0.042391,0.070298,0.11965,0.126842,0.083581,0.155218,0.055585,0.068791
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.080573,0.0,0.0,0.048826,0.044071,0.039298,0.07795,0.050781,0.0,0.021018
50%,0.12264,0.029018,0.003531,0.077624,0.091391,0.076712,0.127143,0.108616,0.036904,0.04381
75%,0.172671,0.07787,0.008579,0.126982,0.151973,0.159861,0.185984,0.188414,0.072279,0.103943
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Finding 5 most similar user to user 7.

In [36]:
u = csm.sort_values(by = [7],ascending = False).index[1:6]

Let's see what is the correlation between these top 5 similar user with user 7.

In [37]:
csm.loc[u,7]

userId
239    0.357103
399    0.350654
220    0.340868
354    0.334286
438    0.333561
Name: 7, dtype: float64

Taking userID x movieID data and making it mean centered for each userID

In [38]:
df2 = (df2.T.loc[:,:] - np.array(df2.mean(axis=1))).T
df2.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.366379,,-0.366379,,,-0.366379,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,0.363636,,,,,,,,,,...,,,,,,,,,,


Replacing null values with each userID mean 0, so that we can make user-user cosine similarity matrix and user-user correlation matrix from it.

In [39]:
df2[df2.isnull()] = 0
df2.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.366379,0.0,-0.366379,0.0,0.0,-0.366379,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.363636,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Checking how data is distributed in userID x movieID matrix

In [40]:
df2.iloc[:,:10].describe()

movieId,1,2,3,4,5,6,7,8,9,10
count,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0
mean,0.110026,-0.009586,-0.020016,-0.012578,-0.041981,0.063284,-0.035474,-0.008197,-0.011946,-0.012189
std,0.488668,0.312009,0.261603,0.131636,0.259795,0.332932,0.286947,0.123147,0.145925,0.354453
min,-2.584034,-2.190283,-2.119681,-2.105263,-2.659292,-1.426087,-2.977427,-2.590909,-1.62069,-1.942122
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2.152629,1.52,1.8,0.0,1.506369,2.402778,1.098214,0.447368,1.132231,1.739278


Checking sum for each row and column of userID x movieID matrix. As we have done mean centering, we will see sum values very close to 0 for each userID

In [41]:
df2.sum(axis = 0)

movieId
1         67.115851
2         -5.847412
3        -12.209480
4         -7.672312
5        -25.608672
            ...    
193581     0.294776
193583    -0.205224
193585    -0.205224
193587    -0.205224
193609     0.372024
Length: 9724, dtype: float64

In [42]:
df2.sum(axis = 1)

userId
1      4.263256e-14
2      6.217249e-15
3      1.421085e-14
4      4.263256e-14
5      4.440892e-15
           ...     
606   -1.847411e-13
607    6.394885e-14
608   -2.403411e-12
609    3.552714e-15
610    1.312284e-12
Length: 610, dtype: float64

Creating a user-user cosine similarity matrix. I will use this matrix to identify 5 most similar users to my client user

In [43]:
csm1 = pd.DataFrame(cosine_similarity(df2), index = df2.index, columns = df2.index)
csm1.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.001265,0.000553,0.048419,0.021847,-0.045497,-0.0062,0.047013,0.01951,-0.008754,...,0.018127,-0.017172,-0.015221,-0.037059,-0.029121,0.012016,0.055261,0.075224,-0.025713,0.010932
2,0.001265,1.0,0.0,-0.017164,0.021796,-0.021051,-0.011114,-0.048085,0.0,0.003012,...,-0.050551,-0.031581,-0.001688,0.0,0.0,0.006226,-0.020504,-0.006001,-0.060091,0.024999
3,0.000553,0.0,1.0,-0.01126,-0.031539,0.0048,0.0,-0.032471,0.0,0.0,...,-0.004904,-0.016117,0.017749,0.0,-0.001431,-0.037289,-0.007789,-0.013001,0.0,0.01955
4,0.048419,-0.017164,-0.01126,1.0,-0.02962,0.013956,0.058091,0.002065,-0.005874,0.05159,...,-0.037687,0.063122,0.02764,-0.013782,0.040037,0.02059,0.014628,-0.037569,-0.017884,-0.000995
5,0.021847,0.021796,-0.031539,-0.02962,1.0,0.009111,0.010117,-0.012284,0.0,-0.033165,...,0.015964,0.012427,0.027076,0.012461,-0.036272,0.026319,0.031896,-0.001751,0.093829,-0.000278


Checking how data is distributed in user-user cosine similarity matrix

In [44]:
csm1.iloc[:,:10].describe()

userId,1,2,3,4,5,6,7,8,9,10
count,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0,610.0
mean,0.015605,-0.005365,-0.002438,0.002604,0.017619,0.01427,0.024487,0.028104,0.011973,-0.009937
std,0.052842,0.047632,0.044891,0.050042,0.064098,0.059536,0.054187,0.065923,0.04941,0.053461
min,-0.105003,-0.166806,-0.070858,-0.108871,-0.190382,-0.107733,-0.14314,-0.153665,-0.157403,-0.138436
25%,-0.00588,-0.017924,-0.01367,-0.016809,-0.006918,-0.010556,0.000557,-0.000663,0.0,-0.032654
50%,0.012211,0.0,0.0,2.6e-05,0.006508,0.004331,0.021885,0.020416,0.000487,-0.008209
75%,0.035697,0.000552,0.0,0.020005,0.03106,0.025372,0.04514,0.048283,0.023638,0.006355
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Finding 5 most similar user to user 7.

In [45]:
u = csm1.sort_values(by = [7],ascending = False).index[1:6]
u

Int64Index([296, 434, 590, 370, 75], dtype='int64', name='userId')

Let's see what is the cosine similarity between these top 5 similar user with user 7.

In [46]:
csm1.loc[u,7]

userId
296    0.212454
434    0.137777
590    0.129029
370    0.128685
75     0.127325
Name: 7, dtype: float64

Creating a user-user correlation matrix. I will use this matrix to identify 5 most similar users to my client user.

In [47]:
csm2 = df2.T.corr()
csm2.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.001264516,0.0005525772,0.048419,0.021847,-0.045497,-0.006199672,0.047013,0.01950985,-0.008754088,...,0.018127,-0.017172,-0.015221,-0.03705875,-0.02912138,0.012016,0.055261,0.075224,-0.02571255,0.010932
2,0.001265,1.0,-1.137081e-18,-0.017164,0.021796,-0.021051,-0.01111357,-0.048085,4.635301e-19,0.003011629,...,-0.050551,-0.031581,-0.001688,1.419593e-20,3.481475e-20,0.006226,-0.020504,-0.006001,-0.060091,0.024999
3,0.000553,-1.137081e-18,1.0,-0.01126,-0.031539,0.0048,2.771264e-18,-0.032471,3.0947370000000002e-18,-1.726421e-18,...,-0.004904,-0.016117,0.017749,-1.2082599999999998e-19,-0.001430628,-0.037289,-0.007789,-0.013001,3.7946319999999995e-19,0.01955
4,0.048419,-0.01716402,-0.01125978,1.0,-0.02962,0.013956,0.05809139,0.002065,-0.005873603,0.05159032,...,-0.037687,0.063122,0.02764,-0.01378212,0.04003747,0.02059,0.014628,-0.037569,-0.01788358,-0.000995
5,0.021847,0.02179571,-0.03153892,-0.02962,1.0,0.009111,0.01011715,-0.012284,3.751729e-20,-0.03316512,...,0.015964,0.012427,0.027076,0.01246135,-0.03627206,0.026319,0.031896,-0.001751,0.09382892,-0.000278


Checking how data is distributed in user-user correlation similarity matrix

In [48]:
csm2.iloc[:,:10].describe()

userId,1,2,3,4,5,6,7,8,9,10
count,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0
mean,0.015631,-0.005373652,-0.002441949,0.002608,0.017648,0.014293,0.024527,0.028151,0.01199264,-0.009954
std,0.052882,0.04767083,0.04492787,0.050083,0.064147,0.059582,0.054222,0.065967,0.04944822,0.053503
min,-0.105003,-0.1668059,-0.07085816,-0.108871,-0.190382,-0.107733,-0.14314,-0.153665,-0.1574033,-0.138436
25%,-0.005893,-0.01801331,-0.01369111,-0.016824,-0.006971,-0.010566,0.000639,-0.000711,-2.2074639999999996e-19,-0.032659
50%,0.012407,-3.816315e-19,-6.609400999999999e-19,5.3e-05,0.006594,0.004422,0.021941,0.020577,0.0005419576,-0.00825
75%,0.035705,0.0005644992,3.650694e-18,0.020013,0.031084,0.025458,0.04517,0.048414,0.02366397,0.006356
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Finding 5 most similar user to user 7.

In [49]:
u = csm2.sort_values(by = [7],ascending = False).index[1:6]
u

Int64Index([296, 434, 590, 370, 75], dtype='int64', name='userId')

Let's see what is the correlation between these top 5 similar user with user 7.

In [50]:
csm2.loc[u,7]

userId
296    0.212454
434    0.137777
590    0.129029
370    0.128685
75     0.127325
Name: 7, dtype: float64

* As seen from oututs of cosine similarity matrix and correlation similarity matrix, the results for the similarities are identical. I didn't know this initially, but if the vectors a and b are mean centered (i.e. have zero means), then their cosine similarity will be the same as their correlation coefficient.

## 3) Recommendation Function:

Now, I will create a function which will take userID (user), number of movie recommendations required (top) and type of similarity matrix (sm) as arguments and then it will give an array of recommended movie titles based on user-user similarity as output.

In [51]:
def recommend_movies_2 (user, top = 10, sm = None):
    
    # selecting similarity matrix based on the "sm" attribute value of the function
    if sm == 'standardized_cosine':
        sm = csm1
    elif sm == 'standardized_correlation':
        sm = csm2
    else:
        sm = csm
        
    #Get the top 5 similar user to the given user using similarity matrix mentioned
    su = list(sm.sort_values(by = [user],ascending = False).index[1:6])
    
    #Obtaining the watchlist of the 5 similar user from which we will recommend different movies
    watched_similar = set()     #defining an empty set variable
    for i in list(su):
        id1 = list(df.loc[df.userId==i, "movieId"]) #extracting movieID for a similar user which are present in the ratings datset
        e = list()              #defining an empty list variable
        for j in id1:
            e.append((data.loc[data.movieId == j, ["title"]].iloc[0,0])) #extracting movie title for each movieID from movies dataset
        watched_similar = set(watched_similar).union(set(e)) #making a set which contains all movies watched by 5 similar users
        
    #Obtaining the watchlist of the given user so that we can recommend different movies than what he/she has seen so far
    id1 =list(df.loc[df.userId==user, "movieId"])  #extracting movieID for given user which are present in the ratings datset
    e = list()                  #defining an empty list variable
    for j in id1:
        e.append((data.loc[data.movieId == j, ["title"]].iloc[0,0])) #extracting movie title for each movieID from movies dataset
    watched_user = set(e)       #creating a set of movie titles
    
    r = list(set(watched_similar)-set(watched_user)) #Extracting movies which are not watched by the given user but are watched by similar users
    np.random.seed(16)
    r = np.random.choice(r,top, replace=False)       #Randomly recommending "top" number of movies from r without replacement and returning it as output
    return(r)

15 Movie recommendation for user 7 based on cosine similarity matrix obtained from normalized userID x movieID matrix

In [52]:
recommend_movies_2(user = 7, top = 15, sm = 'standardized_cosine') 

array(['Mars Attacks! (1996)', 'Clerks (1994)', 'Taxi Driver (1976)',
       'Up (2009)', 'Green Mile, The (1999)', 'Old School (2003)',
       'Lost World: Jurassic Park, The (1997)',
       'Star Trek: Generations (1994)', 'Christmas Story, A (1983)',
       'Tomorrow Never Dies (1997)', 'Quantum of Solace (2008)',
       'Major League (1989)', 'Insomnia (1997)',
       'Lady and the Tramp (1955)', 'Superman II (1980)'], dtype='<U90')

15 Movie recommendation for user 7 based on correlation matrix obtained from normalized userID x movieID matrix

In [53]:
recommend_movies_2(user = 7, top = 15, sm = 'standardized_correlation') 

array(['Mars Attacks! (1996)', 'Clerks (1994)', 'Taxi Driver (1976)',
       'Up (2009)', 'Green Mile, The (1999)', 'Old School (2003)',
       'Lost World: Jurassic Park, The (1997)',
       'Star Trek: Generations (1994)', 'Christmas Story, A (1983)',
       'Tomorrow Never Dies (1997)', 'Quantum of Solace (2008)',
       'Major League (1989)', 'Insomnia (1997)',
       'Lady and the Tramp (1955)', 'Superman II (1980)'], dtype='<U90')

15 Movie recommendation for user 7 based on cosine similarity matrix obtained from userID x movieID matrix which was not normalized

In [54]:
recommend_movies_2(user = 7, top = 15) 

array(['Sleepers (1996)', 'No Country for Old Men (2007)',
       'Crocodile Dundee (1986)', 'Replacements, The (2000)',
       'Eternal Sunshine of the Spotless Mind (2004)',
       'Cutting Edge, The (1992)', 'Pursuit of Happyness, The (2006)',
       'Terminal, The (2004)',
       "Pirates of the Caribbean: At World's End (2007)", 'Kinsey (2004)',
       'Fever Pitch (2005)', 'Josie and the Pussycats (2001)',
       'Seven (a.k.a. Se7en) (1995)', 'Guarding Tess (1994)',
       'Tron (1982)'], dtype='<U93')

10 Movie recommendation for user 7 based on cosine similarity matrix obtained from userID x movieID matrix which was not normalized

In [55]:
recommend_movies_2(user = 7)

array(['Sleepers (1996)', 'No Country for Old Men (2007)',
       'Crocodile Dundee (1986)', 'Replacements, The (2000)',
       'Eternal Sunshine of the Spotless Mind (2004)',
       'Cutting Edge, The (1992)', 'Pursuit of Happyness, The (2006)',
       'Terminal, The (2004)',
       "Pirates of the Caribbean: At World's End (2007)", 'Kinsey (2004)'],
      dtype='<U93')

### Future scope:

* In the function recommend_movies_2, I have not considered the similar user's movie ratings to give recommendations, this can improve quality of recommendations.

* Also I can recommend movies to the user, based on the disliked movies(i.e. ratings <3) by a dis-similar user, this may or may not work as some movies can be very bad and those would be disliked by almost all users regardless of other factors.

* Different types of similarity matrix can be added to function recommend_movies_2, to get recommendations based on the new similarity matrix.