# Recommendation System by Content Based Filtering:
This method uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. For example based on users given rating to different movies, system would offer those films that user like the most based on their genre e.g. comedy or thriller.

In [1]:
#Dataframe manipulation library
import pandas as pd
#Math functions, we'll only need the sqrt function so let's import only that
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#Storing the movie information into a pandas dataframe
movies_df = pd.read_csv('Movie.csv')

#Storing the user information into a pandas dataframe
ratings_df = pd.read_csv('ratings.csv')

#Head is a function that gets the first N rows of a dataframe. N's default is 5.
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [3]:
ratings_df
print(len(ratings_df))
ratings_df.head()

100836


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


# Preprocessing

In [4]:
#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
"""
Here we extract the dates of the movies from 'title' column using the code title.str.extract() which str.extract() extracts
the first match only (as we only need first reg pat here). Each \d represent a digit in string and 4 of them together, extract
a year e.g. 1995. We add a new column named 'year' which its data are years extracted for each movie. '"""
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)

#Removing the parentheses by similar code on year column
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)

"""so far, we extracted the years from movie titles and copied them into a new column named 'year'. Now, we should remove 
years from the title column. We do this simply by finding years and replacing them with empty character ''. """
#Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

"""As in python, empty spaces are counted in strings, we should delete form the end and begining of the strings in column
'tilte'. We do this simply by strip() function. lambda is an easy way to write codes and can perform 'for' loops like here.
lambda iterate over all eneries of column 'title' and strip them."""
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [5]:
"""
As we need genres for each movie to find similar movies, we need to split genres for each movie to make each genre 
as a feature for movies. To do this, first we seperate each genre from the rest by str.split('|') as our separator is: '|'  
"""
#Every genre is separated by a | so we simply have to call the split function on |
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [6]:
#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
moviesWithGenres_df = movies_df.copy()

"""
What we need to do is to create a column for each genre (so, all genres are counted and represented as a feature).
Then, we some kind, do one hot coding for each movies i.e. if a movie has only two genres e.g. action and comedy,
get 1 in these two columns and get 0 for all other genres.
We do this by using pandas iterrows() method which method generates an iterator object of the DataFrame, allowing us to iterate
each row in the DataFrame and produce an index object and a row object. 
What we do is simply that for each movie, we look at the genres column and for each genre in it, produce a column which is named
that column and then give 1 to it i.e. that movie has that genre. Then, finally we fill empty spaces with 0.
"""
#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
"""for each index (i.e. movie) in our dataframe:"""
for index, row in movies_df.iterrows():
    
    """for each genre in columns genres for an index (movie)"""
    for genre in row['genres']:
        
        """give 1 to that genre at that index (row/movie) i.e. make a column (and remove that from genres column),
        and name it as the genre, and fill it with 1"""
        moviesWithGenres_df.at[index, genre] = 1
"""now, we have many empty spaces which are induced by not being that genre for each index. E.g. Toy Story will get 
1 in Adventure, Animation, Children, Comedy, Fantasy columns and for the rest of produced genres, it will get empty value. 
So, we fill in the NaN values with 0 to show that a movie doesn't have that column's genre"""
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
#We do not need timestamp column (we do not want to use it here). so remove it by df.drop('timestamp', 1). 
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [9]:
"""
Now, we need the user information about what movies he has watched and how he rated them so that the system can see what genres
he likes (based of genres of those movies he rated) and what genres he does not. 
Here, this is an example of a random user profile.
"""
userInput = [
            {'title':'V for Vendetta', 'rating':4.5},
            {'title':'Immortals', 'rating':5},
            {'title':'The Grand Budapest Hotel', 'rating':2.5},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Interstellar', 'rating':3.0},
            {'title':'Flight', 'rating':4.5},
            {'title':'Ip Man', 'rating':4.0},
            {'title':'Troy', 'rating':4.0} 
         ] 
#dataframe of user profile
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,V for Vendetta,4.5
1,Immortals,5.0
2,The Grand Budapest Hotel,2.5
3,Pulp Fiction,5.0
4,Interstellar,3.0
5,Flight,4.5
6,Ip Man,4.0
7,Troy,4.0


In [10]:
"""Now, firstly, we need to see if the users watched movies are in our dataset and if they are, what is their ID (index).
So, we make a filter to make a sub-dataframe named inputId. This dataframe choose elements of bigger dataframe (our movies) 
that their title is in our ( by code: isin) our users dataframe (inputMovies)"""
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
print(inputId)

"""we merge this inputId dataframe with user dataframe i.e. inputMovies. The merge in Pandas make a dataframe by merging 
two datafrares by not repeting the repetitive columns and unifying other columns"""
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)

"""As we merged it with original dataframe, there are two redundant columns that we dropped before: 'yea'r and 'genres'"""
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)

#Final input dataframe
#If a movie you added in above isn't here, then it might not be in the original 
#dataframe or it might spelled differently, please check capitalisation.
inputMovies

      movieId           title                            genres  year
257       296    Pulp Fiction  [Comedy, Crime, Drama, Thriller]  1994
4948     7458            Troy   [Action, Adventure, Drama, War]  2004
6151    44191  V for Vendetta  [Action, Sci-Fi, Thriller, IMAX]  2006
6947    65514          Ip Man              [Action, Drama, War]  2008
7742    90888       Immortals          [Action, Drama, Fantasy]  2011
8024    97923          Flight                           [Drama]  2012
8376   109487    Interstellar                    [Sci-Fi, IMAX]  2014


Unnamed: 0,movieId,title,rating
0,296,Pulp Fiction,5.0
1,7458,Troy,4.0
2,44191,V for Vendetta,4.5
3,65514,Ip Man,4.0
4,90888,Immortals,5.0
5,97923,Flight,4.5
6,109487,Interstellar,3.0


In [11]:
"""To start learning, we need to find movies that user watched in dataframe that have genre as feature.
Now, we have 'movieId's in our inputMovies. This will help us to find the genre of each of our input movies,
and then, make a datafram for them. To do this, we make a sub-dataframe of moviesWithGenres, by searching based on movieId.
So, search moviesWithGenres dataframe and only choose those movies that are in user dataframe movies (inputMovies)"""
#Filtering out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
257,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4948,7458,Troy,"[Action, Adventure, Drama, War]",2004,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6151,44191,V for Vendetta,"[Action, Sci-Fi, Thriller, IMAX]",2006,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6947,65514,Ip Man,"[Action, Drama, War]",2008,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
7742,90888,Immortals,"[Action, Drama, Fantasy]",2011,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8024,97923,Flight,[Drama],2012,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8376,109487,Interstellar,"[Sci-Fi, IMAX]",2014,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [12]:
#Resetting the index to make it easier to work
userMovies = userMovies.reset_index(drop=True)

#Dropping unnecessary issues due to save memory and to avoid issues. 
#Now, we have a dataframe containing all genres that user have watched.
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
print(userGenreTable.shape)
userGenreTable

(7, 20)


Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [13]:
"""the rating list of watched movies by user. This numbrs will be producted to userGenreTable to make score of each genre
according to user"""
inputMovies['rating']

0    5.0
1    4.0
2    4.5
3    4.0
4    5.0
5    4.5
6    3.0
Name: rating, dtype: float64

In [14]:
"""We product each movies rating to each of its genre's row (e.g. 5 * [1, 0, 1, 0, 0..., 0]) and then sum all of scores for 
each genres e.g. if two movies have rating of 5, and both are fantasy (and only these two are fantacy movies in list), then
fantacy score will be 10. To do this in a more clear way, we use matrix products. Transpose GenreTable (so it becomes a 20*8 
matrix) and compute its product to rating matrix of size 8*1. The outcome will be a 20*1 matrix, indicating each genre's score.
"""
#Dot produt to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
#The user profile
"""Now, we know which genres our user likes and how much he likes each of them"""
userProfile


Adventure              4.0
Animation              0.0
Children               0.0
Comedy                 5.0
Fantasy                5.0
Romance                0.0
Drama                 22.5
Action                17.5
Crime                  5.0
Thriller               9.5
Horror                 0.0
Mystery                0.0
Sci-Fi                 7.5
War                    8.0
Musical                0.0
Documentary            0.0
IMAX                   7.5
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

In [15]:
"""Now, we make the only genre table of all of our movies. We make index by movieId by .set_index(moviesWithGenres_df['movieId'])
and then remove all columns that are not the genre. We keep movieIds to search the movie names later based on their ID"""

#get the genres of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])

#And drop the unnecessary information
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
#shape of our table. So, we have 9742 movies (based on movieId)
genreTable.shape

(9742, 20)

In [17]:
"""
Now that we have table of genres for all movies (genreTable) and also our weight of each genre (userProfile), 
by producting these two matrix and doing a normalization, we can estimate the score of each movie from our user's perspective.
"""
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head()

movieId
1    0.153005
2    0.098361
3    0.054645
4    0.300546
5    0.054645
dtype: float64

In [18]:
#Sort our recommendations in descending order so that we could see best movies first by sort_values(ascending=False)
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at top 10 values
recommendationTable_df.head(10)

movieId
79132     0.759563
49530     0.726776
71999     0.721311
60684     0.704918
81132     0.693989
117646    0.693989
36529     0.683060
519       0.677596
43932     0.677596
2985      0.677596
dtype: float64

In [19]:
"""Finally, we can see what is the name of selested movies and their information by searching based on movieId
Here is the final recommended movies from most recommended one."""
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
167,198,Strange Days,"[Action, Crime, Drama, Mystery, Sci-Fi, Thriller]",1995
454,519,RoboCop 3,"[Action, Crime, Drama, Sci-Fi, Thriller]",1993
2248,2985,RoboCop,"[Action, Crime, Drama, Sci-Fi, Thriller]",1987
5161,8361,"Day After Tomorrow, The","[Action, Adventure, Drama, Sci-Fi, Thriller]",2004
5556,26701,Patlabor: The Movie (Kidô keisatsu patorebâ: T...,"[Action, Animation, Crime, Drama, Film-Noir, M...",1989
5665,27618,"Sound of Thunder, A","[Action, Adventure, Drama, Sci-Fi, Thriller]",2005
5985,36529,Lord of War,"[Action, Crime, Drama, Thriller, War]",2005
6145,43932,Pulse,"[Action, Drama, Fantasy, Horror, Mystery, Sci-...",2006
6330,48774,Children of Men,"[Action, Adventure, Drama, Sci-Fi, Thriller]",2006
6358,49530,Blood Diamond,"[Action, Adventure, Crime, Drama, Thriller, War]",2006


# Recommendation System based on Collaborative Filtering
This method can be done in two ways: user- based and item- based.

In **user based**, system find out users with similar interests with our user. Then based on items those users watched or bought that query user has not yet, system recommend them to query user. For example if you like this items, people with similar interst has bought these itwms as well. Thus, maybe you like them too.

In **item based** method, system find out similar items based on has been bought together from different users. For example it is common that a tablet and a magnetic pen are bought together (though they are completely different in nature). Thus, when some one buy a laptop, system offers him to buy a magnetic pen as well.

In [20]:
movies_df

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,"[Action, Animation, Comedy, Fantasy]",2017
9738,193583,No Game No Life: Zero,"[Animation, Comedy, Fantasy]",2017
9739,193585,Flint,[Drama],2017
9740,193587,Bungo Stray Dogs: Dead Apple,"[Action, Animation]",2018


In [21]:
#Dropping the genres column as we do not care about genre in this method
movies_df = movies_df.drop('genres', 1)
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [22]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [24]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [25]:
userInput = [
            {'title':'V for Vendetta', 'rating':4.5},
            {'title':'Immortals', 'rating':5},
            {'title':'The Grand Budapest Hotel', 'rating':2.5},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Interstellar', 'rating':3.0},
            {'title':'Flight', 'rating':4.5},
            {'title':'Ip Man', 'rating':4.0},
            {'title':'Troy', 'rating':4.0} 
         ] 
#dataframe of user profile
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,V for Vendetta,4.5
1,Immortals,5.0
2,The Grand Budapest Hotel,2.5
3,Pulp Fiction,5.0
4,Interstellar,3.0
5,Flight,4.5
6,Ip Man,4.0
7,Troy,4.0


In [26]:
"""This cell is same as explained above. Filter all movies and make a sub-dataframe of movies that were watched by user.
Then merge this dataframe to inputMovies dataframe. Finally drop the redundant column."""
#Filtering out the movies by title as explained above
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('year', 1)
#Final input dataframe
#If a movie you added in above isn't here, then it might not be in the original 
#dataframe or it might spelled differently, please check capitalisation.
inputMovies

Unnamed: 0,movieId,title,rating
0,296,Pulp Fiction,5.0
1,7458,Troy,4.0
2,44191,V for Vendetta,4.5
3,65514,Ip Man,4.0
4,90888,Immortals,5.0
5,97923,Flight,4.5
6,109487,Interstellar,3.0


In [27]:
"""Here, we find sub-dataframe of rating-df that contains the users that have watched our query's list of movies and rated them
just like we did before"""
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
print(userSubset.shape) #So, our query user has been watched and rated 550 times by different people
userSubset.head()

(550, 3)


Unnamed: 0,userId,movieId,rating
16,1,296,3.0
255,2,109487,3.0
320,4,296,1.0
533,5,296,5.0
692,6,296,2.0


In [28]:
"""Pandas groupby is used for grouping the data according to the categories and applying a function to the categories. 
It also helps to aggregate data and makes the task of splitting the Dataframe over some criteria really easy and efficient.
Here we make groups of data based of 'userId'. For any userId (i.e. a user), it finds all the movies that were watched by 
that user and put them to a group. 
"""
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])
type(userSubsetGroup)
print(len(userSubsetGroup)) #there are 363 different users that have watched atleast one of the query's movies

363


In [29]:
#lets take an example and see what movies has our user 110 watched and what is the rating to the movie
userSubsetGroup.get_group(110)

Unnamed: 0,userId,movieId,rating
17212,110,296,4.5


In [30]:
"""Sorting it so users with movie most in common with the input will have priority:
To do this, we go over the userSubsetGroup (by lambda function) and for each user, count the number of movies watched by user, 
by coding len(x[1]) i.e. summing over number of different movieId's. Then, sort them so that user with with most number of
watched movies comes at first palce"""

userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

In [31]:
"""lets see top 3 users based on most number of watched movies (of course movies that were also watched by query user).
So here, user 298 have watched 6 movies that our query user have watched and this number is 5 for user 105"""
userSubsetGroup[0:3]

[(298,        userId  movieId  rating
  44555     298      296     4.5
  44970     298     7458     0.5
  45051     298    44191     3.5
  45187     298    65514     2.0
  45322     298    90888     0.5
  45400     298   109487     3.0), (105,        userId  movieId  rating
  16226     105      296     5.0
  16499     105     7458     3.5
  16589     105    44191     4.0
  16671     105    65514     4.5
  16812     105   109487     4.0), (339,        userId  movieId  rating
  52016     339      296     2.5
  52167     339     7458     3.0
  52197     339    44191     4.0
  52323     339    97923     3.5
  52343     339   109487     5.0)]

In [32]:
"""We filter first 50 number of users for computing similarities"""
userSubsetGroup = userSubsetGroup[0:50]
print(userSubsetGroup[20]) #20th most watched similar to our query user, is user 18 who has watched 3 of our query user movies  

(18,       userId  movieId  rating
1796      18      296     4.0
2074      18    44191     4.5
2217      18   109487     4.5)


In [44]:
"""So far, we found 50 people with most number of same watched movies w.r.t. our user. The next thing we should do is to find 
similarity in rating the movies between query user and top 50 most similar users to him. To do this, we can use 
Pearson Correlation (about movie ratings) betwwen query user and 50 users. 
Simply, we compute correlation (about rating the movies) between query user and each of other users. The correlation indicate
their similarity in taste e.g. 0.9 shows much similar taste and -.9 shows quite opposite taste to each other."""

#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

"""here we iterate over name i.e userId e.g 183 and its userSubsetGroup i.e. its table of movieIds and their ratings"""
#For every user group in our subset
for name, group in userSubsetGroup:
    #print(name, group)
    
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    """sort our query movies by movieId. This could be done outside loop so we do not needed to do it over and over 50 times"""
    inputMovies = inputMovies.sort_values(by='movieId')
    
    """here, we compute the number of different movies (that our query user seen) that each of those 50 users has seen
    e.g. if a user has seen 5 movies, it returns 5. function len() which gives number of data (rows) do this"""
    #Get the N for the formula
    nRatings = len(group)
    
    """Now that we have sorted both query user and current users based on the movies, we take common seen movies into acount.
    This is done by searching movieIds of group dataframe (current user dataframe) in inputMovies (query user) and select them
    if thery exist by coding isin(group['movieId'] and then make them a list to define a sub-dataframe (named temp_df) 
    containg these common seen movies."""
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    
    """Now, we have two dataframes. Each one has similar movieIds and sorted the same order. What differes for them is the 
    ratings given by query user and current user to movies. To calculate the similarity of opinions about movies between 
    our query user and this current user, we should take their rating vectors to movies (this vectors have same size, now). """
    
    """taking vector of ratings of our query user on common movies by going to 'rating' column and save it as a list"""
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #print(tempRatingList)
    
    """taking vector (list) of ratings of our current user on common movies"""
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #print(tempGroupList )
    #print('.................')
    
    """This part is the math part of calculating Pearson Correlation of two vectors  tempRatingList and tempGroupList. 
    This code can not be explaned simpler as its completely calculation for pearson correlation."""
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


In [34]:
"""Here lets take a look at this dictionary. So, it will give 50 tuples in shape of (a,b) where a is the userId and b will be 
the similarity of his interest (based on given rating to movies) that user to our qery user"""
pearsonCorrelationDict.items()

dict_items([(298, 0.04037864265436241), (105, 0.5321811563901779), (339, -0.8055696249744949), (414, 0.14744195615489958), (448, 0.6662136141735213), (62, -0.2548235957188128), (125, 0.6625413488689132), (153, 0.7677718959499145), (233, 0.6831300510639733), (249, -0.659231724180059), (305, 0.23904572186687875), (318, -0.42289003161103106), (332, 0.6363636363636364), (380, 0.45454545454545453), (483, 0.09759000729485331), (509, 0.5291502622129182), (534, 0.6882472016116852), (561, 0.6180700462007377), (610, 0.6900655593423543), (10, 0.1013606067599229), (18, -0.6933752452815377), (21, 0.24019223070763082), (28, 0.3273268353539889), (50, 0.6933752452815368), (65, -0.18898223650461146), (68, -1.0), (119, -0.8660254037844448), (123, -0.9707253433941623), (141, -0.6933752452815363), (212, -0.6933752452815377), (219, 0.6546536707079778), (222, 0.4999999999999867), (232, 0), (246, -0.5000000000000053), (247, 0), (274, 0.6546536707079778), (279, -0.2773500981126157), (352, -0.2773500981126157)

In [35]:
"""Now that for each user, we have a similarity score to our query user, we can make a dataframe of them which its columns 
will be similarity index and also userID"""
#a dataframe with a column (without name) which is enteries is similarities and the indexor is users indexes
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
#naming similaruty column
pearsonDF.columns = ['similarityIndex']
#producing and naming userID column
pearsonDF['userId'] = pearsonDF.index
#index dataframe 
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.040379,298
1,0.532181,105
2,-0.80557,339
3,0.147442,414
4,0.666214,448


In [36]:
"""Now we sort the dataframe based on the similarityIndex by coding sort_values(by='similarityIndex', ascending=False)
so that we could see ranking of users based on their most similarity of taste compared to our query user. Here as an example
user 590 has similarity of .98 with query user i.e. he/she gave very similar rating to movies (compared to our query user)"""
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)
topUsers.head()

Unnamed: 0,similarityIndex,userId
45,0.981981,590
38,0.981981,354
40,0.866025,420
44,0.838628,560
7,0.767772,153


In [37]:
"""Now, we bulid a dtaframe that contains all movies (and their rating) that were watched by selected users (who are have most 
similar tastes to our query user). We do this by merging topsuer dataframe with rating-df dataframe. The criterion for merging
will be userId i.e. look at userId in both dataframes (left_on='userId', right_on='userId') and merge them if they are the same 
(how= inner)"""
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.981981,590,1,4.0
1,0.981981,590,2,2.5
2,0.981981,590,3,3.0
3,0.981981,590,5,2.0
4,0.981981,590,6,3.5


In [38]:
topUsersRating.shape

(29438, 4)

In [39]:
"""now, we add expected importance of movie to query user based on similarity of our query user to other user. For example
if user 590 gave 4.0 to the movies 1 and another user with much lower similarity e.g. 0.7 to query user, give 3.0 to that movie,
system gives more attention to the first user rating based on their similarity ratio to query user. Here, system gives weight 
of 4* 0.98 to first one's opinion and 0.7*3 to second one's opinio about the movie.
We make a weightedRating column based on this by producting columns 'similarityIndex' and 'rating' pointwise"""
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.981981,590,1,4.0,3.927922
1,0.981981,590,2,2.5,2.454951
2,0.981981,590,3,3.0,2.945942
3,0.981981,590,5,2.0,1.963961
4,0.981981,590,6,3.5,3.436932


In [40]:
"""Now, we sum all expected importance and also similairty for each movies. e.g. we groupby movies by their movieIds 
(i.e. separate each movie from others) by coding groupby('movieId') and sum all similarity and weighted rating scores for it by
coding: .sum()[['similarityIndex','weightedRating']]."""
#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
#rename the columns
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.647006,19.282705
2,3.265444,9.435236
3,1.489011,4.574413
5,2.040446,5.149396
6,2.816681,11.381667


In [41]:
"""The sum above is bised because its simply only a sum. Thus, if a movie is watched more, even if it gets low ratings, it gets
more score based on above sums. Thus, we also need to do some kind of weighting averaging so that we get a score based on both
similarity and rating (and not only count of watching). We do this by dividing 'sum_weightedRating' by 'sum_similarityIndex' """
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
#
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.14949,1
2,2.889419,2
3,3.072114,3
5,2.523662,5
6,4.040807,6


In [42]:
"""Now we sort the dataframe based on this final score ('weighted average recommendation score') so that we could see best
movies that are suggested to query user"""
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
96655,1171007000000000.0,96655
27482,303.2794,27482
51037,217.9974,51037
236,186.5814,236
183897,105.0171,183897
55278,104.0931,55278
162478,103.5931,162478
142422,103.5931,142422
2676,101.6402,2676
51935,88.04523,51935


In [43]:
"""Now searching the name of the best suggested movies and its features base on MovieId from movies_df dataframe.
So, based on user interests, and other similar taste users, system suggests 'French Kiss' most of all and then Instinct, etc"""
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
202,236,French Kiss,1995
2010,2676,Instinct,1999
5651,27482,Cube 2: Hypercube,2002
6412,51037,Unknown,2006
6446,51935,Shooter,2007
6590,55278,Sleuth,2007
7981,96655,Robot & Frank,2012
9066,142422,The Night Before,2015
9368,162478,Masterminds,2016
9683,183897,Isle of Dogs,2018
