# Recommender system

In this project we will work on a recommender system constructed from the collaborative filtering perpective. Collaborative filtering techniques find similar groups of users and provide recommendations based on similar tastes within that group. Our measure of similarity is chosen to be the **Pearson correlation function** and our dataset consists of a list of movies and a corresponding list of ratings given by users retrieved from the site [MovieLens](https://grouplens.org/datasets/movielens).

## Loading and preprocessing

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [6]:
# Let us remove the year from the title and store it in a new column
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
# Removing parenthesis
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

In [7]:
# We won't need the genres columns, thus we are going to drop it
movies_df.drop('genres', axis=1, inplace = True)
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [8]:
# Let us now take a look into the ratings dataset
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [9]:
# We won't need the information contained in the timestamp column, therefore we are going to drop it
ratings_df.drop('timestamp', 1, inplace=True)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


## Collaborative filtering 

Our aim in a first moment is to construct the recommender system using the collaborative filtering techniques in a **memory-based** approach, in which statistical tools (here, the Pearson correlation function) are used to approximate users or items based on historical data. Afterwards we are going to dive into a **model-based** analysis, where predictions for users preferences are made through the development of a machine learning model. 
Thus, our goal is to find users that have similar preferences to the input user and then recommend items that they have liked to the input. Let's begin by creating an input user to recommend movies to.

In [10]:
user_input = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
input_movies = pd.DataFrame(user_input)
input_movies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


In [11]:
# Adding the movie ID to the input_movies dataframe

inputId = movies_df[movies_df['title'].isin(input_movies['title'].tolist())]
input_movies = pd.merge(inputId, input_movies)
input_movies

Unnamed: 0,movieId,title,year,rating
0,1,Toy Story,1995,3.5
1,2,Jumanji,1995,2.0
2,296,Pulp Fiction,1994,5.0
3,1274,Akira,1988,4.5
4,1968,"Breakfast Club, The",1985,5.0


In [12]:
#Filtering out users that have watched the same movies as the input

user_subset = ratings_df[ratings_df['movieId'].isin(input_movies['movieId'].tolist())]
user_subset_grouped = user_subset.groupby('userId')
user_subset_grouped.head(20)

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
16,1,296,3.0
320,4,296,1.0
422,4,1968,4.0
516,5,1,4.0
...,...,...,...
99510,609,296,4.0
99534,610,1,5.0
99552,610,296,5.0
99636,610,1274,5.0


In [14]:
#Sorting the dataframe so users with most movies in common with the input will have priority

user_subset_grouped = sorted(user_subset_grouped, key=lambda x: len(x[1]), reverse=True)
user_subset_grouped

[(91,
         userId  movieId  rating
  14121      91        1     4.0
  14122      91        2     3.0
  14173      91      296     4.5
  14316      91     1274     5.0
  14383      91     1968     3.0),
 (177,
         userId  movieId  rating
  24900     177        1     5.0
  24901     177        2     3.5
  24930     177      296     5.0
  25069     177     1274     2.0
  25129     177     1968     3.5),
 (219,
         userId  movieId  rating
  31524     219        1     3.5
  31525     219        2     2.5
  31554     219      296     4.0
  31628     219     1274     2.5
  31680     219     1968     3.0),
 (274,
         userId  movieId  rating
  39229     274        1     4.0
  39230     274        2     3.5
  39288     274      296     5.0
  39448     274     1274     4.0
  39549     274     1968     4.0),
 (298,
         userId  movieId  rating
  44535     298        1     2.0
  44536     298        2     0.5
  44555     298      296     4.5
  44620     298     1274     4.0
 

### Pearson correlation function and similarity between instances and the input user

Next, we are going to compare all (almost) users to our specified user and find the one that is most similar.
we're going to find out how similar each user is to the input through the Pearson Correlation Coefficient. It is used to measure the strength of a linear association between two variables. It is written as:

$$\large r=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^{n}(x_i-\bar x)^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar y)^2}}$$

It has the interesting property of being invariant under a change of scale. It goes from -1 to 1 such that, in our case, a 1 means that the two users have similar tastes while a -1 means the opposite. 

We will select a subset of users to iterate through as we don't want to go through all the users:

In [17]:
user_subset_grouped = user_subset_grouped[0:200]

Now, we calculate the Pearson Correlation between input user and subset group, and store it in a dictionary, where the key is the user Id and the value is the coefficient.

In [20]:
from math import sqrt

pearson_correlation_dict={}
for name, group in user_subset_grouped:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    input_movies = input_movies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = input_movies[input_movies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format 
    temp_rating_list = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    temp_group_list = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in temp_rating_list]) - pow(sum(temp_rating_list),2)/float(nRatings)
    Syy = sum([i**2 for i in temp_group_list]) - pow(sum(temp_group_list),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(temp_rating_list, temp_group_list)) - sum(temp_rating_list)*sum(temp_group_list)/float(nRatings)
    
    #If the denominator is different from zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearson_correlation_dict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearson_correlation_dict[name] = 0

In [22]:
pearsonDF = pd.DataFrame.from_dict(pearson_correlation_dict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head(10)

Unnamed: 0,similarityIndex,userId
0,0.438529,91
1,0.0,177
2,0.451243,219
3,0.716115,274
4,0.959271,298
5,0.937614,414
6,0.117202,474
7,0.438529,477
8,0.784465,480
9,0.080064,483


Now let's get the top 50 users that are most similar to the input.

In [23]:
top_users=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
top_users.head()

Unnamed: 0,similarityIndex,userId
43,1.0,132
34,1.0,18
63,1.0,305
82,1.0,489
86,1.0,525


Now we are at the point where we can recommend movies to the input user.