# A user-based collaborative filtering recommendation system based on the Pearson's Correlation.


<img align="right" src='../img/recommend.jpg' width="200">


<br>


<br>

<br>



*We are going to use Pearson's R correlation to reccomend an films to customers, based on their similarity to other *films which customers have rated.*

*The recommendation of films which are most similar to a film the customer has already chosen to watch. This is user-user filtering method - becuase films are reccomended based on similarities based on user reviews.*


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

The datasets are hosted on: https://drive.google.com/drive/folders/0B33wKgIl5ZZzT1pLQldveTBmbE0

**They were originally published by Ankur Tomar. User-Based Collaborative Filtering Recommender System in Python, August 25th 2017. Data Science | Machine Learning | MS Business Analytics @ University of Minnesota**


In [2]:
# Read in raw datasets.
films = pd.read_csv('../data/movies.csv')
ratings = pd.read_csv('../data/ratings.csv')

In [3]:
# Look at first 5 lines of each dataframe
films.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


###### Observations:
- The 'ratings' dataframe has a rating for every unique film (found in 'films' dataframe) from a customer.
- A rating is given out of 5.
- A rating of 0 means the customer did not like the film very much and a 5 means they loved it.
- Both datasets have a similar column, which is moviesId.
- In the 'ratings' dataframe userId is in duplicate, this means the customer has reviewed more than one film.

## Grouping and Ranking Data

In [5]:
# To study the ratings these films are getting I am going to calculate a mean rating value for each film.
film_popularity = pd.DataFrame(ratings.groupby('movieId')['rating'].mean())
film_popularity.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.92093
2,3.431818
3,3.259615
4,2.357143
5,3.071429


In [6]:
# Evaluate popularity of each film, column called 'rating_count'(generates counts on many reviews each film got).
film_popularity['rating_count'] = pd.DataFrame(ratings.groupby('movieId')['rating'].count())
film_popularity.head()

Unnamed: 0_level_0,rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.92093,215
2,3.431818,110
3,3.259615,52
4,2.357143,7
5,3.071429,49


In [7]:
# I observed a statistical description of the dataframe.
film_popularity.describe()

Unnamed: 0,rating,rating_count
count,9724.0,9724.0
mean,3.262448,10.369807
std,0.869874,22.401005
min,0.5,1.0
25%,2.8,1.0
50%,3.416667,3.0
75%,3.911765,9.0
max,5.0,329.0


###### Observations:
- Top row in the 'rating_count' field shows there are 9724 unique films which have been reviewed in this dataset.
- Bottom row in the 'rating_count' field shows the most popular film in this dataset was rated 329 times. 

In [8]:
# To find the most popular/most hightly rated film.
# Most hightly rated film has 'movieId: 356'
film_popularity.sort_values('rating_count', ascending=False).head()

Unnamed: 0_level_0,rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
356,4.164134,329
318,4.429022,317
296,4.197068,307
593,4.16129,279
2571,4.192446,278


In [9]:
# MovieID - Name. Created a filter on 'films' ds where movieId == 356
# Forest Gump is the most highly rated film in our data.
films[films['movieId']==356]

Unnamed: 0,movieId,title,genres
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War


### Preparing Data for Analysis

- I've created a user by item utility matrix. 
- By calling the pivot table functionb - this function will cross tabulate each user, against each place and ouput a matrix.
- Utility matrix will be full of NaN - as customers have only reviewed a few films at a time.
- No one customer will rate more than 3/4 films - this results in the matrix being sparse.
- In cases where a customer did provide a rating for a particular film - a number is shown out of 5.
- I want to find films which correlate in terms of customer rating.

In [10]:
# Using crosstab function to create this user utility matrix.
films_crosstab = pd.pivot_table(data=ratings, values='rating', index='userId', columns='movieId')
films_crosstab.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [11]:
# Isolate the user ratings from most popualar film 'Forest Gump' - filter ratings so only NON null values seen.
# remember: FOREST GUMP is the most popular place with 329 ratings lets look at what those ratings are.
ForestGump_ratings = films_crosstab[356]
ForestGump_ratings[ForestGump_ratings>=0]

userId
1      4.0
6      5.0
7      5.0
8      3.0
10     3.5
11     5.0
14     4.0
15     5.0
16     3.5
17     5.0
18     4.5
19     2.0
21     4.5
22     5.0
24     4.5
26     3.0
27     5.0
28     4.0
29     4.5
33     5.0
34     4.0
37     4.0
38     3.0
41     2.0
42     5.0
43     5.0
45     5.0
47     4.5
49     4.0
50     3.0
      ... 
567    3.0
568    5.0
569    3.0
570    4.0
572    4.0
573    4.5
577    5.0
579    5.0
580    4.0
581    4.5
583    4.0
584    5.0
587    4.0
588    3.0
589    5.0
590    5.0
591    4.0
592    5.0
593    4.0
596    3.5
597    5.0
599    3.5
600    4.0
602    3.0
603    3.0
605    3.0
606    4.0
608    3.0
609    4.0
610    3.0
Name: 356, Length: 329, dtype: float64

## Evaluating Similarity Based on Correlation

To find a correlation between Forest Gump (most popular film) and each of the other thousands of films:

 1. Call the core_with() method of of our films_crosstab.
 2. Pass in the Forest_Gump rating series. 

 3. This generates a Pearson's R correlation coefficient between Forest Gump and each other film which has been reviewed in the dataset.
 
 4. This correlation is based on similarities in customer reviews that were given to each film.

In [15]:
# Named 'similar_to_ForestGump' becuase looking for films similar to Forest Gump.
similar_to_ForestGump = films_crosstab.corrwith(ForestGump_ratings)


  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


#### Pearson's R Correlation Coeffcicient 
- The Pearson's R correlation coefficient is a measure of a linear correlation between two variables (in this case the 'ratings' of two films).

- As shown below, if you have an R value which is close to 1 or negative 1, then you have a strong linear relationship between two variables/items (in this case an item = film). The closer the R value gets to 0, the furthur apart those two items are in terms of being similar.



<img src='../img/pearson.png' width="400">

In [19]:
# This will be returned as a matrix - and want a dataframe.
similar_to_ForestGump = pd.DataFrame(similar_to_ForestGump, columns=['PearsonR'])

# Drop Null values
similar_to_ForestGump.dropna(inplace=True)
similar_to_ForestGump.head()

Unnamed: 0_level_0,PearsonR
movieId,Unnamed: 1_level_1
1,0.303465
2,0.367247
3,0.534682
4,0.388514
5,0.349541


- Dataframe above shows each unique film and a Pearson's R Correlation Coefficient, which indicates how well each film correlates with Forest Gump based on user ratings.
- However, films may show a high correlation to Forest Gump with only a low number of customer reviewes.
- A film with low rating count (2 of 3 ratings) may incorrectly show a high correlation due to having similar ratings as Forest Gump. E.g they have the same mean rating value.


- In this case the correlation is not significant.
- Therefore I am going to include an attribute for film popularity (based on rating count) in addition to how well the review scores correlate.

In [20]:
# Join ratings df with similar_to_ForestGump df.
similarity_summary = similar_to_ForestGump.join(film_popularity['rating_count'])

In [21]:
# Filter only films which have >= 50 reviews and dispplay R correlation sorted is descending order.
similarity_summary[similarity_summary['rating_count']>=50].sort_values('PearsonR', ascending=False).head(10)

# This shows a list of 10 top reviewed places similar to Forest Gump.

Unnamed: 0_level_0,PearsonR,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
356,1.0,329
62,0.652144,80
48,0.550118,68
3,0.534682,52
3552,0.520328,52
2268,0.517146,57
1302,0.503845,56
2797,0.492351,91
3489,0.484676,53
1704,0.484042,141


In [22]:
# Take top 5 correlated results and see what genre of film they are.
# See if they are the same/similar genre to Forest Gump.

# Create new df called genre_corr_ForestGump
# set index= a numpy array of 5. And name columns
genre_corr_ForestGump = pd.DataFrame([356, 2797, 3, 1704, 48], index=np.arange(5), columns=['movieId'])

# I now want to create a summary datatable called summary
# Based on a merge between genre_corr_ForestGump and genre
# I am trying to create a summary between each of the top correlated filmsId's and their genres.
summary = pd.merge(genre_corr_ForestGump, films, on='movieId')


In [23]:
summary

Unnamed: 0,movieId,title,genres
0,356,Forrest Gump (1994),Comedy|Drama|Romance|War
1,2797,Big (1988),Comedy|Drama|Fantasy|Romance
2,3,Grumpier Old Men (1995),Comedy|Romance
3,1704,Good Will Hunting (1997),Drama|Romance
4,48,Pocahontas (1995),Animation|Children|Drama|Musical|Romance


In [24]:
# Let's look at how many film genre's there even are in this dataset. There are 951 unique film genres.
films['genres'].describe()

count      9742
unique      951
top       Drama
freq       1053
Name: genres, dtype: object


<br>
<font color=green>What we're seeing here is that the top 5 places which are correlated with Forest Gump, and they all have at least two of the same genre tags as Forest Gump. Brilliant!


There are 951 unique film genres.
So considering our film recommendations picked up movies with the same genre tag as Forest Gump... our correlation-based recommendation system is on track!</font>

# Truncated Singular Value Decomposition

I am building a model-based collaborativre filtering system using matrix factorization with singular value decomposition.
We are going to use the truncated SVD algorith from Sklearn.

In [26]:
import sklearn
from sklearn.decomposition import TruncatedSVD

In [27]:
# Create DataFrame of relevant columns from existing raw data.
films1 = pd.read_csv('../data/movies.csv')
ratings1 = pd.read_csv('../data/ratings.csv')

In [29]:
ratings1.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [30]:
# Join df's on movieId.
combined_movies_data = pd.merge(films1, ratings1, on='movieId')

In [31]:
combined_movies_data.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [32]:
combined_film_data = combined_movies_data.drop(['genres', 'timestamp'], axis=1)

In [33]:
combined_film_data.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),1,4.0
1,1,Toy Story (1995),5,4.0
2,1,Toy Story (1995),7,4.5
3,1,Toy Story (1995),15,2.5
4,1,Toy Story (1995),17,4.5


In [34]:
# Again we want to see the most popular film in this df, add a rating-count column and sort in descending order.
combined_film_data.groupby('movieId')['rating'].count().sort_values(ascending=False).head()

movieId
356     329
318     317
296     307
593     279
2571    278
Name: rating, dtype: int64

In [36]:
# Film '356' was rated 329 times. Lets see what the name of this movie is. Filter by boolean values.
filter = combined_film_data['movieId']==356
combined_film_data[filter]['movieId'].unique()

array([356])

## Building a Utility Matrix

In [None]:
# Create a user-utility matrix using the pivot table method.
# As we're going to use sklearn's truncated SVD which does not accept NaN so we have to replace with 0.

In [37]:
ratings_crosstab = combined_film_data.pivot_table(values='rating', index='userId', columns='title', fill_value=0)

In [38]:
ratings_crosstab.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,4.0,0
2,0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0
3,0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0
4,0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0
5,0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0


In [39]:
# I am going to take this utility matrix, transpose it, then use SVD to decompose it.
# See shape to display vector metrics. 610 x 9719
ratings_crosstab.shape

(610, 9719)

In [40]:
# Use 'T' method to transpose 'ratings_crosstab', films=rows and customers=columns.
X = ratings_crosstab.values.T

# Check to see if matrix was successfully transposed. 
X.shape

(9719, 610)

In [41]:
# Instantiate an SVD object. n_components is the number of latent factors.
SVD = TruncatedSVD(n_components=12, random_state=17)

# Fit the SVD model to matrix X and perform a return reduction which compresses the number of columns to 12.
resultant_matirx = SVD.fit_transform(X)

# Check shape.
resultant_matirx.shape

(9719, 12)

## Generating a Correlation Matrix

The next thing I am going to do is generate a correlation matrix. Going to calculate the pearson R correlation coefficient for every movie pair in the resultant matrix with correlations being based on similarities between user preferences.

In [46]:
# To generate a correlation matrix, use NumPy's cor coef function.
corr_mat = np.corrcoef(resultant_matirx)
corr_mat.shape

(9719, 9719)

## Isolating 'Forest Gump' From The  Correlation Matrix

In [47]:
# Generate a 'film names' index - this is equal to columns of ratings_crosstab matrix.
film_names = ratings_crosstab.columns

# Gives us numpy array (1 record for each film name) so convert to list.
film_list = list(film_names)

In [48]:
# Find index value of movie of interest 'Forest Gump'. Every film is now compared to Forest Gump.
forrest_gump = film_list.index('Forrest Gump (1994)')

# print to see index value of Forest Gump = 3158.
print(forrest_gump)

3158


In [49]:
# Isolate the array which represents Forest Gump
corr_forrest_gump = corr_mat[forrest_gump]

# Should have a vertical array of 9719 rows. Each row represents how well each film correlates with Forest Gump.
corr_forrest_gump.shape

(9719,)

## Recommending a Highly Correlated Movie.

I am going to recommend a list of film names which exhibits a high degree of correlation with Forest Gump.

In [50]:
# We want to list films which have a Pearson R close to 1. Select films which have a correlation of <1 but >0.9.
recommendations = list(film_names[(corr_forrest_gump < 1.0) & (corr_forrest_gump > 0.9)])
recommendations

['Apollo 13 (1995)',
 'Braveheart (1995)',
 'Dances with Wolves (1990)',
 'Jurassic Park (1993)',
 'Philadelphia (1993)',
 'Pulp Fiction (1994)',
 "Schindler's List (1993)",
 'Seven (a.k.a. Se7en) (1995)',
 'Shawshank Redemption, The (1994)',
 'Silence of the Lambs, The (1991)',
 'Toy Story (1995)',
 "What's Eating Gilbert Grape (1993)"]

In [52]:
# To generate a list which correlates with Forest Gump a little closer...Pearson's R greater than 0.95.
closer_recommendations = list(film_names[(corr_forrest_gump < 1.0) & (corr_forrest_gump > 0.95)])
closer_recommendations

['Shawshank Redemption, The (1994)']

**A person who has watched and liked Forest Gump, is also likely to want to watch Shawshank Redemption!**