# Correlation Based Recommender

## Imports

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
%matplotlib inline

In [2]:
import os
print(os.listdir("./data"))

['movie.csv', 'rating.csv']


## Data Preprocessing

In [3]:
movies = pd.read_csv('./data/movie.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies['year'] = (movies.title.str.extract('(\(\d\d\d\d\))', expand=False).str.extract('(\d\d\d\d)', expand=False))
movies['title'] = (movies.title.str.replace('(\(\d\d\d\d\))', '').apply(lambda x: x.strip()))
movies['genres'] = movies.genres.str.split('|')

movies.head()

  


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
 3   year     27256 non-null  object
dtypes: int64(1), object(3)
memory usage: 852.6+ KB


In [6]:
ratings = pd.read_csv('./data/rating.csv', usecols=['userId', 'movieId', 'rating'],
                     dtype={'userId':np.int32, 'movieId':np.int32, 'rating':np.float32})
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


Due to huge memory usage, we can further decrease our data by multiplying these columns with 2 to make everthing int and then convert back to np.int8.

In [7]:
ratings['rating'] = ratings['rating'] * 2
ratings['rating'] = ratings['rating'].astype(np.int8)
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 3 columns):
 #   Column   Dtype
---  ------   -----
 0   userId   int32
 1   movieId  int32
 2   rating   int8 
dtypes: int32(2), int8(1)
memory usage: 171.7 MB


In [8]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,7
1,1,29,7
2,1,32,7
3,1,47,7
4,1,50,7


# 1. Item Based

Correlation-based recommenders offer a basic form of collaborative filtering. That's because with correlation-based recommendation systems items are recommended based on similarities in their user review. In this sense, they do take user preferences into account. In these systems, you use Pearson's R correlation to recommend an item that is most similar to the item a user has already chosen. In other words, to recommend an item that has a review score that correlates with another item that a user has already chosen.

In [9]:
# Due to problems with pandas, we can't use pivot_table with our all data as it throws MemoryError.
# Therefore, for this part we will work with a sample data
sample_ratings = ratings.sample(n=100000, random_state=20)

# Creating our sparse matrix and fill NA's with 0 to avoid high memory usage.
pivot = pd.pivot_table(sample_ratings, values='rating', index='userId', columns='movieId', fill_value=0)
pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,120610,120819,121235,123947,125916,126420,127622,128151,129659,130490
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
pivot = pivot.astype(np.int8)
pivot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52242 entries, 1 to 138493
Columns: 8433 entries, 1 to 130490
dtypes: int8(8433)
memory usage: 420.5 MB


In [11]:
# Let's look something similar to Pulp Fiction
rand_movie = 296

similar = pivot.corrwith(pivot[rand_movie], drop=True).to_frame(name='PearsonR')

In [12]:
rating_count = (ratings.groupby('movieId')[['rating']]
                       .count()
                       .sort_values('rating', ascending=False)
                       .reset_index())
rating_count = pd.merge(rating_count, movies, on='movieId')
rating_count.head()

Unnamed: 0,movieId,rating,title,genres,year
0,296,67310,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994
1,356,66172,Forrest Gump,"[Comedy, Drama, Romance, War]",1994
2,318,63366,"Shawshank Redemption, The","[Crime, Drama]",1994
3,593,63299,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991
4,480,59715,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993


But let's think about this for a minute here. If we've found some movies that were really well correlated with Pulp Fiction but that had only, say, ten ratings total, then those movies probably wouldn't really be all that similar to Pulp Fiction. I mean maybe those movies got similar ratings, but they wouldn't be very popular. Therefore, that correlation really wouldn't be significant. We also need to take stock of how popular each of these movies is, in addition to how well the review scores correlate with the ratings that were given to other movies in the dataset. So to do that, we will join our corr data frame with a rating state of frame.

In [13]:
similar_sum = similar.join(rating_count['rating'])
similar_top10 = similar_sum[similar_sum['rating']>=500].sort_values(['PearsonR', 'rating'], 
                                                            ascending=[False, False]).head(11)
# Add movie names
similar_top10 = pd.merge(similar_top10[1:11], movies[['title', 'movieId']], on='movieId')
similar_top10

Unnamed: 0,movieId,PearsonR,rating,title
0,129,0.064756,22020.0,Pie in the Sky
1,3287,0.039553,930.0,"Tigger Movie, The"
2,3727,0.032425,732.0,Near Dark
3,1900,0.028727,2220.0,"Children of Heaven, The (Bacheha-Ye Aseman)"
4,2280,0.027202,1677.0,Clay Pigeons
5,3834,0.026947,693.0,Bronco Billy
6,558,0.026128,8554.0,"Pagemaster, The"
7,888,0.024686,5581.0,Land Before Time III: The Time of the Great Gi...
8,3576,0.024289,789.0,"Hidden, The"
9,3439,0.022755,864.0,Teenage Mutant Ninja Turtles II: The Secret of...


# 2. User Based