## Evaluating Similarity based on correlation
#### A pearson's R correlation coefficient based reccomender.

In [40]:
import numpy as np
import pandas as pd

The datasets are hosted on: https://drive.google.com/drive/folders/0B33wKgIl5ZZzT1pLQldveTBmbE0

They were originally published by Ankur Tomar. USER-USER Collaborative filtering Recommender System in Python, August 25th 2017.

In [41]:
# Read datasets into Jupyter notebook.
films = pd.read_csv('../data/movies.csv', encoding='ISO-8859-1')
ratings = pd.read_csv('../data/ratings.csv', encoding='ISO-8859-1')

In [42]:
# Observe first 5 record of data 
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,12882,1,4.0,1147195252
1,12882,32,3.5,1147195307
2,12882,47,5.0,1147195343
3,12882,50,5.0,1147185499
4,12882,110,4.5,1147195239


In [43]:
films.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Observations 

- Ratings dataset has ratings of every unique film (found in 'films' dataset) from a user.
- Rating is given out of 5. 0 means the customer did not like the film very much and 5 means they loved it.
- Both datasets have a similar column, which is moviesId
- Also in the ratings dataset - userId is in duplicate, this means the customer has reviewed more than one film.

## Grouping and Ranking Data

In [44]:
# We're going to look at the ratings, these films are getting.
# To do this: we're going to look at the mean value of all the ratings which have been given to each film.

In [45]:
# New df 'rating' will be generated from the 'ratings' df.
# Take ratings df and group by moviesId - then for each movieId we want to look at the rating column
# and want to generate the mean value for each rating that was given to each film.

# Lets look at the rating which each place are getting
# do this by looking at the MEAN value of ALL the ratings given to EACH place

# New dataFrame, generated from frame df - but group frame by placeID, 
# then for each PlaceID look at rating column and generate mean value for each of the rating which was given to each place

In [46]:
rating = pd.DataFrame(ratings.groupby('movieId')['rating'].mean())
rating.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.793347
2,3.069892
3,2.923077
4,2.576923
5,2.848684


In [47]:
# In addition to the mean value, we want to look at how popular each of these films was. 
# To do this, we add a column called rating count (contains counts for how many reviews each film got)

rating['rating_count'] = pd.DataFrame(ratings.groupby('movieId')['rating'].count())
rating.head()

# We have each of the film id's with their average rating and then the rating count(no of ratings each film got)

Unnamed: 0_level_0,rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.793347,496
2,3.069892,279
3,2.923077,78
4,2.576923,13
5,2.848684,76


In [50]:
# Look at statistical description of this rating df
rating.describe()

Unnamed: 0,rating,rating_count
count,2500.0,2500.0
mean,3.36574,105.802
std,0.488708,97.794701
min,1.153846,3.0
25%,3.058269,41.0
50%,3.460561,75.0
75%,3.73749,132.0
max,4.364362,668.0


- Count, there are 2500 unique films which have been reviewed in the ratings df.
- Also the max value for rating count comes 668 - means the most popular film in the dataset has a total of 668 reviews.

In [51]:
# Let's see what film that is ^ the most popular one.
# Sort in descending order. 
rating.sort_values('rating_count', ascending=False).head()


Unnamed: 0_level_0,rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
2571,4.195359,668
4993,4.091561,628
356,3.91868,621
296,4.217781,613
5952,4.035176,597


In [53]:
# So what film is this? Lets find the name. 
# Create a filter which finds a true value for where the movieId == 2571
# Then filter 'films' df to return only the record where that's true.

films[films['movieId']==2571]

Unnamed: 0,movieId,title,genres
1210,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
