# RECOMMENDATION SYSTEM
Recommender systems typically produce a list of recommendations in one of two ways – through collaborative filtering or through content-based filtering (also known as the personality-based approach)

There are three main classes of recommendation systems. Those are:
1. Collaborative filtering systems
2. Content-based filtering systems
3. Hybrid recommendation systems

1. Collaborative filtering systems:
Collaborative filtering approaches build a model from a user's past behaviour (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items)
2. Content-based filtering systems:
Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties.
3. Hybrid recommendation systems:
Hybrid recommendation systems combine both collaborative and content-based approaches. They help improve recommendations that are derived from sparse datasets. (Netflix is a prime example of a hybrid recommender)

# Exploring collaborative filtering approaches

Collaborative filtering systems come in two main flavors. Those are:
    a. Memory-based systems – These systems memorize training data. They often deploy correlation analysis, cosine similarity calculations, and k-nearest neighbor classification (showed in the demo coming up) to make recommendations.
    b. Model-based systems – These systems use (machine learning) models to uncover patterns and trends in training data. They often deploy Naive Bayes classifiers,  clustering algorithms, or Singular Value Decomposition (SVD) methods.

# ----------- 1. Prepare Problem ---------------
The MovieLens dataset for this purpose. It has been collected by the GroupLens Research Project at the University of Minnesota. MovieLens 100K dataset

These files contain 1,000,209 anonymous ratings of approximately 
3,900 movies made by 
6,040 MovieLens users who joined MovieLens in 2000

RATINGS FILE DESCRIPTION  - UserID::MovieID::Rating::Timestamp
USERS FILE DESCRIPTION    - UserID::Gender::Age::Occupation::Zip-code
MOVIES FILE DESCRIPTION   - MovieID::Title::Genres

- Some MovieIDs do not correspond to a movie due to accidental duplicate entries and/or test entries
- Movies are mostly entered by hand, so errors and inconsistencies may exist

The goal of this project is to predict the rating given a user and a movie

# Problem: Recommend new movies to users.

In [2]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [5]:
# b) Load dataset
#Reading Ratings files:
rnames = ['UserID', 'MovieID', 'Rating', 'Timestamp']
ratings = pd.read_table('dataset/ratings.dat', sep = '::', header = None, names = rnames, engine='python')

In [6]:
#Reading users files:
unames = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']
users = pd.read_table('dataset/users.dat', sep = '::', header = None, names = unames, engine='python')

In [5]:
#Dimensions of Dataset
# shape
print("ratings :")
print(ratings.shape)
print("users :")
print(users.shape)
print("movies :")
print(movies.shape)

ratings :
(1000209, 4)
users :
(6040, 5)
movies :
(3883, 3)


In [9]:
# 2. Summarize Data
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Number of unique users

In [6]:
len(ratings.UserID.unique())

6040

Number of unique movies

In [7]:
len(ratings.MovieID.unique())

3706

- So a total of 3706 movies and 6040 users data is available in the dataset.

Let's drop the timestamp columns. We do not need it

In [8]:
ratings.drop( "Timestamp", inplace = True, axis = 1 )

In [9]:
ratings.head()

Unnamed: 0,UserID,MovieID,Rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


# Load Movies data

In [7]:
#Reading movies files:
mnames = ['MovieID', 'Title', 'Genres']
movies = pd.read_table('dataset/movies.dat', sep = '::', header = None, names = mnames, engine='python')

In [20]:
movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
movies = movies.iloc[:, :2]
movies.columns = ['MovieID', 'Title']

Finding User Similarities

In [12]:
#Finding User Similarities
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation

# Create the pivot table

In [9]:
user_movies_df = ratings.pivot( index='UserID', columns='MovieID', values = "Rating" ).reset_index(drop=True)

# Fill '0' for ratings not given by users

In [10]:
user_movies_df.fillna(0, inplace = True)

In [21]:
user_movies_df.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
0,5.0,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,2.0,,,,,...,,,,,,,,,,


In [23]:
user_movies_df.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
user_movies_df.shape

(6040, 3706)

In [25]:
user_movies_df.iloc[10:20, 20:30]

MovieID,21,22,23,24,25,26,27,28,29,30
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,5.0
17,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,3.0,0.0
18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Calculating Euclidean Distance Score
    Based on what users have given ratings to different items, we can calculate the distances between them. Less the distance more similar they are.
    For calculating distances, many similarity coefficients can be calculated. Most widely used similarity coefficients are Euclidean, Cosine, Pearson Correlation etc.

In [13]:
user_sim = 1 - pairwise_distances( user_movies_df.as_matrix(), metric="cosine" )

In [14]:
user_sim_df = pd.DataFrame( user_sim )

In [15]:
user_sim_df[0:5]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6030,6031,6032,6033,6034,6035,6036,6037,6038,6039
0,1.0,0.096382,0.12061,0.132455,0.090158,0.179222,0.059678,0.138241,0.226148,0.255288,...,0.170588,0.082006,0.069807,0.033663,0.114877,0.186329,0.135979,0.0,0.174604,0.13359
1,0.096382,1.0,0.151479,0.171176,0.114394,0.100865,0.305787,0.203337,0.190198,0.226861,...,0.112503,0.091222,0.268565,0.014286,0.183384,0.228241,0.206274,0.066118,0.066457,0.218276
2,0.12061,0.151479,1.0,0.151227,0.062907,0.074603,0.138332,0.077656,0.126457,0.213655,...,0.09296,0.125864,0.161507,0.0,0.097308,0.143264,0.107744,0.120234,0.094675,0.133144
3,0.132455,0.171176,0.151227,1.0,0.045094,0.013529,0.130339,0.100856,0.093651,0.120738,...,0.163629,0.093041,0.382803,0.0,0.082097,0.170583,0.127464,0.062907,0.064634,0.137968
4,0.090158,0.114394,0.062907,0.045094,1.0,0.047449,0.126257,0.220817,0.26133,0.117052,...,0.100652,0.035732,0.061806,0.054151,0.179083,0.293365,0.172686,0.020459,0.027689,0.241437


• Users with highest similarity values can be treated as similar users

In [16]:
user_sim_df.idxmax(axis=1)[0:5]

0    0
1    1
2    2
3    3
4    4
dtype: int64

# Correlation Score

In [19]:
np.fill_diagonal( user_sim, 0 )
user_sim_df = pd.DataFrame( user_sim )
user_sim_df[0:5]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6030,6031,6032,6033,6034,6035,6036,6037,6038,6039
0,0.0,0.096382,0.12061,0.132455,0.090158,0.179222,0.059678,0.138241,0.226148,0.255288,...,0.170588,0.082006,0.069807,0.033663,0.114877,0.186329,0.135979,0.0,0.174604,0.13359
1,0.096382,0.0,0.151479,0.171176,0.114394,0.100865,0.305787,0.203337,0.190198,0.226861,...,0.112503,0.091222,0.268565,0.014286,0.183384,0.228241,0.206274,0.066118,0.066457,0.218276
2,0.12061,0.151479,0.0,0.151227,0.062907,0.074603,0.138332,0.077656,0.126457,0.213655,...,0.09296,0.125864,0.161507,0.0,0.097308,0.143264,0.107744,0.120234,0.094675,0.133144
3,0.132455,0.171176,0.151227,0.0,0.045094,0.013529,0.130339,0.100856,0.093651,0.120738,...,0.163629,0.093041,0.382803,0.0,0.082097,0.170583,0.127464,0.062907,0.064634,0.137968
4,0.090158,0.114394,0.062907,0.045094,0.0,0.047449,0.126257,0.220817,0.26133,0.117052,...,0.100652,0.035732,0.061806,0.054151,0.179083,0.293365,0.172686,0.020459,0.027689,0.241437


# Finding user similarities

In [20]:
user_sim_df.idxmax(axis=1).sample( 10, random_state = 10 )

5720    5776
6011    5966
290     3750
1985    1255
1065    2744
413      436
3600    3270
4635    4845
1969    1904
5698    4883
dtype: int64

# to be Continue...!