## Creating a recommendation system

This notebook looks into using various python-based machine learning and data science libraries in an attempt to build a machine learning recommendation system capable of recommend movies based on user input.

### 1. Problem definitions

> Given clinical parameters about a patient, can we predict whether or not they have heart disease?

### 2. Data 
The original data came from the Netflix Prize data from Kaggle
> https://www.kaggle.com/netflix-inc/netflix-prize-data

## Loading our data

In [42]:
### Let's import the packages we are going to use

# Regular EDA
import pandas as pd
import numpy as np

# Sklearn Surprise
from surprise import Reader, Dataset, SVD
from surprise.model_selection.validation import cross_validate

### Let's load our data

In [26]:
# Function to load our data
def readFile(file_path, rows=100000):
    """
    This function is used to read the file data and transform it into pandas data
    frame.
    Dictionary items:
    > Cust_Id: customerID, the first column in the text file 
    > Movie_Id: the first row in the text file
    > Rating: Second column in the text file 
    > Date: Last column in the text file
    We will limit the rows we read from the file up to 100000 rows
    """
    #Create a dictionary for data
    data_dict = {'Cust_Id':[], 'Movie_Id':[], 'Rating':[], 'Date':[]}
    with open(file_path, 'r') as f:
        count = 0
        for line in f:
            count += 1
            if count>100000:
                break
            if ':' in line:
                movidId = line[:-2] #remove the last character ':'
                movieId = int(movidId)
            else:
                customerID, rating, date = line.split(',')
                data_dict['Cust_Id'].append(customerID)
                data_dict['Movie_Id'].append(movieId)
                data_dict['Rating'].append(rating)
                data_dict['Date'].append(date.rstrip("\n"))
    return pd.DataFrame(data_dict)

In [28]:
# load and store it into different data frame
df1 = readFile('combined_data_1.txt', rows=100000)
df2 = readFile('combined_data_2.txt', rows=100000)
df3 = readFile('combined_data_3.txt', rows=100000)
df4 = readFile('combined_data_4.txt', rows=100000)

df1['Rating'] = df1['Rating'].astype(float)
df2['Rating'] = df2['Rating'].astype(float)
df3['Rating'] = df3['Rating'].astype(float)
df4['Rating'] = df4['Rating'].astype(float)

In [30]:
df1.dtypes

Cust_Id      object
Movie_Id      int64
Rating      float64
Date         object
dtype: object

Now, let's make our four dataframes into a single entity. Because our data is consistent in shapes, all we need to do is make a copy of the first data frame then append the rest to the new dataframe

In [36]:
df = df1.copy() 
df = df.append(df2)
df = df.append(df3)
df = df.append(df4)
df.index = np.arange(0,len(df))
df.head(10)

Unnamed: 0,Cust_Id,Movie_Id,Rating,Date
0,1488844,1,3.0,2005-09-06
1,822109,1,5.0,2005-05-13
2,885013,1,4.0,2005-10-19
3,30878,1,4.0,2005-12-26
4,823519,1,3.0,2004-05-03
5,893988,1,3.0,2005-11-17
6,124105,1,4.0,2004-08-05
7,1248029,1,3.0,2004-04-22
8,1842128,1,4.0,2004-05-09
9,2238063,1,3.0,2005-05-11


Next, the movie titles file can be loaded into memory as Pandas DataFrame.

In [39]:
df_title = pd.read_csv('movie_titles.csv', encoding = "ISO-8859-1", header = None, names = ['Movie_Id', 'Year', 'Name'])
df_title.head(10)

Unnamed: 0,Movie_Id,Year,Name
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW
5,6,1997.0,Sick
6,7,1992.0,8 Man
7,8,2004.0,What the #$*! Do We Know!?
8,9,1991.0,Class of Nuke 'Em High 2
9,10,2001.0,Fighter


## 2. Training our model

The Reader class in Surprise is to parse a file containing users, items, and users’ ratings on items.The Reader class in Surprise is to parse a file containing users, items, and users’ ratings on items. We are going to use 
`SVD (singular value decomposition)` algorithm to train our model

In [45]:
# Create an instance from Reader class
reader = Reader()

# Load our data using Dataset. The Dataset module in Surprise provides 
# different methods for loading data from files, Pandas DataFrames, or 
# built-in datasets
data = Dataset.load_from_df(df[['Cust_Id', 'Movie_Id', 'Rating']], reader)

# As mentioned, we're going to use SVD for our recommendation system
svd = SVD()

## 3. Evaluation

We're going to use cross validation to evaluate our model

In [46]:
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0103  1.0152  1.0159  1.0165  1.0169  1.0149  0.0024  
MAE (testset)     0.8065  0.8084  0.8111  0.8098  0.8084  0.8088  0.0015  
Fit time          18.64   18.45   20.74   18.93   19.58   19.27   0.83    
Test time         0.62    0.62    0.57    0.69    0.57    0.61    0.04    


{'test_rmse': array([1.01026731, 1.01523398, 1.01592167, 1.01645601, 1.01685702]),
 'test_mae': array([0.80651244, 0.80840073, 0.81105676, 0.80978981, 0.80842716]),
 'fit_time': (18.636523723602295,
  18.448293924331665,
  20.74327254295349,
  18.933855533599854,
  19.580331563949585),
 'test_time': (0.6229898929595947,
  0.6233022212982178,
  0.5690250396728516,
  0.6871788501739502,
  0.5704398155212402)}

Once the model has been evaluated to our satisfaction, then we can re-train the model using the entire training dataset:

In [47]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1d7d7ba30a0>

## 4. Testing our model

After a recommendation model has been trained appropriately, it can be used for prediction.
> For example, given a user (e.g., Customer Id 785314), we can use the trained model to predict the ratings given by the user on different products (i.e., Movie titles):

In [48]:
titles = df_title.copy()
titles['Estimate_Score'] = titles['Movie_Id'].apply(lambda x: svd.predict(785314, x).est)

To recommend products (i.e., movies) to the given user, we can sort the list of movies in decreasing order of predicted ratings and take the top N movies as recommendations:

In [49]:
titles = titles.sort_values(by=['Estimate_Score'], ascending=False)
titles.head(10)

Unnamed: 0,Movie_Id,Year,Name,Estimate_Score
12,13,2003.0,Lord of the Rings: The Return of the King: Ext...,4.494058
9235,9236,1998.0,South Park: Season 2,4.092507
4505,4506,1961.0,Breakfast at Tiffany's,4.033714
24,25,1997.0,Inspector Morse 31: Death Is Now My Neighbour,4.021449
13379,13380,1949.0,Stray Dog,4.016029
4508,4509,1977.0,Little House on the Prairie: Season 4,3.979105
4520,4521,2002.0,Wire in the Blood: Justice Painted Blind,3.969216
4,5,2004.0,The Rise and Fall of ECW,3.968313
13374,13375,1963.0,Andy Griffith Show: Classic Favorites,3.940749
13377,13378,1940.0,His Girl Friday,3.919252


In [51]:
# Let's try another customer
titles2 = df_title.copy()
titles2['Estimate_Score'] = titles2['Movie_Id'].apply(lambda x: svd.predict(1248029, x).est)
titles2 = titles2.sort_values(by=['Estimate_Score'], ascending=False)
titles2.head(10)

Unnamed: 0,Movie_Id,Year,Name,Estimate_Score
12,13,2003.0,Lord of the Rings: The Return of the King: Ext...,4.494058
9235,9236,1998.0,South Park: Season 2,4.092507
4505,4506,1961.0,Breakfast at Tiffany's,4.033714
24,25,1997.0,Inspector Morse 31: Death Is Now My Neighbour,4.021449
13379,13380,1949.0,Stray Dog,4.016029
4508,4509,1977.0,Little House on the Prairie: Season 4,3.979105
4520,4521,2002.0,Wire in the Blood: Justice Painted Blind,3.969216
4,5,2004.0,The Rise and Fall of ECW,3.968313
13374,13375,1963.0,Andy Griffith Show: Classic Favorites,3.940749
13377,13378,1940.0,His Girl Friday,3.919252
