## 5) Modeling

In this notebook, we'll begin evaluating how we can model our data and see how well some basic evaluation practices can create a recommendation system! To begin with, we'll set consider some of the framework we should work within

#### Baseline Performance Criteria

We'll be using the Root Mean Squared Error (RMSE) metric to evaluate how good (or bad) our model is as a simple baseline:

RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$

In addition, we should consider what the "absolute" worst score we can get would be. Since our values for ratings can only go from 1 to 5 in integer values, the most we can be "off" is 4. This means the RMSE at "maximum" can be 4, which gives us a frame of reference for an absolute disaster of a model!

However, another thing to consider is that we could just guess the median rating for everything ("3") and get a sense of how well that would do in terms of RMSE in relation to our model.

In [1]:
## import relevant baseline packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Let's create a RMSE function
def compute_rmse(y_true, y_pred):
    """Compute Root Mean Squared Error."""
    
    return np.sqrt(np.mean(np.power(y_true - y_pred, 2)))

In [3]:
# Let's load in our testing and training sets
data_TRAINING = pd.read_csv('DF_TRAINING.csv')
data_TESTING = pd.read_csv('DF_TESTING.csv')

In [4]:
# Drop the redundant column "Unnamed: 0"
data_TRAINING.drop(['Unnamed: 0'],axis= 1,inplace = True)
data_TESTING.drop(['Unnamed: 0'],axis= 1,inplace = True)

In [5]:
data_TRAINING.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79223 entries, 0 to 79222
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      79223 non-null  int64 
 1   movie_id     79223 non-null  int64 
 2   rating       79223 non-null  int64 
 3   timestamp    79223 non-null  int64 
 4   movie_title  79223 non-null  object
 5   testing      79223 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 3.6+ MB


In [6]:
data_TESTING.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20502 entries, 0 to 20501
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      20502 non-null  int64 
 1   movie_id     20502 non-null  int64 
 2   rating       20502 non-null  int64 
 3   timestamp    20502 non-null  int64 
 4   movie_title  20502 non-null  object
 5   testing      20502 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 961.2+ KB


In [7]:
data_TESTING.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title,testing
0,308,1,4,887736532,Toy Story (1995),1
1,280,1,4,891700426,Toy Story (1995),1
2,181,1,3,878962392,Toy Story (1995),1
3,145,1,3,882181396,Toy Story (1995),1
4,67,1,3,875379445,Toy Story (1995),1


In [8]:
## I'm going to create a unique holder DataFrame that will hold our y_true values and all the different y_preds
## We will make throughout this capstone. This is so we can store the values and pull them out if necessary
## to avoid having to re-run the models again
results_table = pd.DataFrame(columns=['user_id','movie_id','y_true','y_pred_naive','y_pred_average','y_pred_euc_sim','y_pred_cosine_sim'])
results_table['user_id'] = data_TESTING['user_id']
results_table['movie_id'] = data_TESTING['movie_id']

In [9]:
# Let's make a 'complete' data table as well in case it comes in handy
data_COMPLETE = pd.concat([data_TRAINING,data_TESTING],axis=0)
data_COMPLETE.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99725 entries, 0 to 20501
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      99725 non-null  int64 
 1   movie_id     99725 non-null  int64 
 2   rating       99725 non-null  int64 
 3   timestamp    99725 non-null  int64 
 4   movie_title  99725 non-null  object
 5   testing      99725 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 5.3+ MB


In [10]:
# Let's do a quick check of the "median" rating of 3 against our Testing Data Set
results_table['y_true'] = np.array(data_TESTING['rating'])
results_table['y_pred_naive'] = np.full(shape=len(results_table['y_true']),fill_value=3)
RMSE_naive = compute_rmse(results_table['y_true'], results_table['y_pred_naive'])
print(f'The naive guess of median value "3" RMSE is: {RMSE_naive}')

The naive guess of median value "3" RMSE is: 1.2401007885881654


We can now move to trying out some methods of developing a rating prediction for our movies. First, we'll apply an unweighted average between all the ratings in our testing set for a given movie

#### Averaging across users for a movie

The first prediction we can make is by simply taking the average for all the users in our testing set for a given movie and then making that the prediction for the movie.

In [11]:
# First, let's make a table that calculates the mean for each movie in the training set
mean_table_TRAINING = data_TRAINING[['movie_title','movie_id','rating']].groupby('movie_id').mean('rating')
mean_table_TRAINING['movie_id']=mean_table_TRAINING.index
mean_table_TRAINING.reset_index(drop=True)

Unnamed: 0,rating,movie_id
0,3.894737,1
1,3.180952,2
2,3.027778,3
3,3.572289,4
4,3.318841,5
...,...,...
1468,4.500000,1639
1469,3.333333,1643
1470,2.000000,1652
1471,3.000000,1658


In [12]:
# Second, we'll need to make a prediction array
meanPredictHolder = data_TESTING
meanPredictHolder['predict'] = 0

In [13]:
# Third, we'll need a method of assigning the correct prediction which we can generate a function:
def vlookup_assign(df_test, mean_table):
    "This will emulate a vlookup function"
    for counter in range(0,len(df_test)):
        movieID = df_test.iloc[counter,1]
        meanPredict = mean_table.loc[mean_table['movie_id']== movieID]
        df_test.iloc[counter,6] = meanPredict
    return df_test

In [14]:
# Fourth, let's run our model
results = vlookup_assign(meanPredictHolder,mean_table_TRAINING)

In [15]:
# Fifth, let's evaluate the results of our model
#y_true = np.array(results['rating'])
results_table['y_pred_average'] = np.array(meanPredictHolder['predict'])
RMSE_means = compute_rmse(results_table['y_true'], results_table['y_pred_average'])
print(f'The mean guess across training set users for any given movie RMSE is: {RMSE_means}')

The mean guess across training set users for any given movie RMSE is: 1.0299732829945232


The result of averaging across users is a RMSE of approximately 1.03 whereas our naive guess of "3" for everything resulted in a RMSE approxiamately of 1.24. This saw our RMSE decrease by about 16.94% which is fairly good amount of improvement considering the straight-forward approach we are using to make the prediction. Let's see if we can get the prediction even more accurate by starting to account for how similar or dissimilar a user is in relation to the other users

#### Prediction using similarity measures

For this method, we will consider not only the movie rating that needs to be predicted but also the specific 'user' we are making the prediction for regarding the movie. The logic here is that every person is different and that the better we can account for these differences, the more accurate our predictions will be. In terms of a methodology, this will be more in-depth to consider than the more straight-forward metrics we just used so we'll need to consider how we go through this.

There are some considerations we need to make when trying to use similarities. First, we need to make sure to avoid data leakage  (letting testing data affect the training data or training process), which means if we create some data structures to support the process of making similarity measures, none of the testing data can be inside the data structure. Second, we'll need to consider how to handle "blank" values (i.e., what happens when two users are being compared against a specific movie but one or the other hasn't rated a different movie). For the time being, we'll treat any blank value as a "zero" to faciliate proper calculations. Third, we need a structure that will allow us to compare users, which would imply that a pivot table may be a place to start as it allows for us to create a more straight-forward table to compare users and their ratings for any given movie.

In [16]:
# Let's check our training data
data_TRAINING.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title,testing
0,287,1,5,875334088,Toy Story (1995),0
1,148,1,4,877019411,Toy Story (1995),0
2,66,1,3,883601324,Toy Story (1995),0
3,5,1,4,875635748,Toy Story (1995),0
4,109,1,4,880563619,Toy Story (1995),0


In [17]:
# In terms of a pivot table, we'd in essence have rows be user_id and each column would be a rating for each movie
train_user_movie_rating_mtx = data_TRAINING.pivot_table(values='rating', index='user_id', columns='movie_id', fill_value=0)
test_user_movie_rating_mtx = data_TESTING.pivot_table(values='rating', index='user_id', columns='movie_id', fill_value=0)

In [18]:
train_user_movie_rating_mtx.head(5)

movie_id,1,2,3,4,5,6,7,8,9,10,...,1615,1620,1622,1623,1628,1639,1643,1652,1658,1664
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5,3,4,0,0,0,4,1,5,3,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
#We need a list that allows us to indentify the correct movie index for our columns.
#We don't need this for the user_id as we checked that it is a complete sequence from 1 to the nth user
columnPlaceKey = list(train_user_movie_rating_mtx.columns)

In [20]:
data_TESTING.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title,testing,predict
0,308,1,4,887736532,Toy Story (1995),1,3.894737
1,280,1,4,891700426,Toy Story (1995),1,3.894737
2,181,1,3,878962392,Toy Story (1995),1,3.894737
3,145,1,3,882181396,Toy Story (1995),1,3.894737
4,67,1,3,875379445,Toy Story (1995),1,3.894737


#### Similarity Measure: Euclidean

We can use a variety of similarity measures but the first one we'll use is a typical "baseline" which is Euclidean similarity which has it's foundation in the standard definition of distance from geometery.

$$ sim(x,y) = \frac{1}{1 + \sqrt{\sum (x - y)^2}}$$

The way our measure works is that it has an asymotic minimum value of 0 (without reaching it) and a maximum value of 1 (1/1) which allows us to gather weights. As a note, this version it to help take care of edge cases.

#### Process Steps

The next portion of the analysis requires us to create a framework that allows us to craft together measures. In terms of an algorithm, we'll consider the following steps:

1) Identifying the movie to be predicted and the user associated with it which we'll do by pulling from the testing data a user_id and a movie_id

2) Subset the testing data by extracting all users who have rated the movie in question and pulling the information about the specific user we are making the prediction for.

3) Create an array framework that allows for us to calculate the Euclidean distance for each user that results in a weighting scheme

4) Calculate the new weighted average for the movie rating that will become our rating prediction

In [21]:
def euclidean_sim(s1,s2):
    "Gives us the similarity in terms of euclidean distance of two series"
    diff = s1 - s2
    return (1 / (1+np.sqrt(np.sum(diff ** 2))))

In [22]:
#This will be our major process holder that we can simply change with different similarity functions
#!!! Warning - This takes a lot of time if you run the entire testing set !!!
test_user_movie_holder = data_TESTING[['movie_id','user_id','rating']] ## Create a 'list' of the items to predict
test_user_movie_holder['prediction'] = 0 ## Make feature column for our predictions
for testCounter in range(0,len(test_user_movie_holder)): ## Setup the range of our loop -- 
    test_movie = test_user_movie_holder.iloc[testCounter,0] ## Pull out the testing movie_id
    test_user = test_user_movie_holder.iloc[testCounter,1] ## Pull out the testing user_id
    test_movie_index = columnPlaceKey.index(test_movie) ## Gets our test_movie index for our movie
    train_set_holder = data_TRAINING.loc[((data_TRAINING['movie_id'] == test_movie) & (data_TRAINING['user_id'] != test_user))] ## Subset the training set for the movie
    train_user_list = list(train_set_holder['user_id'].sort_values()) ## Get the list of users who have rated the movie in the training set
    train_mtx_holder = train_user_movie_rating_mtx.iloc[train_user_list,:] ## Gives us all the users from the training set with a rating for the movie
    test_user_array = train_user_movie_rating_mtx.iloc[test_user,:] ## Gives us the ratings array for the test user only using training data
    train_mtx_holder_FINAL = train_mtx_holder.drop(train_mtx_holder.columns[test_movie_index], axis=1) ## Creates the final user array without the movie in question
    test_user_array_FINAL = test_user_array.truncate(test_movie_index) ## Same as above for the user we are predicting
    simValueList = [] ## Instantiate our list to hold similarities
    for simCounter in range(0,len(train_mtx_holder_FINAL)): ## Go through each row
        simValueList.append(euclidean_sim(test_user_array_FINAL,train_mtx_holder_FINAL.iloc[simCounter,:])) ## Calculate simialrity
    train_movie_ratings = train_mtx_holder.iloc[:,test_movie_index] ## Get the array of ratings for the movie in questions
    predictionHolder = np.average(train_movie_ratings, weights=simValueList) ## Calculate our prediction
    test_user_movie_holder.iloc[testCounter,3] = predictionHolder ## Put it into our holder

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_user_movie_holder['prediction'] = 0 ## Make feature column for our predictions
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [23]:
#Let's calculate our RMSE
#y_true = np.array(test_user_movie_holder['rating'])
results_table['y_pred_euc_sim'] = np.array(test_user_movie_holder['prediction'])
RMSE_euc = compute_rmse(results_table['y_true'], results_table['y_pred_euc_sim'])
print(f'The prediction for movies using euc similairity on any given movie and user RMSE is: {RMSE_euc}')

The prediction for movies using euc similairity on any given movie and user RMSE is: 1.029511781937744


If we compare the difference in RMSE's between simply taking the average of all users (1.0299732829945336) versus the RMSE of using similarity weighting by user to make a prediction (1.029511781937745), we can see that we don't get nearly as significant a decrease from when we checked against the naive guess of "3" for every prediction (1.2401007885881654). However, we need to consider that the "naive" guess was fairly accurate which doesn't leave us much room to improve and we made substantial gains from using an unweighted average between all users that did rate a movie, which means there is even less room to improve. That said, let's explore a few more similarity measures to see if they might give us better measures.

#### Different Similarity Measures

Euclidean 'similairty' is not the only measure of similarity as we can consider others. For the purpose of this capstone, we'll simply consider one more similarity measure, which is the Cosine similarity measure.

__Cosine Similarity__

Cosine Similarity attempts to derive the virtual "Cosine" angle between two vectors in a multi-dimensional space. In plain speak, it basically looks at the "trend direction" of two different observations as opposed to where they actually end up (which is what Euclidean similarity does). This is useful in certain problems as it is resistant to "extreme magnitude" differences, which can occur with certain problems like comparing documents which may have certain word counts that are extremely different from each other.

- Cosine similarity

$$ sim(x,y) = \frac{(x . y)}{\sqrt{(x . x) (y . y)}} $$

In [24]:
def cosine_sim(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    if (s1.sum() == 0) or (s2.sum() == 0): ## Account for edge cases to avoid errors
        answer = 0
    else:
        answer = np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))
    return answer 

In [25]:
#This will be our major process holder that we can simply change with different similarity functions
#!!! Warning - This takes a lot of time if you run the entire testing set !!!
test_user_movie_holder = data_TESTING[['movie_id','user_id','rating']] ## Create a 'list' of the items to predict
test_user_movie_holder['prediction'] = 0 ## Make feature column for our predictions
for testCounter in range(0,len(test_user_movie_holder)): ## Setup the range of our loop -- 
    test_movie = test_user_movie_holder.iloc[testCounter,0] ## Pull out the testing movie_id
    test_user = test_user_movie_holder.iloc[testCounter,1] ## Pull out the testing user_id
    test_movie_index = columnPlaceKey.index(test_movie) ## Gets our test_movie index for our movie
    train_set_holder = data_TRAINING.loc[((data_TRAINING['movie_id'] == test_movie) & (data_TRAINING['user_id'] != test_user))] ## Subset the training set for the movie
    train_user_list = list(train_set_holder['user_id'].sort_values()) ## Get the list of users who have rated the movie in the training set
    train_mtx_holder = train_user_movie_rating_mtx.iloc[train_user_list,:] ## Gives us all the users from the training set with a rating for the movie
    test_user_array = train_user_movie_rating_mtx.iloc[test_user,:] ## Gives us the ratings array for the test user only using training data
    train_mtx_holder_FINAL = train_mtx_holder.drop(train_mtx_holder.columns[test_movie_index], axis=1) ## Creates the final user array without the movie in question
    test_user_array_FINAL = test_user_array.truncate(test_movie_index) ## Same as above for the user we are predicting
    simValueList = [] ## Instantiate our list to hold similarities
    for simCounter in range(0,len(train_mtx_holder_FINAL)): ## Go through each row
        simValueList.append(cosine_sim(test_user_array_FINAL,train_mtx_holder_FINAL.iloc[simCounter,:])) ## Calculate simialrity
    train_movie_ratings = train_mtx_holder.iloc[:,test_movie_index] ## Get the array of ratings for the movie in questions
    if sum(simValueList) == 0: ## If we have no similarity with Cosine, then we can't use weighting and we'll default to the average rating
        predictionHolder = np.average(train_movie_ratings) ## Calculate our prediction
    else:
        predictionHolder = np.average(train_movie_ratings, weights=simValueList) ## Calculate our prediction
    test_user_movie_holder.iloc[testCounter,3] = predictionHolder ## Put it into our holder

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_user_movie_holder['prediction'] = 0 ## Make feature column for our predictions


In [26]:
#Let's calculate our RMSE
#y_true = np.array(test_user_movie_holder['rating'])
results_table['y_pred_cosine_sim'] = np.array(test_user_movie_holder['prediction'])
RMSE_cosine = compute_rmse(results_table['y_true'], results_table['y_pred_cosine_sim'])
print(f'The prediction for movies using cosine similairity on any given movie and user RMSE is: {RMSE_cosine}')

The prediction for movies using cosine similairity on any given movie and user RMSE is: 1.03549395179855


The end result of using Cosine Similarity is a RMSE that is worse than Euclidean Distance, which means for now, we'll stick to using the Euclidean Similarity metric to make our predictions if we were to strictly use RMSE as our evaluation metric. In order to faciliate more efficient documentation, we'll save the results to a different file so we can access it more easily

In [28]:
# Save our results so we don't need to replicate this process over and over again
results_table.to_csv("results_table_FINAL.csv")

#### Conclusion

In this notebook, we went through and modeled a few ways to predict ratings for a specific user given a specific movie including:

* Naive estimate of the median value of the range of possible ratings (i.e., everyone gets a "3") which results in a RMSE of about 1.240
* An overall average of all the ratings related to a movie in the testing set (i.e., take the average of all the ratings that users supply for a given movie) which results in a RMSE of about 1.030
* An average of all ratings related to a movie that is weighted based on how a specific users "relates" to the test user (i.e., weight more heavily user ratings that are more similar to the test user). We used two different similarity metrics of Euclidean similarity and Cosine similarity which results in RMSEs of 1.030 and 1.035 respectively

We used RMSE as our primary metric for evaluation in this capstone project. Based on this metric, we'd use the Euclidean similarity measure as it did the best (granted only by slighted against just the average rating) against the other methods.

This was a lot of fun to learn about and I'm definitely interested in further improving my skill set related to recommendation systems! Cheers! Emre