## 4) Preprocessing and Data Set Creation

For our recommender model, we'll need to Preprocess and split our data into a training set and a testing set in order to evaluate it. First, let's quickly evaluate our data to examine if any preprocessing is necessary and if so, what preprocessing we'll do.

In [1]:
#import relevant base packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#import our prepped data set
raw_df = pd.read_csv('Ratings_Table_COMPLETE.csv')

In [3]:
#Quick peek of data set
raw_df.head()

Unnamed: 0.1,Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title
0,0,0,172,5,881250949,"Empire Strikes Back, The (1980)"
1,1,0,133,1,881250949,Gone with the Wind (1939)
2,2,196,242,3,881250949,Kolya (1996)
3,3,186,302,3,891717742,L.A. Confidential (1997)
4,4,22,377,1,878887116,Heavyweights (1994)


In [4]:
#summary stats
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100002 entries, 0 to 100001
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Unnamed: 0   100002 non-null  int64 
 1   user_id      100002 non-null  int64 
 2   movie_id     100002 non-null  int64 
 3   rating       100002 non-null  int64 
 4   timestamp    100002 non-null  int64 
 5   movie_title  100002 non-null  object
dtypes: int64(5), object(1)
memory usage: 4.6+ MB


In [5]:
#We can drop the 'Unamed: 0' column as that is just a repeat
raw_df.drop(['Unnamed: 0'],axis=1,inplace=True)

In [6]:
raw_df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title
0,0,172,5,881250949,"Empire Strikes Back, The (1980)"
1,0,133,1,881250949,Gone with the Wind (1939)
2,196,242,3,881250949,Kolya (1996)
3,186,302,3,891717742,L.A. Confidential (1997)
4,22,377,1,878887116,Heavyweights (1994)


#### Scaling & Normalization

Since our data is in integer form ranging from 1 to 5, it has been effectively scaled and we can move forward with creating our testing and training sets.

#### Data Set Creation Strategy for Training and Testing

We will need to consider few things when creating our data set that are potentially different from a traditional "data set" project. The most important aspect is the number of ratings available for any given movie. The reason this is important is relatively straight-forward. If a movie has only a single rating, there is no way to have a training/testing split. If a movie only has two ratings, then there can only be one training observation and one testing observation. This is not a typical consideration for most data sets as you typically have enough observations per category that you can easily split via training/testing and be able to do something like 80/20 split between testing and training. With a recommender system like this, we'll need a more explicit strategy. We will use the following strategy for splitting in terms of counts to drop.

* If a movie only has a single rating, there is no way to have a meaningful train/test split and therefore we'll drop these movies from our test/train data set creation
* If a movie only has two ratings, the test/train split will result in simply in a replication prediction (i.e., simply duplicate the ratings of the training value as the prediction). There isn't much of a model that can be created from this situation so we'll drop these movies as well.

This means that in order for a movie to even have a feasible model for our train/test split, we'll need at least three observations. It may (or may not) make sense to consider these very small ratings count from a model efficiency and effectiveness perspective.

__Note on Training / Testing Split__

The difficulty of this step is explicitly tied to the number of ratings you have per movie and the ability of the sponser to gain more access to ratings if necessary. For example, a large organization like Netflix may have a substantial amount of ratings per movie that you can more easily apply a traditional 80/20 split without much considerations. For a newer organization, that may not necessarily be easily achieveable if they do not have a lot of movies (or product sales, product reviews, etc.) so they'll need a more detailed strategy for dealing with a situation where each given item only has a very sparse amount of ratings. In addition, the ability of the sponser to get more ratings needs to be considered. For example, an organization like Netflix can potentially push to gather ratings quickly and effectively to get enough ratings for any movie to have enough ratings to meet a minimum threshold but a different organization may not have that luxury.

This needs to be considered in the light out our sponsor where (we learned from EDA that) 50% of the movies only have 27 ratings  or less. This means if we set a threshold of needing 30 movie ratings to make it "model valid", we'd be throwing out many possible movies and creating a model that wouldn't be particularly useful to the current state of our sponsor. Therefore, we'll implement the following strategy for our train/test split.

* We'll drop the movies with only one or two ratings as mentioned above since we'll have no way to effectively model a relationship with such sparse data
* For every four movies in the training set, a fifth movie will be the testing set (to virtually replicate a 80/20 split)
* In the event we have a movie with more than two ratings but less than five ratings:

    * If our movie only has three ratings, we'll have the training set get two observations and the testing set get the last observations
    * If our movie only has four ratings, we'll have the training set get three observations and the testing set get the last observations
    
* Once we get to greater than five movies, we'll implement the following split for the remainder after every five ratings:

    * If our Mod 5 == 1, we'll put that remainder in the testing set
    * If our Mod 5 == 2, we'll split them one for training and one for testing
    * If our Mod 5 == 3, we'll split them two for training and one for testing
    * If our Mod 5 == 4, we'll split them two for training and two for testing

While this strategy will not always emulate a 80/20 split, it should give us enough of a spread for training and testing to create a model that is relevant to our sponsor given their own data spread.

In [7]:
#First, let's make a list of the number of ratings per movie
movieId_RatingsCount = raw_df[['movie_id','rating']].groupby('movie_id').count()
movieId_RatingsCount_Sorted = movieId_RatingsCount.sort_values('rating',ascending=False)
movieId_RatingsCount_Sorted['rating_count'] = movieId_RatingsCount_Sorted['rating']

In [8]:
#Second, let's pick out the movies that have more than two ratings
movieId_KeepList = movieId_RatingsCount_Sorted.loc[movieId_RatingsCount_Sorted['rating_count'] > 2]

In [9]:
#Third, let's create a list of the movie IDs
movieId_KeepList = movieId_KeepList.index.tolist()

In [10]:
#Now let's drop the movies that do not have enough ratings
data_table_complete = raw_df.loc[raw_df['movie_id'].isin(movieId_KeepList)]

To make our labels, we'll have to consider two main loops. The first loop will work through each movieID and the second loop we'll then randomly assign each observation within the movieID to either the training set or training set. We'll create a feature called 'testing' that is '1' if the observation is assigned to testing and '0' if assigned to training. It will also be easiest to concatenate each progressive subset of movieIDs as opposed to changing the original complete data table.

In [11]:
#First, let's add a feature for assignment
data_table_complete['testing'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_table_complete['testing'] = 0


In [12]:
#Now let's build our sorting algorithm components
#First, let's make a movie list
movieID_List = set(data_table_complete['movie_id'].tolist())

In [13]:
#import relevant packages
import random

In [14]:
#Let's make an empty dataframe that will be the final holder of all our processign
data_table_ASSIGNED_COMPLETE = pd.DataFrame(columns = data_table_complete.columns)

In [15]:
#Second, we need to create a tool to randomly assign observations based on the number of observations we have. 
# Remember that a '1' goes to testing and a '0' goes to training
def assign_to_set(df):
    obs_num = len(df)  ## Get the number of observations we have
    assign_list = [] ## Instantiate our list
    if obs_num == 3: ## Handle only 3 observations
        assign_list = [0,0,1]
    elif obs_num == 4: ## Handle only 4 observations
        assign_list = [0,0,0,1]
    else:
        quot = obs_num // 5  ## Get the mulitples of 5
        remn = obs_num % 5  ## Get the remainder
        for counter in range (0,quot):
            assign_list = assign_list + [0,0,0,0,1] ##Build based on multiples
        if remn == 1:
            assign_list = assign_list + [0]
        elif remn == 2:
            assign_list = assign_list + [0,1]
        elif remn == 3:
            assign_list = assign_list + [0,0,1]
        elif remn == 4:
            assign_list = assign_list + [0,0,1,1]
    random.shuffle(assign_list) ## Shuffle the list
    return assign_list ## Return the list

In [16]:
## Now let's build our complete algorithm to assign movies
for movie in movieID_List:
    df_subset = data_table_complete.loc[data_table_complete['movie_id'] == movie].copy() ## Make a subset dataframe
    assign_list = assign_to_set(df_subset) ## Get our new assignment list
    df_subset['testing'] = assign_list ## Update the testing list
    data_table_ASSIGNED_COMPLETE = pd.concat([data_table_ASSIGNED_COMPLETE,df_subset],axis=0) ## Concatenate the new DataFrame

In [17]:
#Now let's give it a cursory check to make sure the algorithm worked properly by checking a random movie
len(data_table_ASSIGNED_COMPLETE.loc[data_table_ASSIGNED_COMPLETE['movie_id'] == 56])

394

In [18]:
len(data_table_complete.loc[data_table_complete['movie_id'] == 56])

394

In [19]:
## Okay, at least the lengths make sense, so let's see the counts of testing
data_table_ASSIGNED_COMPLETE.loc[data_table_ASSIGNED_COMPLETE['movie_id'] == 56].value_counts('testing')

testing
0    314
1     80
dtype: int64

Alright, let's check if the function worked properly. 394 divided by 5 results in 78 R 4, which we can then count the number of zeroes that should come out of it. 78 * 4 = 312 and then two more zeroes makes 314 zeroes. For the ones, we get 78 + 2 which is 80. This gives us confidence that the algorithm is working properly.

In [20]:
#Now we can split our new DataFrame into a training set and a testing set
data_table_TRAINING = data_table_ASSIGNED_COMPLETE.loc[data_table_ASSIGNED_COMPLETE['testing'] == 0]
data_table_TESTING = data_table_ASSIGNED_COMPLETE.loc[data_table_ASSIGNED_COMPLETE['testing'] == 1]

In [21]:
#Let's now save these data sets
data_table_TRAINING.to_csv('DF_TRAINING.csv')
data_table_TESTING.to_csv('DF_TESTING.csv')

#### Summary

Alright, we've preprocessed our data (which was relatively simple since our data was already scaled by virtue of it being on a 1 to 5 scale) and seperated our data set accordingly to ensure we have enough training data for our testing data based on the movie. In a more realistic setting, you'd probably have to heavily consider the amount of minimum movie ratings you'd want to justify modeling it. In addition, the randominzation may need to be consider in light of how many movies you have. If you're randomizing on just 5 movies for example, whichever movies are select for the training and testing sets can have significant impacts on accuracy (for example, imagine if an 'odd-ball' rating becomes the single testing set against the other four ratings which are more sensical versus the odd-ball is in the training set and can be at least somewhat correct by the other training observations). For this capstone, we won't get into it, but it is something to consider going forward as well. 