# Movie Recommendation Systems

# Notebook 4: Data Preparation

## 5. Data Preparation

Data preparation required to model the recommendation systems, including: 
* One-hot-encoding genres for the content-based recommender
* Dataframe of unique movies for reference
* Train-test splitting data
* Setting up user-item matrices for both training and testing data 
* Dataframes to determine relevant movies ('truths') for evaluation 

In [1]:
#Run initial set up first
%run ./02_Initial_Setup.ipynb

Number of nulls in "movies" dataframe: 
 movieId    0
title      0
genres     0
dtype: int64

 Number of duplicate rows in "movies" dataframe : 0
Number of duplicates:  5


Unnamed: 0,movieId,title,genres
5601,26958,Emma (1996),Romance
6932,64997,War of the Worlds (2005),Action|Sci-Fi
9106,144606,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Romance|Thriller
9135,147002,Eros (2004),Drama|Romance
9468,168358,Saturn 3 (1980),Sci-Fi|Thriller


Unnamed: 0,movieId,title,genres
650,838,Emma (1996),Comedy|Drama|Romance
2141,2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller
4169,6003,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Thriller
5854,32600,Eros (2004),Drama
5931,34048,War of the Worlds (2005),Action|Adventure|Sci-Fi|Thriller


Number of nulls in "ratings" dataframe: 
 userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

 Number of duplicates rows in "ratings" dataframe : 0
List of original movieIds:  [838, 2851, 6003, 32600, 34048]
List of duplicate movieIds:  [26958, 64997, 144606, 147002, 168358]


  movies['title'] = movies['title'].str.replace('(\(\d\d\d\d\))', '')


Unnamed: 0,original_id,duplicate_id
0,838,26958
1,2851,64997
2,6003,144606
3,32600,147002
4,34048,168358


Number of movies in "ratings":  9719
Number of movies in "movies":  9737


**5.1.1 Matrix of one-hot-encoded genres**

In [66]:
#Dataframe of one-hot-encoded genres
genres = pd.get_dummies(movies['genres'])

In [67]:
#Concat 'genres' and 'movies'
movies_ = pd.concat([movies, genres], axis=1)

In [68]:
#Drop 'genres' columns
movies_.drop('genres', axis=1, inplace=True)

In [69]:
#Combine one-hot-encoded genres for each movie, unique by movieId
genres_ = movies_.drop(['title','year'], axis=1).groupby(['movieId']).sum().reset_index()
genres_ = genres_.set_index('movieId')
len(genres_)

9737

In [70]:
#Preview
genres_

Unnamed: 0_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
193583,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
193585,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
193587,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**5.1.2 Dataframe of unique movies**

In [71]:
#Dataframe of unique movies
unique_movies = movies[['movieId','title','year']].drop_duplicates()

In [72]:
#Keep dataframe on unique movies by movieId and title only (exc. year)
unique_movies = unique_movies[['movieId', 'title']]

In [73]:
#Preview
unique_movies

Unnamed: 0,movieId,title
0,1,Toy Story
1,2,Jumanji
2,3,Grumpier Old Men
3,4,Waiting to Exhale
4,5,Father of the Bride Part II
...,...,...
9732,193581,Black Butler: Book of the Atlantic
9733,193583,No Game No Life: Zero
9734,193585,Flint
9735,193587,Bungo Stray Dogs: Dead Apple


**5.1.3 Train test split 'ratings'**

* Train test split was used to evaluate the recommenders
* Cross validation was not used due to the lengthy execution time (excluding model-based filtering).

*5.1.3.1 Create train and test set dataframes*

In [74]:
# A reader is still needed but only the rating_scale param is required
reader = Reader(rating_scale=(0.5, 5))

In [75]:
#Train-test split
trainset, testset = train_test_split(ratings, test_size=0.25, stratify=ratings['userId'])

In [76]:
#Check if split is stratified  
list(set(trainset['userId'])) == list(set(testset['userId']))

True

In [77]:
# A reader required to read rating scale between 0.5 and 5
reader = Reader(rating_scale=(0.5, 5))

*5.1.3.2 Create train and tests for Surprise (SVD) to work*

In [78]:
#Train set for surprise (SVD) 
# The columns must correspond to user id, item id and ratings (in that order).
trainset_surprise = Dataset.load_from_df(trainset[['userId', 'movieId', 'rating']], reader)
trainset_surprise = trainset_surprise.build_full_trainset()

In [79]:
#Test set for surprise (SVD)
testset_surprise = list(testset.to_records(index=False))
testset_surprise = [tuple(i)for i in testset_surprise]

**5.1.4 Relevant movies - truths (threshold)**

In [80]:
#Dataframe containing relevant movies only, i.e. 'truths' are seen as ratings > 3.5 (above overall average)  
threshold = testset[testset['rating']>3.5]
threshold = threshold.sort_values(['userId','rating'], ascending=[True,False])

In [81]:
#Convert into list of truths
truths = []
userIds = list(set(ratings['userId']))

for userId in userIds: 
    relevant_ = threshold[threshold['userId']==userId]['movieId'].tolist()
    truths.append(relevant_)

**5.1.5 User-item matrix**

* One for the train set [0]; another for the test set [1]
* Not all movies were rated - movieIds of unseen movies in 'missing_movies' 
* Incorporate unrated movies to user-item matrix 

In [82]:
#Make copies and store train and test sets
datasets = [trainset, testset]

In [83]:
#Lists
userIds = list(set(ratings['userId']))
movieIds = list(set(movies['movieId']))

#Set up user-item matrices and store copies
ui_matrix_ = pd.DataFrame(np.nan, index=userIds, columns=movieIds)

train_ui_ = ui_matrix_.copy()
test_ui_ = ui_matrix_.copy()

ui_matrices = [train_ui_, test_ui_]

#Preview
ui_matrix_

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,98239,98243,131013,131023,32728,163809,32743,98279,65514,98296
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,,,,,,,,,,,...,,,,,,,,,,
607,,,,,,,,,,,...,,,,,,,,,,
608,,,,,,,,,,,...,,,,,,,,,,
609,,,,,,,,,,,...,,,,,,,,,,


In [84]:
#Update user-item matrices
#0 = training set
#1 = test set

for i in 0,1:
    dataset_ui_ = pd.pivot_table(datasets[i], values='rating', index=['userId'], columns=['movieId'])
    ui_matrices[i].update(dataset_ui_)

In [85]:
#Preview of training set user-item matrix
train_ui = ui_matrices[0]
train_ui

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,98239,98243,131013,131023,32728,163809,32743,98279,65514,98296
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,,,,,...,,,,,4.0,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,,2.0,2.0,,,,,,,,...,,,,,,,,,,
609,,,,,,,,,,4.0,...,,,,,,,,,,


In [86]:
#Preview of test set user-item matrix
test_ui = ui_matrices[1]
test_ui

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,98239,98243,131013,131023,32728,163809,32743,98279,65514,98296
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,,,,,,,2.5,,,,...,,3.0,,,,,,,,
607,,,,,,,,,,,...,,,,,,,,,,
608,2.5,,,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,,...,,,,,,,,,,


In [87]:
#Check if nulls are
print('Number of not nulls in blank user-item matrix: ', ui_matrix_.notnull().sum().sum())

print('Number of not nulls in training set user-item matrix: ', ui_matrices[0].notnull().sum().sum())
print('Number of not nulls in test set user-item matrix: ', ui_matrices[1].notnull().sum().sum())

print('Number of ratings in total: ', len(ratings))

Number of not nulls in blank user-item matrix:  0
Number of not nulls in training set user-item matrix:  75625
Number of not nulls in test set user-item matrix:  25209
Number of ratings in total:  100834


**5.1.5 Boolean matrices**

Boolean identifier of training set and ratings to be predicted  
1. **'true_ui_bool'**: rated = 1, not rated = 0
2. **'pred_ui_bool'**: rated = 0, not rated = 1

In [88]:
#Matrices
pred_ui_bool = train_ui_.isnull().astype(float) 
true_ui_bool = 1-pred_ui_bool

In [89]:
#Preview
pred_ui_bool

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,98239,98243,131013,131023,32728,163809,32743,98279,65514,98296
1,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
607,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
608,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
609,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [90]:
#Preview
true_ui_bool

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,98239,98243,131013,131023,32728,163809,32743,98279,65514,98296
1,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
607,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**5.1.6 Store user history, relevant movies and recommendations**

In [91]:
#Dataframe storing each user's history (movies seen and rating); and recommendations (predictions), in later sections
user_movies = []
user_ratings = []

for userId in userIds:
    user_ = pd.DataFrame(train_ui_.loc[userId][train_ui_.loc[userId].notnull()])
    user_ = user_.reset_index()
    user_.columns=['movieId', 'rating']
    movies_ = list(user_['movieId'])
    ratings_ = list(user_['rating'])
    
    user_movies.append(movies_)
    user_ratings.append(ratings_)

In [92]:
#Convert lists to dataframe
user_items = pd.DataFrame([userIds, user_movies, user_ratings, truths]).transpose()

In [93]:
#Name columns
user_items.columns=['userId','movieId','rating','actuals']
#Set index
user_items = user_items.set_index('userId')

In [94]:
#Preview
user_items

Unnamed: 0_level_0,movieId,rating,actuals
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"[1, 3, 6, 47, 50, 70, 101, 151, 157, 163, 223,...","[4.0, 4.0, 4.0, 5.0, 5.0, 3.0, 5.0, 5.0, 5.0, ...","[2991, 3053, 2459, 1198, 1025, 2502, 608, 5060..."
2,"[318, 333, 131724, 1704, 68157, 71535, 6874, 1...","[3.0, 4.0, 5.0, 4.5, 4.5, 3.0, 4.0, 5.0, 3.5, ...","[89774, 58559, 80489, 74458, 3578]"
3,"[31, 647, 688, 720, 849, 1124, 1263, 1272, 130...","[0.5, 0.5, 0.5, 0.5, 5.0, 0.5, 0.5, 0.5, 0.5, ...","[5919, 5746, 1587, 26409]"
4,"[21, 32, 47, 52, 58, 106, 125, 126, 162, 171, ...","[3.0, 2.0, 2.0, 3.0, 3.0, 4.0, 5.0, 1.0, 5.0, ...","[920, 2599, 1197, 1080, 3044, 1947, 910, 265, ..."
5,"[34, 36, 58, 153, 232, 247, 261, 266, 290, 296...","[4.0, 4.0, 5.0, 3.0, 4.0, 5.0, 4.0, 1.0, 5.0, ...","[21, 367, 474, 1, 50, 110]"
...,...,...,...
606,"[1, 11, 15, 18, 19, 29, 32, 36, 47, 58, 73, 80...","[2.5, 2.5, 3.5, 4.0, 2.0, 4.5, 4.0, 3.5, 3.0, ...","[1089, 2997, 910, 2360, 2959, 1682, 1193, 3855..."
607,"[1, 11, 25, 112, 153, 165, 188, 204, 208, 241,...","[4.0, 3.0, 3.0, 2.0, 3.0, 4.0, 5.0, 3.0, 3.0, ...","[1370, 2762, 110, 2571, 150, 3347, 1974, 1407,..."
608,"[2, 3, 16, 21, 24, 31, 32, 34, 39, 44, 47, 48,...","[2.0, 2.0, 4.5, 3.5, 2.0, 3.0, 3.5, 3.5, 3.0, ...","[7373, 296, 53996, 3949, 2502, 54503, 6373, 65..."
609,"[10, 110, 116, 137, 161, 185, 208, 253, 288, 2...","[4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, ...","[590, 1150]"
