<h2>Movie Recommendations Dataset</h2>
<h4>Hands-on: Classification with SageMaker</h4>
Input Features: [userId, moveId] <br>
Target Feature: rating <br>
Objective: Predict how a user would rate a particular movie<br>
<h4>Movie Lens Overview: https://grouplens.org/datasets/movielens/</h4>
<h4>Dataset: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip</h4>
<h4>F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. </h4>

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.datasets import dump_svmlight_file


import boto3
import sagemaker.amazon.common as smac

Matplotlib is building the font cache; this may take a moment.


<h3>Load Movies and Parse Genre</h3>

#### Download Movie Dataset from grouplens

In [2]:
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

--2023-08-16 14:18:45--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2023-08-16 14:18:46 (3.61 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]



In [3]:
ls

fm_cloud_prediction_template.ipynb  [0m[01;31mml-latest-small.zip[0m           ReadMe.md
fm_cloud_training_template.ipynb    movie_data_preparation.ipynb  [01;34msdk1.7[0m/


In [4]:
!unzip ml-latest-small.zip

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [5]:
df_movies = pd.read_csv(r'ml-latest-small/movies.csv')

In [6]:
df_movies.shape

(9742, 3)

In [7]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
genre_list = df_movies.genres.map(lambda value: value.split('|'))

In [9]:
genre_list[:10]

0    [Adventure, Animation, Children, Comedy, Fantasy]
1                       [Adventure, Children, Fantasy]
2                                    [Comedy, Romance]
3                             [Comedy, Drama, Romance]
4                                             [Comedy]
5                            [Action, Crime, Thriller]
6                                    [Comedy, Romance]
7                                [Adventure, Children]
8                                             [Action]
9                        [Action, Adventure, Thriller]
Name: genres, dtype: object

In [10]:
def get_unique_genres (genre_list):
    unique_list = set()
    
    for items in genre_list:
        for item in items:
            unique_list.add(item)
    
    return sorted(unique_list)

In [11]:
genre = get_unique_genres(genre_list)

In [12]:
genre, len(genre)

(['(no genres listed)',
  'Action',
  'Adventure',
  'Animation',
  'Children',
  'Comedy',
  'Crime',
  'Documentary',
  'Drama',
  'Fantasy',
  'Film-Noir',
  'Horror',
  'IMAX',
  'Musical',
  'Mystery',
  'Romance',
  'Sci-Fi',
  'Thriller',
  'War',
  'Western'],
 20)

In [13]:
# Table of genre for each movie
df_genre = pd.DataFrame(index=range(df_movies.shape[0]),columns=genre)

In [14]:
df_genre = df_genre.fillna(0)

In [15]:
df_genre.shape

(9742, 20)

In [16]:
df_genre.head()

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [17]:
genre_list [:11]

0     [Adventure, Animation, Children, Comedy, Fantasy]
1                        [Adventure, Children, Fantasy]
2                                     [Comedy, Romance]
3                              [Comedy, Drama, Romance]
4                                              [Comedy]
5                             [Action, Crime, Thriller]
6                                     [Comedy, Romance]
7                                 [Adventure, Children]
8                                              [Action]
9                         [Action, Adventure, Thriller]
10                             [Comedy, Drama, Romance]
Name: genres, dtype: object

In [18]:
# Fill genre for each movie
for row, movie_genre in enumerate(genre_list):
    df_genre.loc[row,movie_genre] = 1

In [19]:
df_genre.head()

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [20]:
# Some movies don't have genre listed
df_genre[df_genre['(no genres listed)'] > 0].head()

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
8517,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8684,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8687,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8782,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8836,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [21]:
# Merge with movie description
df_movies = df_movies.join(df_genre)

In [22]:
df_movies.head()

Unnamed: 0,movieId,title,genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
df_movies.to_csv(r'ml-latest-small/movies_genre.csv', index=False)

<h3>Load Ratings given by each user for a movie</h3>

In [24]:
df_ratings = pd.read_csv(r'ml-latest-small/ratings.csv')

In [25]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [26]:
df_ratings.userId.unique().shape

(610,)

In [27]:
df_ratings.movieId.unique().shape

(9724,)

In [28]:
df_ratings.drop(axis=1,columns=['timestamp'],inplace=True)

In [29]:
# Merge rating and movie description
df_movie_ratings = pd.merge(df_ratings,df_movies,on='movieId')

In [30]:
df_movie_ratings.head(2)

Unnamed: 0,userId,movieId,rating,title,genres,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0


In [31]:
df_movie_ratings.tail(2)

Unnamed: 0,userId,movieId,rating,title,genres,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
100834,610,163937,3.5,Blair Witch (2016),Horror|Thriller,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
100835,610,163981,3.5,31 (2016),Horror,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


## Training and Validation Set
### Target Variable as first column followed by input features:

### Training, Validation files do not have a column header

In [32]:
# Training = 70% of the data
# Validation = 30% of the data
# Randomize the datset
np.random.seed(5)
l = list(df_movie_ratings.index)
np.random.shuffle(l)
df = df_movie_ratings.iloc[l]

In [33]:
rows = df.shape[0]
train = int(.7 * rows)
test = rows-train

In [34]:
rows,train,test

(100836, 70585, 30251)

In [35]:
df.shape

(100836, 25)

In [36]:
df.head(2)

Unnamed: 0,userId,movieId,rating,title,genres,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
92163,298,42011,1.0,Fun with Dick and Jane (2005),Comedy|Crime,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
71427,28,428,2.5,"Bronx Tale, A (1993)",Drama,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
# SageMaker Factorization Machine expects all columns to be of float32
# Let's get the target variable as float32
y = df['rating'].astype(np.float32).ravel()

In [38]:
len(y)

100836

In [39]:
y.dtype

dtype('float32')

In [40]:
# We will create two different training datasets.
# Training 1: rating, user id, movie id
# Training 2: rating, user id, movie id, and movie genre attributes
columns_user_movie = ['userId','movieId']
columns_all = columns_user_movie + genre

In [41]:
columns_user_movie

['userId', 'movieId']

In [42]:
columns_all

['userId',
 'movieId',
 '(no genres listed)',
 'Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'IMAX',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

In [43]:
# Store a copy of user id, movie id and rating
# Train and Test
df[['rating','userId','movieId']][:train].to_csv(r'ml-latest-small/user_movie_train.csv', index=False)
df[['rating','userId','movieId']][train:].to_csv(r'ml-latest-small/user_movie_test.csv',index=False)

In [44]:
# One Hot Encode
# Training 1: user id, movie id
# Training 2: user id, movie id, and movie genre attributes
encoder = preprocessing.OneHotEncoder(dtype=np.float32)

In [45]:
X = encoder.fit_transform(df[columns_user_movie])

In [46]:
df.userId.unique().shape, df.movieId.unique().shape

((610,), (9724,))

In [47]:
# Write Dimensions - we need it for training and prediction
# Number of unique users and movies
dim_movie = df.userId.unique().shape[0] + df.movieId.unique().shape[0]
with open(r'ml-latest-small/movie_dimension.txt','w') as f:
    f.write(str(dim_movie))

In [48]:
X

<100836x10334 sparse matrix of type '<class 'numpy.float32'>'
	with 201672 stored elements in Compressed Sparse Row format>

In [49]:
X.shape[1]

10334

In [50]:
# Create a spare matrix recordio file
def write_sparse_recordio_file (filename, x, y=None):
    with open(filename, 'wb') as f:
        smac.write_spmatrix_to_sparse_tensor (f, x, y)

In [51]:
# Training recordIO file
write_sparse_recordio_file(r'ml-latest-small/user_movie_train.recordio',X[:train],y[:train])

In [52]:
# Test recordIO file
write_sparse_recordio_file(r'ml-latest-small/user_movie_test.recordio',X[train:],y[train:])

In [53]:
# Create libSVM formatted file. Convenient text format
# Output is stored as rating, user_index:value, movie_index:value
#  For example: 5.0 314:1 215:1  (user with index 314 and movie with index 215 in the one hot encoded table has a rating of 5 )

# This file can be used for two purposes: 
#   1. directly traing with libFM binary in local mode
#   2. It is easy to run inference with this format against sagemaker cloud as we need to
#      send only sparse input to sagemaker prediction service

# 
# Store in libSVM format as well for directly testing with libFM
dump_svmlight_file(X[:train],y[:train],r'ml-latest-small/user_movie_train.svm')
dump_svmlight_file(X[train:],y[train:],r'ml-latest-small/user_movie_test.svm')

In [54]:
# Create two lookup files
# File 1: Categorical Movie ID and corresponding Movie Index in One Hot Encoded Table
# File 2: Categorical User ID and corresponding User Index in One Hot Encoded Table

# This is useful for predicting how a particular user would rate all the movies
# or all users rating one particular movie

list_of_movies = df.movieId.unique()
# user 1 and all movies
df_user_movie = pd.DataFrame({'userId': np.full(len(list_of_movies),1), 'movieId' : list_of_movies})

In [55]:
df_user_movie[columns_user_movie].head()

Unnamed: 0,userId,movieId
0,1,42011
1,1,428
2,1,110
3,1,1097
4,1,1073


In [56]:
list_of_movies

array([ 42011,    428,    110, ..., 191005, 117572,   4434])

In [57]:
# Transform to one hot encoding (with existing encoder)
X = encoder.transform(df_user_movie[columns_user_movie])

In [58]:
# Store movieId and corresponding one hot encoded entries
dump_svmlight_file(X,list_of_movies,r'ml-latest-small/one_hot_enc_movies.svm')

In [59]:
# Now create 
# File 2: Categorical User ID and corresponding User Index in One Hot Encoded Table
list_of_users = df.userId.unique()

In [60]:
list_of_users.shape

(610,)

In [61]:
list_of_users[:10]

array([298,  28, 372, 303,  19, 487, 332, 165,  89, 288])

In [62]:
# All users and movie 1
df_user_movie = pd.DataFrame({'userId': list_of_users, 'movieId' : np.full(len(list_of_users),1)})

In [63]:
df_user_movie.head()

Unnamed: 0,userId,movieId
0,298,1
1,28,1
2,372,1
3,303,1
4,19,1


In [64]:
# Transform to one hot encoding (with existing encoder)
X = encoder.transform(df_user_movie[columns_user_movie])

In [65]:
# Store movieId and corresponding one hot encoded entries
dump_svmlight_file(X,list_of_users,r'ml-latest-small/one_hot_enc_users.svm')