# Unsupervised Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**Team RR-1#MLTech**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: EA - Movies Recommender Systems

The objective is to construct a recommender system that employs content or collaborative filtering approaches to accurately forecast a user's rating for a movie they haven't watched, leveraging their past preferences. The aim is to develop a robust and effective solution that holds substantial economic potential, as it enables users to discover and engage with content aligned with their interests, ultimately driving revenue and fostering loyalty to the platform

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Data Dependencies</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Preprocessing</a>

<a href=#five>5. Feature Engineering</a>

<a href=#six>6. Model and Model Evaluation</a>


<a id="one"></a>

# Importing Data Dependences

In [2]:
import numpy as np
import pandas as pd
import zipfile
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')



<a id="two"></a>

# Loading the Data

**Loading the train set**

In [3]:
# Specify the path to the zip file
# zip_file = '..input/data-file'

# with zipfile.ZipFile(zip_file, mode="r") as archive:
#     archive.printdir()

In [4]:
# # open zipped dataset
# with zipfile.ZipFile("data.zip") as z:
#    # open the csv file in the dataset
#    with z.open("train.csv") as f:
#         # read the dataset
#         train = pd.read_csv(f)

In [5]:
# to work on kaggle cloud server load the data without reference to the zipfolder

train = pd.read_csv('../input//edsa-movie-recommendation-predict/train.csv')

**Loading the test set**

In [6]:
# with zipfile.ZipFile("data.zip") as test_file:
#    # open the csv file in the dataset
#    with test_file.open("test.csv") as test_csv:
#         # read the dataset
#         test = pd.read_csv(test_csv)

In [7]:
test = pd.read_csv('../input//edsa-movie-recommendation-predict/test.csv')

**Loading the movie set**

In [8]:
# with zipfile.ZipFile("data.zip") as movie_file:
#    # open the csv file in the dataset
#    with movie_file.open("movies.csv") as movie_csv:
#         # read the dataset
#         movies = pd.read_csv(movie_csv)

In [9]:
movies = pd.read_csv('../input//edsa-movie-recommendation-predict/movies.csv')

**imdb**

In [10]:
# with zipfile.ZipFile("data.zip") as imdb_file:
#    # open the csv file in the dataset
#    with imdb_file.open("imdb_data.csv") as imdb_csv:
#         # read the dataset
#         imdb = pd.read_csv(imdb_csv)

In [11]:
imdb = pd.read_csv('../input//edsa-movie-recommendation-predict/imdb_data.csv')

<a id="three"></a>

# Exploratory Data Analysis

The bulk of the exploratory data analysis was performed on a seperate notebook but for the purpose of the model we're going tp build, lets explore our data

### Training Data

In [12]:
# lets look at the head of our data set

In [13]:
train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [14]:
train.rating.unique() # to check the unique ratings in the data

array([4. , 4.5, 5. , 2. , 3. , 1. , 3.5, 2.5, 0.5, 1.5])

In [15]:
# get the info of our columns

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000038 entries, 0 to 10000037
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 305.2 MB


In [16]:
# checking for null values

train.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [17]:
# check the length of olumns and rows

print(f'Total number of rows is {train.shape[0]}')
print(f'Total number of rows is {train.shape[1]}')

Total number of rows is 10000038
Total number of rows is 4


In [18]:
train['rating'].describe() # Get the statistical summary of the rating column

count    1.000004e+07
mean     3.533395e+00
std      1.061124e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

### Test data

In [19]:
test.head() # view the first 5 rows

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [20]:
test.info() # get the info of the type of data in each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000019 entries, 0 to 5000018
Data columns (total 2 columns):
 #   Column   Dtype
---  ------   -----
 0   userId   int64
 1   movieId  int64
dtypes: int64(2)
memory usage: 76.3 MB


In [21]:
test.isnull().sum() # number of missing values

userId     0
movieId    0
dtype: int64

In [22]:
# The total number of rows and columns in the data set

print(f'Total number of rows is {test.shape[0]}')
print(f'Total number of columns {test.shape[1]}')

Total number of rows is 5000019
Total number of columns 2


### Movies data

In [23]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [24]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [25]:
# check for null values in the movies df

movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [26]:
print(f'Total number of columns is {movies.shape[1]}')
print(f'Total number of rows is {movies.shape[0]}')

Total number of columns is 3
Total number of rows is 62423


### imdb data

In [27]:
imdb.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [28]:
# get the information about the data in each columns

imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movieId        27278 non-null  int64  
 1   title_cast     17210 non-null  object 
 2   director       17404 non-null  object 
 3   runtime        15189 non-null  float64
 4   budget         7906 non-null   object 
 5   plot_keywords  16200 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.2+ MB


In [29]:
# check for null values

imdb.isnull().sum()

movieId              0
title_cast       10068
director          9874
runtime          12089
budget           19372
plot_keywords    11078
dtype: int64

<a id="four"></a>

# Preprocessing

**Merge the train data and the movie data**

In [30]:
# merge the dataframes using the pandas merge function

df = pd.merge(train, movies, how = 'inner')

df = pd.merge(df, imdb, how = 'inner', on = 'movieId')

In [31]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,title_cast,director,runtime,budget,plot_keywords
0,5163,57669,4.0,1518349992,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
1,87388,57669,3.5,1237455297,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
2,137050,57669,4.0,1425631854,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
3,120490,57669,4.5,1408228517,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
4,50616,57669,4.5,1446941640,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman


In [32]:
df.duplicated().sum()

0

In [33]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,title_cast,director,runtime,budget,plot_keywords
0,5163,57669,4.0,1518349992,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
1,87388,57669,3.5,1237455297,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
2,137050,57669,4.0,1425631854,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
3,120490,57669,4.5,1408228517,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
4,50616,57669,4.5,1446941640,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman


In [34]:
# join all the meta data of the movies together

df['tags'] = df.title + df.genres + df.title_cast + df.title_cast + df.director + df.plot_keywords

In [35]:
# rename columns to following right naming convention

df = df.rename(columns = {'userId':'user_id','movieId': 'movie_id'})

In [36]:
df.columns

Index(['user_id', 'movie_id', 'rating', 'timestamp', 'title', 'genres',
       'title_cast', 'director', 'runtime', 'budget', 'plot_keywords', 'tags'],
      dtype='object')

In [37]:
df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,title,genres,title_cast,director,runtime,budget,plot_keywords,tags
0,5163,57669,4.0,1518349992,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman,In Bruges (2008)Comedy|Crime|Drama|ThrillerEli...
1,87388,57669,3.5,1237455297,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman,In Bruges (2008)Comedy|Crime|Drama|ThrillerEli...
2,137050,57669,4.0,1425631854,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman,In Bruges (2008)Comedy|Crime|Drama|ThrillerEli...
3,120490,57669,4.5,1408228517,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman,In Bruges (2008)Comedy|Crime|Drama|ThrillerEli...
4,50616,57669,4.5,1446941640,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman,In Bruges (2008)Comedy|Crime|Drama|ThrillerEli...


In [38]:
df.tags = df.tags.str.replace('|', ' ')

In [39]:
df.tags[0]

'In Bruges (2008)Comedy Crime Drama ThrillerElizabeth Berrington Rudy Blomme Olivier Bonjour Mark Donovan Ann Elsley Colin Farrell Jean-Marc Favorin Ralph Fiennes Brendan Gleeson Eric Godon Zeljko Ivanek Sachi Kimura Anna Madeley Louis Nummy Clémence PoésyElizabeth Berrington Rudy Blomme Olivier Bonjour Mark Donovan Ann Elsley Colin Farrell Jean-Marc Favorin Ralph Fiennes Brendan Gleeson Eric Godon Zeljko Ivanek Sachi Kimura Anna Madeley Louis Nummy Clémence PoésyMartin McDonaghdwarf bruges irish hitman'

**Lets look at the total number of users and the total number of movies**

In [40]:
num_users = df['user_id'].nunique()

num_movies = df['movie_id'].nunique()

print(f'Number of users in the data set is {num_users}')
print(f'Number of movies in the data set is {num_movies}')

Number of users in the data set is 162522
Number of movies in the data set is 23544


<a id="five"></a>

# Feature Engineering

In [41]:
# import dependencies for feature engineering and model building
import surprise
import time # to calculate the time taking to train each model
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split

from surprise import accuracy # Evaluate the model using appropriate metrics (e.g., RMSE)


In [42]:
# Create a Surprise Dataset object

# create a scale of 0 to 5 because we fraction less than 1 in the dataset
unique_ratings = (0.5, 5) 

reader = Reader(rating_scale = unique_ratings)
dataset = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating']], reader)

# Split the dataset into training and testing sets
trainset, validationset = train_test_split(dataset, test_size=0.2, random_state=42)

<!-- Section 1: Importing Packages -->
<a id="six"></a>

# Model and Model Evaluation

### First Model

Collaborative filtering using Singular Value Decomposition

In [43]:
# Initialize the model

model = SVD(n_epochs = 30, verbose = True, random_state = 42)

In [44]:
# Train the model 

start = time.time() # initialize the timer
model.fit(trainset) # train the model
end = time.time()
print(f'Total time for training the model is: {round((end - start),2)}s')

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Total time for training the model is: 251.17s


In [45]:
prediction = model.test(validationset)  # make prediction to evaluate the model

In [46]:
# Evaluate the model

rmse_value = accuracy.rmse(prediction)
print(f'The Root Mean Square Error is {rmse_value}')

RMSE: 0.8359
The Root Mean Square Error is 0.8358581938956068


In [47]:
# Retrain the model

model = SVD(verbose = True, n_factors=300, random_state = 42)

start = time.time()
model.fit(dataset.build_full_trainset())
end = time.time()
print(f'Total time for training the model is: {round((end - start),2)}s')

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Total time for training the model is: 387.5s


### Model Evaluation with RMSE

In [48]:
prediction_2 = model.test(validationset)  # make prediction to evaluate the model

In [49]:
rmse_2 = accuracy.rmse(prediction_2)

RMSE: 0.5592


In [50]:
# lets define a function to calculate the r2 value

def calculate_r2(predictions):
    actual_ratings = np.array([pred.r_ui for pred in predictions])
    predicted_ratings = np.array([pred.est for pred in predictions])

    mean_rating = np.mean(actual_ratings)
    total_variance = np.sum((actual_ratings - mean_rating) ** 2)
    residual_variance = np.sum((actual_ratings - predicted_ratings) ** 2)
    r2 = 1 - (residual_variance / total_variance)

    return r2

In [52]:
rsquared = calculate_r2(prediction_2)
print(f'The residual squared r2 is {round((rsquared * 100), 2)}')

The residual squared r2 is 72.19


### SVD Using Alternating least squares (SVD++)

In [None]:
from surprise import SVDpp

In [None]:
# Initialize the model

# svdpp = SVDpp(verbose = True, n_factors=20)

In [None]:
# Train the model

# start = time.time()
# svdpp.fit(trainset)
# end = time.time()
# print(f'Total time for training the model is: {round((end - start),2)}s')

In [None]:
# make predictions
# accuracy.rmse(svdpp.test(validationset))

### Collaborative Filtering using Matrix Factorization

In [None]:
pip

In [None]:
from sklearn.model_selection import train_test_split
train_data, val_data = train_test_split(df, test_size=0.20) # lets train the model using 80% of the data

In [None]:
n_users = df.user_id.nunique() # get the number users
n_items = df.movie_id.nunique() # get the number of movies

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+str(n_items))

In [None]:
# Calulate the sparcity of the data

sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print('The sparsity level of Data is ' +  str(sparsity*100) + '%')

## Evaluation on Unseen Users through Kaggle submission

In [None]:
test.head()

In [None]:
# prepare the test data to predict ratings by creating an empty rating columns to be populated by the predicted ratings

new_testset = [tuple(row) + (None, ) for row in test[['userId', 'movieId']].values]

In [None]:
new_testset[:5] # view the first 5 rows or new_testset

In [None]:
# lets test the model on unseen data
pred = model.test(new_testset)

In [None]:
pred[:5]  # check the predicted data

In [None]:
# pred_ratings = [round((p.est),1) for p in pred] # extract the ratings from the predicted ratings
pred_ratings = [round((p.est),4) for p in pred]

In [None]:
pred_ratings[:5] # view the first 5 rows

In [None]:
# define a function to convert the ratings to the nearest unique rating in the data set

# def map_ratings(ratings):
#     unique_ratings = [4.0, 4.5, 5.0, 2.0, 3.0, 1.0, 3.5, 2.5, 0.5, 1.5]
#     mapped_ratings = []

#     for rating in ratings:
#         # Find the closest rating in the unique_ratings list
#         closest_rating = min(unique_ratings, key=lambda x: abs(x - rating))
#         mapped_ratings.append(closest_rating)

#     return mapped_ratings

In [None]:
# Test the mapping function with your list of ratings

# ratings_to_map = [3.6, 4.1, 3.3, 4.3, 1.8]
# mapped_ratings = map_ratings(ratings_to_map)

# print(mapped_ratings)

In [None]:
# pred_mapped_ratings = map_ratings(pred_ratings)

In [None]:
# pred_mapped_ratings[:5]

In [None]:
# test['rating'] = pred_mapped_ratings  # assign the predictions to a column in the test data
test['rating'] = pred_ratings

In [None]:
# merge the userid and the movieid

test['Id'] = test['userId'].astype(str) + '_' + test['movieId'].astype(str)

In [None]:
# create a new data frame for the submission

sub_df = test[['Id', 'rating']]
sub_df.head() 

In [None]:
sub_df.to_csv('sub.csv', mode = 'w', index=False)

In [None]:
sub_df.shape

In [None]:
sub_df.head()