<a href="https://colab.research.google.com/github/kwanda2426/unsupervised-predict-streamlit-template/blob/master/Team_14%20notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://explore-datascience.net/images/images_admissions2/main-logo.jpg">

<img src="https://github.com/Explore-AI/Pictures/blob/master/sql_tmdb.jpg?raw=true" width=90%/>

# Streamlit-based Movie Recommender System

## Team 14 : 

## Table of contents
1. [Introduction](#intro)
2. [Data Collection](#data)
3. [Data Preprocessing](#cleaning)
4. [Exploratory Data Analysis](#EDA)
5. [Feature Engineering And Selection](#features)
6. [Model Building And Evaluation](#model)
7. [Model Hyperparameter Tuning](#tuning)
8. [Conclusion](#conclusion)
9. [References](#references)
 

<a id="intro"></a>
# 1. **Introduction**

In our daily life when we are shopping online, or looking for a movie to watch, we normally ask our friends or search for it. And when they recommend something that we do not like yet they enjoyed it. what a waste of time right. So what about if there is a system that can understand you, and recommend for you based on your interests, that would be so cool.

The growth of the internet has resulted in an enormous amount of online data and information available to us. Tools like a recommender system allow us to filter the information which we want or need. Recommender systems can be utilized in many contexts, one of which is a playlist generator for video, movie or music services. 
Recommendation systems are becoming increasingly important in today’s extremely busy world. People are always short on time with the myriad tasks they need to accomplish in the limited 24 hours. Therefore, the recommendation systems are important as they help them make the right choices, without having to expend their cognitive resources.

### **Problem Statement**
In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals are exposed to the content that is relevant to them in one way or another. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options. If customers are not exposed to a content relevant to them, may decide to look for alternatives which may provide better content.

### **Objectives**

The key objective is to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

### **Literature Review**

**What are recommender systems?**

Simply put, recommender systems are the systems that are designed to recommend things to the user based on many different factors. These systems predict the most likely product that the users are most likely to purchase and are of interest to. Companies like Netflix, Amazon, etc. use recommender systems to help their users to identify the correct product or movies for them. 

The purpose of a recommendation system basically is to search for content that would be interesting to an individual. Moreover, it involves a number of factors to create personalised lists of useful and interesting content specific to each user. Recommendation systems are Artificial Intelligence based algorithms that skim through all possible options and create a customized list of items that are interesting and relevant to an individual. These results are based on their profile, search/browsing history, what other people with similar traits/demographics are watching, and how likely are you to watch those movies. This is achieved through predictive modeling and heuristics with the data available.

#### **Content-Based Filtering**

Content-based filtering is a type of recommender system that attempts to guess what a user may like based on that user's activity. Content-based filtering makes recommendations by using keywords and attributes assigned to objects in a database (e.g., items in an online marketplace) and matching them to a user profile.

**Why use content-based filtering?**
- No data from other users is required to start making recommendations.
- Recommendations are highly relevant to the user.
- You avoid the “cold start” problem.
- Recommendations are transparent to the user. Highly relevant recommendations send a message of openness to the user, bolstering their trust level in offered recommendations.


**Challenges of content-based filtering**
- There’s a lack of novelty and diversity.
- Scalability is a challenge. Every time a new product or service or new content is added, its attributes must be defined and tagged.
- Attributes may be incorrect or inconsistent. Content-based recommendations are only as good as the subject-matter experts tagging items.


#### **Collaborative Filtering**
The idea behind collaborative filtering is to consider users’ opinions on different videos and recommend the best video to each user based on the user’s previous rankings and the opinion of other similar types of users.

**Why use collaborative filtering?**
- It does not need a movie’s side knowledge like genres.
- It uses information collected from other users to recommend new items to the current user.
- Even when no information on an item is available, we still can predict the item rating without waiting for a user to purchase it.
- Captures the change in user interests over time: Focusing solely on content does not provide any flexibility on the user's perspective and their preferences.


**Challenges of collaborative filtering**
- Cannot handle well fresh items with no ratings.
- Hard to include side features for query/item.
- Cannot handle well fresh users with no relations to other users.




<a id="data"></a>
# 2. **Data Collection**

## **Import Libraries**

In [None]:
!pip install comet_ml
!pip install surprise

In [None]:
# import comet_ml at the top of your file
from comet_ml import Experiment

# Create an experiment with your api key
experiment = Experiment(
    api_key="cDBGt9YOCyyinNTUvxRUB3hxd",
    project_name="streamlit-based-movie-recommender-system",
    workspace="kwanda2426",
)

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/kwanda2426/streamlit-based-movie-recommender-system/d69f8ff97ef5447fa556f633a134fb40



We use comet to run different experiments while saving the .

In [1]:

# Data manipulation
import pandas as pd
import numpy as np

# datetime
import datetime

# Libraries for data preparation and model building
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise import Reader
from surprise import Dataset
from surprise import SVD
import recmetrics
from sklearn.preprocessing import MaxAbsScaler
from surprise.accuracy import rmse
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
# saving model
import pickle

#ignoring warnings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

#making sure that we can see all rows and cols
pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

ModuleNotFoundError: ignored

### **Loading Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the “read_csv” function in Pandas.

In [None]:
# imdb
imdb_df = pd.read_csv('C:/Users/Tshegofatso/Downloads/edsa-movie-recommendation-wilderness/imdb_data.csv')

# movies
movies_df = pd.read_csv('C:/Users/Tshegofatso/Downloads/edsa-movie-recommendation-wilderness/movies.csv')

# movies
tags_df = pd.read_csv('C:/Users/Tshegofatso/Downloads/edsa-movie-recommendation-wilderness/tags.csv')

# train 
train = pd.read_csv('C:/Users/Tshegofatso/Downloads/edsa-movie-recommendation-wilderness/train.csv')

# test
test = pd.read_csv('C:/Users/Tshegofatso/Downloads/edsa-movie-recommendation-wilderness/test.csv')

<a id="cleaning"></a>
## 3. **Data Preprocessing**

Data preprocessing is a technique that involves taking in raw data and transforming it into a understandable format and useful. The technique includes data cleaning, intergration, transformation, reduction and discretization. The data preprocessing plan will include the following processes:

- **Data cleaning**

- **Table merging process**

- **Dealing with missing values**

### Data cleaning

Data cleaning is important because it improves your data quality and in doing so, increases overall productivity. When you clean your data, all outdated or incorrect information is gone – leaving you with the highest quality information. We aim to determine inaccurate, incomplete, or unreasonable data and then improve quality by correcting detected errors and omissions.

In [None]:
# create copies of the dataframes

imdb_df = imdb_df.copy()
movies_df = movies_df.copy()
train_df = train.copy()
test_df = test.copy()

In [None]:
# merging dataframe

train_df = pd.merge(movies_df, imdb_df, on = 'movieId')

### Checking for missing values

The problem of missing value is quite common in many real-life datasets. Missing value can bias the results of the machine learning models and/or reduce the predictive accuracy of the model, hence it is crucial to know how much is missing and what to do with that.

In [None]:
# Percentage of missing values
(train_df.isnull().sum()/len(train_df))*100

movieId           0.000000
title             0.000000
genres            0.000000
title_cast       38.868334
director         38.281187
runtime          45.624548
budget           70.711011
plot_keywords    42.153945
dtype: float64

We can see that **title_cast** is missing about **36.9%**, the **director** column is missing **36.2%**, **runtime** is missing **44.3%**, **budget** is missing **71.0%**, **plot_keywords** is missing **40.6%**.
The **budget** column since we missing a lot of data we will **drop** the column since **we can't make reliable analysis on it** and the others we can't impute reliable like the cast. The function that removes noise deals with the missing values.

#### Removing noise

Data that can not be processed/interpreted by a machine is classified as noisy data. Text data contain a lot of noise, this comes in a form of special characters such as hashtags, punctuation and numbers.

- We start by changing the datatype of text data to string for better handling and manipulation.

In [None]:
# change data types
train_df['genres'] = train_df.genres.astype(str)
train_df['title_cast'] = train_df.title_cast.astype(str)
train_df['director'] = train_df.director.astype(str)
train_df['plot_keywords'] = train_df.plot_keywords.astype(str)

- Change the text to lower case.

- Replace the vertical bar with a comma.

In [None]:
# Every genre is separated by a | 
train_df['genres'] = train_df['genres'].map(lambda x: x.lower().split('|'))

# Every title cast is separated by a | so we simply have to call the split function on | and separate them by ,
train_df['title_cast'] = train_df['title_cast'].str.split('|')

# And we will do the same thing for the plot keywords
train_df['plot_keywords'] = train_df['plot_keywords'].str.split('|')

Combine the name and surname in the title_cast and director columns, hence creating one word for the uniqueness of a person's name. If no name exists, the function will leave a space.

In [None]:
def string_function(x):
    """combines name and surname into one name
    and return results as one name.
    
    if no name exists returns a space"""
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
cols = ['title_cast','director']

for col in cols:
    train_df[col] = train_df[col].apply(string_function)

The resulting data has every text column in lower case, separated by a comma. The name and surname combined for title_cast and director columns.

<a id="EDA"></a>
## 4. **Exploratory Data Analysis**

#### Data overview

This gives an overview of the dataset that is more interesting than the others, i.e imdb, movies, train and test datasets.

#### IMDB dataset

In [2]:
# Checking how our imdb dataset looks like
print("Rows    : ", imdb_df.shape[0])

print("Columns : ", imdb_df.shape[1])

print("\nMissing values: ", imdb_df.isnull().sum().values.sum())

print("\nInformation about the data: ")
print("  \n", imdb_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in imdb_df.columns:
    unique_out = len(imdb_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

NameError: ignored

**Movies dataset**

In [3]:
# Checking how our movies dataset looks like
print("Rows    : ", movies_df.shape[0])

print("Columns : ", movies_df.shape[1])

print("\nMissing values: ", movies_df.isnull().sum())

print("\nInformation about the data: ")
print("  \n", movies_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in movies_df.columns:
    unique_out = len(movies_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

NameError: ignored

**Tags dataset**

In [4]:
# Checking how our tags dataset looks like
print("Rows    : ", tags_df.shape[0])

print("Columns : ", tags_df.shape[1])

print("\nMissing values: ", tags_df.isnull().sum())

print("\nInformation about the data: ")
print("  \n", tags_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in tags_df.columns:
    unique_out = len(tags_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

NameError: ignored

**Train dataset**

In [None]:
# Checking how our train dataset looks like
print("Rows    : ", train_df.shape[0])

print("Columns : ", train_df.shape[1])

print("\nMissing values: ", train_df.isnull().sum())

print("\nInformation about the data: ")
print("  \n", train_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in train_df.columns:
    unique_out = len(train_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

**Test dataset**

In [None]:
# Checking how our test dataset looks like
print("Rows    : ", test_df.shape[0])

print("Columns : ", test_df.shape[1])

print("\nMissing values: ", test_df.isnull().sum().values.sum())

print("\nInformation about the data: ")
print("  \n", test_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in test_df.columns:
    unique_out = len(test_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

### Splitting the genres and title casts into lists

In [None]:
#extracting released year
movies = movies_df.copy()
movies['release_year']=movies['title'].str[-5:-1] 
#spliting the genres into a list
movies['genres']=movies['genres'].str.split('|') 
#concatinate ratings with movies dataframe
movies.dropna() 
movies.head(3)

In [None]:
#spliting the title cast into a list
imdb = imdb_df.copy()
imdb['title_cast']=imdb['title_cast'].str.split('|') 
imdb.head(3)

### Merging datasets

In [None]:
train_eda = train_df.copy()
con = pd.concat([train_df[:1000],movies], axis=1)
con.head()

In [None]:
df= pd.concat([imdb,con], axis=1)
df.dropna(inplace=True)
df.head(3)

In [None]:
# Merging the tarin  and movies data
data = pd.merge(train, movies, on='movieId')
data.head()

In [None]:
#creating mean ratings data
ratings = pd.DataFrame(data.groupby('title')['rating'].mean())
ratings.head()

In [None]:
#creating number of ratings data
ratings['number_of_ratings'] = data.groupby('title')['rating'].count()
ratings.head()

### Data Visualisation 

In [5]:
fig = plt.figure(figsize=(12, 8))
recmetrics.long_tail_plot(df=data, 
             item_id_column="movieId", 
             interaction_type="movie ratings", 
             percentage=0.5,
             x_labels=False)

NameError: ignored

<Figure size 864x576 with 0 Axes>

The plot plot shows the distribution of ratings/movie popularity with 653 polpular movies and 45760 unpopular movies.

**Movie Ratings from the User**

In [6]:
# Distplot of ratings 
sns.distplot(df["rating"], color='blue');

NameError: ignored

**Exploring Movie Genres**

In [7]:
# Ploting top genres in the Dataset
plt.figure(figsize=(20, 10))
gen = df['genres'].explode()
ax=sns.countplot(x=gen, order=gen.value_counts().index[:30],color='blue')
ax.set_title('Popular Genres', fontsize=15)
plt.xticks(rotation =90)
plt.show()

NameError: ignored

<Figure size 1440x720 with 0 Axes>

Drama, Comedy and Thriller are top 3 most common movie genres.

#### Movies made per year 

In [None]:
# Plot movies released per year
plt.figure(figsize=(15,10))
sns.set(style="darkgrid")
ax = sns.countplot(y=movies['release_year'], data=df, order=df['release_year'].value_counts().index[0:30],color='blue')
ax.set_title('Total Movies Released per Year', fontsize= 20)

from 1955 The number of movies released each year increased, whereas it was previously fluctuating.

#### Popular Cast Members 

In [None]:
# Plot popular cast
plt.figure(figsize = (20,5))
cast=imdb['title_cast'].explode()
ax=sns.countplot(x=cast, order = cast.value_counts().index[:30],color='red')
ax.set_title('Popular Cast',fontsize=15)
plt.xticks(rotation=90)
plt.show()

The most well-known cast members are Samuel L. Jackson and Steve Buscemi, with the remaining members having a slight variation in recognition.

In [None]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
cast = df['title_cast'].explode()
text = list(set(cast))
plt.rcParams['figure.figsize'] = (13, 13)
wordcloud = WordCloud(max_font_size=50, max_words=100,background_color="white").generate(str(text))

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

### Movie Runtime

In [None]:
# Describe the runtime 
df['runtime'].describe()

In [None]:
#Plot the Runtime
sns.set(style="darkgrid", )
sns.kdeplot(data=df['runtime'], shade=True, color='red')

#### Long Movies

In [None]:
#Show movies with long lengths 
df[df['runtime'] > 0][['runtime', 'title', 'release_year']].sort_values('runtime', ascending=False).head(10)

#### Short Movies 

In [None]:
# Show movies with short lengths
df[df['runtime'] > 0][['runtime', 'title', 'release_year']].sort_values('runtime').head(10)

### Tags 

In [None]:
tags = tags_df['tag']
tags.dropna(inplace=True)

In [None]:
#Plot tags 
plt.figure(figsize=(15, 5))
ax = sns.countplot(x=tags, order = tags.value_counts().index[:20],color = 'blue')
ax.set_title('Top Tags', fontsize=15)
plt.xticks(rotation=90)
plt.show

<a id="features"></a>
## 5. **Feature engineering And Selection**

In this section, we extract features from raw data. The motivation is to use these extra features to improve the quality of results from a machine learning process, compared with supplying only the raw data to the machine learning process.

The features engineered for content-based and collaborative filtering are different because methods do not use the same dataset.

### **Content-based Filtering**

#### Feature Engineering

In [None]:
cols = ['title','genres','title_cast','director','plot_keywords']

#create new dataframe with useful data
data_df = train_df[cols]

#set index to movie titles
data_df.set_index('title', inplace = True)

data_df.head()

Unnamed: 0_level_0,genres,title_cast,director,plot_keywords
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Toy Story (1995),"[adventure, animation, children, comedy, fantasy]","[tomhanks, timallen, donrickles, jimvarney, wa...",johnlasseter,"[toy, rivalry, cowboy, cgi animation]"
Jumanji (1995),"[adventure, children, fantasy]","[robinwilliams, jonathanhyde, kirstendunst, br...",jonathanhensleigh,"[board game, adventurer, fight, game]"
Grumpier Old Men (1995),"[comedy, romance]","[waltermatthau, jacklemmon, sophialoren, ann-m...",markstevenjohnson,"[boat, lake, neighbor, rivalry]"
Waiting to Exhale (1995),"[comedy, drama, romance]","[whitneyhouston, angelabassett, lorettadevine,...",terrymcmillan,"[black american, husband wife relationship, be..."
Father of the Bride Part II (1995),[comedy],"[stevemartin, dianekeaton, martinshort, kimber...",alberthackett,"[fatherhood, doberman, dog, mansion]"


Now we create the bag of words from the genres, title_cast,director and plot keywords.

In [None]:
data_df['bag_of_words'] = ''
columns = data_df.columns
for index, row in data_df.iterrows():
    words = ''
    for col in columns:
        if col != 'director':
            words = words + ' '.join(row[col])+ ' '
        else:
            words = words + row[col]+ ' '
    row['bag_of_words'] = words
    
data_df.drop(columns = [col for col in data_df.columns if col!= 'bag_of_words'], inplace = True)

**Vectorization**

The data we have is text, but machine learning algorithms operate on a numeric feature space, expecting input as a two-dimensional array where rows are instances and columns are features. In order to perform machine learning on text, we need to transform our documents into vector representations such that we can apply numeric machine learning. We make use of two vectorization techniques:

- CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.

From the  bag of words, we generate numerical features.

In [None]:
# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(data_df['bag_of_words'])
new_matrix = count_matrix
# creating a Series for the movie titles.
indices = pd.Series(data_df.index)
indices[:10]

0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
5                           Heat (1995)
6                        Sabrina (1995)
7                   Tom and Huck (1995)
8                   Sudden Death (1995)
9                      GoldenEye (1995)
Name: title, dtype: object

**Feature scaling**

It is possible for features to have different scales, there is a chance that higher weightage is given to features with higher magnitude. This will impact the performance of the machine learning algorithm and obviously, we do not want our algorithm to be biassed towards certain features. 

MaxAbsScaler estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.


In [None]:
# initialise a scaler

scaler = MaxAbsScaler() 

scaled_new_matrix = scaler.fit_transform(new_matrix) # scaled new_matrix


### **Collaborative Filtering**

<a id="model"></a>
## 6. **Model Building And Evaluation**

The method of learning is unsupervised, hence this type of algorithm learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a compact internal representation of its world and then generate imaginative content from it. 

We use two forms of recommender system algorithms: content-based and collaborative filtering.

### **Content-based Filtering**

From the features engineered, we find the similarities within the data. This is done by computing the cosine similarity.

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.


In [None]:
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)

With our content similarity matrix computed,     

We now do recommendations by: 

  1. Select an initial item (movie) to generate recommendations from. 
  2. Extract all the similarity values between the initial item and each other item in the similarity matrix.
  3. Sort the resulting values in descending order. 
  4. Select the top N similarity values, and return the corresponding item details to the user. This is now our simple top-N list.  
  
We implement this algorithmic process in the function below:

In [None]:
def recommendations(title, cosine_sim = cosine_sim):
    
    recommended_movies = []
    
    # gettin the index of the movie that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(data_df.index)[i])
        
    return recommended_movies

In [None]:
# recommendations for the movie
recommendations('Hard Target (1993)')

### **Collaborative Filtering**

Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.

Because it’s based on historical data, the core assumption here is that the users who have agreed in the past tend to also agree in the future. In terms of user preference, it usually expressed by two categories, Explicit and Implicit rating. 

**Explicit Rating**, is a rate given by a user to an item on a sliding scale, 
like 5 stars for Titanic. This is the most direct feedback from users to show how much they like an item.

**Implicit Rating**, suggests users preference indirectly, such as page views, clicks, purchase records, whether or not listen to a music track, and so on.

In this predict explicit data rating will be used. **Surprise** is a Python scikit for building and analyzing recommender systems that deal with explicit rating data.

**Surprise:**

Provides various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods, matrix factorization-based ( SVD, PMF, SVDpp, NMF), and many others.
Provides tools to evaluate, analyse and compare the algorithms’ performance.
From the Suprise library, the follwoing algorithms were used:

Basic algorithms


**NormalPredictor:** this algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal.

**BaselineOnly:** this algorithm predicts the baseline estimate for given user and item.

**k-NN algorithms**

**KNNBasic:** this is a basic collaborative filtering algorithm.

**KNNWithMeans:** this is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

**KNNWithZScore:** this is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

**KNNBaseline:** is a basic collaborative filtering algorithm taking into account a baseline rating.

**Matrix Factorization-based algorithms**

**SVD:** this algorithm is equivalent to Probabilistic Matrix Factorization ( which makes use of data provided by users with similar preferences to offer recommendations to a particular user).

**SVDpp:** this algorithm is an extension of SVD that takes into account implicit ratings.

**NMF:** this is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

**SlopeOne:** this is a straightforward implementation of the SlopeOne algorithm.

**Coclustering:** is a collaborative filtering algorithm based on co-clustering.

In [None]:
#Loading 10000 dataset
data = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']].head(10000), Reader)

In [None]:
#Implement an algorithm
algo = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), 
                  KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]

#Read 10000 dataset
data2 = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']].head(10000), Reader())

#Implementing algorithm for RMSE
algo_rmse=[]
for a in algo:
    
    cross_valid=cross_validate(a, data2, measures=['RMSE'], cv = 3)
    output=pd.DataFrame.from_dict(cross_valid).mean(axis=0)
    output=output.append(pd.Series([str(a).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    algo_rmse.append(output)

algo_rmse
surprise_results = pd.DataFrame(algo_rmse).set_index('Algorithm').sort_values('test_rmse')
surprise_results

Based on the table above containing test_rmse, fit_time, test_time values for the algorithms, we notice that the SVDpp, SVD and BaselineOnly algorithms are top three best performing algorithms. Therefore the best performing algorithm will be used for prediction and to find the Root Mean Squared Error (RMSE) values.

**Predicting with SVDpp Algorithm**

In [None]:
#Loading 100000 dataset
data3 = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']].head(100000), Reader()) 

In [None]:
trainset, testset = train_test_split(data3, test_size=0.05)

In [None]:
from surprise import accuracy
#SVDpp model
svdpp=SVDpp(n_epochs = 30, n_factors = 200, init_std_dev = 0.05, random_state=42)

#Fitting the model
svdpp.fit(trainset)

# Making prediction on the validation dataset
test_pred= svdpp.test(testset)

#Evaluating model performance
rsme_collabo = accuracy.rmse(test_pred,
                             verbose=True)

In [None]:
#Predicting the rating for each user and movie
ratings=[]
for x,y in test_df.itertuples(index=False):
    output=svdpp.predict(x,y)
    ratings.append(output)
    
output_df=pd.DataFrame(ratings)[['uid','iid','est']]
output_df['ID']=output_df['uid'].astype(str) + '_' + output_df['iid'].astype(str)
output_df=output_df[['ID','est']]
output_df.head()

**Predicting with SVD Algorithm**

In [None]:
#Loading 1000000 dataset
data4 = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']].head(1000000), Reader()) 

In [None]:
trainset, testset = train_test_split(data4, test_size=0.05)

In [None]:
from surprise import accuracy
#SVD model
svd=SVD(n_epochs = 30, n_factors = 200, init_std_dev = 0.05, random_state=42)

#Fitting the model
svd.fit(trainset)

# Making prediction on the validation dataset
test_pred= svd.test(testset)

#Evaluating model performance
rsme_collabo = accuracy.rmse(test_pred,
                             verbose=True)

In [None]:
#Predicting the rating for each user and movie
ratings=[]
for x,y in test_df.itertuples(index=False):
    output=svd.predict(x,y)
    ratings.append(output)
    
output_df=pd.DataFrame(ratings)[['uid','iid','est']]
output_df['ID']=output_df['uid'].astype(str) + '_' + output_df['iid'].astype(str)
output_df=output_df[['ID','est']]
output_df.head()

**Predicting with BaselineOnly algorithm**

In [None]:
#Loading 1000000 dataset
data5 = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']].head(1000000), Reader()) 

In [None]:
trainset, testset = train_test_split(data5, test_size=0.05)

In [None]:
from surprise import accuracy
#BaselineOnly model
bsl_options = {'method': 'sgd','n_epochs': 40}
blo=BaselineOnly(bsl_options=bsl_options)

#Fitting the model
blo.fit(trainset)

# Making prediction on the validation dataset
test_pred= blo.test(testset)

#Evaluating model performance
rsme_collabo = accuracy.rmse(test_pred,
                             verbose=True)

In [None]:
#Predicting the rating for each user and movie
ratings=[]
for x,y in test_df.itertuples(index=False):
    output=blo.predict(x,y)
    ratings.append(output)
    
output_df=pd.DataFrame(ratings)[['uid','iid','est']]
output_df['ID']=output_df['uid'].astype(str) + '_' + output_df['iid'].astype(str)
output_df=output_df[['ID','est']]
output_df.head()

<a id="evaluation"></a>
## 7. **Model Parameter Tuning**

In [None]:

params = {'n_epochs' : 12,
           'init_std_dev': 0.01,
           'n_factors' : 160,
          'model_name' : 'SVDpp'}

RMSE = 0.84443
metrics = RMSE

In [None]:
# log our parameters and results

experiment.log_parameters(params)

experiment.log_parameters(metrics)

In [None]:
# ending the experiment

experiment.end()

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/kwanda2426/streamlit-based-movie-recommender-system/d69f8ff97ef5447fa556f633a134fb40
COMET INFO:   Parameters:
COMET INFO:     imag         : 0.0
COMET INFO:     init_std_dev : 0.01
COMET INFO:     model_name   : SVDpp
COMET INFO:     n_epochs     : 12
COMET INFO:     n_factors    : 160
COMET INFO:     real         : 0.84443
COMET INFO:   Uploads:
COMET INFO:     environment details      : 1
COMET INFO:     filename                 : 1
COMET INFO:     git metadata             : 1
COMET INFO:     git-patch (uncompressed) : 1 (10.63 MB)
COMET INFO:     installed packages       : 1
COMET INFO:     notebook                 : 1
COMET INFO:     source_code              : 1
COMET INFO: ---------------------------
COMET INFO: Uploading 1 metrics, params a

<a id="conclusion"></a>
## 8. **Conclusion**

<a id="references"></a>
## 9. **References**

1. Hakami, A., 2022. Movie Recommendation system. [online] Medium. Available at: <https://medium.com/mlearning-ai/movie-recommendation-system-f2f57290b1b8> [Accessed 24 January 2022].

2. abramovsky, O., 2022. How to generate recommendations using TF-IDF. [online] Medium. Available at: <https://medium.com/codex/how-to-generate-recommendations-using-tf-idf-52d46eca606f> [Accessed 27 January 2022].

3. Youtube.com. 2022. Overview of recommender systems. [online] Available at: <https://www.youtube.com/watch?v=1JRrCEgiyHM> [Accessed 16 January 2022].

4. Youtube.com. 2022. Content-based Filtering. [online] Available at: <https://www.youtube.com/watch?v=2uxXPzm-7FY> [Accessed 16 January 2022].

5. Youtube.com. 2022. Collaborating Filtering. [online] Available at: <https://www.youtube.com/watch?v=h9gpufJFF-0> [Accessed 16 January 2022].