# Unsupervised Learning Predict Notebook - EDSA Movie Recommendation


In [2]:
!pip install comet_ml

In [3]:
from comet_ml import Experiment

## Introduction

Have you ever been on an online streaming platform like Netflix, Showmax, Youtube? I watched a movie and after some time, that platform started recommending different movies and TV shows to me. I wondered, how the movie streaming platform could suggest content I actually liked. This is an example of a Recommendation System. This system is capable of learning ones watching patterns and providing the person with relevant suggestions. Having witnessed the fourth industrial revolution where Artificial Intelligence and other technologies are dominating the market, I am sure that you must have come across a recommendation system in your everyday life.The importance of a recommender system cannot be stressed enough. The financial benefits are enormous and almost every major tech company has applied them in some form or the other In this Machine Learning Predict, we will walk you through building your own recommendation system.

<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://3.bp.blogspot.com/-tRH4a36gEOc/VlJcXFoY9bI/AAAAAAAAADo/fRu2BNRW7W4/s1600/Film%2BReel.jpg"
     style="float: center; padding-bottom=0.5em"
     width=600px/>

</div>


## 1.1 Problem Statement

Develop a recommendation algorithm based on content or collaborative filtering capable of accurately recommending movies for users based on historical preferences. 

### Datasets
* genome_scores.csv - A score mapping the strength between movies and tag-related properties.
* genome_tags.csv - User assigned tags for genome-related scores.
* imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
* links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
* sample_submission.csv - Sample of the submission format for the hackathon.
* tags.csv - User assigned for the movies within the dataset.
* test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
* train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.


In [5]:
!pip install squarify


## 2. Loading Required Libraries:

In [6]:
# Ignore warnings
import warnings
warnings.simplefilter(action='ignore')

#Install Prerequisites
import sys
#!{sys.executable} -m pip install scikit-learn scikit-surprise
#!pip install git+https://github.com/gbolmier/funk-svd

# Exploratory Data Analysis
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data Preprocessing
import random
from time import time

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import NullFormatter
from sklearn.preprocessing import StandardScaler
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Models
from surprise import Reader, Dataset
from surprise import SVD, NormalPredictor, BaselineOnly, NMF, SlopeOne, CoClustering
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# Performance Evaluation
from surprise import accuracy
from sklearn.metrics import mean_squared_error
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

# Display
%matplotlib inline
sns.set(font_scale=1)
sns.set_style("white")
pd.set_option('display.max_columns', 37)
from wordcloud import WordCloud 
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display_html 
from IPython.core.display import HTML
from collections import defaultdict
import datetime
import re
import squarify
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set_style("darkgrid")
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
py.init_notebook_mode(connected = True)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


## 3. Loading the dataset


In [8]:
# #Reading all the given data
train_df = pd.read_csv('/kaggle/input/movierecommendation/train.csv') 
test_df = pd.read_csv('/kaggle/input/movierecommendation/test.csv')
links_df = pd.read_csv('/kaggle/input/movierecommendation/links.csv')
movies_df = pd.read_csv('/kaggle/input/movierecommendation/movies.csv')
imdb_df = pd.read_csv('/kaggle/input/movierecommendation/imdb_data.csv')
tags_df = pd.read_csv('/kaggle/input/movierecommendation/tags.csv')
genome_tags_df = pd.read_csv('/kaggle/input/movierecommendation/genome_tags.csv')
genome_scores_df = pd.read_csv('/kaggle/input/movierecommendation/genome_scores.csv')


dataset = pd.read_csv('/kaggle/input/movierecommendation/train.csv')
df =  dataset.head(100000)
movies = pd.read_csv('/kaggle/input/movierecommendation/movies.csv')
df = pd.merge(df,movies,on="movieId")


It is important to note that our target variable within this project is the ratings variable that is present in the train_df  loaded above.

# 4. Data Preprocessing
## 4.1 Dataset Overview
Below we will be familiarizing ourselves with the attributes of the data given to us.

In [10]:
#A function that displays multiple dataframes in one cell
def data_overview_display(dataframe_list,column_names=[]):

    html_string = ''
    html_string += ('<tr>' + ''.join(f'<td style="text-align:center">{name}</td>' for name in column_names) + '</tr>')
    html_string += ('<tr>' + ''.join(f'<td style="vertical-align:top"> {df.to_html(index=True)}</td>' for df in dataframe_list) + '</tr>')
    html_string = f'<table>{html_string}</table>'
    html_string = html_string.replace('table','table style="display:inline"')
    display_html(html_string, raw=True)  
    

In [11]:
#Show "imdb_df" dataframe
imdb_df.head()

In [12]:
#displays the overview of train,test,movies 
data_overview_display([train_df.head(),test_df.head(),links_df.head(),tags_df.head()], column_names=['Train','Test','Links','Tags'])

In [13]:
#Displays the overview of Movies,Genome tags, and genome scores dataframes
data_overview_display([movies_df.head(),genome_tags_df.head(),genome_scores_df.head()], column_names=['Movies','Genome Tags','Genome Scores'])

# 4.2 Data Cleaning

## 4.2.1 Null values

In [14]:
#Create the null-value dataframes of all the given data
train_null = pd.DataFrame({"Null Values":train_df.isnull().sum()})
test_null = pd.DataFrame({"Null Values":test_df.isnull().sum()})
movies_null = pd.DataFrame({"Null Values":movies_df.isnull().sum()})
links_null = pd.DataFrame({"Null Values":links_df.isnull().sum()})
imdb_null = pd.DataFrame({"Null Values":imdb_df.isnull().sum()})
tags_null = pd.DataFrame({"Null Values":tags_df.isnull().sum()})
genome_tags_null = pd.DataFrame({"Null Values":genome_tags_df.isnull().sum()})
genome_scores_null = pd.DataFrame({"Null Values":genome_scores_df.isnull().sum()})

In [15]:
#Display overview of null values of dataframes
data_overview_display([train_null,test_null,movies_null,links_null,genome_scores_null,tags_null,genome_tags_null], column_names=['Train df','Test df','Movies df', 'Links df','genome scores df','tags df','genome tags'])

In [16]:
#Display overview null value of "imdb_df"
data_overview_display([imdb_null], column_names=['imdb df'])

In [17]:
print ("As seen from the above, the imdb df has more than: ", round((imdb_null.loc['budget'][0]/len(imdb_df))*100,2),"% of the missing data, it is thus not advisable to drop the null rows")

# 4.3 Normalising and combining the Data

In [18]:
movies_df['release_year']=movies_df['title'].str[-5:-1] #extracting released year
movies_df['genres']=movies_df['genres'].str.split('|') #spliting the genres into a list
movies_df=pd.concat([movies_df,train_df['rating']],axis=1).dropna() #concatinate ratings with movies dataframe
movies_df.head()

In [19]:
imdb_df['title_cast']=imdb_df['title_cast'].str.split('|') #spliting the title cast into a list
imdb_df['plot_keywords']=imdb_df['plot_keywords'].str.split('|') #spliting the Key words into a list
imdb_df.head()

In [20]:
movies_df.shape

In [21]:
movies_df.info()

In [22]:
imdb_df.shape

In [23]:
imdb_df.info()

In [24]:
movies_df.groupby('title')['rating'].mean().sort_values(ascending=False ).head() #Katleho's code

In [25]:
movies_df.groupby('title')['rating'].count().sort_values(ascending=False ).head()

In [26]:
df2 = pd.DataFrame(movies_df.groupby('title')['rating'].mean())

In [27]:
df2.head()

In [28]:
df2['counts'] = pd.DataFrame(movies_df.groupby('title')['rating'].count())

In [29]:
df2.head()

<a id="16"></a>
***
## **Exploratory Data Analysis (EDA)**
***
<div align="center" ">
<img src="https://luminousmen.com/media/exploratory-data-analysis.jpg"
     style="float: center; padding-bottom=0.5em"
     width=400px/>



In [32]:
plt.figure(figsize=(8,6))
plt.title('Distribution of ratings')
plt.xlabel('Ratings')
plt.ylabel('Number of ratings')
plt.rcParams['patch.force_edgecolor'] = True
df2['rating'].hist() 

As we can see in the diagram above, the ratings are left-skewed. We were expecting to see a normal distribution with an average rating of 3. Instead, we observe that users tend to rate movies quite favourably and tend to avoid negative ratings. This skewness could also be explained by the fact that users tend to only rate movies they enjoyed and avoid rating movies they dont like. In other words, if a user doesn't enjoy a movie, it is quite unlikely that they will watch it up until the endand rate it.

In [33]:
def extract_popular_movies(df1,df2):
    
    """
    A function that retruns popular movies based on 
    the avarage ratings and the total ratings count.

    Parameters:
    
    df1: DataFrame from the train_df
    df2: DataFrame from the movies_df
    
    Returns a dataframe of porpular movies.
      
    """
   
   
    rating = pd.DataFrame(df1.groupby('movieId')['rating'].mean())#Calculating avarage rating and storing the results as a DataFrame
    
    rating['ratings_count'] = pd.DataFrame(df1.groupby('movieId')['rating'].count())#Calculating total ratings count and storing the results as a DataFrame
    rating=rating.sort_values(by=['ratings_count','ratings_count'],ascending=False).reset_index()
    
    
    inner_join = pd.merge(rating,df2,on ='movieId',how ='inner')#Joining both DataFrames
    popular_movies=inner_join[['title','rating_x','ratings_count','release_year']].rename(columns={"rating_x": "rating"})
    
    return popular_movies

In [34]:
extract_popular_movies(train_df,movies_df).head() #display the extracted porpular movies

Above we have created a dataframe of the top 15 movies by  ratings of all time which we will use to create a bar plot below. We observe that the most poular movie of all time is a movie called Shawshank Redemption that was released in 1994 and that has an average rating of approximately 4.42. 

In [35]:
df=extract_popular_movies(train_df,movies_df)


plt.figure(figsize = (10,6))#Bar plot of most popular movies by ratings
ax=sns.barplot(y='title', x='ratings_count', data=df.head(20),color='blue')
ax.set_title('All time Popular Movies by ratings ( Top 15)',fontsize=15)
plt.xticks(rotation=90)
plt.show()

According to the dataset ,Shawshank redemption is the most Popular movie of all time, with the Lord of the rings also making the top 15.

In [36]:
#Plotting total amount of movies released in each year using a count plot.
figure= plt.subplots(figsize=(15, 5))
axes=sns.countplot(x=movies_df['release_year'], order = movies_df['release_year'].value_counts()[0:50].index,color='blue')
axes.set_title('Total movies released per year',fontsize=19)
plt.xticks(rotation=90)
plt.show()

Above, we observe that the years 2015 and 2016 are the years where the highest number of movies were released.What the diagram above communicates to us is that as the years progress, the amount of movies being released have significantly increased.The number of movies being released per year have definitely shot up since the year 2000.

In [37]:
#Create dataframe "mini_df"
mini_df = movies_df['genres'].explode().value_counts().reset_index()

#Plotting popular genres using Treemap
sizes=np.array(mini_df['genres'])
labels=mini_df['index']
colors = [plt.cm.Paired(i/float(len(labels))) for i in range(len(labels))]
plt.figure(figsize=(12,8), dpi= 100)
squarify.plot(sizes=sizes, label=labels, color = colors, alpha=.5, edgecolor="black", linewidth=3, text_kwargs={'fontsize':10})
plt.title('Treemap of Movie Genres', fontsize = 15)
plt.axis('off')
plt.show()

The Treemap above indicates that comedy and drama are the most popular genres, followed by others like action, horror, Thriller, romance and this indicates that a combination of these has a higher chance of producing a popular movie. This indicates that most viewers enjoy comedy and action movies the most

In [38]:
sub_df= train_df['rating'].value_counts().sort_index().reset_index()
fig, ax = plt.subplots(figsize=(10,6))
sns.barplot(data=sub_df, x='index', y='rating', palette="PuBu", edgecolor="black", ax=ax)
ax.set_xlabel("Rating")
ax.set_ylabel('Number of Users')
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
total = float(sub_df['rating'].sum())
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2., height+350, '{0:.2%}'.format(height/total), fontsize=11, ha="center", va='bottom')
plt.title('Percentage of Users Per Rating', fontsize=14)
plt.show()

As we can see in the diagram above, most users either give a rating of 4.0 or 3.0. Hardly any user ever gives a 0.5 rating

In [39]:
#Plotting popular cast using a count-plot
plt.figure(figsize = (10,6))
title_cast=imdb_df['title_cast'].explode()
ax=sns.countplot(x=title_cast, order = title_cast.value_counts().index[:20],color='blue')
ax.set_title('Top 20 Popular Actors',fontsize=15)
plt.xticks(rotation=90)
plt.show()

As we can see in the diagram above, Samuel L Jackson was the popular cast as he appeared in over 80 movies from our database.

In [40]:
#Plotting Ratings distribution of observations using a dist-plot
plt.figure(figsize = (10,6))
axes=sns.distplot(movies_df['rating'],color='blue')
axes.set_title('Ratings Distribution',fontsize=15)
plt.show()

Most movies had a rating of 4

In [41]:
#Plotting distribution of movies's duration using dist-plot
plt.figure(figsize = (10,6))
axes=sns.distplot(imdb_df['runtime'],color='blue')
axes.set_title('Runtime Distribution',fontsize=15)
plt.show()

In [42]:
# Creating a wordcloud of the movie titles to view the most popular movie titles withtin the word cloud
movies_df['title'] = movies_df['title'].fillna("").astype('str')
title_corpus = ' '.join(movies_df['title'])
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(title_corpus)

# Plotting the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

We observe that the most popular words searched for are 'Love', 'Night', 'Life', 'Girl','Man'

In [43]:
#Plotting top 10 movie directors using a count-plot
plt.figure(figsize = (10,6))
director=imdb_df['director']#.explode()
axes=sns.countplot(y=director, order = director.value_counts().index[1:11],color='blue')
axes.set_title('Top 10 Most Popular Movie Directors',fontsize=15)
plt.xticks(rotation=90)
plt.show()

The graph above indicates a number of popular directors based on the number of movies they have directed.In the diagram above, we can observe that the most popular movie director is Luc Besson, with Robert Rodriguez just making the top 10 list

In [44]:
movies_imdb= pd.merge(imdb_df,movies_df,on ='movieId',how ='inner') #Merging the imdb_df and movies_df

In [45]:
#Create variable "runtime_genre"
runtime_per_genre=movies_imdb[['genres','runtime']].explode('genres')

In [46]:
runtime_per_genre.head(100)


In [47]:
#Plotting an average runtime per genre using line-plot
plt.figure(figsize=(10,6))
plot = sns.barplot(data =runtime_per_genre,x='genres', y='runtime',)
plt.title('Average Runtime Per Genre',fontsize=15)
plt.xlabel('genres', fontsize=15)
plt.ylabel('runtime', fontsize=15)
plt.xticks(rotation=90)
plt.show()

It can be observed that Western movies have the highest runtime whilst animation movies have the least runtime.

In [48]:
#Create variable "budget_per_genre"
budget_per_genre=movies_imdb[['genres','budget']].explode('genres')
budget_per_genre['budget']=budget_per_genre['budget'].str.replace(',', "").str.extract('(\d+)', expand=False).astype('float')

In [49]:
#Plotting an average budget per genre using a line-plot
plt.figure(figsize=(10,6))
axes=sns.barplot(data=budget_per_genre, x="genres", y="budget")
plt.title('Average Budget Per Genre',fontsize=15)
plt.xlabel('genres', fontsize=15)
plt.ylabel('budget', fontsize=15)
plt.xticks(rotation=90)
plt.show()

The War genre requires the biggest budget, and doccumentaries are the least expensive.

In [50]:
#Plotting the average rating per genre using a box-plot
plt.figure(figsize=(10,6))
genre_rating=movies_df[['rating','genres']].explode('genres')
sns.boxplot(data=genre_rating, x="genres", y="rating",palette="PuBu")
plt.xlabel('genres', fontsize=15)
plt.ylabel('rating', fontsize=15)
plt.xticks(rotation=90)
plt.show()

Most Movies with genre **Film-Noir** had the highest ratings amongst all genres.

In [51]:
def extract_popular_movies(df1,df2):
    """
    Retruns popular movies based on the avarage ratings and the total ratings count.

    Parameters:
    
    df1: DataFrame from train_df
    df2: DataFrame from movies_df
    
    Returns a dataframe of popular movies
   
      
    """
    #Calculating the avarage rating and storing the results as a DataFrame
    avg_rating= pd.DataFrame(df1.groupby('movieId')['rating'].mean())
    #Calculating the total ratings count and storing the results as a DataFrame
    avg_rating['ratings_count'] = pd.DataFrame(df1.groupby('movieId')['rating'].count())
    avg_rating=avg_rating.sort_values(by=['ratings_count','ratings_count'],ascending=False).reset_index()
    
    #Joining the Two DataFrames
    join = pd.merge(avg_rating,df2,on ='movieId',how ='inner')
    popular_movies=join[['title','rating_x','ratings_count','release_year']].rename(columns={"rating_x": "rating"})
    
    return popular_movies

In [52]:
#Create the dataframe "df"
df=extract_popular_movies(train_df,movies_df)

#Extracting latest movies from 2010 to date
latest_movies=df[df['release_year']>'2010'][['rating','ratings_count','title']]

In [53]:
#Plotting the latest movies from 2010 to date using a bar-plot
plt.figure(figsize = (10,6))
axes=sns.barplot(y='title', x='ratings_count', data=latest_movies.head(20),color='blue')
axes.set_title('Top 20 popular movies by ratings from 2010',fontsize=15)
plt.xticks(rotation=90)
plt.show()

Intersteller and Django are the most popular movies of this decade.

In [54]:
#Plotting popular cast using count-plot
plt.figure(figsize = (10,6))
title_cast=imdb_df['title_cast'].explode()
axes=sns.countplot(x=title_cast, order = title_cast.value_counts().index[:20],color='blue')
axes.set_title('Popular Cast',fontsize=15)
plt.xticks(rotation=90)
plt.show()

### Wordcloud of Popular Actors

In [55]:
# Creating a wordcloud of the movie titles to view the most popular movie titles withtin the word cloud
imdb_df['title_cast'] = imdb_df['title_cast'].fillna("").astype('str')
title_corpus = ' '.join(imdb_df['title_cast'])
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(title_corpus)

# Plotting the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

In [56]:
test_df.info()

<a id='8'></a>
# 6. Modelling: Recommendation System

## Collaborative Filtering

Collaborative Filtering is the most common technique used when it comes to building intelligent recommender systems that can learn to give better recommendations as more information about users is collected.

Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users. It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. It looks at the items they like and combines them to create a ranked list of suggestions.


From the Suprise library, the follwoing algorithms were used:

# Basic algorithms
***NormalPredictor:*** this algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal.

***BaselineOnly:*** this algorithm predicts the baseline estimate for given user and item.

# Matrix Factorization-based algorithms
***SVD:*** this algorithm is equivalent to Probabilistic Matrix Factorization ( which makes use of data provided by users with similar preferences to offer recommendations to a particular user).

***NMF:*** this is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

***Coclustering:*** is a collaborative filtering algorithm based on co-clustering.

## Correlation Score

In [57]:
visit_rating = mm['Forrest Gump (1994)']
visit_rating.head()

In [58]:
visit_similar = mm.corrwith(visit_rating)
visit_corr = pd.DataFrame(visit_similar,columns=['Correlation'])
visit_corr.dropna(inplace=True)
visit_corr.head()

The DataFrame above displays movies that are most similar to forest gump based on the  **correlation score**

In [59]:
visit_corr.sort_values('Correlation',ascending=False).head(10)

In [60]:
visit_corr_count = visit_corr.join(df2['counts'])
visit_corr_count.head()

In [61]:
visit_corr_count[visit_corr_count['counts']>10].sort_values('Correlation',ascending=False).head(10)

In [62]:
train_df.head()

In [63]:
# Loading our trainset

reader = Reader(rating_scale=(train_df['rating'].min(), train_df['rating'].max()))
data = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']], reader)
trainset, testset = train_test_split(data, test_size=.30, random_state=42)


## SVD

The Singular Value Decomposition algorithm is a matrix factorization technique which reduces the number of features of a dataset. In the matrix structure, each row represents a user and each column represents a movie. The matrix elements are ratings that are given to movies by users.

In [64]:
# Base algorithm
algo = SVD(n_epochs= 50, init_std_dev=0.02, n_factors=250)

# Fitting our trainset
algo.fit(trainset)
 
# Using the 15% testset to make predictions
predictions = algo.test(testset) 
predictions

test = pd.DataFrame(predictions)

svd_rmse = accuracy.rmse(predictions)

In [65]:
#params = {'n_epochs': 50, 'init_std_dev':0.02, 'n_factors':250}
#metrics = {'accuracy' : svd_rmse}
#experiment.log_parameters(params)
#experiment.log_metrics(metrics)

In [66]:
test.head()

In [67]:
pred=pd.DataFrame(predictions)
pred

In [68]:
pred=pred.rename(columns={'uid':'userId', 'iid':'movieId','est':'rating'})
pred.drop(['r_ui','details'],axis=1,inplace=True)

In [69]:
pred.info()

#### NormalPredictor  
The Normal Predictor algorithm predicts a random rating for each movie based on the distribution of the training set, which is assumed to be normal.

In [70]:
np_test = NormalPredictor()
np_test.fit(trainset)
predictions = np_test.test(testset)
# Calculate RMSE
np_rmse = accuracy.rmse(predictions)

In [71]:
#params = {'Parameters': 'default'}
#metrics = {'accuracy' : np_rmse}
#experiment.log_parameters(params)
#experiment.log_metrics(metrics)

#### NMF  
NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. The optimization procedure is a (regularized) stochastic gradient descent with a specific choice of step size that ensures non-negativity of factors, provided that their initial values are also positive.

In [72]:
nmf = NMF()
nmf.fit(trainset)
predictions = nmf.test(testset)
# Calculate RMSE
nmf_rmse = accuracy.rmse(predictions)

In [73]:
#params = {'n_components': 'None', 'init': 'None', 'solver':'cd', 'beta_loss':'frobenius', 'tol': 0.0001, 'max_iter':200, 'random_state':'None', 'alpha':'deprecated', 'alpha_W':0.0, 'alpha_H':'same', 'l1_ratio':0.0, 'verbose':0, 'shuffle':'False', 'regularization':'deprecated'}
#metrics = {'accuracy' : nmf_rmse}
#experiment.log_parameters(params)
#experiment.log_metrics(metrics)

#### BaselineOnly  
The Baseline Only algorithm predicts the baseline estimate for a given user and movie. A baseline is calculated using either Stochastic Gradient Descent (SGD) or Alternating Least Squares (ALS).

In [74]:
bsl_choice= {'method': 'sgd','n_epochs': 40}
blo = BaselineOnly(bsl_options=bsl_choice)
blo.fit(trainset)
predictions = blo.test(testset)
# Calculate RMSE
blo_rmse = accuracy.rmse(predictions)

In [75]:
#params = {'method': 'sgd','n_epochs': 40}
#metrics = {'accuracy' : blo_rmse}
#experiment.log_parameters(params)
#experiment.log_metrics(metrics)

In [76]:
test.info()

#### Co-Clustering  
The Co-clustering algorithm assigns clusters using a straightforward optimization method, much like k-means.

In [77]:
coc= CoClustering(random_state=42)
coc.fit(trainset)
predictions = coc.test(testset)
# Calculate RMSE
coc_rmse = accuracy.rmse(predictions)

In [78]:
#params = {'random_state': 42}
#metrics = {'accuracy' : coc_rmse}
#experiment.log_parameters(params)
#experiment.log_metrics(metrics)

In [79]:
experiment.end()

In [80]:
experiment.display()

##  Performance Evaluation


We built and tested six different collaborative filtering models and compared their performance using a statistical measure known as the root mean squared error (RMSE), which determines the average squared difference between the estimated values and the actual value. A low RMSE value indicates high model accuracy

### Root Mean Squared Error (RMSE):
$$RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}$$ 

In [81]:
# Comparing RMSE values between models
fig,axis = plt.subplots(figsize=(8, 5))
rmse_x = ['SVD','NormalPredictor','NMF','BaselineOnly','Co-Clustering']
rmse_y = [svd_rmse,np_rmse,nmf_rmse,blo_rmse,coc_rmse]
ax = sns.barplot(x=rmse_x, y=rmse_y,palette="PuBu",edgecolor='black')
plt.title('RMSE Value Per Collaborative-based Filtering Model',fontsize=14)
plt.xticks(rotation=90)
plt.ylabel('RMSE')
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2, p.get_y() + p.get_height(), round(p.get_height(),2), fontsize=12, ha="center", va='bottom')
    
plt.show()

## Using the Suprise Library 

In [82]:

# Here we use a powerful library -Surprise, designed for recommender systems to validate our model.
reader = Reader(rating_scale=(0.5, 5))

data = Dataset.load_from_df(pred[['userId', 'movieId' ,'rating']], reader)

cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

In [83]:
pred.info()


<a id='10.2'></a>
 ## Hyperparameter Tuning 
 
Hyperparameter tuning is the process of determining the right combination of hyperparameters that allows the model to maximize model performance. Setting the correct combination of hyperparameters is the only way to extract the maximum performance out of models.

We decided to hypertune the SVD algorithm model, which was the best performing amongst the top three algorithm (since it had the lowest RMSE value)


In [84]:
# Define search grid
#param_grid = {'n_epochs': [20,30,40,45,50], 'init_std_dev' : [0.01,0.02,0.05], 'n_factors' : [100,150,200,250,300]}

# Instatiate gridsearch instance
#gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=2)

# Run gridsearch
#gs.fit(data)

# best RMSE score
#print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
#print(gs.best_params['rmse'])

In [85]:
# We are converting the userId and movieId into integers as our algorithm converted them into floats
pred.userId=pred['userId'].astype(int)
pred.movieId=pred['movieId'].astype(int)

In [86]:
# Concatenating userId/movieId into a single Id column.(code has to be run twice to get desired outcome)

pred['Id']=pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)
pred['Id']=pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)

In [87]:
predict=pred[['Id','rating']]

<a id='15'></a>
# Submission file

In [92]:
test_df["rating"] = test_df.apply(
    lambda x: algo.predict(x["userId"], x["movieId"]).est, axis=1
)
submission = test_df[["Id", "rating"]]

submission.to_csv("submissionkaggle.csv", index=False)

In [93]:
submission.info()