# Unsupervised Learning Predict - Movie Recommender System Challenge
© Explore Data Science Academy

---
### Honour Code

We, **InfinityAI** {**#Team_NM3**}, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

<a id="cont"></a>

## Table of Contents

#### Section 1: Enviroment Setup

<a href=#one>1.1 Python Package Setup</a>

<a href=#two>1.2 Comet  Initialization</a>

<a href=#three>1.3 Package Imports</a>

#### Section 2: Data

<a href=#five>2.1 Download of dataset</a>

<a href=#six>2.2 Basic Data Analysis</a>

#### Section 3: Exploratory Data Analysis

<a href=#seven>3.1 Explore user data</a>

<a href=#seven>3.2 Explore movie genre</a>

<a href=#seven>3.3 Explore movie data</a>

<a href=#seven>3.4 Explore imdb data</a>

<a href=#seven>3.5 Explore tags data</a>

<a href=#seven>3.6 Explore publishing years</a>


###  Section 4: Base Model Testing

<a href=#eight>4.1 Cross-Validation Testing</a>

<a href=#nine>4.2  Train-Test-Split</a>

<a href=#ten>4.3   Grid Search</a>

###  Section 5: Model Building

<a href=#twelve>5.1 Fit model to whole dataset</a>

<a href=#thirteen>5.2 Download CSV for Kaggle Competition</a>

<a href=#fourteen>5.3 Pickle model for use in Streamlit</a>

### Section 6:  Collaborative & Content Based Model

<a href=#fifteen>6.1 Collaborative Filtering - Approach I</a>

<a href=#sixteen>6.2 Content Based - Approach II</a>

### Section 7: Conclusion


# Introduction
Movies have managed to enthrall audiences ever since one second clips of racing horses emerged in the 1890s to the introduction of sound in the 1920s to the birth of color in the 1930s to mainstream 3D Movies in the early 2010s. Around the world, movie industries have been blessed with creative geniuses in the form of directors, screenwriters, actors, sound designers and cinematographers. Together with the rise in popularity of portable devices, capable of hosting streaming services, movies have ensured that people can stay glued to their favourites whether in transit or in the corners of their homes. 

However, the spread into a plethora of genres ranging from romance to comedy to science fiction to horror has created a new problem of information overload, where choice and decision-making for individuals has become quite challenging. 

In today’s technology driven world, there have been several attempts to solving this problem using recommender systems. These systems are basically a subclass of intelligent information filtering processes that provide suggestions for items that are most pertinent to a particular user.

Providing an accurate and robust solution to this challenge has immense economic potential for industry clients, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity. 

![IMG_3532.png](attachment:1293c550-39df-42d0-bd59-6aba283d8dc9.png)

In this Notebook, the **Infinity AI** team identifies some insights into data that can be used for the development of a few recommender systems. The team explores eight datasets of more than 48000 movies and over 160000 users with up to 15 million of datapoints containing movie ratings, genres, keywords, and so on collected from Explore Ai Academy (EDSA) and the MovieLens datasets. Using these datasets, the team attempts to answer various questions about movies. We are:

 > Josiah Aramide <br>
 > Bongani Mavuso <br>
 > Ndinannyi mukwevho <br>
 > Aniedi Oboho-Etuk <br>
 > Manoko Langa <br>
 > Tshepiso Padi <br>
 > Nsika Masondo <br>
 

### Problem Statement

EXPLORE AI (the client) is determined to improve her recommender system service to targeted consumer categories based on their movie content ratings. 

Data from the historical viewing experiences, available to the company contains some movie preference (ratings) and similarity characteristics (movie attributes) that can ensure accurate prediction of consumer behaviour. 

By constructing a recommendation algorithm based on content or collaborative filtering, Infinity AI team can develop a solution capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences. This solution will give the company access to immense economic opportunities and guarantee customer retention, with users of the system being exposed to content they would like to view or purchase. Additionally, this easy-to-deploy solution will deliver increase in customer click-through rate - generating revenue and platform affinity.

![movie_recommender_system.jpg](attachment:65060f6b-9d0c-4d92-bcfa-a43baca1acd6.jpg)


### Objectives

**InfinityAI** seeks to achieve the following objectives for the project brief:

- 1. analyse the supplied data for interesting insights on movies;
- 2. identify underlying patterns and standardize the data;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a recommendation algorithm based on content or collaborative filtering that is capable of capable of accurately predicting how a user will rate a movie they have not yet viewed;
- 5. evaluate the accuracy of the best machine learning model; and
- 6. explain the inner working of the model to a non-technical audience.

# Section 1: Enviroment Setup

Let's start by installing and importing required packages/libraries.

<a id="one"></a>
## 1.1 Python Package Installation
<a href=#cont>Back to Table of Contents</a>

To run this notebook, install the following root packages on your local machine:

- matplotlib
- nltk
- numpy
- pandas
- plotly
- scikit-learn
- seaborn
- surprise
- Comet

In [6]:
# --> uncomment these lines below if the dependent code cells do not run

#!pip install comet_ml # Comet installation for Jupyter Notebook/Collab
#!pip install git+https://github.com/microsoft/recommenders.git
#!pip install kneed # knee (/elbow) point detection for cluster optimization
#!pip install tf_slim
#!pip install downcast
##pip install scikit-surprise
#!pip install ipython-autotime
#!pip install surprise

 <a id="one"></a>
## 1.2 Comet Initialization
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Due to the way Comet ties into other Maching Learning packages automatically to track certain features, it is required to be one of the first packages imported at the top of the notebook.

In [None]:
# Create an instance of Comet experiment with TeamNM3's API key
experiment = Experiment(
    api_key="RpnzF8DcMSor3mXqAfEQqsXjv",
    project_name="unsupervised-learning-predict",
    workspace="teamnm3",
)

 <a id="one"></a>
## 1.3 Importing Packages
<a class="anchor" id="1.2"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| Below are the libraries and tools imported for use in this project. The libraries include:
   - **numpy**: for working with arrays,
   - **pandas**: for tansforming and manipulating data in tables,
   - **matplotlib**: for creating interactive visualisations,
   - **seaborn**: for making statistical graphs and plots,
   - **scikit-learn**: for machine learning and statistical modeling, and
   - **math**: for algebraic notations and calculations.

---

In [4]:
# Libraries for data loading, data manipulation and data visulisation 
# Import our regular old heroes 
import numpy as np
import pandas as pd
import datetime
import time
import re
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.offline import init_notebook_mode, iplot, plot
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.graph_objs as go
import plotly.offline as pyo

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler # for standardization 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection, preprocessing)

from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score, mean_squared_error

# Additional packages
import warnings
from collections import OrderedDict
from datetime import date
#from comet_ml import Experiment

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud, STOPWORDS

# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists

from surprise import (
    NMF,
    SVD,
    BaselineOnly,
    CoClustering,
    Dataset,
    KNNBasic,
    NormalPredictor,
    Reader,
    SlopeOne,
    SVDpp,
)
from surprise import accuracy
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

# Imported for our sanity
import warnings
warnings.filterwarnings('ignore')

# Section 2: Data

<a id="two"></a>
## 2.1 Download of dataset
<a class="anchor" id="1.2"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section, data is loaded from **Kaggle** made available to **TeamNM3** by the client, **Explore-AI**. This involves reading the data from the `.csv` file format into a Pandas dataframe. The Pandas dataframe allows for easy views and manipulations of the data in the form of tables and can be combined with other python libraries like numpy for desirable results. |

---

In [5]:
# Import Data

# Kaggle base path
#base_path = "../input/edsa-recommender-system-predict/"
base_path = "/kaggle/input/edsa-movie-recommendation-predict/"

# # Local base path
# base_path = "../../edsa-recommender-predict/"

df_ratings = pd.read_csv(base_path + "train.csv")
df_movies = pd.read_csv(base_path + "movies.csv")
df_imdb = pd.read_csv(base_path + "imdb_data.csv")

df_genome_scores = pd.read_csv(base_path + "genome_scores.csv")
df_genome_tags = pd.read_csv(base_path + "genome_tags.csv")
df_links = pd.read_csv(base_path + "links.csv")
df_tags = pd.read_csv(base_path + "tags.csv")
df_rating = pd.read_csv('/kaggle/input/movie-lens-small-latest-dataset/ratings.csv')

df_test = pd.read_csv(base_path + "test.csv")
sample_submission = pd.read_csv(base_path + "sample_submission.csv")

FileNotFoundError: ignored

<a id="three"></a>
## 2.2 Basic Data Analysis 
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In this section we perform a basic analysis of the data in the various CSVs to develop an understanding of the data we're able to work with. We conclude this basic data analysis by combining all the data into one dataframe and then continuing into a more in-depth analysis.

### Ratings DataFrame

We begin this basic data analysis by examining the ratings dataframe.

In [None]:
# Display top 5 rows of dataframe
df_ratings.head()

In [None]:
# Gather information about the dataframe
df_ratings.info()

In [None]:
# Check if dataframe as any null values
df_ratings.isnull().sum()

We then create a dataframe to display the count and percentage of each rating value in the dataset.

In [None]:
# Determining number of rows for each rating value
rows_rating = df_ratings["rating"].value_counts()
rows_df_rating = pd.DataFrame({"rating": rows_rating.index, "Rows": rows_rating.values})

# Determining percentage of rows for each rating value
percentage_rating = round(df_ratings["rating"].value_counts(normalize=True) * 100, 2)
percentage_df_rating = pd.DataFrame(
    {"rating": percentage_rating.index, "Percentage": percentage_rating.values}
)

# Joining row and percentage information
ratings_df_distribution = pd.merge(
    rows_df_rating, percentage_df_rating, on="rating", how="outer"
)
ratings_df_distribution.set_index("rating", inplace=True)
ratings_df_distribution.sort_index(axis=0)

In the dataframe above we can see that 4.0 is the most commonly score, with 26.53% of the movies in the dataframe assigned that score.

We visualize the data below:

In [None]:
pyo.init_notebook_mode()
init_notebook_mode(connected=True)
data = df_ratings["rating"].value_counts().sort_index(ascending=False)

# Plot data
trace = go.Bar(
    x=data.index,
    text=["{:.1f} %".format(val) for val in (data.values / df_ratings.shape[0] * 100)],
    textposition="auto",
    textfont=dict(color="#000000"),
    y=data.values,
)

# Create layout
layout = dict(
    title="Distribution Of {} ratings".format(ratings_df.shape[0]),
    xaxis=dict(title="Rating"),
    yaxis=dict(title="Count"),
)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
pyo.iplot(fig)

we see half scores (0.5, 1.5, 2.5, 3.5 and 4.5) are less commonly used than integer score values. We don't know if this is because users prefer to rate movies with integer values or if it's because half scores were introduced after the original scoring system was already in use, leading to a decreased volume in a dataset with ratings from 1995. We quickly attempt to understand this further by investigating which years recorded half-score ratings:

In [None]:
# Create list of date objects
rating_date_list = [
    date.fromtimestamp(timestamp) for timestamp in list(df_ratings["timestamp"])
]

# Create year column
df_ratings["review_year"] = [date_item.year for date_item in rating_date_list]
years_with_half_scores = df_ratings[
    df_ratings["rating"].isin([0.5, 1.5, 2.5, 3.5, 4.5])
]["review_year"]
unique_years_with_half_scores = set(years_with_half_scores)
print(
    "There are {} years with half scores. \nThey are: {}.".format(
        len(unique_years_with_half_scores), sorted(list(unique_years_with_half_scores))
    )
)

We can see that before 2003, movies were not rated with half scores.

we'll check the percentage of half scores of the total for the ratings from 2003 onwards:

In [None]:
all_scores_after_2003 = len(df_ratings["rating"])
number_of_years_with_half_scores = len(years_with_half_scores)
print(
    "The percentage of reviews with half scores in the data from 2003 onwards is {:.2%}".format(
        number_of_years_with_half_scores / all_scores_after_2003
    )
)

We can see that half scores are not as popular as integer scores.


Now we examine the user data:

In [None]:
# Find the total number of users and movies and the number of unique users 
# and movies
unique_users = df_ratings["userId"].nunique()
total_users = len(df_ratings["userId"])
unique_movies = df_ratings["movieId"].nunique()
total_movies = len(df_ratings["movieId"])

# Display these values
print(
    "Total number of unique users: \t{} \n"
    "Total number of unique movies: \t{}\n"
    "Percentage of unique users: \t{:.2%}\n"
    "Percentage of unique movies: \t{:.2%}".format(
        unique_users,
        unique_movies,
        unique_users / total_users,
        unique_movies / total_movies,
    )
)

The percentage of unique users and movies are both low.


We now move on to exploring the ratings dataframe.

**Summary of the basic analysis of the Ratings DataFrame**

The ratings dataframe consists of 10'000'038 rows and 4 columns (userId, movieID, rating and timestamp).

Ratings are from 0.5 to 5 in increments of 0.5. The majority of ratings were a 4, comprising of close to 27% of all the given data.

The data contains 162'541 unique users and 48'213 movies were rated.

Integer scores appear favoured over half scores, which were only introduced in 2003

**Movies DataFrame**

Here we perform a basic analysis of the movies dataframe. We begin this analysis by generating the head of the dataframe below:

In [None]:
# Display top 5 rows of dataframe
df_movies.head()

In [None]:
# Gather information about the dataframe
df_movies.info()

Let's examine the number of unique values in the dataset:

In [None]:
# Information regarding number of unique values in each column:
print(
    "Total number of unique movie IDs: \t{}\n"
    "Total number of unique movie titles: \t{}\n"
    "Total number of unique movie genres: \t{}\n".format(
        df_movies["movieId"].nunique(),
        df_movies["title"].nunique(),
        df_movies["genres"].nunique(),
    )
)
print(
    "There are {} movies with the same name.".format(
        df_movies["movieId"].nunique() - df_movies["title"].nunique()
    )
)

There are fewer unique movie titles than unique IDs. We know that the difference between these two numbers is equal to the number of movies with the exact same name, which we see is 98. We will keep this in mind while building the recommender system.

We explore the genres column, first by finding the top 20 and lowest 20 genres by volume:

In [None]:
# Top 20 genres by volume:
df_genres = df_movies["genres"].value_counts()
df_genres.head(10)

In [None]:
# Bottom 20 genres by volume:
df_genres = df_movies["genres"].value_counts()
df_genres.tail(20)

We can see that the top genres by volume only have one or two genre types, whereas the bottom genres consist of multiple genres

**Summary of the basic analysis of the Movies DataFrame**

The movies dataframe contains 62'423 rows and 3 columns (movieId, title and genres). 98 of the rows have duplicate titles. 1'639 unique genres are listed, which includes combination genres. 5062 movies do not have a genre listed and 3 most popular genres are: Drama, Comedy and Documentary.

Next we move to the IMBD dataset.

**IMDB Data**

Here we explore the IMBD data to learn more about the content of the movies and the people who worked on them.

We begin by examining the dataframe.

In [None]:
# Display top 5 rows of dataframe
df_imdb.head()

In [None]:
# Gather information about the dataframe
df_imdb.info()

Here we see that there are only 27277 movies in this dataframe, which is less than the 48213 movies in the ratings dataframe.

Next we check for missing data:

In [None]:
# Find percentage of missing values in each column
columns = df_imdb.columns
percent_missing_values = df_imdb.isnull().sum() / len(df_imdb.index) * 100
df_missing_values = pd.DataFrame(
    {"column_name": columns, "percent_missing": percent_missing_values}
)
df_missing_values

Here we see many columns are missing data.

Next we examine the number of unique items in each column, keeping in mind that all columns except for the movie ID column is missing data:

In [None]:
# Information regarding number of unique values in certain column:
print(
    "Total number of unique movie IDs: \t{}\n"
    "Total number of unique title casts: \t{}\n"
    "Total number of unique directors: \t{}\n"
    "Total number of unique plot keywords: \t{}".format(
        df_imdb["movieId"].nunique(),
        df_imdb["title_cast"].nunique(),
        df_imdb["director"].nunique(),
        df_imdb["plot_keywords"].nunique(),
    )
)

Next we check the dataframe for duplicated movies:

In [None]:
df_imdb[df_imdb["movieId"].duplicated()]

let's examine the most common cast members, directors, and plot keywords by volume in the dataset.

In [None]:
# Top 5 title cast members by volume:
df_cast = df_imdb["title_cast"].value_counts()
df_cast.head()

In [None]:
# Top 5 directors by volume:
df_directors = df_imdb["director"].value_counts()
df_directors.head()

In [None]:
# Top 5 plot keywords by volume:
df_keywords = df_imdb["plot_keywords"].value_counts()
df_keywords.head()

**Summary of the basic analysis of the IMDB DataFrame**

The IMDB Dataframe has 27'278 rows and 6 columns (movieId, title_cast, director, runtime, budget and plot_keywords). MovieId is the only column that doesn't have any null values. All the other columns have at least 36% of missing values, with the budget column having the hightest percentage of null values at 71%. No movieId's are duplicated.

Luc Besson, Woody Allen and Stephen King are the 3 directors that appear most often in this dataset.

The most popular plot keywords are "documentary", "action", and "f rated".

**Genome Scores DataFrame**

Here we explore the genome scores data. This dataset contains scores that measure the relevance of a tag to a movie.

In [None]:
# Display top 5 rows of dataframe
df_genome_scores.head()

In [None]:
# Gather information about the dataframe
df_genome_scores.info()

In [None]:
# Information regarding number of unique values in each column:
print(
    "Total number of unique movie IDs: \t" + str(df_genome_scores["movieId"].nunique())
)
print("Total number of unique tag IDs: \t" + str(df_genome_scores["tagId"].nunique()))

**Summary of the basic analysis of the Genome Scores DataFrame**

The Genome Scores Dataframe has 15'584'448 rows and 3 columns (movieId, tagId and relevance). There are 13'816 unique movie ids and 1'128 unique tag ids.

**Genome Tags DataFrame**

Here we explore the tag data. These tags are assigned by a user.

In [None]:
# Display top 5 rows of dataframe
df_genome_tags.head()

In [None]:
# Gather information about the dataframe
df_genome_tags.info()

In [None]:
# Information regarding number of unique values in each column:
print("Total number of unique Tags: \t\t" + str(df_genome_tags["tag"].nunique()))
print("Total number of unique Tag IDs: \t" + str(df_genome_tags["tagId"].nunique()))

**Summary of the basic analysis of the Genome Tags DataFrame**

The Genome Tags Dataframe has 1'128 rows and 2 columns (tagId and tag). All values are unique.

**Links DataFrame**

Here we explore the links dataframe. From Kaggle, this data serves as a link between a MovieLens ID and the IMDB and TMDB IDs associated with it

In [None]:
# Display top 5 rows of dataframe
df_links.head()

In [None]:
# Gather information about the dataframe
df_links.info()

In [None]:
print("Total number of unique Movie Ids: \t" + str(df_links["movieId"].nunique()))
print("Total number of unique IMDB Ids: \t" + str(df_links["imdbId"].nunique()))
print("Total number of unique TMDB Ids: \t" + str(df_links["tmdbId"].nunique()))

In [None]:
df_links.isnull().sum()

In [None]:
# Check how many tmdb IDs are duplicated:
df_tmdb = df_links[df_links["tmdbId"].duplicated()]
tmdb_total = df_tmdb["tmdbId"].value_counts().sum()
print("Total number of tmdb Ids duplicated: \t" + str(tmdb_total))

**Summary of the Links DataFrame**

The links dataframe has 62'423 rows and 3 columns (movieId, imdbId and tmdbId). There are 62'423 unique movie and imdb IDs, which correspond with the number of rows in the dataframe. 107 of the tmdbId's are null values and 35 values are also dupclicates.

**Test and sample submission Data**

Here we briefly explore the data provided by Kaggle that users use for submissions.

In [None]:
# Display top 5 rows of dataframe
df_test.head()

In [None]:
# Gather information about the dataframe
df_test.info()

In [None]:
# Display top 5 rows of dataframe
sample_submission.head()

In [None]:
# Gather information about the dataframe
sample_submission.info()

In [None]:
sample_submission[sample_submission["Id"].duplicated()]

**Summary of the test and sample submission Data**

Both dataframes have 5'000'019 and no IDs are ducplicated.

# Section 3: Exploratory Data Analysis

<a id="three"></a>
## 3.1 Exploring user data  
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In this section we aim to explore the data specific to the users who contributed ratings.

We start this EDA by generating summary statistics for the rating values:

In [None]:
# Generate summary statistics
summary_statistics = df_ratings[["rating"]].describe().round(2)
summary_statistics

We see that the average is 3.53, which seems sensible for movie reviews with a maximum score of 5 and a minimum score of 0.5.

We find the number of times a user rated a movie

In [None]:
# To find the number of times a user rated a movie, we create a data frame with the count by userId
df_user = pd.DataFrame(
    df_ratings['userId'].value_counts()).reset_index()
df_user.rename(columns={'index':'userId','userId':'count'},
                  inplace=True)
df_user.head()

We now aggregate user data by user IDs to get the average ratings.

In [None]:
df_aggregated = (
    df_ratings[["userId", "rating"]].groupby("userId").agg(["count", "mean"])
)
df_aggregated.head(5)

We now group users in ranges and visualize the results

In [None]:
# Grouping the users within a certain range aided us in determining the common userId's and the new ones.
group_one = df_user.loc[(df_user['count'] > 0) & 
            (df_user['count'] < 50),
            'userId'].value_counts().sum()
group_two = df_user.loc[(df_user['count'] >= 50) & 
            (df_user['count'] < 500),
            'userId'].value_counts().sum()
group_three = df_user.loc[(df_user['count'] >= 500) & 
            (df_user['count'] < 1000),
            'userId'].value_counts().sum()
group_four = df_user.loc[(df_user['count'] >= 1000) & 
            (df_user['count'] < 1500),
            'userId'].value_counts().sum()
group_five = df_user.loc[(df_user['count'] >= 1500),
            'userId'].value_counts().sum()

Visualize the results

In [None]:
# To give us insight in the spread, we used figures to determine the spread.
trial_error = np.array([['group_one', group_one,
                         'between 1 and 50'],
                        ['group_two', group_two,
                         'between 50 and 500'],
                        ['group_three', group_three,
                         'between 500 and 1000'],
                        ['group_four', group_four,
                         'between 1000 and 1500'],
                        ['group_five', group_five,
                         'greater than 1500']])
df_trial_error = pd.DataFrame({'group': trial_error[:, 0],
                               'userId_grouping': trial_error[:, 1],
                               'explanation': trial_error[:, 2]})
fig = px.bar(df_trial_error,
             x=df_trial_error["group"],
             y=df_trial_error["userId_grouping"],
             color=df_trial_error["group"],
             title='Grouped Rating Distribustion')
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))
fig.show()
df_trial_error

The user Id's are grouped by the rating counts in a grouping range illustrated in the DataFrame above. In the Grouped Rating Distribution bar graph, it is visually displayed that there is unequal distribution. The distribution is skewed to the left, with the majority of the user ids in the rating count range between 1 and 50. At the same time, the last group has only a value count of 61, which is a significant difference from group one with a value count of 110 010.

In [None]:
def user_ratings_count(df, n):
    plt.figure(figsize=(14,7))
    data = df['userId'].value_counts().head(n)
    ax = sns.barplot(x = data.index, y = data, order= data.index, palette='CMRmap', edgecolor="black")
    for p in ax.patches:
        ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=11, ha='center', va='bottom')
    plt.title(f'Top {n} Users by Number of Ratings', fontsize=14)
    plt.xlabel('User ID')
    plt.ylabel('Number of Ratings')
    plt.show()

In [None]:
user_ratings_count(df_ratings,10)

From the graph above we can see there is one outlier with close to thirteen thousand reviews.

Filtering out user 72315 because his/her number of raings is too extreme and he/she is an oulier

In [None]:
user_ratings_count(df_ratings[df_ratings['userId'] !=72315],10)

<a id="three"></a>
## 3.2 Exploring movie genre  
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [None]:
genres = pd.DataFrame(df_movies['genres'].
                      str.split("|").
                      tolist(),
                      index=df_movies['movieId']).stack()
genres = genres.reset_index([0, 'movieId'])
genres.columns = ['movieId', 'Genre']
genres.head()

In [None]:
fig, ax = plt.subplots(figsize=(14, 7))
sns.countplot(x='Genre',
              data=genres,
              palette='CMRmap',
              order=genres['Genre'].
              value_counts().index)
plt.xticks(rotation=90)
plt.xlabel('Genre', size=20)
plt.ylabel('Count', size=20)
plt.title('Distribution of Movie Genres', size=25)
plt.show()

We can see that the top 3 most popular movie genres include drama, comedy and thriller.

<a id="three"></a>
## 3.3 Exploring movie data  
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [None]:
# Merging ratings with movies data

movies=pd.merge(df_ratings, df_movies,on='movieId',how='inner')
movies.head()

In [None]:
# Merging movies and imdb data

full_movies = pd.merge(movies,df_imdb,on='movieId',how='inner')
full_movies.head()

In [None]:
def top_n_plot_by_ratings(df,column, n):
    plt.figure(figsize=(14,7))
    data = df[str(column)].value_counts().head(n)
    ax = sns.barplot(x = data.index, y = data, order= data.index, palette='CMRmap', edgecolor="black")
    for p in ax.patches:
        ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=11, ha='center', va='bottom')
    plt.title(f'Top {n} {column.title()} by Number of Ratings', fontsize=14)
    plt.xlabel(column.title())
    plt.ylabel('Number of Ratings')
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# top 15 most rated movies

top_n_plot_by_ratings(movies,'title',15)

In the Top 15 Title by Number of Ratings bar graph, all the movies are prior the year 2001, with 14 of them released in the 19th century.

The top three are Shawshank Redemption 1994, Forest Grump 1994 and Pulp Fiction 1994. All three movies fall under the popular drama genre and are American.

In [None]:
# Wordcloud of movie titles
movies_word = df_movies['title'] = df_movies['title'].astype('str')
movies_wordcloud = ' '.join(movies_word)
title_wordcloud = WordCloud(stopwords = STOPWORDS,
                            background_color = 'White',
                            height = 1200,
                            width = 900).generate(movies_wordcloud)
plt.figure(figsize = (14,7), facecolor=None)
plt.imshow(title_wordcloud)
plt.axis('off')
plt.title('Distribution of words from movie titles')
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Show the most rated rating

top_n_plot_by_ratings(movies,'rating',10)

The most common rating score that is given is 4.0, followed by 3.0. The least common score that is given by usrs is 0.5.

In [None]:
# show director's ratings

top_n_plot_by_ratings(full_movies,'director',15)

Quentin Tarantino is the top director of the number of ratings.

In [None]:
movieRatingDistGroup = df_ratings['rating'].value_counts().sort_index().reset_index()
fig, ax = plt.subplots(figsize=(14,7))
sns.barplot(data=movieRatingDistGroup, x='index', y='rating', palette="CMRmap", edgecolor="black", ax=ax)
ax.set_xlabel("Rating")
ax.set_ylabel('Number of Users')
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
total = float(movieRatingDistGroup['rating'].sum())
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2., height+350, '{0:.2%}'.format(height/total), fontsize=11, ha="center", va='bottom')
plt.title('Number of Users Per Rating', fontsize=14)
plt.show()

Most of the users are weighted within the score range of 3.0 - 5.0, with the most users being weighted in the 4.0 score, accounting for 26.53% of the users.

<a id="three"></a>
## 3.4 Exploring imdb data  
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [None]:
def count_directors(df, count = 10):
    """
    Function to count the most common dircetors in a DataFrame:
    Parameters
    ----------
        df (DataFrame): input dataframe containing imdb metadata
        count (int): filter directors with fewer than count films
        
    Returns
    -------
        directors (DataFrame): output DataFrame
    Examples
    --------
        >>> df = pd.DataFrame({'imdbid':[0,1,2,3,4,5], 'director': [A,B,A,C,B]})
        >>> count_directors(df, count = 1)
            |index|director|count|
            |0|A|2|
            |1|B|2|
            |2|C|1|
    """
    directors = pd.DataFrame(df['director'].value_counts()).reset_index()
    directors.columns = ['director', 'count']
    # Lets only take directors who have made 10 or more movies otherwise we will have to analyze 11000 directors
    directors = directors[directors['count']>=count]
    return directors.sort_values('count', ascending = False)

In [None]:
def feature_count(df, column):
    plt.figure(figsize=(14,7))
    ax = sns.barplot(x = df[f'{column}'], y= df['count'], palette='brg')
    for p in ax.patches:
        ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=11, ha='center', va='bottom')
    plt.title(f'Number of Movies Per {column}', fontsize=14)
    plt.xlabel(f'{column}')
    plt.ylabel('Count')
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# shows number of movies per director

directors = count_directors(df_imdb)
feature_count(directors.head(15), 'director')

In the Number of Movies Per Director bar graph, the leading director has produced the most movies with a count of 28. Luc Besson and Woody Allen are tied with a value count of producing 26 movies and followed by Stephan King with 24. They are the only producers in the dataset with over 20 movie productions.

<a id="three"></a>
## 3.5 Exploring tags  
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

First we create a word cloud of tags:

In [None]:
comment_words = ''
stopwords = set(STOPWORDS)

# iterate through the csv file
for val in df_tags['tag']:

    # typecaste each val to string
    val = str(val)

    # split the value
    tokens = val.split()

    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()

    comment_words += " ".join(tokens)+" "
  
wordcloud = WordCloud(width=1200, height=900,
                      colormap='winter',
                      background_color='white',
                      stopwords=stopwords,collocations=False,
                      min_font_size=10).generate(comment_words)

# plot the WordCloud image
plt.figure(figsize=(14, 7), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.title('Distribution of words in the tags data frame by Tags')
plt.tight_layout(pad=0)

plt.show()

We can see from the word cloud that the most common words in tags were 'Comedy','book','War' and 'Dark'.

In [None]:
# creating a dataframe of genre and count of t

value_count = pd.DataFrame(df_tags['tag'].
                           value_counts()).reset_index()
value_count.rename(columns = {'index': 'genre', 'tag': 'count'},
                   inplace = True)

In [None]:
value_count.head()

In [None]:
genre_count = value_count.head(20)
plt.figure(figsize=(14,7))
ax = sns.barplot(x = genre_count['genre'], y= genre_count['count'], palette='CMRmap')
for p in ax.patches:
        ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=11, ha='center', va='bottom')
plt.title('Number of times a genre tag appears', fontsize=14)
plt.xlabel('Genre tag')
plt.ylabel('Genre tag Count')
plt.xticks(rotation=90)
plt.show()

The most popular words in the world cloud include book, comedy, ending, based, dark and sci-fi.
The three most popular genres that appear in df_tags('tags') are sci-fi, atmospheric, and action.

<a id="three"></a>
## 3.6 Publishing years 
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [None]:
dates = []
for title in df_movies['title']:
    if title[-1] == " ":
        year = title[-6: -2]
        try:
            dates.append(int(year))
        except:
            dates.append(9999)
    else:
        year = title[-5: -1]
        try:
            dates.append(int(year))
        except:
            dates.append(9999)

df_movies['Publish Year'] = dates

In [None]:

len(df_movies)

In [None]:
len(df_movies[df_movies['Publish Year'] == 9999])

In [None]:
df_movies[(df_movies['Publish Year'] > 1888) &
          (df_movies['Publish Year'] < 2021)]

In [None]:
dataset = pd.DataFrame(df_movies['Publish Year'].
                       value_counts()).reset_index()
dataset.rename(columns={'index': 'year', 'Publish Year': 'count'},
               inplace=True)
dataset.head()

In [None]:
year_dataset = dataset[(dataset['year']>1888) & (dataset['year']<2021)].sort_values(by='count',ascending=False).head(50)
plt.figure(figsize=(14,7))
ax = sns.barplot(x = year_dataset['year'], y= year_dataset['count'], order=year_dataset['year'], palette='CMRmap')
#for p in ax.patches:
#       ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=11, ha='center', va='bottom')
plt.title('Number of Movies Released Per year', fontsize=14)
plt.xlabel('year')
plt.ylabel('Released Movie Count')
plt.xticks(rotation=90)
plt.show()

In the Number of Movies Released Per year graph, we are able to visually see an major increase in movie releases in the 21st century.

# Section 4: Base Model testing

In [None]:
%load_ext autotime

With the Surprise library, the following algorithms will be used. RMSE is used as the accuracy metric for the predictions:

**NormalPredictor**

Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

**BaselineOnly**

Algorithm predicting the baseline estimate for given user and item.

**KNNBasic**

A basic collaborative filtering algorithm.

**SVD**

The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

**SVDpp**

The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

**Nonnegative Matrix Factorization (NMF)**

A collaborative filtering algorithm based on Non-negative Matrix Factorization.

**SlopeOne**

A simple yet accurate collaborative filtering algorithm.

**CoClustering**

A collaborative filtering algorithm based on co-clustering.

<a id="three"></a>
## 4.1 Cross-Validation Testing 
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

**Cross-Validation Testing**

Here we perform cross-validation testing on five algorithms: SVD, NormalPredictor, BaseLineOnly, SlopeOne and CoClustering.

In [None]:
# Select sample size of 500 000 to test base models
df_model_testing = df_ratings.sample(n=500000)

reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(df_model_testing[["userId", "movieId", "rating"]], reader)

We then calculate the RMSEs for the five algorithms and display the scores in a dataframe.

In [None]:
benchmark = []

# Iterate over all algorithms
for algorithm in [
    SVD(),
    NMF(),
    NormalPredictor(),
    BaselineOnly(),
    SlopeOne(),
    CoClustering(),
]:

    # Perform cross validation
    results = cross_validate(algorithm, data, cv=5)

    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(
        pd.Series([str(algorithm).split(" ")[0].split(".")[-1]], index=["Algorithm"])
    )
    benchmark.append(tmp)

In [None]:
# Show summary dataframe
summary_results = (
    pd.DataFrame(benchmark).set_index("Algorithm").sort_values("test_rmse")
)
summary_results

With a dataset of 500 000 rows the BaselineOnly model outperforms the SVD Model. The fit time of the BaselineOnly model is much better than the SVD model. Gridsearch

Next we perform split testing.

<a id="three"></a>
## 4.2 Train-Test-Split
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

**Train-Test-Split Testing of Top 2 Models**

Here we perform a train-test-split test on the BaselineOnly model and SVD model, the top two perfomers.

In [None]:
# Use all rows in Ratings dataframe
data_1 = Dataset.load_from_df(df_ratings[["userId", "movieId", "rating"]], reader)

# Test set is made of 25% of the ratings.
trainset, testset = train_test_split(data_1, test_size=0.25)

**BaselineOnly Model**

Test the BaseLineOnly model.

In [None]:
BaselineOnly_1 = BaselineOnly()

# Train the algorithm on the train set, and predict ratings for the test set
BaselineOnly_1.fit(trainset)
pred_Baseline = BaselineOnly_1.test(testset)

# Then compute RMSE
accuracy.rmse(pred_Baseline)

**SVD Model**

Test the SVD model.

In [None]:
SVD_1 = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
SVD_1.fit(trainset)
pred_SVD_1 = SVD_1.test(testset)

# Then compute RMSE
accuracy.rmse(pred_SVD_1)

<a id="three"></a>
## 4.3 GridSearch
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

**GridSearch**

Next we attempt to improve our model's performance by conducting a grid search on both the SVD and BaselineOnly models.

SVD Model

In [None]:
# # API key to run experiment in Comet
experiment = Experiment(
    api_key="RpnzF8DcMSor3mXqAfEQqsXjv",
    project_name="unsupervised-learning-predict",
    workspace="teamnm3",
)

reader = Reader(rating_scale=(0.5, 5))
df_model_testing_3 = df_ratings.sample(n=10000)
data_3 = Dataset.load_from_df(
     df_model_testing_3[["userId", "movieId", "rating"]], reader
 )
param_grid = {
      "n_factors": [10, 20, 50, 100, 150, 200],
      "n_epochs": [15, 20, 25, 50, 75, 100],
      "lr_all": [0.005, 0.008, 0.001],
      "reg_all": [0.1, 0.3, 0.5],
 }

gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data_3)
algo = gs.best_estimator["rmse"]

# # best RMSE score
print(gs.best_score["rmse"])

# # combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

experiment.log_dataset_hash(data_3)
experiment.log_parameters({"model_type": "SVD", "param_grid": param_grid})
experiment.log_metrics({"RMSE": gs.best_score["rmse"]})
experiment.end()

**Model Testing with Hyper Parameters Tuned - SVD**

Here we use the set of tuned hyperparameters obtained from experiment to train and test our second SVD model.

In [None]:
# RMSE test whole dataset 1st set of tuned parameters - logged in Comet_ml
SVD_2 = SVD(n_factors=200, n_epochs=100, lr_all=0.005, reg_all=0.1, random_state=27)

# Train the algorithm on the trainset, and predict ratings for the testset
SVD_2.fit(trainset)
pred_SVD_2 = SVD_2.test(testset)

# Then compute RMSE
accuracy.rmse(pred_SVD_2)

**BaselineOnly Model¶**

We repeat this process for the BaselineOnly model by running another GridSearchCV.

In [None]:


# API key to run experiment in Comet
experiment = Experiment(
    api_key="RpnzF8DcMSor3mXqAfEQqsXjv",
    project_name="unsupervised-learning-predict",
    workspace="teamnm3",
)

param_grid = {
     "bsl_options": {
         "method": ["sgd"],
         "learning_rate": [0.004, 0.006, 0.008, 0.010],  # gamma
         "reg": [0.015, 0.020, 0.025],  # lambda 1 and 5
     }
 }
gs_baseline = GridSearchCV(
     BaselineOnly,
     param_grid,
     measures=["rmse"],
     cv=3,
     return_train_measures=True,
     n_jobs=1,
 )
gs_baseline.fit(data_3)

algo_baseline = gs_baseline.best_estimator["rmse"]

# # best RMSE score
print(gs_baseline.best_score["rmse"])

# # combination of parameters that gave the best RMSE score
print(gs_baseline.best_params["rmse"])

experiment.log_dataset_hash(data_3)
experiment.log_parameters({"model_type": "BaselineOnly", "param_grid": param_grid})
experiment.log_metrics({"RMSE": gs_baseline.best_score["rmse"]})

experiment.end()

# Section 5: Model Building

<a id="three"></a>
## 5.1 Fit model to whole dataset
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

We now switch to training our model on the entire train dataset.

In [None]:
# Use all rows in Ratings dataframe
data = Dataset.load_from_df(df_ratings[["userId", "movieId", "rating"]], reader)

# Test set is made of 25% of the ratings.
trainset, testset = train_test_split(data_4, test_size=0.25)

# Final Model Building
SVD_model = SVD(random_state=27)
trainset = data.build_full_trainset()
SVD_model.fit(trainset)

<a id="three"></a>
## 5.2 Download CSV for Kaggle Competition
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [None]:
df_test["rating"] = df_test.apply(
    lambda x: SVD_model.predict(x["userId"], x["movieId"]).est, axis=1
)
df_test["Id"] = df_test.apply(lambda x: f"{x['userId']:.0f}_{x['movieId']:.0f}", axis=1)
submission = test_df[["Id", "rating"]]

In [None]:
submission.to_csv("SVD_model.csv", index=False)

<a id="three"></a>
## 5.3 Pickle model for use in Streamlit
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [None]:
# Uncomment to pickle and download the final model
# pickle.dump(SVD_model, open("./SVD_model.pkl",'wb'))

# Section 6: Collaborative & Content Based Models

**Filtration Strategies for Movie Recommendation Systems**

Movie recommendation systems use a set of different filtration strategies and algorithms to help users find the most relevant films. The most popular categories of the ML algorithms used for movie recommendations include content-based filtering and collaborative filtering systems.

![content-based_vs_collaborative_light.png](attachment:4d15547f-29ec-4040-9af9-5425213778b0.png)

<a id="three"></a>
## 6.1 Collaborative Filtering: Approach I
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

— **Collaborative Filtering**

As the name suggests, this filtering strategy is based on the combination of the relevant user’s and other users’ behaviors. The system compares and contrasts these behaviors for the most optimal results. It’s a collaboration of the multiple users’ film preferences and behaviors.

What’s the mechanism behind this strategy? The core element in this movie recommendation system and the ML algorithm it’s built on is the history of all users in the database. Basically, collaborative filtering is based on the interaction of all users in the system with the items (movies). Thus, every user impacts the final outcome of this ML-based recommendation system, while content-based filtering depends strictly on the data from one user for its modeling.

Collaborative filtering algorithms are divided into two categories:

- **User-based collaborative filtering**. The idea is to look for similar patterns in movie preferences in the target user and other users in the database.
- **Item-based collaborative filtering**. The basic concept here is to look for similar items (movies) that target users rate or interact with.
The modern approach to the movie recommendation systems implies a mix of both strategies for the most gradual and explicit results.

In [None]:
# removing years in title
df_movies['title'] = df_movies.title.str.replace('(\(\d\d\d\d\))', '')
df_movies['title'] = df_movies['title'].apply(lambda x: x.strip())
df_movies.head()

We start this process by filtering the user and movie data by users that rated more than 60 movies.

In [None]:
# Convert IDs to int. Required for merging
df_ratings['movieId'] = df_ratings['movieId'].astype('int')
df_movies['movieId'] = df_movies['movieId'].astype('int')

# Merge df_rating and df_movies into your main dataframe
train_dat = df_ratings.merge(df_movies, on='movieId')

In [None]:
# getting only the columns will need from merged data
df_train = train_dat[['userId','movieId','title','rating']]
df_train.head()

In [None]:
# Convert rating into appropriate data types
df_train.rating = df_train.rating.astype(str).astype(float)

In [None]:
# show data
df_train.head()

In [None]:
# confirm the number of unique users, unique movies, and total ratings, and we will also calculate the average number of ratings provided by users:

n_users = df_train.userId.unique().shape[0]
n_movies = df_train.movieId.unique().shape[0]
n_ratings = len(df_train)
avg_ratings_per_user = n_ratings/n_users
print('Number of unique users: ', n_users)
print('Number of unique movies: ', n_movies)
print('Number of total ratings: ', n_ratings)
print('Average number of ratings per user: ', avg_ratings_per_user)

In [None]:
# To reduce the complexity and size of this dataset, we focus on the top one thousand most rated movies.

movieIndex = df_train.groupby("movieId").count().sort_values(by= \
"rating",ascending=False)[0:1000].index
rating2 = df_train[df_train.movieId.isin(movieIndex)]
rating2.count()

In [None]:
# We will also take a sample of one thousand users at random and filter the dataset for just these users.

userIndex = rating2.groupby("userId").count().sort_values(by= \
"rating",ascending=False).sample(n=1000, random_state=2018).index
rating3 = rating2[rating2.userId.isin(userIndex)]
rating3.count()

In [None]:
# we also reindex movieID and userID to a range of 1 to 1,000 for our reduced dataset

movies = rating3.movieId.unique()
movies_df = pd.DataFrame(data=movies,columns=['originalMovieId'])
movies_df['newMovieId'] = movies_df.index+1
users = rating3.userId.unique()
users_df = pd.DataFrame(data=users,columns=['originalUserId'])
users_df['newUserId'] = users_df.index+1
rating3 = rating3.merge(movies_df,left_on='movieId', \
right_on='originalMovieId')
rating3.drop(labels='originalMovieId', axis=1, inplace=True)
rating3 = rating3.merge(users_df,left_on='userId', \
right_on='originalUserId')
rating3.drop(labels='originalUserId', axis=1, inplace=True)

In [None]:
# Let’s calculate the number of unique users, unique movies, total ratings, and average number of ratings per user for our reduced dataset.

n_users = rating3.userId.unique().shape[0]
n_movies = rating3.movieId.unique().shape[0]
n_ratings = len(rating3)
avg_ratings_per_user = n_ratings/n_users
print('Number of unique users: ', n_users)
print('Number of unique movies: ', n_movies)
print('Number of total ratings: ', n_ratings)
print('Average number of ratings per user: ', avg_ratings_per_user)

In [None]:
rating3.head()

In [None]:
# we construct our utility matrix easily by using the pivot_table function

util_matrix = rating3.pivot_table(index=['newUserId'], 
                                       columns=['title'],
                                       values='rating') 
util_matrix.shape

In [None]:
# Create a neat version of the utility matrix to assist with plotting book titles 
rating3['neat_title'] = rating3['title'].apply(lambda x: x[:20])
util_matrix_neat = rating3.pivot_table(index=['newUserId'], 
                                            columns=['neat_title'],
                                            values='rating')

fig, ax = plt.subplots(figsize=(15,5))
# We select only the first 100 users for ease of computation and visualisation. 
# You can play around with this value to see more of the utility matrix. 
_ = sns.heatmap(util_matrix_neat[:100], annot=False, ax=ax).set_title('Movies Utility Matrix')

In [None]:
# Normalize each row (a given user's ratings) of the utility matrix
util_matrix_norm = util_matrix.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1)
# Fill Nan values with 0's, transpose matrix, and drop users with no ratings
util_matrix_norm.fillna(0, inplace=True)
util_matrix_norm = util_matrix_norm.T
util_matrix_norm = util_matrix_norm.loc[:, (util_matrix_norm != 0).any(axis=0)]
# Save the utility matrix in scipy's sparse matrix format
util_matrix_sparse = sp.sparse.csr_matrix(util_matrix_norm.values)

In [None]:
# Compute the similarity matrix using the cosine similarity metric
user_similarity = cosine_similarity(util_matrix_sparse.T)
# Save the matrix as a dataframe to allow for easier indexing  
user_sim_df = pd.DataFrame(user_similarity, 
                           index = util_matrix_norm.columns, 
                           columns = util_matrix_norm.columns)

# Review a small portion of the constructed similartiy matrix  
user_sim_df[:5]

In [None]:
def collab_generate_top_N_recommendations(user, N=10, k=20):
    # Cold-start problem - no ratings given by the reference user. 
    # With no further user data, we solve this by simply recommending
    # the top-N most popular books in the item catalog. 
    if user not in user_sim_df.columns:
        return rating3.groupby('title').mean().sort_values(by='rating',
                                        ascending=False).index[:N].to_list()
    
    # Gather the k users which are most similar to the reference user 
    sim_users = user_sim_df.sort_values(by=user, ascending=False).index[1:k+1]
    favorite_user_items = [] # <-- List of highest rated items gathered from the k users  
    most_common_favorites = {} # <-- Dictionary of highest rated items in common for the k users
    
    for i in sim_users:
        # Maximum rating given by the current user to an item 
        max_score = util_matrix_norm.loc[:, i].max()
        # Save the names of items maximally rated by the current user   
        favorite_user_items.append(util_matrix_norm[util_matrix_norm.loc[:, i]==max_score].index.tolist())
        
    # Loop over each user's favorite items and tally which ones are 
    # most popular overall.
    for item_collection in range(len(favorite_user_items)):
        for item in favorite_user_items[item_collection]: 
            if item in most_common_favorites:
                most_common_favorites[item] += 1
            else:
                most_common_favorites[item] = 1
    # Sort the overall most popular items and return the top-N instances
    sorted_list = sorted(most_common_favorites.items(), key=operator.itemgetter(1), reverse=True)[:N]
    top_N = [x[0] for x in sorted_list]
    return top_N

In [None]:
# Our recommended list for user 41
collab_generate_top_N_recommendations(41)

In [None]:
# User 41's historical ratings. only 10 shown
rating3[rating3['newUserId'] == 41][:][['title','rating']].sort_values(by='rating', ascending=False)[:10]

In [None]:
def collab_generate_rating_estimate(movie_title, user, k=20, threshold=0.0):
    # Gather the k users which are most similar to the reference user 
    sim_users = user_sim_df.sort_values(by=user, ascending=False).index[1:k+1]
    # Store the corresponding user's similarity values 
    user_values = user_sim_df.sort_values(by=user, ascending=False).loc[:,user].tolist()[1:k+1]
    rating_list = [] # <-- List of k user's ratings for the reference item
    weight_list = [] # <-- List of k user's similarities to the reference user
    
    # Create a weighted sum for each of the k users who have rated the 
    # reference item (book).
    for sim_idx, user_id in enumerate(sim_users):
        # User's rating of the item
        rating = util_matrix.loc[user_id, movie_title]
        # User's similarity to the reference user 
        similarity = user_values[sim_idx]
        # Skip the user if they have not rated the item, or are too dissimilar to 
        # the reference user
        if (np.isnan(rating)) or (similarity < threshold):
            continue
        elif not np.isnan(rating):
            rating_list.append(rating*similarity)
            weight_list.append(similarity)
    try:
        # Return the weighted sum as the predicted rating for the reference item
        predicted_rating = sum(rating_list)/sum(weight_list) 
    except ZeroDivisionError:
        # If no ratings for the reference item can be collected, return the average 
        # rating given by all users for the item.  
        predicted_rating = np.mean(util_matrix[movie_title])
    return predicted_rating

In [None]:
# Once again we can use our newly formed function to generate rating predictions for user 41
# pick a movie title 'Heat'
title = "Heat"
actual_rating = rating3[(rating3['newUserId'] == 41) & (rating3['title'] == title)]['rating'].values[0]
pred_rating = collab_generate_rating_estimate(movie_title = title, user = 41)
print (f"Title - {title}")
print ("---")
print (f"Actual rating: \t\t {actual_rating}")
print (f"Predicted rating: \t {pred_rating}")

In [None]:
# we picked a movie title "Goodfellas"  and compared ratings

title = "Goodfellas"
actual_rating = rating3[(rating3['newUserId'] == 41) & (rating3['title'] == title)]['rating'].values[0]
pred_rating = collab_generate_rating_estimate(movie_title = title, user = 41)
print (f"Title - {title}")
print ("---")
print (f"Actual rating: \t\t {actual_rating}")
print (f"Predicted rating: \t {pred_rating}")

In [None]:
# we picked a movie titled "Reservoir Dogs"  and compared ratings

title = "Reservoir Dogs"
actual_rating = rating3[(rating3['newUserId'] == 41) & (rating3['title'] == title)]['rating'].values[0]
pred_rating = collab_generate_rating_estimate(movie_title = title, user = 41)
print (f"Title - {title}")
print ("---")
print (f"Actual rating: \t\t {actual_rating}")
print (f"Predicted rating: \t {pred_rating}")

<a id="three"></a>
## 6.2 Content Based Filtering: Approach II
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

— **Content-Based Filtering**

A filtration strategy for movie recommendation systems, which uses the data provided about the items (movies). This data plays a crucial role here and is extracted from only one user. An ML algorithm used for this strategy recommends motion pictures that are similar to the user’s preferences in the past. Therefore, the similarity in content-based filtering is generated by the data about the past film selections and likes by only one user.

How does it work? The recommendation system analyzes the past preferences of the user concerned, and then it uses this information to try to find similar movies. This information is available in the database (e.g., lead actors, director, genre, etc.). After that, the system provides movie recommendations for the user. That said, the core element in content-based filtering is only the data of only one user that is used to make predictions.

**Join Data Sets**

Here we join all the datasets together into one combined dataframe

In [None]:
# Join Ratings and Movies Data Sets
df_combined = pd.merge(df_ratings, df_movies, on="movieId", how="left")

# Join IMDB Data Set as well
df_combined = pd.merge(df_combined, df_imdb, on="movieId", how="left")

# Display top 5 rows of new combined dataframe
df_combined.head(2)

We start this process by filtering the user and movie data by users that rated more than 60 movies.

In [None]:
# Filter ratings with users that rated more than 60 times to be able to pivot 
# the table
filtered_ratings = df_ratings.groupby("userId").filter(lambda x: len(x) >= 60)

# List the movie titles after filtering
movie_list_rating = filtered_ratings.movieId.unique().tolist()

# View shape of filtered ratings
print(filtered_ratings.shape)

Next we examine the percentage prevelance of unique movies in the filtered_ratings dataframe that was created above:

In [None]:
# Calculate percentage of movies and users in filtered dataframe
unique_movies_f = (
    len(filtered_ratings.movieId.unique()) / len(df_ratings.movieId.unique()) * 100
)
unique_users_f = (
    len(filtered_ratings.userId.unique()) / len(df_ratings.userId.unique()) * 100
)

print(
    round(unique_movies_f, 2), "% of original movie titles in the filtered dataframe."
)
print(round(unique_users_f, 2), "% of original users in the filtered dataframe.")

Now that we understand the content of this dataframe, we proceed to filter the movies dataframe to exlude items not in the movie_list_rating and process the genres column:

In [None]:
# Filter the movies dataframe with the movie titles from the filtered list
df_movies = df_movies[df_movies.movieId.isin(movie_list_rating)]

We then process the genres column to remove the pipe symbol:

In [None]:
# Replace | in genres with a space and make lowercase to use as metadata later
df_movies["genres"] = [item.replace("|", " ").lower() for item in df_movies["genres"]]
df_movies.head()

We continue to process this dataframe by dropping the timestamp column and mappign the movie title to its ID.

In [None]:
# Create movie dictionary to map title to id
movie_dict = dict(zip(df_movies.title.tolist(), df_movies.movieId.tolist()))

# Drop timestamp column from filtered dataframe
filtered_ratings.drop(["timestamp"], axis=1, inplace=True)
filtered_ratings.head()

Next we add tags with a high relevance score (above 0.7)

In [None]:
# Create combined dataframe with genres, titles and tags with relevance above 0.7
combined = pd.merge(df_movies, df_genome_scores, on="movieId", how="left").merge(
    df_genome_tags, on="tagId", how="left"
)

filter_combined = combined[combined["relevance"] > 0.7]
filter_combined.drop(["tagId", "relevance"], axis=1, inplace=True)

# Replace NaN with empty string
filter_combined.fillna("", inplace=True)
filter_combined.head()

we're able to create a new metadata column in the dataframe by concatenating the tag with the genres.

In [None]:
# Create metadata column from movie tags and genres
filter_combined = pd.DataFrame(
    filter_combined.groupby("movieId")["tag"].apply(lambda x: " ".join(x))
)

movies_meta = pd.merge(df_movies, filter_combined, on="movieId", how="left").fillna("")
movies_meta["metadata"] = movies_meta["tag"] + " " + movies_meta["genres"]

movies_meta[["movieId", "title", "metadata"]].head()

In [None]:
# Convienient indexes to map between book titles and indexes of 
# the books dataframe
titles = movies_meta['title']
indices = pd.Series(movies_meta.index, index=movies_meta['title'])

In [None]:
# show indices

indices.head()

We now convert these textual features into a format which enables us to compute their relative similarities to one another.

In [None]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2),
                     min_df=0, stop_words='english')

# Produce a feature matrix, where each row corresponds to a book,
# with TF-IDF features as columns 
tf_authTags_matrix = tf.fit_transform(movies_meta['metadata'])

df_tfidf = pd.DataFrame(tf_authTags_matrix.toarray(), index=movies_meta.index.tolist())

print(df_tfidf.shape)

In [None]:
# Select top features with Truncated SVD
svd = TruncatedSVD(n_components=200)
latent_matrix = svd.fit_transform(df_tfidf)

# Plot variance expalained to see what dimensions to use
explained = svd.explained_variance_ratio_.cumsum()
plt.plot(explained, ".-", ms=10, color="blue")
plt.xlabel("Singular value components", fontsize=12)
plt.ylabel("Cumulative percent of variance", fontsize=12)
plt.show()

# Print percentage of variance explained
print(
    "200 components explains",
    round(svd.explained_variance_ratio_.sum(), 2) * 100,
    "% of the variance",
)

In [None]:
# Number of dimensions to keep
n = 200
df_latent_matrix_1 = pd.DataFrame(
    latent_matrix[:, 0:n], index=movies_meta.title.tolist()
)

# Content latent matrix shape
latent_matrix.shape

We now can compute the similarity between each vector within our matrix. This is done by making use of the cosine_similarity function provided to us by sklearn.

In [None]:
cosine_sim_authTags = cosine_similarity(latent_matrix, 
                                        latent_matrix)
print (cosine_sim_authTags.shape)

In [None]:
cosine_sim_authTags[:5]

### Infinity One
Building a model with more Attributes for Streamlit (Infinity One)

In [None]:
# Convert IDs to int. Required for merging
# Convert 'movieIds' to int. This is necessary for merging datasets
df_movies['movieId'] = df_movies['movieId'].astype('int')
df_imdb['movieId'] = df_imdb['movieId'].astype('int')

# Merge df_imdb and df_mov into the main items_dataset
items_dataset = df_imdb.merge(df_movies, on='movieId')
#items_dataset.head(5)

In [None]:
# extract necessary elements from the columns within the items_dataset
# view - title_cast, director, budget, plot_keywords, title, genres
# and strip spaces in names to avoid similar names counted equal
        
elements = ['title_cast', 'plot_keywords', 'genres'] 
for item in elements:
    items_dataset[item] = items_dataset[item].fillna('')
    items_dataset[item] = items_dataset[item].str.replace(' ', '')
    items_dataset[item] = items_dataset[item].str.replace('|', ' ')
    
# create a year column from the title column of the dataset
items_dataset['year'] = items_dataset.title.str.extract('(\d+)')
items_dataset['year'] = items_dataset['year'].fillna(np.nan)
items_dataset['runtime'] = items_dataset['runtime'].fillna(np.nan)

# then remove year from title column
items_dataset['title'] = items_dataset.title.str.replace('(\(\d\d\d\d\))', '')
items_dataset['title'] = items_dataset.title.str.strip()

In [None]:
# drop unwanted columns
items_dataset = items_dataset.drop(['budget','movieId'], axis=1)

In [None]:
# clean up the dataset by converting all to lower case 
# prepared for vectorizer

def clean_items(x):
    if isinstance(x, str):
        res = str.lower(x)
        return res 
    else:
        return ''

In [None]:
# apply function to items_dataset columns

elements2 = ['title_cast', 'director', 'plot_keywords', 'genres'] 
for item in elements2:
    items_dataset[item] = items_dataset[item].apply(clean_items)

In [None]:
# make columns strings to add to vectorizer output
items_dataset['runtime'] = items_dataset['runtime'].apply(lambda x: str(x))
items_dataset['year'] = items_dataset['year'].apply(lambda x: str(x))

In [None]:
# elements = ['title_cast', 'director', 'plot_keywords', 'genres'] 
# from elements, create final preprocessing of items_pool, 
# get ready for vectorizer
def items_pool(x):
    return x['plot_keywords']           \
                + ' ' + x['title_cast'] \
                + ' ' + x['director']   \
                + ' ' + x['genres']     \
                + ' ' + x['runtime']    \
                + ' ' + x['year']


In [None]:
# create a new column into our items_dataset for the items_pool
items_dataset['pool'] = items_dataset.apply(items_pool, axis=1)

#### RECOMMENDER STEPS
1. Using the index mapping, obtain the index of the movie given its title.

2. Create a list of tuples containing the index number of the movies and the cosine similarity scores for that particular movie compared with all movies. 

3. Sort the list of tuples based on the similarity scores; that is, the second element of the tuple.

4. Get the top N elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

5. Return the titles corresponding to the indices of the top N elements.

In [None]:
# make CountVectorizer object and create the count matrix
# using countvectorizer to preserve the weights of important elements
count_vect = CountVectorizer(stop_words='english')
count_matrix = count_vect.fit_transform(items_dataset['pool'])

#### Dimensionality Reduction to get the most important feature sets

In [None]:
# defining global scaler objects
ss = StandardScaler()

# determine features and drop some columns
#labels = []
# scale the dataframe
count_matrix_scaled = ss.fit_transform(count_matrix)

In [None]:
# carrying out dimensionality reduction
# apply PCA to get the 20 most important feature sets
print("Computing PCA projection")
t0 = time()
count_matrix_pca = decomposition.PCA(n_components=20).fit_transform(count_matrix_scaled)
t1 = time()
print("Finished PCA projection in " + str(t1-t0) + "s.")

In [None]:
# Compute the Cosine Similarity matrix based on the count_matrix
cosine_sim = cosine_similarity(count_matrix_pca, count_matrix_pca)

In [None]:
# Set index of the main DataFrame and 
# construct a reverse mapping of the index to titles
indices = pd.Series(
    items_dataset.index, index=items_dataset['title'])

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_movie_recs(N, sample, cosine_sim=cosine_sim):
    '''
    title: str
    N: int
    cosine_sim: float
    '''
    # Get the index of the movie that matches the title
    idx = indices[sample]

    # Get the pairwsie similarity scores of all movies with that movie
    # in a list of tuples, get the similarity scores next to the indices
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the N most similar movies
    sim_scores = sim_scores[1:N+1]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top N most similar movies
    res = items_dataset['title'].iloc[movie_indices]
    return res

In [None]:
# get a sample of the movie recommender capabilities of Infinity One
User1_likes_Toy_Story = get_movie_recs(10, 'Toy Story', cosine_sim)

# Section 7: Conclusion

The bigger the choice, the harder it is to make the final decision. This is especially true for modern movie fans, who have thousands of movies to pick from. But thanks to machine learning, we now have recommendation systems based on its complex algorithms and techniques.

Today, movie recommendation systems are widely used by the most popular streaming services, enabling a more personalized experience and increased user satisfaction across the platforms. Why do we need them? It’s estimated that the world cinema has released more than 500,000 movies — a number beyond one person’s control. With such an enormous number of motion pictures to choose from, developing and improving recommendation systems with ML was a crucial step to make this process easier and feasible.

Once again, ML proves to be a vital technological solution that makes our lives easier. And the more these systems evolve, the more advanced ML techniques we have at our disposal that generate the most accurate content for users and give them what they are looking for.

On this notebook we constructed such system.We constructed a recommendation algorithm based on content and collaborative filtering, capable of accurately predicting how a user will rate a movie they have not watched yet based on their historical preference.