# Unsupervised Learning Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

{**2110ACDS_T7**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.


  

<h2><center> Movie Recommendations System</h2></center>
<figure>
<center><img src ="https://drive.google.com/uc?id=194N0bzcjgy-D5GjvN2ofYic9kd5BAxKz" width = "800" height = '500'/>

**Introduction**
    
In this modern age people are exposed to a myriad option when it comes to entertainment and very limited time to go through all these options. Therefore, the recommendation systems are important as they help them make the right choices, without having to expend their cognitive resources. Recommendation systems are Artificial Intelligence based algorithms that skim through all possible options and create a customized list of items that are interesting and relevant to an individual.

**Problem**
    
Our client has a databse with a huge number of movies which can be overwhelming for their users to choose from. Therefore, there is need to filter, prioritize and efficiently deliver relevant movies in order to alleviate the problem of movies overload, which has created a potential problem to many their users.

**Objective**
    
Our team has been tasked with creation of a Recommender system that will solve this problem by searching through large volume of dynamically generated movies to provide users with personalized movies. Other benefits accrued from this system will include:-
1. Increased user satisfaction
2. Increased sales/conversion
3. Increased loyalty/ share of mind
4. Reduced churn

**Process**
    
In order to achieve this objective the team will follow the process below:-
1. Explore the supplied data, identify potential errors in the data and clean the existing data set;
2. Build a model that is capable rating a users unseen movies;
3. Evaluate the accuracy of the best machine learning model and
4. Explain the inner working of the model to a non-technical audience.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Preprocessing of test data</a>

<a href=#six>6. Modeling</a>

<a href=#seven>7. Model Performance</a>

<a href=#eight>8. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

### 1.1 Loading experiments to Comet ML  

>Comet is a great tool for model versioning and experimentation as it records the parameters and conditions from each of the experiements- allowing reproducability of results, or go back to a previous version of the experiment.  

>Record of the experiments will be stored in the Advanced-classification project


In [1]:
#!pip install comet_ml

In [2]:
# import comet_ml at the top of your file


# Create an experiment with your api key


### 1.2 Brief Description of Libraries 
> The following libraries will be used to aid creation of a Sentiment Analysis model.

>* Numpy:- NumPy (short for Numerical Python) is “the fundamental package for scientific computing with Python” and it is the library Pandas, Matplotlib and Scikit-learn builds on top off.
>* Pandas:- a software library for data manipulation and analysis.
>* Sklearn:- this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
>* Plotly:- this library is an interactive, open-source plotting library that supports over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases. It enables Python users to create beautiful interactive web-based visualizations that can be displayed in Jupyter notebooks, saved to standalone HTML files, or served as part of pure Python-built web applications using Dash.
>* Surprise:- this is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. It provides various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods, matrix factorization-based ( SVD, PMF, SVD++, NMF), and many others. Also, various similarity measures (cosine, MSD, pearson…) are built-in.
>* Matplotlib:-  a library for creating static, animated, and interactive visualizations in Python.
>* Seaborn:- a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures.



In [3]:
# Libraries for importing and loading data
import numpy as np
import pandas as pd

# Libraries for data visualizations
import sys
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA as RandomizedPCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# Setting global constants to ensure notebook results are reproducible





<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

### 2.1 Brief description of the data
**Data Overview**
>This dataset consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. 

**Source**
>The data for the MovieLens dataset is maintained by the GroupLens research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from IMDB

**Supplied Files**
1. genome_scores.csv - a score mapping the strength between movies and tag-related properties. Read more here
2. genome_tags.csv - user assigned tags for genome-related scores
3. imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
4. links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
5. sample_submission.csv - Sample of the submission format for the hackathon.
6. tags.csv - User assigned for the movies within the dataset.
7. test.csv - Contains user and movie IDs with no rating data.
8. train.csv - Contains user and movie IDs with associated rating data and timestamp.


In [4]:
df_train = pd.read_csv('../input/edsa-movie-recommendation-2022/train.csv')
df_test = pd.read_csv('../input/edsa-movie-recommendation-2022/test.csv')
genome_scores = pd.read_csv('../input/edsa-movie-recommendation-2022/genome_scores.csv')
genome_tags = pd.read_csv('../input/edsa-movie-recommendation-2022/genome_tags.csv')
df_imdb = pd.read_csv('../input/edsa-movie-recommendation-2022/imdb_data.csv')
df_links = pd.read_csv('../input/edsa-movie-recommendation-2022/links.csv')
df_movies = pd.read_csv('../input/edsa-movie-recommendation-2022/movies.csv')
df_tags = pd.read_csv('../input/edsa-movie-recommendation-2022/tags.csv')
sample_submission = pd.read_csv('../input/edsa-movie-recommendation-2022/sample_submission.csv')                      

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


**What is EDA?** 

This is unavoidable and one of the major step to fine-tune the given data set(s) in a different form of analysis to understand the insights of the key characteristics of various entities of the data set like column(s), row(s) by applying Pandas, NumPy, Statistical Methods, and Data visualization packages. 

**Out Come of this phase as below**

1. Understanding the given dataset and helps clean up the given dataset.
2. It gives you a clear picture of the features and the relationships between them.
3. Providing guidelines for essential variables and leaving behind/removing non-essential variables.
4. Handling Missing values or human error.
5. Identifying outliers.
6. EDA process would be maximizing insights of a dataset.


In [5]:
df_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000038 entries, 0 to 10000037
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 305.2 MB


In [7]:
df_train.duplicated().sum()

0

The train data has 10000038 observations and  4 features: userId, movieId, rating, timestamp. The userId, movieId and timestamp have integer datatype and the rating has float datatype. There are neither duplicate or null values in the features.

In [8]:
df_test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [9]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000019 entries, 0 to 5000018
Data columns (total 2 columns):
 #   Column   Dtype
---  ------   -----
 0   userId   int64
 1   movieId  int64
dtypes: int64(2)
memory usage: 76.3 MB


The test data has 5000019 observations and 2 features: userId and movieId both with integer datatype. No null values

In [10]:
genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [11]:
genome_scores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15584448 entries, 0 to 15584447
Data columns (total 3 columns):
 #   Column     Dtype  
---  ------     -----  
 0   movieId    int64  
 1   tagId      int64  
 2   relevance  float64
dtypes: float64(1), int64(2)
memory usage: 356.7 MB


In [14]:
genome_scores.duplicated().sum()

0

genome_scores has 15584448 observations and 3 columns:- movieId, tagId and relevance. The movieId and tagId have an integer datatype while the relevance column has float datatype. There are neither duplicate or null values.

In [15]:
genome_tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [16]:
genome_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tagId   1128 non-null   int64 
 1   tag     1128 non-null   object
dtypes: int64(1), object(1)
memory usage: 17.8+ KB


In [17]:
genome_tags.duplicated().sum()

0

genome_tags has 1128 observations and 2 columns:- tagId and tag. The tagId has integer datatype while the tag has object datatype. There is neither duplicate or null values.

In [18]:
df_imdb.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [19]:
df_imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movieId        27278 non-null  int64  
 1   title_cast     17210 non-null  object 
 2   director       17404 non-null  object 
 3   runtime        15189 non-null  float64
 4   budget         7906 non-null   object 
 5   plot_keywords  16200 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.2+ MB


In [20]:
df_imdb.duplicated().sum()

0

The imdb dataframe has 27278 entries and it has 6 columns:- movieId, title_cast, director, runtime, budget and plot_keywords. The title_cast, director, budget and plot_keywords are of the object datatype,movieId is of integer datatype while the runtime is of float datatype. Only the movieId column has no null values and all the columns have no duplicate values.

In [21]:
df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [22]:
df_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  62423 non-null  int64  
 1   imdbId   62423 non-null  int64  
 2   tmdbId   62316 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 1.4 MB


In [23]:
df_links.duplicated().sum()

0

The df_links has 62423 observatios and 3 columns:- movieId, imdbId and tmdbId. The movieId and imdbId has integer datatype while the tmdbId had a float datatype. The tmdbId has null values. No columns have duplicate values.

In [24]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [25]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [26]:
df_movies.duplicated().sum()

0

The df_movies has 62423 observations nd 3 columns:- movieId, title and genres. The title and genres have the object datatype while the movieId has integer datatype. There are neither duplicate or null values.

In [27]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [28]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093360 entries, 0 to 1093359
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   userId     1093360 non-null  int64 
 1   movieId    1093360 non-null  int64 
 2   tag        1093344 non-null  object
 3   timestamp  1093360 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 33.4+ MB


In [29]:
df_tags.duplicated().sum()

0

The df_tags has 1093360 observations and 4 columns:- userId, movieId, tag and timestamp. The userId, movieId and timestamp have the integer datatypes while the tag column has object datatype. Only the tag column has null values. All the columns have no duplicates.

#### Memory Reduction
---
Memory reduction can be performed by change each column to a data type that is best suited for the range of values it contains.

**What can be caused without memory reduction?**

This causes a large amount of memory to be placed in reserves for each observation which leads to memory depletion and putting your machine in an unstable state.

**Columns that will be observed:**
* All columns contain `int64` and `float64` data types.
* `timestamp` in the Train DataFrame has an `object` datatype which needs to be converted.

Firstly, we need to get the maximum value in each column to see which data type suits it best.

In [30]:
print('---Movies Data Set---')
print(df_movies.max(numeric_only = True))
print('\n---IMDB Data Set---')
print(df_imdb.max(numeric_only = True))
print('\n---Train Data Set---')
print(df_train.max(numeric_only = True))
print('\n---Test Data Set---')
print(df_test.max(numeric_only = True))
print('\n---Links Data Set---')
print(df_links.max(numeric_only = True))

---Movies Data Set---
movieId    209171
dtype: int64

---IMDB Data Set---
movieId    131262.0
runtime       877.0
dtype: float64

---Train Data Set---
userId       1.625410e+05
movieId      2.091710e+05
rating       5.000000e+00
timestamp    1.574328e+09
dtype: float64

---Test Data Set---
userId     162541
movieId    209163
dtype: int64

---Links Data Set---
movieId      209171.0
imdbId     11170942.0
tmdbId       646282.0
dtype: float64


### <center>**Data Types**</center>

---
<table>
  <tr>
    <th>Data Type</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>int8</td>
    <td>Byte (-128 to 127)</td>
  </tr>
  <tr>
    <td>int16</td>
    <td>Integer (-32768 to 32767)</td>
  </tr>
  <tr>
    <td>int32</td>
    <td>Integer (-2147483648 to 2147483647)</td>
  </tr>
  <tr>
    <td>int64</td>
    <td>Integer (-9223372036854775808 to 9223372036854775807)</td>
  </tr>
  <tr>
    <td>uint8</td>
    <td>Unsigned integer (0 to 255)</td>
  </tr>
  <tr>
    <td>uint16</td>
    <td>Unsigned integer (0 to 65535)</td>
  </tr>
  <tr>
    <td>uint32</td>
    <td>Unsigned integer (0 to 4294967295)</td>
  </tr>
  <tr>
    <td>uint64</td>
    <td>Unsigned integer (0 to 18446744073709551615)</td>
  </tr>
  <tr>
    <td>float16</td>
    <td>Half precision float: sign bit, 5 bits exponent</td>
  </tr>
  <tr>
    <td>float32</td>
    <td>Single precision float: sign bit, 8 bits exponent</td>
  </tr>
  <tr>
    <td>float64</td>
    <td>Double precision float: sign bit, 11 bits exponent</td>
  </tr>
</table>

Based on our observation on the maximum values in the Train data frame and the different data types, we can assign each column a new datatype.
#### **Data types to be assigned**
**Movies Data Set:**     movieId --> `uint32`

**IMDB Data Set:** movieId --> `uint32`  |  runtime --> `float16`

**Train Data Set:** userId --> `uint32`  |  movieId --> `uint32`  |  rating --> `float16`  |  timestamp --> `uint32`

**Test Data Set:** userId --> `uint32`  |  movieId --> `uint32`

**Links Data Set:** movieId --> `uint`  |  imdbId --> `uint32`  |  tmdbId --> `will be removed`

In [31]:
# Data Frame sizes before conversion
before_convert = (sys.getsizeof(df_movies) + 
                  sys.getsizeof(df_imdb) + 
                  sys.getsizeof(df_train) + 
                  sys.getsizeof(df_test) + 
                  sys.getsizeof(df_links)) / 1000000

In [32]:
# Movies DF
df_movies['movieId'] = df_movies['movieId'].astype('uint32')

# IMDB DF
df_imdb['movieId'] = df_imdb['movieId'].astype('uint32')
df_imdb['runtime'] = df_imdb['runtime'].astype('float16')

# Train DF
df_train['movieId'] = df_train['movieId'].astype('uint32')
df_train['userId'] = df_train['userId'].astype('uint32')
df_train['timestamp'] = df_train['timestamp'].astype('uint32')
df_train['rating'] = df_train['rating'].astype('float16')

# Test DF
df_test['movieId'] = df_test['movieId'].astype('uint32')
df_test['userId'] = df_test['userId'].astype('uint32')

# Links DF
df_links['movieId'] = df_links['movieId'].astype('uint32')
df_links['imdbId'] = df_links['imdbId'].astype('uint32')

In [33]:
# Data Frame sizes after conversion
after_convert = (sys.getsizeof(df_movies) + 
                  sys.getsizeof(df_imdb) + 
                  sys.getsizeof(df_train) + 
                  sys.getsizeof(df_test) + 
                  sys.getsizeof(df_links)) / 1000000

In [34]:
# Plotly template being used
template = 'plotly_dark'

# Colour being used for all plots
color = '#4590b8'

In [35]:
fig = go.Figure()

fig.add_trace(go.Indicator(
    value = after_convert,
    delta = {'reference': before_convert},
    gauge = {
        'axis': {
            'range': [None, 1000]
        },
        'threshold' : {
            'line': {
                'color': "red", 'width': 4
            }, 
            'thickness': 0.75, 
            'value': before_convert
        }
    },
    mode = "number+delta+gauge",
    title = {'text': "Memory Usage in MB"}))


fig.update_layout(
    template = template)

fig.update_traces(gauge_bar_color = color)

The combined memory usage has been reduced by approximately 221MB. This will make the transfer of data much faster and will reduce the amount of resources needed to process the data.

If the data frames are merged then it will increase the memory usage because of the increase in dimensions / columns. To avoid increasing the amount of resources being used, remove variables that you are no longer using by using `del <variable_name>`.
For example, if you merge the data frames and store it in a new variable then there is no need to keep the individual data frame variables.

### Most Common Genres
---
We will be observing the frequency of genres. As we saw above, the genres column has multiple genres seperated with a vertical line `|` therefore we first have to split the genres and store them in a list.

In [36]:
# Splitting the genres
movie_genres = df_movies['genres'].apply(lambda x: x.split('|'))

list_genres = []
for genre_list in movie_genres:
    for genre in genre_list:
        list_genres.append(genre)

# Convert the list into a Series to get value count
list_genres = pd.Series(list_genres)

In [37]:
def series_to_df(series, column1, column2 = 'total'):
    """
        * Converts series into a count DataFrame
    """
    series_count = series.value_counts()
    series_df = pd.DataFrame(columns = [column1, column2])
    series_df[column1] = list(series_count.index)
    series_df[column2] = series_count.values
    
    return series_df

In [38]:
genre_df = series_to_df(list_genres, 'genre')

In [39]:
fig = go.Figure()

fig.add_trace(go.Bar(
    x = genre_df['genre'],
    y = genre_df['total'],
    text = ['{:.1f} %'.format((val / genre_df['total'].sum() * 100)) for val in (genre_df['total'])],
    textposition = 'auto',
    textfont = dict(color = '#FFFFFF')
))

fig.update_layout(
    title = {
        'text': 'Most Common Genres',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template
)

fig.update_xaxes(
    title = {
        'text': 'Genres'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Frequency'
    }
)

fig.update_traces(marker_color = color)

<li> <code> Drama </code>, <code> Comedy </code> have a higher occurence in the dataset.

<li> <code> Musical </code>, <code> Film-Noir </code>, <code> IMAX </code> genres have the lowest respectively, they all have an ocurence of under $1 $% 

### Movies analysis
---
**We will be displaying the following figures:**
* Total Number of Movies
* Total Number of Users
* Average Rating for Movies
* Average Runtime for Movies


In [40]:
fig = go.Figure()

fig.add_trace(go.Indicator(
    mode = "number",
    value = df_movies['movieId'].nunique(),
    domain = {'row': 0, 'column': 0}, 
    title = 'Total Movies'))

fig.add_trace(go.Indicator(
    mode = "number",
    value = df_train['userId'].nunique(),
    domain = {'row': 1, 'column': 0}, 
    title = 'Total Users'))


fig.add_trace(go.Indicator(
    mode = "number",
    value = np.mean(np.array(df_train['rating'])),
    domain = {'row': 0, 'column': 1}, 
    title = 'Average Rating for Movies'))

fig.add_trace(go.Indicator(
    mode = "number",
    value = np.mean(np.array(df_imdb['runtime'].dropna())),
    domain = {'row': 1, 'column': 1}, 
    title = 'Average Runtime for Movies'))

fig.update_layout(
    grid = {'rows': 2, 'columns': 2, 'pattern': "independent"}, 
    template = template)

<li> There are about <code> 62400 movies </code> and about <code> 162500 users </code>.

<li> The average rating for movies is about <code> 3.53 </code> and the average runtime for movies is approximately <code> 100.3 </code>. 

### Rating Distribution
---
We will be looking at how ratings are distributed

In [41]:
rating_df = series_to_df(df_train['rating'], 'rating')

In [42]:
fig = go.Figure()

fig.add_trace(go.Bar(
    x = rating_df['rating'],
    y = rating_df['total'],
    text = ['{:.1f} %'.format((val / rating_df['total'].sum() * 100)) for val in (rating_df['total'])],
    textposition = 'auto',
    textfont = dict(color = '#FFFFFF')
))

fig.update_layout(
    title = {
        'text': 'Rating Distribution',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template
)

fig.update_xaxes(
    title = {
        'text': 'Ratings'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Count Per Rating'
    }
)

fig.update_traces(marker_color = color)

* `26.5%` of all ratings is `4` which is the majority.
* The distribution shows a skewness in the positive direction
* Very good quality movies in the dataset, as most of the movies, have been rated between the range from 3 to 5.
* Considering only rating is not the right measure of popularity, as there may be the case where one movie is rated 5 by few users and another rated 4.8 by many users.

### Distribution of the number of ratings per user
---
Knowing that there are so many users, we just can't look at them all. Therefore we will visualise the first 50 users and the total numer of rating for each of them.

In [43]:
user_p = df_train.groupby('userId')['rating'].count().clip(upper = 50)
fig = go.Figure()

fig.add_trace(go.Histogram(x = user_p.values,
                     name = 'rating',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 1)))

fig.update_layout(
    title = {
        'text': 'Number of Ratings Per User (First 50 Users)',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template,
    bargap = 0.2
)

fig.update_xaxes(
    title = {
        'text': 'Users'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Number of Ratings'
    }
)

fig.update_traces(marker_color = color)

It appears that there are users who have rated only a few movies, this implies that not all users are equivalent to suggest movie recommendations to other users. This is only one perspective, we could also view it as movies that only received one or very few ratings. This can be caused by a lack of popularity or received a bad rating by the first user then became overlooked by other users.

### Most Ratings per Movie
---
We will be looking at the top 10 movies with the most total ratings (this is only the total number of ratings given, not the average ratings).

In [44]:
# Merge the Movies and Train DataFrame to get names of movies
movies_df = df_train.merge(df_movies, how = 'left', on = 'movieId')

In [45]:
# Get top 10 movies with the most ratings
total_ratings = movies_df.groupby('title')['rating'].count().sort_values(ascending = False)[:10]

# Create a DataFrame for the total ratings
total_ratings_df= pd.DataFrame(columns = ['movies', 'total'])
total_ratings_df['movies'] = list(total_ratings.index)
total_ratings_df['total'] = total_ratings.values

In [46]:
"""
    * Plotly adds data to plots in a stack manner
    * Data must be in ascending order with the lowest total first
    * This will allow the highest total to appear at the top
"""

movie_p = total_ratings_df.sort_values('total', ascending = True)
fig = go.Figure()

fig.add_trace(go.Bar(x = movie_p['total'],
                     y = movie_p['movies'],
                     orientation = 'h'
                    
                    ))
fig.update_layout(
    title = {
        'text': 'Top 10 Most Rated Movies',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template,
    bargap = 0.2
)

fig.update_xaxes(
    title = {
        'text': 'Total Ratings'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Movies'
    }
)

fig.update_traces(marker_color = color)

* The movie with the most ratings is **Shawshank Redemption, The (1994)**, the movie is about a banker who is convicted for the murder of his wife and her lover and is sentenced to two consecutive life sentences at the Shawshank State Prison. This movie is based on **Rita Hayworth and Shawshank Redemption** by Stephen King and it is claimed to be amongs the best movies ever made in World Cinema and applauded by many film critics.
* **Forrest Gump (1994)** is about a man with a low IQ, recounts the early years of his life when he found himself in the middle of key historical events. This movie has been voted the greatest film character of all time, beating James Bond and Scarlett O'Hara in the process

### Most Average Rated Movie
---
We will be looking at the top 10 movies with the most average rating.

In [47]:
"""
    * There are movies with only one rating which is impractical to include
    * The threshold being used is 20
    * All movies with a rating total of less than or equal to 20 will be filtered out
"""

min_total_ratings = 20
filter_ = movies_df['movieId'].value_counts() > min_total_ratings
filter_ = filter_[filter_].index.tolist()

df_filtered = movies_df[movies_df['movieId'].isin(filter_)]

In [48]:
# Get top 10 movies with the most average ratings
avg_ratings = df_filtered.groupby('title')['rating'].mean().sort_values(ascending = False)[:10]

# Create a DataFrame for the avg ratings
avg_ratings_df= pd.DataFrame(columns = ['movies', 'average'])
avg_ratings_df['movies'] = list(avg_ratings.index)
avg_ratings_df['average'] = avg_ratings.values

In [49]:
avg_df = avg_ratings_df.sort_values('average', ascending = True)
fig = go.Figure()

fig.add_trace(go.Bar(x = avg_df['average'],
                     y = avg_df['movies'],
                     orientation = 'h'
                    
                    ))
fig.update_layout(
    title = {
        'text': 'Top 10 Most Average Rated Movies',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template,
    bargap = 0.2
)

fig.update_xaxes(
    title = {
        'text': 'Average Ratings'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Movies'
    }
)

fig.update_traces(marker_color = color)

TypeError: unsupported datatype in numpy array

TypeError: unsupported datatype in numpy array

TypeError: unsupported datatype in numpy array

Figure({
    'data': [{'marker': {'color': '#4590b8'},
              'orientation': 'h',
              'type': 'bar',
              'x': array([4.293, 4.297, 4.312, 4.34 , 4.34 , 4.4  , 4.418, 4.457, 4.473, 4.516],
                         dtype=float16),
              'y': array(['Blue Planet II (2017)', 'Human Planet (2011)', 'Godfather, The (1972)',
                          'The Blue Planet (2001)', 'The Dawn Wall (2018)',
                          'Band of Brothers (2001)', 'Shawshank Redemption, The (1994)', 'Cosmos',
                          'Planet Earth (2006)', 'Planet Earth II (2016)'], dtype=object)}],
    'layout': {'bargap': 0.2,
               'template': '...',
               'title': {'font': {'size': 25}, 'text': 'Top 10 Most Average Rated Movies', 'x': 0.5},
               'xaxis': {'title': {'text': 'Average Ratings'}},
               'yaxis': {'title': {'text': 'Movies'}}}
})

### Number of ratings per year
---
We will be looking at the total number of ratings per year. 
​

**Note:** This is for the total ratings ever made each year.

In [50]:
def timestamp_to_date(timestamps):
    """
        * Convert timestamps to dates 
        * Only the year is extracted
        * Get the count for each year
        
        Parameters:
        ===========================
        * timestamps: Series or list containing timestamps
        
        Returns:
        ===========================
        * DataFrame sorted by year
    """
    years = []
    for timestamp in timestamps:
        years.append(pd.Timestamp(timestamp, unit = 's').year)# Convert timestamp datetime and return the year
        
    years = pd.Series(years).value_counts()
    df_years = pd.DataFrame(columns = ['years', 'total'])
    df_years['years'] = list(years.index)
    df_years['total'] = years.values
    
    return df_years.sort_values(by = 'years')

In [51]:
year_counts = timestamp_to_date(df_train['timestamp'])

In [52]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(x = year_counts['years'], 
               y = year_counts['total'], 
               mode='lines+markers'))

fig.update_layout(
    title = {
        'text': 'Number of Ratings per year',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template,
)

fig.update_xaxes(
    title = {
        'text': 'Years'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Number of Ratings'
    }
)

fig.update_traces(marker_color = color)

<blockquote>The most ratings were received in the year 2016 with a total of <b>702 962</b>. As we can see that the year 1998 received the least count of ratings with a total of <b>108 811</b>. After 2014, the count of rating increased tremendously and started declining after 2016.</blockquote>

### Top 10 Actors With Most Movie Appearances
---
We will extracting the top 10 actors with the most movie appearances. 

In [53]:
# Extract all actors
actors_ = []

for actors in df_imdb['title_cast'].dropna():
    for actor in actors.split('|'):
        actors_.append(actor)

In [54]:
actors_ = pd.Series(actors_)

actors_df= series_to_df(actors_, 'actors', 'total movies')

In [55]:
actors_df = actors_df[:10].sort_values('total movies', ascending = True)

In [56]:
fig = go.Figure()

fig.add_trace(go.Bar(x = actors_df['total movies'],
                     y = actors_df['actors'],
                     orientation = 'h'
                    
                    ))
fig.update_layout(
    title = {
        'text': 'Top 10 Actors With Most Movie Appearances',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template,
    bargap = 0.2
)

fig.update_xaxes(
    title = {
        'text': 'Total Movies'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Actors'
    }
)

fig.update_traces(marker_color = color)

* **Samuel L. Jackson** shows up, nail his scenes, and then take off, leaving less in-demand collaborators and co-stars to finish the film while he moves on to the next one. This explains why he stars in so many movies.
* **Steve Buscemi** plays in a lot of comedy movies including those with Adam Sandler where he is a regular cast.

### Top 10 Directors with Most Directed Movies
---
We will  be extracting the top 10 directors who directed the most movies.

In [57]:
directors_ = df_imdb['director'].dropna()

directors_df = series_to_df(directors_, 'directors', 'total movies')

In [58]:
directors_df = directors_df[1:11].sort_values('total movies', ascending = True)

In [59]:
fig = go.Figure()

fig.add_trace(go.Bar(x = directors_df['total movies'],
                     y = directors_df['directors'],
                     orientation = 'h'
                    
                    ))
fig.update_layout(
    title = {
        'text': 'Top 10 Directors with Most Directed Movies',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template,
    bargap = 0.2
)

fig.update_xaxes(
    title = {
        'text': 'Total Movies'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Directors'
    }
)

fig.update_traces(marker_color = color)

* **Luc Besson** is a French film director, screenwriter, and producer.He won Best Director and Best French Director for his sci-fi action film The Fifth Element (1997). He wrote and directed the 2014 sci-fi action film Lucy and the 2017 space opera film Valerian and the City of a Thousand Planets. As writer, director, or producer, Besson has so far been involved in the creation of more than 50 films.
* **Stephen King** is an author of many books which end up on screen. He is best known for his book and movie I.T.
* **William Shakespeare**'s plays has been credited on 1,500 movies , including those under production but not yet released.

### Top 5 genres (movies) produced per year 
---
Using the top 5 most frequency genres, we will see how frequent movies were created in those genres over time.

In [60]:
def get_years(df, genre):
    """
        Gets the years of when movies were released in a specific genre
        
        Parameters:
        ===============================================================
        * df: DataFrame 
        * genre: genre to be used as a filter
        
        Return:
        ===============================================================
        * Series containing the years for all movies in the specified genre
        
    """
    genre_filter = df[df['genres'].str.contains(genre)]['title']
    
    years = []
    for title in genre_filter:
        if '(' in title:
            if title[title.index('(') + 1 : title.index(')')].isdigit() and len(str(title[title.index('(') + 1 : title.index(')')])) == 4:
                years.append(title[title.index('(') + 1 : title.index(')')])
            else:
                pass
    return pd.Series(years)

In [61]:
# Get the years for each genre
drama = get_years(df_movies, 'Drama')
comedy = get_years(df_movies, 'Comedy')
thriller = get_years(df_movies, 'Thriller')
romance = get_years(df_movies, 'Romance')
action = get_years(df_movies, 'Action')

In [62]:
# Convert into DataFrame with years counts
drama = series_to_df(drama, 'years', 'count').sort_values(by = 'years')
comedy = series_to_df(comedy, 'years', 'count').sort_values(by = 'years')
thriller = series_to_df(thriller, 'years', 'count').sort_values(by = 'years')
romance = series_to_df(romance, 'years', 'count').sort_values(by = 'years')
action = series_to_df(action, 'years', 'count').sort_values(by = 'years')

In [63]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(x = drama['years'].astype('int32'), 
               y = drama['count'], 
               mode='lines', 
               name = 'Drama'))

fig.add_trace(
    go.Scatter(x = comedy['years'].astype('int32'), 
               y = comedy['count'], 
               mode='lines', 
               name = 'Comedy'))


fig.add_trace(
    go.Scatter(x = thriller['years'].astype('int32'), 
               y = thriller['count'], 
               mode='lines', 
               name = 'Thriller'))

fig.add_trace(
    go.Scatter(x = romance['years'].astype('int32'), 
               y = romance['count'], 
               mode='lines', 
               name = 'Romance'))

fig.add_trace(
    go.Scatter(x = action['years'].astype('int32'), 
               y = action['count'], 
               mode='lines', 
               name = 'Action'))

fig.update_layout(
    title = {
        'text': 'Number of Movies produced per year for the top 5 genres',
        'font': {
            'size': 25
        }
    },
    title_x = 0.5,
    template = template
)

fig.update_xaxes(
    title = {
        'text': 'Years'
    }
)

fig.update_yaxes(
    title = {
        'text': 'Number of Movies'
    }
)


<a id="four"></a>
## 4. Feature Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

<a id="five"></a>
## 5. Preprocessing the test data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Preprocessing Test data ⚡ |
| :--------------------------- |
| In this section, you are required to ensure that all the data transformation done to train dataset has been done to the test dataset. |

---

<a id="six"></a>
## 6. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the the recommended movie. |

---

<a id="model"></a>
### Make a Movie Recommendation Model
---
* Collaborative Filtering

Collaborative Filtering is the most common technique used when it comes to building intelligent recommender systems that can learn to give better recommendations as more information about users is collected. This is a technique that can filter out items that a user might like on the basis of reactions by similar users.

It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. It looks at the items they like and combines them to create a ranked list of suggestions.

There are different types of algorithms in the family of collaborative filtering.

    * Model Based
    
The second category covers the Model based approaches, which involve a step to reduce or compress the large but sparse user-item matrix. For understanding this step, a basic understanding of dimensionality reduction can be very helpful.

**Dimensionality Reduction**

In the user-item matrix, there are two dimensions:

>1. The number of users
>2. The number of items

If the matrix is mostly empty, reducing dimensions can improve the performance of the algorithm in terms of both space and time. You can use various methods like matrix factorization or autoencoders to do this.

Matrix factorization can be seen as breaking down a large matrix into a product of smaller ones. This is similar to the factorization of integers, where 12 can be written as 6 x 2 or 4 x 3. In the case of matrices, a matrix A with dimensions m x n can be reduced to a product of two matrices X and Y with dimensions m x p and p x n respectively.

    * Memory Based
    
The first category includes algorithms that are memory based, in which statistical techniques are applied to the entire dataset to calculate the predictions.

To find the rating R that a user U would give to an item I, the approach includes:

> 1. Finding users similar to U who have rated the item I
> 2. Calculating the rating R based the ratings of users found in the previous step

Memory based can be further classified in **User-based** and **Item-based**. The rating matrix is used to find similar users based on the ratings they give, is called user-based or user-user collaborative filtering. If you use the rating matrix to find similar items based on the ratings given to them by users, then the approach is called item-based or item-item collaborative filtering.

The two approaches are mathematically quite similar, but there is a conceptual difference between the two. Here’s how the two compare:

*User-based*: For a user U, with a set of similar users determined based on rating vectors consisting of given item ratings, the rating for an item I, which hasn’t been rated, is found by picking out N users from the similarity list who have rated the item I and calculating the rating based on these N ratings.

*Item-based*: For an item I, with a set of similar items determined based on rating vectors consisting of received user ratings, the rating by a user U, who hasn’t rated it, is found by picking out N items from the similarity list that have been rated by U and calculating the rating based on these N ratings.



#### Model based approach

In this approach, CF models are developed using machine learning algorithms to predict user’s rating of unrated items.

In [None]:
# Filter out movies that were rated less than or equal to 100
min_movie_ratings = 100
filter_movies = df_train['movieId'].value_counts() > min_movie_ratings
filter_movies = filter_movies[filter_movies].index.tolist()

df_new = df_train[(df_train['movieId'].isin(filter_movies))]
print('The original data frame shape:\t{}'.format(df_train.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))

In [None]:
# Sort by timestamp to get the most recent first
df_demo = df_new.sort_values(by = 'timestamp', ascending = False)

In [None]:
reader = Reader(rating_scale = (0.5, 5.0))

data = Dataset.load_from_df(df_demo[['userId', 'movieId', 'rating']], reader)

In [None]:
trainset = data.build_full_trainset()

In [None]:
svd_ = SVD()

In [None]:
svd_.fit(trainset)

In [None]:
testset = trainset.build_testset()

In [None]:
predictions = svd_.test(testset)

In [None]:
cross_validate(svd_, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [13]:
predictions = svd_.predict(df_test)

NameError: name 'svd_' is not defined

<a id="seven"></a>
## 7. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

<a id="eight"></a>
## 8. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---