# **Overview: Movie Recommendation** 

![](https://about.netflix.com/images/meta/netflix-symbol-black.png)

#### In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.…ever wondered how Netflix, Amazon Prime, Showmax, Disney and the likes somehow know what to recommend to you?

#### …it's not just a guess drawn out of the hat. There is an algorithm behind it.


# **Problem Statement**

#### With this context, EDSA is challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

#### What value is achieved through building a functional recommender system?Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.


#### This dataset consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. The MovieLens dataset has long been used by industry and academic researchers to improve the performance of explicitly-based recommender systems, and now you get to as well!

#### For this Predict, we'll be using a special version of the MovieLens dataset which has enriched with additional data, and resampled for fair evaluation purposes.

<a id="cont"></a>

# **Table of Contents**

<details>
<summary><a href=#one>1. Importing Packages</a></summary>
<br>
<a href=#one.one>1.1 Importing python packages that will be used in the notebook </a>
</details>

<br>

<details>
<summary><a href=#two>2. Loading Data</a></summary>
<br>
<a href=#two.one>2.1 Loading the Train and Test datasets</a>
</details>

<br>

<details>
<summary><a href=#three>3. Exploratory Data Analysis (EDA)</a></summary>
<br>
<a href=#three.one>3.1 Why is EDA important?</a>
<br>
<a href=#three.two>3.2 Pandas profiling model</a>
<br>
<a href=#three.three>3.3 Generating a word cloud</a>
<br>
<a href=#three.four>3.4 Looking at the data types of the Train and Test datasets</a>
<br>
<a href=#three.five>3.5 Looking for null values in the Train and Test datasets</a>
<br>
<a href=#three.six>3.6 Investigating the distribution of categorical values</a>
<br>
<a href=#three.seven> 3.7  Hashtags for each sentiment</a>
</details>

<br>

<details>
<summary><a href=#four>4. Data Engineering</a></summary>
<br>
<a href=#four.one>4.1 A copy of each dataset </a>
<br>
<a href=#four.two>4.2 Function to make all text lowercase </a>
<br>
<a href=#four.three>4.3 Function to remove URLs </a>
<br>
<a href=#four.four>4.4 Removing special characters </a>
<br>
<a href=#four.five>4.5 Removing punctuation </a>
<br>
<a href=#four.six>4.6 Removing digits</a>
<br>
<a href=#four.seven>4.7 Removing stopwords </a>
<br>
<a href=#four.eight>4.8 Tokenization </a>
<br>
<a href=#four.nine>4.9 Lemmatization </a>
<br>
<a href=#four.ten>4.10 Datasets after cleaning </a>
<br>
<a href=#four.eleven>4.11 Analysis of data after cleaning </a>
</details>

<br>

<details>
<summary><a href=#four>5. Modeling</a></summary>
<br>
<a href=#five.one>5.1 Splitting the x variable from the tartget variable </a>
<br>
<a href=#five.two>5.2 Turning text into something the model can read </a>
<br>
<a href=#five.three>5.3 Splitting the data into Train and validation set </a>
<br>
<a href=#four.four>4.4 Training the model and evaluating the model with the validation set </a>
<br>
<a href=#five.five>5.5 Logistic Regression model </a>
<br>
<a href=#five.six>5.6 Random Forest model </a>
<br>
<a href=#five.seven>5.7 Naive model</a>
<br>
<a href=#five.eight>5.8 SVC model </a>
<br>
<a href=#five.nine>5.9 KNN model </a>
<br>
<a href=#five.ten>5.10 Test set preperation and saving the best model </a>
<br>
<a href=#five.eleven>5.11 Test predicitions </a>
<br>
<a href=#five.twelve>5.12 CSV conversion </a>
</details>

<br>

<details>
<summary><a href=#six>6. Model performance</a></summary>
<br>
<a href=#six.one>6.1 What is performance analysis in machine learning</a>
<br>
<a href=#six.two>6.2 Evaluation of model</a>
<br>
<a href=#six.three>6.3 Assesment of the F-1 score according to both Train and Test sets </a>
<br>
<a href=#six.four>6.4 Analysing the dataframe</a>
<br>
<a href=#six.five>6.5 Plotting the F-1 Test performance from the Test data </a>
<br>
<a href=#six.six>6.6 Confusion matrix of the various models </a>
</details>

<br>

<details>
<summary><a href=#six>7. Model Explanations</a></summary>
<br>
<a href=#seven.one>7.1 Best performing model</a>
<br>
<a href=#seven.two>7.2 Conclusion</a>

 
 <a id="one"></a>
 
 # **1.Importing Packages**
<a href=#cont>Back to Table of Contents</a>

---
    
| *Description: Importing Packages*|
| :--------------------------- |
>In this section all the packages that may be needed during our analysis and the libraries that will be used throughout the analysis and modelling will be imported. 
 |

---

### <a id="one.one"></a>1.1 *Importing python packages that will be used in the notebook.*

In [2]:
# Libraries for data loading, data manipulation and data visulisation
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from pandas_profiling import ProfileReport


# Libraries for data preparation and model building
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, precision_score, recall_score,accuracy_score, confusion_matrix,classification_report


import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import string
import urllib
import spellchecker
from textblob import TextBlob
import autocorrect
from textblob import TextBlob
from nltk.tokenize import  TweetTokenizer
STOPWORDS = set(stopwords.words('english'))


nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

pd.set_option('display.max_rows', 1000)
pd.set_option('Max_colwidth', 400)

# suppress cell warnings
warnings.filterwarnings("ignore")

# Setting global constants to ensure notebook results are reproducible
# PARAMETER_CONSTANT = ###

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/caitlinmclaren/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/caitlinmclaren/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/caitlinmclaren/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


<a id="two"></a>

 # **2. Loading the Data**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| *Description: Loading the data*  |
| :--------------------------- |
|
>In this section the  `train.csv` and `test_with_no_lable.csv` will be loaded into the notebook.
 |

### <a id="two.one"></a> 2.1 *Loading all the data sets.*

In [3]:
# Loading the Train dataset
df_train = pd.read_csv('train.csv')
df_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [4]:
# Loading Test dataset
df_test = pd.read_csv('test.csv')
df_test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [6]:
# Loading Genome_Scores dataset
df_genome_score = pd.read_csv('genome_scores.csv')
df_genome_score.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [7]:
# Loading Genome_Tags dataset
df_genome_tags = pd.read_csv('genome_tags.csv')
df_genome_tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [8]:
# Loading imdb_data dataset
df_imdb_data  = pd.read_csv('imdb_data.csv')
df_imdb_data .head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wallace Shawn|John Ratzenberger|Annie Potts|John Morris|Erik von Detten|Laurie Metcalf|R. Lee Ermey|Sarah Freeman|Penn Jillette|Jack Angel|Spencer Aste,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bradley Pierce|Bonnie Hunt|Bebe Neuwirth|David Alan Grier|Patricia Clarkson|Adam Hann-Byrd|Laura Bell Bundy|James Handy|Gillian Barber|Brandon Obray|Cyrus Thiedeke|Gary Joseph Thorup,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Margret|Burgess Meredith|Daryl Hannah|Kevin Pollak|Katie Sagona|Ann Morgan Guilbert|James Andelin|Marcus Klemp|Max Wright|Cheryl Hawker|Wayne A. Evenson|Allison Levine,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|Lela Rochon|Gregory Hines|Dennis Haysbert|Mykelti Williamson|Michael Beach|Leon|Wendell Pierce|Donald Faison|Jeffrey D. Sams|Jazz Raycole|Brandon Hammond|Kenya Moore,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betrayal|mother son relationship
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberly Williams-Paisley|George Newbern|Kieran Culkin|BD Wong|Peter Michael Goetz|Kate McGregor-Stewart|Jane Adams|Eugene Levy|Rebecca Chambers|April Ortiz|Dulcy Rogers|Kathy Anthony,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [9]:
# Loading Links dataset
df_links  = pd.read_csv('links.csv')
df_links .head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [10]:
# Loading Movies dataset
df_movies  = pd.read_csv('movies.csv')
df_movies .head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [11]:
# Loading Tags dataset
df_tags  = pd.read_csv('tags.csv')
df_tags .head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


<a id="three"></a>

# **3. Exploratory Data Analysis (EDA)**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
|  *Description: Exploratory data analysis* |
| :--------------------------- |
| 
>In this section, there will be an in-depth analysis of all the variables in the dataframe. |

---

### <a id="three.one"></a> 3.1 *Why is EDA important?* 

&#10148; It helps to prepare the dataset for analysis. </br>
&#10148; It allows a machine learning model to predict the dataset better. </br>
&#10148; It gives more accurate results.  </br>
&#10148; It also helps with choosing a better machine learning model. </br>

### <a id="three.two"></a> 3.2 *Looking at the shape of the datasets. It can be seen how many columns and rows exist in the dataset*

In [12]:
# Train data set
df_train.shape

(10000038, 4)

In [13]:
# Test data set
df_test.shape

(5000019, 2)

In [14]:
# Genome Score data set
df_genome_score.shape

(15584448, 3)

In [15]:
# Genome Tag data set
df_genome_tags.shape

(1128, 2)

In [16]:
# imdb data set
df_imdb_data.shape

(27278, 6)

In [17]:
# Links data set
df_links.shape

(62423, 3)

In [18]:
# Movies data set
df_movies.shape

(62423, 3)

In [19]:
# Tags data set
df_tags.shape

(1093360, 4)

### <a id="three.three"></a> 3.3  *Looking at the data types that are in the dataframes.* 
>*It can be seen there is int64, float64 and object type data*

In [20]:
# Train data set
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000038 entries, 0 to 10000037
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 305.2 MB


In [21]:
# Test data set
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000019 entries, 0 to 5000018
Data columns (total 2 columns):
 #   Column   Dtype
---  ------   -----
 0   userId   int64
 1   movieId  int64
dtypes: int64(2)
memory usage: 76.3 MB


In [22]:
# Genome Score data set
df_genome_score.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15584448 entries, 0 to 15584447
Data columns (total 3 columns):
 #   Column     Dtype  
---  ------     -----  
 0   movieId    int64  
 1   tagId      int64  
 2   relevance  float64
dtypes: float64(1), int64(2)
memory usage: 356.7 MB


In [23]:
# Genome Tags data set
df_genome_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tagId   1128 non-null   int64 
 1   tag     1128 non-null   object
dtypes: int64(1), object(1)
memory usage: 17.8+ KB


In [24]:
# imdb data set
df_imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movieId        27278 non-null  int64  
 1   title_cast     17210 non-null  object 
 2   director       17404 non-null  object 
 3   runtime        15189 non-null  float64
 4   budget         7906 non-null   object 
 5   plot_keywords  16200 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.2+ MB


In [25]:
# Links data set
df_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  62423 non-null  int64  
 1   imdbId   62423 non-null  int64  
 2   tmdbId   62316 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 1.4 MB


In [26]:
# Movies data set
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [27]:
# Tags data set
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093360 entries, 0 to 1093359
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   userId     1093360 non-null  int64 
 1   movieId    1093360 non-null  int64 
 2   tag        1093344 non-null  object
 3   timestamp  1093360 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 33.4+ MB


### <a id="three.four"></a> 3.4 *Looking for null values in the datasets.*
>It can be seen that there are null values in the imdb,links and tags data sets

In [28]:
# Train data set
df_train.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [29]:
df_test.isnull().sum()

userId     0
movieId    0
dtype: int64

In [30]:
df_genome_score.isnull().sum()

movieId      0
tagId        0
relevance    0
dtype: int64

In [31]:
df_genome_tags.isnull().sum()

tagId    0
tag      0
dtype: int64

In [32]:
df_imdb_data.isnull().sum()

movieId              0
title_cast       10068
director          9874
runtime          12089
budget           19372
plot_keywords    11078
dtype: int64

In [33]:
df_links.isnull().sum()

movieId      0
imdbId       0
tmdbId     107
dtype: int64

In [34]:
df_movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [35]:
df_tags.isnull().sum()

userId        0
movieId       0
tag          16
timestamp     0
dtype: int64

<a id="four"></a>

# **4. Data Engineering**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
|  *Description: Data engineering*  |
| :--------------------------- |
| 
>In this section the dataset will be cleaned and possible new new features created - as identified in the EDA phase. |

---


<a id="five"></a>

# **5. Modelling**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| *Description: Modelling*  |
| :--------------------------- |
| 
>In this section models will be built,namley: . |

---

<a id="six"></a>

# **6.Model Performance**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| *Description: Model performance* |
| :--------------------------- |
| 
>In this section the models that were built will be compared relative to their performance and the best model will be selected. |

---

<a id="seven"></a>

# **7. Model Explanation**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
|  *Description: Model explanation*  |
| :--------------------------- |
| 
>A brief explanation is given of which model preformed the best
---

![](https://imageio.forbes.com/specials-images/dam/imageserve/966248982/660x0.jpg?format=jpg&width=960)