# MOVIE RECOMMENDATION 2022
© Explore Data Science Academy

<br></br>

<div align="center" style="width: 700px; font-size: 80%; text-align: center; margin-left: 100px">
<img src="resources/imgs/Image_header.png"
     alt="Collaborative-based Filtering - Utility Matrix"
     style="float: center; padding-bottom=0.5em"
     width=700px/>
</div>

#### Development Team

1. Mercy Milkah Gathoni
2. Linda Kelida
3. Samuel Mijan
4. Sipho Lukhele
5. Jessica Njuguna

<a id="one"></a>
## 1. INTRODUCTION

### Problem Statement

In today’s technology driven world, recommender systems are socially and economically critical to ensure that individuals can make optimised choices surrounding the content they engage with on a daily basis. One application where this is especially true is movie recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

With this context, EDSA is challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed, based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system having personalised recommendations - generating platform affinity for the streaming services which best facilitates their audience's viewing.

<a id="cont"></a>
### Table of Contents

**<a href=#one>1. Introduction</a>**
- Problem Statement
- Table of Contents
- Summary
- Preliminary Activities


**<a href=#two>2. Exploratory Data Analysis</a>**


**<a href=#three>3. Feature Engineering</a>**
- Dealing with Null Values
- Data Scaling
- Dimension Reduction


**<a href=#four>4. Modelling</a>**
- Logging Comet Experiments


**<a href=#five>5. Model Perfomance Comparison</a>**


**<a href=#six>6. Model Explanations</a>**


**<a href=#seven>7. Conclusion</a>**


**<a href=#eight>8. Appendix</a>**
- Kaggle Submissions

### Summary

**Agenda:**

**Deliverables:**

**Results:**

### Preliminary Activities

#### Comet set up

from comet_ml import Experiment

experiment = Experiment(api_key="MHehhbanm9HbbvXptMjQ0hinn",
    project_name="movie-recommender-2022",
    workspace="jessica-njuguna")

#### Importing Packages

In [1]:
# Libraries for data loading, data manipulation
import pandas as pd

# Libraries for mathematical analyses
import numpy as np

#Libraries for Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale = 1)
# from wordcloud import WordCloud
# from statsmodels.graphics.correlation import plot_corr
# from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
# from wordcloud import STOPWORDS

# #Libraries to clean the text
# import contractions #This expands contraction such as 'don't' to 'do not'
# import regex as re
# import string
# import nltk
# from nltk.tokenize import TreebankWordTokenizer
# from nltk.stem import WordNetLemmatizer
# from nltk.corpus import stopwords
# import emoji #allows us to manipulate with emojis
# import itertools

# #Libraries for text pre-processing
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction.text import TfidfVectorizer

# #Libraries for data balancing
# from imblearn.under_sampling import RandomUnderSampler
# from imblearn.over_sampling import SMOTE


# # Libraries for model building
# from sklearn.model_selection import train_test_split
# from sklearn.naive_bayes import BernoulliNB
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import GridSearchCV

# #Libraries for Model Performance
# from sklearn.metrics import classification_report
# from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, confusion_matrix
# from sklearn.tree import plot_tree

#Library for creating pickle files of the models
import pickle

#### Importing Data

###### Let’s load the datasets using pandas.

In [2]:
raw_train_df = pd.read_csv('/home/explore-student/unsupervised_data/unsupervised_movie_data/train.csv')
raw_test_df = pd.read_csv('/home/explore-student/unsupervised_data/unsupervised_movie_data/test.csv')
raw_movies_df = pd.read_csv('/home/explore-student/unsupervised_data/unsupervised_movie_data/movies.csv')
raw_tags_df = pd.read_csv('/home/explore-student/unsupervised_data/unsupervised_movie_data/tags.csv')
raw_links_df = pd.read_csv('/home/explore-student/unsupervised_data/unsupervised_movie_data/links.csv')
raw_imdb_df = pd.read_csv('/home/explore-student/unsupervised_data/unsupervised_movie_data/imdb_data.csv')
raw_getags_df = pd.read_csv('/home/explore-student/unsupervised_data/unsupervised_movie_data/genome-tags.csv')
raw_gescores_df = pd.read_csv('/home/explore-student/unsupervised_data/unsupervised_movie_data/genome-scores.csv')

###### Getting to know my data frames

###### Head() prints the first 5 rows of our dataset including column header and the content of each row.

In [3]:
#function that displays the first five raws of a data frame
def display_df(df):
    '''This functions takes in a dataframe and returns the first five raws of it'''
    return df.head()

Call the function to any data frame that we have ...
Here we will call the function on the train set and have a look at the columns and the values in it 

In [11]:
#what does our train set have?
display_df(raw_movies_df)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


###### Info() prints the column header and the data type stored in each column. This function is extremely useful when we are trying to understand what values need to change types in order to apply functions to them. Integers that are stored as string will not be added together until we transform them into integers.

We will create a function that displays the info about the data frames when called upon,it will tell us a couple of things:
1. The type of columns we have
2. Whether our dataframes have missing values
3. The number of entries and columns in the dataframe

In [5]:
#define a function that displays the information of a df
def display_info(df):
    '''This function takes in a dataframe and  returns the information about a dataframe'''
    return df.info()
    

In [10]:
#display movie set information
display_info(raw_movies_df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


We tried with the movie set and from it we see that we have 62423 raws, 1 column of Datatype int64 and 2 are of object data type

We can also call this funtion on other  datasets

###### Describe() gives the mean, median, standard deviation and percentiles of all the numerical values in our dataset. 

So we create a function that takes in a dataframe and returns the summary statistics

In [14]:
#define a data frame summary statistic function
def summary_stat(df):
    '''This function takes a dataframe and returns the summary statistics of all numerical columns'''
    return df.describe()

We will look at the imdb dataset since it is more relevant with this function because of the runtime column

In [23]:
summary_stat(raw_imdb_df)

Unnamed: 0,movieId,runtime
count,27278.0,15189.0
mean,59855.48057,100.312331
std,44429.314697,31.061707
min,1.0,1.0
25%,6931.25,89.0
50%,68068.0,98.0
75%,100293.25,109.0
max,131262.0,877.0


The average runtime for movies is about 100 minutes and from this also we are able to see that the 
maximum runtime as 877 minutes(such a long time watching a movie).


###### Do we have null values in our datasets?  We will find out when calling the below function

In [26]:
# define a function for checking missing values
def missing_val(df):
    '''A function that checks for missing values per column when passing in a dataframe
    and returns the count of missing values in each column'''
    # Count total NaN at each column in a DataFrame
    count = print(" \nCount total NaN at each column in a DataFrame : \n\n",
              df.isnull().sum())
    return count

In [None]:
We will call missng_val() on all dataframes 

In [36]:
missing_val(raw_train_df),  missing_val(raw_movies_df)

 
Count total NaN at each column in a DataFrame : 

 userId       0
movieId      0
rating       0
timestamp    0
dtype: int64
 
Count total NaN at each column in a DataFrame : 

 movieId    0
title      0
genres     0
dtype: int64


(None, None)

...Movies and train do not have any missing values...

In [37]:
missing_val(raw_tags_df), missing_val(raw_links_df)

 
Count total NaN at each column in a DataFrame : 

 userId        0
movieId       0
tag          16
timestamp     0
dtype: int64
 
Count total NaN at each column in a DataFrame : 

 movieId      0
imdbId       0
tmdbId     107
dtype: int64


(None, None)

...The tags dataset has 10 missing values in the tag column 
while the links dataset has 107 missing values in the tmdbId columns...

In [38]:
missing_val(raw_gescores_df), missing_val(raw_getags_df), missing_val(raw_imdb_df)

 
Count total NaN at each column in a DataFrame : 

 movieId      0
tagId        0
relevance    0
dtype: int64
 
Count total NaN at each column in a DataFrame : 

 tagId    0
tag      0
dtype: int64
 
Count total NaN at each column in a DataFrame : 

 movieId              0
title_cast       10068
director          9874
runtime          12089
budget           19372
plot_keywords    11078
dtype: int64


(None, None, None)

... gescores and getags datasets have no missing values while the imdb dataset haas missing values in almost all of the columns.. Pheww!!

<a id="two"></a>
## 2. EXPLORATORY DATA ANALYSIS
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

<a id="three"></a>
## 3. FEATURE ENGINEERING
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

#### Dealing with Null Values

#### Data Scaling

#### Dimension Reduction

<a id="four"></a>
## 4. MODELLING
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

### Logging Experiments on Comet

In [8]:
# params = {"test_size": 0.3,
#           "model_type": "Bernoulli-Naive_Bayes",
#           "vectorizer": "tfidf vectorizer",
#           "param_grid": "None" ,
#           "stratify": True
#           }
# metrics = {"F1 score:": bnb_f1,
#            "Recall:": bnb_rec,
#            "Precision:": bnb_prec,
#            'Accuracy': bnb_acc
#            }
experiment_name = 'Comet Set Up'

In [9]:
experiment.set_name(experiment_name)
# experiment.log_parameters(params)
# experiment.log_metrics(metrics)
experiment.end()

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/jessica-njuguna/movie-recommender-2022/be36105dd4534d9e87cb030e25209e76
COMET INFO:   Others:
COMET INFO:     Name : Comet Set Up
COMET INFO:   Uploads:
COMET INFO:     conda-environment-definition : 1
COMET INFO:     conda-info                   : 1
COMET INFO:     conda-specification          : 1
COMET INFO:     environment details          : 1
COMET INFO:     filename                     : 1
COMET INFO:     git metadata                 : 1
COMET INFO:     git-patch (uncompressed)     : 1 (10.63 MB)
COMET INFO:     installed packages           : 1
COMET INFO:     notebook                     : 1
COMET INFO:     source_code                  : 1
COMET INFO: ---------------------------
COMET INFO: Uploading metrics, params, and assets to Comet befo

<a id="five"></a>
## 5. MODEL PERFORMANCE COMPARISON
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

<a id="six"></a>
## 6. MODEL EXPLANATIONS
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

<a id="seven"></a>
## 7. CONCLUSION
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

<a id="eight"></a>
## 8. APPENDIX
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>