# Unsupervised Learning Predict - Movie Recommender System Challenge
© Explore Data Science Academy

---
### Honour Code

We, **XXX** {**#Team_NM3**}, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

<a id="cont"></a>

## Table of Contents

#### Section 1: Data Pre-processing

<a href=#one>1.1 Importing Packages</a>

<a href=#two>1.2 Loading Data</a>

<a href=#three>1.3 Exploratory Data Analysis (EDA)</a>

<a href=#four>1.4 Data Engineering</a>

#### Section 2: Model Development and Analysis

<a href=#five>2.1 Modeling</a>

<a href=#six>2.2 Model Performance</a>

#### Section 3: Model Explanation and Conclusions

<a href=#seven>3.1 Model Explanations</a>

<a href=#seven>3.2 Conclusions</a>

# Introduction
In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity. 

This Notebook has been adapted and developed by **XXX** - a group of seven students from the July 2022 cohort of the Explore Ai Academy **Data Science** course. We are:

 > Josiah Aramide <br>
 > Bongani Mavuso <br>
 > Ndinannyi mukwevho <br>
 > Aniedi Oboho-Etuk <br>
 > Manoko Langa <br>
 > Tshepiso Padi <br>
 > Nsika Masondo <br>
 

### Problem Statement

The client is determined to improve its recommender system service offering to targeted consumer categories based on their movie content rating. 

Data from the historical viewing experiences, available to the company contains some preference and similarity characteristics that can ensure accurate prediction of consumer behaviour. 

By constructing a recommendation algorithm based on content or collaborative filtering, **XXX** team can develop a solution capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences. This solution can give the company access to immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.


### Objectives

**XXX** seeks to achieve the following objectives for the project brief:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a recommendation algorithm based on content or collaborative filtering that is capable of capable of accurately predicting how a user will rate a movie they have not yet viewed;
- 5. evaluate the accuracy of the best machine learning model; and
- 6. explain the inner working of the model to a non-technical audience.

# Section 1: Data Pre-processing

This section describes steps for installing dependencies and requirements, initializing the experiment on Comet, importing packages, loading the two datasets - train and test datasets, conducting the exploratory data analysis (EDA) and implementing data engineering.

 <a id="one"></a>
## 1.1 Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| Below are the libraries and tools imported for use in this project. The libraries include:
   - **numpy**: for working with arrays,
   - **pandas**: for tansforming and manipulating data in tables,
   - **matplotlib**: for creating interactive visualisations,
   - **seaborn**: for making statistical graphs and plots,
   - **scikit-learn**: for machine learning and statistical modeling, and
   - **math**: for algebraic notations and calculations.

---

In [None]:
# Comet installation for Jupyter Notebook/Collab
!pip install comet_ml

In [227]:
# Libraries for data loading, data manipulation and data visulisation 
import numpy as np   # for working with 
import pandas as pd  # for data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt  # for making visualisations and plots
import seaborn as sns
import pickle
%matplotlib inline

# Libraries for collecting experiment parameters
import warnings
warnings.filterwarnings("ignore")
import comet_ml
from comet_ml import Experiment

# Libraries for data engineering and model building
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler # for standardization
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from imblearn.pipeline import Pipeline

# Libraries for Building Models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import mutual_info_regression #determine mutual info
from sklearn.datasets import make_blobs

# Libraries for model performance (metrics)
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
import math
import time
import datetime as dt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Create an experiment with your api key
experiment = Experiment(
    api_key="xxxxxxxx",
    project_name="xxxxxxxxxxxx",
    workspace="teamnm3",
)

<a id="two"></a>
## 1.2 Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section, data is loaded from the **xxxxx** made available to **TeamNM3** by the client, **Explore-AI**. This involves reading the data from the `.csv` file format into a Pandas dataframe. The Pandas dataframe allows for easy views and manipulations of the data in the form of tables and can be combined with other python libraries like numpy for desirable results. |

---

In [8]:
# Store datasets in a Pandas Dataframe
df = pd.read_csv('xxx.csv')

<a id="three"></a>
## 1.3 Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, an in-depth analysis (graphical and non-graphical) of the supplied data is conducted. This includes: 
 - viewing the matrix to determine the dimensions of the data;
 - identifying the features and target;
 - investigating the formatting of the data (types, nulls etc.)
 - viewing the xxx;
 - identifying the xxx;
 - analysing the xxx;
 .|

---

### 1.3.1 Viewing the matrix (dimensions) of the data
First, it is necessary to view the matrix of the train and test datasets to see the total number of rows, cumner of columns, the content and the format (datatypes) of the features and labels that both datasets contain.

In [None]:
# Train dataset Matrix
df_train.shape

In [None]:
# Test dataset Matrix
df_test.shape

### OBSERVATION
- As the results show, the train dataset contains **15,819 rows of observations** in 3 columns of features and the target (or response) variable, 
- The test dataset contains a much lower number of observations (**10,546**) with only 2 columns i.e. not containing the target variable. 

Next, a peek view of some of the rows in the dataset should be of interest. This can be accomplished with the `pd.head()` command as seen in the code cell below. The command can take an argument specifying the number of rows to view (15 in this example), otherwise it returns the first 5 rows by default. 

In [None]:
# View top of datasets, train set

df_train.head(15)

In [None]:
# looking at the test set
df_test.head(10)

### OBSERVATION
- The output indicates that the `xxx` column (features) contains xxx. These will need to be addressed during the feature engineering phase in order to derive any usefulness from them.


In [None]:
# Data Types and Non-null count 
df_train.info()

In [None]:
# Confirm the Non-null count
df_train.isnull().sum()

In [None]:
# Summary Statistics of our train dataset
df_train.groupby(" ").describe().T

### OBSERVATION
- From above, it can be observed that the dataset appears to have no missing values. That is, the count of non-null rows equals the expected count of entries in the columns. 

### 1.3.2 Visualisation: Histogram of ... showing outliers

### OBSERVATION
 - One immediately obvious fact from the unclean dataset 

### 1.3.3 Visualisation: Distribution (density) plot of 

In [None]:
# checking the distribution of tweets in the data

length_train = df_train[' '].str.len().plot.hist(color = 'green', figsize = (6, 4))

### OBSERVATION
- The barplots above confirm the 

For now, this seems to be as much insight as can be displayed from the raw dataset. In the next stage of Data Engineering, the observations highlighted will be implemented particularly those that have to do with cleaning the message, removing outliers and noise, and manipulating the result into a format appropriate for use in machine learning models.

The following section details how to achieve just that!

<a id="four"></a>
## 1.4 Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section we conduct our feature engineering to: 
- clean identified errors from the dataset;
- enrich the dataset by creating new features;
- split the dataset into training and validation sets for use by selected models;
- standardize the dataset;
- 
- 

These steps follow the insights that were gathered earlier during the EDA phase.|

---

### 1.4.1 Preprocessing 1: Cleaning
The first step is to begin organising the data cleaning exercise by building smart functions so that these can be recalled for cleaning both the training and testing datasets. Without this logical flow of cleaning the data, the exercise can quickly get very messy. However, with a couple of functions, it can be decided where the code lines will be inserted for repetitive tasks such as ... Then, the cleaning exercise can logically progress ... as shown below. 

Then, these functions are called to clean 

Later, in Step 2, ... ready for the modeling phase.

In [None]:
# create a function to do some preprocessing on the data
def xxx(yyy):
    '''
    :parameter
        :
    :return
        
    '''
    return xx

### 1.4.2 Preprocessing 2 - Split and Standardization
In this step, the task is to complete preprocessing on the train and test datasets ahead of modeling. First, a function is created to split the datasets into train and validation sets to support the performance measurement during the modeling stage. Next, another function is created to standardize the dataset. 

#### 1.4.2.1 Splitting
Create a `preprocess_train_split` function to complete the task of splitting features from the train dataset.

In [None]:
# create a function to preprocess data for our models
def preprocess_train_split(xxx):
    '''
    :parameter
        :
    :return
        split of 
    '''
    
    # split train data into train and validation datasets
        
    return (X_train, y_train), (X_valid, y_valid)

#### 1.4.2.2 Extract the features
Call the `preprocess_train_split` function to split/extract the features into actual variables for training, validation and test datasets.

In [None]:
# splitted train dataset into training and validation sets
(X_train, y_train), (X_valid, y_valid) = preprocess_train_split(xxx)
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape) 

#### 1.4.2.3 Standardizing the features
Create a `stand` function to complete the task of standardization.

In [234]:
# defining global scalers
rs = RobustScaler()
mm = MinMaxScaler()

In [233]:
def stand(X_train, X_valid, X_test):
    '''
    :parameter
        :
    :return
        
    '''
    # standardize the features to be in comparable scale
    rs = RobustScaler()
    mm = MinMaxScaler()
    train_vect = rs.fit_transform(train_vect)
    train_vect = mm.fit_transform(train_vect)
    
    return train_vect

With this level of cleaning concluded, the model building and development stage follows next.

# Section 2: Model Development and Analysis

This section describes the steps and processes involved in building models for the project as well as the analysis of the model performance in terms of their accuracy in accomplishing the classification task.

# <a id="five"></a>
## 2.1 Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, the **TeamNM3** team explored the following models for their skill and strengths with regards processing  was considered in the model development. The models include:

- 1. L
- 2. S 
---
The initial task is to build a base model using ...

### 2.1.1 Overview of the Selected Models

### 2.1.2 Fit, Train and Predict with a base model
The first step of modeling involved fitting, training and predicting a base model of ...

### DISCUSSION
The two outputs above are 

### 2.1.3 Building other models 
With the base model fully operational, it is now reasonable to develop other models that can strengthen the recommendation system task. As with all the earlier stages of the data science process, functions are built to help enhance the functionality of training and testing the datasets.

#### 2.1.3.1 Create model objects for all models

#### 2.1.3.2 Create functions for training and testing all models
Two functions `train_model` and `test_model` are created to optimize the process of training and testing all selected models.

In [63]:
# create a function to train our models
def train_model(model, X, y):

    ''' returns a model trained on the training dataset
        parameters:
            model:   a machine learning model
            X:
    '''    
    return model.fit(X, y)

### 2.1.4 Model Fitting, Training and Predictions

The models are fitted and trained on the balanced datasets and then used for predicting the tweet classification task on the unseen dataset. The process involves using the trained models by calling on built functions. 

First, the prediction is done with the validation dataset which has a label but has not been resampled. This prediction results are used in the next sub-section for evaluating the model performance. Another prediction set is conducted subsequently on the blind test dataset which has no labels. This prediction is used for the Kaggle submission to obtain external scores on the performance of the models.

#### 2.1.4.1 Model 1: 

In [65]:
# training the 

#### 2.1.4.2 Model 2: 

In [70]:
# training the support vector machine 


### 2.1.5 Extract Results for Submission
With the model fitting, training and prediction tasks completed, it is now possible to extract results from some of the models for submission on Kaggle as well as for use in Streamlit web app development.

#### Extracting Results for Submission - Kaggle

In [None]:
#create a Kaggle submission file for the model
results_dict = pd.DataFrame({'tweetid':df_test['tweetid'],
                'sentiment': nbc_org_pred_test})

results_dict.to_csv('submission_nbc_org.csv', index = False)

#### Extracting pkl file for web app development

In [None]:
# pickle/save base model for Streamlit web deployment
model_save_path = "lgr_base.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(lgr_train,file)

<a id="six"></a>
## 2.2 Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section, the relative performance of the selected classification models against some common metrics are compared and considered. The following metrics are deployed in checking the model performance using functions, as previously established:
-  |

---
**xxx**

.


### 2.2.1 Model Scores, Matrices and Heatmaps
A function is built to take care of the `roc_auc_score` calculation.

In [102]:
# define a function for calculating roc scores
def roc_score(model, X_valid, y_valid):    
    # with the model previously instantiated, 

    return res

#### 2.2.1.1 Scores and Matrices of models trained on the balanced training dataset
The scores of models trained on the resampled datasets are first verified and then tabulated and plotted for easy comparison.

#### Model 1: 

In [None]:
# print roc_score for xxx model

#### Model 2: Support Vector

In [None]:
# plot bar of roc
roc_factsheet.plot(kind='bar', title='ROC scores across selected xxx')

### DISCUSSION
In the simple barplot of the ROC scores above, 

### DISCUSSION
In this instance,

### 2.2.2 Improving model performance

The results above ...

#### 2.2.2.1 Implementing Hyperparameter tuning to improve model performance

# Section 3: Model Explanations and Conclusions

This section describes

<a id="seven"></a>
## 3.1 Model Explanation
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, we discuss the inner workings of some of the selected models work in an attempt to understand how the models have performed the task. We discuss the following models:
- 
- Support Vector Machines,
- Random Forest.|

---

### 3.1.1 Understanding the inner workings of select models

### 3.1.2 Characteristics and Advantages of the Best Performing Models

<a id="seven"></a>
## 3.2 Conclusions
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, we draw conclusions and consider a few recommendations based on the discussions and investigations conducted for this Twitter classification project.|

---

In conclusion, it can be said that:
- the data available from the 

Finally, it is evident that deploying machine learning solutions that are well-tuned to 

Thus, thorough consideration of the strategic objectives and direction of the company with regards to interventions to be supported by insights from the ... can improve the choice of the machine learning model that best delivers on the recommendation system task.

### 3.2.1 Logging and extracting parameters for Comet experiments

In [354]:
# create dictionaries for the data we want to log

# metrics
metrics_nbc_smt = {"f1": nbc_smt_f1, "recall": nbc_smt_r, 
                  "precision": nbc_smt_p, "roc": nbc_roc}

# parameters
params_nbc_smt = {"vectorizer": tf_vect, "model_type": "naive bayes", 
                 "model": nbc_smt, "robust scaler": rs, "Min Max": mm}

#params_abc_sm = {"random_state": 42, "vectorizer": tf_vect, 
 #                "model_type": "ada boost", "model": abc_sm, 
 #                "robust scaler": rs, "Min Max": mm, 
 #                 "base_estimator": rfc}

In [355]:
# Log our parameters and results
experiment.log_parameters(params_nbc_smt)
experiment.log_metrics(metrics_nbc_smt)

In [None]:
# end the experiment on Comet
experiment.end()

Running experiment.display() will show the experiments comet.ml page

In [357]:
# display the experiment parameters on Comet
experiment.display()