<a href="https://colab.research.google.com/github/endiesworld/2110ACDS_T7_C_Predict/blob/main/2110ACDS_T7_starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDSA Movie Recommendation 2022

© Explore Data Science Academy

---
### Honour Code

**2110ACDS_T6**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.


  

<h2><center> EDSA Movie Recommendation 2022</h2></center>
<figure>
<center><img src ="./assets/movies.png" width = "800" height = '500'/>

*Introduction*
<p align = "justify">Recommender System is a system that seeks to predict or filter preferences according to the user’s choices. Recommender systems are utilized in a variety of areas, and in this project we will use a recommender system to recommend movies for movie lovers.


*About the problem*
<p align = "justify">PUT PROBLEM STATEMENT HERE.

*Objective*
<p align = "justify"> We aim to provide an accurate and robust solution to this problem, by providing personalised recommendations to users of this product, and generating platform affinity for the streaming services which best facilitates their audience's viewing

*Process*
<p align = "justify"> In order to achieve this objective the team will follow the process below:-

1. analyse the supplied data, identify potential errors in the data and clean the existing data set;

2. determine if additional features can be added to enrich the data set;

3. build a model that is capable of predicting how a user will rate a movie;

4. evaluate the accuracy of the best machine learning model;

5. accurately predicting how a user will rate a movie they have not yet viewed, based on their historical preferences, and

6. explain the inner working of the model to a non-technical audience.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:

# Import comet_ml at the top of your file
# from comet_ml import Experiment

# # Create an experiment with your api key
# experiment = Experiment(
#     api_key="emBEBYBp72gW5tfeZBSGftD0Y",
#     project_name="movie-recommendation",
#     workspace="emmanuelokoro",
#     log_code = True
# )

In [2]:
# Libraries for importing and loading data
import numpy as np
import pandas as pd
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.feature_extraction.text import TfidfVectorizer

# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists


# Setting global constants to ensure notebook results are reproducible

RANDOM_STATE = 42


import warnings
warnings.filterwarnings('ignore')

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

### 2.1 Brief description of the data



In [3]:
# load the data
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
genome_scores = pd.read_csv('./data/genome_scores.csv')
genome_tags = pd.read_csv('./data/tags.csv')
imdb_data = pd.read_csv('./data/imdb_data.csv')
links = pd.read_csv('./data/links.csv')
movies = pd.read_csv('./data/movies.csv')
# tags = pd.read_csv('./data/tags.csv')

In [4]:
# Preview train dataset
print('The Shape of the data is: ', train.shape)
train.head()

The Shape of the data is:  (10000038, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [5]:
train['userId'].nunique()

162541

In [6]:
# Preview train dataset
print('The Shape of the data is: ', test.shape)
test.head()

The Shape of the data is:  (5000019, 2)


Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [7]:
test['movieId'].nunique()

39643

In [8]:
test.tail()

Unnamed: 0,userId,movieId
5000014,162541,4079
5000015,162541,4467
5000016,162541,4980
5000017,162541,5689
5000018,162541,7153


In [9]:
# Preview genome_scores dataset
print('The Shape of the data is: ', genome_scores.shape)
genome_scores.head()

The Shape of the data is:  (15584448, 3)


Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [10]:
# Preview genome_scores dataset
print('The Shape of the data is: ', genome_tags.shape)
genome_tags.head()

The Shape of the data is:  (1093360, 4)


Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [11]:
# Preview imdb_data dataset
print('The Shape of the data is: ', imdb_data.shape)
imdb_data.head()

The Shape of the data is:  (27278, 6)


Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [12]:
imdb_data.isna().sum()

movieId              0
title_cast       10068
director          9874
runtime          12089
budget           19372
plot_keywords    11078
dtype: int64

In [13]:
imdb_data = imdb_data.dropna()
imdb_data.isna().sum()

movieId          0
title_cast       0
director         0
runtime          0
budget           0
plot_keywords    0
dtype: int64

In [14]:
# Preview links dataset
print('The Shape of the data is: ', links.shape)
links.head()

The Shape of the data is:  (62423, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [15]:
# Preview movies dataset
print('The Shape of the data is: ', movies.shape)
movies.head()

The Shape of the data is:  (62423, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [16]:
movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [16]:
# Preview tags dataset
# print('The Shape of the data is: ', tags.shape)
# tags.head()

#### Dataset summary


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


### 3.1 Exploratory Data Analysis
*What is Exploratory data analysis?*
    Exploratory data analysis (EDA) is the process of analysing and investigating data sets and summarizing their main characteristics, often employing both non-graphical and graphical methods. 

*Why is conducting EDA important?*
    It aids in determining how best to manipulate data to get the required answers, expose trends, patterns, and relationships that are not readily apparent i.e. get insights into the dataset.

*How is EDA conducted?*
    EDA can be conducted in the following ways:
- **Univariate**:- \
    i. **non-graphical**:- This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships.\
    ii. **graphical**:- Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. It involves visual exploratory analysis of the data.
- Multivariate:-  \
    i. **non-graphical**:- Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics. \
    ii. **graphical**:- Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
    
To achieve the above, while considering the volume of dataset for this project, we make use of this python module **pandas_profiling**

#### 3.1.1 pandas_profiling
Pandas profiling is an open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code. It offers report generation for the dataset with lots of features and customizations for the report generated.

In [17]:
# Generate EDA report of train dataset
# from pandas_profiling import ProfileReport
# profile = ProfileReport(train, title="Report")
# profile


In [18]:
# Generate report for genome_scores
# profile = ProfileReport(genome_scores, title="genome_scores report")
# profile


#### summarize the above.

 **Descriptive Statistics**

>Descriptive statistics summarize the data by computing mean, median, mode, standard deviation likewise.descriptive statistics describe the dataset in a way simpler manner through;

*   The measure of central tendency 
>*  Mean:- The average value 
>*  Median:- The mid point value 
>*  Mode:- The most common value

*   Measure of spread  
>* Percentiles:- Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
>* standard deviation:-a number that describes how spread out the values are.
*  Measure of symmetry 
>* Skewness:- a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
>>* If skewness is less than -1 or greater than 1, the distribution is highly skewed.
>>* If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
>>* If skewness is between -0.5 and 0.5, the distribution is approximately symmetric. 
*  Measure of Peakedness 
>* Kurtosis:-  a measure of relative peakedness of a probability distribution, or alternatively how heavy or how light its tails are. A standard normal distribution has kurtosis of 3 and is recognized as mesokurtic. An increased kurtosis (>3) can be visualized as a thin “bell” with a high peak whereas a decreased kurtosis corresponds to a broadening of the peak and “thickening” of the tails. Kurtosis >3 is recognized as leptokurtic and <3 as platykurtic (lepto=thin; platy=broad).
>>








In [19]:
# look at data statistics


### 3.2 Key Insights from EDA 


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

### 4.1 Content-based Filtering 
Making recommendations based on how similar the properties or features of an item are to other items

Considering the large volume of dataset we have, we shall restrict this work to only userIds present in the test dataset.

**Unique userid**
We want to evaluate the difference between the unique userId in the Train dataset and Test dataset

In [17]:
test['userId'].nunique()

162350

In [18]:
test_case = test['userId'].nunique()
train_case = train['userId'].nunique()
print('The difference in unique userID count between train and test data set is:', (train_case - test_case))

The difference in unique userID count between train and test data set is: 191


From the above, proceed to extract these 191 userIds, that are not required for prediction

In [19]:
test_userids = test['userId'].unique().tolist()
test_userids[:20]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [20]:
#Extract rows with userId present in test userId
useful_train = train[train['userId'].isin(test_userids)]
useful_train.shape

(9997845, 4)

#### Sorting of Tables

We proceed to sort both tables( train and test ) by useId

In [32]:
# Sort train dataset by userId
useful_train.sort_values(by=['userId'], inplace= True)
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
5122500,1,3949,5.0,1147868678
9258011,1,2351,4.5,1147877957
2858468,1,2068,2.5,1147869044
2625067,1,27266,4.5,1147879365
6239556,1,7939,2.5,1147869183


In [33]:
# Sort test dataset by userId
test.sort_values(by=['userId'], inplace= True)
test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


#### Merging of relevant tables

At this stage, we merge both tables with other tables considered to be useful for the task at hand.
The tables we merge with are listed below:
- imdb_data
- movies


In [34]:
# Merge train table with imdb_data table 
useful_train = useful_train.merge(imdb_data, on = 'movieId', how= 'left')
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords
0,1,3949,5.0,1147868678,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,102.0,"$4,500,000",drug addiction|heroin|sex show|sex scene
1,1,2351,4.5,1147877957,,,,,
2,1,2068,2.5,1147869044,,,,,
3,1,27266,4.5,1147879365,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...,Kar-Wai Wong,129.0,"$12,000,000",nostalgia|loneliness|room 2046|1960s
4,1,7939,2.5,1147869183,,,,,


In [35]:
# Merge test table with imdb_data table 
test = test.merge(imdb_data, on = 'movieId', how= 'left')
test.head()

Unnamed: 0,userId,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,2011,,,,,
1,1,4144,,,,,
2,1,5767,,,,,
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,102.0,"$4,000,000",older man younger woman relationship|lonelines...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,127.0,"$30,000,000",suffering|torture|brutality|whipping


In [36]:
# Merge train table with movies table 
useful_train = useful_train.merge(movies, on = 'movieId', how= 'left')
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords,title,genres
0,1,3949,5.0,1147868678,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,102.0,"$4,500,000",drug addiction|heroin|sex show|sex scene,Requiem for a Dream (2000),Drama
1,1,2351,4.5,1147877957,,,,,,"Nights of Cabiria (Notti di Cabiria, Le) (1957)",Drama
2,1,2068,2.5,1147869044,,,,,,Fanny and Alexander (Fanny och Alexander) (1982),Drama|Fantasy|Mystery
3,1,27266,4.5,1147879365,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...,Kar-Wai Wong,129.0,"$12,000,000",nostalgia|loneliness|room 2046|1960s,2046 (2004),Drama|Fantasy|Romance|Sci-Fi
4,1,7939,2.5,1147869183,,,,,,Through a Glass Darkly (Såsom i en spegel) (1961),Drama


In [37]:
# Merge test table with movies table 
test = test.merge(movies, on = 'movieId', how= 'left')
test.head()

Unnamed: 0,userId,movieId,title_cast,director,runtime,budget,plot_keywords,title,genres
0,1,2011,,,,,,Back to the Future Part II (1989),Adventure|Comedy|Sci-Fi
1,1,4144,,,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama|Romance
2,1,5767,,,,,,Teddy Bear (Mis) (1981),Comedy|Crime
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,102.0,"$4,000,000",older man younger woman relationship|lonelines...,Lost in Translation (2003),Comedy|Drama|Romance
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,127.0,"$30,000,000",suffering|torture|brutality|whipping,"Passion of the Christ, The (2004)",Drama


#### Merging vital colunms

For this stage, we proceed to merge columns we have considered to be important in describing the content of a movie into a new column called key_words. The columns are listed below:
- title_cast
- director
- plot_keywords
- genres

In [38]:
# Merge the columns listed above into a new column named key_words fot the train data
useful_train['key_words'] = (pd.Series(useful_train[['title_cast', 'director', 'plot_keywords', 'genres']].fillna('')
                      .values.tolist()).str.join(' '))
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords,title,genres,key_words
0,1,3949,5.0,1147868678,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,102.0,"$4,500,000",drug addiction|heroin|sex show|sex scene,Requiem for a Dream (2000),Drama,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...
1,1,2351,4.5,1147877957,,,,,,"Nights of Cabiria (Notti di Cabiria, Le) (1957)",Drama,Drama
2,1,2068,2.5,1147869044,,,,,,Fanny and Alexander (Fanny och Alexander) (1982),Drama|Fantasy|Mystery,Drama|Fantasy|Mystery
3,1,27266,4.5,1147879365,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...,Kar-Wai Wong,129.0,"$12,000,000",nostalgia|loneliness|room 2046|1960s,2046 (2004),Drama|Fantasy|Romance|Sci-Fi,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...
4,1,7939,2.5,1147869183,,,,,,Through a Glass Darkly (Såsom i en spegel) (1961),Drama,Drama


In [41]:
# confrim the absense of NaN value in the key_word column for the train data
nan = useful_train['key_words'].isna().sum()
print(f' There are {nan} numbers of NaN values in the train keywords column')

 There are 0 numbers of NaN values in the train keywords column


In [42]:
# Merge the columns listed above into a new column named key_words fot the test data
test['key_words'] = (pd.Series(test[['title_cast', 'director', 'plot_keywords', 'genres']].fillna('')
                      .values.tolist()).str.join(' '))
test.head()

Unnamed: 0,userId,movieId,title_cast,director,runtime,budget,plot_keywords,title,genres,key_words
0,1,2011,,,,,,Back to the Future Part II (1989),Adventure|Comedy|Sci-Fi,Adventure|Comedy|Sci-Fi
1,1,4144,,,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama|Romance,Drama|Romance
2,1,5767,,,,,,Teddy Bear (Mis) (1981),Comedy|Crime,Comedy|Crime
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,102.0,"$4,000,000",older man younger woman relationship|lonelines...,Lost in Translation (2003),Comedy|Drama|Romance,Scarlett Johansson|Bill Murray|Akiko Takeshita...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,127.0,"$30,000,000",suffering|torture|brutality|whipping,"Passion of the Christ, The (2004)",Drama,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...


In [43]:
# confrim the absense of NaN value in the key_word column for the test data
nan = test['key_words'].isna().sum()
print(f' There are {nan} numbers of NaN values in the test keywords column')

 There are 0 numbers of NaN values in the test keywords column


#### Droping of colunms not needed

Going forward, we drop colunms we have considered not realy important for the task at hand. The columns are listed below:
- runtime
- budget
- timestamp
- title
- title_cast
- director
- plot_keywords 
- genres'

In [44]:
# Drop the above listed columns in the train data
useful_train.drop(columns=['timestamp', 'runtime', 'budget','title', 'title_cast', 'director', 
                           'plot_keywords','genres'], inplace= True)
useful_train.head()

Unnamed: 0,userId,movieId,rating,key_words
0,1,3949,5.0,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...
1,1,2351,4.5,Drama
2,1,2068,2.5,Drama|Fantasy|Mystery
3,1,27266,4.5,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...
4,1,7939,2.5,Drama


In [46]:
# Drop the above listed columns in the test data
test.drop(columns=['runtime', 'budget','title', 'title_cast', 'director', 
                           'plot_keywords','genres'], inplace= True)
test.head()

Unnamed: 0,userId,movieId,key_words
0,1,2011,Adventure|Comedy|Sci-Fi
1,1,4144,Drama|Romance
2,1,5767,Comedy|Crime
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...


#### Data Formating

As can be seen in the key_words colunm, each enity are separated by a '|', and this character which is a separator(delimiter) if left with the data, will affect the accuracy of our model, hence needs to be removed. 

To achieve this, we write a function called splitter, to operate on both the train and test dataset. 

In [48]:
# Remove delimeters(Separators) from string data
def splitter(df, col_list, delim):
    """
        This function accepts a dataframe(df) and a list of columns(col_list), which contains the delimiter
        to be removed, it also accepts the delimiter which is to be removed
    """
    new_df = df.copy()
    
    for col in col_list:
        new_df[col] = new_df[col].str.split(delim).str.join(' ')
    
    return new_df

In [49]:
# Remove delimeter from key_words colunm in train data
useful_train = splitter(useful_train, ['key_words'], '|')
useful_train.head()

Unnamed: 0,userId,movieId,rating,key_words
0,1,3949,5.0,Ellen Burstyn Jared Leto Jennifer Connelly Mar...
1,1,2351,4.5,Drama
2,1,2068,2.5,Drama Fantasy Mystery
3,1,27266,4.5,Tony Chiu-Wai Leung Li Gong Faye Wong Takuya K...
4,1,7939,2.5,Drama


In [50]:
# Remove delimeter from key_words colunm in test data
test = splitter(test, ['key_words'], '|')
test.head()

Unnamed: 0,userId,movieId,key_words
0,1,2011,Adventure Comedy Sci-Fi
1,1,4144,Drama Romance
2,1,5767,Comedy Crime
3,1,6711,Scarlett Johansson Bill Murray Akiko Takeshita...
4,1,7318,Jim Caviezel Maia Morgenstern Christo Jivkov F...


#### Dividing dataset into chunks

We shall now proceed to divide both dataset into chunks. The numbers of chunk we have chosen is 162350, which is the numbers of unique userIds we have for both dataset. We do this to enables us save these chunks into local storage system of our machines, fetch these chunks back individualy and separately processing these chunks and making prediction afterwards. This act is necessary because our machines all have limited capacity, which prevents us from processing theses dataset at once.



In [None]:
# A function that generate a list of chunks
def create_chunk_list(df, col_ref, col_val):
    """
        This function accepts a dataframe, the dataframe column and the colunm value to filter by
        It returns a new dataframe, which is a datframe where the reference column matches the passed column value.
    """
    new_df = df[df[col_ref] == col_val]
        
    return new_df
            

In [None]:
for i in range (162350):
    extention = i + 1
    chunk_name = "train_chunk_{0}".format(extention)
    globals()[chunk_name] = create_chunk_list(globals()[chunk_name], 'userId', extention)
    

In [22]:
# Order train dataset by userId
useful_train.sort_values(by=['userId'], inplace= True)
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
5122500,1,3949,5.0,1147868678
9153002,1,1175,3.5,1147868826
6923102,1,6016,5.0,1147869090
724395,1,7323,3.5,1147869119
2805472,1,4973,4.5,1147869080


In [23]:
# View the last five rows of the train data
useful_train.tail()

Unnamed: 0,userId,movieId,rating,timestamp
9103441,162541,2396,4.0,1240952712
547504,162541,4973,4.5,1240950790
7991803,162541,2539,1.0,1240950911
1861237,162541,1201,3.0,1240953800
9435687,162541,1230,3.5,1240951041


In [24]:
# Get the values of all userId into a list 
train_userids = useful_train['userId'].unique().tolist()
print(f'There are {len(train_userids)} of different userIds in the useful train dataset')

There are 162350 of different userIds in the useful train dataset


#### Order the test dataset by userId

In [25]:
# Order test dataset 
useful_test = test.sort_values(by=['userId'])
useful_test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [26]:
# View the last five rows of the test dataset
useful_test.tail()

Unnamed: 0,userId,movieId
4999993,162541,345
4999992,162541,150
5000017,162541,5689
5000004,162541,2324
5000018,162541,7153


From the above result, we proceed to divide 162350 by 40, which give an approximate value of 4059. Which means when we divide the useful train dataset into chuncks, we would have 4059 unique userIds in the first 39 chunks, and 4049 in the last chunks. 

### Compare userId's positions for both dataset

In [28]:
twotables = useful_test.copy
# useful_test['userId'].compare(useful_train['userId'])

In [29]:
# A function that generate a list of chunks
def create_chunk_list(obj, limit, cycle):
    """
        This function accepts a list of data as obj argument, a limit value which is the maximum chunk size in each 
        chunk and a cycle, which is the numbers/size of chunks to be created
    """
    chunks = []
    start = 0
    new_limit = limit
    for i in range(cycle):
        chunks.append(obj[start:new_limit])
        start = start + limit
        new_limit = new_limit + limit
        
    return chunks
            

In [30]:
# Create 1624 chunks of userId, with a chunk size of 100
userId_chunks = create_chunk_list(train_userids, 100, 1624)

# Random evaluation of the userId chunk size
print(f'We have {len(userId_chunks)} chunks of dataset')
print(f'The length of the first chunk is: {len(userId_chunks[0])}')
print(f'The length of the 1623 chunk is: {len(userId_chunks[1622])}')
print(f'The length of the last chunk is: {len(userId_chunks[-1])}')

We have 1624 chunks of dataset
The length of the first chunk is: 100
The length of the 1623 chunk is: 100
The length of the last chunk is: 50


We separated the uniqueIds into 16234 chuncks which gives us 100 unique userIds in each of the first 1623 chunks and 
50 unique userId in the last chunk.
For proper understanding, let us view the first 10 userIds in three random chunks

In [31]:
# Random evaluation of the first 10 userids in the chunks created above

print(f'The first 10 userIds in the first chunk are: {userId_chunks[0][:10]}')
print(f'The first 10 userIds in the 1623 chunk is: {userId_chunks[1622][:10]}')
print(f'The first 10 userIds in the last chunk is: {userId_chunks[-1][:10]}')

The first 10 userIds in the first chunk are: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The first 10 userIds in the 1623 chunk is: [162392, 162393, 162394, 162395, 162396, 162397, 162398, 162399, 162400, 162401]
The first 10 userIds in the last chunk is: [162492, 162493, 162494, 162495, 162496, 162497, 162498, 162499, 162500, 162501]


#### Divide dataset

For simplicity and explanatory reasons, we have chosen to divide the dataset manually i.e one line at a time. A faster will be to dynamically create it and store in a list. This operation will generate **train_chunk_1** to **train_chunk_1623**

In [32]:
# Divide useful trian into chunks, to create train_chunk_1 to train_chunk_40
train_chunks_name = []
for index, chunkId in enumerate(userId_chunks):
    chunk_name = "train_chunk_{0}".format(index + 1)
    globals()[chunk_name] = useful_train[useful_train['userId'].isin(chunkId)]
    train_chunks_name.append(chunk_name)
    
# THE ABOVE AS A FUNCTION
# def chunk_dataframe(df,col , chunk_ref):
#     """
#         This function accepts a dataframe and a chunk reference, which it uses to create smaller pieces of dataframe
#         as a chunk to the inputed dataframe. It returns a list of chunked dataframe.
#     """
#     df_chunks = []
#     for index, chunkId in enumerate(chunk_ref):
#         chunk_name = "train_chunk_{0}".format(index + 1)
#         globals()[chunk_name] = df[df[col].isin(chunkId)]
#         df_chunks.append(chunk_name)
        
#     return df_chunks

In [52]:
# Divide useful test dataset into chunks, to create train_chunk_1 to train_chunk_40
test_chunks_name = []
for index, chunkId in enumerate(userId_chunks):
    chunk_name = "test_chunk_{0}".format(index + 1)
    globals()[chunk_name] = useful_test[useful_test['userId'].isin(chunkId)]
    test_chunks_name.append(chunk_name)

In [33]:
#Test the above operation by printing the first five rows of the first chunk and the last five rows of the last chunk
train_chunk_1.head()

Unnamed: 0,userId,movieId,rating,timestamp
5122500,1,3949,5.0,1147868678
9153002,1,1175,3.5,1147868826
6923102,1,6016,5.0,1147869090
724395,1,7323,3.5,1147869119
2805472,1,4973,4.5,1147869080


In [34]:
# View the last five rows of the last chunk of the train data
train_chunk_1624.tail()

Unnamed: 0,userId,movieId,rating,timestamp
9103441,162541,2396,4.0,1240952712
547504,162541,4973,4.5,1240950790
7991803,162541,2539,1.0,1240950911
1861237,162541,1201,3.0,1240953800
9435687,162541,1230,3.5,1240951041


In [54]:
# View the last five rows of the last chunk of the test data
test_chunk_1624.tail()

Unnamed: 0,userId,movieId
4999993,162541,345
4999992,162541,150
5000017,162541,5689
5000004,162541,2324
5000018,162541,7153


In [35]:
# View the first and last chunk names created above
print(f'The first chunk name created above is: {train_chunks_name[0]}')
print(f'The last chunk name created above is: {train_chunks_name[-1]}')

The first chunk name created above is: train_chunk_1
The last chunk name created above is: train_chunk_1624


In [55]:
# View the first and last test dataset chunk names created above
print(f'The first chunk name created above is: {test_chunks_name[0]}')
print(f'The last chunk name created above is: {test_chunks_name[-1]}')

The first chunk name created above is: test_chunk_1
The last chunk name created above is: test_chunk_1624


#### Merging of tables

We proceed to merge tables with all train_chunks dataset created above. We execute this merge operation, using the userId as a reference. This operation will generate variables **merge_table_1** to **merge_table_40**

In [36]:
# Merge chunk with imdb_data 
merge_chunks_name = []
for index, chunk_name in enumerate(train_chunks_name):
    merge_name = "merge_chunk_{0}".format(index + 1)
    globals()[merge_name] = globals()[chunk_name].merge(imdb_data, on = 'movieId', how= 'left')
    merge_chunks_name.append(merge_name)
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords
0,1,3949,5.0,1147868678,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,102.0,"$4,500,000",drug addiction|heroin|sex show|sex scene
1,1,1175,3.5,1147868826,Pascal Benezech|Dominique Pinon|Marie-Laure Do...,Jean-Pierre Jeunet,99.0,"FRF24,000,000",black comedy|absurd comedy|surrealist|bed
2,1,6016,5.0,1147869090,Alexandre Rodrigues|Leandro Firmino|Phellipe H...,Kátia Lund,130.0,"$3,300,000",photographer|slum|gang|brazil
3,1,7323,3.5,1147869119,Daniel Brühl|Katrin Saß|Chulpan Khamatova|Mari...,Bernd Lichtenberg,121.0,"EUR4,800,000",coma|german democratic republic|capitalism|pol...
4,1,4973,4.5,1147869080,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,122.0,"$10,000,000",female protagonist|paris france|france|montmar...


In [56]:
# Merge test chunk with imdb_data 
test_merge_chunks_name = []
for index, chunk_name in enumerate(test_chunks_name):
    merge_name = "test_merge_chunk_{0}".format(index + 1)
    globals()[merge_name] = globals()[chunk_name].merge(imdb_data, on = 'movieId', how= 'left')
    test_merge_chunks_name.append(merge_name)
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,2011,,,,,
1,1,4144,,,,,
2,1,5767,,,,,
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,102.0,"$4,000,000",older man younger woman relationship|lonelines...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,127.0,"$30,000,000",suffering|torture|brutality|whipping


In [37]:
# View the result of the last five rows in the last merged chunk 
merge_chunk_1624.tail()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords
5336,162541,2396,4.0,1240952712,Geoffrey Rush|Tom Wilkinson|Steven O'Donnell|T...,John Madden,123.0,"$25,000,000",william shakespeare character|shakespeare play...
5337,162541,4973,4.5,1240950790,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,122.0,"$10,000,000",female protagonist|paris france|france|montmar...
5338,162541,2539,1.0,1240950911,Robert De Niro|Billy Crystal|Lisa Kudrow|Chazz...,Kenneth Lonergan,103.0,"$80,000,000",sex scene|mafia boss|mob boss|sexual intercourse
5339,162541,1201,3.0,1240953800,,,,,
5340,162541,1230,3.5,1240951041,,,,,


In [38]:
# Drop columns that are considered not necessary from the result of the merge operation above

for  merge_name in merge_chunks_name:
    globals()[merge_name] = globals()[merge_name].drop(columns=['timestamp', 'runtime', 'budget'])
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords
0,1,3949,5.0,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,drug addiction|heroin|sex show|sex scene
1,1,1175,3.5,Pascal Benezech|Dominique Pinon|Marie-Laure Do...,Jean-Pierre Jeunet,black comedy|absurd comedy|surrealist|bed
2,1,6016,5.0,Alexandre Rodrigues|Leandro Firmino|Phellipe H...,Kátia Lund,photographer|slum|gang|brazil
3,1,7323,3.5,Daniel Brühl|Katrin Saß|Chulpan Khamatova|Mari...,Bernd Lichtenberg,coma|german democratic republic|capitalism|pol...
4,1,4973,4.5,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,female protagonist|paris france|france|montmar...


In [57]:
# Drop test columns that are considered not necessary from the result of the merge operation above

for  merge_name in test_merge_chunks_name:
    globals()[merge_name] = globals()[merge_name].drop(columns=['runtime', 'budget'])
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,plot_keywords
0,1,2011,,,
1,1,4144,,,
2,1,5767,,,
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,older man younger woman relationship|lonelines...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,suffering|torture|brutality|whipping


In [39]:
# Merge chunks with movies table 

for  merge_name in merge_chunks_name:
    globals()[merge_name] = globals()[merge_name].merge(movies, on = 'movieId', how= 'left')
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords,title,genres
0,1,3949,5.0,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,drug addiction|heroin|sex show|sex scene,Requiem for a Dream (2000),Drama
1,1,1175,3.5,Pascal Benezech|Dominique Pinon|Marie-Laure Do...,Jean-Pierre Jeunet,black comedy|absurd comedy|surrealist|bed,Delicatessen (1991),Comedy|Drama|Romance
2,1,6016,5.0,Alexandre Rodrigues|Leandro Firmino|Phellipe H...,Kátia Lund,photographer|slum|gang|brazil,City of God (Cidade de Deus) (2002),Action|Adventure|Crime|Drama|Thriller
3,1,7323,3.5,Daniel Brühl|Katrin Saß|Chulpan Khamatova|Mari...,Bernd Lichtenberg,coma|german democratic republic|capitalism|pol...,"Good bye, Lenin! (2003)",Comedy|Drama
4,1,4973,4.5,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,female protagonist|paris france|france|montmar...,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy|Romance


In [58]:
# Merge chunks with movies table 

for  merge_name in test_merge_chunks_name:
    globals()[merge_name] = globals()[merge_name].merge(movies, on = 'movieId', how= 'left')
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,plot_keywords,title,genres
0,1,2011,,,,Back to the Future Part II (1989),Adventure|Comedy|Sci-Fi
1,1,4144,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama|Romance
2,1,5767,,,,Teddy Bear (Mis) (1981),Comedy|Crime
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,older man younger woman relationship|lonelines...,Lost in Translation (2003),Comedy|Drama|Romance
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,suffering|torture|brutality|whipping,"Passion of the Christ, The (2004)",Drama


#### Free up memory in the global variable

In [61]:
# Delete the train_chunks and test_chunks_name stored in the global variable
for data in train_chunks_name:
    del globals()[data]
for data in test_chunks_name:
    del globals()[data]

#### Data formating

Before we can use any string vectorizer on our data, we need to properly format the data.

In [41]:
# Remove delimeters(Separators) from string data
def splitter(df, col_list, delim):
    """
        This function accepts a dataframe(df) and a list of columns(col_list), which contains the delimiter
        to be removed, it also accepts the delimiter which is to be removed
    """
    new_df = df.copy()
    
    for col in col_list:
        new_df[col] = new_df[col].str.split(delim).str.join(' ')
    
    return new_df
        

In [42]:
# Remove separators form string data
for  merge_name in merge_chunks_name:
    globals()[merge_name] = splitter(globals()[merge_name], ['title_cast', 'plot_keywords', 'genres'], '|')
    
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords,title,genres
0,1,3949,5.0,Ellen Burstyn Jared Leto Jennifer Connelly Mar...,Hubert Selby Jr.,drug addiction heroin sex show sex scene,Requiem for a Dream (2000),Drama
1,1,1175,3.5,Pascal Benezech Dominique Pinon Marie-Laure Do...,Jean-Pierre Jeunet,black comedy absurd comedy surrealist bed,Delicatessen (1991),Comedy Drama Romance
2,1,6016,5.0,Alexandre Rodrigues Leandro Firmino Phellipe H...,Kátia Lund,photographer slum gang brazil,City of God (Cidade de Deus) (2002),Action Adventure Crime Drama Thriller
3,1,7323,3.5,Daniel Brühl Katrin Saß Chulpan Khamatova Mari...,Bernd Lichtenberg,coma german democratic republic capitalism pol...,"Good bye, Lenin! (2003)",Comedy Drama
4,1,4973,4.5,Audrey Tautou Mathieu Kassovitz Rufus Lorella ...,Guillaume Laurant,female protagonist paris france france montmar...,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy Romance


In [62]:
for  merge_name in test_merge_chunks_name:
    globals()[merge_name] = splitter(globals()[merge_name], ['title_cast', 'plot_keywords', 'genres'], '|')
    
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,plot_keywords,title,genres
0,1,2011,,,,Back to the Future Part II (1989),Adventure Comedy Sci-Fi
1,1,4144,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama Romance
2,1,5767,,,,Teddy Bear (Mis) (1981),Comedy Crime
3,1,6711,Scarlett Johansson Bill Murray Akiko Takeshita...,Sofia Coppola,older man younger woman relationship lonelines...,Lost in Translation (2003),Comedy Drama Romance
4,1,7318,Jim Caviezel Maia Morgenstern Christo Jivkov F...,Benedict Fitzgerald,suffering torture brutality whipping,"Passion of the Christ, The (2004)",Drama


In [43]:
# View the last merged chunk
merge_chunk_1624.tail()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords,title,genres
5336,162541,2396,4.0,Geoffrey Rush Tom Wilkinson Steven O'Donnell T...,John Madden,william shakespeare character shakespeare play...,Shakespeare in Love (1998),Comedy Drama Romance
5337,162541,4973,4.5,Audrey Tautou Mathieu Kassovitz Rufus Lorella ...,Guillaume Laurant,female protagonist paris france france montmar...,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy Romance
5338,162541,2539,1.0,Robert De Niro Billy Crystal Lisa Kudrow Chazz...,Kenneth Lonergan,sex scene mafia boss mob boss sexual intercourse,Analyze This (1999),Comedy
5339,162541,1201,3.0,,,,"Good, the Bad and the Ugly, The (Buono, il bru...",Action Adventure Western
5340,162541,1230,3.5,,,,Annie Hall (1977),Comedy Romance


In [44]:
# Merge interested columns values for vectorization
title_list = []
indices_list = []

for index, merge_name in enumerate(merge_chunks_name):
    globals()[merge_name]['key_words'] = (pd.Series(globals()[merge_name][['title_cast', 'director', 'plot_keywords', 
                                                                           'genres']].fillna('')
                      .values.tolist()).str.join(' '))
    
    titles = "titles_{0}".format(index + 1)
    globals()[titles] = globals()[merge_name]['title']
    title_list.append(titles)
    
    indices = "indices_{0}".format(index + 1)
    globals()[indices] = pd.Series(globals()[merge_name].index, index=globals()[merge_name]['title'])
    indices_list.append(indices)
    
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords,title,genres,key_words
0,1,3949,5.0,Ellen Burstyn Jared Leto Jennifer Connelly Mar...,Hubert Selby Jr.,drug addiction heroin sex show sex scene,Requiem for a Dream (2000),Drama,Ellen Burstyn Jared Leto Jennifer Connelly Mar...
1,1,1175,3.5,Pascal Benezech Dominique Pinon Marie-Laure Do...,Jean-Pierre Jeunet,black comedy absurd comedy surrealist bed,Delicatessen (1991),Comedy Drama Romance,Pascal Benezech Dominique Pinon Marie-Laure Do...
2,1,6016,5.0,Alexandre Rodrigues Leandro Firmino Phellipe H...,Kátia Lund,photographer slum gang brazil,City of God (Cidade de Deus) (2002),Action Adventure Crime Drama Thriller,Alexandre Rodrigues Leandro Firmino Phellipe H...
3,1,7323,3.5,Daniel Brühl Katrin Saß Chulpan Khamatova Mari...,Bernd Lichtenberg,coma german democratic republic capitalism pol...,"Good bye, Lenin! (2003)",Comedy Drama,Daniel Brühl Katrin Saß Chulpan Khamatova Mari...
4,1,4973,4.5,Audrey Tautou Mathieu Kassovitz Rufus Lorella ...,Guillaume Laurant,female protagonist paris france france montmar...,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy Romance,Audrey Tautou Mathieu Kassovitz Rufus Lorella ...


In [63]:
# Merge interested columns values for vectorization
test_title_list = []
test_indices_list = []

for index, merge_name in enumerate(test_merge_chunks_name):
    globals()[merge_name]['key_words'] = (pd.Series(globals()[merge_name][['title_cast', 'director', 'plot_keywords', 
                                                                           'genres']].fillna('')
                      .values.tolist()).str.join(' '))
    
    titles = "titles_{0}".format(index + 1)
    globals()[titles] = globals()[merge_name]['title']
    title_list.append(titles)
    
    indices = "indices_{0}".format(index + 1)
    globals()[indices] = pd.Series(globals()[merge_name].index, index=globals()[merge_name]['title'])
    indices_list.append(indices)
    
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,plot_keywords,title,genres,key_words
0,1,2011,,,,Back to the Future Part II (1989),Adventure Comedy Sci-Fi,Adventure Comedy Sci-Fi
1,1,4144,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama Romance,Drama Romance
2,1,5767,,,,Teddy Bear (Mis) (1981),Comedy Crime,Comedy Crime
3,1,6711,Scarlett Johansson Bill Murray Akiko Takeshita...,Sofia Coppola,older man younger woman relationship lonelines...,Lost in Translation (2003),Comedy Drama Romance,Scarlett Johansson Bill Murray Akiko Takeshita...
4,1,7318,Jim Caviezel Maia Morgenstern Christo Jivkov F...,Benedict Fitzgerald,suffering torture brutality whipping,"Passion of the Christ, The (2004)",Drama,Jim Caviezel Maia Morgenstern Christo Jivkov F...


In [45]:
# Drop unwanted colunms in the train data set
for index, merge_name in enumerate(merge_chunks_name):
    globals()[merge_name].drop(columns= ['title_cast','director','plot_keywords', 'genres'], inplace=True)
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title,key_words
0,1,3949,5.0,Requiem for a Dream (2000),Ellen Burstyn Jared Leto Jennifer Connelly Mar...
1,1,1175,3.5,Delicatessen (1991),Pascal Benezech Dominique Pinon Marie-Laure Do...
2,1,6016,5.0,City of God (Cidade de Deus) (2002),Alexandre Rodrigues Leandro Firmino Phellipe H...
3,1,7323,3.5,"Good bye, Lenin! (2003)",Daniel Brühl Katrin Saß Chulpan Khamatova Mari...
4,1,4973,4.5,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Audrey Tautou Mathieu Kassovitz Rufus Lorella ...


In [64]:
# Drop unwanted colunms in the test data set
for index, merge_name in enumerate(test_merge_chunks_name):
    globals()[merge_name].drop(columns= ['title_cast','director','plot_keywords', 'genres'], inplace=True)
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title,key_words
0,1,2011,Back to the Future Part II (1989),Adventure Comedy Sci-Fi
1,1,4144,In the Mood For Love (Fa yeung nin wa) (2000),Drama Romance
2,1,5767,Teddy Bear (Mis) (1981),Comedy Crime
3,1,6711,Lost in Translation (2003),Scarlett Johansson Bill Murray Akiko Takeshita...
4,1,7318,"Passion of the Christ, The (2004)",Jim Caviezel Maia Morgenstern Christo Jivkov F...


In [50]:
# View the shape of the first chunk in train dataset
merge_chunk_1.shape

(5260, 5)

In [65]:
# View the shape of the first chunk in test dataset
test_merge_chunk_1.shape

(2748, 4)

In [66]:
# View the shape of the last chunk in train dataset
merge_chunk_1623.shape

(5915, 5)

In [67]:
# View the shape of the last chunk in test dataset
test_merge_chunk_1623.shape

(3031, 4)

In [49]:
# Save the merge table for the train dataset
for index, merge_name in enumerate(merge_chunks_name):
    directory = './data/chunked_train_data/'+merge_name+'.csv'
    globals()[merge_name].to_csv(directory,index=False)

In [68]:
# Save the merge table for the test dataset
for index, merge_name in enumerate(test_merge_chunks_name):
    directory = './data/chunked_test_data/'+merge_name+'.csv'
    globals()[merge_name].to_csv(directory,index=False)

In [69]:
merge_chunk_1.loc[0,'key_words']

"Ellen Burstyn Jared Leto Jennifer Connelly Marlon Wayans Christopher McDonald Louise Lasser Marcia Jean Kurtz Janet Sarno Suzanne Shepherd Joanne Gordon Charlotte Aronofsky Mark Margolis Michael Kaycheck Jack O'Connell Chas Mastin Hubert Selby Jr. drug addiction heroin sex show sex scene Drama"

#### Free up memory space

For us to proceed to the next, which is CPU intensive, we want to free up some memory space

In [70]:
# Free up memory space 

train = None
genome_scores = None
genome_tags = None
links = None

print(train)

None


**TfidfVectorizer**
We now need a mechanism to convert these textual features into a format which enables us to compute their relative similarities to one another.
This will allow us to translate our string-based collection of title_cast, director, plot_keywords, genres, key_words into numerical vectors to achieve this, we make use of **TfidfVectorizer**.

In [42]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2),min_df=0, stop_words='english')

vector_list = []

for index, merge_name in enumerate(merge_chunks_name):
    tf_matrix = "tf_matrix_{0}".format(index + 1)
    
    globals()[tf_matrix] = tf.fit_transform(globals()[merge_name]['key_words'])
    vector_list.append(tf_matrix)

In [59]:
tf_matrix_1.shape

(5260, 57409)

In [60]:
for data in merge_chunks_name:
    del globals()[data]

In [None]:
# Create first 100 consine similarities 
cosine_sim_list = []

real_index = 0
stop_count = 300
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

print (cosine_sim_1.shape)

In [62]:
# Print cosine similarity value for the 200th chunk
cosine_sim_100.shape

(6094, 6094)

In [63]:
# View the next start point
real_index

100

In [64]:
# Create next 200 consine similarities 

stop_count = 150
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

# View the last cosine similarity value generated 
print (cosine_sim_150.shape)


(6023, 6023)


In [65]:
# View the next start point
real_index

150

In [66]:
# Create next 200 consine similarities 

stop_count = 200
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

# View the last cosine similarity value generated 
print (cosine_sim_200.shape)

(5638, 5638)


In [67]:
# View the next start point
real_index

200

In [68]:
# Create next 200 consine similarities 

stop_count = 250
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

# View the last cosine similarity value generated 
print (cosine_sim_250.shape)

(5638, 5638)


In [70]:
# View the next start point
real_index

250

In [None]:
# Create next 200 consine similarities 

stop_count = 300
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

# View the last cosine similarity value generated 
print (cosine_sim_300.shape)

#### Spliting of dataset
We split our transformed dataset into 50 part, converting same into dataframe and save to local storage. This process
gives 199957 rows for the first 49 chunks and 199952 rows for the last chunk

In [None]:
def create_chunk_list(obj, limit):
    chunk = []
    chunks = []
    obj_len = len(obj)
    
    for index, value in enumerate(obj) :
        chunk.append(value)
        if ( (len(chunk) == limit ) | ((index +1) == obj_len )) :
            chunks.append(chunk)
            chunk = []
    
    return chunks
            

####  Convert vectorised data set back to dataframe form

In [None]:
vectoried_df = pd.DataFrame(data=tf_authTags_matrix.toarray(),columns = vector.get_feature_names())
vectoried_df.head()

We now can compute the similarity between each vector within our matrix. This is done by making use of the `cosine_similarity` function provided to us by `sklearn`.

In [None]:
cosine_sim_authTags = cosine_similarity(tf_authTags_matrix, 
                                        tf_authTags_matrix)
print (cosine_sim_authTags.shape)

### 4.2 Collaborative filtering

### 4.3 Rating Prediction

As motivated previously, in some cases we may wish to directly calculate what rating a user _would_ give a book that they haven't read yet. 

We can modify our content-based filtering algorithm to do this in the following manner: 

   1. Select a reference user from the database and a reference item (movie) they have _not_ rated. 
   2. For the user, gather the similarity values between the reference item and each item the user _has_ rated. 
   3. Sort the gathered similarity values in descending order. 
   4. Select the $k$ highest similarity values which are above a given threshold value, creating a collection $K$. 
   5. Compute a weighted average rating from these values, which is the sum of the similarity values of each item multiplied by its assigned user-rating, divided by the sum of the similarity values. This can be expressed in formula as:
   
   $$ \hat{R}_{ju} = \frac{\sum_{i \in K} s_{ij} \times r_{iu}}{\sum_{i \in K} s_{ij}}   $$
   
   where $\hat{R}_{ju}$ is the weighted average computed for the reference item $j$ and reference user $u$, $K$ is the collection of items, $s_{ij}$ is the similarity computed between items $i$ and $j$, and $r_{iu}$ is the known rating user $u$ has given item $i$.
   6. We return the weighted average $\hat{R}_{ju}$ as the prediction for our reference item.
   
   
We implement this algorithmic process in the function below:

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic

Run the next cell to make sure the experiment as ended. It notifies comit.

In [None]:
experiment.end()