<a id="id_0"></a>
# Make Predictions
## Context
**Project Name:** Predict Netflix titles that Mum will watch       
**Written by:** Claudia Wallis         
**Created:** 27/02/2022           
**Last modifed:** 27/02/20212           

## Objective
**Predict which Nteflix Titles Mum will want to watch in England**           

## Input Data 
1. [netflix_dataset_latest_2021_kaggle.xlsx](https://www.kaggle.com/syedmubarak/netflix-dataset-latest-2021)
2.  model_mum_os.pkl from 2.build_model.ipynb

## Table of Contents
1. [Set Up](#id_1)
    - 1a. Import packages
    - 1b. Update variables
2. [Load Data](#id_2)
    - 2a. Load data
    - 2b. Checks
3. [Select Titles Available in England](#id_3)
    - 3a. Clean data
    - 3b. Checks
4. [Prepare Features](#id_4)
5. [Make Predictions](#id_5)

<a id="id_1"></a>
## 1. Set Up
#### 1a) Import packages

In [1]:
# base packages
import pandas as pd
import numpy as np

# etc
import time
import pickle
import os

# NLP related
import re
from nltk.probability import FreqDist
from nltk import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Model related
from xgboost import XGBClassifier

# pipeline functions
os.chdir('C:\\Users\\claud\\Documents\\code\\')
import pipeline; import importlib; importlib.reload(pipeline)
from pipeline.fns import data_summary

# python related
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.filterwarnings("ignore")
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

#### 1b) Update variables

In [2]:
# update paths
path = 'C:\\Users\\claud\\Documents\\data\\'

# set options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows',None)
pd.set_option('float_format', '{:f}'.format)

<a id="id_2"></a>
## 2. Load Data
#### 2a) Load data

In [3]:
%%time
# load kaggle dataset
kaggle_netflix = pd.read_excel(path + 'input_data\\netflix_dataset_latest_2021_kaggle.xlsx')

# load model
with open(path + 'output_data//model_mum_os.pkl', 'rb') as file:
    model = pickle.load(file)

# load key values
with open(path + 'output_data//actors_key_values.pkl', 'rb') as file:
    actors_key_values = pickle.load(file)
with open(path + 'output_data//director_key_values.pkl', 'rb') as file:
    director_key_values = pickle.load(file)
with open(path + 'output_data//genre_key_values.pkl', 'rb') as file:
    genre_key_values = pickle.load(file)

Wall time: 10.8 s


#### 2b) Checks

In [4]:
print('Kaggle Netflix Data:\nShape = ' + str(kaggle_netflix.shape))
display(kaggle_netflix.head())
kaggle_netflix['Netflix Release Date'] = pd.to_datetime(kaggle_netflix['Netflix Release Date'])
display(kaggle_netflix['Netflix Release Date'].describe())

Kaggle Netflix Data:
Shape = (9425, 29)


Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Hidden Gem Score,Country Availability,Runtime,Director,Writer,Actors,View Rating,IMDb Score,Rotten Tomatoes Score,Metacritic Score,Awards Received,Awards Nominated For,Boxoffice,Release Date,Netflix Release Date,Production House,Netflix Link,IMDb Link,Summary,IMDb Votes,Image,Poster,TMDb Trailer,Trailer Site
0,Lets Fight Ghost,"Crime, Drama, Fantasy, Horror, Romance","Comedy Programmes,Romantic TV Comedies,Horror Programmes,Thai TV Programmes","Swedish, Spanish",Series,4.3,Thailand,< 30 minutes,Tomas Alfredson,John Ajvide Lindqvist,"Lina Leandersson, Kåre Hedebrant, Per Ragnar, Henrik Dahl",R,7.9,98.0,82.0,74.0,57.0,2122065.0,2008-12-12,2021-03-04,"Canal+, Sandrew Metronome",https://www.netflix.com/watch/81415947,https://www.imdb.com/title/tt1139797,"A med student with a supernatural gift tries to cash in on his abilities by facing off against ghosts, till a wandering spirit brings romance instead.",205926.0,https://occ-0-4708-64.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABcmgLCxN8dNahdY2kgd1hhcL2a6XrE92x24Bx5h6JFUvH5zMrv6lFWl_aWMt33b6DHvkgsUeDx_8Q1rmopwT3fuF8Rq3S1hrkvFf3uzVv2sb3zrtU-LM1Zy1FfrAKD3nKNyA_RQWrmw.jpg?r=cd0,https://m.media-amazon.com/images/M/MV5BOWM4NTY2NTMtZDZlZS00NTgyLWEzZDMtODE3ZGI1MzI3ZmU5XkEyXkFqcGdeQXVyNzI1NzMxNzM@._V1_SX300.jpg,https://www.youtube.com/watch?v=LqB6XJix-dM,YouTube
1,HOW TO BUILD A GIRL,Comedy,"Dramas,Comedies,Films Based on Books,British",English,Movie,7.0,Canada,1-2 hour,Coky Giedroyc,Caitlin Moran,"Cleo, Paddy Considine, Beanie Feldstein, Dónal Finn",R,5.8,79.0,69.0,1.0,,70632.0,2020-05-08,2021-03-04,"Film 4, Monumental Pictures, Lionsgate",https://www.netflix.com/watch/81041267,https://www.imdb.com/title/tt4193072,"When nerdy Johanna moves to London, things get out of hand when she reinvents herself as a bad-mouthed music critic to save her poverty-stricken family.",2838.0,https://occ-0-1081-999.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABe_fxMSBM1E-sSoszr12SmkI-498sqBWrEyhkchdn4UklQVjdoPS_Hj-NhvgbePvwlDSzMTcrIE0kgiy-zTEU_EaGg.jpg?r=35a,https://m.media-amazon.com/images/M/MV5BZGUyN2ZlMjYtZTk2Yy00MWZiLWIyMDktMzFlMmEzOWVlMGNiXkEyXkFqcGdeQXVyMTE1MzI2NzIz._V1_SX300.jpg,https://www.youtube.com/watch?v=eIbcxPy4okQ,YouTube
2,The Con-Heartist,"Comedy, Romance","Romantic Comedies,Comedies,Romantic Films,Thai Comedies,Thai Films",Thai,Movie,8.6,Thailand,> 2 hrs,Mez Tharatorn,"Pattaranad Bhiboonsawade, Mez Tharatorn, Thodsapon Thiptinnakorn","Kathaleeya McIntosh, Nadech Kugimiya, Pimchanok Leuwisetpaiboon, Thiti Mahayotaruk",,7.4,,,,,,2020-12-03,2021-03-03,,https://www.netflix.com/watch/81306155,https://www.imdb.com/title/tt13393728,"After her ex-boyfriend cons her out of a large sum of money, a former bank employee tricks a scam artist into helping her swindle him in retaliation.",131.0,https://occ-0-2188-64.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABSj6td_whxb4en62Ax5EKSKMl2lTzEK5CcBhwBdjRgF6SOJb4RtVoLhPAUWEskuOxPiaafxU1qauZDTJguwNQ9GstA.jpg?r=e76,https://m.media-amazon.com/images/M/MV5BODAzOGZmNjUtMTIyMC00NGU1LTg5MTMtZWY4MDdiZjI0NGEwXkEyXkFqcGdeQXVyNzEyMTA5MTU@._V1_SX300.jpg,https://www.youtube.com/watch?v=md3CmFLGK6Y,YouTube
3,Gleboka woda,Drama,"TV Dramas,Polish TV Shows,Social Issue TV Dramas",Polish,Series,8.7,Poland,< 30 minutes,,,"Katarzyna Maciag, Piotr Nowak, Marcin Dorocinski, Julia Kijowska",,7.5,,,2.0,4.0,,2011-06-14,2021-03-03,,https://www.netflix.com/watch/81307527,https://www.imdb.com/title/tt2300049,A group of social welfare workers led by their new director tries to provide necessary aid to people struggling with various problems.,47.0,https://occ-0-2508-2706.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABSxWH_aWvJrqXWANpOp86kFpU3kdpqx9RsdYZZGHfpIalSig2QHKaZXm8vhKWr89-OLh5XqzIHj_5UzwNriADy19NQ.jpg?r=561,https://m.media-amazon.com/images/M/MV5BMTc0NzZiYTYtMTQyNy00Mjg0LTk1NzMtMTljMjI4ZmM4ZjFmXkEyXkFqcGdeQXVyMTc4MzI2NQ@@._V1_SX300.jpg,https://www.youtube.com/watch?v=5kyF2vy63r0,YouTube
4,Only a Mother,Drama,"Social Issue Dramas,Dramas,Movies Based on Books,Period Pieces,Swedish Movies",Swedish,Movie,8.3,"Lithuania,Poland,France,Italy,Spain,Greece,Belgium,Portugal,Netherlands,Germany,Switzerland,United Kingdom,Iceland,Czech Republic",1-2 hour,Alf Sjöberg,Ivar Lo-Johansson,"Hugo Björne, Eva Dahlbeck, Ulf Palme, Ragnar Falck",,6.7,,,2.0,1.0,,1949-10-31,2021-03-03,,https://www.netflix.com/watch/81382068,https://www.imdb.com/title/tt0041155,An unhappily married farm worker struggling to care for her children reflects on her lost youth and the scandalous moment that cost her true love.,88.0,https://occ-0-2851-41.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABdpOFktQ4Z3klQEU2XQc9NWompf70CHEGLPIeBdCGGLDhvy1Mqly5552DUYR5-5M77STCj8rPvCbXltOcTj53olEzA.jpg?r=c84,https://m.media-amazon.com/images/M/MV5BMjVmMzA5OWYtNTFlMy00ZDBlLTg4NDUtM2NjYjFhMGYwZjBkXkEyXkFqcGdeQXVyNzQxNDExNTU@._V1_SX300.jpg,https://www.youtube.com/watch?v=H0itWKFwMpQ,YouTube


count                    9425
unique                   1642
top       2015-04-14 00:00:00
freq                     2121
first     2015-04-14 00:00:00
last      2021-03-04 00:00:00
Name: Netflix Release Date, dtype: object

<a id="id_3"></a>
## 3. Select Titles Available in England
### 3a) Clean data

In [5]:
# remove movies/series not available in Australia
def id_available_country(available_countries, target_available_country = 'United Kingdom'):
    
    available_countries_ls = available_countries.split(",")
    
    if target_available_country in available_countries_ls:
        return 1
    else:
        return 0
    
kaggle_netflix['Available UK Flag'] = kaggle_netflix['Country Availability'].astype (str).apply(lambda x: id_available_country(x))
print('Volume available in UK = ' + str(kaggle_netflix['Available UK Flag'].sum()))
uk_kaggle_netflix = kaggle_netflix[kaggle_netflix['Available UK Flag'] == 1]

# create id column
uk_kaggle_netflix.reset_index(drop=True, inplace = True)
uk_kaggle_netflix['Title id'] = uk_kaggle_netflix.index

# only keep most recent row per title that has been released onto Netflix multiple times
uk_kaggle_netflix = uk_kaggle_netflix.sort_values(by = 'Netflix Release Date',
                                                    ascending = False).drop_duplicates(subset = ['Title', 'Release Date'])
print('Volume available in UK, post duplicate removal = ' + str(uk_kaggle_netflix.shape[0]))

# cull unnecessary features
uk_kaggle_netflix = uk_kaggle_netflix[['Title id', 'Title', 'Netflix Release Date', 
                                       'IMDb Score', 'Rotten Tomatoes Score', 'Hidden Gem Score',  'Metacritic Score', 
                                       'Awards Received', 'Awards Nominated For', 'IMDb Votes',
                                       'View Rating', 'Genre', 'Series or Movie', 'Runtime',
                                       'Director', 'Actors', 'Summary']]


Volume available in UK = 3908
Volume available in UK, post duplicate removal = 3888


#### 3b) Checks

In [6]:
print('Kaggle Netflix Data:\nShape = ' + str(uk_kaggle_netflix.shape))
display(uk_kaggle_netflix.head())
display(uk_kaggle_netflix['Netflix Release Date'].describe())

Kaggle Netflix Data:
Shape = (3888, 17)


Unnamed: 0,Title id,Title,Netflix Release Date,IMDb Score,Rotten Tomatoes Score,Hidden Gem Score,Metacritic Score,Awards Received,Awards Nominated For,IMDb Votes,View Rating,Genre,Series or Movie,Runtime,Director,Actors,Summary
0,0,Only a Mother,2021-03-03,6.7,,8.3,,2.0,1.0,88.0,,Drama,Movie,1-2 hour,Alf Sjöberg,"Hugo Björne, Eva Dahlbeck, Ulf Palme, Ragnar Falck",An unhappily married farm worker struggling to care for her children reflects on her lost youth and the scandalous moment that cost her true love.
2,2,The Simple Minded Murderer,2021-03-03,7.6,92.0,7.8,,7.0,2.0,2870.0,,Drama,Movie,1-2 hour,Hans Alfredson,"Maria Johansson, Hans Alfredson, Stellan Skarsgård, Per Myrberg","A good-natured farmhand, perpetually terrorized by his cruel boss for his disability, finds love and acceptance when a poor family takes him in."
3,3,To Kill a Child,2021-03-03,7.7,,8.8,,2.0,5.0,78.0,,"Short, Drama",Movie,< 30 minutes,"José Esteban Alenda, César Esteban Alenda","Cristina Marcos, Manolo Solo, Roger Príncep, Roger Álvarez",A car accident involving a young child takes a devastating toll in this 9-minute film based on the 1948 short story by writer Stig Dagerman.
4,4,Harrys Daughters,2021-03-03,8.1,96.0,4.4,85.0,46.0,94.0,766594.0,PG-13,"Adventure, Drama, Fantasy, Mystery",Movie,1-2 hour,David Yates,"Daniel Radcliffe, Ralph Fiennes, Alan Rickman, Michael Gambon","As two sisters both experience pregnancy, tragedy rattles their bond by bringing secrets, jealousy and sorrow to the forefront of their relationship."
5,5,Gyllene Tider,2021-03-03,7.7,,8.8,,,,19.0,,Music,Movie,30-60 mins,Lasse Hallström,"Anders Herrlin, Per Gessle, Micke Andersson, Göran Fritzson",This music documentary offers backstage interviews and concert footage of the Swedish pop group at the peak of their popularity in the early 80s.


count                    3888
unique                   1172
top       2015-04-14 00:00:00
freq                      665
first     2015-04-14 00:00:00
last      2021-03-03 00:00:00
Name: Netflix Release Date, dtype: object

<a id="id_4"></a>
## 4. Prepare Features
#### 4a) One Hot Encoding categorical data with minimal unique values

In [7]:
# group view rating col
child = ['E', 'TV-Y7', 'TV-Y', 'TV-Y7-FV', 'TV-G', 'G', ]
teen = ['PG-13', 'TV-14', 'TV-PG', 'PG', 'Passed']
adult = ['R', 'TV-MA', 'X', 'Approved']
unrated = ['Unrated', 'Not Rated']

uk_kaggle_netflix['Grouped Rating'] = 'unrated'
uk_kaggle_netflix.loc[uk_kaggle_netflix['View Rating'].isin(child), 'Grouped Rating'] = 'child'
uk_kaggle_netflix.loc[uk_kaggle_netflix['View Rating'].isin(teen), 'Grouped Rating'] = 'teen'
uk_kaggle_netflix.loc[uk_kaggle_netflix['View Rating'].isin(adult), 'Grouped Rating'] = 'adult'

uk_kaggle_netflix.drop('View Rating', axis = 1, inplace = True)

cols_to_one_hot_encode = ['Series or Movie', 'Runtime', 'Grouped Rating']

# generate binary values using get_dummies
for col in cols_to_one_hot_encode:
    uk_kaggle_netflix = pd.get_dummies(uk_kaggle_netflix, columns = [col], prefix = [col])

uk_kaggle_netflix.drop(['Series or Movie_Series'], axis = 1, inplace = True)

display(uk_kaggle_netflix.head())

Unnamed: 0,Title id,Title,Netflix Release Date,IMDb Score,Rotten Tomatoes Score,Hidden Gem Score,Metacritic Score,Awards Received,Awards Nominated For,IMDb Votes,Genre,Director,Actors,Summary,Series or Movie_Movie,Runtime_1-2 hour,Runtime_30-60 mins,Runtime_< 30 minutes,Runtime_> 2 hrs,Grouped Rating_adult,Grouped Rating_child,Grouped Rating_teen,Grouped Rating_unrated
0,0,Only a Mother,2021-03-03,6.7,,8.3,,2.0,1.0,88.0,Drama,Alf Sjöberg,"Hugo Björne, Eva Dahlbeck, Ulf Palme, Ragnar Falck",An unhappily married farm worker struggling to care for her children reflects on her lost youth and the scandalous moment that cost her true love.,1,1,0,0,0,0,0,0,1
2,2,The Simple Minded Murderer,2021-03-03,7.6,92.0,7.8,,7.0,2.0,2870.0,Drama,Hans Alfredson,"Maria Johansson, Hans Alfredson, Stellan Skarsgård, Per Myrberg","A good-natured farmhand, perpetually terrorized by his cruel boss for his disability, finds love and acceptance when a poor family takes him in.",1,1,0,0,0,0,0,0,1
3,3,To Kill a Child,2021-03-03,7.7,,8.8,,2.0,5.0,78.0,"Short, Drama","José Esteban Alenda, César Esteban Alenda","Cristina Marcos, Manolo Solo, Roger Príncep, Roger Álvarez",A car accident involving a young child takes a devastating toll in this 9-minute film based on the 1948 short story by writer Stig Dagerman.,1,0,0,1,0,0,0,0,1
4,4,Harrys Daughters,2021-03-03,8.1,96.0,4.4,85.0,46.0,94.0,766594.0,"Adventure, Drama, Fantasy, Mystery",David Yates,"Daniel Radcliffe, Ralph Fiennes, Alan Rickman, Michael Gambon","As two sisters both experience pregnancy, tragedy rattles their bond by bringing secrets, jealousy and sorrow to the forefront of their relationship.",1,1,0,0,0,0,0,1,0
5,5,Gyllene Tider,2021-03-03,7.7,,8.8,,,,19.0,Music,Lasse Hallström,"Anders Herrlin, Per Gessle, Micke Andersson, Göran Fritzson",This music documentary offers backstage interviews and concert footage of the Swedish pop group at the peak of their popularity in the early 80s.,1,0,1,0,0,0,0,0,1


#### 4b) Handle Categorical columns with multiple values per cell

In [8]:
def key_value_f(text, key_value):
    if ~pd.isna(text):
        values = text.split(', ')
        occurrences = [value for value in values if value == key_value]
        if len(occurrences) > 0:
            return 1
        else:
            return 0
        
    else:
        return 0

    
# Define dictionary of values required for each feature
key_value_feat_dict = {'feature': ['Genre', 'Director', 'Actors'], 
                       'key_values': [genre_key_values, director_key_values, actors_key_values]}

for feat in range(len(key_value_feat_dict) + 1):
    print(key_value_feat_dict['feature'][feat] + ':')
    
    # id most occuring values for the feature to create value flags for
    key_values = key_value_feat_dict['key_values'][feat]
    print('Adding ' + str(len(key_values)) + ' ' +
          str(key_value_feat_dict['feature'][feat].lower()) + ' flag features\n\n')
    
    # update any missing values in the feature to be a blank space to avoid errors
    uk_kaggle_netflix.loc[pd.isna(uk_kaggle_netflix[key_value_feat_dict['feature'][feat]]),
                          key_value_feat_dict['feature'][feat]] = ' '
    
    # create a value flag for chosen values
    for value in key_values:
        uk_kaggle_netflix[value + '_f'] =\
        uk_kaggle_netflix[key_value_feat_dict['feature'][feat]].apply(lambda x: key_value_f(x, value))
        
    # remove i=original feature column
    uk_kaggle_netflix.drop(key_value_feat_dict['feature'][feat],
                           axis = 1, inplace = True)

display(uk_kaggle_netflix.shape)
display(uk_kaggle_netflix.head())

Genre:
Adding 22 genre flag features


Director:
Adding 3 director flag features


Actors:
Adding 15 actors flag features




(3888, 60)

Unnamed: 0,Title id,Title,Netflix Release Date,IMDb Score,Rotten Tomatoes Score,Hidden Gem Score,Metacritic Score,Awards Received,Awards Nominated For,IMDb Votes,Summary,Series or Movie_Movie,Runtime_1-2 hour,Runtime_30-60 mins,Runtime_< 30 minutes,Runtime_> 2 hrs,Grouped Rating_adult,Grouped Rating_child,Grouped Rating_teen,Grouped Rating_unrated,Drama_f,Comedy_f,Action_f,Thriller_f,Romance_f,Crime_f,Documentary_f,Adventure_f,Fantasy_f,Animation_f,Family_f,Mystery_f,Sci-Fi_f,Horror_f,Biography_f,History_f,Music_f,Sport_f,Short_f,War_f,Musical_f,Reality-TV_f,Cathy Garcia-Molina_f,Hayao Miyazaki_f,Jay Karas_f,Adam Sandler_f,Shah Rukh Khan_f,Priyanka Chopra_f,Mark Wahlberg_f,Liam Neeson_f,Eric Idle_f,John Cleese_f,Kevin James_f,Akshay Kumar_f,Ashleigh Ball_f,Nawazuddin Siddiqui_f,Terry Gilliam_f,Paresh Rawal_f,Russell Crowe_f,John Abraham_f
0,0,Only a Mother,2021-03-03,6.7,,8.3,,2.0,1.0,88.0,An unhappily married farm worker struggling to care for her children reflects on her lost youth and the scandalous moment that cost her true love.,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,The Simple Minded Murderer,2021-03-03,7.6,92.0,7.8,,7.0,2.0,2870.0,"A good-natured farmhand, perpetually terrorized by his cruel boss for his disability, finds love and acceptance when a poor family takes him in.",1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3,To Kill a Child,2021-03-03,7.7,,8.8,,2.0,5.0,78.0,A car accident involving a young child takes a devastating toll in this 9-minute film based on the 1948 short story by writer Stig Dagerman.,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4,Harrys Daughters,2021-03-03,8.1,96.0,4.4,85.0,46.0,94.0,766594.0,"As two sisters both experience pregnancy, tragedy rattles their bond by bringing secrets, jealousy and sorrow to the forefront of their relationship.",1,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,5,Gyllene Tider,2021-03-03,7.7,,8.8,,,,19.0,This music documentary offers backstage interviews and concert footage of the Swedish pop group at the peak of their popularity in the early 80s.,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### 4c) NLP feature engineering of text feature = Summary
**4c) i. Clean Summary feature**

In [9]:
def preprocess_text(text):
    # convert all text to lowercase & remove punctuation & characters & strip
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    
    ## tokenize
    words = word_tokenize(text)
    
    ## get stop word list
    sws = stopwords.words("english")
    
    ## remove stop words
    no_stopwords = [word for word in words if word not in sws]
    
    # transform words back into string
    text = ' '.join(no_stopwords)
    
    return text    

# clean summary feature
uk_kaggle_netflix['Clean Summary'] = uk_kaggle_netflix['Summary'].apply(lambda x: preprocess_text(x))

**4c) ii. Get sentiment scores**

In [10]:
# calculate sentiment score
uk_kaggle_netflix['Summary Sentiment Score'] =\
uk_kaggle_netflix['Clean Summary'].apply(lambda x: TextBlob(x).sentiment.polarity)

display(uk_kaggle_netflix['Summary Sentiment Score'].describe())

count   3888.000000
mean       0.046916
std        0.266831
min       -1.000000
25%       -0.066667
50%        0.000000
75%        0.200000
max        1.000000
Name: Summary Sentiment Score, dtype: float64

**4c) iii. Get embeddings**

In [11]:
%%time
# create array from text column
x = uk_kaggle_netflix.loc[:, 'Clean Summary'].values
print(len(x))
print(x)

# create embeddings
os.chdir('C:\\Users\\claud\\Documents\\')
emb_model = SentenceTransformer('paraphrase-distilroberta-base-v1')
roberta = emb_model.encode(x)
display(roberta.shape)
display(roberta)

3888
['unhappily married farm worker struggling care children reflects lost youth scandalous moment cost true love'
 'goodnatured farmhand perpetually terrorized cruel boss disability finds love acceptance poor family takes'
 'car accident involving young child takes devastating toll 9minute film based 1948 short story writer stig dagerman'
 ...
 'standup comedy star kevin hart delivers unique perspective work race family friends laughriot comedy show'
 'engaging documentary series shares surprising backstories familiar institutions like pentagon west point playboy mansion'
 'madagascar goes wild holiday spirit set valentines day christmasthemed tales featuring everyones favorite animal characters']


(3888, 768)

array([[ 9.5230341e-02,  5.0791520e-01,  3.2454786e-01, ...,
        -9.0709075e-02,  8.4763728e-02,  2.6724419e-01],
       [ 9.9013969e-02,  1.2575524e-01,  2.3845463e-01, ...,
         2.8332448e-01,  4.1070423e-04,  3.2410502e-01],
       [ 2.7075472e-01,  4.0111753e-01,  2.1703014e-01, ...,
         2.9310319e-01, -1.2523268e-01, -3.9373729e-02],
       ...,
       [ 2.0687398e-01,  1.6644548e-01,  4.9040145e-03, ...,
         5.5140424e-01, -1.0279551e-01, -8.7507620e-02],
       [-6.6251233e-03,  1.1539538e-01,  8.4907509e-02, ...,
        -3.8986835e-01,  3.0589399e-01, -4.2014364e-02],
       [-3.4092093e-01, -9.7056879e-03,  3.9992809e-01, ...,
         1.8125887e-01, -4.2065147e-02,  2.9027084e-02]], dtype=float32)

Wall time: 1min 40s


In [13]:
def explained_variance(embeddings, n_components, print_f = False, return_f = True):
    pca = PCA(n_components = n_components)
    principal_components = pca.fit_transform(embeddings)
    ev = round(pca.explained_variance_ratio_.sum() * 100, 2)
    if print_f:
        print('Together the ' + str(n_components) + ' components contain ' +
              str(ev) + '% of the information.')
    
    if return_f:
        return ev, pca.explained_variance_ratio_
    
# choosing to reduce to 50 features to get 50% of the information as there is no clear elbow 
# and from 50, adding an extra feature won't even provide an extra 1% of the information
n_components = 50
pca = PCA(n_components = n_components)
principal_components = pca.fit_transform(roberta)
explained_variance(embeddings = roberta, n_components = n_components, print_f = True, return_f = False)

# transfer components to uk_kaggle_netflix
components = list(range(1, n_components + 1))
component_features = ['PC ' + str(item) for item in components]
pca_df = pd.DataFrame(data = principal_components,
                      columns = component_features)
pca_df.reset_index(drop = True, inplace = True)

# join on entire uk_kaggle_netflix
uk_kaggle_netflix.reset_index(drop = True, inplace = True)
uk_kaggle_netflix = uk_kaggle_netflix.join(pca_df)

# drop unnecessary text columns
uk_kaggle_netflix.drop(['Summary', 'Clean Summary'], axis = 1, inplace = True)

display(uk_kaggle_netflix.shape)
display(uk_kaggle_netflix.head())

Together the 50 components contain 50.65% of the information.


(3888, 110)

Unnamed: 0,Title id,Title,Netflix Release Date,IMDb Score,Rotten Tomatoes Score,Hidden Gem Score,Metacritic Score,Awards Received,Awards Nominated For,IMDb Votes,Series or Movie_Movie,Runtime_1-2 hour,Runtime_30-60 mins,Runtime_< 30 minutes,Runtime_> 2 hrs,Grouped Rating_adult,Grouped Rating_child,Grouped Rating_teen,Grouped Rating_unrated,Drama_f,Comedy_f,Action_f,Thriller_f,Romance_f,Crime_f,Documentary_f,Adventure_f,Fantasy_f,Animation_f,Family_f,Mystery_f,Sci-Fi_f,Horror_f,Biography_f,History_f,Music_f,Sport_f,Short_f,War_f,Musical_f,Reality-TV_f,Cathy Garcia-Molina_f,Hayao Miyazaki_f,Jay Karas_f,Adam Sandler_f,Shah Rukh Khan_f,Priyanka Chopra_f,Mark Wahlberg_f,Liam Neeson_f,Eric Idle_f,John Cleese_f,Kevin James_f,Akshay Kumar_f,Ashleigh Ball_f,Nawazuddin Siddiqui_f,Terry Gilliam_f,Paresh Rawal_f,Russell Crowe_f,John Abraham_f,Summary Sentiment Score,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,PC 11,PC 12,PC 13,PC 14,PC 15,PC 16,PC 17,PC 18,PC 19,PC 20,PC 21,PC 22,PC 23,PC 24,PC 25,PC 26,PC 27,PC 28,PC 29,PC 30,PC 31,PC 32,PC 33,PC 34,PC 35,PC 36,PC 37,PC 38,PC 39,PC 40,PC 41,PC 42,PC 43,PC 44,PC 45,PC 46,PC 47,PC 48,PC 49,PC 50
0,0,Only a Mother,2021-03-03,6.7,,8.3,,2.0,1.0,88.0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.366667,-0.892209,1.277649,-0.455781,-0.12874,-0.453335,-0.090014,0.252693,1.124929,-0.66109,-0.031115,-0.085253,0.144837,-1.742766,0.742718,-1.065015,-0.611184,-0.170975,-1.173843,-0.411335,-0.018112,-0.660896,0.110633,0.251613,-0.001655,-0.753747,0.409876,-0.59442,-0.160213,0.40274,-0.572196,0.543147,0.446064,-0.443287,-0.893419,0.047273,0.031848,-0.621761,-0.041992,-0.262035,0.268677,-0.01205,-0.379724,0.245288,0.052732,0.259024,0.054006,-0.09621,0.368036,-0.456575,0.79854
1,2,The Simple Minded Murderer,2021-03-03,7.6,92.0,7.8,,7.0,2.0,2870.0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-0.2375,-0.766107,-0.04171,-0.844389,0.590763,-0.220203,-0.000832,-0.200618,0.542435,-0.413336,-0.448982,-0.890622,0.238509,-1.159853,0.243853,-1.48875,-0.443726,-0.34595,-0.586453,-0.231497,-0.205329,0.002023,0.183736,0.498179,0.61789,-0.088227,0.0697,-0.388221,-0.573973,0.462404,0.180467,0.621688,-0.234719,-5e-06,-1.277524,1.183436,0.33558,0.431005,-0.157356,0.780195,-0.184203,0.354196,-0.297512,-0.063243,-0.702202,0.291091,0.448143,0.074302,0.355624,0.115976,0.192676
2,3,To Kill a Child,2021-03-03,7.7,,8.8,,2.0,5.0,78.0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-0.3,-0.567756,-0.559739,-0.202694,0.028394,0.189353,-0.930537,0.634488,0.425921,0.071775,-0.73419,0.12283,-0.277336,0.271897,0.410455,0.780681,0.028732,-0.282636,-0.8183,0.459188,0.770815,-0.499905,0.02895,-0.674097,1.074189,0.476387,0.653173,0.33652,1.196788,0.051526,-0.051508,-0.294885,0.767992,1.202314,-0.460541,0.700479,-0.660286,-0.96957,-0.834016,-0.487416,0.740782,-0.58437,-1.087455,0.126163,0.664941,0.622627,0.214075,0.16687,0.433892,0.153734,0.01814
3,4,Harrys Daughters,2021-03-03,8.1,96.0,4.4,85.0,46.0,94.0,766594.0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,-1.413474,1.485164,0.111595,-0.587126,-0.129776,-1.570893,0.204426,0.374226,-0.719971,-0.82402,-0.486428,-0.521229,0.619354,-0.728844,-0.258715,-0.17556,0.077802,-0.122505,0.368115,-0.102252,0.368981,-0.090678,-0.126189,-0.368553,-0.109327,0.724354,-0.655321,0.676775,-0.86886,0.552446,0.346698,0.634756,-0.119028,-0.068368,0.494161,-0.179647,0.098453,-1.194075,-0.086413,-0.857209,-0.787125,-0.023906,0.495273,0.062687,-0.157745,0.215707,0.04393,-0.326338,0.007177,-0.068989
4,5,Gyllene Tider,2021-03-03,7.7,,8.8,,,,19.0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.066667,2.142846,0.680901,1.187192,-0.905365,-0.62521,-0.703216,0.49046,0.06632,-0.354669,-0.05778,0.523041,-0.256934,-0.423015,0.322345,-0.020691,0.219795,-0.647,-0.192547,0.366327,-0.008912,-1.689752,-0.393795,-0.54266,0.024756,0.013548,-0.166197,0.494158,1.000062,-0.153908,-0.969641,0.414622,-0.281657,-0.354082,0.913375,0.059547,-0.291582,-0.253158,0.166844,-0.102931,0.844367,0.665774,0.503254,-1.019394,-0.392418,-1.397359,0.735276,0.828651,-0.562697,-0.677264,-0.005107


#### 4d) Handling Nulls

In [14]:
display(uk_kaggle_netflix.describe())
display(uk_kaggle_netflix.isna().sum()[:10])

# replace null values with 0 for columns related to awards
uk_kaggle_netflix['Awards Received'].fillna(0, inplace = True)
uk_kaggle_netflix['Awards Nominated For'].fillna(0, inplace = True)

# impute other columns' missing values with mean
cols_to_impute = ['IMDb Score', 'Rotten Tomatoes Score', 'Hidden Gem Score', 'Metacritic Score', 'IMDb Votes']
def impute_mean(df_col):
    if df_col.isna().sum() > 0:
        col_mean = df_col.mean()
        df_col.fillna(col_mean, inplace = True)
        
        return df_col       
for col in cols_to_impute:
    uk_kaggle_netflix[col] = impute_mean(uk_kaggle_netflix[col])

display(uk_kaggle_netflix.describe())
display(uk_kaggle_netflix.isna().sum())

Unnamed: 0,Title id,IMDb Score,Rotten Tomatoes Score,Hidden Gem Score,Metacritic Score,Awards Received,Awards Nominated For,IMDb Votes,Series or Movie_Movie,Runtime_1-2 hour,Runtime_30-60 mins,Runtime_< 30 minutes,Runtime_> 2 hrs,Grouped Rating_adult,Grouped Rating_child,Grouped Rating_teen,Grouped Rating_unrated,Drama_f,Comedy_f,Action_f,Thriller_f,Romance_f,Crime_f,Documentary_f,Adventure_f,Fantasy_f,Animation_f,Family_f,Mystery_f,Sci-Fi_f,Horror_f,Biography_f,History_f,Music_f,Sport_f,Short_f,War_f,Musical_f,Reality-TV_f,Cathy Garcia-Molina_f,Hayao Miyazaki_f,Jay Karas_f,Adam Sandler_f,Shah Rukh Khan_f,Priyanka Chopra_f,Mark Wahlberg_f,Liam Neeson_f,Eric Idle_f,John Cleese_f,Kevin James_f,Akshay Kumar_f,Ashleigh Ball_f,Nawazuddin Siddiqui_f,Terry Gilliam_f,Paresh Rawal_f,Russell Crowe_f,John Abraham_f,Summary Sentiment Score,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,PC 11,PC 12,PC 13,PC 14,PC 15,PC 16,PC 17,PC 18,PC 19,PC 20,PC 21,PC 22,PC 23,PC 24,PC 25,PC 26,PC 27,PC 28,PC 29,PC 30,PC 31,PC 32,PC 33,PC 34,PC 35,PC 36,PC 37,PC 38,PC 39,PC 40,PC 41,PC 42,PC 43,PC 44,PC 45,PC 46,PC 47,PC 48,PC 49,PC 50
count,3888.0,3881.0,1734.0,3881.0,1197.0,1907.0,2403.0,3881.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0
mean,1950.917438,7.069595,67.199539,5.941587,59.057644,8.609858,14.826883,43770.165164,0.627315,0.453704,0.023663,0.385031,0.137603,0.291152,0.050154,0.262603,0.396091,0.471193,0.374486,0.180298,0.181327,0.173868,0.146348,0.141204,0.110597,0.109568,0.105453,0.096965,0.087449,0.079733,0.064815,0.054527,0.042695,0.035494,0.029064,0.025977,0.023663,0.020833,0.01929,0.002829,0.003086,0.002572,0.00463,0.006173,0.00463,0.0018,0.002572,0.003344,0.003858,0.002572,0.002829,0.002829,0.003086,0.003086,0.002315,0.002058,0.002572,0.046916,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0
std,1128.000113,0.88537,24.840506,2.357274,16.387271,18.15281,31.806895,119177.274839,0.483582,0.497916,0.152015,0.486665,0.344527,0.454352,0.218291,0.440105,0.489147,0.499234,0.484052,0.384485,0.385339,0.379045,0.3535,0.348276,0.313672,0.312391,0.307175,0.295948,0.282528,0.270913,0.24623,0.227083,0.202196,0.185048,0.168007,0.159088,0.152015,0.142845,0.13756,0.053122,0.055477,0.050656,0.067892,0.078335,0.067892,0.042398,0.050656,0.057735,0.062001,0.050656,0.053122,0.053122,0.055477,0.055477,0.048063,0.04532,0.050656,0.266831,1.159775,1.017493,0.917648,0.883797,0.869772,0.859326,0.816043,0.804683,0.800265,0.786134,0.757629,0.740627,0.734066,0.728775,0.708396,0.703111,0.697744,0.688463,0.681132,0.677461,0.663435,0.649358,0.644927,0.639964,0.635143,0.630101,0.621086,0.615562,0.605045,0.60176,0.596746,0.592881,0.584806,0.583636,0.577894,0.569346,0.566099,0.562789,0.554383,0.55211,0.542333,0.532995,0.528454,0.526254,0.522708,0.51969,0.519374,0.515886,0.513145,0.506226
min,0.0,2.8,0.0,0.7,12.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-3.538242,-2.808469,-3.162184,-2.489024,-2.821504,-2.884916,-2.801313,-2.943148,-2.704649,-2.714811,-2.585814,-2.139142,-2.371309,-2.514258,-2.597761,-2.396797,-2.347164,-2.11468,-2.339656,-2.17475,-2.054649,-2.092079,-2.289614,-2.2465,-2.113136,-2.171414,-2.142651,-2.179903,-1.99549,-2.02719,-2.174943,-1.829919,-2.115733,-2.004313,-1.779395,-1.900601,-1.88246,-1.956226,-2.0222,-1.990758,-1.857387,-1.643732,-1.964351,-1.954381,-1.759194,-1.70997,-1.746902,-1.668863,-1.899349,-1.80339
25%,974.75,6.6,53.0,3.8,47.0,1.0,2.0,791.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.066667,-0.808654,-0.732023,-0.629719,-0.63236,-0.592748,-0.581358,-0.558424,-0.545439,-0.531386,-0.532373,-0.532065,-0.505833,-0.497737,-0.501449,-0.495558,-0.473789,-0.472465,-0.473525,-0.443405,-0.458938,-0.457318,-0.434961,-0.430497,-0.426396,-0.42967,-0.416713,-0.422277,-0.404025,-0.412382,-0.402126,-0.396907,-0.397784,-0.390729,-0.386022,-0.394972,-0.390873,-0.379285,-0.381332,-0.363506,-0.364392,-0.360135,-0.350964,-0.356955,-0.36476,-0.343282,-0.354129,-0.365429,-0.348314,-0.34607,-0.333691
50%,1949.5,7.1,73.0,6.7,60.0,3.0,5.0,3866.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.127017,-0.04138,-0.027232,-0.046764,-0.011328,-0.047199,-0.002495,-0.003194,0.004017,0.01129,-0.014052,-0.005711,-0.015952,-0.010358,-0.017313,-0.005884,-0.004978,-0.013378,0.000972,-0.016748,-0.01284,-0.000371,-0.002528,0.001665,-0.008667,0.005504,-0.006394,-0.001866,-0.000806,0.001204,0.00685,-0.018938,-0.000583,0.003804,0.021636,0.00311,-0.005559,-0.009495,-0.006437,0.005092,0.005487,-0.004521,0.005472,0.008906,0.014598,0.004404,-0.010439,-0.005329,0.00058,0.00834
75%,2926.25,7.7,87.0,8.2,71.0,8.0,13.0,26666.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.731302,0.684692,0.602749,0.588546,0.576912,0.556054,0.557565,0.558158,0.546461,0.539465,0.511948,0.48709,0.494939,0.489123,0.474775,0.477998,0.473238,0.478903,0.449286,0.472456,0.428578,0.439547,0.427915,0.43163,0.431928,0.423873,0.4257,0.414844,0.415225,0.391539,0.393887,0.401135,0.373161,0.395814,0.381728,0.385843,0.37736,0.375435,0.366877,0.355113,0.367027,0.366222,0.35723,0.347529,0.365848,0.332471,0.349076,0.345877,0.336494,0.335994
max,3907.0,9.5,100.0,9.7,97.0,251.0,386.0,2072912.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.926368,3.746886,3.505433,3.208619,3.387001,3.548229,2.830118,3.231982,2.826298,2.782582,2.649444,2.738232,2.446271,2.992143,2.624331,2.494262,2.51703,2.402648,2.618083,2.207033,2.582064,2.155555,2.520522,3.092753,2.22668,2.621655,2.549426,2.192027,2.271013,2.242856,2.223186,1.966992,2.32581,2.136377,2.090688,2.003605,2.038731,2.238583,2.296252,1.99689,2.173954,1.829952,2.019415,2.215926,1.801669,1.862742,1.763329,1.97866,2.136901,1.810029


Title id                    0
Title                       0
Netflix Release Date        0
IMDb Score                  7
Rotten Tomatoes Score    2154
Hidden Gem Score            7
Metacritic Score         2691
Awards Received          1981
Awards Nominated For     1485
IMDb Votes                  7
dtype: int64

Unnamed: 0,Title id,IMDb Score,Rotten Tomatoes Score,Hidden Gem Score,Metacritic Score,Awards Received,Awards Nominated For,IMDb Votes,Series or Movie_Movie,Runtime_1-2 hour,Runtime_30-60 mins,Runtime_< 30 minutes,Runtime_> 2 hrs,Grouped Rating_adult,Grouped Rating_child,Grouped Rating_teen,Grouped Rating_unrated,Drama_f,Comedy_f,Action_f,Thriller_f,Romance_f,Crime_f,Documentary_f,Adventure_f,Fantasy_f,Animation_f,Family_f,Mystery_f,Sci-Fi_f,Horror_f,Biography_f,History_f,Music_f,Sport_f,Short_f,War_f,Musical_f,Reality-TV_f,Cathy Garcia-Molina_f,Hayao Miyazaki_f,Jay Karas_f,Adam Sandler_f,Shah Rukh Khan_f,Priyanka Chopra_f,Mark Wahlberg_f,Liam Neeson_f,Eric Idle_f,John Cleese_f,Kevin James_f,Akshay Kumar_f,Ashleigh Ball_f,Nawazuddin Siddiqui_f,Terry Gilliam_f,Paresh Rawal_f,Russell Crowe_f,John Abraham_f,Summary Sentiment Score,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,PC 11,PC 12,PC 13,PC 14,PC 15,PC 16,PC 17,PC 18,PC 19,PC 20,PC 21,PC 22,PC 23,PC 24,PC 25,PC 26,PC 27,PC 28,PC 29,PC 30,PC 31,PC 32,PC 33,PC 34,PC 35,PC 36,PC 37,PC 38,PC 39,PC 40,PC 41,PC 42,PC 43,PC 44,PC 45,PC 46,PC 47,PC 48,PC 49,PC 50
count,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0,3888.0
mean,1950.917438,7.069595,67.199539,5.941587,59.057644,4.222994,9.163837,43770.165164,0.627315,0.453704,0.023663,0.385031,0.137603,0.291152,0.050154,0.262603,0.396091,0.471193,0.374486,0.180298,0.181327,0.173868,0.146348,0.141204,0.110597,0.109568,0.105453,0.096965,0.087449,0.079733,0.064815,0.054527,0.042695,0.035494,0.029064,0.025977,0.023663,0.020833,0.01929,0.002829,0.003086,0.002572,0.00463,0.006173,0.00463,0.0018,0.002572,0.003344,0.003858,0.002572,0.002829,0.002829,0.003086,0.003086,0.002315,0.002058,0.002572,0.046916,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0
std,1128.000113,0.884572,16.586412,2.35515,9.090023,13.420642,26.020814,119069.914812,0.483582,0.497916,0.152015,0.486665,0.344527,0.454352,0.218291,0.440105,0.489147,0.499234,0.484052,0.384485,0.385339,0.379045,0.3535,0.348276,0.313672,0.312391,0.307175,0.295948,0.282528,0.270913,0.24623,0.227083,0.202196,0.185048,0.168007,0.159088,0.152015,0.142845,0.13756,0.053122,0.055477,0.050656,0.067892,0.078335,0.067892,0.042398,0.050656,0.057735,0.062001,0.050656,0.053122,0.053122,0.055477,0.055477,0.048063,0.04532,0.050656,0.266831,1.159775,1.017493,0.917648,0.883797,0.869772,0.859326,0.816043,0.804683,0.800265,0.786134,0.757629,0.740627,0.734066,0.728775,0.708396,0.703111,0.697744,0.688463,0.681132,0.677461,0.663435,0.649358,0.644927,0.639964,0.635143,0.630101,0.621086,0.615562,0.605045,0.60176,0.596746,0.592881,0.584806,0.583636,0.577894,0.569346,0.566099,0.562789,0.554383,0.55211,0.542333,0.532995,0.528454,0.526254,0.522708,0.51969,0.519374,0.515886,0.513145,0.506226
min,0.0,2.8,0.0,0.7,12.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-3.538242,-2.808469,-3.162184,-2.489024,-2.821504,-2.884916,-2.801313,-2.943148,-2.704649,-2.714811,-2.585814,-2.139142,-2.371309,-2.514258,-2.597761,-2.396797,-2.347164,-2.11468,-2.339656,-2.17475,-2.054649,-2.092079,-2.289614,-2.2465,-2.113136,-2.171414,-2.142651,-2.179903,-1.99549,-2.02719,-2.174943,-1.829919,-2.115733,-2.004313,-1.779395,-1.900601,-1.88246,-1.956226,-2.0222,-1.990758,-1.857387,-1.643732,-1.964351,-1.954381,-1.759194,-1.70997,-1.746902,-1.668863,-1.899349,-1.80339
25%,974.75,6.6,67.199539,3.8,59.057644,0.0,0.0,791.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.066667,-0.808654,-0.732023,-0.629719,-0.63236,-0.592748,-0.581358,-0.558424,-0.545439,-0.531386,-0.532373,-0.532065,-0.505833,-0.497737,-0.501449,-0.495558,-0.473789,-0.472465,-0.473525,-0.443405,-0.458938,-0.457318,-0.434961,-0.430497,-0.426396,-0.42967,-0.416713,-0.422277,-0.404025,-0.412382,-0.402126,-0.396907,-0.397784,-0.390729,-0.386022,-0.394972,-0.390873,-0.379285,-0.381332,-0.363506,-0.364392,-0.360135,-0.350964,-0.356955,-0.36476,-0.343282,-0.354129,-0.365429,-0.348314,-0.34607,-0.333691
50%,1949.5,7.1,67.199539,6.7,59.057644,0.0,2.0,3882.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.127017,-0.04138,-0.027232,-0.046764,-0.011328,-0.047199,-0.002495,-0.003194,0.004017,0.01129,-0.014052,-0.005711,-0.015952,-0.010358,-0.017313,-0.005884,-0.004978,-0.013378,0.000972,-0.016748,-0.01284,-0.000371,-0.002528,0.001665,-0.008667,0.005504,-0.006394,-0.001866,-0.000806,0.001204,0.00685,-0.018938,-0.000583,0.003804,0.021636,0.00311,-0.005559,-0.009495,-0.006437,0.005092,0.005487,-0.004521,0.005472,0.008906,0.014598,0.004404,-0.010439,-0.005329,0.00058,0.00834
75%,2926.25,7.7,68.0,8.2,59.057644,3.0,7.0,26804.75,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.731302,0.684692,0.602749,0.588546,0.576912,0.556054,0.557565,0.558158,0.546461,0.539465,0.511948,0.48709,0.494939,0.489123,0.474775,0.477998,0.473238,0.478903,0.449286,0.472456,0.428578,0.439547,0.427915,0.43163,0.431928,0.423873,0.4257,0.414844,0.415225,0.391539,0.393887,0.401135,0.373161,0.395814,0.381728,0.385843,0.37736,0.375435,0.366877,0.355113,0.367027,0.366222,0.35723,0.347529,0.365848,0.332471,0.349076,0.345877,0.336494,0.335994
max,3907.0,9.5,100.0,9.7,97.0,251.0,386.0,2072912.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.926368,3.746886,3.505433,3.208619,3.387001,3.548229,2.830118,3.231982,2.826298,2.782582,2.649444,2.738232,2.446271,2.992143,2.624331,2.494262,2.51703,2.402648,2.618083,2.207033,2.582064,2.155555,2.520522,3.092753,2.22668,2.621655,2.549426,2.192027,2.271013,2.242856,2.223186,1.966992,2.32581,2.136377,2.090688,2.003605,2.038731,2.238583,2.296252,1.99689,2.173954,1.829952,2.019415,2.215926,1.801669,1.862742,1.763329,1.97866,2.136901,1.810029


Title id                   0
Title                      0
Netflix Release Date       0
IMDb Score                 0
Rotten Tomatoes Score      0
Hidden Gem Score           0
Metacritic Score           0
Awards Received            0
Awards Nominated For       0
IMDb Votes                 0
Series or Movie_Movie      0
Runtime_1-2 hour           0
Runtime_30-60 mins         0
Runtime_< 30 minutes       0
Runtime_> 2 hrs            0
Grouped Rating_adult       0
Grouped Rating_child       0
Grouped Rating_teen        0
Grouped Rating_unrated     0
Drama_f                    0
Comedy_f                   0
Action_f                   0
Thriller_f                 0
Romance_f                  0
Crime_f                    0
Documentary_f              0
Adventure_f                0
Fantasy_f                  0
Animation_f                0
Family_f                   0
Mystery_f                  0
Sci-Fi_f                   0
Horror_f                   0
Biography_f                0
History_f     

<a id="id_5"></a>
## 5. Make Predictions

In [15]:
# drop excess columns
X_uk = uk_kaggle_netflix.drop(['Title', 'Title id', 'Netflix Release Date'], axis = 1)
uk_kaggle_netflix.reset_index(drop = True, inplace = True)

# get Model predictions
uk_probs = model.predict_proba(X_uk)

# join probs onto df_train
df_uk_preds = pd.DataFrame({'label_0_prob' : uk_probs[:, 0], 'label_1_prob' : uk_probs[:, 1]})
df_uk_preds = uk_kaggle_netflix[['Title', 'Netflix Release Date']].merge(df_uk_preds, left_index = True, right_index = True)
df_uk_preds = pd.merge(df_uk_preds, kaggle_netflix, how = 'left', on = ['Title', 'Netflix Release Date'])
df_uk_preds = df_uk_preds[['Title', 'Genre', 'Series or Movie', 'Summary', 'label_1_prob']]

print('When in England Mum should consider watching the following 30 Netflix titles with the highest probability scores, ' +
      'as being something she would want to watch:')
df_uk_preds.sort_values(by = 'label_1_prob', ascending = False).head(30)

When in England Mum should consider watching the following 30 Netflix titles with the highest probability scores, as being something she would want to watch:


Unnamed: 0,Title,Genre,Series or Movie,Summary,label_1_prob
2176,One of Us,"Drama, Mystery, Thriller",Series,A dark web of secrets and lies emerges when a newlywed couple is killed and detectives question their feuding families.,0.950171
1014,The I-Land,"Adventure, Drama, Mystery, Sci-Fi",Series,"Wiped clean of memories and thrown together, a group of strangers fight to survive harsh realities -- and the island that traps them.",0.946659
3421,Bitten,"Drama, Fantasy, Horror, Mystery",Series,"Elena Michaels tries to stray from the pack of werewolves that turned her into a monster, but her efforts are thwarted by a string of grisly murders.",0.943834
1038,Falling Inn Love,"Comedy, Romance",Movie,"When a San Francisco exec wins a New Zealand inn, she ditches city life to remodel and flip the rustic property with help from a handsome contractor.",0.940321
2026,The Chalet,"Drama, Mystery, Thriller",Series,Friends gathered at a remote chalet in the French Alps for a summer getaway are caught in a deadly trap as a dark secret from the past comes to light.,0.926252
3192,Sense8,"Drama, Mystery, Sci-Fi, Thriller",Series,"One moment links 8 minds in disparate parts of the world, putting 8 strangers in each others lives, each others secrets, and in terrible danger.",0.920979
2573,13 Reasons Why,"Drama, Mystery, Thriller",Series,"After a teenage girls perplexing suicide, a classmate receives a series of tapes that unravel the mystery of her tragic choice.",0.904931
246,To the Lake,"Drama, Sci-Fi, Thriller",Series,"Facing the end of civilization when a terrifying plague strikes, a group risks their lives, loves — and humanity — in a brutal struggle to survive.",0.88382
722,Ragnarok,"Drama, Fantasy, Mystery",Series,"In a Norwegian town poisoned by pollution and rattled by melting glaciers, the End Times feel all too real. It’ll take a legend to battle an old evil.",0.856676
3498,Hemlock Grove,"Drama, Horror, Mystery, Thriller",Series,"Secrets are just a part of daily life in the small Pennsylvania town of Hemlock Grove, where the darkest evils hide in plain sight.",0.845822


[back to the top](#id_0)