## **Netflix Movie Classification**
In this notebook, we try to fit an **LSTM** model that would approximate the clustering model fitted previously to classify each show to a cluster based on the show description. The following steps have been performed.

1.   Importing Libraries
2.   Loading/cleaning/subsetting the dataset
3.   Preprocessing text data
4.   Balancing the dataset
4.   Preparing dataset for modelling(word embedding, train/test split)
5.   Creating the LSTM model
6.   Training LSTM model
7.   Validating results
8.   Performing Hyperparameter tuning of the LSTM model
9.   Validating results


### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from plotly.offline import iplot
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Flatten, Dropout
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk import word_tokenize
STOPWORDS = set(stopwords.words('english'))
from keras_tuner.tuners import RandomSearch
import keras_tuner

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Loading the dataset

In [2]:
df = pd.read_csv("netflix_titles_labled.csv")
print(df.shape)
df.tail()

(8807, 14)


Unnamed: 0.1,Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,cluster
8802,8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",2
8803,8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",1
8804,8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,1
8805,8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",5
8806,8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...,0


In [3]:
df_final = df[['show_id','title','description','cluster']]
df_final.head()


Unnamed: 0,show_id,title,description,cluster
0,s1,Dick Johnson Is Dead,"As her father nears the end of his life, filmm...",1
1,s2,Blood & Water,"After crossing paths at a party, a Cape Town t...",1
2,s3,Ganglands,To protect his family from a powerful drug lor...,0
3,s4,Jailbirds New Orleans,"Feuds, flirtations and toilet talk go down amo...",4
4,s5,Kota Factory,In a city of coaching centers known to train I...,0


In [4]:
df_final['cluster'].value_counts()

0    3657
1    3303
2     733
5     398
4     396
3     320
Name: cluster, dtype: int64

### **Preprocessing Text Data**

In [5]:
df_final['description'] = df_final['description'].str.strip()
df_final['description'] = df_final['description'].str.lower()
df_final['description'] = df_final['description'].str.replace('[^a-zA-Z]', ' ',)
df_final['description'] = df_final['description'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split() if word not in (STOPWORDS)]))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The default value of regex will change from True to False in a future version.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value inst

### Balancing classes

In [6]:
classes = list(df_final['cluster'].unique())
min_class_val = min(df_final['cluster'].value_counts())
df_balanced = pd.DataFrame()
for c in classes:
    class_df = df_final[df_final['cluster'] == c]
    class_df = class_df.sample(min_class_val)
    df_balanced = pd.concat([df_balanced,class_df])

print(df_balanced.shape)
df_balanced.head()

(1920, 4)


Unnamed: 0,show_id,title,description,cluster
2459,s2460,Space Force,four star general begrudgingly team eccentric ...,1
3078,s3079,Albert Pinto Ko Gussa Kyun Aata Hai?,police investigate disappearance young man hea...,1
1631,s1632,Rust Creek,wrong turn wood becomes fight life career seek...,1
4785,s4786,Sommore: Chandelier Status,luminous funnywoman sommore wow miami unique t...,1
1626,s1627,The Happytime Murders,la puppet human coexist luck detective team ex...,1


In [7]:
df_balanced['cluster'].value_counts()

1    320
0    320
4    320
5    320
2    320
3    320
Name: cluster, dtype: int64

### Preparing dataset for modelling

Word Embedding

In [8]:
# The maximum number of words to be used. (most frequent)
max_words = 10000
# Max number of words in each complaint.
max_seq = max(df_balanced['description'].apply(lambda x:len(x)))
# This is fixed.
embedding_dim = 300

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df_balanced['description'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 7574 unique tokens.


In [9]:
X = tokenizer.texts_to_sequences(df_balanced['description'].values)
X = pad_sequences(X, maxlen=max_seq)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (1920, 176)


In [10]:
Y = pd.get_dummies(df_balanced['cluster']).values
print('Shape of label tensor:', Y.shape)

Shape of label tensor: (1920, 6)


Train/Test Split

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 100)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(1344, 176) (1344, 6)
(576, 176) (576, 6)


### **Building LSTM Model**

In [12]:
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=X.shape[1]))
model.add(Dropout(0.5))
model.add(LSTM(embedding_dim, dropout=0.5, recurrent_dropout=0.5,return_sequences=True))
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(6, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['Precision'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 176, 300)          3000000   
                                                                 
 dropout (Dropout)           (None, 176, 300)          0         
                                                                 
 lstm (LSTM)                 (None, 176, 300)          721200    
                                                                 
 flatten (Flatten)           (None, 52800)             0         
                                                                 
 dropout_1 (Dropout)         (None, 52800)             0         
                                                                 
 dense (Dense)               (None, 32)                1689632   
                                                                 
 dropout_2 (Dropout)         (None, 32)                0

### **Training LSTM Model**

In [13]:
epochs = 10
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size = batch_size,
                    validation_split=0.1,
                    callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.001)])



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10


### **Validating Results**

In [14]:
precision = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Precision: {:0.3f}'.format(precision[0],precision[1]))

Test set
  Loss: 1.792
  Precision: 0.000


### **Hyperparameter Tuning of LSTM Model**

In [16]:


def build_model(hp):
    model = Sequential()
    model.add(Embedding(max_words, embedding_dim, input_length=X.shape[1]))
    model.add(Dropout(hp.Float('Dropout_rate',min_value=0,max_value=0.5,step=0.1)))
    model.add(LSTM(hp.Int('layer_2_neurons',min_value=32,max_value=128,step=32)))
    model.add(Flatten())
    model.add(Dropout(hp.Float('Dropout_rate',min_value=0,max_value=0.5,step=0.1)))
    model.add(Dense(hp.Int('hidden_size', 30, 100, step=10, default=50),activation= hp.Choice('dense_activation',values=['relu', 'sigmoid'],default='relu')))
    model.add(Dropout(hp.Float('Dropout_rate',min_value=0,max_value=0.5,step=0.1)))
    model.add(Dense(6, activation='softmax'))
    model.compile(optimizer=tf.keras.optimizers.Adam(
      hp.Float('learning_rate', 1e-4, 1e-2, sampling='log')),loss='categorical_crossentropy',metrics = ['Precision'])
    return model

tuner= RandomSearch(
        build_model,
        keras_tuner.Objective('val_precision','max'),
        max_trials=10,
        executions_per_trial=2,
        overwrite=True
        )

tuner.search(
        x=X_train,
        y=Y_train,
        epochs=10,
        batch_size=128,
        validation_data=(X_test,Y_test),
)

Trial 10 Complete [00h 04m 10s]
val_precision: 0.0

Best val_precision So Far: 0.9904761910438538
Total elapsed time: 00h 34m 29s
INFO:tensorflow:Oracle triggered exit


In [17]:
best_model = tuner.get_best_models(num_models=1)[0]

In [18]:
best_model.evaluate(X_test,Y_test)



[1.552685022354126, 1.0]

In [19]:
Y_lables = [i for a in Y_test for i, e in enumerate(a) if e != 0]

In [20]:
pred_lables = np.argmax(best_model.predict(X_test),axis = 1)

In [None]:
preci

In [23]:
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
print(precision_score(Y_lables,pred_lables,average='micro'))
print(classification_report(Y_lables,pred_lables))

0.4583333333333333
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       100
           1       0.17      0.03      0.05       101
           2       0.52      0.27      0.36        89
           3       0.38      0.86      0.53        97
           4       0.70      0.73      0.71       100
           5       0.43      0.91      0.59        89

    accuracy                           0.46       576
   macro avg       0.37      0.47      0.37       576
weighted avg       0.36      0.46      0.37       576

