# The Office Character Predictor

Our objective in this notebook is, given a quote from the script of The Office, predict which of the 4 main characters (Michael, Dwight, Jim and Pam) is most likely to be saying the quote.  
We'll build a Neural Network based on LSTM layers in tensorflow to accomplish this supervised learning objective.  
We'll be using The Office Quote Dataset, which is comprised of 2 .csv, one containing a character's "Talking Head" moments (the character is talking directly to the camera) and another containing a character's reply to another character's line.  
Let's start by loading the data and some helpful modules

## Load Modules and Data

In [None]:
import pandas as pd
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers

from sklearn.model_selection import train_test_split
from sklearn.metrics import  accuracy_score, confusion_matrix

import seaborn as sns
from matplotlib import pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# constant variables
NUM_WORDS = 1000000
MAX_LEN = 140
NUM_CLASSES = 4

In [None]:
df1 = pd.read_csv('/kaggle/input/the-office-quotes-dataset/talking_head.csv')
df2 = pd.read_csv('/kaggle/input/the-office-quotes-dataset/parent_reply.csv')

## Clean Data

Let's combine the 2 datasets into one with two columns: a character's quote and a character's name

In [None]:
df1.head()

In [None]:
df1 = df1.drop(columns=['quote_id'])
df1.head()

In [None]:
df2.head()

In [None]:
df2 = df2.drop(columns=["parent_id", "parent"])
df2 = df2.rename(columns={'reply': 'quote'})
df2.head()

In [None]:
df = pd.concat([df1, df2]).reset_index(drop=True)
df.head()

### Encode character names to integer

We encode the characters' names to an int value, so that they can be fed into the tensorflow neural network model

In [None]:
print(list(df['character'].unique()))

In [None]:
char_to_int = {
    "Michael": 0,
    "Dwight": 1,
    "Jim": 2,
    "Pam": 3
}
int_to_char = ["Michael", "Dwight", "Jim", "Pam"]
df['character'] = df['character'].replace(char_to_int).astype('int8')
df.head()

## Train Test Split

In [None]:
X = df['quote'].values
y = df['character'].values
del df

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    train_size=0.8,
    random_state=42,
    shuffle=True
)
del X, y

## Encode Sentences to Sequences

We tokenize our words into integer tokens, and map the sentences to sequences of tokens, so that they can be fed into the model

In [None]:
def tokenize_and_sequence(
    train_sentences, 
    test_sentences, 
    num_words=NUM_WORDS, 
    maxlen=MAX_LEN
    ):
    print(f"num_words: {num_words}")
    tok = Tokenizer(num_words=num_words, oov_token='<OOV>')
    tok.fit_on_texts(train_sentences)
    
    train_sequences = tok.texts_to_sequences(train_sentences)
    train_sequences = pad_sequences(
        train_sequences, 
        padding='post', maxlen=maxlen, truncating='post'
    )
    
    test_sequences = tok.texts_to_sequences(test_sentences)
    test_sequences = pad_sequences(
        test_sequences,
        padding='post', maxlen=maxlen, truncating='post'
    )
    
    return train_sequences, test_sequences, tok

In [None]:
%%time
X_train, X_test, tok = tokenize_and_sequence(X_train, X_test)

## Build Model

We build a Neural Network with an Embedding layer, Bidirectional Long Short Term Memory (LSTM) layers, Dense layers for classification and a Dropout layer for training purposes.

In [None]:
class OfficeModel(tf.keras.Model):
    def __init__(self, vocab_dim=NUM_WORDS, max_len=MAX_LEN, num_classes=NUM_CLASSES):
        super(OfficeModel, self).__init__()
        self.embedding = layers.Embedding(vocab_dim, 32, input_length=max_len)
        self.lstm1 = layers.Bidirectional(layers.LSTM(32, return_sequences=True))
        self.lstm2 = layers.Bidirectional(layers.LSTM(16))
        self.dense = layers.Dense(64, activation='relu')
        self.dropout = layers.Dropout(0.5)
        self.classifier = layers.Dense(num_classes, activation='softmax')
    
    def call(self, inputs, training=False):
        x = self.embedding(inputs)
        x = self.lstm1(x)
        x = self.lstm2(x)
        x = self.dense(x)
        if training:
            x = self.dropout(x, training=training)
        return self.classifier(x)

In [None]:
model = OfficeModel()
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy']
)

## Train Model

In [None]:
%%time
model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    batch_size=512,
    epochs=20
)

## Evaluate Model

In [None]:
%%time
y_pred = model.predict(X_test, batch_size=512, verbose=1)
y_pred = y_pred.argmax(axis=1)

In [None]:
%%time
acc = accuracy_score(y_test, y_pred)
conf_mat = confusion_matrix(y_test, y_pred)
conf_mat_recall = confusion_matrix(y_test, y_pred, normalize='true')
conf_mat_precision = confusion_matrix(y_test, y_pred, normalize='pred')

### Accuracy

In [None]:
print(f"Accuracy = {acc:.2%}")

### Confusion Matrix

The confusion matrix is rather confusing. So how does it work?  
If we scan along the x axis, we'll find the values that **actually** belong to a class.  
For example, all the values along the 1st row are quotes that are **actually** Michael's.  
If we scan along the y axis, we'll find the values that were **predicted** to belong to a class by our model.  
For example, all the values along the 1st column are quotes that are **predicted** to be Michaels by our model  
By intersecting rows and columns, we obtain values that **actually** belong to the class in the y label and are **predicted** to belong to the class in the x label.

In [None]:
plt.figure(figsize=(6, 6))
sns.heatmap(conf_mat, annot=True, fmt="d", cbar=False)
plt.xticks([0, 1, 2, 3], int_to_char)
plt.yticks([0, 1, 2, 3], int_to_char)
plt.show()

### Confusion Matrix: normalized along columns

Diagonal values show the model's precision in identifying a given class

In [None]:
plt.figure(figsize=(6, 6))
sns.heatmap(conf_mat_precision, annot=True, fmt='.2%', cbar=False)
plt.xticks([0, 1, 2, 3], int_to_char)
plt.yticks([0, 1, 2, 3], int_to_char)
plt.show()

### Confusion Matrix: normalized along rows

Diagonal values show the model's recall in identifying a given class

In [None]:
plt.figure(figsize=(6, 6))
sns.heatmap(conf_mat_recall, annot=True, fmt='.2%', cbar=False)
plt.xticks([0, 1, 2, 3], int_to_char)
plt.yticks([0, 1, 2, 3], int_to_char)
plt.show()