<a href="https://colab.research.google.com/github/Anuj040/NLP/blob/master/NLP_speaker_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The content of this notebook has been motivated from https://www.kaggle.com/johnwdata/eda-and-nlp-on-democratic-debate-transcripts

## Prepare the data

* Before starting, download the dataset from https://www.kaggle.com/brandenciranni/democratic-debate-transcripts-2020
* You can then upload the dataset to the colab, as follows

In [None]:
from google.colab import files

In [None]:
# Upload the data file
files.upload()

In [None]:
import pandas as pd

# Prepare dataframe and check for null values
df = pd.read_csv("/content/debate_transcripts_v3_2020-02-26.csv", encoding="cp1252", usecols= ["speaker", "speech"])
df.isna().sum()

In [None]:
# For the specified columns, there are no null values
# so, we can proceed as such
# check the samples available for each speaker
# information might be useful in case of imbalanced dataset 
value_counts = df.speaker.value_counts()

In [None]:
# Remove speakers with very few sample
# Identify the speakers with speech count less than 100
to_remove = value_counts[value_counts <= 100].index

# Keep rows where the city column is not in to_remove
df = df[~df.speaker.isin(to_remove)]
# Get the list of unique speakers
speakers = list(df.speaker.unique())

## One hot encode the speaker labels

In [None]:
# Get one hot encoding of columns B
one_hot = pd.get_dummies(df["speaker"])
# Join the encoded df
df = df.join(one_hot)

## Clean the speech, removing filler/stop words

In [None]:
import nltk
import re
nltk.download('stopwords')

# add a column for the speech with stop words and punctuation removed
stop_words = set(nltk.corpus.stopwords.words('english'))
add_words = {"its", "would", "us", "then", "so", "it", "thats", "going", "also", "crosstalk"}
stop_words =  stop_words.union(add_words)
df["speech_cleaned"] = df["speech"].apply(lambda x: " ".join([re.sub(r'[^\w\d]','', item.lower()) for item in x.split() if re.sub(r'[^\w\d]','', item.lower()) not in stop_words]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Clean up the stems

In [None]:
from nltk.stem import PorterStemmer
# Create stemmer
stemmer = PorterStemmer()
# Stem cleaned up speech in the debate data
df["speech_cleaned"] = df["speech_cleaned"].apply(lambda x: " ".join([stemmer.stem(item) for item in x.split()]))

## Prepare the train/test splits

In [None]:
train=df.sample(frac=0.8,random_state=101) #random state is a seed value
test=df.drop(train.index).sample(frac=1.0,random_state=45)

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
speech_tokenize = Tokenizer()
# set max sequence length
max_len = 150
X_train = pad_sequences(speech_tokenize.texts_to_sequences(train["speech_cleaned"]), maxlen=max_len, padding="post")
Y_train = train[speakers]

X_test = pad_sequences(speech_tokenize.texts_to_sequences(test["speech_cleaned"]), maxlen=max_len, padding="post")

## Define the model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, Dense, Dropout, BatchNormalization, Activation, Bidirectional, LSTM
# build a model for speech analysis
speaker_model = Sequential([
    Embedding(len(speech_tokenize.word_index) + 1, 200),
    Bidirectional(LSTM(32, return_sequences=True)),
    Bidirectional(LSTM(16)),
    Dense(64),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(16),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(len(speakers
    ), activation="softmax")
])

## Compile and train the model

In [None]:
speaker_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["acc"])
# Test if the model is working as expected
# speech = df["speech_cleaned"][0]
# predict = speaker_model.predict(pad_sequences(speech_tokenize.texts_to_sequences([speech]), maxlen=max_len, padding="post"))
# predict.shape

speaker_model.fit(X_train, Y_train, validation_split=.1, epochs=100, verbose=2)



Epoch 1/100
86/86 - 9s - loss: 2.7933 - acc: 0.0988 - val_loss: 2.7542 - val_acc: 0.1176
Epoch 2/100
86/86 - 2s - loss: 2.7201 - acc: 0.1112 - val_loss: 2.6987 - val_acc: 0.0915
Epoch 3/100
86/86 - 2s - loss: 2.6998 - acc: 0.1152 - val_loss: 3.1526 - val_acc: 0.0915
Epoch 4/100
86/86 - 2s - loss: 2.6959 - acc: 0.1148 - val_loss: 4.8746 - val_acc: 0.0915
Epoch 5/100
86/86 - 2s - loss: 2.6915 - acc: 0.1217 - val_loss: 9.9763 - val_acc: 0.0915
Epoch 6/100
86/86 - 2s - loss: 2.6911 - acc: 0.1119 - val_loss: 42.4488 - val_acc: 0.0915
Epoch 7/100
86/86 - 2s - loss: 2.6888 - acc: 0.1137 - val_loss: 54.2911 - val_acc: 0.0915
Epoch 8/100
86/86 - 2s - loss: 2.6837 - acc: 0.1210 - val_loss: 107.4386 - val_acc: 0.1536
Epoch 9/100
86/86 - 2s - loss: 2.6829 - acc: 0.1050 - val_loss: 79.4706 - val_acc: 0.1176
Epoch 10/100
86/86 - 2s - loss: 2.6814 - acc: 0.1159 - val_loss: 98.6460 - val_acc: 0.1536
Epoch 11/100
86/86 - 2s - loss: 2.6866 - acc: 0.1181 - val_loss: 64.5547 - val_acc: 0.0915
Epoch 12/100

<keras.callbacks.History at 0x7fe072404dd0>

* As can be seen, with this naive approach, the model seems to be not learning much. One possible cause could be the data imbalance. To remedy that, we can try various different approaches

## Experimenting with weighing losses for different labels depending on their occuring frequency

In [None]:
import numpy as np
value_counts = train.speaker.value_counts()
# Higher weight to label with lower occurence
loss_weights = np.array(value_counts.sum()/value_counts.sort_index())
loss_weights = loss_weights/max(loss_weights)

### Recompile the model with additional parameter for loss weights and retrain the model

In [None]:
speaker_model_2 = Sequential([
    Embedding(len(speech_tokenize.word_index) + 1, 64),
    Bidirectional(LSTM(32, return_sequences=True)),
    Bidirectional(LSTM(16)),
    Dense(64),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(16),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(len(speakers
    ), activation="softmax")
])
speaker_model_2.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["acc"], loss_weights=loss_weights)
speaker_model_2.fit(X_train, Y_train, validation_split=.1, epochs=100, verbose=2)

Epoch 1/100
86/86 - 7s - loss: 1.6437 - acc: 0.0970 - val_loss: 1.6324 - val_acc: 0.0915
Epoch 2/100
86/86 - 2s - loss: 1.6241 - acc: 0.1130 - val_loss: 1.6111 - val_acc: 0.1176
Epoch 3/100
86/86 - 2s - loss: 1.5979 - acc: 0.1188 - val_loss: 1.5863 - val_acc: 0.1176
Epoch 4/100
86/86 - 2s - loss: 1.5805 - acc: 0.1181 - val_loss: 1.9737 - val_acc: 0.0915
Epoch 5/100
86/86 - 2s - loss: 1.5755 - acc: 0.1145 - val_loss: 1.9779 - val_acc: 0.1536
Epoch 6/100
86/86 - 2s - loss: 1.5698 - acc: 0.1221 - val_loss: 5.4193 - val_acc: 0.1536
Epoch 7/100
86/86 - 2s - loss: 1.5692 - acc: 0.1076 - val_loss: 41.5574 - val_acc: 0.0882
Epoch 8/100
86/86 - 2s - loss: 1.5686 - acc: 0.1130 - val_loss: 53.8889 - val_acc: 0.1176
Epoch 9/100
86/86 - 2s - loss: 1.5670 - acc: 0.1145 - val_loss: 56.7676 - val_acc: 0.0915
Epoch 10/100
86/86 - 2s - loss: 1.5655 - acc: 0.1239 - val_loss: 18.9681 - val_acc: 0.0915
Epoch 11/100
86/86 - 2s - loss: 1.5644 - acc: 0.1174 - val_loss: 66.6400 - val_acc: 0.0915
Epoch 12/100
8

<keras.callbacks.History at 0x7fe0a145cc10>

## Using pretrained embeddings

As is clear from above, weighing the losses did not help much either. Another possible approach can be to use pretrained embeddings, instead of learning emebeddings as in the approach above so far. The motivation for using these embeddings is as they have been trained on much larger word corpus, they contain much better relational information between different words.

There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions. Here, we will work with 200 dimensional vector

In [None]:
embedding_dimension = 200

In [None]:
# Download GloVe pretrained embeddings
import os
if not os.path.isfile(f'glove.6B.{embedding_dimension}d.txt'):
  ! wget https://nlp.stanford.edu/data/glove.6B.zip
  ! unzip glove.6B.zip
  ! rm glove.6B.zip

In [None]:
# load the whole embedding into memory
embeddings_index = dict()
f = open(f'glove.6B.{embedding_dimension}d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [None]:
# prepare tokenizer
embed_tokenize = Tokenizer()
embed_tokenize.fit_on_texts(df["speech_cleaned"])
vocab_size = len(embed_tokenize.word_index) + 1

# create a weight matrix for words in dataset
embedding_matrix = np.zeros((vocab_size, embedding_dimension))
for word, i in embed_tokenize.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

In [None]:
df["speech_cleaned"] = df["speech"].apply(lambda x: " ".join([re.sub(r'[^\w\d]','', item.lower()) for item in x.split() if re.sub(r'[^\w\d]','', item.lower()) not in stop_words]))
train=df.sample(frac=0.8,random_state=101) #random state is a seed value
test=df.drop(train.index).sample(frac=1.0,random_state=45)

# set max sequence length
max_len = 150
# integer encode the speech
X_train = pad_sequences(embed_tokenize.texts_to_sequences(train["speech_cleaned"]), maxlen=max_len, padding="post")
Y_train = train[speakers]


In [None]:
# build a model with GloVe embeddings
glove_model = Sequential([
    Embedding(vocab_size, embedding_dimension, weights=[embedding_matrix], input_length=max_len, trainable=False),
    Bidirectional(LSTM(32, return_sequences=True)),
    Bidirectional(LSTM(16)),
    Dense(64),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(16),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(len(speakers
    ), activation="softmax")
])

glove_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["acc"], loss_weights=loss_weights)
glove_model.fit(X_train, Y_train, validation_split=.1, epochs=100, verbose=2)

Epoch 1/100
86/86 - 7s - loss: 1.7418 - acc: 0.0629 - val_loss: 1.6144 - val_acc: 0.1046
Epoch 2/100
86/86 - 2s - loss: 1.5694 - acc: 0.1388 - val_loss: 1.5581 - val_acc: 0.1242
Epoch 3/100
86/86 - 2s - loss: 1.4853 - acc: 0.1868 - val_loss: 1.5011 - val_acc: 0.1438
Epoch 4/100
86/86 - 2s - loss: 1.4127 - acc: 0.2053 - val_loss: 1.4569 - val_acc: 0.1601
Epoch 5/100
86/86 - 2s - loss: 1.3778 - acc: 0.2242 - val_loss: 1.3941 - val_acc: 0.2092
Epoch 6/100
86/86 - 2s - loss: 1.3254 - acc: 0.2504 - val_loss: 1.4003 - val_acc: 0.1895
Epoch 7/100
86/86 - 2s - loss: 1.2926 - acc: 0.2714 - val_loss: 1.3594 - val_acc: 0.1895
Epoch 8/100
86/86 - 2s - loss: 1.2341 - acc: 0.3070 - val_loss: 1.3193 - val_acc: 0.2516
Epoch 9/100
86/86 - 2s - loss: 1.2006 - acc: 0.3161 - val_loss: 1.2954 - val_acc: 0.2712
Epoch 10/100
86/86 - 2s - loss: 1.1620 - acc: 0.3423 - val_loss: 1.2770 - val_acc: 0.2451
Epoch 11/100
86/86 - 2s - loss: 1.1195 - acc: 0.3757 - val_loss: 1.2964 - val_acc: 0.2418
Epoch 12/100
86/86 

<keras.callbacks.History at 0x7fe08d3b5d50>

* It is immidiately clear that using the pretrained embeddings help tremendously.
* In cases, where the model is supposed to be for a very specific task, we can also retrain these embeddings for better adaptability to a given task.

In [None]:
# build a model with GloVe embeddings (trainable = True)
glove_model_2 = Sequential([
    Embedding(vocab_size, embedding_dimension, weights=[embedding_matrix], input_length=max_len, trainable=True),
    Bidirectional(LSTM(32, return_sequences=True)),
    Bidirectional(LSTM(16)),
    Dense(64),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(16),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(len(speakers
    ), activation="softmax")
])

glove_model_2.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["acc"], loss_weights=loss_weights)
glove_model_2.fit(X_train, Y_train, validation_split=.1, epochs=100, verbose=2)

Epoch 1/100
86/86 - 8s - loss: 1.7149 - acc: 0.0974 - val_loss: 1.6177 - val_acc: 0.1536
Epoch 2/100
86/86 - 2s - loss: 1.5209 - acc: 0.1657 - val_loss: 1.5746 - val_acc: 0.1503
Epoch 3/100
86/86 - 2s - loss: 1.4038 - acc: 0.2340 - val_loss: 1.5089 - val_acc: 0.1863
Epoch 4/100
86/86 - 2s - loss: 1.3145 - acc: 0.2856 - val_loss: 1.4385 - val_acc: 0.2582
Epoch 5/100
86/86 - 2s - loss: 1.2162 - acc: 0.3528 - val_loss: 1.3542 - val_acc: 0.2451
Epoch 6/100
86/86 - 2s - loss: 1.1161 - acc: 0.4095 - val_loss: 1.3327 - val_acc: 0.2549
Epoch 7/100
86/86 - 2s - loss: 1.0516 - acc: 0.4379 - val_loss: 1.2890 - val_acc: 0.3301
Epoch 8/100
86/86 - 2s - loss: 0.9779 - acc: 0.4713 - val_loss: 1.2138 - val_acc: 0.3562
Epoch 9/100
86/86 - 2s - loss: 0.8838 - acc: 0.5265 - val_loss: 1.2312 - val_acc: 0.3170
Epoch 10/100
86/86 - 2s - loss: 0.8241 - acc: 0.5552 - val_loss: 1.1923 - val_acc: 0.3399
Epoch 11/100
86/86 - 2s - loss: 0.7581 - acc: 0.6032 - val_loss: 1.2557 - val_acc: 0.3791
Epoch 12/100
86/86 

<keras.callbacks.History at 0x7fe08b828990>

## Deal with overfitting

* As expected, retraining/finetuning the embeddings result in further gain performance
* Now, we have a clear case of overfitting (train acc. >> val.acc.). Some strategies that can help are
  * increasing the model capacity

### Increasing the model capacity

In [None]:
# build a model with GloVe embeddings (trainable = True)
glove_model_3 = Sequential([
    Embedding(vocab_size, embedding_dimension, weights=[embedding_matrix], input_length=max_len, trainable=True),
    Bidirectional(LSTM(128, return_sequences=True)),
    Bidirectional(LSTM(64)),
    Dense(256),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(64),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(len(speakers
    ), activation="softmax")
])

glove_model_3.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["acc"], loss_weights=loss_weights)
glove_model_3.fit(X_train, Y_train, validation_split=.1, epochs=100, verbose=2)

Epoch 1/100
86/86 - 8s - loss: 1.6550 - acc: 0.1243 - val_loss: 1.5818 - val_acc: 0.1601
Epoch 2/100
86/86 - 2s - loss: 1.4136 - acc: 0.2406 - val_loss: 1.5329 - val_acc: 0.2059
Epoch 3/100
86/86 - 2s - loss: 1.2767 - acc: 0.3172 - val_loss: 1.4374 - val_acc: 0.2647
Epoch 4/100
86/86 - 2s - loss: 1.1475 - acc: 0.3819 - val_loss: 1.3503 - val_acc: 0.3039
Epoch 5/100
86/86 - 2s - loss: 1.0310 - acc: 0.4433 - val_loss: 1.4671 - val_acc: 0.2810
Epoch 6/100
86/86 - 2s - loss: 0.9271 - acc: 0.4815 - val_loss: 1.4655 - val_acc: 0.2582
Epoch 7/100
86/86 - 2s - loss: 0.8318 - acc: 0.5461 - val_loss: 1.2361 - val_acc: 0.3039
Epoch 8/100
86/86 - 2s - loss: 0.7406 - acc: 0.6047 - val_loss: 1.1746 - val_acc: 0.3660
Epoch 9/100
86/86 - 2s - loss: 0.6672 - acc: 0.6319 - val_loss: 1.2056 - val_acc: 0.3595
Epoch 10/100
86/86 - 2s - loss: 0.6005 - acc: 0.6860 - val_loss: 1.3624 - val_acc: 0.3464
Epoch 11/100
86/86 - 2s - loss: 0.5124 - acc: 0.7420 - val_loss: 1.2036 - val_acc: 0.3824
Epoch 12/100
86/86 

<keras.callbacks.History at 0x7fe088c90e10>

A slight improvement in the model performance is observed

### label smoothing
Now, as the model accuracy on training set is relatively high, it can also mean that the model is becoming overconfident. One strategy to deal with such situations is label smoothing, where instead of pushing the model to predict '1' as the truth value, we make it predict a sightly lower value like 0.9

In [None]:
import tensorflow as tf
# build a model with GloVe embeddings (trainable = True)
glove_model_4 = Sequential([
    Embedding(vocab_size, embedding_dimension, weights=[embedding_matrix], input_length=max_len, trainable=True),
    Bidirectional(LSTM(128, return_sequences=True)),
    Bidirectional(LSTM(64)),
    Dense(256),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(64),
    BatchNormalization(),
    Activation("relu"),
    Dropout(.25),
    Dense(len(speakers
    ), activation="softmax")
])
cce = tf.keras.losses.CategoricalCrossentropy(from_logits=False, label_smoothing=0.1)
optm = tf.keras.optimizers.Adam()
glove_model_4.compile(optimizer=optm, loss=cce, metrics=["acc"], loss_weights=loss_weights)
glove_model_4.fit(X_train, Y_train, validation_split=.1, epochs=100, verbose=2)


Epoch 1/100
86/86 - 8s - loss: 1.6279 - acc: 0.1461 - val_loss: 1.5796 - val_acc: 0.1209
Epoch 2/100
86/86 - 3s - loss: 1.4280 - acc: 0.2315 - val_loss: 1.5189 - val_acc: 0.2026
Epoch 3/100
86/86 - 3s - loss: 1.3154 - acc: 0.3060 - val_loss: 1.5494 - val_acc: 0.1471
Epoch 4/100
86/86 - 3s - loss: 1.1862 - acc: 0.4095 - val_loss: 1.5064 - val_acc: 0.1732
Epoch 5/100
86/86 - 3s - loss: 1.0571 - acc: 0.5011 - val_loss: 1.3562 - val_acc: 0.2614
Epoch 6/100
86/86 - 3s - loss: 0.9313 - acc: 0.6094 - val_loss: 1.2623 - val_acc: 0.3660
Epoch 7/100
86/86 - 3s - loss: 0.8489 - acc: 0.6679 - val_loss: 1.2600 - val_acc: 0.3889
Epoch 8/100
86/86 - 2s - loss: 0.7565 - acc: 0.7402 - val_loss: 1.2247 - val_acc: 0.4150
Epoch 9/100
86/86 - 3s - loss: 0.6942 - acc: 0.7976 - val_loss: 1.2651 - val_acc: 0.3987
Epoch 10/100
86/86 - 3s - loss: 0.6601 - acc: 0.8092 - val_loss: 1.2468 - val_acc: 0.4118
Epoch 11/100
86/86 - 3s - loss: 0.6295 - acc: 0.8376 - val_loss: 1.2710 - val_acc: 0.3856
Epoch 12/100
86/86 

<keras.callbacks.History at 0x7fe075f64cd0>