# To be, or not to be

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from tensorflow import keras

## Preparing the data

In [2]:
df = pd.read_csv("../data/Shakespeare_data.csv")
df

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
...,...,...,...,...,...,...
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


Let's begin with some general cleanup, starting by pruning off the redudant dataline column.

In [3]:
del df['Dataline']

In addition, we see a number of cells without data, represented by `NaN`. Clearly the lines without a player aren't useful, since that is the value we are trying to predict. The rows players but missing other data may be useful, but don't appear to make up a large portion of the dataset, so let's drop them as well.

In [4]:
df = df.dropna()
df = df.reset_index(drop=True)
df

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
1,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
2,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
3,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
4,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...
105147,A Winters Tale,38.0,5.3.179,LEONTES,"Is troth-plight to your daughter. Good Paulina,"
105148,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
105149,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
105150,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first


It looks like the "Player line number" column consists of floating point numbers rather than integers. This is because of the missing values we used to have in the column, which Pandas denoted with `NaN`. `NaN` has type `float`, so all the values were cast to `float` as well.

In [5]:
# Check that all values are whole numbers
all(df['PlayerLinenumber'].apply(lambda x : x % 1 == 0))

True

In [6]:
df = df.copy()
df['PlayerLinenumber'] = df['PlayerLinenumber'].apply(int)
df

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,Henry IV,1,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
1,Henry IV,1,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
2,Henry IV,1,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
3,Henry IV,1,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
4,Henry IV,1,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...
105147,A Winters Tale,38,5.3.179,LEONTES,"Is troth-plight to your daughter. Good Paulina,"
105148,A Winters Tale,38,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
105149,A Winters Tale,38,5.3.181,LEONTES,Each one demand an answer to his part
105150,A Winters Tale,38,5.3.182,LEONTES,Perform'd in this wide gap of time since first


Now, I want to break up the column "ActSceneLine" into three. The current string format is intractable for analysis.

In [7]:
parseAsl = lambda asl : [int(s) for s in asl.split('.')]

parsedAsl = pd.DataFrame(df['ActSceneLine'].apply(parseAsl).tolist(), columns=['A','S','L'])
df[['A','S','L']] = parsedAsl[['A','S','L']]
del df['ActSceneLine']
df

Unnamed: 0,Play,PlayerLinenumber,Player,PlayerLine,A,S,L
0,Henry IV,1,KING HENRY IV,"So shaken as we are, so wan with care,",1,1,1
1,Henry IV,1,KING HENRY IV,"Find we a time for frighted peace to pant,",1,1,2
2,Henry IV,1,KING HENRY IV,And breathe short-winded accents of new broils,1,1,3
3,Henry IV,1,KING HENRY IV,To be commenced in strands afar remote.,1,1,4
4,Henry IV,1,KING HENRY IV,No more the thirsty entrance of this soil,1,1,5
...,...,...,...,...,...,...,...
105147,A Winters Tale,38,LEONTES,"Is troth-plight to your daughter. Good Paulina,",5,3,179
105148,A Winters Tale,38,LEONTES,"Lead us from hence, where we may leisurely",5,3,180
105149,A Winters Tale,38,LEONTES,Each one demand an answer to his part,5,3,181
105150,A Winters Tale,38,LEONTES,Perform'd in this wide gap of time since first,5,3,182


Now our data looks pretty, but the "Play" and "Player" columns still aren't model-friendly. Let's break them apart into a one-hot encoding schema. At the same time, let's break our labels, the "Player" column, off from the rest of the dataframe so that we can distinguish it for the model. But before we do that, we need to shuffle our data.

In [8]:
shuffled = df.sample(frac=1)

oneHot = shuffled.join(pd.get_dummies(shuffled['Play']))
labels = pd.get_dummies(shuffled['Player'])
del oneHot['Play']
del oneHot['Player']

It is not immediately clear how the player line might be useful to the model, which expects numerical input. Let's start without it, try a to train a neural network, and then reassess after we see the success rate of the model.

In [9]:
noText = oneHot.copy()
del noText['PlayerLine']

Now we prepare our data for the neural network, and split the training data from the testing data, in a 9:1 split.

In [10]:
lenTotal = len(noText)
lenTrain = int(.9*lenTotal)
lenTest  = lenTotal - lenTrain

trainingInput  = noText.head(lenTrain).astype('float').to_numpy()
trainingLabels = labels.head(lenTrain).astype('float').to_numpy()

testingInput  = noText.head(lenTest).astype('float').to_numpy()
testingLabels = labels.head(lenTest).astype('float').to_numpy()

## Modeling

Now I create neural networks with Keras. The initial parameters are somewhat arbitrary. I tweak them in an exploratory manner to see what most positively affects the model.

The training takes some time, so I generate the models in another file (`src/modelGen.py`), save the weights to a file, and then load these weights below. The code to train the neural networks is commented out.

In [13]:
lenInput  = len(noText.columns)
lenLabels = len(labels.columns)
lenHidden = int((lenInput+lenLabels)/2)

# Model 1: 1 hidden layer, many epochs
model = tf.keras.Sequential([
    keras.layers.Dense(lenHidden, activation='relu'),
    keras.layers.Dense(lenLabels, activation='softmax'),
])
model.compile(optimizer='adam',
              loss='mse',
              metrics=['accuracy'])
# callback = keras.callbacks.ModelCheckpoint(filepath='../models/noText.ckpt',
#                                            save_weights_only=True,
#                                            verbose=1)
# model.fit(trainingInput, trainingLabels, epochs=100, callbacks=[callback], verbose=1)
model.load_weights('../models/noText.ckpt')

# Model 2: 3 hidden layers, fewer epochs
model2 = tf.keras.Sequential([
    keras.layers.Dense(lenHidden, activation='relu'),
    keras.layers.Dense(lenHidden, activation='relu'),
    keras.layers.Dense(lenHidden, activation='relu'),
    keras.layers.Dense(lenLabels, activation='softmax'),
])
model2.compile(optimizer='adam',
              loss='mse',
              metrics=['accuracy'])
# callback = keras.callbacks.ModelCheckpoint(filepath='../models/noText_deep.ckpt',
#                                            save_weights_only=True,
#                                            verbose=1)
# model2.fit(trainingInput, trainingLabels, epochs=50, callbacks=[callback], verbose=1)
model2.load_weights('../models/noText_deep.ckpt')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f04b6405668>

In [14]:
model.evaluate(testingInput, testingLabels)



[0.0006758045590156388, 0.533663]

In [15]:
model2.evaluate(testingInput, testingLabels)



[0.0005540509235346601, 0.6269494]

## Conclusion

While experimenting with the neural networks, I found an appropriate width for the hidden layers to be somewhere between the size of the input and the size of the output. I started with just a single hidden layer and 100 epochs. I found it beneficial to deepen the network, and to make up for the increased training time by decreasing the epochs, since the neural network appeared to plateau around 50.

The first model achieved 53% accuracy over the test data (a random 10% of the initial data, disjoint from the training data), while the second manage nearly 63%.