# To be, or not to be

## Classy Shakespeare plays and players

#### Author: Ruturaj Kiran Vaidya

References:

* One hot encoding: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
* One hot encoding in action: http://www.insightsbot.com/blog/McTKK/python-one-hot-encoding-with-scikit-learn
* Random forest classifier: https://www.datacamp.com/community/tutorials/random-forests-classifier-python


In [1]:
# imports

In [1]:
import pandas as pd
import numpy as np

# One hot encoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# classification models
# Train test split function
from sklearn.model_selection import train_test_split
# Decision tree model
from sklearn.tree import DecisionTreeClassifier
# Random forest model
from sklearn.ensemble import RandomForestClassifier

# For accuracy
from sklearn import metrics

In [2]:
# Reading dataset
shakespeare = pd.read_csv("../data/external/Shakespeare_data.csv")

In [3]:
shakespeare.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


In [4]:
shakespeare.tail()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.
111395,111396,A Winters Tale,38.0,,LEONTES,Exeunt


In [5]:
# Let's see how many features and total records are in the dataset

print(shakespeare.shape)

(111396, 6)


In [6]:
# Notice that the dataset has NA values. Droppint NAs.
shakespeare = shakespeare.dropna()
# Also dropping Dataline as every value is unique and hence it will not help us in our analysis
del shakespeare["Dataline"]

In [7]:
shakespeare = shakespeare.reset_index(drop=True)

In [8]:
shakespeare.head()

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
1,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
2,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
3,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
4,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil


In [9]:
print(shakespeare.shape)

(105152, 5)


In [10]:
# Checking unique number of players and plays
print(f"Number of players: {len(shakespeare['Player'].unique())}\nNumber of plays: {len(shakespeare['Play'].unique())}")

Number of players: 934
Number of plays: 36


-------------
We want to determine players using other "features".
Now we have to convert categorial data into numerical data, to apply learning algorithms on it.

Two techniques can be used:
* Integer encoding
* <b>One hot encoding</b>

We will use <b>one hot encoding</b>.

-------------

In [11]:
#As seen from the table, there is a unique numeric value already associated with each player (playerLinenumber).
#We will use one hot encoding.
#First we can literally split ActSceneLine to derive more features from it - Act, Scene and Line

lam = lambda x: pd.Series([x['ActSceneLine'].split(".")[0], x['ActSceneLine'].split(".")[1], x['ActSceneLine'].split(".")[-1]])
shakespeare[["Act", "Scene", "Line"]] = shakespeare.apply(lam, axis=1)

In [12]:
shakespeare.tail()

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,Act,Scene,Line
105147,A Winters Tale,38.0,5.3.179,LEONTES,"Is troth-plight to your daughter. Good Paulina,",5,3,179
105148,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely",5,3,180
105149,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part,5,3,181
105150,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first,5,3,182
105151,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.,5,3,183


In [13]:
# Dropping ActSceneLine as no longer needed
del shakespeare["ActSceneLine"]

In [14]:
shakespeare.tail()

Unnamed: 0,Play,PlayerLinenumber,Player,PlayerLine,Act,Scene,Line
105147,A Winters Tale,38.0,LEONTES,"Is troth-plight to your daughter. Good Paulina,",5,3,179
105148,A Winters Tale,38.0,LEONTES,"Lead us from hence, where we may leisurely",5,3,180
105149,A Winters Tale,38.0,LEONTES,Each one demand an answer to his part,5,3,181
105150,A Winters Tale,38.0,LEONTES,Perform'd in this wide gap of time since first,5,3,182
105151,A Winters Tale,38.0,LEONTES,We were dissever'd: hastily lead away.,5,3,183


In [15]:
# Similarly PlayerLine can also be dropped as there are a lot of unique values
print(len(shakespeare['PlayerLine'].unique()))
del shakespeare["PlayerLine"]

103715


In [16]:
# Now, using one hot encoding on Play data
# First converting Play values into numerical representation
le_play = LabelEncoder()
shakespeare["Play_Encoded"] = le_play.fit_transform(shakespeare.Play)

In [17]:
play_ohe = OneHotEncoder()
X = play_ohe.fit_transform(shakespeare.Play_Encoded.values.reshape(-1,1)).toarray()
dfOneHot = pd.DataFrame(X, columns = ["Play_"+str(int(i)) for i in range(X.shape[1])])
shakespeare = pd.concat([shakespeare, dfOneHot], axis=1)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [18]:
shakespeare.tail()

Unnamed: 0,Play,PlayerLinenumber,Player,Act,Scene,Line,Play_Encoded,Play_0,Play_1,Play_2,...,Play_26,Play_27,Play_28,Play_29,Play_30,Play_31,Play_32,Play_33,Play_34,Play_35
105147,A Winters Tale,38.0,LEONTES,5,3,179,2,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
105148,A Winters Tale,38.0,LEONTES,5,3,180,2,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
105149,A Winters Tale,38.0,LEONTES,5,3,181,2,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
105150,A Winters Tale,38.0,LEONTES,5,3,182,2,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
105151,A Winters Tale,38.0,LEONTES,5,3,183,2,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# Play, Play_Encoded, Player can also be dropped
shakespeare = shakespeare.drop(["Play", "Player", "Play_Encoded"], axis=1)

In [20]:
shakespeare.tail()

Unnamed: 0,PlayerLinenumber,Act,Scene,Line,Play_0,Play_1,Play_2,Play_3,Play_4,Play_5,...,Play_26,Play_27,Play_28,Play_29,Play_30,Play_31,Play_32,Play_33,Play_34,Play_35
105147,38.0,5,3,179,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
105148,38.0,5,3,180,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
105149,38.0,5,3,181,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
105150,38.0,5,3,182,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
105151,38.0,5,3,183,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Classification Models
#### *  Decision Tree
#### *  Random Forest

In [25]:
# input
features = shakespeare[shakespeare.columns[1:]].to_numpy()
# output
labels = shakespeare["PlayerLinenumber"].to_numpy()

In [26]:
# Lets select the test size as 0.2 (i.e. 20%)

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2)

In [27]:
# Decision Tree

dtc = DecisionTreeClassifier()
dtc.fit(X_train,y_train)
dtc_y_pred=dtc.predict(X_test)
print(f"Decision Tree model accuracy: {metrics.accuracy_score(y_test, dtc_y_pred)}")

Decision Tree model accuracy: 0.6782844372592839


In [28]:
# Random Forest
# n_estimators is the number of trees used in the forest
# Selecting number of trees as 40

rfc = RandomForestClassifier(n_estimators = 40)
rfc.fit(X_train,y_train)
rfc_y_pred = rfc.predict(X_test)
print(f"Random Forest model accuracy: {metrics.accuracy_score(y_test, rfc_y_pred)}")

Random Forest model accuracy: 0.7037230754600352


### Discussion:

Accuracy of Random Forest (0.7037230754600352) is greater than Decision tree (0.6782844372592839) with above configuration. My assumption from the above tests is that, the random forest model is better (it has also be discussed in class slides). Also, I observed that the speed of decision tree model was better than random forest model (cosidering the above configuration).