In [18]:
# Small debugging to allow sklearn importing
import sys
sys.path.append('/usr/local/lib/python3.8/site-packages')
# -------------------------
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from scipy.sparse import hstack

In [2]:
df = pd.read_csv('data/Shakespeare_data.csv')

 ## Cleaning Data 

In [3]:
df.head(10)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
8,9,Henry IV,1.0,1.1.6,KING HENRY IV,Shall daub her lips with her own children's bl...
9,10,Henry IV,1.0,1.1.7,KING HENRY IV,"Nor more shall trenching war channel her fields,"


Some things to note: 
- There are NaN values present for PlayerLineNumber, ActSceneLine and Player.
- ActSceneLine can be split by Act Scene and Line numbers. 

Because we intend on training our models to predict what player would say a specific line, we should remove ambiguous data and split the ActSceneLine into separate columns for more precision in our models. 

### Steps to Clean 
1. Drop nil PlayerLineNumber and Player values as they will not benefit our models.
2. Split ActSceneLine into three separate rows. Delete former ActSceneLine

In [4]:
df.drop(columns = ["Dataline"], inplace = True)

In [5]:
df = df.query('not Player.isnull()')

In [6]:
df = df.query('not ActSceneLine.isnull()')

To better our training, we can split the ActSceneLine data into three separate columns we can query against. This allows for a more concise and "queryable" way of dealing with our data. This column will turn into Act, Scene and Line. After that, we no longer need the former ActSceneLine column. This can be dropped. 

In [7]:
temp = df["ActSceneLine"].str.split(".", n = 2, expand = True)
df["Act"]= temp[0] 
df["Scene"]= temp[1] 
df["Line"]= temp[2] 
df.drop(columns =["ActSceneLine"], inplace = True)

In [8]:
df.head(10)

Unnamed: 0,Play,PlayerLinenumber,Player,PlayerLine,Act,Scene,Line
3,Henry IV,1.0,KING HENRY IV,"So shaken as we are, so wan with care,",1,1,1
4,Henry IV,1.0,KING HENRY IV,"Find we a time for frighted peace to pant,",1,1,2
5,Henry IV,1.0,KING HENRY IV,And breathe short-winded accents of new broils,1,1,3
6,Henry IV,1.0,KING HENRY IV,To be commenced in strands afar remote.,1,1,4
7,Henry IV,1.0,KING HENRY IV,No more the thirsty entrance of this soil,1,1,5
8,Henry IV,1.0,KING HENRY IV,Shall daub her lips with her own children's bl...,1,1,6
9,Henry IV,1.0,KING HENRY IV,"Nor more shall trenching war channel her fields,",1,1,7
10,Henry IV,1.0,KING HENRY IV,Nor bruise her flowerets with the armed hoofs,1,1,8
11,Henry IV,1.0,KING HENRY IV,"Of hostile paces: those opposed eyes,",1,1,9
12,Henry IV,1.0,KING HENRY IV,"Which, like the meteors of a troubled heaven,",1,1,10


**Note:** In order to build classifications using DecisionTree and RandomForest, we will need to convert some of our data into floats, as these classifications require so. This can be accomplished using LabelEncoding. 

In [9]:
le = preprocessing.LabelEncoder()
df['player_le'] = le.fit_transform(df['Player'])
df['play_le'] = le.fit_transform(df['Play'])
df['player_line_le'] = le.fit_transform(df['PlayerLine'])
df['act_le'] = le.fit_transform(df["Act"])
df['scene_le'] = le.fit_transform(df["Scene"])
df['line_le'] = le.fit_transform(df["Line"])

In [10]:
df

Unnamed: 0,Play,PlayerLinenumber,Player,PlayerLine,Act,Scene,Line,player_le,play_le,player_line_le,act_le,scene_le,line_le
3,Henry IV,1.0,KING HENRY IV,"So shaken as we are, so wan with care,",1,1,1,457,9,60240,1,1,0
4,Henry IV,1.0,KING HENRY IV,"Find we a time for frighted peace to pant,",1,1,2,457,9,23568,1,1,111
5,Henry IV,1.0,KING HENRY IV,And breathe short-winded accents of new broils,1,1,3,457,9,4998,1,1,222
6,Henry IV,1.0,KING HENRY IV,To be commenced in strands afar remote.,1,1,4,457,9,73793,1,1,333
7,Henry IV,1.0,KING HENRY IV,No more the thirsty entrance of this soil,1,1,5,457,9,48893,1,1,444
...,...,...,...,...,...,...,...,...,...,...,...,...,...
111390,A Winters Tale,38.0,LEONTES,"Is troth-plight to your daughter. Good Paulina,",5,3,179,494,2,41329,5,9,88
111391,A Winters Tale,38.0,LEONTES,"Lead us from hence, where we may leisurely",5,3,180,494,2,42772,5,9,90
111392,A Winters Tale,38.0,LEONTES,Each one demand an answer to his part,5,3,181,494,2,22110,5,9,91
111393,A Winters Tale,38.0,LEONTES,Perform'd in this wide gap of time since first,5,3,182,494,2,55480,5,9,92


# Feature Engineering

### Decision Tree

For this first feature, I will be using the Decision Tree. Decision trees are useful for their quickness. The decision tree will take in both my testing and training data set to give us predictions as an outcome. 

In [11]:
x = df[['play_le', 'act_le', 'scene_le', 'line_le']]
y = df['player_le']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=1)

In [12]:
tree = DecisionTreeClassifier()
tree_classifier = tree.fit(x_train, y_train)
tree_pred = tree_classifier.predict(x_test)
print("Prediction Accuracy:", metrics.accuracy_score(y_test, tree_pred))

Prediction Accuracy: 0.6287561810574362


Using the DecisionTree gave me an accuracy of around 62%. This is not exceptional, but it's alright. Looking forward, I might be able to attain stronger accuracy by adjusting the x and y data fields in the training and data set. 

### Random Forest

Let's use another tool called RandomForestClassifier to pull a feature and test our data more. I will us the training data used in the DecisionTreeClassifier. By keeping the training data the same, we can see if the DecisionTree or RandomForest is more accurate in their results.

In [13]:
forest = RandomForestClassifier()
forest.fit(x_train, y_train)
forest_pred = forest.predict(x_test)
print("Prediction Accuracy:", metrics.accuracy_score(y_test, forest_pred))

Prediction Accuracy: 0.6296120197793837


The RandomForestClassifier was MUCH slower in calculation than the DecisionTree. The results yielded were slightly stronger than that of the DecisionTree with almost 63% accuracy. This is strong but also is computationally heavier. 

## Conclusion

The fact that our RandomForest was better than the DecisionTree makes sense. Although the difference was quite miniscule, RandomForests generally are more accurate because of the scale and algorithms used. They do take much longer but the results are more accurate. 

Going forward, using other classifications and features would yield higher accuracy in our results. 