# Classification of classy Shakespeare plays

__Problem Statement__: Determine the players from Shakespeare plays using other columns

__Data Source__: https://www.kaggle.com/kingburrito666/shakespeare-plays

__References__:

https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
<br>
https://towardsdatascience.com/solving-a-simple-classification-problem-with-python-fruits-lovers-edition-d20ab6b071d2
<br>
https://towardsdatascience.com/understanding-data-science-classification-metrics-in-scikit-learn-in-python-3bc336865019

#### Step 1: Data Extraction

In [1]:
import pandas as pd

In [2]:
#Load data into data frame
players_df = pd.read_csv("../data/external/Shakespeare_data.csv")
players_df.head(3)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."


In [3]:
#Remove null values
players_df = players_df.dropna(axis=0, subset=['Player']) 
players_df = players_df.dropna(axis=0, subset=['ActSceneLine']) 
players_df = players_df.dropna(axis=0, subset=['Play']) 
players_df = players_df.dropna(axis=0, subset=['PlayerLine']) 

players_df.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil


In [4]:
#Number of null values after removing NaN
print(players_df.isnull().values.sum())

0


In [5]:
#Check data types in dataframe
players_df.dtypes

Dataline              int64
Play                 object
PlayerLinenumber    float64
ActSceneLine         object
Player               object
PlayerLine           object
dtype: object

In [6]:
#Find unique players to convert symbolic data to numerical
players = list(players_df.Player.unique())

In [7]:
def playerIndex(x):
    i = players.index(x)
    return i

In [8]:
players_df['PlayerIndex'] = players_df['Player'].apply(playerIndex)

In [9]:
#Find unique plays to convert symbolic data to numerical
plays = list(players_df.Play.unique())

In [10]:
def playIndex(x):
    i = plays.index(x)    
    return i 

In [11]:
players_df['PlayIndex'] = players_df['Play'].apply(playIndex)

In [12]:
players_df.head(5)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,PlayerIndex,PlayIndex
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,",0,0
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,",0,0
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils,0,0
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.,0,0
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil,0,0


#### Step 2: Data Tranformation

In [13]:
#One hot encoding for transforming PlayIndex column
players_df = pd.get_dummies(players_df, columns=['PlayIndex'])

In [14]:
players_df.head(3)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,PlayerIndex,PlayIndex_0,PlayIndex_1,PlayIndex_2,...,PlayIndex_26,PlayIndex_27,PlayIndex_28,PlayIndex_29,PlayIndex_30,PlayIndex_31,PlayIndex_32,PlayIndex_33,PlayIndex_34,PlayIndex_35
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,",0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,",0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Step 3: Model Training and Testing data sets

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
#Dropping the symbolic columns
players_df = players_df.drop('Play', axis = 1)
players_df = players_df.drop('Player', axis = 1)
players_df = players_df.drop('PlayerLine', axis = 1)
players_df = players_df.drop('ActSceneLine', axis = 1)

In [17]:
players_df.head()

Unnamed: 0,Dataline,PlayerLinenumber,PlayerIndex,PlayIndex_0,PlayIndex_1,PlayIndex_2,PlayIndex_3,PlayIndex_4,PlayIndex_5,PlayIndex_6,...,PlayIndex_26,PlayIndex_27,PlayIndex_28,PlayIndex_29,PlayIndex_30,PlayIndex_31,PlayIndex_32,PlayIndex_33,PlayIndex_34,PlayIndex_35
3,4,1.0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,1.0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,6,1.0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,7,1.0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,8,1.0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
#Define features
X=players_df.drop(['PlayerIndex'], axis=1)

In [19]:
#Define target
y=players_df['PlayerIndex']

In [20]:
#Define testing and training datasets
train_features, test_features , train_labels, test_labels = train_test_split(X, y, random_state=0)

In [21]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (78864, 38)
Training Labels Shape: (78864,)
Testing Features Shape: (26288, 38)
Testing Labels Shape: (26288,)


#### Step 4: Building Models

In [22]:
#Using Decision Tree to find the accuracy
from sklearn.tree import DecisionTreeClassifier

In [23]:
#Training the decision tree classifier
clf = DecisionTreeClassifier().fit(train_features, train_labels)

In [24]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(train_features, train_labels)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(test_features, test_labels)))

Accuracy of Decision Tree classifier on training set: 1.00
Accuracy of Decision Tree classifier on test set: 0.78


### Summary

1. Tried various models such as Linear Regression, Gaussian Naive Bayes, K-Nearest and Decision Tree Classifier to find the accuracy
2. Decision Tree classifier gave the best accuracy of 78% on the test data set.