# Classification of classy Shakespeare plays

#### Data Source:

https://www.kaggle.com/kingburrito666/shakespeare-plays

#### Step 1: Data Extraction

In [1]:
import pandas as pd

Loading data into data frame

In [2]:
players_df = pd.read_csv("../data/external/Shakespeare_data.csv")

In [3]:
players_df.head(3)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."


Removing not required columns

In [4]:
players_df = players_df.drop(['PlayerLinenumber'], axis=1)

In [5]:
players_df = players_df.drop(['ActSceneLine'], axis=1)

In [6]:
players_df.head(3)

Unnamed: 0,Dataline,Play,Player,PlayerLine
0,1,Henry IV,,ACT I
1,2,Henry IV,,SCENE I. London. The palace.
2,3,Henry IV,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."


Removing null values from Player column

In [7]:
players_df = players_df.dropna(axis=0, subset=['Player']) 
players_df.head()

Unnamed: 0,Dataline,Play,Player,PlayerLine
3,4,Henry IV,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,6,Henry IV,KING HENRY IV,And breathe short-winded accents of new broils
6,7,Henry IV,KING HENRY IV,To be commenced in strands afar remote.
7,8,Henry IV,KING HENRY IV,No more the thirsty entrance of this soil


Checking data types in dataframe

In [8]:
players_df.dtypes

Dataline       int64
Play          object
Player        object
PlayerLine    object
dtype: object

#### Step 2: Data Tranformation

Using Encoders for transforming the object data types

#### One hot encoding for transforming Play column

In [9]:
players_df = pd.get_dummies(players_df, columns=['Play'])

#### Label Encoding for transforming Player and PlayerLine columns

In [10]:
from sklearn.preprocessing import LabelEncoder

In [11]:
le = LabelEncoder()

In [12]:
players_df['Player'] = le.fit_transform(players_df['Player'].astype('str'))

In [13]:
players_df['PlayerLine'] = le.fit_transform(players_df['PlayerLine'].astype('str'))

In [14]:
players_df.head(3)

Unnamed: 0,Dataline,Player,PlayerLine,Play_A Comedy of Errors,Play_A Midsummer nights dream,Play_A Winters Tale,Play_Alls well that ends well,Play_Antony and Cleopatra,Play_As you like it,Play_Coriolanus,...,Play_Richard III,Play_Romeo and Juliet,Play_Taming of the Shrew,Play_The Tempest,Play_Timon of Athens,Play_Titus Andronicus,Play_Troilus and Cressida,Play_Twelfth Night,Play_Two Gentlemen of Verona,Play_macbeth
3,4,457,63730,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,457,25777,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,6,457,5119,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Step 3: Model Training and Testing data sets

In [15]:
from sklearn.model_selection import train_test_split


Defining Features

In [16]:
X=players_df.drop(['Player'], axis=1)

Defining Targets

In [17]:
y=players_df['Player']

Splitting the training and testing data sets

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#### Step 4: Building Models

1. Using Decision Tree to find the accuracy

In [19]:
from sklearn.tree import DecisionTreeClassifier

In [20]:
clf = DecisionTreeClassifier().fit(X_train, y_train)

In [21]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of Decision Tree classifier on training set: 1.00
Accuracy of Decision Tree classifier on test set: 0.69


2. Using Logistic Regression to find the accuracy 

In [22]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X, y)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

Accuracy of Logistic regression classifier on training set: 0.03
Accuracy of Logistic regression classifier on test set: 0.03


3. Using K-Nearest Neighbours to find the accuracy

In [23]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

Accuracy of K-NN classifier on training set: 0.46
Accuracy of K-NN classifier on test set: 0.23


4. Using Gaussian Naive Bayes to find the accuracy

In [24]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of GNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

Accuracy of GNB classifier on training set: 0.26
Accuracy of GNB classifier on test set: 0.26


__Conclusion__: Accuracy of 69% is predicted for Decision Tree classifier on the test data set, which is much better than the other classifiers.