Changing the directory to the required dataset

In [2]:
cd Desktop/Shakespear/

/Users/danaalmansour/Desktop/Shakespear


fetching the directory of dataset

In [3]:
ls

[31mShakespeare_data.csv[m[m*
[31malllines.txt[m[m*
[31mwilliam-shakespeare-black-silhouette.jpg[m[m*


In [4]:
# csv - comma separated values
# pandas, numpy, matplotlib, seaborn

importing the necessary libraries to perform the operation

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

loading the dataset

In [6]:
data_csv = pd.read_csv('Shakespeare_data.csv') 

Having an overview of it 

In [7]:
data_csv.sample(5)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
108681,108682,A Winters Tale,46.0,2.1.173,ANTIGONUS,"If this prove true, they'll pay for't:"
25697,25698,Coriolanus,3.0,2.2.5,First Officer,"That's a brave fellow, but he's vengeance prou..."
807,808,Henry IV,21.0,2.2.58,PRINCE HENRY,"from your encounter, then they light on us."
38253,38254,Henry V,28.0,3.6.100,FLUELLEN,"think the duke hath lost never a man, but one ..."
60577,60578,Measure for measure,30.0,4.3.80,Provost,But Barnardine must die this afternoon:


In [8]:
#feature engineering - it means building new features by applying operations on current one, or maybe adding a new feature
# exploratroy data analysis (EDA) - studying the dataset, like by plotting charts and
#  drawing inferences about the data

checking the datatypes of features of dataset

In [9]:
data_csv.dtypes

Dataline              int64
Play                 object
PlayerLinenumber    float64
ActSceneLine         object
Player               object
PlayerLine           object
dtype: object

Checking the null value count in the dataset

In [10]:
data_csv.isna().sum()

Dataline               0
Play                   0
PlayerLinenumber       3
ActSceneLine        6243
Player                 7
PlayerLine             0
dtype: int64

Filling the null values

In [11]:
data_csv['Player'].replace(np.NaN,'Other',inplace  = True)

In [12]:
data_csv['PlayerLinenumber'] = data_csv['PlayerLinenumber'].fillna(method='bfill')

filled missing values of ActSceneLine with 0 because they are in str format and training a model to fill in the missing values or taking mean median or mode does not make sense as it will make data imbalanced and would give vague results on analysis.

In [13]:
data_csv['ActSceneLine'].fillna(0, inplace = True)
data_csv.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,1.0,0,Other,ACT I
1,2,Henry IV,1.0,0,Other,SCENE I. London. The palace.
2,3,Henry IV,1.0,0,Other,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


Rechecking the null values

In [14]:
data_csv.isna().sum()

Dataline            0
Play                0
PlayerLinenumber    0
ActSceneLine        0
Player              0
PlayerLine          0
dtype: int64

now our dataset is clean and we have no null values left.

Performed feature engineering upon columns and got a new column for analysis

In [15]:
data_csv['Play&PlayerLine'] = data_csv['Play'] + " " + data_csv['PlayerLine'] 

In [16]:
data_csv.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,Play&PlayerLine
0,1,Henry IV,1.0,0,Other,ACT I,Henry IV ACT I
1,2,Henry IV,1.0,0,Other,SCENE I. London. The palace.,Henry IV SCENE I. London. The palace.
2,3,Henry IV,1.0,0,Other,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ...","Henry IV Enter KING HENRY, LORD JOHN OF LANCAS..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,","Henry IV So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,",Henry IV Find we a time for frighted peace to ...


In [17]:
len(data_csv['Player'].unique())

935

In [18]:
len(data_csv['Play'].unique())

36

In [19]:
len(data_csv)

111396

As we can see that there are 935 classes in which we need to classify our dataset and the number of points is 111k so it won't make much of sense to classify dataset into such large number of categories.

Even after this analysis, we want to classify the target variable based on dependent variable we need to use One Hot Encoding follwed by Label Encoding, after this we train a neural network for classification. And this way we can accomplish our task, but that is very very tedious.

We start by splitting the data into training and test set for training and testing our neural network which we can build.

Splitting the dataset in 80 - 20 ratio, taking X values as Play and Player Line Number, taking y values as Player

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_csv.loc[:,'Play':'PlayerLinenumber'], data_csv['Player'], test_size = 0.2, random_state = 42)

checking the shape of X_train and X_test

In [21]:
X_train.shape

(89116, 2)

In [22]:
X_test.shape

(22280, 2)

Performing OneHot Encoding

In [23]:
y_train = pd.get_dummies(y_train, prefix = 'Player_')


In [24]:
y_train.head()

Unnamed: 0,Player__A Patrician,Player__A Player,Player__AARON,Player__ABERGAVENNY,Player__ABHORSON,Player__ABRAHAM,Player__ACHILLES,Player__ADAM,Player__ADRIAN,Player__ADRIANA,...,Player__Widow,Player__Wife,Player__YORK,Player__YOUNG CLIFFORD,Player__YOUNG SIWARD,Player__Young LUCIUS,Player__of BUCKINGHAM,Player__of King Henry VI,Player__of Prince Edward,Player__of young Princes
73101,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
86651,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23541,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
25691,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19922,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
len(X_train['Play'].unique())

36

In [26]:
df_train = pd.get_dummies(X_train['Play'])

In [27]:
df_train.head()

Unnamed: 0,A Comedy of Errors,A Midsummer nights dream,A Winters Tale,Alls well that ends well,Antony and Cleopatra,As you like it,Coriolanus,Cymbeline,Hamlet,Henry IV,...,Richard III,Romeo and Juliet,Taming of the Shrew,The Tempest,Timon of Athens,Titus Andronicus,Troilus and Cressida,Twelfth Night,Two Gentlemen of Verona,macbeth
73101,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
86651,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
23541,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
25691,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19922,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
df_train['PlayerLinenumber'] = X_train['PlayerLinenumber']

In [29]:
df_train.head()

Unnamed: 0,A Comedy of Errors,A Midsummer nights dream,A Winters Tale,Alls well that ends well,Antony and Cleopatra,As you like it,Coriolanus,Cymbeline,Hamlet,Henry IV,...,Romeo and Juliet,Taming of the Shrew,The Tempest,Timon of Athens,Titus Andronicus,Troilus and Cressida,Twelfth Night,Two Gentlemen of Verona,macbeth,PlayerLinenumber
73101,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,85.0
86651,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,87.0
23541,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,14.0
25691,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,93.0
19922,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,68.0


We need to install keras and tensorflow for building a neural network and then we will compute accuracy of our model.

Like this we complete part 6 of our project.

Tried another approach by data preprocessing data and used SVM classifier.

In [78]:
df = data_csv.copy()

In [79]:
df['Play'] = df['Play'].astype('category') 

In [80]:
df['Player'] = df['Player'].astype('category')

In [81]:
df['PlayerLineCount'] = df['PlayerLine'].apply(lambda x:len(x.split(' ')))

In [82]:
df['Play&PlayerLineCount'] = df['Play&PlayerLine'].apply(lambda x:len(x.split(' ')))

In [83]:
df.drop(['PlayerLine','Play&PlayerLine','ActSceneLine','Dataline'], axis=1, inplace=True)

In [84]:
df.reset_index()

Unnamed: 0,index,Play,PlayerLinenumber,Player,PlayerLineCount,Play&PlayerLineCount
0,0,Henry IV,1.0,,2,4
1,1,Henry IV,1.0,,5,7
2,2,Henry IV,1.0,,16,18
3,3,Henry IV,1.0,KING HENRY IV,9,11
4,4,Henry IV,1.0,KING HENRY IV,9,11
...,...,...,...,...,...,...
111391,111391,A Winters Tale,38.0,LEONTES,8,11
111392,111392,A Winters Tale,38.0,LEONTES,8,11
111393,111393,A Winters Tale,38.0,LEONTES,9,12
111394,111394,A Winters Tale,38.0,LEONTES,6,9


In [85]:
df = df.dropna()

In [86]:
df

Unnamed: 0,Play,PlayerLinenumber,Player,PlayerLineCount,Play&PlayerLineCount
3,Henry IV,1.0,KING HENRY IV,9,11
4,Henry IV,1.0,KING HENRY IV,9,11
5,Henry IV,1.0,KING HENRY IV,7,9
6,Henry IV,1.0,KING HENRY IV,7,9
7,Henry IV,1.0,KING HENRY IV,8,10
...,...,...,...,...,...
111391,A Winters Tale,38.0,LEONTES,8,11
111392,A Winters Tale,38.0,LEONTES,8,11
111393,A Winters Tale,38.0,LEONTES,9,12
111394,A Winters Tale,38.0,LEONTES,6,9


In [95]:
labels = df['Play'].astype('category').cat.categories.tolist()

In [96]:
mapper = {'Play' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}

In [97]:
df.replace(mapper,inplace=True)

In [98]:
y = df['Player']

In [99]:
X = df.drop('Player',axis=1)

In [100]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [101]:
clf = LinearSVC(random_state=0, tol=1e-5)

In [None]:
clf.fit(X_train,y_train)