# Authorship Identification with Shakespeare

## Introduction

My objective with this project is to explore various NLP methods and to use Elizabethan-era plays for a supervised classification task. 

## Text Cleaning
5 texts from each class into both a test and validation split
Importing words
Create BoW
Cleaning - deleting all text that comes before dramatis personae
Tokenization - change to lowercase, remove punctuation
Lemmatization - reducing words to lemma
Remove stopwords - located elizabethan stopwords list
Tokenize with TF-IDF
Create bigrams and trigrams

## Modeling
Topic Modeling for Shakespeare - Latent Dirichlet Allocation (an unsupervised learning bonus)

Naive Bayesian Classification

Recurrent Neural Networks

notes: I added both Two Noble Kinsmen, Pericles, Edward III to the Shakespeare columns as it is generally recognized by shakespearean scholars that he can be at least partly attibuted. The same is not true of Lucrice, Cromwell, and Sir John Oldcastle which have had various turns being published under Shakespeare's name. Although these are occasionally still debated, it is often thought that publishers intentionally used Shakespeare's name to sell the plays under false pretense.

https://github.com/Kaguilar1222/gutenburg_nlp/blob/master/stopwords_elizabethan

In [1]:
import os, shutil

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

import pandas as pd
import numpy as np

from sklearn.manifold import TSNE

from nltk.tokenize import word_tokenize

np.random.seed(0)

In [7]:
non_shakespeare_directory = 'data/Other'
shakespeare_directory = 'data/Shakespeare'
non_shakespeare_filenames = os.listdir(non_shakespeare)
shakespeare_filenames = os.listdir(shakespeare)

# checking for unfavorable class imbalance
print(f'There are {len(os.listdir(non_shakespeare))} non-Shakespeare plays')
print(f'There are {len(os.listdir(shakespeare))} Shakespeare plays')

There are 52 non-Shakespeare plays
There are 37 Shakespeare plays


In [10]:
with open('data/Shakespeare/1508-0.txt') as f:
    taming_of_the_shrew = f.readlines()
    print(taming_of_the_shrew)



In [20]:
def remove_notes(play):
    play_lower = [line.lower() for line in play]
    removed_notes = []
    for i, line in enumerate(play_lower):
        if ('dramatis person' in line) or ('persons represented' in line) or ('the actors names' in line) or ('the puritaine widdow''Persons of the''THE FATALL DOWRY:''START OF THIS PROJECT GUTENBERG EBOOK', 'actors\n' == line 
            removed_notes = play[i+1:]
    return removed_notes

In [21]:
remove_notes(taming_of_the_shrew)

['\n',
 'Persons in the Induction\n',
 'A LORD\n',
 'CHRISTOPHER SLY, a tinker\n',
 'HOSTESS\n',
 'PAGE\n',
 'PLAYERS\n',
 'HUNTSMEN\n',
 'SERVANTS\n',
 '\n',
 'BAPTISTA MINOLA, a rich gentleman of Padua\n',
 'VINCENTIO, an old gentleman of Pisa\n',
 'LUCENTIO, son to Vincentio; in love with Bianca\n',
 'PETRUCHIO, a gentleman of Verona; suitor to Katherina\n',
 '\n',
 'Suitors to Bianca\n',
 'GREMIO\n',
 'HORTENSIO\n',
 '\n',
 'Servants to Lucentio\n',
 'TRANIO\n',
 'BIONDELLO\n',
 '\n',
 'Servants to Petruchio\n',
 'GRUMIO\n',
 'CURTIS\n',
 '\n',
 'PEDANT, set up to personate Vincentio\n',
 '\n',
 'Daughters to Baptista\n',
 'KATHERINA, the shrew\n',
 'BIANCA\n',
 '\n',
 'WIDOW\n',
 '\n',
 'Tailor, Haberdasher, and Servants attending on Baptista and Petruchio\n',
 '\n',
 'SCENE: Sometimes in Padua, and sometimes in PETRUCHIOâ€™S house in\n',
 'the country.\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'INDUCTION\n',
 '\n',
 'SCENE I. Before an alehouse on a heath.\n',
 '\n',
 'Enter Hostess and S

In [23]:
with open('data/Other/4011.txt') as f:
    epicoene = f.readlines()
    print(epicoene)

['The Project Gutenberg EBook of Epicoene, by Ben Jonson\n', '\n', 'This eBook is for the use of anyone anywhere at no cost and with\n', 'almost no restrictions whatsoever.  You may copy it, give it away or\n', 're-use it under the terms of the Project Gutenberg License included\n', 'with this eBook or online at www.gutenberg.org\n', '\n', '\n', 'Title: Epicoene\n', '       Or, The Silent Woman\n', '\n', 'Author: Ben Jonson\n', '\n', 'Release Date: May, 2003  [Etext #4011]\n', 'Posting Date: December 10, 2009\n', '\n', 'Language: English\n', '\n', 'Character set encoding: ASCII\n', '\n', '*** START OF THIS PROJECT GUTENBERG EBOOK EPICOENE ***\n', '\n', '\n', '\n', '\n', 'Produced by Amy E Zelmer, Robert Prince, and Sue Asscher\n', '\n', '\n', '\n', '\n', '\n', 'EPICOENE; OR, THE SILENT WOMAN\n', '\n', '\n', 'By Ben Jonson\n', '\n', '\n', '\n', '\n', '\n', 'INTRODUCTION\n', '\n', 'THE greatest of English dramatists except Shakespeare, the first\n', 'literary dictator and poet-laureate, 

In [24]:
remove_notes(epicoene)

['\n',
 'MOROSE, a Gentleman that loves no noise.\n',
 '\n',
 'SIR DAUPHINE EUGENIE, a Knight, his Nephew.\n',
 '\n',
 'NED CLERIMONT, a Gentleman, his Friend.\n',
 '\n',
 'TRUEWIT, another Friend.\n',
 '\n',
 'SIR JOHN DAW, a Knight.\n',
 '\n',
 'SIR AMOROUS LA-FOOLE, a Knight also.\n',
 '\n',
 'THOMAS OTTER, a Land and Sea Captain.\n',
 '\n',
 'CUTBEARD, a Barber.\n',
 '\n',
 "MUTE, one of MOROSE's Servants.\n",
 '\n',
 'PARSON.\n',
 '\n',
 'Page to CLERIMONT.\n',
 '\n',
 'EPICOENE, supposed the Silent Woman.\n',
 '\n',
 'LADY HAUGHTY, LADY CENTAURE, MISTRESS DOL MAVIS,\n',
 'Ladies Collegiates.\n',
 '\n',
 "MISTRESS OTTER, the Captain's Wife, MISTRESS TRUSTY,\n",
 "LADY HAUGHTY'S Woman, Pretenders.\n",
 '\n',
 'Pages, Servants, etc.\n',
 '\n',
 '\n',
 'SCENE -- LONDON.\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'PROLOGUE\n',
 '\n',
 '   Truth says, of old the art of making plays\n',
 '   Was to content the people; and their praise\n',
 '   Was to the poet money, wine, and bays.\n',
 '\n',
 ' 

In [None]:
if not '[' in line and  not ']' in line:
            for symbol in ",.?!''\n":
                line = line.replace(symbol, '').lower()
            cleaned_song.append(line)
if not '_' in line and  not '_' in line:
for symbol in ",.?!''\n":
    line = line.replace(symbol, '').lower()
cleaned_song.append(line)

In [None]:
ſ to s