# Introduction

One of the most fascinating debates in literature is the authorship of Shakespeare's plays. My objective with this project is to explore various NLP methods and to use Elizabethan-era plays for a supervised classification task to determine if a play is or is not written by Shakespeare. I have 39 .txt files for plays that were written by Shakespeare (two of which are only partial, more recent attributions) and 50 .txt files that are attributed to Shakespeare's contemporaries.

As I am only dealing with a slight class imbalance, I am not making any changes to the data to correct for this when building my authorship attribution models.

In [1]:
# importing libraries for file directories, arrays, tf-idf, and tokenization
import os, shutil

import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords, wordnet
from nltk.collocations import *
from nltk import FreqDist
from nltk import word_tokenize
import string
import re

import random
np.random.seed(0)

import warnings 
warnings.filterwarnings('ignore')

# Creating Directories

Please note: I added Pericles, Two Noble Kinsmen and Edward III to the Shakespeare data as they are generally recognized as partial or full-attributions by Shakespearean scholars and the Folger Shakespeare Library. The main Shakespearean corpus of 37 does not include these 3 plays, and often only the original 37 are recognized.

The same is not true of Lucrice, Cromwell, and Sir John Oldcastle (among other plays which Project Gutenberg attributes to Shakespeare) which had various turns being published under Shakespeare's name. Although these are occasionally still debated, it is often thought that publishers intentionally used Shakespeare's name to sell the plays under false pretenses and are generally accepted outside of Project Gutenberg not to be written by him.

## Manual Train/Test Split

Below in comments are the lines of code I used to divide out my train/test datasets.

In [2]:
# # saving filepaths for original folders
# non_shakespeare_directory = 'data/Other'
# shakespeare_directory = 'data/Shakespeare'
# non_shakespeare_filenames = os.listdir(non_shakespeare_directory)
# shakespeare_filenames = os.listdir(shakespeare_directory)

In [3]:
# saving filepaths for new train/test folders 
train_folder = os.path.join('data/', 'train')
train_shakespeare = os.path.join(train_folder, 'shakespeare')
train_not_shakespeare = os.path.join(train_folder, 'not_shakespeare')

test_folder = os.path.join('data/', 'test')
test_shakespeare = os.path.join(test_folder, 'shakespeare')
test_not_shakespeare = os.path.join(test_folder, 'not_shakespeare')

In [4]:
# # implementing train/test split
# os.mkdir(test_folder)
# os.mkdir(test_shakespeare)
# os.mkdir(test_not_shakespeare)

# os.mkdir(train_folder)
# os.mkdir(train_shakespeare)
# os.mkdir(train_not_shakespeare)

In [5]:
# # creating test/train split in folders 
# import random

# # moving 9 random Shakespeare test plays
# for play in random.sample(os.listdir(shakespeare_directory), k=9):
#     origin = os.path.join(shakespeare_directory, play)
#     destination = os.path.join(test_shakespeare, play)
#     shutil.move(origin, destination)
    
# # train shakespeare
# for play in os.listdir(shakespeare_directory):
#     origin = os.path.join(shakespeare_directory, play)
#     destination = os.path.join(train_shakespeare, play)
#     shutil.move(origin, destination)
    
# # moving 9 random non-Shakespeare test plays
# for play in random.sample(os.listdir(non_shakespeare_directory), k=9):
#     origin = os.path.join(non_shakespeare_directory, play)
#     destination = os.path.join(test_not_shakespeare, play)
#     shutil.move(origin, destination)
    
# # train non-shakespeare
# for play in os.listdir(non_shakespeare_directory):
#     origin = os.path.join(non_shakespeare_directory, play)
#     destination = os.path.join(train_not_shakespeare, play)
#     shutil.move(origin, destination)

# Data Cleaning

Now that my files for each class are split out into train/test data, I am setting up this data to be cleaned and vectorized. I will be saving my data in 3 different forms for use in other notebooks:
* TF-IDF of word vectors
* TF-IDF of word vectors with dimensionality reduction
* TF-IDF of bigrams and trigrams together
* Lemmatized plays (still containing stopwords and punctuation)
* Tokenized plays (stopwords removed)

## Removing Notes

There are a lot of editors notes in each of the plays - not only introductions from scholars and historians, but also general usage legal language inserted by Project Gutenberg. At this stage I am also removing stopwords and punctuation. The stopwords list is supplemented by an Elizabethan-era stopwords list made available by Bryan Bumgardner here: https://bryanbumgardner.com/elizabethan-stop-words-for-nlp/.

In [6]:
# defining stopwords list
stopwords_list = stopwords.words('english')
stopwords_list += ['art', 'doth', 'dost', 'ere', 'ere', 'hast', 'hath',
                   'hence', 'hither', 'nigh', 'oft', 'shouldst', 'thither',
                   'thee', 'thou', 'thine', 'thy', 'tis', 'twas', 'wast',
                   'whence', 'wherefore', 'whereto', 'withal', 'wouldst',
                   'ye', 'yon', 'yonder', 'th']
stopwords_list += list(string.punctuation)

# defining regex pattern
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"

In [7]:
# defining function to clean play data and remove editors notes
def remove_header(play):
    play_lower = [line.lower() for line in play]
    removed_notes = []
# deleting editors notes at beginning of text 
    for i, line in enumerate(play_lower):
        if ('dramatis person' in line)\
        or ('drammatis person' in line)\
        or ('persons represented' in line)\
        or ('the actors names' in line)\
        or ('Persons of the' in line)\
        or ('printed by iohn norton:' in line):
            removed_notes = play_lower[i+1:]
            break
        else:
            removed_notes = play_lower
    return removed_notes


def remove_appendix(prologue_removed):
    removed_appendix = []
    for i, line in enumerate(prologue_removed):
        if ('end of this project gutenberg ebook' in line):
            removed_appendix = prologue_removed[:i]
            break
        else:
            removed_appendix = prologue_removed
    return removed_appendix

# deleting empty lines 
def remove_blank_lines(removed_notes):
    for i, line in enumerate(removed_notes):
        if (line == '\n'):
            del removed_notes[i]
    return removed_notes

bad_words = ['gutenberg', 'commercial', 'copyright', 'shakespeare',\
             'full license', 'united states', 'carnegie mellon',\
            'donations', 'ebook', 'legal', 'ascii', 'electronic', 'download'\
            'online', 'restrictions', '@']
bad_words += [str(i) for i in list(range(10))]

      
# deleting line breaks from text, replacing archaic 's', removing footnotes where remaining
def remove_notes(removed_notes):
    edited_play = []
    for line in removed_notes:
        line = line.replace('ſ', 's').replace('\n', '')\
        .replace('\\', '')
        if not any (word in line for word in bad_words):
            edited_play.append(line)
    return edited_play

In [8]:
# creating training/test corpus and target arrays
def main(shakes_directory, not_shakes_directory):
    corpus = []
    list_target = []
    # Iterate through list of filenames and read each it
    for file in os.listdir(shakes_directory):
        with open(shakes_directory + '/' + file, encoding='utf8', errors='ignore') as f:
            raw_data = f.readlines()
        # cleaning raw text
        header_removed = remove_header(raw_data)
        appendix_removed = remove_appendix(header_removed)
        blanks_removed = remove_blank_lines(appendix_removed)
        cleaned = remove_notes(blanks_removed)
        corpus.append(' '.join(cleaned))
        list_target.append(1)
    for file in os.listdir(not_shakes_directory):
        with open(not_shakes_directory + '/' + file, encoding='utf8', errors='ignore') as f:
            raw_data = f.readlines()
        # cleaning raw text
        header_removed = remove_header(raw_data)
        appendix_removed = remove_appendix(header_removed)
        blanks_removed = remove_blank_lines(appendix_removed)
        cleaned = remove_notes(blanks_removed)
        corpus.append(' '.join(cleaned))
        list_target.append(0)
    return corpus, np.array(list_target)

        
train_corpus, y_train = main(train_shakespeare, train_not_shakespeare)
test_corpus, y_test = main(test_shakespeare, test_not_shakespeare)

train_corpus[0][2000:2200]

'  and humbly now upon my bended knee,     in sight of england and her lordly peers,     deliver up my title in the queen     to your most gracious hands, that are the substance     of that great shado'

## Creating Lemmatized Plays

In [9]:
# preparing plays by converting lines into sentences that can be POS tagged and lemmatized
train_corpus_sentences = [re.split('[._!?]', play) for play in train_corpus]
test_corpus_sentences = [re.split('[._!?]', play) for play in test_corpus]

In [10]:
# initializing lemmatizer
lemmatizer = WordNetLemmatizer()

# getting POS tags from wordnet (we only need the first letter of the tag)
def get_pos(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:                    
        return None

# tokenizing sentence, getting POS tag, creating tuple of word and tag generated above
def lemmatize_plays(corpus):
    corpus_to_return = []
    for play in corpus:
        lemmatized_sentences = [] 
        for sentence in play:
            nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))    
            wn_tagged = map(lambda x: (x[0], get_pos(x[1])), nltk_tagged)
            res_words = []
            # returning lemmatized sentence and returning list of sentences
            for word, tag in wn_tagged:
                if tag is None:                        
                    res_words.append(word)
                else:
                    res_words.append(lemmatizer.lemmatize(word, tag))
            lemmatized_sentences.append(' '.join(res_words))
        corpus_to_return.append(' '.join(lemmatized_sentences))
    return corpus_to_return

In [11]:
# tokenizing, lemmatizing 
train_lemmatized = lemmatize_plays(train_corpus_sentences)
test_lemmatized = lemmatize_plays(test_corpus_sentences)

## Creating Play Tokens

In [12]:
# stripping stopwords, tokenizing
def tokenize_stop(lemmatized_plays):
    tokens = []
    for play in lemmatized_plays:
        tokens_raw = nltk.regexp_tokenize(play, pattern)
        stopped_play = [word for word in tokens_raw if word not in stopwords_list]
        tokens.append(stopped_play)
    return tokens
train_tokens = tokenize_stop(train_lemmatized)
test_tokens = tokenize_stop(test_lemmatized)

## Creating TF-IDFs

Below I am creating TF-IDF vector representations of the texts, and saving both vectors with all words in the corpus, and a dimensionality reduced representation that still has 80% of explained variance (this should hopefully address any overfitting).

In [13]:
# initializing TF-IDF vectorizer
tfidf = TfidfVectorizer(stop_words=stopwords_list, token_pattern=pattern)
X_train = tfidf.fit_transform(train_lemmatized)
X_test = tfidf.transform(test_lemmatized)

In [14]:
# importing library to lower dimensions of TF-IDF vector in a way that keeps 80% of explained variance
from sklearn.decomposition import TruncatedSVD

# creating lower-dimension TF-IDF vectors 
svd = TruncatedSVD(n_components=50)
X_train_50 = svd.fit_transform(X_train)
X_test_50 = svd.transform(X_test)
print(np.sum(svd.explained_variance_ratio_))

0.8057528066646362


In [15]:
# initializing TF-IDF vectorizer for bigrams
tfidf_ngrams = TfidfVectorizer(stop_words=stopwords_list,
                               token_pattern=pattern,
                               ngram_range=(2, 2))

X_train_ngrams = tfidf_ngrams.fit_transform(train_lemmatized)
X_test_ngrams = tfidf_ngrams.transform(test_lemmatized)

# Saving to Pickle

I am saving all 5 different representations of my text for use in other notebooks; as well, of course, as the true class labels. 

In [16]:
# saving files as pickle objects for use in other notebooks
import pickle

pickle_jar = [X_train, X_test, 
              y_train, y_test,
              X_train_50, X_test_50,
              X_train_ngrams, X_test_ngrams,
              train_lemmatized, test_lemmatized,
              train_tokens, test_tokens]

with open('data/pickle_jar.pickle', 'wb') as f:
    pickle.dump(pickle_jar, f, pickle.HIGHEST_PROTOCOL)

In the next notebook I will be exploring methods for visualizing our data, to identify some common words, see how separable our classes are, and how we can use vectors to determine word similarities.