##  Generative Model  

This is an implementation of a generative model and comparing a Naive Bayes Classifier and a Support Vector Machine for authorship attribution based on the paper of Fox et al. (2012) using a pandas data frame for data management.
The model contains the following features:
- Word frequency of stop words
- Word frequency of POS tags
- Bigrams of POS and  stop words 

Because the stop words list of Fox et al. is not available, we explore different sets of stop words to determine the most helpful.

In [78]:
#NOTE

#Results can differ depending on how the CountVectorizers in the get_feature functions are built. Two versions 
#have been tried and been found to affect the results for better or for worse in a non consistent way 
#in each feature.
#To compare SVM and NB, version 1 is being used.

#Version 1 
#cvec1 = CountVectorizer(vocabulary = stopwords_507, strip_accents="ascii")#word freq of stop words
#cvec2 = CountVectorizer() #freq of POS tags
#cvec3 = CountVectorizer(ngram_range=(2,2), strip_accents="ascii") #ngrams of POS/stops

#Version 2
#cvec1 = CountVectorizer(vocabulary= stopwords_507, strip_accents="ascii")#word freq of stop words
#cvec2 = CountVectorizer(max_features=1000, strip_accents="ascii") #freq of POS tags
#cvec3 = CountVectorizer(ngram_range=(2,2),max_features=1000, strip_accents="ascii") #ngrams of POS/stops 

In [1]:
import os
import re
import spacy
import pandas as pd
from glob import glob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn import linear_model

import nltk
from nltk.corpus import stopwords
from stop_words import get_stop_words

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

import matplotlib.pyplot as plt

import numpy as np

## Creating Dataframe

### Creating list with stop words of different length to explore this as a feature
Lengths of lists explored:
- 305 used from spacy 
- 710, which is the same length used by Fox et al. (2012), which is combination of different packages
- 747, which is the set of all stop word lists combined to get the maximum amount of words

In [2]:
nlp = spacy.load("en")
stopwrd1= []
for word in nlp.Defaults.stop_words:
    stopwrd1.append(word)


stopwords2 = stopwords.words('english')


stopwords3 = get_stop_words('english')

#from https://www.ranks.nl/stopwords
stopwords4 = ['a ', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', 'cannot', "can't", 'cause', 'causes', 'certain', 'certainly', 'co', 'com', 'come', 'comes', 'contain', 'containing', 'contains', 'could', 'couldnt', 'd', 'date', 'did', "didn't", 'different', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'due', 'during', 'e', 'each', 'ed', 'edu', 'effect', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'especially', 'et', 'et-al', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'except', 'f', 'far', 'few', 'ff', 'fifth', 'first', 'five', 'fix', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'gave', 'get', 'gets', 'getting', 'give', 'given', 'gives', 'giving', 'go', 'goes', 'gone', 'got', 'gotten', 'h', 'had', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', 'hed', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'heres', 'hereupon', 'hers', 'herself', 'hes', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'home', 'how', 'howbeit', 'however', 'hundred', 'i', 'id', 'ie', 'if', "i'll", 'im', 'immediate', 'immediately', 'importance', 'important', 'in', 'inc', 'indeed', 'index', 'information', 'instead', 'into', 'invention', 'inward', 'is', "isn't", 'it', 'itd', "it'll", 'its', 'itself', "i've", 'j', 'just', 'k', 'keep\tkeeps', 'kept', 'kg', 'km', 'know', 'known', 'knows', 'l', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'lets', 'like', 'liked', 'likely', 'line', 'little', "'ll", 'look', 'looking', 'looks', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'million', 'miss', 'ml', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'mug', 'must', 'my', 'myself', 'n', 'na', 'name', 'namely', 'nay', 'nd', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'now', 'nowhere', 'o', 'obtain', 'obtained', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'omitted', 'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'ord', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'owing', 'own', 'p', 'page', 'pages', 'part', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'put', 'q', 'que', 'quickly', 'quite', 'qv', 'r', 'ran', 'rather', 'rd', 're', 'readily', 'really', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'respectively', 'resulted', 'resulting', 'results', 'right', 'run', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sec', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sent', 'seven', 'several', 'shall', 'she', 'shed', "she'll", 'shes', 'should', "shouldn't", 'show', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'since', 'six', 'slightly', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specifically', 'specified', 'specify', 'specifying', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure\tt', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 
             'thered', 'therefore', 'therein', "there'll", 'thereof', 'therere', 'theres', 'thereto', 'thereupon', "there've", 'these', 'they', 'theyd', "they'll", 'theyre', "they've", 'think', 'this', 'those', 'thou', 'though', 'though', 'thousand', 'throug', 'through', 'throughout', 'thru', 'thus', 'til', 'tip', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'ups', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'v', 'value', 'various', "'ve", 'very', 'via', 'viz', 'vol', 'vols', 'vs', 'w', 'want', 'wants', 'was', 'wasnt', 'way', 'we', 'wed', 'welcome', "we'll", 'went', 'were', 'werent', "we've", 'what', 'whatever', "what'll", 'whats', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', "who'll", 'whom', 'whomever', 'whos', 'whose', 'why', 'widely', 'willing', 'wish', 'with', 'within', 'without', 'wont', 'words', 'world', 'would', 'wouldnt', 'www', 'x', 'y', 'yes', 'yet', 'you', 'youd', "you'll", 'your', 'youre', 'yours', 'yourself', 'yourselves', "you've", 'z', 'zero']

        

stopwords_305 = stopwrd1
stopwords_747= list(set(stopwrd1 + stopwords2+ stopwords3 +stopwords4))
stopwords_710= stopwords_747[:710]


In [3]:
def get_ordered_lists(auhtors, corpus):
    """ 
    input: corpus as a list with path of files and a list of authors
    
    output: list of list, with the same order of the author list, with a list containing all plays for each author
    """
    dramas_per_author=[]
    for author in auhtors:
        authorList=[]
        for drama in corpus:
            if author in drama:
                authorList.append(drama)
        dramas_per_author.append(authorList)
    
    return dramas_per_author

In [4]:
"""Don't run this! DF is saved and you can open it"""

def create_dataframe_oneText(corpus, authors, stopwrds710, stopwrds747):
    
    """
    input: all of the texts top be analysed, all authors, and two lists of stopwords
    
    runtime: approx 14 min for 76 files 
    
    output: pandas DF with all txt files as one instance: 
    - Play ID to identify each text, and name of play
    - Raw text
    - POS tags if word not in respective stop word list
    - POS or stop words depentend on stop word list
    - lemmas
    - auhtor name and label for each author
    """
    nlp = spacy.load("en")
    data=[]   
    corpus_per_author = get_ordered_lists(authors, corpus)
    
    play_id = -1
    for idx, corpus in enumerate(corpus_per_author):
        author=authors[idx]
        label=idx

        for play in corpus:
            play_id += 1

            drama=play[3:]
            
            file= open(play).read()
            doc = nlp(file)
            
            POS305= " "
            POS710= " "
            POS747= " "
            
            POS_stwrd305 = " "
            POSstopwrds710 = " "
            POSstopwrds747 = " "

            lemmas= " "

            for word in doc:
                
                #getting the lemmas
                lemmas += str(word.lemma_)
                lemmas += str(" ")
                
                #getting either POS or stpwrd with 305 stops
                if word.is_stop:
                    POS_stwrd305 += str(word) 
                    POS_stwrd305 += str(" ") 
                else:
                    POS_stwrd305  += str(word.pos_)
                    POS_stwrd305 += str(" ")
                    
                #getting either POS or stpwrd with 710 stops                         
                if str(word) in stopwrds710:
                    POSstopwrds710 += str(word) 
                    POSstopwrds710 += str(" ")
                else:
                    POSstopwrds710  += str(word.pos_)
                    POSstopwrds710  += str(" ")
                    
                #getting either POS or stpwrd with 747 stops                         
                if str(word) in stopwrds747:
                    POSstopwrds747 += str(word) 
                    POSstopwrds747 += str(" ")
                else:
                    POSstopwrds747  += str(word.pos_)
                    POSstopwrds747  += str(" ")
                    
                #getting POS if not stop with 305 stops     
                if not word.is_stop:
                    POS305 += str(word.pos_)
                    POS305 += " "
                    
                #getting POS if not stop with 710 stops     
                if str(word) not in stopwrds710:
                    POS710 += str(word.pos_)
                    POS710 += " "

                #getting POS if not stop with 710 stops     
                if str(word) not in stopwrds747:
                    POS747 += str(word.pos_)
                    POS747 += " "
                        
                        
                    
            data.append((play_id, file, POS305, POS710, POS747,  POS_stwrd305, POSstopwrds710, POSstopwrds747,  lemmas, label, author, drama))
            
    df=pd.DataFrame(data,
                    columns=
                    ["Play_id","Raw Text", "POS_305", "POS_710", "POS_747", "POSstops_305", "POSstops_710", "POSstops_747", "Lemmas", "Label", "Author", "Play"] )
    return df



corpus= glob("El/*tok*")
y_authors = ['E-Shakespeare', 'L-Shakespeare', 'Marlowe', 'Middleton','Jonson', 'Chapman']
df = create_dataframe_oneText(corpus, y_authors, stopwords_710, stopwords_747)



In [6]:
#saves data frame to your directory named DataFrame
df.to_excel("DataFrame.xlsx")
df.shape #df has 77 instances (each text one, with 12 columns)

(77, 12)

In [4]:
"""Read in DF here"""
#this is how you read an excel file as a DF  
df= pd.read_excel("DataFrame.xlsx")

In [8]:
#Test
df.head()

Unnamed: 0,Play_id,Raw Text,POS_305,POS_710,POS_747,POSstops_305,POSstops_710,POSstops_747,Lemmas,Label,Author,Play
0,0,"As I remember , Adam , it was upon this fashio...",ADP PRON VERB PUNCT PROPN PUNCT NOUN SPACE VE...,ADP PRON VERB PUNCT PROPN PUNCT NOUN SPACE VE...,ADP PRON VERB PUNCT PROPN PUNCT NOUN SPACE VE...,ADP PRON VERB PUNCT PROPN PUNCT it was upon t...,ADP PRON VERB PUNCT PROPN PUNCT it was upon t...,ADP PRON VERB PUNCT PROPN PUNCT it was upon t...,"as -PRON- remember , adam , -PRON- be upon th...",0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok
1,1,"Proceed , Solinus , to procure my fall\nAnd by...",PROPN PUNCT PROPN PUNCT VERB NOUN SPACE CCONJ...,PROPN PUNCT PROPN PUNCT VERB NOUN SPACE CCONJ...,PROPN PUNCT PROPN PUNCT VERB NOUN SPACE CCONJ...,PROPN PUNCT PROPN PUNCT to VERB my NOUN SPACE...,PROPN PUNCT PROPN PUNCT to VERB my NOUN SPACE...,PROPN PUNCT PROPN PUNCT to VERB my NOUN SPACE...,"proceed , solinus , to procure -PRON- fall \n...",0,E-Shakespeare,comedy_errors.txt.E-Shakespeare.tok
2,2,"Who 's there ?\nNay , answer me : stand , and ...",NOUN VERB PUNCT SPACE PROPN PUNCT VERB PUNCT ...,NOUN VERB PUNCT SPACE PROPN PUNCT VERB PUNCT ...,NOUN VERB PUNCT SPACE PROPN PUNCT VERB PUNCT ...,NOUN VERB there PUNCT SPACE PROPN PUNCT VERB ...,NOUN VERB there PUNCT SPACE PROPN PUNCT VERB ...,NOUN VERB there PUNCT SPACE PROPN PUNCT VERB ...,"who be there ? \n nay , answer -PRON- : stand...",0,E-Shakespeare,Hamlet.txt.E-Shakespeare.tok
3,3,"So shaken as we are , so wan with care ,\nFind...",ADV VERB PUNCT NOUN NOUN PUNCT SPACE VERB NOU...,ADV VERB PUNCT NOUN NOUN PUNCT SPACE VERB NOU...,ADV VERB PUNCT NOUN NOUN PUNCT SPACE VERB NOU...,ADV VERB as we are PUNCT so NOUN with NOUN PU...,ADV VERB as we are PUNCT so NOUN with NOUN PU...,ADV VERB as we are PUNCT so NOUN with NOUN PU...,"so shake as -PRON- be , so wan with care , \n...",0,E-Shakespeare,henryivPart1.txt.E-Shakespeare.tok
4,4,Open your ears ; for which of you will stop\nT...,VERB NOUN PUNCT VERB SPACE DET NOUN VERB ADJ ...,VERB NOUN PUNCT SPACE DET NOUN VERB ADJ PROPN...,VERB NOUN PUNCT SPACE DET NOUN VERB ADJ PROPN...,VERB your NOUN PUNCT for which of you will VE...,VERB your NOUN PUNCT for which of you will st...,VERB your NOUN PUNCT for which of you will st...,open -PRON- ear ; for which of -PRON- will st...,0,E-Shakespeare,henryivPart2.txt.E-Shakespeare.tok


## Classifiers 

In [5]:
def  predict_NB(train_features, train_label, test_features): 
    """
    Naive Bayes Classifier
    
    input: features for train and test instances and label for train instances
    
    returns prediction for instance and coefficient matrix and classifier
    
    """
    # played around with smoothing 0.1 seemed to be the best
    clf = MultinomialNB(alpha=0.1) #using the same classifier as Fox et al.
    clf.fit(train_features, train_label)
    pred=clf.predict(test_features)
    coef = clf.coef_
    return pred, coef, clf

In [6]:
def  predict_SVM(train_features, train_label, test_features): 
    """
    Support Vector Machine Classifier
    
    input: features for train and test instances and label for train instances
    
    returns prediction for instance and coefficient matrix and classifier
    
    """
    
    clf = LinearSVC() 
    clf.fit(train_features, train_label)
    pred=clf.predict(test_features)
    coef = clf.coef_
    return pred, coef, clf

In [None]:
#NOTE

#using the SVM may create a non convergence warning, which does not affect the output 
#and cannot be fixed by an increase of iterations

## Leave-one-out Cross Validation

In [7]:
def get_LOO_instances(dataframe):
    """
    
    generates instances for LOO
    
    input: df
    
    output: list with instances as tuples: [(train_row, test_row)]
    
    """
    loo_instances=[]
    for instance in range(dataframe.shape[0]):
        inst= list(range(dataframe.shape[0]))
        inst.remove(instance)
        test=dataframe.drop(inst)
        train= dataframe.drop([instance])
        loo_instances.append((train, test))
    return loo_instances

instances = get_LOO_instances(df)

In [8]:
#Test
instances[1][1]["Label"]

1    0
Name: Label, dtype: int64

## Implementing Features Used by Fox et al. (2012) as a Basis with 710 Stop Words

In [79]:

def get_features_GM_710(X_train, X_test):
    
    """
    Input: data used to get model, both for testing and training
    output: features for train and test instances 
    """
    
    #getting the columns I want to use for my features from the instances passed
    X_txt = X_train["Raw Text"]
    X_POS = X_train["POS_710"]
    X_POSstops = X_train[ "POSstops_710"]
    
    X_test_txt= X_test["Raw Text"]
    X_test_POS = X_test["POS_710"]
    X_test_POSstops = X_test[ "POSstops_710"]
    
    
    # using countvectorizors to get freqs 
    cvec1 = CountVectorizer(vocabulary= stopwords_710, strip_accents="ascii")#word freq of stop words
    cvec2 = CountVectorizer() #freq of POS tags
    cvec3 = CountVectorizer(ngram_range=(2,2), strip_accents="ascii") #ngrams of POS/stops 

   # fitting  and transforming of train data, and stack vectors at the same time to get X for model, 
    train_features= np.hstack((
        cvec1.fit_transform(X_txt).toarray(),
        cvec2.fit_transform(X_POS).toarray(),
        cvec3.fit_transform(X_POSstops).toarray(),
        ))
    
    #only fit X_test data
    test_features = np.hstack((
            cvec1.transform(X_test_txt).toarray(),
            cvec2.transform(X_test_POS).toarray(),
            cvec3.transform(X_test_POSstops).toarray(),
        ))
    



    
    return train_features, test_features


In [80]:
def get_preds_710(dataframe):
    
    """
    change function get_features_GM(train_data, test_data) for feature engineering!
    input: Dataframe
    
    output: predictions for test instances
    """
    
    instances= get_LOO_instances(dataframe)
    predictions_NB=[]
    predictions_SVM=[]
    for inst in instances:
        train_data= inst[0]
        test_data = inst[1]
        train_label=train_data["Label"]
        train_features, test_features= get_features_GM_710(train_data, test_data)
        pred_NB, coeff_NB, cls_NB = predict_NB(train_features, train_label, test_features)
        pred_SVM, coeff_SVM, cls_SVM = predict_SVM(train_features, train_label, test_features)
        predictions_NB.append(pred_NB)
        predictions_SVM.append(pred_SVM)
        
        
    return np.array(predictions_NB), coeff_NB, cls_NB, np.array(predictions_SVM), coeff_SVM, cls_SVM 

In [81]:
predictions_NB_710, coeff_NB_710, cls_NB_710, predictions_SVM_710, coeff_SVM_710, cls_SVM_710  =  get_preds_710(df) 





In [82]:
df["Pred_NB_710"]=predictions_NB_710 #I add my predictions to the data frame as a new column
df["Pred_SVM_710"]=predictions_SVM_710 #I add my predictions to the data frame as a new column

In [83]:
#get the acc by comparing pred to the labels 
df["acc_NB_710"] =df["Label"]==df[ "Pred_NB_710"]
acc_NB_710 = df.loc[df.acc_NB_710 == True, 'acc_NB_710'].count()/df.shape[0]
print(acc_NB_710)

0.8181818181818182


In [84]:
#get the acc by comparing pred to the labels 
df["acc_SVM_710"] =df["Label"]==df[ "Pred_SVM_710"]
acc_SVM_710 = df.loc[df.acc_SVM_710 == True, 'acc_SVM_710'].count()/df.shape[0]
print(acc_SVM_710)

0.8051948051948052


## Other Sets of Stop Words 

### Use List of Stop Words of Length 305 (Minimum Amount of Stop Words)

In [40]:

def get_features_GM_305(X_train, X_test):
    
    """
    Input: data used to get model, both for testing and training
    output: features for train and test instances 
    """
    
    #getting the columns I want to use for my features from the instances passed
    X_txt = X_train["Raw Text"]
    X_POS = X_train["POS_305"]
    X_POSstops = X_train[ "POSstops_305"]
    
    X_test_txt= X_test["Raw Text"]
    X_test_POS = X_test["POS_305"]
    X_test_POSstops = X_test[ "POSstops_305"]
    
    # using countvectorizors to get freqs
    cvec1 = CountVectorizer(vocabulary = stopwords_305, strip_accents="ascii")#word freq of stop words
    cvec2 = CountVectorizer() #freq of POS tags
    cvec3 = CountVectorizer(ngram_range=(2,2), strip_accents="ascii") #ngrams of POS/stops
     


   # fitting  and transforming of train data, and stack vectors at the same time to get X for model, 
    train_features= np.hstack((
        cvec1.fit_transform(X_txt).toarray(),
        cvec2.fit_transform(X_POS).toarray(),
        cvec3.fit_transform(X_POSstops).toarray(),
        ))
    
    #only fit X_test data
    test_features = np.hstack((
            cvec1.transform(X_test_txt).toarray(),
            cvec2.transform(X_test_POS).toarray(),
            cvec3.transform(X_test_POSstops).toarray(),
        ))

    
    return train_features, test_features


In [41]:
def get_preds_305(dataframe):
    
    """
    change function get_features_GM(train_data, test_data) for feature engineering!
    input: Dataframe
    
    output: predictions for test instances
    """
    
    instances= get_LOO_instances(dataframe)
    predictions_NB=[]
    predictions_SVM=[]
    for inst in instances:
        train_data= inst[0]
        test_data = inst[1]
        train_label=train_data["Label"]
        train_features, test_features= get_features_GM_305(train_data, test_data)
        pred_NB, coeff_NB, cls_NB = predict_NB(train_features, train_label, test_features)
        pred_SVM, coeff_SVM, cls_SVM = predict_SVM(train_features, train_label, test_features)
        predictions_NB.append(pred_NB)
        predictions_SVM.append(pred_SVM)
        
        
    return np.array(predictions_NB), coeff_NB, cls_NB, np.array(predictions_SVM), coeff_SVM, cls_SVM 

In [42]:
predictions_NB_305, coeff_NB_305, cls_NB_305, predictions_SVM_305, coeff_SVM_305, cls_SVM_305  =  get_preds_305(df) 





In [43]:
df["Pred_NB_305"]=predictions_NB_710 #I add my predictions to the data frame as a new column
df["Pred_SVM_305"]=predictions_SVM_710 #I add my predictions to the data frame as a new column

In [44]:
#get the acc by comparing pred to the labels 
df["acc_NB_305"] =df["Label"]==df[ "Pred_NB_305"]
acc_NB_305 = df.loc[df.acc_NB_305 == True, 'acc_NB_305'].count()/df.shape[0]
print(acc_NB_305)

0.8181818181818182


In [45]:
#get the acc by comparing pred to the labels 
df["acc_SVM_305"] =df["Label"]==df[ "Pred_SVM_305"]
acc_SVM_305 = df.loc[df.acc_SVM_305 == True, 'acc_SVM_305'].count()/df.shape[0]
print(acc_SVM_305)

0.8181818181818182


### Use List of Stop Words of Length 747 (Maximum Amount of Stop Words)

In [46]:

def get_features_GM_747(X_train, X_test):
    
    """
    Input: data used to get model, both for testing and training
    output: features for train and test instances 
    """
    
    #getting the columns I want to use for my features from the instances passed
    X_txt = X_train["Raw Text"]
    X_POS = X_train["POS_747"]
    X_POSstops = X_train[ "POSstops_747"]
    
    X_test_txt= X_test["Raw Text"]
    X_test_POS = X_test["POS_747"]
    X_test_POSstops = X_test[ "POSstops_747"]
    
    
    # using countvectorizors to get freqs 
    cvec1 = CountVectorizer(vocabulary = stopwords_747, strip_accents="ascii")#word freq of stop words
    cvec2 = CountVectorizer() #freq of POS tags
    cvec3 = CountVectorizer(ngram_range=(2,2), strip_accents="ascii") #ngrams of POS/stops

   # fitting  and transforming of train data, and stack vectors at the same time to get X for model, 
    train_features= np.hstack((
        cvec1.fit_transform(X_txt).toarray(),
        cvec2.fit_transform(X_POS).toarray(),
        cvec3.fit_transform(X_POSstops).toarray(),
        ))
    
    #only fit X_test data
    test_features = np.hstack((
            cvec1.transform(X_test_txt).toarray(),
            cvec2.transform(X_test_POS).toarray(),
            cvec3.transform(X_test_POSstops).toarray(),
        ))

    
    return train_features, test_features


In [47]:
def get_preds_747(dataframe):
    
    """
    change function get_features_GM(train_data, test_data) for feature engineering!
    input: Dataframe
    
    output: predictions for test instances
    """
    
    instances= get_LOO_instances(dataframe)
    predictions_NB=[]
    predictions_SVM=[]
    for inst in instances:
        train_data= inst[0]
        test_data = inst[1]
        train_label=train_data["Label"]
        train_features, test_features= get_features_GM_747(train_data, test_data)
        pred_NB, coeff_NB, cls_NB = predict_NB(train_features, train_label, test_features)
        pred_SVM, coeff_SVM, cls_SVM = predict_SVM(train_features, train_label, test_features)
        predictions_NB.append(pred_NB)
        predictions_SVM.append(pred_SVM)
        
        
    return np.array(predictions_NB), coeff_NB, cls_NB, np.array(predictions_SVM), coeff_SVM, cls_SVM 

In [48]:
predictions_NB_747, coeff_NB_747, cls_NB_747, predictions_SVM_747, coeff_SVM_747, cls_SVM_747  =  get_preds_747(df)





In [49]:
df["Pred_NB_747"]=predictions_NB_747 #I add my predictions to the data frame as a new column
df["Pred_SVM_747"]=predictions_SVM_747 #I add my predictions to the data frame as a new column

In [50]:
#get the acc by comparing pred to the labels 
df["acc_NB_747"] =df["Label"]==df[ "Pred_NB_747"]
acc_NB_747 = df.loc[df.acc_NB_747 == True, 'acc_NB_747'].count()/df.shape[0]
print(acc_NB_747)

0.7922077922077922


In [51]:
#get the acc by comparing pred to the labels 
df["acc_SVM_747"] =df["Label"]==df[ "Pred_SVM_747"]
acc_SVM_747 = df.loc[df.acc_SVM_747 == True, 'acc_SVM_747'].count()/df.shape[0]
print(acc_SVM_747)

0.8051948051948052


## Looking At Results

In [52]:
df_stpw_results = pd.DataFrame()
df_stpw_results= pd.DataFrame([ df["Label"], df["Pred_NB_305"], df["Pred_NB_710"], df["Pred_NB_747"], df["Pred_SVM_305"], df["Pred_SVM_710"], df["Pred_SVM_747"] ]).transpose().append({"Label": 100, "Pred_NB_305": acc_NB_305, "Pred_NB_710":acc_NB_710, "Pred_NB_747": acc_NB_747, "Pred_SVM_305": acc_SVM_305, "Pred_SVM_710":acc_SVM_710, "Pred_SVM_747": acc_SVM_747},  ignore_index=True)
df_stpw_results

Unnamed: 0,Label,Pred_NB_305,Pred_NB_710,Pred_NB_747,Pred_SVM_305,Pred_SVM_710,Pred_SVM_747
0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.0,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
3,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,0.0,0.000000,0.000000,0.000000,2.000000,2.000000,2.000000
6,0.0,0.000000,0.000000,0.000000,2.000000,2.000000,2.000000
7,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,0.0,2.000000,2.000000,2.000000,2.000000,2.000000,2.000000
9,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000


__Results__  


| n_stop_words 	| NB 	| SVM 	|  
|--------------	|--------	|--------	|
| *305*  	| __0.818__ 	| __0.818__ 	|  
| *710* 	| __0.818__ 	| __0.818__ 	|
| *747*  	| 0.792 	| 0.8051 	| 


There might be the maximum somewhere between the 305 and 747 stop words. Thus will explore this in the following.

In [53]:
stopwords_507=stopwords_747[:int(305+(710-305)/2)]

In [33]:
"""Don't run this. DataFrame_imp is saved and can be read in"""

def create_dataframe_improve(corpus, authors, new_stops):
    
    """
    input: all of the texts top be analysed, all authors, and a list of stop words
    
    
    output: pandas DF with all txt files as one instance: 
    - Play ID to identify each text, and name of play
    - Raw text
    - POS tags if word not in respective stop word list
    - POS or stop words depentend on stop word list
    - auhtor name and label for each author
    """
    nlp = spacy.load("en")
    data=[]   
    corpus_per_author = get_ordered_lists(authors, corpus)
    
    play_id = -1
    for idx, corpus in enumerate(corpus_per_author):
        author=authors[idx]
        label=idx

        for play in corpus:
            play_id += 1

            drama=play[3:]
            
            file= open(play).read()
            doc = nlp(file)
            
            POS_new= " "
            POSstopwrds_new = " "

            for word in doc:
                
                if str(word) in new_stops:
                    POSstopwrds_new += str(word) 
                    POSstopwrds_new += str(" ")
                else:
                    POSstopwrds_new  += str(word.pos_)
                    POSstopwrds_new  += str(" ")
                    
                if str(word) not in new_stops:
                    POS_new += str(word.pos_)
                    POS_new += " "
                        
                        
                    
            data.append((play_id, file, POS_new, POSstopwrds_new, label, author, drama))
            
    df=pd.DataFrame(data,
                    columns=
                    ["Play_id","Raw Text", "POS_new", "POSstops_new", "Label", "Author", "Play"] )
    return df



corpus= glob("El/*tok*")
y_authors = ['E-Shakespeare', 'L-Shakespeare', 'Marlowe', 'Middleton','Jonson', 'Chapman']
df_imp = create_dataframe_improve(corpus, y_authors, stopwords_507)



In [35]:
df_imp.to_excel("DataFrame_imp.xlsx")


In [54]:
"""Read in new df here"""
df_imp = pd.read_excel("DataFrame_imp.xlsx")

In [73]:

def get_features_GM_imp(X_train, X_test):
    
    """
    Input: data used to get model, both for testing and training
    output: features for train and test instances 
    """
    
    #getting the columns I want to use for my features from the instances passed
    X_txt = X_train["Raw Text"]
    X_POS = X_train["POS_new"]
    X_POSstops = X_train[ "POSstops_new"]
    
    X_test_txt= X_test["Raw Text"]
    X_test_POS = X_test["POS_new"]
    X_test_POSstops = X_test[ "POSstops_new"]
    
    
    # using countvectorizors to get freqs 
    cvec1 = CountVectorizer(vocabulary = stopwords_507, strip_accents="ascii")#word freq of stop words
    cvec2 = CountVectorizer() #freq of POS tags
    cvec3 = CountVectorizer(ngram_range=(2,2), strip_accents="ascii") #ngrams of POS/stops

   # fitting  and transforming of train data, and stack vectors at the same time to get X for model, 
    train_features= np.hstack((
        cvec1.fit_transform(X_txt).toarray(),
        cvec2.fit_transform(X_POS).toarray(),
        cvec3.fit_transform(X_POSstops).toarray(),
        ))
    
    #only fit X_test data
    test_features = np.hstack((
            cvec1.transform(X_test_txt).toarray(),
            cvec2.transform(X_test_POS).toarray(),
            cvec3.transform(X_test_POSstops).toarray(),
        ))

    
    return train_features, test_features


In [74]:
def get_preds_imp(dataframe):
    
    """
    change function get_features_GM(train_data, test_data) for feature engineering!
    input: Dataframe
    
    output: predictions for test instances
    """
    
    instances= get_LOO_instances(dataframe)
    predictions_NB=[]
    predictions_SVM=[]
    for inst in instances:
        train_data= inst[0]
        test_data = inst[1]
        train_label=train_data["Label"]
        train_features, test_features= get_features_GM_imp(train_data, test_data)
        pred_NB, coeff_NB, cls_NB = predict_NB(train_features, train_label, test_features)
        pred_SVM, coeff_SVM, cls_SVM = predict_SVM(train_features, train_label, test_features)
        predictions_NB.append(pred_NB)
        predictions_SVM.append(pred_SVM)
        
        
    return np.array(predictions_NB), coeff_NB, cls_NB, np.array(predictions_SVM), coeff_SVM, cls_SVM 

In [75]:
predictions_NB_imp, coeff_NB_imp, cls_NB_imp, predictions_SVM_imp, coeff_SVM_imp, cls_SVM_imp  =  get_preds_imp(df_imp)





In [76]:
df["Pred_NB_imp"]=predictions_NB_imp #I add my predictions to the data frame as a new column
df["Pred_SVM_imp"]=predictions_SVM_imp #I add my predictions to the data frame as a new column

In [77]:
#get the acc by comparing pred to the labels 
df["acc_NB_imp"] =df["Label"]==df[ "Pred_NB_imp"]
acc_NB_imp = df.loc[df.acc_NB_imp == True, 'acc_NB_imp'].count()/df.shape[0]
print(acc_NB_imp)

0.7922077922077922


In [78]:
#get the acc by comparing pred to the labels 
df["acc_SVM_imp"] =df["Label"]==df[ "Pred_SVM_imp"]
acc_SVM_imp = df.loc[df.acc_SVM_imp == True, 'acc_SVM_imp'].count()/df.shape[0]
print(acc_SVM_imp)

0.8051948051948052


In [62]:
imp_NB = np.append(predictions_NB_imp, acc_NB_imp)
imp_SVM = np.append(predictions_SVM_imp, acc_SVM_imp)

In [63]:
df_stpw_results["Pred_NB_imp"] =imp_NB
df_stpw_results["Pred_SVM_imp"] =imp_SVM

In [64]:
"""stop words result df is already saved"""
df_stpw_results
df_stpw_results.to_excel("Stopword_results.xlsx")

In [66]:
df_stpw_results.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,68,69,70,71,72,73,74,75,76,77
Label,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,100.0
Pred_NB_305,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,5.0,5.0,5.0,0.0,5.0,5.0,0.0,5.0,5.0,0.818182
Pred_NB_710,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,5.0,5.0,5.0,0.0,5.0,5.0,0.0,5.0,5.0,0.818182
Pred_NB_747,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,5.0,5.0,5.0,0.0,5.0,5.0,0.0,5.0,5.0,0.792208
Pred_SVM_305,0.0,0.0,1.0,0.0,0.0,2.0,2.0,0.0,2.0,0.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0.818182
Pred_SVM_710,0.0,0.0,1.0,0.0,0.0,2.0,2.0,0.0,2.0,0.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0.818182
Pred_SVM_747,0.0,0.0,1.0,0.0,0.0,2.0,2.0,0.0,2.0,1.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0.805195
Pred_NB_imp,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,...,5.0,5.0,5.0,0.0,5.0,5.0,0.0,5.0,5.0,0.792208
Pred_SVM_imp,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0.818182


__Results__

The set of stop words don't seem to have a significant impact on the outcome.


| n_stop_words 	| NB 	| SVM 	|  
|--------------	|--------	|--------	|
| *305*  	| __0.818__ 	| __0.818__ 	| 
| *507*  | 0.792    | __0.818__ |
| *710*	| __0.818__ 	| __0.818__ 	|
| *747*	| 0.792 	| 0.805 	| 


