# Generative Model

In the following the approach of using a multi support vector machine classifier together with each sentence of a play as an instance will be explored. This is done to increase the number of training data passed to the classifier. The features implemented will be based on those used by Fox et al. (2012), which are: 
-	Frequency of stop words
-	Frequency of POS tags of words which are not stop words
-   Frequency of bigrams of stop words and POS tags 

In the end, each sentence will be assigned to one author, to determine the writer of an entire play the majority class of all sentence of one play will be taken.  
To evaluate the accuracy, a three-fold cross validation will be employed.


In [1]:

import os
import spacy
import numpy as np
import pandas as pd
from glob import glob

from scipy.stats import mode

from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import make_classification
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split


import nltk
from nltk.corpus import stopwords
from stop_words import get_stop_words

## Preparing Data

We prepare the data such that each sentence of a play is an instance.

In [2]:
corpus= glob("El/*")
corpus_tagged = glob('Tagged_or_Stopwords/*')

In [3]:
def get_ordered_lists(auhtors, corpus):
    """ 
    input: corpus as a list with path of files and a list of authors
    
    output: list of list, with the same order of the author list, with a list containing all plays for each author
    """
    dramas_per_author=[]
    for author in auhtors:
        authorList=[]
        for drama in corpus:
            if author in drama:
                authorList.append(drama)
        dramas_per_author.append(authorList)
    
    return dramas_per_author

In [4]:
def get_lists(corpus, auhtors):
    
    """
    input: list of authors, list of all files as paths for all plays
    
    output: list with  all files ordered according to author list 
    """
    corpus_per_author = get_ordered_lists(auhtors, corpus)
    
    data=[]
    labels=[]
    for idx, corpus in enumerate(corpus_per_author):
        for idx2, play in enumerate(corpus):
            instance= play
            data.append(instance)

            labels.append(idx)


    
    return data, labels


y_authors = ['E-Shakespeare', 'L-Shakespeare', 'Marlowe', 'Middleton','Jonson', 'Chapman']

data, labels = get_lists(corpus, y_authors)


### Preparing stop word lists

In [5]:
nlp = spacy.load("en")
stopwrd1= []
for word in nlp.Defaults.stop_words:
    stopwrd1.append(word)


stopwords2 = stopwords.words('english')


stopwords3 = get_stop_words('english')

#from https://www.ranks.nl/stopwords
stopwords4 = ['a ', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', 'cannot', "can't", 'cause', 'causes', 'certain', 'certainly', 'co', 'com', 'come', 'comes', 'contain', 'containing', 'contains', 'could', 'couldnt', 'd', 'date', 'did', "didn't", 'different', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'due', 'during', 'e', 'each', 'ed', 'edu', 'effect', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'especially', 'et', 'et-al', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'except', 'f', 'far', 'few', 'ff', 'fifth', 'first', 'five', 'fix', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'gave', 'get', 'gets', 'getting', 'give', 'given', 'gives', 'giving', 'go', 'goes', 'gone', 'got', 'gotten', 'h', 'had', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', 'hed', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'heres', 'hereupon', 'hers', 'herself', 'hes', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'home', 'how', 'howbeit', 'however', 'hundred', 'i', 'id', 'ie', 'if', "i'll", 'im', 'immediate', 'immediately', 'importance', 'important', 'in', 'inc', 'indeed', 'index', 'information', 'instead', 'into', 'invention', 'inward', 'is', "isn't", 'it', 'itd', "it'll", 'its', 'itself', "i've", 'j', 'just', 'k', 'keep\tkeeps', 'kept', 'kg', 'km', 'know', 'known', 'knows', 'l', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'lets', 'like', 'liked', 'likely', 'line', 'little', "'ll", 'look', 'looking', 'looks', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'million', 'miss', 'ml', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'mug', 'must', 'my', 'myself', 'n', 'na', 'name', 'namely', 'nay', 'nd', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'now', 'nowhere', 'o', 'obtain', 'obtained', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'omitted', 'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'ord', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'owing', 'own', 'p', 'page', 'pages', 'part', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'put', 'q', 'que', 'quickly', 'quite', 'qv', 'r', 'ran', 'rather', 'rd', 're', 'readily', 'really', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'respectively', 'resulted', 'resulting', 'results', 'right', 'run', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sec', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sent', 'seven', 'several', 'shall', 'she', 'shed', "she'll", 'shes', 'should', "shouldn't", 'show', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'since', 'six', 'slightly', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specifically', 'specified', 'specify', 'specifying', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure\tt', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 
             'thered', 'therefore', 'therein', "there'll", 'thereof', 'therere', 'theres', 'thereto', 'thereupon', "there've", 'these', 'they', 'theyd', "they'll", 'theyre', "they've", 'think', 'this', 'those', 'thou', 'though', 'though', 'thousand', 'throug', 'through', 'throughout', 'thru', 'thus', 'til', 'tip', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'ups', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'v', 'value', 'various', "'ve", 'very', 'via', 'viz', 'vol', 'vols', 'vs', 'w', 'want', 'wants', 'was', 'wasnt', 'way', 'we', 'wed', 'welcome', "we'll", 'went', 'were', 'werent', "we've", 'what', 'whatever', "what'll", 'whats', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', "who'll", 'whom', 'whomever', 'whos', 'whose', 'why', 'widely', 'willing', 'wish', 'with', 'within', 'without', 'wont', 'words', 'world', 'would', 'wouldnt', 'www', 'x', 'y', 'yes', 'yet', 'you', 'youd', "you'll", 'your', 'youre', 'yours', 'yourself', 'yourselves', "you've", 'z', 'zero']

        

stopwords_305 = stopwrd1
stopwords_747= list(set(stopwrd1 + stopwords2+ stopwords3 +stopwords4))
stopwords_710= stopwords_747[:710]
print(len(stopwords_305))
print(len(stopwords_710))

305
710


### Creating Dataframe with an instance being a sentence, each instance will have the following information:


- All POS taggs if word is not a stopword for default stop words loaded with SpaCy and the additioanl passed one
- POS taggs or stopwords for each stop word list
- Lemmas for each word in stantance
- Sentence Length 
- Label indicating author
- Author name 
- Name of play
- ID for each play    


In [6]:
def create_dataframe(corpus, authors, stopwrds2):
    
    """
    Input: corpus with all texts, authors, and additional list with stopwords
    
    runtime: 14m 27s for 76 files 
    
    output: Pandas dataframe with one sentence is an instance, which has the following information:
            - All POS taggs if word is not a stopword for default stop words loaded with SpaCy and the additional passed one
            - POS taggs or stopwords for each stop word list
            - Lemmas for each word in stantance
            - Sentence Length 
            - Label indicating author
            - Author name 
            - Name of play
            - ID for each play    
    """
    nlp = spacy.load("en")
    data=[]   
    corpus_per_author = get_ordered_lists(authors, corpus)
    
    play_id = -1
    for idx, corpus in enumerate(corpus_per_author):
        author=authors[idx]
        label=idx

        for play in corpus:
            play_id += 1

            drama=play[3:]
            doc = nlp(open(play).read())
            for sent in doc.sents:
                sentlen= len(sent)
                POS_stwrd = " "
                sentence= " "
                lemmas= " "
                POSstopwrds2 = " "
                POS= " "
                POS2 = " "
                for word in sent:
                    sentence += str(word)
                    sentence += str(" ")
                    lemmas += str(word.lemma_)
                    lemmas += str(" ")

                    if word.is_stop:
                        POS_stwrd += str(word) 
                        POS_stwrd += str(" ") 
                        
                    else:
                        POS_stwrd  += str(word.pos_)
                        POS_stwrd  += str(" ")
                       
                    if str(word) in stopwrds2:
                        POSstopwrds2 += str(word) 
                        POSstopwrds2 += str(" ")
                    else:
                        POSstopwrds2  += str(word.pos_)
                        POSstopwrds2  += str(" ")
                        
                    if not word.is_stop:
                        POS += str(word.pos_)
                        POS += " "
                    if str(word) not in stopwrds2:
                        POS2 += str(word.pos_)
                        POS2 += " "
                        
                        
                    
                data.append((sentence, POS, POS2, POS_stwrd, POSstopwrds2,  lemmas, sentlen, label, author, drama,  play_id))
            
    df=pd.DataFrame(data,
                    columns=
                    ["Sentence", "POSnotstopwrds", "POSnotstopwrds2", "POSstopwrds", "POSstopwrds2", "Lemmas", "Sentencelen","Label", "Author", "Play", "Play_id" ] )
    return df




y_authors = ['E-Shakespeare', 'L-Shakespeare', 'Marlowe', 'Middleton','Jonson', 'Chapman']
df = create_dataframe(corpus, y_authors, stopwords_710)

#saving df in directory as excel file called "DataFrame"
df.to_excel("DataFrame_SVM.xlsx") 



In [25]:
#read saved df
df=pd.read_excel('DataFrame_SVM.xlsx', index_col=0) 
df.head(5)

Unnamed: 0,Sentence,POSnotstopwrds,POSnotstopwrds2,POSstopwrds,POSstopwrds2,Lemmas,Sentencelen,Label,Author,Play,Play_id
0,"As I remember , Adam , it was upon this fashi...",ADP PRON VERB PUNCT PROPN PUNCT NOUN SPACE VE...,ADP PRON VERB PUNCT PROPN PUNCT NOUN SPACE VE...,ADP PRON VERB PUNCT PROPN PUNCT it was upon t...,ADP PRON VERB PUNCT PROPN PUNCT it was upon t...,"as -PRON- remember , adam , -PRON- be upon th...",50,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0
1,"My brother Jaques he keeps at school , and \n...",ADJ NOUN PROPN VERB NOUN PUNCT SPACE NOUN VER...,ADJ NOUN PROPN VERB NOUN PUNCT SPACE NOUN VER...,ADJ NOUN PROPN he VERB at NOUN PUNCT and SPAC...,ADJ NOUN PROPN he VERB at NOUN PUNCT and SPAC...,"-PRON- brother jaques -PRON- keep at school ,...",68,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0
2,"His horses \n are bred better ; for , besides...",ADJ NOUN SPACE VERB ADV PUNCT PUNCT ADJ SPACE...,ADJ NOUN SPACE VERB VERB ADV PUNCT PUNCT VERB...,ADJ NOUN SPACE are VERB ADV PUNCT for PUNCT b...,ADJ NOUN SPACE VERB VERB ADV PUNCT for PUNCT ...,"-PRON- horse \n be breed better ; for , besid...",66,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0
3,Besides this nothing that he so \n plentifull...,ADP SPACE ADV VERB PUNCT NOUN VERB SPACE NOUN...,ADP SPACE ADV PRON PUNCT NOUN SPACE PRON NOUN...,ADP this nothing that he so SPACE ADV VERB me...,ADP this nothing that he so SPACE ADV gives P...,besides this nothing that -PRON- so \n plenti...,61,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0
4,"This is it , Adam , that \n grieves me ; and ...",DET PUNCT PROPN PUNCT SPACE VERB PUNCT NOUN N...,DET PUNCT PROPN PUNCT SPACE VERB PRON PUNCT N...,DET is it PUNCT PROPN PUNCT that SPACE VERB m...,DET is it PUNCT PROPN PUNCT that SPACE VERB P...,"this be -PRON- , adam , that \n grieve -PRON-...",55,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0


## Implementing the same features as done by Fox et al. (2012)
### using each sentence as instance 

In [22]:
#split the dataframe into train and test data
train_data, test_data= train_test_split(df, test_size=0.33)

Exploring which size of stop word sets has the better outcome.

#### 1. Using set of stop words of size 710 (as done by Fox and colleagues)

In [9]:

X_train_sent = train_data["Sentence"]
X_train_POS_710 = train_data["POSnotstopwrds2"]
X_train_POSstp_710 = train_data["POSstopwrds2"]

y_train= train_data["Label"]

#do the same thing for test data
X_test_sent = test_data["Sentence"]
X_test_POS_710 = test_data["POSnotstopwrds2"]
X_test_POSstp_710 = test_data["POSstopwrds2"]

y_test= test_data["Label"]


In [10]:


cvec1 = CountVectorizer(vocabulary= stopwords_710, max_features=1000) 
cvec2 = CountVectorizer(max_features=1000)
cvec3 = CountVectorizer(ngram_range=(2,2), max_features=1000)


training_features_710 = np.hstack((
    cvec1.fit_transform(X_train_sent).toarray(),
    cvec2.fit_transform(X_train_POS_710).toarray(),
    cvec3.fit_transform(X_train_POSstp_710).toarray(),
    ))


test_features_710 = np.hstack((
        cvec1.transform(X_test_sent).toarray(),
        cvec2.transform(X_test_POS_710).toarray(),
        cvec3.transform(X_test_POSstp_710).toarray(),
    ))


# cvec1.vocabulary_
# cvec2.vocabulary_
# cvec3.vocabulary_

In [11]:
svm = LinearSVC()
svm.fit(training_features_710,y_train )
pred_svm=svm.score(test_features_710, y_test)

nb = MultinomialNB(alpha=0.1)
nb.fit(training_features_710,y_train, y_train )
pred_MB=nb.score(test_features_710, y_test)

  self.class_log_prior_ = (np.log(self.class_count_) -


In [12]:
print("Prediction of Naive Bayes has score of " + str(pred_MB))
print("Prediction of multi SVM has score of " + str(pred_svm))

Prediction of Naive Bayes has score of 0.3453691109187585
Prediction of multi SVM has score of 0.46442850665677426


#### 2. Using smaller set of stopwords (305)

In [13]:


X_train_sent = train_data["Sentence"]
X_train_POS_305 = train_data["POSnotstopwrds"]
X_train_POSstp_305 = train_data["POSstopwrds"]

y_train= train_data["Label"]

#do the same thing for test data
X_test_sent= test_data["Sentence"]
X_test_POS_305= test_data["POSnotstopwrds"]
X_test_POSstp_305 = test_data["POSstopwrds"]

y_test= test_data["Label"]


In [14]:
#then define what columns I want

cvec1 = CountVectorizer(vocabulary= stopwords_305, max_features=1000) 
cvec2 = CountVectorizer(max_features=1000)
cvec3 = CountVectorizer(ngram_range=(2,2), max_features=1000)


training_features_305 = np.hstack((
    cvec1.fit_transform(X_train_sent).toarray(),
    cvec2.fit_transform(X_train_POS_305).toarray(),
    cvec3.fit_transform(X_train_POSstp_305).toarray(),
    ))


test_features_305 = np.hstack((
        cvec1.transform(X_test_sent).toarray(),
        cvec2.transform(X_test_POS_305).toarray(),
        cvec3.transform(X_test_POS_305).toarray(),
    ))
# cvec1.vocabulary_
# cvec2.vocabulary_
# cvec3.vocabulary_

In [15]:
svm = LinearSVC()
svm.fit(training_features_305, y_train)
pred_305_svm= svm.score(test_features_305, y_test)

nb = MultinomialNB(alpha=0.1)
nb.fit(training_features_305, y_train)
pred_305_nb= nb.score(test_features_305, y_test)

In [16]:
print("Prediction of Naive Bayes has score of " + str(pred_305_nb))
print("Prediction of multi SVM has score of " + str(pred_305_svm))

Prediction of Naive Bayes has score of 0.39584106178640616
Prediction of multi SVM has score of 0.4166357528543753


__Results__


Overall, the results are low, nevertheless, the multi SVM  performs better across all set of stop words.  
Moreover, the list with 710 stop words delivered a higher score.


|     Stops  | NB    | SVM  |  
|--------|--------|--------|
| 305  | 0.3958 | 0.4166 |
| 710 | 0.3454 | __0.4644__ |



Thus, in the following I will use the bigger list of stop words with a multiclass SVM classifier.

## Predictions for each play

To get the prediction for each play, I will firstly use a 3-fold cross validation to get the prediction for each sentence.
Following this I will look at each sentence of a play and take the majority class to get the final prediction of the author.

In [17]:

X_sent = df["Sentence"]
X_POS_710 = df["POSnotstopwrds2"] #do this with the bigger stop words set
X_POSstp_710= df["POSstopwrds2"]
y = df["Label"]

cvec1 = CountVectorizer(vocabulary= stopwrd1, max_features=1000, strip_accents="ascii") 
cvec2 = CountVectorizer(max_features=1000)
cvec3 = CountVectorizer(ngram_range=(2,2), max_features=1000)


X = np.hstack((
    cvec1.fit_transform(X_sent).toarray(),
    cvec2.fit_transform(X_POS_710).toarray(),
    cvec3.fit_transform(X_POSstp_710).toarray(),
    ))

clf = LinearSVC()
y_pred = cross_val_predict(clf, X, y, cv=3)

In [19]:
df["Pred"] = y_pred
df.head()
df.to_excel("DataFrame_SVM_Preds.xlsx") 

In [20]:
df.head()

Unnamed: 0,Sentence,POSnotstopwrds,POSnotstopwrds2,POSstopwrds,POSstopwrds2,Lemmas,Sentencelen,Label,Author,Play,Play_id,Pred
0,"As I remember , Adam , it was upon this fashi...",ADP PRON VERB PUNCT PROPN PUNCT NOUN SPACE VE...,ADP PRON VERB PUNCT PROPN PUNCT NOUN SPACE VE...,ADP PRON VERB PUNCT PROPN PUNCT it was upon t...,ADP PRON VERB PUNCT PROPN PUNCT it was upon t...,"as -PRON- remember , adam , -PRON- be upon th...",50,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0,0
1,"My brother Jaques he keeps at school , and \n...",ADJ NOUN PROPN VERB NOUN PUNCT SPACE NOUN VER...,ADJ NOUN PROPN VERB NOUN PUNCT SPACE NOUN VER...,ADJ NOUN PROPN he VERB at NOUN PUNCT and SPAC...,ADJ NOUN PROPN he VERB at NOUN PUNCT and SPAC...,"-PRON- brother jaques -PRON- keep at school ,...",68,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0,0
2,"His horses \n are bred better ; for , besides...",ADJ NOUN SPACE VERB ADV PUNCT PUNCT ADJ SPACE...,ADJ NOUN SPACE VERB VERB ADV PUNCT PUNCT VERB...,ADJ NOUN SPACE are VERB ADV PUNCT for PUNCT b...,ADJ NOUN SPACE VERB VERB ADV PUNCT for PUNCT ...,"-PRON- horse \n be breed better ; for , besid...",66,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0,1
3,Besides this nothing that he so \n plentifull...,ADP SPACE ADV VERB PUNCT NOUN VERB SPACE NOUN...,ADP SPACE ADV PRON PUNCT NOUN SPACE PRON NOUN...,ADP this nothing that he so SPACE ADV VERB me...,ADP this nothing that he so SPACE ADV gives P...,besides this nothing that -PRON- so \n plenti...,61,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0,1
4,"This is it , Adam , that \n grieves me ; and ...",DET PUNCT PROPN PUNCT SPACE VERB PUNCT NOUN N...,DET PUNCT PROPN PUNCT SPACE VERB PRON PUNCT N...,DET is it PUNCT PROPN PUNCT that SPACE VERB m...,DET is it PUNCT PROPN PUNCT that SPACE VERB P...,"this be -PRON- , adam , that \n grieve -PRON-...",55,0,E-Shakespeare,asyoulikeit.txt.E-Shakespeare.tok,0,4


In [23]:
corpus_per_author, y = get_lists(corpus, y_authors)

majority_class= {}
for idx, play in enumerate(corpus_per_author):
    mjc= mode(list(df.loc[lambda df: df['Play_id'] == idx]["Pred"]))[0]
    majority_class[play]=mjc
    
    
get_data= []

for idx, play in enumerate(corpus_per_author):
    canonical= y[idx]
    pred= int(majority_class[play])
    get_data.append((play, canonical, pred))
    
dfpred= pd.DataFrame(get_data, columns=["play", "canonical author", "predicted author"])  
# dfpred.head(77)
dfpred["acc"] =dfpred["canonical author"]==dfpred[ "predicted author"]
acc_total = dfpred.loc[dfpred.acc == True, 'acc'].count()/len(corpus_per_author)

In [24]:
print("The overall accuracy is: ", acc_total )

The overall accuracy is:  0.5454545454545454


__Results__

The overall accuracy is very low with 54, 54%, and thus, this might be due the imbalanced data set. 
Thus, the approach with using a sentence as an instance will not be further explored, instead each play will be used as an instance.