# STYLOMETRY THROUGH PLAYS

## Motivation of the project

The purpose of the project can be understood in  the frame of literature analysis, although certain aspects related to authorship attribution approaches are also explored due to the nature of the project. 

Although the authorship attribution techniques have been tested in several domains, when concerned literature, the typical stylometric analysis takes the author style as objective. In this case the analysis goes beyond the author itself and treats every character as a single style unit. With this, the goal is to explore the relation that exists between different characters styles, as well as the survival potentials of the author style displayed through his characters and plays. In a higher order conceptual framework, the original purpose was to research the creativity process itslef, when an author uses it to give birth to a set of characters with their own entity but somehow still related to their creator and the world of the play where they come from. 

A previous paper have been found related to this purpose, "Computational Stylometry: Who's in a play?" by Carl Vogel and Gerard Lynch in January 2007, where the researchers explore plays of four English authors using a cross validated model, with letter unigram analysis, and chi squared as classification method. Finaly, the authors run five different experiments which combine different outputs and classification targets, such as character to character, play to author, etc..

In the present case, the analysis has been performed with the same four English authors, Oscar Wilde, George Bernard Shaw, Ben Jonson, and William Shakespeare, plus two big names of the Germanic theater culture, Friedrich Schiller and Johann Wolfang von Goethe. Both, the methods and the classification means displayed in this project are more, and updated to our times in relation to the former one, with the goal of getting a deeper and wider understanding of the phenomena. Further details will be cleared later.


## Raw data 

The data consists in text files containing the text corpuses, as whole play file in case of all authors, with the exception of Shakespeare plays, which have been constructed through alternative methods due to computational convenience.  

The files has been taken from the open source site Project Gutenberg, which offers the possibility of access the Plain Text corpus, which have been copied and pasted in separated text files named by play and author. The Shakespeare plays available at Project Gutenberg presented two big inconvenients fro the analysis. They were extremely inconsistent in format, and they kept some old English mannerisms, such as the use of 'v' instead of 'u'. As a consequence, his corpus has been composed from Open Source Shakespeare, where the speeches can be filtered by play and character at will. Due to the latter, the corpuses are split by plays, but it doesn't mean that they contain the play itself, but the speech lines of filtered characters by a 1500 words threshold.

### English plays

#### Oscar Wilde
    - An Ideal Husband
    - A Woman of no Importance
    - Lady Windermer's Fan 
    - The Importance of Being Earnest 
#### Geoarge Bernard Shaw
    - Pygmalion
    - Androcles and the Lion
    - Caesar and Cleopatra
    - Candida
    - Man And Superman
#### Ben Jonson
    - Cyntia's Revels
    - Every Man On His Humor 
    - Volpone, Or The Fox
    - The Alchemist
#### William Shakespeare
    - Macbeth 
    - Romeo And Juliet
    - Othello
    - Hamlet
    - King Lear
    
### German plays

#### Friedrich Schiller
    - Kabale und Liebe 
    - Die Verschwoerung des Fiesco zu Genua
    - Die Räuber
    - Die Jungfrau von Orleans 
#### Johann Wolfgang von Goethe
    - Faust
    - Faust II
    - Egmont
    - Iphigenie auf Tauris
    - Die Laune des Verliebten 
    
    
    
    
    

## Description and  instructions

#### Requirements

The main requirement to run the code is the environment provided by anaconda packages. Besides that, I only use two libraries that might be specifically installed, nltk and ipywidgets with:
    - pip install nltk
    - pip install ipywidgets

#### Data Cleaning

Corresponds to the name of the first notebook to be open, and the folder that contains it. Also the files with the raw data are included in this folder.

#### Data Analysis

It contains three notebooks and one single text file that contains the German stop words. The notebooks order for a correct display is as follows:
    - burrows_delta_method
    - svm_models
    - clustering_and_visualitation

** Given the repetitive nature of some processes, the intra-notebook explanations are gonna be limited to the first time that a new process is presented, or in case that a significant change occurs.

** Under the same logic, some outputs are full displayed as they first appear, but not all along the notebook. It has been made like that to facilitate the lecture of the document, nevertheless I invite the lector to explore the different results by himself in detail and conscientiously, if wanted, as it has been developed.



## Methodology

#### Data Cleaning

Several problems were faced at this initial point, mainly because of the lack of unified protocols along the editions, plus certain irregularities in the way the texts were given at some plays. Specific functions were created with the purpose of extracting the data from the text files in a general way, although due to these irregularities, every cleaning step had to be done under careful supervision, and either new functions or adaptations at the already created had to be committed on the way. A short research was also done for every play to find out particular aspects that may affect the analysis, as characters addressed with several names or period related language particularities. After this short individual research, every subcorpus for every character is created as a list of paragraphs, which will be cleaned to get the raw speeches tokenized by sentences. Then we create the dictionary out of the subcorpuses for every play, and after analyzing the length of words and sentences we delete the characters whose number of words are consider insufficient for the analysis, under 1500 words. Finally, every play dictionary is stored for future analysis.

#### Burrows' Delta Method

It is a widely validated method in authorship attribution, that offers a fairly good fit for the current question.  It attenuates the effect of different samples size due to the normalization, it let us to compare one test against several corpus candidates, and it has been prove to produce better results than chi squared methods. Being the vectorization construct out of the stop words, the discrimination-categorization its made mainly through a lexical approach.

It is performed five times, one per partition. The partitions has been made by splitting the corpuses into five balanced parts, in terms of number of sentences, assuming that the style would remain relatively constant through all the character speeches. The method consist in taking the most common n words, essentially stop words, from the whole train corpus. In all cases the method was applied over the 50 most common words, except for the cases when all plus 4000 words characters were token together, where this n went up to 100. Then we create the normalized vectors of z scores, a single z score for every stop word, for train and test samples. After that, we sum all the z scores absolute values into a single one. The classification comes to place when we select the closest result measured as the lowest value for every test sample against the train ones. Finally we get the accuracy of every character by getting the number of correct predictions over five, and the global accuracy after that.

An extra analysis has been executed gathering all characters with a speech production over 4000 words together, which let us evaluate their accuracy punctuation among the rest of characters. These gives us information about how cohesive constructed are the principal characters, when we take the under 4000 words ones as noise. 

A second Burrow's Delta Method was performed using word bigram level instead of single words in order to include more contextual information. Although, due to the relatively low size of the corpus was expected that the accuracy would drop dramatically, I found it interesting to check how it would affect different authors.  In this case, the detailed step-by-step process carried on the first one was considered unnecessary and an specific function, model predictions, was created and used in the delta distances part.

** Further analysis with 3 and 4 word grams level where executed, but it was considered as not usefull information for the interest of the art.

Finally, a third Burrows' delta method analysis was performed taking all over 4000 words characters together, regardless of the author. This allows us to explore the principal characters cohesion when taking them among other large corpuses, although, unlike the former analysis, we lose the noise as part of the concrete authors style in detriment.

** The Germanic authors were added after all the english ones were analyzed, so the structure of the analysis was done straightforward without the detailed partitions presented in the english ones.

#### SVM Models

A well known technique in the documents classification field includes the use of Term frequency – Inverse document frequency vectorizer, Tf-idf, which gives more weight to the less common terms in detriment of the stop words, and therefore it gains sensitivity potentials when it comes to key words or any other vocabulary content features. In this sense, it was convenient to run these experiments to cover that part of lexical information ignored by the Burrow's delta method. In addition, I made use of the linear support vector machine classifier to construct different models, since they seems to grant high accuracy in the lately literature. All of them were executed under five fold cross validation.

    
 * SVM classifier
     
 * SVM classifier without stop words
 
 * SVM classifier with word bi-grams
 
 * SVM classifier without stop words and with character-grams in range 1 to 3

#### Clustering and Visualitation

Finally, an interactive hierarchical clustering was developed, offering the possibility to change some values of interest at will. Both, the distance metrics, and the vectorizer type can be tuned for every author. The distance metrics include euclidean distances and cosine distances as setting options. In the other hand, the available vectorizers appear as 'Lexical Style' for the 'raw' one, and 'Content Style' for the tf-idf one, under the previously discussed settings. Also a new object was created inside the plotting function to enable the application of some filters to the character names labels, so the user can keep track of characters specific aspects, as which author or play is he coming from, or how big is his apportion based on the 4000 words threshold.


## Results

The results are presented divided by languages, considering that there are too many linguistic differences to take into account in such a linguistic dependent task. As said, the main purpose of the project is merely descriptive, and the dimensions of the conclusions reach as the reader's expertise in the field does. That is why the interactive dendrograms were created. Nevertheless, exist some aspects that can be extracted as general results, as the efficiency of the techniques, and the average per author, which are not exempt from discussion either.

#### Columns
* Characters, number of characters
* BDM, Burrows' delta method
* BDM_top7 or BDM_top6, Burrows' delta method filtered by +4000 words characters
* BDM_bg, Burrows' delta method vectorizing by word bigrams
* BDM_top, Burrows' delta method performed with the 28 characters over 4000 words from the four authors
* BDM_all_bg, Burrows' delta method performed with those 28 characters and with word bigrams
* SVM, Support Vector Machine classifier, using Term frequency - Inverse document frequency vectorizer
* SVM_wsw, Support Vector Machine classifier, using Tf-idf vectorizer without stop words
* SVM_bg, Support Vector Machine classifier, using Tf-idf with word bigrams
* SVM_cg, Support Vector Machine classifier, using Tf-idf with character grams in range 1 to 3

### English authors

In [83]:
import pandas as pd
data_eng = ([21, 0.5047, 0.6571, 0.4095, 0.8, 0.4571, 0.9428, 0.9047, 0.9333, 0.8857], 
        [23, 0.6695, 0.9714, 0.3217, 0.9428, 0.6857, 0.9826, 0.9217, 1, 0.9739],
        [22, 0.4909, 0.8857, 0.2181, 0.8571, 0.4285, 0.9, 0.9, 0.9909, 0.9454],
        [21, 0.5142, 0.8, 0.219, 0.7428, 0.3128, 0.819, 0.87, 0.9428, 0.9142])
df_eng = pd.DataFrame(data_eng, index=['Wilde', 'Shaw', 'Jonson', 'Shakespeare'], 
             columns = ['Characters', 'BDM', 'BDM_top7', 'BDM_bg', 'BDM_top', 
                        'BDM_all_bg', 'SVM', 'SVM_wsw', 'SVM_bg', 'SVM_cg'])
df_eng['BDM_avg'] = (df_eng['BDM'] + df_eng['BDM_top7'] + df_eng['BDM_bg'] + df_eng['BDM_top'] + df_eng['BDM_all_bg'])/5 
df_eng['SVM_avg'] = (df_eng['SVM'] + df_eng['SVM_wsw'] + df_eng['SVM_bg'] + df_eng['SVM_cg'])/4
df_eng.loc['Totals'] = '87', df_eng['BDM'].mean(), df_eng['BDM_top7'].mean(), df_eng['BDM_bg'].mean(), df_eng['BDM_top'].mean(), df_eng['BDM_all_bg'].mean(), df_eng['SVM'].mean(), df_eng['SVM_wsw'].mean(), df_eng['SVM_bg'].mean(), df_eng['SVM_cg'].mean(), df_eng['BDM_avg'].mean(), df_eng['SVM_avg'].mean()
df_eng

Unnamed: 0,Characters,BDM,BDM_top7,BDM_bg,BDM_top,BDM_all_bg,SVM,SVM_wsw,SVM_bg,SVM_cg,BDM_avg,SVM_avg
Wilde,21,0.5047,0.6571,0.4095,0.8,0.4571,0.9428,0.9047,0.9333,0.8857,0.56568,0.916625
Shaw,23,0.6695,0.9714,0.3217,0.9428,0.6857,0.9826,0.9217,1.0,0.9739,0.71822,0.96955
Jonson,22,0.4909,0.8857,0.2181,0.8571,0.4285,0.9,0.9,0.9909,0.9454,0.57606,0.934075
Shakespeare,21,0.5142,0.8,0.219,0.7428,0.3128,0.819,0.87,0.9428,0.9142,0.51776,0.8865
Totals,87,0.544825,0.82855,0.292075,0.835675,0.471025,0.9111,0.8991,0.96675,0.9298,0.59443,0.926688


### Comments
* The Burrows' delta method shows an accuracy close to 0.5 for Wilde, Jonson, and Shakespeare, but is Shaw with almost 0.67 the one who stands clearly out for them. 
* When we filter the accuracy through the top characters, Wilde shows the less increment, with Jonson and Shakespeare are reaching 0.88 and 0.8 respectively, and Shaw goes up to 0.97 accuracy leading once more the group.
* The bigrams Burrows' delta method affects almost proportionally to all authors, with the exception of Wilde, which loses just 0.1 points.
* When gathering the 28 top characters together is Shaw once more the one with the highest accuracy for both classical BDM and word bigram BDM. Nevertheless is Shakespeare the one who shows the worst punctuation, with Jonson and Wilde sharing the middle positions in both models.
* Considering the potential of the BDM in detecting writing style regardless of the content and attending at the total averages, Shaw is the one showing most distinct characters,0.72, followed by Jonson,0.58, and Wilde, 0.57, and leaving Shakespeare at the bottom position with 0.52. Although, some differences could be specially significant in this area, due to the 300 years gap between authors, both Shakespeare and Jonson, and Wilde and Shaw, were contemporary with each other, so this fact is compensated at least with each other.
* The Support Vector Machine classifier, with and without stop words, leads to the same ranking with Shaw, once more at the top position, followed by Wilde, Jonson and finally Shakespeare, with a difference of 0.13 points between first and fourth position when applied with stop words. This difference goes down to 0.03 when executed without stop words, with a technical tie between Wilde and Jonson in the middle positions. This values constrictions come after the decrease of around 0.06 accuracy of Shaw and 0.04 for Wild, leaving Jonson with the same 0.9, and Shakespeare is, in detriment, the only one gaining, around 0.5 points. In addition, the total accuracy for the stop words one still higher by 0.02.
* When applying both, word and character grams models, the total accuracy improves, being word bigrams the highest among all with 0.97, followed by 1 to 3 characters grams with 0.93 accuracy. The ranking is the same for both models, with Shaw leading with 1 and 0.97, followed by Jonson, 0.99 and 0.94, Shakespeare, with 0.94 and 0.91, and closing with Wilde, which gets 0.93 and 0.88, respectively.
* The overall average for all SVM models, although differing in content sensitivity from BDM models, results in the same ranking, with Shaw at the top, 0.97, followed by Jonson, 0.93, then Wilde, 0.92, and Shakespeare with 0.89.





### German authors

In [82]:
data_ger = ([17, 0.5411, 0.8333, 0.1529, 0.9667, 0.2, 0.8941, 0.9411, 0.9882, 0.9411], 
            [15, 0.64, 0.8, 0.1466, 0.97, 0.3333, 0.6933, 0.68, 0.92, 0.6933 ])
df_ger = pd.DataFrame(data_ger, index=['Schiller', 'Goethe'], 
             columns = ['Characters', 'BDM', 'BDM_top6', 'BDM_bg', 'BDM_top', 
                        'BDM_all_bg', 'SVM', 'SVM_wsw', 'SVM_bg', 'SVM_cg'])
df_ger['BDM_avg'] = (df_ger['BDM'] + df_ger['BDM_top6'] + df_ger['BDM_bg'] + df_ger['BDM_top'] + df_ger['BDM_all_bg'])/5 
df_ger['SVM_avg'] = (df_ger['SVM'] + df_ger['SVM_wsw'] + df_ger['SVM_bg'] + df_ger['SVM_cg'])/4
df_ger.loc['Totals'] = '32', df_ger['BDM'].mean(), df_ger['BDM_top6'].mean(), df_ger['BDM_bg'].mean(), df_ger['BDM_top'].mean(), df_ger['BDM_all_bg'].mean(), df_ger['SVM'].mean(), df_ger['SVM_wsw'].mean(), df_ger['SVM_bg'].mean(), df_ger['SVM_cg'].mean(), df_ger['BDM_avg'].mean(), df_ger['SVM_avg'].mean()
df_ger

Unnamed: 0,Characters,BDM,BDM_top6,BDM_bg,BDM_top,BDM_all_bg,SVM,SVM_wsw,SVM_bg,SVM_cg,BDM_avg,SVM_avg
Schiller,17,0.5411,0.8333,0.1529,0.9667,0.2,0.8941,0.9411,0.9882,0.9411,0.5388,0.941125
Goethe,15,0.64,0.8,0.1466,0.97,0.3333,0.6933,0.68,0.92,0.6933,0.57798,0.74665
Totals,32,0.59055,0.81665,0.14975,0.96835,0.26665,0.7937,0.81055,0.9541,0.8172,0.55839,0.843888


### Comments

* Regarding at the Burrows' delta method, Goethe gets the best result with 0.54, 0.1 points far from Schillers, although when looking at the top six results is Schiller the one getting the highest score. Goethe score increases just 0.16 points, which  reveals a relatively small difference between bigger and smaller characters scores, even if we take into account the slightly smaller amount of total characters as a factor.

* The bigrams BDM scores are very poor, even if watching at the filtered top characters, so they don't give us so much to work with, besides the astounding difference when compared with the English results. Further analysis should be driven, but it could be explained by the richness and reflexivity of German grammar, which would lead to an exponential increase of stop words related bigrams variety for such an small amount of text.

* For every Support Vector Machine model Schiller scores are significantly higher than Goethes. Even when Goethe has the lowest amount of characters, shows the lowest score in the SMV models average, 0.12 lower than Shakespeare, and not being able to be attributed to language or period, at least at high degree, since Schiller score goes up till 0.94. At this point is convenient to clarify some aspects of Goethes corpus which might be involved in this phenomena. Goethe is the only author which have second part characters, as is the case of Faust and Mephistopheles, two top characters in both parts of Faust, which have been treated as different characters for several reasons. The fact that 24 years separate the two publications,the huge entity that both parts raise in the Germanic culture separately, plus the fact that the literature criteria often points out significant differences, specially in the thematic dimension, made it easier for the researcher to take this decision. In addition, such a topic obviously arouses the curiosity and the drive of watching how this relation behaves, although it shouldn't be ignored the possible effects along the experiments.

* The SVM model with word bigrams, as it did with the English authors, shows the highest accuracy scores, and even Goethe gets a 0.92 value.

* With just two authors, further comparatives between models scores by languages are difficult to make, specially with the huge gap between German authors in the SVM models.



### Comments on Hierarchical Clustering

Depending on the approach and the previous knowledge on the characters, plays, and authors, the different features combinations will offer different conclusions and discussion lines. However, and besides my lack of expertise, I felt it pertinent to make some superficial notes on the topic:

* Generally the use of Content Style (Tf-idf vectorizer) seems to gather characters from same play together. Although this effect varies in magnitude according to authors, is in Oscar Wilde where this effect shows itself most clearly. It is also clearly seen in German authors, specially in Schiller, although is also visible in Goethe if we keep in mind the duality nature of Faust and Mephistopheles.
* When filtering by top characters and Lexical Style, the lexical richness of the authors around their main characters could be interpreted as their grade of dispersion. Once more, Oscar Wilde's has the closest ones, gathering six out of seven in just four nodes distance. On the contrary Shakespeare, Schiller and particularly Shaw, show pretty well dispersed top characters.
* With both, lexical and content vectorizers, the two English authors periods are fairly well defined in the all together dendrogram, quite obviously under the lexical style condition, where just a few characters of Shaw are able to break into Shakespeare and Jonsons ones.
* For German authors, it seems that they build more author dependent clusters when Lexical Style is applied.
* The use of either euclidean distances or cosine distances does not reveal big changes.
* Regarding the duality of Faust and Mephistopheles, it seems that both are highly related with their second parts and with each other as well. Through lexical analysis they appear with themselves sharing a cluster, and just one node far from each other, but when we put the emphasis in the content aspect, while Mephistopheles stays with himself in one cluster, Faust split himself in two consecutive one node distance from Mephistopheles. Something similar happends when we watch at them in the both German dendrogram, with a consistent Mephistopheles, sharing a second node relation with Faust II, and a with Faust I in a three node distance, through lexical style. Regarding to content, it clusters Mephistopheles I and Faust I together, being both closely related with Mephistopheles II, and Faust II three nodes away, which would be in accordance with the difference at content level between the two parts, but somehow keeping the previous consistence shown by Mephistopheles, and revealing Faust as the one suffering the biggest 'identity problems'. It is relevant to mention here that in the Burrows' delta method Faust I and Mephistipheles II got five out of five right predictions, and Mephistopheles I just got one Mephistopheles II as result, while Faust II is the only one with extra-identity intruders, with two Mephistopheles II. Summarizing it seems that in general terms, Mephistopheles is more consistent than Faust, and is Faust II the most permeable, showing a close dialectical relation with Mephistopheles, that gets stronger in the second part of the play.

### Final comments

Despite assuming the multidimensional and divergent nature of the creative process, and so the difficulty involved in dividing it into analyzable and measurable parts, the possibilities that new computational techniques offer open new ways of describing and understand how it works. The creative ability of writers of all time have been largely discussed, and the actual work doesn't try to put an end to this discussions through a quantitative method, but to put the quantitive conclusions at the service of a higher order analysis. In this sense, every possible contribution that this work can bring to the state of the art has been a reason for gratification both in its development and in its conclusion.

