Authorship attribution of a text corpus
=======================================

Here we will repeat a famous experiment in authorship attribution, and try to discover who wrote the Federalist Papers!

We have the corpus from last week's lesson with NLTK; this week, we are going to use a library called `gensim` which has support for a lot of the sorts of big-data distant-reading text analysis that happens where DH intersects literature.

Reading in the data
-------------------

Now let's load up the Papers. They are in a folder called 'federalist' and each paper is numbered, e.g. 'federalist_7.txt'. We can just as easily do this using NLTK to make a corpus out of the folder, as we did last week.

In [None]:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.util import read_regexp_block

# Define how paragraphs look in our text files.
def read_hanging_block( stream ):
    return read_regexp_block( stream, "^[A-Za-z]" )

corpus_root = '../textfiles/federalist'
file_pattern = 'federalist_.*\.txt'
federalist = PlaintextCorpusReader( corpus_root, file_pattern, 
                                para_block_reader=read_hanging_block )
print("List of texts in corpus:", federalist.fileids())

Authorship attribution is done by comparing different *features* of the texts we are looking at. Examples include:

* lexical features (average sentence length, variation in sentence length, range of words used)
* punctuation features (average number of different marks per sentence)
* word count features (e.g. frequency of the different common 'function words')
* syntactic features (e.g. frequency of noun use, verb use, adjective use, etc.)

Essentially there are a whole lot of approaches to take, and usually you want to take as many approaches as possible to arrive at some sort of consensus answer. Today we will try three approaches: looking at use of function words, at lexical diversity, and at relative frequency of parts of speech.

Getting the word count feature - the frequency of "function words"
------------------------------------------------------------------

These are the words that we would normally leave out of any vocabulary analysis because they are so common - 'the', 'a', 'and', 'of', 'to', and so on. Lists of them can be had for different languages, and indeed we know that NLTK provides us with such a list. Let's use it.

Okay! Now we have, for each text, to count up the frequency of each of these words. This is called making a "feature vector" - each text will be reduced to a data structure that has a count for each of the function words.

**PAY ATTENTION HERE!** This step, the conversion of text files to feature vectors, is where you will make or break any of these text analysis techniques. As we will see, when we are doing authorship attribution we want to count the stopwords, but when we do topic modeling we want to count everything *BUT* the stopwords! Think carefully about the theory and ideas behind what you are doing, when you use these tools.

So now we have our "texts", which are lists of stopwords, and we have our dictionary, which assigns a unique ID to each word. We put these things together to make a vector of each text, which will be a series of `(dictionaryID, count)` tuples. Anytime the count is zero, the dictionary ID will simply be left out of that text's vector. We will use the `doc2bow` method to do this; the result looks something like this.

This was the first text, and now we want this sort of "bag of words" (bow) for all of the texts! We use a list comprehension again to get that.

Let's do something similar to get the distribution of parts of speech. We will POS-tag all the texts, choose the twenty most common parts of speech throughout the corpus excluding punctuation, and then make a similar vector for each text counting the instances of each part of speech.

Now, just as before, make a dictionary out of these "texts".

Hm, let's filter out the punctuation, and limit ourselves to the top 15 parts of speech. We can filter the dictionary like this:

Okay! We have our dictionary the way we want it, so we can make a second gensim corpus out of our texts.

Now we have made two corpora from our texts; one represents the frequency of function words, and the other represents the frequency of common parts of speech.

But now we will want to normalize our vectors a little bit - some texts are a lot longer than others, so will have many more function words overall, and we don't want this fact to affect our results. So we need to scale the values in each tuple, keeping them in proportion with each other but always between 0 and 1.

In [None]:
# Let's do some math
def scale(vector):
    size = 0.0
    maximum = 0.0
    for t in vector:
        size += t[1]
        if t[1] > maximum:
            maximum = t[1]
    scaled = []
    for t in vector:
        fpcount = float(t[1])
        factor = size / size * maximum
        scaled.append((t[0], fpcount / factor))
    return scaled

# Now let's apply this to scale the vectors


Getting the results
-------------------
Okay! We have a set of criteria - the frequency of our function words - and a corresponding set of values for each text. It's time to crunch the numbers and see which papers resemble each other.

We know that there were three authors, so we want to see if we can make the 85 different papers cluster into three groups. There is a statistical function for this called KMeans, from the "scikit-learn" module which has a lot of things for machine learning. (Dividing data into clusters of similar things is a pretty common thing to have to do in machine learning. Lucky for us.)

First we define a function to do the clustering for each data set:

In [None]:
from sklearn.cluster import KMeans

def PredictAuthors(feature_vector_set):
    km = KMeans(n_clusters=3)
    km.fit(feature_vector_set)
    return km

In order to use this, we need to convert our gensim corpus into a matrix that SciPy recognizes. Gensim gives us a utility to do this. In order to get our matrix the right way around, we will also have to transpose it.

And then we run this on our data table of the function word frequencies and get a complicated result. We ask for the labels of that result and get something that looks like this:

Each of these numbers (0, 1, 2) represents an author. We know that Hamilton was responsible for most of the papers, Madison for most of the rest, and Jay for the fewest. So let's assign the authors on that assumption.

Now we can put this into a function definition, since we'll have to do it twice.

So how did that do against reality? Let's read in the real answers and add them to the table.

Let's make an HTML table for comparison.

In [None]:
from IPython.display import HTML

# prepare our data


# make our table
answer_table = '<table><tr><th>Paper</th><th>Stopwords</th><th>Parts of speech</th><th>Real</th></tr>'
for i in range(len(real_author)):
    ra = real_author[i]
    sa = stopword_authors[i]
    pa = pos_authors[i]
    answer_table += '<tr><td>%d</td>' % (i+1)     # Print the letter number
    answer_table += '<td style="color: %s;">%s</td>' % (colorcode(sa, ra), sa)
    answer_table += '<td style="color: %s;">%s</td>' % (colorcode(pa, ra), pa)
    answer_table += '<td>%s</td></tr>' % ra
answer_table += '</table>'

HTML(answer_table)

...As you can see, the method is not perfect. 😄 A better method for the Federalist Papers problem would be to use a *trained* corpus, to let the model take into account what we know about the papers' authorship. 

Probably the most commonly-used method for authorship attribution today is known as Burrows' Delta, named after John Burrows who came up with it. The Delta algorithms are available in [a statistical package](https://sites.google.com/site/computationalstylistics/) called `stylo`, written for the R programming language for statistical computing. If this is something you anticipate wanting to use, that is a very good place to start.