Authorship attribution of a text corpus
=======================================

Here we will repeat a famous experiment in authorship attribution, and try to discover who wrote the Federalist Papers!

We have the corpus from our lesson on NLTK, and we have the `gensim` library that we used in our topic modeling experiments. We'll put these together to get what we need for authorship attribution.

Reading in the data
-------------------

Now let's load up the Papers. They are in a folder called 'federalist' and each paper is numbered, e.g. 'federalist_7.txt'. We can just as easily do this using NLTK to make a corpus out of the folder, as we did last week.

In [None]:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.util import read_regexp_block

# Define how paragraphs look in our text files.
def read_hanging_block( stream ):
    return read_regexp_block( stream, "^[A-Za-z]" )

corpus_root = '../lessondata/federalist'
file_pattern = 'federalist_.*\.txt'
federalist = PlaintextCorpusReader( corpus_root, file_pattern, 
                                para_block_reader=read_hanging_block )
print("List of texts in corpus:", federalist.fileids())

Authorship attribution is done by comparing different *features* of the texts we are looking at. Examples include:

* lexical features (average sentence length, variation in sentence length, range of words used)
* punctuation features (average number of different marks per sentence)
* word count features (e.g. frequency of the different common 'function words')
* syntactic features (e.g. frequency of noun use, verb use, adjective use, etc.)

Essentially there are a whole lot of approaches to take, and usually you want to take as many approaches as possible to arrive at some sort of consensus answer. Today we will try three approaches: looking at use of function words, at lexical diversity, and at relative frequency of parts of speech.

Getting the word count feature - the frequency of "function words"
------------------------------------------------------------------

These are the words that we would normally leave out of any vocabulary analysis because they are so common - 'the', 'a', 'and', 'of', 'to', and so on. Indeed we left them out of our topic modeling trial last week for this very reason, but for authorship attribution, conversely, they might be very relevant! Let's retrieve them from NLTK.

In [None]:
from nltk.corpus import stopwords
print(" : ".join(stopwords.words("english")))
print(len(stopwords.words("english")))

# Make the stopword list into a Python set. That will make our work much faster below.
swset = set(stopwords.words("english"))

Okay! Now we have, for each text, to count up the frequency of each of these words. This is called making a "feature vector" - each text will be reduced to a data structure that has a count for each of the function words.

**PAY ATTENTION HERE!** This step, the conversion of text files to feature vectors, is where you will make or break any of these text analysis techniques. As we will see, when we are doing authorship attribution we want to count the stopwords, but when we do topic modeling we want to count everything *BUT* the stopwords! Think carefully about the theory and ideas behind what you are doing, when you use these tools.

In [None]:
from nltk.text import Text
# Make a dictionary for each text that gives the frequency of each stopword
stopword_vectors = []
for paper in federalist.fileids():
    # Get all the words in the paper as lowercase
    pwords = [x.lower() for x in federalist.words(paper)]
    # Generate a vocabulary from these words
    pvocab = Text(pwords).vocab()
    # Keep the stopword entries in the vocabulary
    pstopword_freq = {k: pvocab[k] for k in pvocab.keys() if k in swset}
    # Add in all the stopwords that are *not* in the vocabulary
    for sw in swset:
        if pvocab.get(sw) is None:
            pstopword_freq[sw] = 0
    # Now save this dictionary in our list.
    stopword_vectors.append(pstopword_freq)
    
stopword_vectors[3]

Let's do something similar to get the distribution of parts of speech. We will POS-tag all the texts, choose the twenty most common parts of speech throughout the corpus excluding punctuation, and then make a similar vector for each text counting the instances of each part of speech.

Here is how to tag a single text:

In [None]:
from nltk import pos_tag

pos_tag(federalist.words('federalist_1.txt'))

...so let's do this for all the texts, and put the resulting arrays into an outer array.

In [None]:
# Convert sequences of words into sequences of part-of-speech tags
pos_texts = []
for fed in federalist.fileids():
    pos_tagged = pos_tag(federalist.words(fed))
    pos_texts.append(pos_tagged)
print(len(pos_texts))
pos_texts[15]

In [None]:
# Figure out what our top 15 parts of speech are by making a single long "sequence" and getting its vocabulary
pos_corpus = []
for pt in pos_texts:
    pos_corpus.extend([x[1] for x in pt if x[1].isalpha()])
len(pos_corpus)

In [None]:
most_frequent_pos = set([x[0] for x in Text(pos_corpus).vocab().most_common(15)])
most_frequent_pos

In [None]:
# Now make the feature vector 
pos_vectors = []
for pt in pos_texts:
    tagsonly = [x[1] for x in pt]
    pos_vocab = Text(tagsonly).vocab()
    pos_freq = {k: pos_vocab[k] for k in pos_vocab.keys() if k in most_frequent_pos}
    pos_vectors.append(pos_freq)

print(len(pos_vectors))
pos_vectors[3]

Now we have extracted two sets of features our texts; one represents the frequency of function words, and the other represents the frequency of common parts of speech.

But now we will want to normalize our vectors a little bit - some texts are a lot longer than others, so will have many more function words overall, and we don't want this fact to affect our results. So we need to scale the values in each dictionary, so that the word / POS count becomes a fraction of the text length.

In [None]:
textlengths = [len(federalist.words(x)) for x in federalist.fileids()]

def scale(vector, textlength):
    # The vector is a dictionary of 'thing': 'count'. We need to scale the count
    # in each case by the overall text length.
    scaled = {}
    for k in vector.keys():
        scaled[k] = vector[k] / textlength
    return scaled

stopword_scaled = []
pos_scaled = []
for i in range(len(textlengths)):
    stopword_scaled.append(scale(stopword_vectors[i], textlengths[i]))
    pos_scaled.append(scale(pos_vectors[i], textlengths[i]))
        
print(len(pos_scaled))
print(len(stopword_scaled))
stopword_scaled[0]

Making a dataframe
----

Now we need to put these dictionaries into a big table, called a 'dataframe'. This is done with a library called pandas, which is used a lot for all sorts of scientific computing in Python. If you are in the Digital Data class you can think of a dataframe as an SQL-like table. First let's try it with just the stopwords.

In [None]:
import pandas as pd

labels = ['Paper #%s' % n.replace('federalist_', '').replace('.txt', '') for n in federalist.fileids()]

stopword_features = pd.DataFrame(stopword_scaled, index=labels)
pos_features = pd.DataFrame(pos_scaled, index=labels)
stopword_features

Getting the results
-------------------
Okay! We have a set of criteria - the frequency of our function words - and a corresponding set of values for each text. It's time to crunch the numbers and see which papers resemble each other.

We know that there were three authors, so we want to see if we can make the 85 different papers cluster into three groups. There is a statistical function for this called KMeans, from the "scikit-learn" module which has a lot of things for machine learning. (Dividing data into clusters of similar things is a pretty common thing to have to do in machine learning. Lucky for us.)

First let's see what happens if we ask the computer to divide our papers into four groups (one for each author, plus one for the jointly-authored Hamilton/Madison ones.)

In [None]:
from sklearn.cluster import KMeans

def PredictAuthors(fvs):
    km = KMeans(n_clusters=4)
    km.fit(fvs)
    return km

And then we run this on our data table of the function word frequencies and get a complicated result. We ask for the labels of that result and get something that looks like this:

In [None]:
stopword_result = PredictAuthors( stopword_features ).labels_ 
print(stopword_result)
pos_result = PredictAuthors( pos_features ).labels_
print(pos_result)

Each of these numbers (0, 1, 2, 3) represents an author. We know that Hamilton was responsible for most of the papers, Madison for most of the rest, Jay for a few, and the Hamilton/Madison collaboration for the fewest. So let's assign the authors on that assumption.

(...How else might we model this?)

In [None]:
from nltk.probability import FreqDist
author_order = ["Hamilton", "Madison", "Jay", "Hamilton/Madison"]

freq_order = FreqDist(stopword_result).most_common(4)
print(freq_order)

mapping = {}
for i in range(4):
    mapping[freq_order[i][0]] = author_order[i]
mapping

Now we can put this into a function definition, since we'll have to do it twice.

In [None]:
def assign_author(result):
    author_order = ["Hamilton", "Madison", "Jay", "Hamilton/Madison"]
    freq_order = FreqDist(result).most_common(4)
    mapping = {}
    for i in range(4):
        mapping[freq_order[i][0]] = author_order[i]
        
    return [mapping.get(x) for x in result]

assign_author(stopword_result)

So how did that do against reality? Let's read in the real answers and add them to an HTML table for comparison.

In [None]:
with open('../lessondata/federalist/metadata.txt', encoding='utf-8') as f:
    answers = f.readlines()
answers

We can also see how this file looks: the first 3 characters have a number, then the next few (up to #30 - I counted for you - have the author(s), then the rest is the title of the paper. We can use this to make a dictionary.


In [None]:
known_authors = {}
for a in answers:
    paperno = int(a[0:3].lstrip())
    author = a[4:29].rstrip()
    known_authors[paperno] = author
known_authors

Let's try that again, keeping the ones that are by a single author (e.g. JAY), discarding the ones that are by two authors (e.g. HAMILTON AND MADISON), and setting aside the ones that are uncertain (i.e. HAMILTON OR MADISON).

In [None]:
disputed_papers = []
known_authors = {}
for a in answers:
    paperno = int(a[0:3].lstrip())
    author = a[4:29].rstrip()
    if ' OR ' in author:
        disputed_papers.append(paperno)
    elif ' ' not in author:
        known_authors[paperno] = author
print(disputed_papers)
known_authors

Now that we have our "real" answers, we can make our table.

In [None]:
from IPython.display import HTML

stopword_authors = assign_author(stopword_result)
pos_authors = assign_author(pos_result)

def colorcode(assigned, real):
    cellcolor = 'red'
    if assigned.lower() == real.lower():
        cellcolor = 'green'
    elif real.lower().find(assigned.lower()) > -1:
        cellcolor = 'orange'
    return cellcolor

answer_table = '<table><tr><th>Paper</th><th>Stopwords</th><th>Parts of speech</th><th>Real</th></tr>'
for i in range(len(stopword_authors)):
    if i in known_authors.keys():
        ra = known_authors.get(i)
    elif i in disputed_papers:
        ra = 'UNKNOWN'
    else:
        ra = 'Hamilton and Madison'
    sa = stopword_authors[i]
    pa = pos_authors[i]
    answer_table += '<tr><td>%d</td>' % (i+1)     # Print the letter number
    answer_table += '<td style="color: %s;">%s</td>' % (colorcode(sa, ra), sa)
    answer_table += '<td style="color: %s;">%s</td>' % (colorcode(pa, ra), pa)
    answer_table += '<td>%s</td></tr>' % ra
answer_table += '</table>'

HTML(answer_table)

...Pretty terrible. 😄 This is the difference between untrained and trained data. What happens if we tell the algorithm what we know, and then let it test the others?

Splitting our data
----

Now that we have made these feature vectors for all our texts, we need to set aside the ones whose authorship is unknown. Let's go through the `metadata.txt` file in the `federalist` directory, and see who wrote what.

Splitting our dataframes
----

Let's remove those papers that are not in known_authors, to make the training set.

In [None]:
stopword_training = stopword_features.filter(items=["Paper #%d" % x for x in known_authors.keys()], axis=0)
pos_training = pos_features.filter(items=["Paper #%d" % x for x in known_authors.keys()], axis=0)

stopword_training

In [None]:
stopword_test = stopword_features.filter(items=["Paper #%d" % x for x in disputed_papers], axis=0)
pos_test = pos_features.filter(items=["Paper #%d" % x for x in disputed_papers], axis=0)
pos_test

Supervising the learning
----

We will use another form of clustering for this - rather than KMeans, it will be KNeighbors. [how it works

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Get our list of answers, ordered by paper number.
training_answers = [known_authors.get(p) for p in sorted(known_authors.keys())]
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(stopword_training.values, training_answers)

In [None]:
prediction = knn.predict(stopword_test.values)
prediction

Let's turn all this into a function, so we can try it with both sets of testing data.

In [None]:
def cluster_predict(trainingset, answers, testset):
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(trainingset.values, answers)
    return knn.predict(testset)

cluster_predict(pos_training, training_answers, pos_test)

So now what if we try to take into account both sets of data? We can combine the two training tables, and the two test tables, like so:

In [None]:
training_all = stopword_training.merge(pos_training, left_index=True, right_index=True)
testing_all = stopword_test.merge(pos_test, left_index=True, right_index=True)

cluster_predict(training_all, training_answers, testing_all)

We're almost onto something!

Probably the most commonly-used method for authorship attribution today is known as Burrows' Delta, named after John Burrows who came up with it. The Delta algorithms are available in [a statistical package](https://sites.google.com/site/computationalstylistics/) called `stylo`, written for the R programming language for statistical computing. If this is something you anticipate wanting to use, that is a very good place to start.