Authorship attribution of a text corpus
=======================================

Here we will repeat a famous experiment in authorship attribution, and try to discover who wrote the Federalist Papers!

We have the corpus from our lesson on NLTK, and we have the `gensim` library that we used in our topic modeling experiments. We'll put these together to get what we need for authorship attribution.

Reading in the data
-------------------

Now let's load up the Papers. They are in a folder called 'federalist' and each paper is numbered, e.g. 'federalist_7.txt'. We can just as easily do this using NLTK to make a corpus out of the folder, as we did last week.

In [1]:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.util import read_regexp_block

# Define how paragraphs look in our text files.
def read_hanging_block( stream ):
    return read_regexp_block( stream, "^[A-Za-z]" )

corpus_root = '../textfiles/federalist'
file_pattern = 'federalist_.*\.txt'
federalist = PlaintextCorpusReader( corpus_root, file_pattern, 
                                para_block_reader=read_hanging_block )
print("List of texts in corpus:", federalist.fileids())

List of texts in corpus: ['federalist_1.txt', 'federalist_10.txt', 'federalist_11.txt', 'federalist_12.txt', 'federalist_13.txt', 'federalist_14.txt', 'federalist_15.txt', 'federalist_16.txt', 'federalist_17.txt', 'federalist_18.txt', 'federalist_19.txt', 'federalist_2.txt', 'federalist_20.txt', 'federalist_21.txt', 'federalist_22.txt', 'federalist_23.txt', 'federalist_24.txt', 'federalist_25.txt', 'federalist_26.txt', 'federalist_27.txt', 'federalist_28.txt', 'federalist_29.txt', 'federalist_3.txt', 'federalist_30.txt', 'federalist_31.txt', 'federalist_32.txt', 'federalist_33.txt', 'federalist_34.txt', 'federalist_35.txt', 'federalist_36.txt', 'federalist_37.txt', 'federalist_38.txt', 'federalist_39.txt', 'federalist_4.txt', 'federalist_40.txt', 'federalist_41.txt', 'federalist_42.txt', 'federalist_43.txt', 'federalist_44.txt', 'federalist_45.txt', 'federalist_46.txt', 'federalist_47.txt', 'federalist_48.txt', 'federalist_49.txt', 'federalist_5.txt', 'federalist_50.txt', 'federalist_5

Authorship attribution is done by comparing different *features* of the texts we are looking at. Examples include:

* lexical features (average sentence length, variation in sentence length, range of words used)
* punctuation features (average number of different marks per sentence)
* word count features (e.g. frequency of the different common 'function words')
* syntactic features (e.g. frequency of noun use, verb use, adjective use, etc.)

Essentially there are a whole lot of approaches to take, and usually you want to take as many approaches as possible to arrive at some sort of consensus answer. Today we will try three approaches: looking at use of function words, at lexical diversity, and at relative frequency of parts of speech.

Getting the word count feature - the frequency of "function words"
------------------------------------------------------------------

These are the words that we would normally leave out of any vocabulary analysis because they are so common - 'the', 'a', 'and', 'of', 'to', and so on. Indeed we left them out of our topic modeling trial last week for this very reason, but for authorship attribution, conversely, they might be very relevant! Let's retrieve them from NLTK.

In [2]:
from nltk.corpus import stopwords
print(" : ".join(stopwords.words("english")))
print(len(stopwords.words("english")))

# Make the stopword list into a Python set. That will make our work much faster below.
swset = set(stopwords.words("english"))

i : me : my : myself : we : our : ours : ourselves : you : your : yours : yourself : yourselves : he : him : his : himself : she : her : hers : herself : it : its : itself : they : them : their : theirs : themselves : what : which : who : whom : this : that : these : those : am : is : are : was : were : be : been : being : have : has : had : having : do : does : did : doing : a : an : the : and : but : if : or : because : as : until : while : of : at : by : for : with : about : against : between : into : through : during : before : after : above : below : to : from : up : down : in : out : on : off : over : under : again : further : then : once : here : there : when : where : why : how : all : any : both : each : few : more : most : other : some : such : no : nor : not : only : own : same : so : than : too : very : s : t : can : will : just : don : should : now : d : ll : m : o : re : ve : y : ain : aren : couldn : didn : doesn : hadn : hasn : haven : isn : ma : mightn : mustn : needn 

Okay! Now we have, for each text, to count up the frequency of each of these words. This is called making a "feature vector" - each text will be reduced to a data structure that has a count for each of the function words.

**PAY ATTENTION HERE!** This step, the conversion of text files to feature vectors, is where you will make or break any of these text analysis techniques. As we will see, when we are doing authorship attribution we want to count the stopwords, but when we do topic modeling we want to count everything *BUT* the stopwords! Think carefully about the theory and ideas behind what you are doing, when you use these tools.

In [3]:
from gensim import corpora

# Make a 2D array of each text reduced to its sequence in stopwords
stopword_texts = [[w.lower() for w in federalist.words(paper) 
                   if w.lower() in swset] 
                  for paper in federalist.fileids()]

# Now make a feature vector set - a gensim corpus - from these texts.
swdictionary = corpora.Dictionary(stopword_texts)
swdictionary.token2id

{'a': 24,
 'about': 106,
 'above': 123,
 'after': 55,
 'again': 83,
 'against': 45,
 'all': 32,
 'am': 51,
 'an': 26,
 'and': 8,
 'any': 14,
 'are': 43,
 'as': 39,
 'at': 48,
 'be': 1,
 'because': 72,
 'been': 3,
 'before': 41,
 'being': 15,
 'below': 110,
 'between': 88,
 'both': 89,
 'but': 20,
 'by': 11,
 'can': 7,
 'd': 122,
 'did': 99,
 'do': 50,
 'does': 95,
 'doing': 119,
 'down': 112,
 'during': 111,
 'each': 81,
 'few': 78,
 'for': 12,
 'from': 37,
 'further': 63,
 'had': 17,
 'has': 54,
 'have': 57,
 'having': 6,
 'he': 74,
 'her': 96,
 'here': 94,
 'hers': 124,
 'herself': 100,
 'him': 91,
 'himself': 79,
 'his': 92,
 'how': 107,
 'i': 31,
 'if': 9,
 'in': 59,
 'into': 42,
 'is': 28,
 'it': 33,
 'its': 29,
 'itself': 76,
 'just': 18,
 'me': 75,
 'more': 13,
 'most': 62,
 'my': 44,
 'myself': 116,
 'no': 47,
 'nor': 85,
 'not': 40,
 'now': 98,
 'of': 52,
 'off': 114,
 'on': 10,
 'once': 105,
 'only': 93,
 'or': 35,
 'other': 70,
 'our': 64,
 'ours': 115,
 'ourselves': 97,
 'o

So now we have our "texts", which are lists of stopwords, and we have our dictionary, which assigns a unique ID to each word. We put these things together to make a vector of each text, which will be a series of `(dictionaryID, count)` tuples. Anytime the count is zero, the dictionary ID will simply be left out of that text's vector. We will use the `doc2bow` method to do this; the result looks something like this.

In [4]:
swdictionary.doc2bow(stopword_texts[0])

[(0, 4),
 (1, 34),
 (2, 71),
 (3, 3),
 (4, 6),
 (5, 8),
 (6, 1),
 (7, 3),
 (8, 40),
 (9, 4),
 (10, 9),
 (11, 14),
 (12, 12),
 (13, 7),
 (14, 6),
 (15, 1),
 (16, 2),
 (17, 1),
 (18, 1),
 (19, 14),
 (20, 2),
 (21, 13),
 (22, 25),
 (23, 7),
 (24, 25),
 (25, 28),
 (26, 11),
 (27, 1),
 (28, 12),
 (29, 10),
 (30, 18),
 (31, 14),
 (32, 9),
 (33, 20),
 (34, 3),
 (35, 6),
 (36, 9),
 (37, 11),
 (38, 1),
 (39, 10),
 (40, 14),
 (41, 1),
 (42, 2),
 (43, 12),
 (44, 8),
 (45, 1),
 (46, 3),
 (47, 3),
 (48, 8),
 (49, 11),
 (50, 1),
 (51, 3),
 (52, 105),
 (53, 6),
 (54, 6),
 (55, 2),
 (56, 2),
 (57, 10),
 (58, 129),
 (59, 26),
 (60, 1),
 (61, 8),
 (62, 2),
 (63, 1),
 (64, 3),
 (65, 1),
 (66, 10),
 (67, 3),
 (68, 1),
 (69, 3),
 (70, 3),
 (71, 2),
 (72, 1),
 (73, 1)]

This was the first text, and now we want this sort of "bag of words" (bow) for all of the texts! We use a list comprehension again to get that.

In [5]:
stopword_corpus = [swdictionary.doc2bow(text) 
                   for text in stopword_texts]
len(stopword_corpus)

85

Let's do something similar to get the distribution of parts of speech. We will POS-tag all the texts, choose the twenty most common parts of speech throughout the corpus excluding punctuation, and then make a similar vector for each text counting the instances of each part of speech.

Here is how to tag a single text:

In [6]:
from nltk import pos_tag

pos_tag(federalist.words('federalist_1.txt'))

[('To', 'TO'),
 ('the', 'DT'),
 ('People', 'NNP'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('State', 'NNP'),
 ('of', 'IN'),
 ('New', 'NNP'),
 ('York', 'NNP'),
 (':', ':'),
 ('AFTER', 'NNP'),
 ('an', 'DT'),
 ('unequivocal', 'JJ'),
 ('experience', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('inefficiency', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('subsisting', 'VBG'),
 ('federal', 'JJ'),
 ('government', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('are', 'VBP'),
 ('called', 'VBN'),
 ('upon', 'IN'),
 ('to', 'TO'),
 ('deliberate', 'VB'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('Constitution', 'NN'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('United', 'NNP'),
 ('States', 'NNPS'),
 ('of', 'IN'),
 ('America', 'NNP'),
 ('.', '.'),
 ('The', 'DT'),
 ('subject', 'JJ'),
 ('speaks', 'VBZ'),
 ('its', 'PRP$'),
 ('own', 'JJ'),
 ('importance', 'NN'),
 (';', ':'),
 ('comprehending', 'VBG'),
 ('in', 'IN'),
 ('its', 'PRP$'),
 ('consequences', 'NNS'),
 ('nothing', 'NN'),
 ('less', 'JJR'),
 ('than', 'IN'),
 ('the', 'DT'),
 ('e

...so let's do this for all the texts, and put the resulting arrays into an outer array.

In [7]:
# Convert sequences of words into sequences of part-of-speech tags
pos_texts = []
for paper in federalist.fileids():
    tagged = pos_tag(federalist.words(paper))
    tagsonly = [x[1] for x in tagged]
    pos_texts.append(tagsonly)
    
pos_texts[17]

['TO',
 'DT',
 'NNP',
 'IN',
 'DT',
 'NNP',
 'IN',
 'NNP',
 'NNP',
 ':',
 'NNP',
 'NNP',
 'RB',
 'VB',
 'VBN',
 'IN',
 'DT',
 'NNS',
 'VBN',
 'IN',
 'DT',
 'VBG',
 'NN',
 'MD',
 'TO',
 'VB',
 'VBN',
 'IN',
 'IN',
 'DT',
 'NNP',
 'NNS',
 ',',
 'IN',
 'DT',
 'NN',
 'IN',
 'DT',
 'NNP',
 '.',
 'CC',
 'DT',
 'MD',
 'VB',
 ',',
 'IN',
 'NN',
 ',',
 'DT',
 'NN',
 'IN',
 'DT',
 'JJ',
 'NN',
 'IN',
 'PRP$',
 'JJ',
 'NN',
 ',',
 'IN',
 'PRP',
 'MD',
 'IN',
 'NN',
 'NN',
 'DT',
 'NN',
 'IN',
 'DT',
 'JJ',
 'NN',
 'IN',
 'DT',
 'JJ',
 'NN',
 'TO',
 'DT',
 'JJ',
 'NNS',
 ':',
 'DT',
 'NN',
 'NN',
 'TO',
 'DT',
 'NNS',
 ',',
 'JJ',
 'TO',
 'DT',
 ',',
 'CC',
 'NN',
 'TO',
 'DT',
 'NNP',
 '.',
 'DT',
 'NNS',
 'IN',
 'NNP',
 ',',
 'NNP',
 ',',
 'CC',
 'IN',
 'DT',
 'JJ',
 'NNS',
 'IN',
 'PRP$',
 'NN',
 'VB',
 'RB',
 'VB',
 'IN',
 'JJ',
 'NNS',
 ',',
 'CC',
 'VBZ',
 'DT',
 'NNP',
 'IN',
 'NNP',
 'TO',
 'NNP',
 '.',
 'DT',
 'NN',
 ',',
 'RB',
 'IN',
 'JJ',
 'NNS',
 ',',
 'VBZ',
 'RB',
 'JJ',
 '.',
 'CC

Now, just as before, make a dictionary out of these "texts".

In [8]:
posdict = corpora.Dictionary(pos_texts)
posdict.token2id

{'$': 41,
 "''": 39,
 '(': 19,
 ')': 2,
 ',': 16,
 '.': 27,
 ':': 18,
 'CC': 13,
 'CD': 34,
 'DT': 29,
 'EX': 25,
 'FW': 42,
 'IN': 10,
 'JJ': 3,
 'JJR': 7,
 'JJS': 28,
 'MD': 12,
 'NN': 5,
 'NNP': 17,
 'NNPS': 9,
 'NNS': 22,
 'PDT': 32,
 'POS': 40,
 'PRP': 24,
 'PRP$': 6,
 'RB': 26,
 'RBR': 4,
 'RBS': 33,
 'RP': 36,
 'SYM': 37,
 'TO': 11,
 'UH': 23,
 'VB': 1,
 'VBD': 21,
 'VBG': 30,
 'VBN': 0,
 'VBP': 31,
 'VBZ': 15,
 'WDT': 8,
 'WP': 14,
 'WP$': 35,
 'WRB': 20,
 '``': 38}

Hm, let's filter out the punctuation, and limit ourselves to the top 15 parts of speech. We can filter the dictionary like this. First let's see what the list of items looks like...

In [9]:
[x for x in posdict.iteritems()]

[(0, 'VBN'),
 (1, 'VB'),
 (35, 'WP$'),
 (2, ')'),
 (3, 'JJ'),
 (28, 'JJS'),
 (5, 'NN'),
 (6, 'PRP$'),
 (7, 'JJR'),
 (8, 'WDT'),
 (40, 'POS'),
 (38, '``'),
 (9, 'NNPS'),
 (10, 'IN'),
 (11, 'TO'),
 (12, 'MD'),
 (13, 'CC'),
 (14, 'WP'),
 (15, 'VBZ'),
 (42, 'FW'),
 (16, ','),
 (18, ':'),
 (19, '('),
 (20, 'WRB'),
 (21, 'VBD'),
 (41, '$'),
 (22, 'NNS'),
 (39, "''"),
 (23, 'UH'),
 (32, 'PDT'),
 (37, 'SYM'),
 (24, 'PRP'),
 (25, 'EX'),
 (26, 'RB'),
 (27, '.'),
 (4, 'RBR'),
 (29, 'DT'),
 (30, 'VBG'),
 (31, 'VBP'),
 (36, 'RP'),
 (17, 'NNP'),
 (33, 'RBS'),
 (34, 'CD')]

...and then let's make a note of the IDs of the punctuation ones...

In [10]:
punct_ids = [x[0] for x in posdict.iteritems() if not x[1].isalpha()]
punct_ids

[35, 2, 6, 38, 16, 18, 19, 41, 39, 27]

...and finally use this list in our filtering logic.

In [11]:
posdict.filter_tokens(bad_ids=punct_ids)
posdict.filter_extremes(no_above=1, keep_n=15)
posdict.token2id

{'CC': 12,
 'DT': 7,
 'IN': 9,
 'JJ': 3,
 'MD': 11,
 'NN': 5,
 'NNS': 4,
 'PRP': 14,
 'RB': 6,
 'TO': 10,
 'VB': 1,
 'VBD': 2,
 'VBN': 0,
 'VBZ': 13,
 'WDT': 8}

Okay! We have our dictionary the way we want it, so we can make a second gensim corpus out of our texts.

In [12]:
pos_corpus = [posdict.doc2bow(text) for text in pos_texts]
pos_corpus[17]

[(0, 68),
 (1, 125),
 (2, 22),
 (3, 170),
 (4, 111),
 (5, 298),
 (6, 75),
 (7, 278),
 (8, 20),
 (9, 320),
 (10, 90),
 (11, 68),
 (12, 67),
 (13, 32),
 (14, 56)]

Now we have made two corpora from our texts; one represents the frequency of function words, and the other represents the frequency of common parts of speech.

But now we will want to normalize our vectors a little bit - some texts are a lot longer than others, so will have many more function words overall, and we don't want this fact to affect our results. So we need to scale the values in each tuple, keeping them in proportion with each other but always between 0 and 1.

In [13]:
def scale(vector):
    size = 0.0
    maximum = 0.0
    for t in vector:
        size += t[1]
        if t[1] > maximum:
            maximum = t[1]
    scaled = []
    for t in vector:
        fpcount = float(t[1])
        factor = size / size * maximum
        scaled.append((t[0], fpcount / factor))
    return scaled

stopword_corpus = [scale(v) for v in stopword_corpus]
pos_corpus = [scale(v) for v in pos_corpus]
pos_corpus[17]

[(0, 0.2125),
 (1, 0.390625),
 (2, 0.06875),
 (3, 0.53125),
 (4, 0.346875),
 (5, 0.93125),
 (6, 0.234375),
 (7, 0.86875),
 (8, 0.0625),
 (9, 1.0),
 (10, 0.28125),
 (11, 0.2125),
 (12, 0.209375),
 (13, 0.1),
 (14, 0.175)]

Getting the results
-------------------
Okay! We have a set of criteria - the frequency of our function words - and a corresponding set of values for each text. It's time to crunch the numbers and see which papers resemble each other.

We know that there were three authors, so we want to see if we can make the 85 different papers cluster into three groups. There is a statistical function for this called KMeans, from the "scikit-learn" module which has a lot of things for machine learning. (Dividing data into clusters of similar things is a pretty common thing to have to do in machine learning. Lucky for us.)

First we define a function to do the clustering for each data set:

In [14]:
from sklearn.cluster import KMeans

def PredictAuthors(fvs):
    km = KMeans(n_clusters=3)
    km.fit(fvs)
    return km

In order to use this, we need to convert our gensim corpus into a matrix that SciPy recognizes. Gensim gives us a utility to do this. In order to get our matrix the right way around, we will also have to transpose it.

In [15]:
from numpy import transpose
from gensim.matutils import corpus2csc

stopword_matrix = transpose(corpus2csc(stopword_corpus))
pos_matrix = transpose(corpus2csc(pos_corpus))
pos_matrix

<85x15 sparse matrix of type '<class 'numpy.float64'>'
	with 1275 stored elements in Compressed Sparse Row format>

And then we run this on our data table of the function word frequencies and get a complicated result. We ask for the labels of that result and get something that looks like this:

In [16]:
stopword_result = PredictAuthors( stopword_matrix ).labels_ 
print(stopword_result)
pos_result = PredictAuthors( pos_matrix ).labels_ 
print(pos_result)

[0 2 0 0 0 2 0 0 2 2 2 1 2 0 0 2 0 0 0 2 2 0 1 0 0 2 2 0 0 0 0 0 2 1 2 2 2
 2 2 2 2 2 2 2 1 2 2 2 2 2 0 0 2 2 0 0 0 0 0 2 1 2 2 2 0 2 0 0 2 0 0 0 0 0
 0 2 0 0 2 0 2 0 0 0 0]
[1 0 1 0 1 0 0 1 0 0 0 2 0 0 0 1 1 1 1 1 1 1 2 1 1 1 1 1 0 1 0 1 0 2 1 0 0
 1 0 0 0 0 0 0 2 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 2 1 1 1 1 1 0 1 1 1 1 1 1 1
 1 1 1 0 0 1 0 1 1 1 0]


Each of these numbers (0, 1, 2) represents an author. We know that Hamilton was responsible for most of the papers, Madison for most of the rest, and Jay for the fewest. So let's assign the authors on that assumption.

In [17]:
from nltk.probability import FreqDist
author_order = ["Hamilton", "Madison", "Jay"]

freq_order = FreqDist(stopword_result).most_common(3)
print(freq_order)

mapping = {}
for i in range(3):
    mapping[freq_order[i][0]] = author_order[i]
mapping

[(0, 42), (2, 38), (1, 5)]


{0: 'Hamilton', 1: 'Jay', 2: 'Madison'}

Now we can put this into a function definition, since we'll have to do it twice.

In [18]:
def assign_author(result):
    author_order = ["Hamilton", "Madison", "Jay"]
    freq_order = FreqDist(result).most_common(3)
    mapping = {}
    for i in range(3):
        mapping[freq_order[i][0]] = author_order[i]
        
    return [mapping.get(x) for x in result]

assign_author(stopword_result)

['Hamilton',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Madison',
 'Madison',
 'Jay',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Madison',
 'Hamilton',
 'Jay',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Jay',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Jay',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Jay',
 'Madison',
 'Madison',
 'Madison',
 'Hamilton',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Hamilton',
 'Hamilton',
 'Madison',
 'Hamilton',
 'Madison',
 'Hamilton

So how did that do against reality? Let's read in the real answers and add them to the table.

In [19]:
ff = open( "../textfiles/federalist/metadata.txt" )
real_author = []
for line in ff:
    data = line.split()
    our_author = data[1].title()
    if data[2] == 'AND' or data[2] == 'OR':
        our_author = " ".join( [ our_author, data[2].lower(), 
                                data[3].title() ] )
    real_author.append( our_author )
ff.close()
print(real_author)

['Hamilton', 'Jay', 'Jay', 'Jay', 'Jay', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Madison', 'Hamilton', 'Hamilton', 'Hamilton', 'Madison', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton and Madison', 'Hamilton and Madison', 'Hamilton and Madison', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Madison', 'Madison', 'Madison', 'Madison', 'Madison', 'Madison', 'Madison', 'Madison', 'Madison', 'Madison', 'Madison', 'Madison', 'Hamilton or Madison', 'Hamilton or Madison', 'Hamilton or Madison', 'Hamilton or Madison', 'Hamilton or Madison', 'Hamilton or Madison', 'Hamilton or Madison', 'Hamilton or Madison', 'Hamilton or Madison', 'Madison', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton or Madison', 'Hamilton or Madison', 'Jay', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamil

Let's make an HTML table for comparison.

In [20]:
from IPython.display import HTML

stopword_authors = assign_author(stopword_result)
pos_authors = assign_author(pos_result)

def colorcode(assigned, real):
    cellcolor = 'red'
    if assigned == real:
        cellcolor = 'green'
    elif real.find(assigned) > -1:
        cellcolor = 'orange'
    return cellcolor

print(len(stopword_authors))
print(len(pos_authors))
print(len(real_author))
    
answer_table = '<table><tr><th>Paper</th><th>Stopwords</th><th>Parts of speech</th><th>Real</th></tr>'
for i in range(len(real_author)):
    ra = real_author[i]
    sa = stopword_authors[i]
    pa = pos_authors[i]
    answer_table += '<tr><td>%d</td>' % (i+1)     # Print the letter number
    answer_table += '<td style="color: %s;">%s</td>' % (colorcode(sa, ra), sa)
    answer_table += '<td style="color: %s;">%s</td>' % (colorcode(pa, ra), pa)
    answer_table += '<td>%s</td></tr>' % ra
answer_table += '</table>'

HTML(answer_table)

85
85
85


Paper,Stopwords,Parts of speech,Real
1,Hamilton,Hamilton,Hamilton
2,Madison,Madison,Jay
3,Hamilton,Hamilton,Jay
4,Hamilton,Madison,Jay
5,Hamilton,Hamilton,Jay
6,Madison,Madison,Hamilton
7,Hamilton,Madison,Hamilton
8,Hamilton,Hamilton,Hamilton
9,Madison,Madison,Hamilton
10,Madison,Madison,Madison


...As you can see, the method is not perfect. 😄 A better method for the Federalist Papers problem would be to use a *trained* corpus, to let the model take into account what we know about the papers' authorship. 

Probably the most commonly-used method for authorship attribution today is known as Burrows' Delta, named after John Burrows who came up with it. The Delta algorithms are available in [a statistical package](https://sites.google.com/site/computationalstylistics/) called `stylo`, written for the R programming language for statistical computing. If this is something you anticipate wanting to use, that is a very good place to start.