<a href="https://colab.research.google.com/github/Cal-Poly-Data-301/lab-5-abarbieu/blob/main/DATA_301_Lab_5_abarbieu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Authorship of the Federalist Papers

The _Federalist Papers_ were a set of 85 essays published between 1787 and 1788 to promote the ratification of the United States Constitution. They were originally published under the pseudonym "Publius". Although the identity of the authors was a closely guarded secret at the time, most of the papers have since been conclusively attributed to one of Hamilton, Jay, or Madison. The known authorships can be found in `https://dlsun.github.io/pods/data/federalist/authorship.csv`.

For 15 of the papers, however, the authorships remain disputed. (These papers can be identified from the `authorship.csv` file because the "Author" field is blank.) In this analysis, you will use the papers with known authorship to predict the authorships of the disputed papers. The text of each paper is available at `https://dlsun.github.io/pods/data/federalist/x.txt`, where `x` is the number of the paper (i.e., a number from 1 to 85). The name of the file indicates the number of the paper.

## Question 1

When analyzing an author's style, common words like "the" and "on" are actually more useful than rare words like "hostilities". That is because rare words typically signify context. Context is useful if you are trying to find documents about similar topics, but not so useful if you are trying to identify an author's style because different authors can write about the same topic. For example, both Dr. Seuss and Charles Dickens used rare words like "chimney" and "stockings" in _How the Grinch Stole Christmas_ and _A Christmas Carol_, respectively. But they used common words very differently: Dickens used the word "upon" over 100 times, while Dr. Seuss did not use "upon" even once.

Read in the Federalist Papers. Convert each one into a vector of term frequencies. In order to restrict to common words, include only the top 50 words across the corpus. (Because we are restricting to the most common words already, there is no reason to reweight them using TF-IDF.)

In [159]:
import pandas as pd
import requests
import numpy as np

auth = pd.read_csv("https://dlsun.github.io/pods/data/federalist/authorship.csv")

paperdir="https://dlsun.github.io/pods/data/federalist/"

papers = pd.Series()
for p in auth.Paper:
  papers[f"{p}"] = requests.get(paperdir+f"{p}.txt").text 

  if __name__ == '__main__':


In [160]:
from sklearn.feature_extraction.text import CountVectorizer

extractor = CountVectorizer()
vect = extractor.fit_transform(papers)

In [161]:
top50 = vect.sum(axis=0).argsort()[0,-50:]
top50_words = extractor.get_feature_names_out()[top50]
top50_vect = vect.toarray()[:,top50[0]].reshape((85,50))
top50_vect, top50_words

(array([[  8,   8,   3, ...,  72, 106, 133],
        [  5,   0,   0, ...,  53,  83, 107],
        [  0,   0,   3, ...,  56,  62,  93],
        ...,
        [  5,  13,   6, ..., 219, 331, 485],
        [ 11,  28,  11, ..., 140, 293, 390],
        [  4,  13,  10, ..., 115, 172, 246]]),
 array([['we', 'constitution', 'can', 'those', 'no', 'any', 'one', 'them',
         'than', 'if', 'at', 'more', 'has', 'power', 'people', 'other',
         'its', 'but', 'all', 'state', 'been', 'may', 'government',
         'states', 'an', 'they', 'on', 'are', 'from', 'with', 'their',
         'not', 'or', 'for', 'will', 'have', 'would', 'this', 'as', 'by',
         'which', 'is', 'it', 'that', 'be', 'in', 'and', 'to', 'of',
         'the']], dtype=object))

## Question 2
Make a visualization that summarizes the most common words used by Hamilton, Madison, and Jay.

In [162]:
top50_ham = top50_vect[auth[auth.Author == "Hamilton"].index].sum(axis = 0)
top50_mad = top50_vect[auth[auth.Author == "Madison"].index].sum(axis = 0)
top50_jay = top50_vect[auth[auth.Author == "Jay"].index].sum(axis = 0)
topwordsdf = pd.DataFrame({"Hamilton": top50_ham, "Madison": top50_mad, "Jay": top50_jay})

In [163]:
import altair as alt
toplot = pd.melt(topwordsdf.reset_index(), id_vars='index', value_vars=['Hamilton', 'Madison', 'Jay'])
# toplot.groupby("variable").plot.line(x="index",y="value")
alt.Chart(toplot.reset_index()).mark_line().encode(
    x="index:Q",
    y="value:Q",
    color="variable"
)

In [164]:
top50_words

array([['we', 'constitution', 'can', 'those', 'no', 'any', 'one', 'them',
        'than', 'if', 'at', 'more', 'has', 'power', 'people', 'other',
        'its', 'but', 'all', 'state', 'been', 'may', 'government',
        'states', 'an', 'they', 'on', 'are', 'from', 'with', 'their',
        'not', 'or', 'for', 'will', 'have', 'would', 'this', 'as', 'by',
        'which', 'is', 'it', 'that', 'be', 'in', 'and', 'to', 'of',
        'the']], dtype=object)

The above visualization the x variable represents the index into the above array, so the right side are the most common words. Below, it is plotted as a percentage of the total words written. 

In [165]:
toplot["val_p"] = toplot["value"]/toplot.groupby("variable")["value"].transform("sum")
alt.Chart(toplot.reset_index()).mark_line().encode(
    x="index:Q",
    y="val_p:Q",
    color="variable"
)

## Question 3

For each of the documents with disputed authorships, find the 5 most similar documents with _known_ authorships, using cosine distance on the term frequencies. Use the authors of these 5 most similar documents to predict the author of each disputed document. (For example, if 3 of the 5 closest documents were written by Hamilton, 1 by Madison, and 1 by Jay, then we would predict that the disputed document was written by Hamilton.)

In [166]:
from sklearn.metrics.pairwise import cosine_similarity
auth = auth.fillna('unknown')
unknowns = auth[auth.Author == "unknown"].index
similar5 = cosine_similarity(top50_vect).argsort(axis=1)
auths=[]
for i in range(similar5.shape[0]):
  if i in unknowns:
    sims=similar5[i]
    n=0
    i=0
    auth_row=[]
    while n<5:
      if not sims[i] in unknowns:
        auth_row.append(auth.Author[sims[i]])
        n+=1
      i+=1
    auths.append(max(set(auth_row), key=auth_row.count))
pd.Series(auths,index=unknowns)

17    Jay
18    Jay
19    Jay
48    Jay
49    Jay
50    Jay
51    Jay
52    Jay
53    Jay
54    Jay
55    Jay
56    Jay
57    Jay
61    Jay
62    Jay
dtype: object

Looks like they're all written by Jay, by KNN at least.

## Submission Instructions

- Copy this notebook to your own Drive, if you have not already.
- Restart this notebook and run the cells from beginning to end. 
  - Go to Runtime > Restart and Run All.
- Rename this notebook by clicking on "DATA 301 Lab 5 - YOUR NAMES HERE" at the very top of this page. Replace "YOUR NAMES HERE" with the first and last names of you (and your partners, for Phase 2).
- Get the link to your notebook:
  - Click on "Share" at the top-right. 
  - Change the settings to "Anyone with the link can view". 
  - Copy the sharing link into Canvas.