<a href="https://colab.research.google.com/github/erikluu/lab-5-erikluu/blob/main/Copy_of_DATA_301_Lab_5_Erik_Luu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Authorship of the Federalist Papers

The _Federalist Papers_ were a set of 85 essays published between 1787 and 1788 to promote the ratification of the United States Constitution. They were originally published under the pseudonym "Publius". Although the identity of the authors was a closely guarded secret at the time, most of the papers have since been conclusively attributed to one of Hamilton, Jay, or Madison. The known authorships can be found in `https://dlsun.github.io/pods/data/federalist/authorship.csv`.

For 15 of the papers, however, the authorships remain disputed. (These papers can be identified from the `authorship.csv` file because the "Author" field is blank.) In this analysis, you will use the papers with known authorship to predict the authorships of the disputed papers. The text of each paper is available at `https://dlsun.github.io/pods/data/federalist/x.txt`, where `x` is the number of the paper (i.e., a number from 1 to 85). The name of the file indicates the number of the paper.

In [310]:
import pandas as pd

papers = pd.read_csv("https://dlsun.github.io/pods/data/federalist/authorship.csv")

a = papers.fillna("Unknown")
unknown = a[a["Author"] == "Unknown"]
known = a[a["Author"] != "Unknown"]

## Question 1

When analyzing an author's style, common words like "the" and "on" are actually more useful than rare words like "hostilities". That is because rare words typically signify context. Context is useful if you are trying to find documents about similar topics, but not so useful if you are trying to identify an author's style because different authors can write about the same topic. For example, both Dr. Seuss and Charles Dickens used rare words like "chimney" and "stockings" in _How the Grinch Stole Christmas_ and _A Christmas Carol_, respectively. But they used common words very differently: Dickens used the word "upon" over 100 times, while Dr. Seuss did not use "upon" even once.

Read in the Federalist Papers. Convert each one into a vector of term frequencies. In order to restrict to common words, include only the top 50 words across the corpus. (Because we are restricting to the most common words already, there is no reason to reweight them using TF-IDF.)

In [311]:
known_files = []
for i in known['Paper'].to_list():
  known_files.append(str(i) + ".txt")

unknown_files = []
for i in unknown['Paper'].to_list():
  unknown_files.append(str(i) + ".txt")

In [312]:
from toolz.itertoolz import get
import requests
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

papers_dir = "https://dlsun.github.io/pods/data/federalist/"

# known
docs_known = pd.Series()
for file in known_files:
  response = requests.get(papers_dir + file, "r")
  docs_known[file[:-4]] = response.text

vec = CountVectorizer(ngram_range=(1, 3), max_features=50)
vec.fit(docs_known)
tf_sparse = vec.transform(docs_known)
tf_sparse_known = tf_sparse.todense()

# unknown
docs_unknown = pd.Series()
for file in unknown_files:
  response = requests.get(papers_dir + file, "r")
  docs_unknown[file[:-4]] = response.text

vec = CountVectorizer(ngram_range=(1, 3), max_features=50)
vec.fit(docs_unknown)
tf_sparse = vec.transform(docs_unknown)
tf_sparse_unknown = tf_sparse.todense()

  if __name__ == '__main__':


## Question 2
Make a visualization that summarizes the most common words used by Hamilton, Madison, and Jay.

In [313]:
def getBagOfWords(docs):
  bag_of_words = (
    docs.
    str.lower().                  # convert all letters to lowercase
    str.replace("[^\w\s]", " ").  # replace non-alphanumeric characters by whitespace
    str.split()                   # split on whitespace
  ).apply(Counter)

  return bag_of_words

def getVectorFreq(papers, docs):
  papers_dir = "https://dlsun.github.io/pods/data/federalist/{x}.txt"
  for p in papers:
    p = str(p)
    response = requests.get(papers_dir.format(x=p), "r")
    docs[p[:-4]] = response.text

  return docs

# Hamilton
hamilton = papers[papers["Author"] == "Hamilton"]
hamilton_papers = hamilton["Paper"].tolist()

docs_hamilton = pd.Series()
docs_hamilton = getVectorFreq(hamilton_papers, docs_hamilton)
docs_hamilton = getBagOfWords(docs_hamilton)[0].most_common(50)


# Jay
jay = papers[papers["Author"] == "Jay"]
jay_papers = jay["Paper"].tolist()

docs_jay = pd.Series()
docs_jay = getVectorFreq(jay_papers, docs_jay)
docs_jay = getBagOfWords(docs_jay)[0].most_common(50)

# Madison
madison = papers[papers["Author"] == "Madison"]
madison_papers = madison["Paper"].tolist()

docs_madison = pd.Series()
docs_madison = getVectorFreq(madison_papers, docs_madison)
docs_madison = getBagOfWords(docs_madison)[0].most_common(50)

  """


In [314]:
from altair import *

def makeChart(source):
  bars = Chart(source).mark_bar().encode(
      x='Count',
      y="Word"
  )

  text = bars.mark_text(
      align='left',
      baseline='middle',
      dx=3  # Nudges text to right so it doesn't appear on top of the bar
  )

  return (bars + text).properties(height=900)

h = pd.DataFrame(list(docs_hamilton), columns=["Word", "Count"])
h["Author"] = "Hamilton"
j = pd.DataFrame(list(docs_jay), columns=["Word", "Count"])
j["Author"] = "Jay"
m = pd.DataFrame(list(docs_madison), columns=["Word", "Count"])
m["Author"] = "Madison"

df = pd.concat([h, j, m])

bars = Chart(df).mark_bar(opacity=0.7).encode(
    x=X("Count", stack=None),
    y="Word",
    color="Author",
)

# I want to unstack them... this works for now.

bars.properties(height=900)

## Question 3

For each of the documents with disputed authorships, find the 5 most similar documents with _known_ authorships, using cosine distance on the term frequencies. Use the authors of these 5 most similar documents to predict the author of each disputed document. (For example, if 3 of the 5 closest documents were written by Hamilton, 1 by Madison, and 1 by Jay, then we would predict that the disputed document was written by Hamilton.)

In [316]:
from sklearn.metrics.pairwise import cosine_distances
import numpy as np
import copy

n = unknown["Paper"].to_list()

sims = []

# for d in tf_sparse_unknown:
#   t = np.append(tf_sparse_known, d, axis=0)
#   dists = cosine_distances(tf_sparse_known)[-1]
#   index = known["Paper"].to_list()
#   index = index.append(n.pop(0))
#   matches = pd.Series(dists, index)
#   sims.append(matches.sort_values()[:6])

for paper in unknown_files:
  t_files = None
  t_files = copy.deepcopy(known_files)
  t_files.append(paper)

  docs_temp = pd.Series()
  for f in t_files:
    response = requests.get(papers_dir + f, "r")
    docs_temp[f[:-4]] = response.text
    
  vec = CountVectorizer(ngram_range=(1, 3), max_features=50)
  vec.fit(docs_temp)
  tf_sparse_x = vec.transform(docs_temp)
  tf_sparse_x = tf_sparse_x.todense()

  dists = cosine_distances(tf_sparse_x)[-1]
  index = known["Paper"].to_list()
  index.append(int(paper.split(".")[0]))
  matches = pd.Series(dists, index)
  sims.append(matches.sort_values()[:6])

print(sims)




[18    0.000000
47    0.015769
17    0.025336
44    0.027136
69    0.027148
45    0.030364
dtype: float64, 19    0.000000
47    0.015078
17    0.019531
42    0.020488
37    0.021384
44    0.022504
dtype: float64, 20    0.000000
48    0.023105
42    0.024386
37    0.025430
70    0.025674
10    0.025758
dtype: float64, 49    0.000000
71    0.014155
78    0.015418
41    0.015492
28    0.016115
39    0.016337
dtype: float64, 50    0.000000
48    0.022905
37    0.022993
14    0.023615
65    0.024072
44    0.024075
dtype: float64, 51    0.000000
43    0.017123
39    0.018676
41    0.018902
46    0.019509
65    0.019579
dtype: float64, 52    0.000000
41    0.010484
43    0.010962
78    0.012090
71    0.012523
14    0.013527
dtype: float64, 53    0.000000
41    0.011182
43    0.013071
14    0.013155
48    0.014094
36    0.014110
dtype: float64, 54    0.000000
39    0.014590
81    0.015302
48    0.016196
43    0.017949
83    0.020538
dtype: float64, 55    0.000000
14    0.011634
35    0.013539




In [332]:
papers = pd.read_csv("https://dlsun.github.io/pods/data/federalist/authorship.csv")

def most_frequent(List):
    occurence_count = Counter(List)
    return occurence_count.most_common(1)[0][0]

predicted_authors = [] 
for s in sims:
  p = s.index.tolist()[1:]
  temp = []
  for x in p:
    temp.append(papers.iloc[x]["Author"])
  predicted_authors.append(most_frequent(temp))

df = pd.DataFrame()
df["Paper"] = unknown["Paper"]
df["Author"] = predicted_authors
df

Unnamed: 0,Paper,Author
17,18,Madison
18,19,Madison
19,20,Madison
48,49,Hamilton
49,50,Madison
50,51,Madison
51,52,Hamilton
52,53,Madison
53,54,Madison
54,55,Hamilton


Of these papers, most are Hamilton and Madison. Hamilton makes sense as the amjority fo the known papers were written by Hamilton. I was surprised to see so many Madisons. This was a cool lab.

In [335]:
papers["Author"].value_counts()

Hamilton    51
Madison     14
Jay          5
Name: Author, dtype: int64

## Submission Instructions

- Copy this notebook to your own Drive, if you have not already.
- Restart this notebook and run the cells from beginning to end. 
  - Go to Runtime > Restart and Run All.
- Rename this notebook by clicking on "DATA 301 Lab 5 - YOUR NAMES HERE" at the very top of this page. Replace "YOUR NAMES HERE" with the first and last names of you (and your partners, for Phase 2).
- Get the link to your notebook:
  - Click on "Share" at the top-right. 
  - Change the settings to "Anyone with the link can view". 
  - Copy the sharing link into Canvas.