<a href="https://colab.research.google.com/github/Cal-Poly-Data-301/lab-5-gcathcarlson/blob/main/DATA_301_Lab_5_Carlson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Authorship of the Federalist Papers

The _Federalist Papers_ were a set of 85 essays published between 1787 and 1788 to promote the ratification of the United States Constitution. They were originally published under the pseudonym "Publius". Although the identity of the authors was a closely guarded secret at the time, most of the papers have since been conclusively attributed to one of Hamilton, Jay, or Madison. The known authorships can be found in `https://dlsun.github.io/pods/data/federalist/authorship.csv`.

For 15 of the papers, however, the authorships remain disputed. (These papers can be identified from the `authorship.csv` file because the "Author" field is blank.) In this analysis, you will use the papers with known authorship to predict the authorships of the disputed papers. The text of each paper is available at `https://dlsun.github.io/pods/data/federalist/x.txt`, where `x` is the number of the paper (i.e., a number from 1 to 85). The name of the file indicates the number of the paper.

## Question 1

When analyzing an author's style, common words like "the" and "on" are actually more useful than rare words like "hostilities". That is because rare words typically signify context. Context is useful if you are trying to find documents about similar topics, but not so useful if you are trying to identify an author's style because different authors can write about the same topic. For example, both Dr. Seuss and Charles Dickens used rare words like "chimney" and "stockings" in _How the Grinch Stole Christmas_ and _A Christmas Carol_, respectively. But they used common words very differently: Dickens used the word "upon" over 100 times, while Dr. Seuss did not use "upon" even once.

Read in the Federalist Papers. Convert each one into a vector of term frequencies. In order to restrict to common words, include only the top 50 words across the corpus. (Because we are restricting to the most common words already, there is no reason to reweight them using TF-IDF.)

In [1]:
import pandas as pd
import requests
from collections import Counter
fed_files = []
fed_dir = "http://dlsun.github.io/pods/data/federalist/"
for i in range(1, 86):
  fed_files.append(str(i) + ".txt")

docs_fed = pd.Series()
for file in fed_files:
    response = requests.get(fed_dir + file, "r")
    docs_fed[file[:-4]] = response.text



  if __name__ == '__main__':


In [2]:
from collections import Counter

words = (
    docs_fed.
    str.lower().                  
    str.replace("[^\w\s]", " ").  
    str.split()                  
).apply(Counter)

tf = pd.DataFrame(list(words))
tf = tf.fillna(0)

  


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

data = CountVectorizer().fit(docs_fed)
bag_of_words = data.transform(docs_fed)
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in data.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
words_freq[:50]
data_frame_total = pd.DataFrame()
total_words = []
total_values = []
for i in range(50):
  total_words.append(words_freq[:50][i][0])
  total_values.append(words_freq[:50][i][1])
data_frame_total["Words"] = total_words
data_frame_total["Values"] = total_values

In [5]:
data_total = pd.DataFrame()
for word in data_frame_total['Words']:
  data_total[word] = tf[word]
data_total

Unnamed: 0,the,of,to,and,in,be,that,it,is,which,by,as,this,would,have,will,for,or,not,their,with,from,are,on,they,an,states,government,may,been,state,all,but,its,other,people,has,power,more,at,if,than,them,one,any,no,those,can,constitution,we
0,133,106,72,40,27,34,28,20,13,18,14,10,14,2,10,25,12,6,14,14,6,11,12,9.0,6,11,2.0,9,11.0,3.0,6,9,2,10.0,3,6,6.0,2.0,7.0,8,4.0,11.0,2.0,4,6.0,3.0,9.0,3.0,8.0,8.0
1,107,83,53,83,34,15,44,38,16,11,10,16,14,5,17,2,13,10,10,21,13,4,6,8.0,22,1,2.0,9,4.0,8.0,1,4,8,5.0,4,23,6.0,1.0,5.0,10,3.0,5.0,4.0,10,1.0,1.0,2.0,0.0,0.0,5.0
2,93,62,56,60,25,31,20,21,7,11,18,24,6,2,7,24,11,32,13,11,10,15,8,6.0,5,3,11.0,16,6.0,2.0,8,4,7,1.0,7,8,5.0,3.0,13.0,1,7.0,8.0,8.0,8,5.0,2.0,6.0,3.0,0.0,0.0
3,86,72,51,90,24,26,17,28,10,10,14,20,1,17,9,15,12,24,14,19,12,8,11,11.0,17,3,1.0,16,10.0,2.0,6,4,10,9.0,11,8,1.0,2.0,13.0,2,14.0,9.0,12.0,13,5.0,1.0,4.0,8.0,0.0,10.0
4,66,53,45,72,28,31,23,21,7,10,10,3,6,37,1,7,7,10,8,11,11,11,3,5.0,11,4,1.0,2,2.0,0.0,2,4,4,4.0,4,3,0.0,1.0,11.0,4,3.0,9.0,11.0,10,3.0,2.0,9.0,1.0,0.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,389,248,163,88,135,85,49,65,61,41,32,46,44,21,23,30,21,21,26,11,14,20,15,16.0,11,22,18.0,7,25.0,22.0,26,13,20,10.0,6,1,12.0,20.0,4.0,7,6.0,6.0,16.0,10,15.0,8.0,12.0,6.0,13.0,2.0
81,168,94,83,41,38,36,19,13,15,18,4,15,14,11,7,7,4,13,13,6,6,14,11,0.0,10,10,8.0,3,14.0,1.0,18,7,4,4.0,2,1,0.0,10.0,1.0,2,2.0,3.0,4.0,9,1.0,4.0,6.0,5.0,4.0,2.0
82,485,331,219,121,213,105,121,102,116,79,82,54,60,48,54,24,39,26,36,16,30,22,43,18.0,29,20,23.0,16,16.0,36.0,33,28,25,21.0,21,3,22.0,10.0,23.0,19,24.0,16.0,21.0,18,16.0,22.0,13.0,6.0,13.0,5.0
83,390,293,140,89,91,94,84,64,68,53,31,35,36,18,27,38,27,21,30,29,14,20,33,21.0,22,15,19.0,25,26.0,18.0,27,14,18,7.0,10,11,19.0,13.0,7.0,10,8.0,16.0,5.0,5,32.0,27.0,9.0,11.0,28.0,11.0


## Question 2
Make a visualization that summarizes the most common words used by Hamilton, Madison, and Jay.

In [7]:
import pandas as pd

data_dir = "https://dlsun.github.io/pods/data/"
df = pd.read_csv(data_dir + "federalist/authorship.csv")
df["Author"]

0     Hamilton
1          Jay
2          Jay
3          Jay
4          Jay
        ...   
80    Hamilton
81    Hamilton
82    Hamilton
83    Hamilton
84    Hamilton
Name: Author, Length: 85, dtype: object

In [8]:
data_hamilton = data_total.loc[df["Author"] == "Hamilton"]
data_jay = data_total.loc[df["Author"] == "Jay"]
data_madison = data_total.loc[df["Author"] == "Madison"]
for key in data_hamilton.keys():
  data_hamilton.loc['Total', key] = data_hamilton[key].sum()
for key in data_jay.keys():
  data_jay.loc['Total', key] = data_jay[key].sum()
for key in data_madison.keys():
  data_madison.loc['Total', key] = data_madison[key].sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [178]:
data_total.loc['Total_Ham'] = data_hamilton.loc['Total']
data_total.loc['Total_Jay'] = data_jay.loc['Total']
data_total.loc['Total_Madison'] = data_madison.loc['Total']
compare_df = data_total.loc['Total_Ham':].transpose()
compare_df

Unnamed: 0,Total_Ham,Total_Jay,Total_Madison
the,10450.0,526.0,3904.0
of,7329.0,369.0,2335.0
to,4598.0,293.0,1261.0
and,2720.0,408.0,1164.0
in,2829.0,164.0,808.0
be,2300.0,160.0,754.0
that,1717.0,150.0,542.0
it,1549.0,138.0,497.0
is,1329.0,57.0,481.0
which,1245.0,56.0,424.0


## Question 3

For each of the documents with disputed authorships, find the 5 most similar documents with _known_ authorships, using cosine distance on the term frequencies. Use the authors of these 5 most similar documents to predict the author of each disputed document. (For example, if 3 of the 5 closest documents were written by Hamilton, 1 by Madison, and 1 by Jay, then we would predict that the disputed document was written by Hamilton.)

In [103]:
data_work = data_total.loc[:84]
df["Author"] = df["Author"].fillna('Unknown')
data_work["Author"] = df["Author"]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [104]:
index = data_work.index
condition = data_work["Author"]=="Unknown"
papers_indices = index[condition]
papers_indices_list = papers_indices.tolist()
papers_indices_list

[17, 18, 19, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 61, 62]

In [107]:
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
cosine_df = pd.DataFrame()
for paper_idx in papers_indices_list:
  array = cosine_similarity(data_total.loc[:84])[paper_idx]
  cosine_df[paper_idx] = array
cosine_df = cosine_df.drop(papers_indices_list)

In [150]:
large_df = pd.DataFrame()
for paper_idx in papers_indices_list:
  list = []
  list.append(cosine_df.nlargest(5, [paper_idx]).index)
  count_ham = 0
  count_mad = 0
  count_jay = 0
  author_list = []
  for i in list:
    author_list.append(data_work.iloc[i].Author)
  large_df[paper_idx] = author_list
large_df

Unnamed: 0,17,18,19,48,49,50,51,52,53,54,55,56,57,61,62
0,46 Madison 16 Hamilton 68 Hamilton 4...,46 Madison 16 Hamilton 41 Madison 3...,47 Madison 39 Madison 41 Madison 6...,70 Hamilton 40 Madison 45 Madison 2...,47 Madison 40 Madison 36 Madison 43 ...,42 Madison 40 Madison 38 Madison 4...,40 Madison 42 Madison 70 Hamilton 7...,40 Madison 13 Madison 35 Hamilton 4...,38 Madison 80 Hamilton 47 Madison 4...,13 Madison 34 Hamilton 83 Hamilton 3...,11 Hamilton 83 Hamilton 16 Hamilton 2...,83 Hamilton 13 Madison 40 Madison 3...,13 Madison 40 Madison 64 Hamilton 4...,37 Madison 35 Hamilton 83 Hamilton 1...,47 Madison 40 Madison 13 Madison 7...


In [176]:
large_df[17][0]
data_work.at[17, 'Author'] = 'Madison'
large_df[18][0]
data_work.at[18, 'Author'] = 'Madison'
large_df[19][0]
data_work.at[19, 'Author'] = 'Madison'
large_df[48][0]
data_work.at[48, 'Author'] = 'Hamilton'
large_df[49][0]
data_work.at[49, 'Author'] = 'Madison'
large_df[50][0]
data_work.at[50, 'Author'] = 'Madison'
large_df[51][0]
data_work.at[51, 'Author'] = 'Hamilton'
large_df[52][0]
data_work.at[52, 'Author'] = 'Madison'
large_df[53][0]
data_work.at[53, 'Author'] = 'Madison'
large_df[54][0]
data_work.at[54, 'Author'] = 'Hamilton'
large_df[55][0]
data_work.at[55, 'Author'] = 'Hamilton'
large_df[56][0]
data_work.at[56, 'Author'] = 'Madison'
large_df[57][0]
data_work.at[57, 'Author'] = 'Madison'
large_df[61][0]
data_work.at[61, 'Author'] = 'Hamilton'
large_df[62][0]
data_work.at[62, 'Author'] = 'Madison'
data_work['Author']

0     Hamilton
1          Jay
2          Jay
3          Jay
4          Jay
        ...   
80    Hamilton
81    Hamilton
82    Hamilton
83    Hamilton
84    Hamilton
Name: Author, Length: 85, dtype: object

## Submission Instructions

- Copy this notebook to your own Drive, if you have not already.
- Restart this notebook and run the cells from beginning to end. 
  - Go to Runtime > Restart and Run All.
- Rename this notebook by clicking on "DATA 301 Lab 5 - YOUR NAMES HERE" at the very top of this page. Replace "YOUR NAMES HERE" with the first and last names of you (and your partners, for Phase 2).
- Get the link to your notebook:
  - Click on "Share" at the top-right. 
  - Change the settings to "Anyone with the link can view". 
  - Copy the sharing link into Canvas.