# DAT 301 - Final Exam
## Sebastian FSV
## 4/28/2021


Data scraped from Wikileaks include all 107 emails that contained either the words (favor & confidential) or (favor and classified). Millions of emails belonging to the presidential candidate were made public prior to the 2016 election and contain many of Hillary's personal and professional communications during and leading up to her presidential campaign. 

Emails were made public under the aligation that there was nefarious activity that could be infered or directly proved contained within. All of the text that is contained within each email is collected including email addresses and times. The data is stripped and split and turned into a list.

An equal set of the most common words contained within this email query is compared to the most common words used in George Orwell's 1984. These sets are then studied for a difference in mean usage of their most common words. 

In [None]:
# !pip install bs4
# !pip install requests
# !pip install pandas
# !pip install wordcloud
# !pip install seaborn

# import sys
# from os import path

import numpy as np
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from wordcloud import WordCloud, STOPWORDS
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns
from random import sample
import scipy.stats as ss

# Hillary Clinton's Emails

In [None]:
url = "https://wikileaks.org/clinton-emails/?q=%28favor+%26+confidential%29+%7C+%28favor+%26+classified%29&mfrom=&mto=&title=&notitle=&date_from=&date_to=&nofrom=&noto=&count=200&sort=0&page=1&#searchresult"
r = requests.get(url)
soup = bs(r.content, 'html.parser')

### Locate and Create Table of Target Emails

Had to identify uniue table class in order to pull correct table rows into table data, then coerced the table into a data frame.

In [None]:
table = soup.find("table",  class_='table table-striped search-result')
tab_rows = table.find_all('tr')

In [None]:
l = []
for tr in tab_rows:
    td = tr.find_all('td')
    row = tr.get_text().strip() 
    l.append(row)

### Turn Table into Data Frame

In [None]:
df = pd.DataFrame(l)
df[['DocID', 'Date', 'Subject', 'From', 'To']] = df[0].str.split('\n', expand=True)
df = df.iloc[1:,1:]

### Create Giant Text List

#### !!!! THIS CELL TAKES THE LONGEST TO RUN !!!!

This `for` loop pulls the entire email contents for each webpage associated to a particular emailid. Changing the row parameters from the first 40 emails to over 100 drastically increases the program's run time, but will successfully scrape 100+ webpages in less than 2 minutes. Creates long list where each element is the contents of an email as a string.

In [None]:
url1 = 'https://wikileaks.org/clinton-emails/emailid/'

li = []
for i in df.iloc[1:40,0]:
    one = url1 + i
    one1 = requests.get(one)
    two = bs(one1.content, 'html.parser')
    three = two.find(id='uniquer')
    four = three.get_text().strip()
    li.append(four)

### Create Word Usage Data Frame

The punctuation marks must be removed and duplicates of words are eliminated using the `punctuation` attribute from the `string` package and the lower() method. The string is then split into individual words and the frequency of each word is counted. A dictionary is made from the Word and Frequency lists and then again turned into a pandas data frame. Once in a data frame, the values are sorted in descending order based on the Count. 

In [None]:
wordlist = list(re.sub('[' + string.punctuation + ']', '', four).lower().split())   
freq = [wordlist.count(w) for w in wordlist]
five = dict(list(zip(wordlist, freq)))
usage = pd.DataFrame(list(five.items()),columns=['Word', 'Count'])
usage.sort_values(by='Count', ascending=False, inplace=True)

# George Orwell's 1984

In [None]:
url2 = 'http://www.george-orwell.org/1984/'

### Create Giant Text List

In [None]:
story = []
for i in range(22,23):
    myurl = url2 + str(i) + '.html'
    s = requests.get(myurl)
    ch = bs(s.content, 'html.parser')
    six = [sib.get_text() for sib in ch.find('h2').next_siblings]
    story.append(six)


In [None]:
eight = [str(i).split() for i in story]
nine = []
for phrase in eight:
    for word in phrase:
        nine.append(word)


### Create Word Usage Data Frame

In [None]:
words2 = [ str(w).lower().strip() for w in nine ]
table = str.maketrans(dict.fromkeys(string.punctuation))
words1 = [i.translate(table) for i in nine]

In [None]:
freq1 = [words1.count(w) for w in words1]
sev = dict(list(zip(words1, freq1)))

In [None]:
usage1 = pd.DataFrame(list(sev.items()),columns=['Word', 'Count'])
usage1.sort_values(by='Count', ascending=False, inplace=True)


## STOPWORDS and Other Filters

In [None]:
stopwords = []
stopwords = list(set(STOPWORDS))
stopwords += ['>','from:','to:', 'no.', 'date:','sent:','subject:', 're:',
             'original', 'message', 'cameron', 'robinson', 'shaun',
             'c05774510', '11302015', 'f201420439', "o\\'brien", "Winston", 'There']

In [None]:
filt = (usage.Word.isin(stopwords))
filt1 = (usage1.Word.isin(stopwords))
unique = usage[~filt]
unique1 = usage1[~filt1]
length = [len(w) for w in unique.Word]
length1 = [len(w) for w in unique1.Word]
unique.insert(2, 'Length', length, True)
unique1.insert(2, 'Length', length1, True)
filt2 = (unique['Length'] > 4)
filt3 = (unique1['Length'] > 4)
#---------------------------------------------------
usage2 = usage.sort_values(by='Count', ascending=True)
usage3 = usage1.sort_values(by='Count', ascending=True)
filt6 = (usage2.Word.isin(stopwords))
filt7 = (usage3.Word.isin(stopwords))
unique2 = usage2[~filt6]
unique3 = usage3[~filt7]

length2 = [len(w) for w in unique2.Word]
length3 = [len(w) for w in unique3.Word]
unique2.insert(2, 'Length', length, True)
unique3.insert(2, 'Length', length1, True)
filt4 = (unique2['Length'] > 4)
filt5 = (unique3['Length'] > 4)

### Final Data Frames and Lists

Being a deliborate novel, the deluge of words from 1984 more than quadrupled the amount of words made availabe for analysis when compared to Mrs. Clinton's emails. Therefore, a length check had to added to ensure a similar proportion of words are being considered from both groups.

Depending on the number of words imported from Hillary's emails, a randomly selected set of the same size is chosen from the larger set of 1984 words.

Nothing is hard coded and this entire program is fully scalable. 

In [None]:
a=unique[filt2].iloc[1:11,0:2]
b=unique1[filt3].iloc[1:11,0:2]
#--------------
c=unique2[filt4].iloc[1:11,0:2]
d=unique3[filt5].iloc[1:11,0:2]

hrc_lst = unique[filt2]
_1984_lst = unique1[filt3]

In [None]:
e = unique[filt2]

e['Type'] = np.repeat('HRC', len(e['Word']))
ff = unique1[filt3]
ff['Type'] = np.repeat('1984', len(ff['Word']))
fff = sample(list(np.arange(0,len(ff['Word']), step=1)), len(e['Word']))

f = ff.iloc[fff]

### Length Checks

In [None]:
mydict = {
    'HRC'  : [len(hrc_lst['Word'])],
    '1984' : [len(_1984_lst['Word'])]}

df2 = pd.DataFrame(mydict, index=['words'])
df2

In [None]:
hrc = e['Word'].to_string()
_1984 = f['Word'].to_string()

mydict1 = {
    'HRC'  : [len(e['Word'])],
    '1984' : [len(f['Word'])]}

df3 = pd.DataFrame(mydict1, index=['words'])
df3

## Most Used Words

In [None]:
colors = ['lightcoral', 'brown', 'firebrick', 'darkred', 'red', 'tomato', 'coral', 'orangered','sienna','sandybrown']
colors1 = ['navy', 'deepskyblue', 'teal','aqua', 'mediumblue', 'cadetblue', 'blue', 'mediumpurple', 'royalblue', 'dodgerblue']

fig, axs = plt.subplots(1,2, figsize=(15,9))
plt.rcParams.update({'font.size' : 20})
fig.suptitle('Top 10 Used Words')
axs[0].pie(a['Count'], labels=a['Word'], colors=colors)
axs[1].pie(b['Count'], labels=b['Word'], colors=colors1)
axs[0].set_title("HRC")
axs[1].set_title("1984")

 

Both sets of words are implicative of the role played by each group. 

Having a high profile position of power can be seen by the common use of words like: government, relationships, support and leaders. 

Meanwhile, the main character of Orwell's novel used opposing words: ALWAYS and NEVER the exact same number of times. Possibly in resistance to the onslaught of doublespeak that plagues his world

The pie charts above as well as the pandas data frames below summarize the most frequent words used by each group in the study. 

Interesting note may be that the frequency count for the most common words used in 1984 is almost double than that of former presidential candidate. Possibly implying a more limited selection of words from the residents of the Orwellian Universe. 

In [None]:
a['Type'] = np.repeat('HRC', len(a['Word']))
b['Type'] = np.repeat('1984', len(b['Word']))

a = a.set_index([np.arange(0,10,step=1)])
b = b.set_index([np.arange(0,10,step=1)])

ab = pd.concat([y.reset_index(drop=True) for y in [a, b]], axis=1)
ab

In [None]:
fig5, axs5 = plt.subplots(1,2, figsize=(15,9))
plt.rcParams.update({'font.size' : 20})
fig5.suptitle('Bottom 10 Used Words')
axs5[1].pie(c['Count'], labels=c['Word'], colors=colors)
axs5[0].pie(d['Count'], labels=d['Word'], colors=colors1)
axs5[1].set_title("HRC")
axs5[0].set_title("1984")

In [None]:
c['Type'] = np.repeat('HRC', len(a['Word']))
d['Type'] = np.repeat('1984', len(b['Word']))

c = c.set_index([np.arange(0,10,step=1)])
d = d.set_index([np.arange(0,10,step=1)])

cd = pd.concat([y.reset_index(drop=True) for y in [c, d]], axis=1)
cd

In [None]:
wordcloud = WordCloud(max_font_size = 80, background_color = 'white',
                     collocations = True, colormap='magma').generate(hrc)
plt.figure()
plt.imshow(wordcloud)
plt.axis('off')
plt.suptitle("HRC's Diction")

wordcloud1 = WordCloud(max_font_size = 80, background_color = "white", 
                      collocations = True, colormap = "ocean").generate(_1984)
plt.figure()
plt.imshow(wordcloud1)
plt.axis("off")
plt.suptitle("1984's Diction")
plt.show()

Cool graphics depicting each group's trends

### Distribution of Word Use

In [None]:
g = e.append(f, ignore_index=True)
fac = sns.FacetGrid(g, col='Type', height=4.5, aspect=1.8)
fac.map_dataframe(sns.histplot, x='Count', kde=True, binwidth=1)
fac.set_axis_labels('Word Frequency', 'Count')
fac.set(xticks=[x for x in np.arange(start=1, stop=11, step=1)])
fac.fig.suptitle('How Often Was Each Word Used?')

Although the data was collected from different points in history and despite the fact that each group had a very different outlook towards the world around them, the word frequency distribution charts shows some interesting information. Although the individual words were different, the overall usage was very similar. With the same number of observations contributed by each group, we can see that about 150, out of 199 words(not including STOPWORDS) were used only once in the text. The words that were used twice were only used about 12-13% of the time and the progression falling off as a geometric distribution. 

### Outliers

In [None]:
sns.set_theme(style='darkgrid')
box = sns.boxplot(x='Type', y='Count', data=g, hue='Type', 
                  palette='Set3').set_title('Boxplot of Word Usage')
plt.yticks(np.arange(start=0, stop=8, step=2))
plt.show()

With 75% of the observations only ocurring once in each dataset, these boxplots show that even words that are only used twice are considered outlieres. 

## Summary Statistics

In [None]:
var = 'Count'
type_grp = g.groupby('Type')

xbar_hrc = type_grp.mean()[var].iloc[1]
xbar_1984 = type_grp.mean()[var].iloc[0]
s_hrc = type_grp.std()[var].iloc[1]
s_1984 = type_grp.std()[var].iloc[0]
n_hrc = type_grp.count()[var].iloc[1]
n_1984 = type_grp.count()[var].iloc[0]
var_hrc = type_grp.var()[var].iloc[1]
var_1984 = type_grp.var()[var].iloc[0]

mydict2 = {
    'HRC'  : [xbar_hrc, s_hrc, n_hrc, var_hrc],
    '1984' : [xbar_1984, s_1984, n_1984, var_1984]
}

df4 = pd.DataFrame(mydict2, index=['mean', 'std', 'n', 'var']) 
df5 = df4.round(decimals=3)
df5

### t-Test to check for difference in mean from two independent distributions of word usage

In [None]:
tobs = (xbar_hrc - xbar_1984) / ( s_hrc**2/n_hrc + s_1984**2/n_1984 )**(1/2)
deg_free = (s_hrc**2/n_hrc + s_1984**2/n_1984)**2 / ( (s_hrc**2/n_hrc)**2/(n_hrc-1) + (s_1984**2/n_1984)**2/(n_1984-1) ) 
t_dist = ss.t(deg_free)
pval = t_dist.cdf(tobs)

mydict3 = {
    't' : tobs,
    'df' : deg_free,
    'pval' : pval
}

df6 = pd.DataFrame(mydict3, index=['Count'])
df6


In [None]:
# df6.append(df7).set_index([pd.Index(['Count', 'Length'])])