## 2016 Election Project 

This notebook is intended to document my data processing throughout this project. I'll be poking around and modifying my data in this file. The data I am starting out with are transcripts of the presidential debates from the 2016 US Election between Hillary Clinton and Donald Trump. The transcripts were taken from UCSB's American Presidency Project, and the citation for each of the transcripts will be included both above, and as a part of the data I process.

To begin, I have three transcripts from the three presidential debates.

Presidential Candidates Debates: "Presidential Debate at the University of Nevada in Las Vegas," October 19, 2016. Online by Gerhard Peters and John T. Woolley, The American Presidency Project. http://www.presidency.ucsb.edu/ws/?pid=119039.

Presidential Candidates Debates: "Presidential Debate at Washington University in St. Louis, Missouri," October 9, 2016. Online by Gerhard Peters and John T. Woolley, The American Presidency Project. http://www.presidency.ucsb.edu/ws/?pid=119038.

Presidential Candidates Debates: "Presidential Debate at Hofstra University in Hempstead, New York," September 26, 2016. Online by Gerhard Peters and John T. Woolley, The American Presidency Project. http://www.presidency.ucsb.edu/ws/?pid=118971.


**I might use other speeches as well, but I think I will use their individual speeches AFTER they have both been chosen as their party's candidates. Since I'll be adding manual RE annotation, I worry about using too many files** 

In [1]:
import nltk
from nltk.corpus import PlaintextCorpusReader
import pandas as pd
import glob
import os


In [7]:
os.chdir('/Users/Paige/Documents/Data_Science/2016-Election-Project/data/Debates/general/')
files = glob.glob("*.txt")
os.chdir('/Users/Paige/Documents/Data_Science/2016-Election-Project/data/Debates/dem/')
files.extend(glob.glob('*.txt'))
os.chdir('/Users/Paige/Documents/Data_Science/2016-Election-Project/data/Debates/rep/')
files.extend(glob.glob('*.txt'))
files

['10-19-16.txt',
 '10-9-16.txt',
 '9-26-16.txt',
 '1-17-16_dem.txt',
 '1-25-16_dem.txt',
 '10-13-15_dem.txt',
 '11-14-15_dem.txt',
 '12-19-15_dem.txt',
 '2-11-16_dem.txt',
 '2-4-16_dem.txt',
 '3-6-16_dem.txt',
 '3-9-16_dem.txt',
 '4-14-16_dem.txt',
 '1-14-16_rep.txt',
 '1-28-16_rep.txt',
 '10-28-15_rep.txt',
 '11-10-15_rep.txt',
 '12-15-15_rep.txt',
 '2-13-16_rep.txt',
 '2-25-16_rep.txt',
 '2-6-16_rep.txt',
 '3-10-16_rep.txt',
 '3-3-16_rep.txt',
 '8-6-15_rep.txt',
 '9-16-15_rep.txt']

In [8]:
#I'm creating a list where each entry in the list is a transcript
transcripts = []
for f in files:
    fi = open(f, 'r')
    txt = fi.read()
    fi.close
    transcripts.append(txt)

FileNotFoundError: [Errno 2] No such file or directory: '10-19-16.txt'

In [4]:
print(transcripts[0][:200])

IndexError: list index out of range

In [None]:
words0= nltk.word_tokenize(transcripts[0])
len(words0)

In [None]:
sents0= nltk.sent_tokenize(transcripts[0])
len(sents0)

In [None]:
print(transcripts[1][:200])

In [None]:
words1= nltk.word_tokenize(transcripts[1])
len(words1)

In [None]:
sents1= nltk.sent_tokenize(transcripts[1])
len(sents1)

In [None]:
print(transcripts[2][:500])

In [None]:
words2= nltk.word_tokenize(transcripts[2])
len(words2)

In [None]:
sents2= nltk.sent_tokenize(transcripts[2])
len(sents2)

**The first debate has 20719 words and 1292 sentences including markers for who is speaking. The second debate has 19613 words and 1246 sentences including markers for who is speaking. The third debate has 20372 words and 1236 sentences including markers for who is speaking**

**We can see that we need to do some clean up. What I would eventually like to end up with is a dataframe where the columns are Debate, Date, Source, Speaker, Sents, where the Sents are in the order of their speech. For now, I will keep the speech/questions of the moderators, becuase it might be interesting to compare the referring expressions *they* use for the candidates vs what the cadidates use for each other.**

In [None]:
#I want to split large chunks of the transcript based on who is speaking.
#Since the transcript data has a pretty standardized fomat (The speaker is in all caps followed by a colon)
#I can add a marker to each of these sections, and split the data on that marker

speaker_split = []

for txt in transcripts:
    speaker_split.append(txt.replace("CLINTON:", "#$&CLINTON*:").replace("TRUMP:", "#$&TRUMP*:").replace("WALLACE:", "#$&WALLACE*:").replace("COOPER:", "#$&COOPER*:").replace("RADDATZ:", "#$&RADDATZ*:").replace("HOLT:", "#$&HOLT*:").replace("PARTICIPANTS:", "#$&PARTICIPANTS*:").replace("MODERATOR:", "#$&MODERATOR*:").replace("MODERATORS:", "#$&MODERATORS*:").replace("\n", " "))

speaker_split = [txt.strip().split("#$&") for txt in speaker_split]

In [None]:
speaker_split[0][:4]

In [None]:
#Creating three separate lists of split speech by speaker for each debate
debate3 = speaker_split[0]
debate2 = speaker_split[1]
debate1 = speaker_split[2]

In [None]:
#Splitting the SPEAKER: from the speech
debate3 = [txt.split("*:") for txt in debate3]
debate2 = [txt.split("*:") for txt in debate2]
debate1 = [txt.split("*:") for txt in debate1]

In [None]:
debate3 #We can see that we need to remove the empty list at the beginning, and strip all of the entries
debate3.remove([''])
debate3[:4]
#We'll strip the entries when they're in the data frame

In [None]:
debate2.remove([''])
debate1.remove([''])

In [None]:
debate3df = pd.DataFrame(debate3)
#I want to add a column of the source of the transcript for each dataframe
#I'm adding these columns with all the same value because I will eventually combine the 
#dataframes from all three debates, and this information will be important then

debate3df['Source'] = 'http://www.presidency.ucsb.edu/ws/?pid=119039'
debate3df['Debate'] = '3' 
debate3df['Location'] = 'University of Nevada in Las Vegas'
debate3df['Date'] = '10/19/16'

In [None]:
debate3df.head(10)

In [None]:
#Renaming the first two columns
debate3df.columns = ['Speaker', 'Speech', 'Source', 'Debate', 'Location', 'Date']

#Stripping the text in the Speaker and Speech columns
debate3df['Speaker'] = debate3df['Speaker'].apply(lambda x: x.strip())
debate3df['Speech'] = debate3df['Speech'].apply(lambda x: x.strip())

In [None]:
debate3df.head(10)

In [None]:
#Reorganize the order of the columns
debate3df = debate3df[['Location', 'Date', 'Debate', 'Source', 'Speaker', 'Speech']]

#Drop these first two rows, because they are not speech information
debate3df.drop(0, inplace=True)
debate3df.drop(1, inplace=True)

debate3df.head()

In [None]:
#This is the same processing for debate2df
debate2df = pd.DataFrame(debate2)
debate2df['Source'] = 'http://www.presidency.ucsb.edu/ws/?pid=119038'
debate2df['Debate'] = '2' 
debate2df['Location'] = 'Washington University in St. Louis, Missouri'
debate2df['Date'] = '10/9/16'

#Renaming the first two columns
debate2df.columns = ['Speaker', 'Speech', 'Source', 'Debate', 'Location', 'Date']

#Stripping the text in the Speaker and Speech columns
debate2df['Speaker'] = debate2df['Speaker'].apply(lambda x: x.strip())
debate2df['Speech'] = debate2df['Speech'].apply(lambda x: x.strip())

#Reorganize the order of the columns
debate2df = debate2df[['Location', 'Date', 'Debate', 'Source', 'Speaker', 'Speech']]

#Drop these first two rows, because they are not speech information
debate2df.drop(0, inplace=True)
debate2df.drop(1, inplace=True)

In [None]:
#This is the same processing for debate1df
debate1df = pd.DataFrame(debate1)
debate1df['Source'] = 'http://www.presidency.ucsb.edu/ws/?pid=118971'
debate1df['Debate'] = '1' 
debate1df['Location'] = 'Hofstra University in Hempstead, New York'
debate1df['Date'] = '9/26/16'

#Renaming the first two columns
debate1df.columns = ['Speaker', 'Speech', 'Source', 'Debate', 'Location', 'Date']

#Stripping the text in the Speaker and Speech columns
debate1df['Speaker'] = debate1df['Speaker'].apply(lambda x: x.strip())
debate1df['Speech'] = debate1df['Speech'].apply(lambda x: x.strip())

#Reorganize the order of the columns
debate1df = debate1df[['Location', 'Date', 'Debate', 'Source', 'Speaker', 'Speech']]

#Drop these first two rows, because they are not speech information
debate1df.drop(0, inplace=True)
debate1df.drop(1, inplace=True)

In [None]:
debate1df.head()

In [None]:
debate2df.head()

**Currently, the entries in this data frame are split into chunks of who is speaking. I think I might want each entry in the data frame to be a sentence instead. I'm going to make another dataframe where each row is information about one sentence. I'm going to keep both dataframes in case I decide one would be more helpful than the other later.**

In [None]:
debate3[:3]

In [None]:
debate3_sent = []
for chunk in debate3:
    sents = nltk.sent_tokenize(chunk[1])
    for sent in sents:
        debate3_sent.append([chunk[0], sent])

In [None]:
df3_sents = pd.DataFrame(debate3_sent)
df3_sents['Source'] = 'http://www.presidency.ucsb.edu/ws/?pid=119039'
df3_sents['Debate'] = '3' 
df3_sents['Location'] = 'University of Nevada in Las Vegas'
df3_sents['Date'] = '10/19/16'

#Renaming the first two columns
df3_sents.columns = ['Speaker', 'Speech', 'Source', 'Debate', 'Location', 'Date']

#Stripping the text in the Speaker and Speech columns
df3_sents['Speaker'] = df3_sents['Speaker'].apply(lambda x: x.strip())
df3_sents['Speech'] = df3_sents['Speech'].apply(lambda x: x.strip())

#Reorganize the order of the columns
df3_sents = df3_sents[['Location', 'Date', 'Debate', 'Source', 'Speaker', 'Speech']]

#Drop these first two rows, because they are not speech information
df3_sents.drop(0, inplace=True)
df3_sents.drop(1, inplace=True)

In [None]:
df3_sents.head()

In [None]:
#The same for debate 1
debate1_sent = []
for chunk in debate1:
    sents = nltk.sent_tokenize(chunk[1])
    for sent in sents:
        debate1_sent.append([chunk[0], sent])

In [None]:
df1_sents = pd.DataFrame(debate1_sent)
df1_sents['Source'] = 'http://www.presidency.ucsb.edu/ws/?pid=118971'
df1_sents['Debate'] = '1' 
df1_sents['Location'] = 'Hofstra University in Hempstead, New York'
df1_sents['Date'] = '9/26/16'

#Renaming the first two columns
df1_sents.columns = ['Speaker', 'Speech', 'Source', 'Debate', 'Location', 'Date']

#Stripping the text in the Speaker and Speech columns
df1_sents['Speaker'] = df1_sents['Speaker'].apply(lambda x: x.strip())
df1_sents['Speech'] = df1_sents['Speech'].apply(lambda x: x.strip())

#Reorganize the order of the columns
df1_sents = df1_sents[['Location', 'Date', 'Debate', 'Source', 'Speaker', 'Speech']]

#Drop these first two rows, because they are not speech information
df1_sents.drop(0, inplace=True)
df1_sents.drop(1, inplace=True)
df1_sents.head()

In [None]:
#The same for debate 2
debate2_sent = []
for chunk in debate2:
    sents = nltk.sent_tokenize(chunk[1])
    for sent in sents:
        debate2_sent.append([chunk[0], sent])

In [None]:
df2_sents = pd.DataFrame(debate2_sent)
df2_sents['Source'] = 'http://www.presidency.ucsb.edu/ws/?pid=119038'
df2_sents['Debate'] = '2' 
df2_sents['Location'] = 'Washington University in St. Louis, Missouri'
df2_sents['Date'] = '10/9/16'

#Renaming the first two columns
df2_sents.columns = ['Speaker', 'Speech', 'Source', 'Debate', 'Location', 'Date']

#Stripping the text in the Speaker and Speech columns
df2_sents['Speaker'] = df2_sents['Speaker'].apply(lambda x: x.strip())
df2_sents['Speech'] = df2_sents['Speech'].apply(lambda x: x.strip())

#Reorganize the order of the columns
df2_sents = df2_sents[['Location', 'Date', 'Debate', 'Source', 'Speaker', 'Speech']]

#Drop these first two rows, because they are not speech information
df2_sents.drop(0, inplace=True)
df2_sents.drop(1, inplace=True)
df2_sents.head()

**Now I have 2 dataframes for each debate. One is a dataframe where each row is information on a chunk of speech, and the other is a dataframe where each row is information on a particular sentence. Both the chunks and sentences are in the order in which they were spoken. Now I'm going to export these dataframes to CSV files and annotate them for referring expressions manually.**

In [None]:
df1_sents.to_csv('debate1_sents.csv')
df2_sents.to_csv('debate2_sents.csv')
df3_sents.to_csv('debate3_sents.csv')
debate1df.to_csv('debate1.csv')
debate2df.to_csv('debate2.csv')
debate3df.to_csv('debate3.csv')

**I have manually annotated the 3 debatex_sents.csv files for the name/expression Donald Trump, Hillary Clinton, and the moderators used to refer to any of the candidates or another person.**

In [None]:
df1_re = pd.read_csv('debate1_sents_first.csv', encoding = 'latin1')
df1_re = df1_re.drop('Unnamed: 0', 1)

In [None]:
df1_re.head()

In [None]:
df2_re = pd.read_csv('debate2_sents_first.csv', encoding = 'latin1')
df2_re = df2_re.drop('Unnamed: 0', 1)

In [None]:
df2_re.head()

In [None]:
df3_re = pd.read_csv('debate3_sents_first.csv', encoding = 'latin1')
df3_re = df3_re.drop('Unnamed: 0', 1)

In [None]:
df3_re.head()

In [None]:
df1_re_only = df1_re.loc[df1_re.RE.notnull(), :]
df1_re_only.head()

In [None]:
len(df1_re_only) #How many RE's/names were there?

In [None]:
#How many times did each speaker refer to someone else?
df1_re_only.groupby(['Speaker']).count()['RE']

In [None]:
df2_re_only = df2_re.loc[df2_re.RE.notnull(), :]
df2_re_only.head()

In [None]:
len(df2_re_only) #How many RE's/names were there?

In [None]:
#How many times did each speaker refer to someone else?
df2_re_only.groupby(['Speaker']).count()['RE']

In [None]:
df3_re_only = df3_re.loc[df3_re.RE.notnull(), :]
df3_re_only.head()

In [None]:
len(df3_re_only) #How many RE's/names were there?

In [None]:
#How many times did each speaker refer to someone else?
df3_re_only.groupby(['Speaker']).count()['RE']

**In the first debate, I found 149 names and referring expressions. In the second debate I found 191 names and referring expressions. In the third debate I found 223 reffering expressions.**

In [None]:
#As a sample of my data, I will be uploading csvs that only contain rows with RE's.
df1_re_only.to_csv('/Users/Paige/Documents/Data_Science/2016-Election-Project/data_samples/df1_re_only.csv')
df2_re_only.to_csv('/Users/Paige/Documents/Data_Science/2016-Election-Project/data_samples/df2_re_only.csv')
df3_re_only.to_csv('/Users/Paige/Documents/Data_Science/2016-Election-Project/data_samples/df3_re_only.csv')

### Sharing Plan

The only information I had about copyright from the American Presidency Project was a citation. I believe since I modified the data and included citations for each debate and in all of my data frames I include the source url of the transcript for each sentence, it is ok for me to share my modified version of the data. I would like to make my code completely available to the world and the members of this class. I do need to keep look into more information regarding sharing my data. For now, I am going to keep only sharing samples of the data frames that I create, but for the reasons I mentioned before, I believe I am able to share all of my data. With my added linguistic annotation, I would share my data with the Creative Commons Attribution Share Alike 4.0 license. If I can find more licensing information and find out that the members of the American Presidency Project disapprove of my distribution of a modification of their data even though it is cited, I will keep my data private and only share my analysis of the data. I contacted them and hope to hear back soon.