## 2016 Election Project 

This notebook is intended to document my data processing throughout this project. I'll be poking around and modifying my data in this file. The data I am starting out with are transcripts of the presidential debates from the 2016 US Election between Hillary Clinton and Donald Trump. The transcripts were taken from UCSB's American Presidency Project, and the citation for each of the transcripts will be included both above, and as a part of the data I process.

To begin, I have three transcripts from the three presidential debates.

Presidential Candidates Debates: "Presidential Debate at the University of Nevada in Las Vegas," October 19, 2016. Online by Gerhard Peters and John T. Woolley, The American Presidency Project. http://www.presidency.ucsb.edu/ws/?pid=119039.

Presidential Candidates Debates: "Presidential Debate at Washington University in St. Louis, Missouri," October 9, 2016. Online by Gerhard Peters and John T. Woolley, The American Presidency Project. http://www.presidency.ucsb.edu/ws/?pid=119038.

Presidential Candidates Debates: "Presidential Debate at Hofstra University in Hempstead, New York," September 26, 2016. Online by Gerhard Peters and John T. Woolley, The American Presidency Project. http://www.presidency.ucsb.edu/ws/?pid=118971.


**I might use other speeches as well, but I think I will use their individual speeches AFTER they have both been chosen as their party's candidates. Since I'll be adding manual RE annotation, I worry about using too many files** 

In [1]:
import nltk
from nltk.corpus import PlaintextCorpusReader
import pandas as pd
import glob
import os


In [2]:
os.chdir('/Users/Paige/Documents/Data_Science/2016-Election-Project/data/Debates')
files = glob.glob("*.txt")
files

['10-19-16.txt', '10-9-16.txt', '9-26-16.txt']

In [3]:
transcripts = []
for f in files:
    fi = open(f, 'r')
    txt = fi.read()
    fi.close
    transcripts.append(txt)

In [4]:
print(transcripts[0][:200])

PARTICIPANTS:
Former Secretary of State Hillary Clinton (D) and
Businessman Donald Trump (R)
MODERATOR:
Chris Wallace (Fox News)

WALLACE: Good evening from the Thomas and Mack Center at the Universit


In [5]:
print(transcripts[1][:200])

PARTICIPANTS:
Former Secretary of State Hillary Clinton (D) and
Businessman Donald Trump (R)
MODERATORS:
Anderson Cooper (CNN) and
Martha Raddatz (ABC News)

RADDATZ: Ladies and gentlemen the Republic


In [6]:
print(transcripts[2][:500])

PARTICIPANTS:
Former Secretary of State Hillary Clinton (D) and
Businessman Donald Trump (R)
MODERATOR:
Lester Holt (NBC News)

HOLT: Good evening from Hofstra University in Hempstead, New York. I'm Lester Holt, anchor of "NBC Nightly News." I want to welcome you to the first presidential debate.

The participants tonight are Donald Trump and Hillary Clinton. This debate is sponsored by the Commission on Presidential Debates, a nonpartisan, nonprofit organization. The commission drafted tonight'


**We can see that we need to do some clean up. What I would eventually like to end up with is a dataframe where the columns are Debate, Date, Source, Speaker, Sents, where the Sents are in the order of their speech. For now, I will keep the speech/questions of the moderators, becuase it might be interesting to compare the referring expressions *they* use for the candidates vs what the cadidates use for each other.**

In [7]:
#I want to split large chunks of the transcript based on who is speaking.
#Since the transcript data has a pretty standardized fomat (The speaker is in all caps followed by a colon)
#I can add a marker to each of these sections, and split the data on that marker

speaker_split = []

for txt in transcripts:
    speaker_split.append(txt.replace("CLINTON:", "#$&CLINTON*:").replace("TRUMP:", "#$&TRUMP*:").replace("WALLACE:", "#$&WALLACE*:").replace("COOPER:", "#$&COOPER*:").replace("RADDATZ:", "#$&RADDATZ*:").replace("HOLT:", "#$&HOLT*:").replace("PARTICIPANTS:", "#$&PARTICIPANTS*:").replace("MODERATOR:", "#$&MODERATOR*:").replace("MODERATORS:", "#$&MODERATORS*:").replace("\n", " "))

speaker_split = [txt.strip().split("#$&") for txt in speaker_split]

In [8]:
speaker_split[0][:4]

['',
 'PARTICIPANTS*: Former Secretary of State Hillary Clinton (D) and Businessman Donald Trump (R) ',
 'MODERATOR*: Chris Wallace (Fox News)  ',
 "WALLACE*: Good evening from the Thomas and Mack Center at the University of Nevada, Las Vegas. I'm Chris Wallace of Fox News, and I welcome you to the third and final of the 2016 presidential debates between Secretary of State Hillary Clinton and Donald J. Trump.  This debate is sponsored by the Commission on Presidential Debates. The commission has designed the format: Six roughly 15-minute segments with two-minute answers to the first question, then open discussion for the rest of each segment. Both campaigns have agreed to those rules.  For the record, I decided the topics and the questions in each topic. None of those questions has been shared with the commission or the two candidates. The audience here in the hall has promised to remain silent. No cheers, boos, or other interruptions so we and you can focus on what the candidates have

In [9]:
#Creating three separate lists of split speech by speaker for each debate
debate3 = speaker_split[0]
debate2 = speaker_split[1]
debate1 = speaker_split[2]

In [10]:
#Splitting the SPEAKER: from the speech
debate3 = [txt.split("*:") for txt in debate3]
debate2 = [txt.split("*:") for txt in debate2]
debate1 = [txt.split("*:") for txt in debate1]

In [11]:
debate3 #We can see that we need to remove the empty list at the beginning, and strip all of the entries
debate3.remove([''])
debate3[:4]
#We'll strip the entries when they're in the data frame

[['PARTICIPANTS',
  ' Former Secretary of State Hillary Clinton (D) and Businessman Donald Trump (R) '],
 ['MODERATOR', ' Chris Wallace (Fox News)  '],
 ['WALLACE',
  " Good evening from the Thomas and Mack Center at the University of Nevada, Las Vegas. I'm Chris Wallace of Fox News, and I welcome you to the third and final of the 2016 presidential debates between Secretary of State Hillary Clinton and Donald J. Trump.  This debate is sponsored by the Commission on Presidential Debates. The commission has designed the format: Six roughly 15-minute segments with two-minute answers to the first question, then open discussion for the rest of each segment. Both campaigns have agreed to those rules.  For the record, I decided the topics and the questions in each topic. None of those questions has been shared with the commission or the two candidates. The audience here in the hall has promised to remain silent. No cheers, boos, or other interruptions so we and you can focus on what the candi

In [12]:
debate2.remove([''])
debate1.remove([''])

In [19]:
debate3df = pd.DataFrame(debate3)
#I want to add a column of the source of the transcript for each dataframe
debate3df['Source'] = 'http://www.presidency.ucsb.edu/ws/?pid=119039'

In [21]:
debate3df.head(10)

Unnamed: 0,0,1,Source
0,PARTICIPANTS,Former Secretary of State Hillary Clinton (D)...,http://www.presidency.ucsb.edu/ws/?pid=119039
1,MODERATOR,Chris Wallace (Fox News),http://www.presidency.ucsb.edu/ws/?pid=119039
2,WALLACE,Good evening from the Thomas and Mack Center ...,http://www.presidency.ucsb.edu/ws/?pid=119039
3,CLINTON,"Thank you very much, Chris. And thanks to UNL...",http://www.presidency.ucsb.edu/ws/?pid=119039
4,WALLACE,"Secretary Clinton, thank you. Mr. Trump, sam...",http://www.presidency.ucsb.edu/ws/?pid=119039
5,TRUMP,"Well, first of all, it's great to be with you...",http://www.presidency.ucsb.edu/ws/?pid=119039
6,WALLACE,"Mr. Trump, thank you. We now have about 10 m...",http://www.presidency.ucsb.edu/ws/?pid=119039
7,CLINTON,"Well, first of all, I support the Second Amen...",http://www.presidency.ucsb.edu/ws/?pid=119039
8,WALLACE,Let me bring Mr. Trump in here. The bipartisa...,http://www.presidency.ucsb.edu/ws/?pid=119039
9,TRUMP,"Well, the D.C. vs. Heller decision was very s...",http://www.presidency.ucsb.edu/ws/?pid=119039


**Currently, the entries in this data frame are split into chunks of who is speaking. I think I might want each entry in the data frame to be a sentence instead.**

In [15]:
debate3df.columns = ['Speaker', 'Speech']
debate3df['Speaker'] = debate3df['Speaker'].apply(lambda x: x.strip())
debate3df['Speech'] = debate3df['Speech'].apply(lambda x: x.strip())

In [22]:
debate3df.head(10)

Unnamed: 0,0,1,Source
0,PARTICIPANTS,Former Secretary of State Hillary Clinton (D)...,http://www.presidency.ucsb.edu/ws/?pid=119039
1,MODERATOR,Chris Wallace (Fox News),http://www.presidency.ucsb.edu/ws/?pid=119039
2,WALLACE,Good evening from the Thomas and Mack Center ...,http://www.presidency.ucsb.edu/ws/?pid=119039
3,CLINTON,"Thank you very much, Chris. And thanks to UNL...",http://www.presidency.ucsb.edu/ws/?pid=119039
4,WALLACE,"Secretary Clinton, thank you. Mr. Trump, sam...",http://www.presidency.ucsb.edu/ws/?pid=119039
5,TRUMP,"Well, first of all, it's great to be with you...",http://www.presidency.ucsb.edu/ws/?pid=119039
6,WALLACE,"Mr. Trump, thank you. We now have about 10 m...",http://www.presidency.ucsb.edu/ws/?pid=119039
7,CLINTON,"Well, first of all, I support the Second Amen...",http://www.presidency.ucsb.edu/ws/?pid=119039
8,WALLACE,Let me bring Mr. Trump in here. The bipartisa...,http://www.presidency.ucsb.edu/ws/?pid=119039
9,TRUMP,"Well, the D.C. vs. Heller decision was very s...",http://www.presidency.ucsb.edu/ws/?pid=119039
