## 2016 Election Project 

This notebook is intended to document my data processing throughout this project. I'll be poking around and modifying my data in this file. The data I am starting out with are transcripts of the presidential debates from the 2016 US Election. I am processing the Democratic and Republican primary debates, and the debates of the general election between Hillary Clinton and Donald Trump. The transcripts were taken from UCSB's American Presidency Project, and the citation for each of the transcripts can be found in the README.

In [1]:
%pprint
import nltk
from nltk.corpus import PlaintextCorpusReader
import pandas as pd
import glob
import os
import re

Pretty printing has been turned OFF


In [2]:
#Reading in files
os.chdir('/Users/Paige/Documents/Data_Science/2016-Election-Project/data/Debates/transcripts/')
files = glob.glob("*.txt")
files

['1-14-16_rep.txt', '1-17-16_dem.txt', '1-25-16_dem.txt', '1-28-16_rep.txt', '10-13-15_dem.txt', '10-19-16.txt', '10-28-15_rep.txt', '10-9-16.txt', '11-10-15_rep.txt', '11-14-15_dem.txt', '12-15-15_rep.txt', '12-19-15_dem.txt', '2-11-16_dem.txt', '2-13-16_rep.txt', '2-25-16_rep.txt', '2-4-16_dem.txt', '2-6-16_rep.txt', '3-10-16_rep.txt', '3-3-16_rep.txt', '3-6-16_dem.txt', '3-9-16_dem.txt', '4-14-16_dem.txt', '8-6-15_rep.txt', '9-16-15_rep.txt', '9-26-16.txt']

In [3]:
len(files)

25

In [4]:
#I'm creating a list where each entry in the list is a transcript
transcripts = []
for f in files:
    fi = open(f, 'r')
    txt = fi.read()
    fi.close
    transcripts.append(txt)

In [5]:
print(transcripts[0][:200])

PARTICIPANTS:
Former Governor Jeb Bush (FL);
Ben Carson;
Governor Chris Christie (NJ);
Senator Ted Cruz (TX);
Governor John Kasich (OH);
Senator Marco Rubio (FL);
Donald Trump;
MODERATORS:
Maria Barti


In [6]:
words0= nltk.word_tokenize(transcripts[0])
len(words0)

27658

In [7]:
sents0= nltk.sent_tokenize(transcripts[0])
len(sents0)

1498

In [8]:
print(transcripts[1][:200])

PARTICIPANTS:
Former Secretary of State Hillary Clinton;
Former Governor Martin O'Malley (MD);
Senator Bernie Sanders (VT);
MODERATORS:
Lester Holt (NBC News)
Andrea Mitchell (NBC News)

HOLT: Good ev


**Cleaning up: I would eventually like to end up with a dataframe where the columns are Date, Type (primary or general), Speaker, Sents, where the Sents are in the order that they are said.**

In [9]:
#I want to split large chunks of the transcript based on who is speaking.
#Since the transcript data has a pretty standardized fomat (The speaker is in all caps followed by a colon)
#I can add a marker to each of these sections, and split the data on that marker

speaker_split = []

for txt in transcripts:
    #To take care of the first one where there is no newline preceding the label..
    txt = txt.replace("PARTICIPANTS:", 'PARTICIPANTS%:')
    #Take care of all other speakers, labels
    txt = re.sub(r"\n(\w+):", r"#$&\1%:", txt)
    speaker_split.append(txt)

#Split each chunk by the special marker
speaker_split = [txt.strip().split("#$&") for txt in speaker_split]

In [10]:
speaker_split[0][:4]

['PARTICIPANTS%:\nFormer Governor Jeb Bush (FL);\nBen Carson;\nGovernor Chris Christie (NJ);\nSenator Ted Cruz (TX);\nGovernor John Kasich (OH);\nSenator Marco Rubio (FL);\nDonald Trump;', 'MODERATORS%:\nMaria Bartiromo (Fox Business Network); and\nNeil Cavuto (Fox Business Network)\n', "CAVUTO%: It is 9:00 p.m. here at the North Charleston Coliseum and Performing Arts Center in South Carolina. Welcome to the sixth Republican presidential of the 2016 campaign, here on the Fox Business Network. I'm Neil Cavuto, alongside my friend and co-moderator Maria Bartiromo.\n", 'BARTIROMO%: Tonight we are working with Facebook to ask the candidates the questions voters want answered. And according to Facebook, the U.S. election has dominated the global conversation, with 131 million people talking about the 2016 race. That makes it the number one issue talked about on Facebook last year worldwide.\n']

In [11]:
#Creating a giant list so I don't have to handle things one at a time
#Splitting each chunk into two elements: speaker, speech
debates = [[txt.split("%:") for txt in split] for split in speaker_split]
debates[0][:4]

[['PARTICIPANTS', '\nFormer Governor Jeb Bush (FL);\nBen Carson;\nGovernor Chris Christie (NJ);\nSenator Ted Cruz (TX);\nGovernor John Kasich (OH);\nSenator Marco Rubio (FL);\nDonald Trump;'], ['MODERATORS', '\nMaria Bartiromo (Fox Business Network); and\nNeil Cavuto (Fox Business Network)\n'], ['CAVUTO', " It is 9:00 p.m. here at the North Charleston Coliseum and Performing Arts Center in South Carolina. Welcome to the sixth Republican presidential of the 2016 campaign, here on the Fox Business Network. I'm Neil Cavuto, alongside my friend and co-moderator Maria Bartiromo.\n"], ['BARTIROMO', ' Tonight we are working with Facebook to ask the candidates the questions voters want answered. And according to Facebook, the U.S. election has dominated the global conversation, with 131 million people talking about the 2016 race. That makes it the number one issue talked about on Facebook last year worldwide.\n']]

In [12]:
debate_sents = []
#For each debate, then for each [speaker, speech] chunk in that debate, get a list of tokenized sents to replace the speech
for debate in debates:
    sents_toks = []
    for chunk in debate:
        sents = nltk.sent_tokenize(chunk[1])
        for sent in sents:
            sents_toks.append([chunk[0], sent])
    debate_sents.append(sents_toks)

In [13]:
#I am creating a list of 25 dataframes, one for each debate
# Adding a column specifying the type of debate, the date, the speaker, and sent

dataframes = []
for f in files:
    index = files.index(f)
    df = pd.DataFrame(debate_sents[index])
    if f.endswith('_dem.txt'):
        df['Type'] = 'primary_dem' 
        df['Date'] = f[:-8]
    elif f.endswith('_rep.txt'):
        df['Type'] = 'primary_rep' 
        df['Date'] = f[:-8]
    else:
        df['Type'] = 'general' 
        df['Date'] = f[:-4]
    dataframes.append(df)

In [14]:
# Every returned Out[] is displayed, not just the last one. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [15]:
for df in dataframes:
    df.head()

Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,1-14-16
1,MODERATORS,\nMaria Bartiromo (Fox Business Network); and\...,primary_rep,1-14-16
2,CAVUTO,It is 9:00 p.m. here at the North Charleston ...,primary_rep,1-14-16
3,CAVUTO,Welcome to the sixth Republican presidential o...,primary_rep,1-14-16
4,CAVUTO,"I'm Neil Cavuto, alongside my friend and co-mo...",primary_rep,1-14-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,1-17-16
1,MODERATORS,\nLester Holt (NBC News)\nAndrea Mitchell (NBC...,primary_dem,1-17-16
2,HOLT,Good evening and welcome to the NBC News Yout...,primary_dem,1-17-16
3,HOLT,"After all the campaigning, soon, Americans wil...",primary_dem,1-17-16
4,HOLT,And New Hampshire not far behind.,primary_dem,1-17-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,1-25-16
1,MODERATOR,"\nChris Cuomo, CNN",primary_dem,1-25-16
2,CUOMO,All right.,primary_dem,1-25-16
3,CUOMO,"We are live at Drake University in Des Moines,...",primary_dem,1-25-16
4,CUOMO,Welcome to our viewers in the United States an...,primary_dem,1-25-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,1-28-16
1,MODERATORS,\nBret Baier (Fox News);\nMegyn Kelly (Fox New...,primary_rep,1-28-16
2,BAIER,Nine p.m. on the East Coast.,primary_rep,1-28-16
3,BAIER,"Eight o'clock here in Des Moines, Iowa.",primary_rep,1-28-16
4,BAIER,Welcome to the seventh Republican presidential...,primary_rep,1-28-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Lincoln Chafee (RI);\nFormer...,primary_dem,10-13-15
1,MODERATORS,\nAnderson Cooper (CNN);\nDana Bash (CNN);\nDo...,primary_dem,10-13-15
2,COOPER,I'm Anderson Cooper.,primary_dem,10-13-15
3,COOPER,Thanks for joining us.,primary_dem,10-13-15
4,COOPER,We've already welcomed the candidates on stage.,primary_dem,10-13-15


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton (D...,general,10-19-16
1,MODERATOR,\nChris Wallace (Fox News),general,10-19-16
2,WALLACE,Good evening from the Thomas and Mack Center ...,general,10-19-16
3,WALLACE,"I'm Chris Wallace of Fox News, and I welcome y...",general,10-19-16
4,WALLACE,This debate is sponsored by the Commission on ...,general,10-19-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,10-28-15
1,MODERATORS,\nJohn Harwood (CNBC);\nBecky Quick (CNBC); an...,primary_rep,10-28-15
2,QUINTANILLA,"Good evening, I'm Carl Quintanilla, with my c...",primary_rep,10-28-15
3,QUINTANILLA,We'll be joined tonight by some of CNBC's top ...,primary_rep,10-28-15
4,QUINTANILLA,Let's get through the rules of the road.,primary_rep,10-28-15


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton (D...,general,10-9-16
1,MODERATORS,\nAnderson Cooper (CNN) and\nMartha Raddatz (A...,general,10-9-16
2,RADDATZ,Ladies and gentlemen the Republican nominee f...,general,10-9-16
3,RADDATZ,[applause],general,10-9-16
4,COOPER,Thank you very much for being here.,general,10-9-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,11-10-15
1,MODERATORS,\nGerard Baker (The Wall Street Journal);\nMar...,primary_rep,11-10-15
2,CAVUTO,"It is 9:00 p.m. on the East Coast, 8:00 p.m. ...",primary_rep,11-10-15
3,CAVUTO,Welcome to the Republican presidential debate ...,primary_rep,11-10-15
4,CAVUTO,"I'm Neil Cavuto, alongside my co-moderators, M...",primary_rep,11-10-15


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,11-14-15
1,MODERATORS,\nNancy Cordes (CBS News);\nKevin Cooney (CBS ...,primary_dem,11-14-15
2,DICKERSON,Before we start the debate here are the rules.,primary_dem,11-14-15
3,DICKERSON,The candidates have one minute to respond to o...,primary_dem,11-14-15
4,DICKERSON,Any candidate who is attacked by another candi...,primary_dem,11-14-15


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,12-15-15
1,MODERATORS,\nWolf Blitzer (CNN);\nDana Bash (CNN); and\nH...,primary_rep,12-15-15
2,BLITZER,Welcome to the CNN-Facebook Republican presid...,primary_rep,12-15-15
3,BLITZER,We have a very enthusiastic audience.,primary_rep,12-15-15
4,BLITZER,Everyone is here.,primary_rep,12-15-15


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,12-19-15
1,MODERATORS,\nMartha Raddatz (ABC News)\nDavid Muir (ABC N...,primary_dem,12-19-15
2,RADDATZ,Good evening to you all.,primary_dem,12-19-15
3,RADDATZ,The rules for tonight are very basic and have ...,primary_dem,12-19-15
4,RADDATZ,Candidates can take up to a minute-and-a-half ...,primary_dem,12-19-15


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,2-11-16
1,MODERATORS,\nGwen Ifill (PBS);\nJudy Woodruff (PBS),primary_dem,2-11-16
2,WOODRUFF,"Good evening, and thank you.",primary_dem,2-11-16
3,WOODRUFF,We are happy to welcome you to Milwaukee for t...,primary_dem,2-11-16
4,WOODRUFF,We are especially pleased to thank our partner...,primary_dem,2-11-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,2-13-16
1,MODERATOR,\nJohn Dickerson (CBS News); with,primary_rep,2-13-16
2,PANELISTS,\nMajor Garrett (CBS News); and\nKimberly Stra...,primary_rep,2-13-16
3,DICKERSON,Good evening.,primary_rep,2-13-16
4,DICKERSON,I'm John Dickerson.,primary_rep,2-13-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nBen Carson;\nSenator Ted Cruz (TX);\nGoverno...,primary_rep,2-25-16
1,MODERATOR,\nWolf Blitzer (CNN); with,primary_rep,2-25-16
2,PANELISTS,\nMaria Celeste Arrarás (Telemundo);\nDana Bas...,primary_rep,2-25-16
3,BLITZER,We're live here at the University of Houston ...,primary_rep,2-25-16
4,BLITZER,[applause]\n\nAn enthusiastic crowd is on hand...,primary_rep,2-25-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,2-4-16
1,MODERATORS,\nChuck Todd (MSNBC);\nRachel Maddow (MSNBC),primary_dem,2-4-16
2,TODD,"Good evening, and welcome to the MSNBC Democr...",primary_dem,2-4-16
3,MADDOW,We are super excited to be here at the Univer...,primary_dem,2-4-16
4,MADDOW,"Tonight, this is the first time that Hillary C...",primary_dem,2-4-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,2-6-16
1,MODERATORS,\nDavid Muir (ABC News); and\nMartha Raddatz (...,primary_rep,2-6-16
2,MUIR,"Good evening, again, everyone.",primary_rep,2-6-16
3,MUIR,This is the first time since Iowa and the only...,primary_rep,2-6-16
4,MUIR,The people of Iowa have been heard.,primary_rep,2-6-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nSenator Ted Cruz (TX);\nGovernor John Kasich...,primary_rep,3-10-16
1,MODERATORS,\nJake Tapper (CNN);\nDana Bash (CNN);\nHugh H...,primary_rep,3-10-16
2,TAPPER,Live from the Bank United Center on the campu...,primary_rep,3-10-16
3,TAPPER,For our viewers in the United States and aroun...,primary_rep,3-10-16
4,TAPPER,In just five days voters will go to the polls ...,primary_rep,3-10-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nSenator Ted Cruz (TX);\nGovernor John Kasich...,primary_rep,3-3-16
1,MODERATORS,\nBret Baier (Fox News);\nMegyn Kelly (Fox New...,primary_rep,3-3-16
2,KELLY,"Good evening, and welcome to the fabulous FOX...",primary_rep,3-3-16
3,KELLY,"I'm Megyn Kelly, along with my co-moderators, ...",primary_rep,3-3-16
4,BAIER,59 Republican delegates are at stake here in ...,primary_rep,3-3-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,3-6-16
1,MODERATORS,\nAnderson Cooper (CNN);\nDon Lemon (CNN),primary_dem,3-6-16
2,COOPER,And welcome to The Whiting Auditorium on the ...,primary_dem,3-6-16
3,COOPER,I'm Anderson Cooper.,primary_dem,3-6-16
4,COOPER,I want to welcome our viewers in the United St...,primary_dem,3-6-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,3-9-16
1,MODERATORS,\nJorge Ramos (Univision);\nMaría Elena Salina...,primary_dem,3-9-16
2,RAMOS,[Speaking in Spanish]\n\nSALINAS [through tra...,primary_dem,3-9-16
3,RAMOS,RAMOS [through translator]: Here with us tonig...,primary_dem,3-9-16
4,SALINAS,"Welcome, Karen.",primary_dem,3-9-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton;\n...,primary_dem,4-14-16
1,MODERATOR,\nWolf Blitzer (CNN);,primary_dem,4-14-16
2,PANELISTS,\nDana Bash (CNN); and\nErrol Louis (NY1),primary_dem,4-14-16
3,BLITZER,"Secretary Clinton and Senator Sanders, you ca...",primary_dem,4-14-16
4,BLITZER,"As moderator, I'll guide the discussion, askin...",primary_dem,4-14-16


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,8-6-15
1,MODERATORS,\nBret Baier (Fox News);\nMegyn Kelly (Fox New...,primary_rep,8-6-15
2,KELLY,Welcome to the first debate night of the 2016...,primary_rep,8-6-15
3,KELLY,I'm Megyn Kelly... [applause]... along with my...,primary_rep,8-6-15
4,KELLY,Tonight... [applause] Nice.,primary_rep,8-6-15


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Governor Jeb Bush (FL);\nBen Carson;\...,primary_rep,9-16-15
1,MODERATORS,\nJake Tapper (CNN);\nDana Bash (CNN); and\nHu...,primary_rep,9-16-15
2,TAPPER,I'm Jake Tapper.,primary_rep,9-16-15
3,TAPPER,We're live at the Ronald Reagan Library in Sim...,primary_rep,9-16-15
4,TAPPER,Round 2 of CNN's presidential debate starts now.,primary_rep,9-16-15


Unnamed: 0,0,1,Type,Date
0,PARTICIPANTS,\nFormer Secretary of State Hillary Clinton (D...,general,9-26-16
1,MODERATOR,\nLester Holt (NBC News),general,9-26-16
2,HOLT,Good evening from Hofstra University in Hemps...,general,9-26-16
3,HOLT,"I'm Lester Holt, anchor of ""NBC Nightly News.""",general,9-26-16
4,HOLT,I want to welcome you to the first presidentia...,general,9-26-16


In [16]:
#Creating a new giant list of cleaned dataframes where the columns are reordered and cleaned up
dataframes_clean = []
for df in dataframes:
    #Drop the first two rows because they don't matter
    df.drop(0, inplace=True)
    df.drop(1, inplace=True)
    #Renaming the first two columns
    df.columns = ['Speaker', 'Sents', 'Debate Type', 'Date']
    #Strip newlines from Speaker and Sents columns
    df['Speaker'] = df['Speaker'].apply(lambda x: x.strip('\n'))
    df['Sents'] = df['Sents'].apply(lambda x: x.strip('\n'))
    #Reorder columns
    dataframes_clean.append(df[['Date','Debate Type', 'Speaker', 'Sents']])

In [17]:
dataframes_clean[0].head()

Unnamed: 0,Date,Debate Type,Speaker,Sents
2,1-14-16,primary_rep,CAVUTO,It is 9:00 p.m. here at the North Charleston ...
3,1-14-16,primary_rep,CAVUTO,Welcome to the sixth Republican presidential o...
4,1-14-16,primary_rep,CAVUTO,"I'm Neil Cavuto, alongside my friend and co-mo..."
5,1-14-16,primary_rep,BARTIROMO,Tonight we are working with Facebook to ask t...
6,1-14-16,primary_rep,BARTIROMO,"And according to Facebook, the U.S. election h..."


In [18]:
dataframes_clean[-1].head()

Unnamed: 0,Date,Debate Type,Speaker,Sents
2,9-26-16,general,HOLT,Good evening from Hofstra University in Hemps...
3,9-26-16,general,HOLT,"I'm Lester Holt, anchor of ""NBC Nightly News."""
4,9-26-16,general,HOLT,I want to welcome you to the first presidentia...
5,9-26-16,general,HOLT,The participants tonight are Donald Trump and ...
6,9-26-16,general,HOLT,This debate is sponsored by the Commission on ...


**Now I have a nice data frame for each debate. For any utterance in any debate, I provide information about who said it, what kind of debate it was, and when the debate took place. Now I'm going to export these dataframes to CSV files and process them with NER annotation in a different notebook.**

In [19]:
i=-1
for df in dataframes_clean:
    i+=1
    df.to_csv('../csv/'+str(files[i][:-4])+'.csv')