## Overview of data 

In [2]:
import pandas as pd
import numpy as np

### RNC and DNC Speeches 
I found these folders of transcripts on kaggle.com : 
https://www.kaggle.com/christianlillelund/2020-republican-convention-speeches and https://www.kaggle.com/christianlillelund/2020-democratic-convention-speeches. They contain speeches from both the Republican and Democratic National Conventions from 2020. 
#### Goals:
* make a csv file with speaker, affiliation and text information
* this will be the main file for analysis

In [3]:
# RNC transcripts
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'data/RNC/'
reps = PlaintextCorpusReader(corpus_root, '.*txt')

In [4]:
# DNC transcripts
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'data/DNC/'
dems = PlaintextCorpusReader(corpus_root, '.*txt')

In [5]:
transRNC = [reps.raw(x) for x in reps.fileids()]
transDNC = [dems.raw(x) for x in dems.fileids()]
demnames = [x.replace('_',' ').strip('.') for x in dems.fileids()]
repnames = [x.replace('_',' ').strip('.') for x in reps.fileids()]

In [6]:
DNCspeakeraff = pd.DataFrame({'Speakers': demnames, 
                       'Aff': 'D',
                       'transcript' : transDNC}) 
RNCspeakeraff = pd.DataFrame({'Speakers': repnames, 
                       'Aff': 'R',
                        'transcript' :transRNC})

In [7]:
speakeraff = pd.concat([DNCspeakeraff,RNCspeakeraff])
speakeraff = speakeraff.set_index('Speakers') #the contents in the data files with their political affiliation
speakeraff

Unnamed: 0_level_0,Aff,transcript
Speakers,Unnamed: 1_level_1,Unnamed: 2_level_1
alexandria ocasio-cortex.txt,D,Good evening and thank you to everyone here to...
andrew cuomo.txt,D,"We climbed the impossible mountain, and right ..."
andrew yang.txt,D,"Hello, America. I'm Andrew Yang. You might kno..."
barack obama.txt,D,Good evening everybody. As you've seen by now ...
bernie sanders.txt,D,Good evening. Our great nation is now living i...
bill clinton.txt,D,"We have a leader to help us solve problems, cr..."
chuck schumer.txt,D,"Brooklyn, New York. Behind me is a sight eye s..."
colin powell.txt,D,"Hi, I'm former secretary of state Colin Powell..."
cory booker.txt,D,Union job lifted my family out of poverty and ...
dr jill biden.txt,D,Quiet that sparks with possibility just before...


Because they were just text files, the text content was already pretty clean. All I had to do was put them in a frame and add the political affiliation

In [8]:
len(speakeraff) #There were 51 speakers in total in the two data sets

51

In [9]:
print(len(speakeraff.index[speakeraff['Aff'] == 'D']),
    len(speakeraff.index[speakeraff['Aff'] == 'R']))

21 30


In [10]:
#speakeraff.to_csv(r'data/convspeeches.csv') #made it a csv file

### Vice Presidental Debate
Scraped from https://www.rev.com/blog/transcripts/kamala-harris-mike-pence-2020-vice-presidential-debate-transcriptThis. debate occured on October 7th, 2020. The moderator was Susan Page. In my initial scrape, I left her in the csv, and later in this notebook I note her affiliation as "none"
#### Goals:
* making list of text for all of kamala's words and all of pence's words to add to the above dataframe for word classification and analysis. 
* making a new csv with speaker political affiliation to have on hand

In [11]:
pagedebate = pd.read_csv("data/debate2020.csv")
pagedebate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327 entries, 0 to 326
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   speaker     327 non-null    object
 1   time_stamp  327 non-null    object
 2   transcript  327 non-null    object
dtypes: object(3)
memory usage: 7.8+ KB


There are 327 lines spoken in this debate. All of the columns and rows and the right amount of info (no NaN)

In [12]:
text = [x.strip('\n</p>') for x in pagedebate.transcript.values]
speaker = [x.strip('>') for x in pagedebate.speaker.values]
pagedebate = pd.DataFrame({'transcript': text, 'Speakers' :speaker})
pagedebate

Unnamed: 0,transcript,Speakers
0,Good evening. From the University of Utah in S...,Susan Page
1,"These are tumultuous times, but we can and wil...",Susan Page
2,"Thank you, Susan. Well, the American people ha...",Kamala Harris
3,"Can you imagine if you knew on January 28th, a...",Kamala Harris
4,"Thank you, Senator Harris-",Susan Page
...,...,...
322,"And brings me to Joe, Joe Biden. One of the re...",Kamala Harris
323,Joe has a longstanding reputation of working a...,Kamala Harris
324,"Brecklin, when you think about the future, I d...",Kamala Harris
325,"Thank you, Senator Harris. Thank you, vice pre...",Susan Page


I took out the timestamp column and cleaned up the text and speaker string. (A sideeffect of scraping, I was able to clean most up in that process with regular expressions, but I was still left with the above x.strip characters)

In [13]:
kamala = pagedebate[pagedebate.Speakers == 'Kamala Harris'].transcript
pence = pagedebate[pagedebate.Speakers == 'Mike Pence'].transcript
pence # just pence's lines

7      Susan, thank you. And I want to thank the Comm...
8      And I believe it saved hundreds of thousands o...
10     … of America first. And the American people, I...
12     … of the sacrifices they have made. It’s saved...
17                       Susan, I have to weigh in here-
                             ...                        
314    Well, Susan, first and foremost, I think we’re...
315    But when you talk about accepting the outcome ...
316    So let me just say, I think we’re going to win...
318    Brecklin, it’s a wonderful question. And let m...
319    I look at the relationship between Justice Rut...
Name: transcript, Length: 113, dtype: object

In [14]:
#adding Pence and Harris full text into the main csv(conv csv)
pencefull = str(list(pence.values)).replace("',", '.').replace(" '"," ").strip('[]').strip("''")
pencelist = [('Mike Pence', pencefull)] #list of Mike Pence and his entire words spoken

In [15]:
#pencelist shows the 'Mike Pence' and his entire spoken lines

In [16]:
pencedebate = pd.DataFrame(pencelist, columns=['Speakers', 'transcript'])
pencedebate.set_index('Speakers')
pencedebate['Aff'] = 'R'
pencedebate = pencedebate.set_index('Speakers')
#adding to main file
pencedebate
maincsv = pd.concat([speakeraff,pencedebate])

In [17]:
harrisfull = str(list(kamala.values)).replace("',", '.').replace(" '"," ").strip('[]').strip("''")
harrislist = [('Kamala Harris', harrisfull)]
harrisdebate = pd.DataFrame(harrislist, columns=['Speakers', 'transcript'])
harrisdebate = harrisdebate.set_index('Speakers')
harrisdebate['Aff'] = 'D'
maincsv = pd.concat([maincsv,harrisdebate])
maincsv

Unnamed: 0_level_0,Aff,transcript
Speakers,Unnamed: 1_level_1,Unnamed: 2_level_1
alexandria ocasio-cortex.txt,D,Good evening and thank you to everyone here to...
andrew cuomo.txt,D,"We climbed the impossible mountain, and right ..."
andrew yang.txt,D,"Hello, America. I'm Andrew Yang. You might kno..."
barack obama.txt,D,Good evening everybody. As you've seen by now ...
bernie sanders.txt,D,Good evening. Our great nation is now living i...
bill clinton.txt,D,"We have a leader to help us solve problems, cr..."
chuck schumer.txt,D,"Brooklyn, New York. Behind me is a sight eye s..."
colin powell.txt,D,"Hi, I'm former secretary of state Colin Powell..."
cory booker.txt,D,Union job lifted my family out of poverty and ...
dr jill biden.txt,D,Quiet that sparks with possibility just before...


In [18]:
#Putting political affiliations on the debate speakers
#This is in the local csv file of just the debate


penceaff = pd.DataFrame({'Speakers': pagedebate[pagedebate.Speakers == 'Mike Pence'].Speakers, 'Aff':'R'})
harrisaff = pd.DataFrame({'Speakers': pagedebate[pagedebate.Speakers == 'Kamala Harris'].Speakers, 'Aff':'D'})
#pageaff = pd.DataFrame({'Speakers' : pagedebate[pagedebate.Speakers == 'Susan Page'].Speakers, 'Aff':'None'})

#dropping duplicates
penceaff = penceaff.drop_duplicates(keep='first')
harrisaff = harrisaff.drop_duplicates(keep='first')
#pageaff = pageaff.drop_duplicates(keep='first')

#all of the affiliations (keeping susan neutral)
allaff = pd.merge(penceaff, harrisaff, how='outer')
#allaff1 =allaff.merge(pageaff, how='outer')
allaff

Unnamed: 0,Speakers,Aff
0,Mike Pence,R
1,Kamala Harris,D


In [20]:
vpdebatepage = pd.merge(pagedebate, allaff, how='left')
vpdebatepage = vpdebatepage.set_index('Speakers')
vpdebatepage

In [19]:
#vpdebatepage.to_csv(r'data/pagedebate2020.csv') #original csv file of every speaker (page has AFF of 'none')

In [24]:
vpdebatepage2 = vpdebatepage.dropna() #drop where one value is missing AKA susan page cause we really only need D and R
vpdebatepage2

Unnamed: 0_level_0,transcript,Aff
Speakers,Unnamed: 1_level_1,Unnamed: 2_level_1
Kamala Harris,"Thank you, Susan. Well, the American people ha...",D
Kamala Harris,"Can you imagine if you knew on January 28th, a...",D
Kamala Harris,… right to reelection based on this.,D
Mike Pence,"Susan, thank you. And I want to thank the Comm...",R
Mike Pence,And I believe it saved hundreds of thousands o...,R
...,...,...
Mike Pence,I look at the relationship between Justice Rut...,R
Kamala Harris,"First of all, I love hearing from our young le...",D
Kamala Harris,"And brings me to Joe, Joe Biden. One of the re...",D
Kamala Harris,Joe has a longstanding reputation of working a...,D


In [25]:
#vpdebatepage2.to_csv(r'data/pageVPdebate.csv')

### Welker Presidental Debate
Scraped from https://www.rev.com/blog/transcripts/donald-trump-joe-biden-final-presidential-debate-transcript-2020. This debate occured on October 22, 2020. I did the same with Kristen Welker as I did with Susan Page.
#### Goals:
* Same as the vp debate

In [26]:
welkerdebate = pd.read_csv("data/presdebate2020.csv")
welkerdebate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 512 entries, 0 to 511
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   speaker     512 non-null    object
 1   time_stamp  512 non-null    object
 2   transcript  512 non-null    object
dtypes: object(3)
memory usage: 12.1+ KB


512 spoken lines in this debate

In [27]:
text = [x.strip('\n</p>') for x in welkerdebate.transcript.values]
speaker = [x.strip('>') for x in welkerdebate.speaker.values]
welkerdebate = pd.DataFrame({'transcript': text, 'Speakers' :speaker})
welkerdebate

Unnamed: 0,transcript,Speakers
0,"Good evening, everyone. Good evening. Thank yo...",Kristen Welker
1,How are you doing? How are you?,Donald Trump
2,And I do want to say a very good evening to bo...,Kristen Welker
3,The goal is for you to hear each other and for...,Kristen Welker
4,… during this next stage of the coronavirus cr...,Kristen Welker
...,...,...
507,"All right. Vice President Biden, same question...",Kristen Welker
508,"I will say, I’m an American President. I repre...",Joe Biden
509,"We can grow this economy, we can deal with the...",Joe Biden
510,"All right, I want to thank you both for a very...",Kristen Welker


In [28]:
biden = welkerdebate[welkerdebate.Speakers == 'Joe Biden'].transcript
trump = welkerdebate[welkerdebate.Speakers == 'Donald Trump'].transcript
biden.head() #head of biden's lines

9     220,000 Americans dead. You hear nothing else ...
10    The expectation is we’ll have another 200,000 ...
11    What I would do is make sure we have everyone ...
12    We’re in a situation now where the New England...
20    Make sure it’s totally transparent. Have the s...
Name: transcript, dtype: object

In [29]:
#adding to main csv

bidenfull = str(list(biden.values)).replace("',", '.').replace(" '"," ").strip('[]').strip("''")
bidenlist = [('Joe Biden', bidenfull)]
bidendebate = pd.DataFrame(bidenlist, columns=['Speakers', 'transcript'])
bidendebate = bidendebate.set_index('Speakers')
bidendebate['Aff'] = 'D'

trumpfull = str(list(trump.values)).replace("',", '.').replace(" '"," ").strip('[]').strip("''")
trumplist = [('Donald Trump', trumpfull)]
trumpdebate = pd.DataFrame(trumplist, columns=['Speakers', 'transcript'])
trumpdebate = trumpdebate.set_index('Speakers')
trumpdebate['Aff'] = 'R'

maincsv = pd.concat([maincsv,bidendebate,trumpdebate])
maincsv

Unnamed: 0_level_0,Aff,transcript
Speakers,Unnamed: 1_level_1,Unnamed: 2_level_1
alexandria ocasio-cortex.txt,D,Good evening and thank you to everyone here to...
andrew cuomo.txt,D,"We climbed the impossible mountain, and right ..."
andrew yang.txt,D,"Hello, America. I'm Andrew Yang. You might kno..."
barack obama.txt,D,Good evening everybody. As you've seen by now ...
bernie sanders.txt,D,Good evening. Our great nation is now living i...
bill clinton.txt,D,"We have a leader to help us solve problems, cr..."
chuck schumer.txt,D,"Brooklyn, New York. Behind me is a sight eye s..."
colin powell.txt,D,"Hi, I'm former secretary of state Colin Powell..."
cory booker.txt,D,Union job lifted my family out of poverty and ...
dr jill biden.txt,D,Quiet that sparks with possibility just before...


In [30]:
#Putting political affiliations on the debate speakers


trumpaff = pd.DataFrame({'Speakers': welkerdebate[welkerdebate.Speakers == 'Donald Trump'].Speakers, 'Aff':'R'})
bidenaff = pd.DataFrame({'Speakers': welkerdebate[welkerdebate.Speakers == 'Joe Biden'].Speakers, 'Aff':'D'})
#welkeraff = pd.DataFrame({'Speakers' : welkerdebate[welkerdebate.Speakers == 'Kristen Welker'].Speakers, 'Aff':'None'})

#dropping duplicates
trumpaff = trumpaff.drop_duplicates(keep='first')
bidenaff = bidenaff.drop_duplicates(keep='first')
#welkeraff = welkeraff.drop_duplicates(keep='first')

#all of the affiliations (keeping susan neutral)
allaff = pd.merge(trumpaff, bidenaff, how='outer')
#allaff1 = allaff.merge(welkeraff, how='outer')
allaff

Unnamed: 0,Speakers,Aff
0,Donald Trump,R
1,Joe Biden,D


In [31]:
debatewelker = pd.merge(welkerdebate, allaff, how='left')
debatewelker = debatewelker.set_index('Speakers')

In [27]:
#debatewelker.to_csv(r'data/presdebatewelker.csv') #original csv of every speaker (welker AFF of 'none')

In [33]:
debatewelker2 = debatewelker.dropna()
debatewelker2

Unnamed: 0_level_0,transcript,Aff
Speakers,Unnamed: 1_level_1,Unnamed: 2_level_1
Donald Trump,How are you doing? How are you?,R
Donald Trump,"So as you know, 2.2 million people modeled out...",R
Donald Trump,There was a very big spike in Texas. It’s now ...,R
Donald Trump,"I can tell you from personal experience, I was...",R
Joe Biden,"220,000 Americans dead. You hear nothing else ...",D
...,...,...
Donald Trump,"Before the plague came in, just before, I was ...",R
Donald Trump,Success is going to bring us together. We are ...,R
Joe Biden,"I will say, I’m an American President. I repre...",D
Joe Biden,"We can grow this economy, we can deal with the...",D


In [34]:
#debatewelker2.to_csv(r'data/welkerPRESdebate.csv')

### Data info

In [28]:
%pprint
maincsv.info()
debatewelker.info()
vpdebatepage.info()

Pretty printing has been turned OFF
<class 'pandas.core.frame.DataFrame'>
Index: 55 entries, alexandria ocasio-cortex.txt to Donald Trump
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Aff         55 non-null     object
 1   transcript  55 non-null     object
dtypes: object(2)
memory usage: 1.3+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 512 entries, Kristen Welker to Joe Biden
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   transcript  512 non-null    object
 1   Aff         512 non-null    object
dtypes: object(2)
memory usage: 12.0+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 327 entries, Susan Page to Susan Page
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   transcript  327 non-null    object
 1   Aff         327 non-null    object
dtypes: object(2)
memory usage: 7.7+

In [29]:
#how many words in total in the maincsv
maintok = nltk.RegexpTokenizer(r"\w+")
mainwords = maintok.tokenize(str(list(maincsv.transcript.values)).lower())
len(mainwords) 

94099

In [30]:
#words in the welker debate
welktok = nltk.RegexpTokenizer(r"\w+")
welkwords = welktok.tokenize(str(list(debatewelker.transcript.values)).lower())
len(welkwords) 

19419

In [31]:
#words in the welker debate
pagetok = nltk.RegexpTokenizer(r"\w+")
pagewords = pagetok.tokenize(str(list(vpdebatepage.transcript.values)).lower())
len(pagewords)

15355

## Last data collection: Party Platforms
* I got the Republican Party Platform from https://www.presidency.ucsb.edu/documents/2016-republican-party-platform
    * Interestingly, they didn't write a platform for 2020, so the latest was from 2016

Republican Party Platforms, 2016 Republican Party Platform Online by Gerhard Peters and John T. Woolley, The American Presidency Project.

* And similarily, the Democratic Party Platform from https://www.presidency.ucsb.edu/documents/2020-democratic-party-platform

Democratic Party Platforms, 2020 Democratic Party Platform Online by Gerhard Peters and John T. Woolley, The American Presidency Project.
     

In [66]:
# importing the platform to use as training data

rep_platform = open('data/platforms/rep_platform_clean.txt','r')
rep_plat = rep_platform.read()
dem_platform = open('data/platforms/dem_platform_clean.txt','r')
dem_plat = dem_platform.read()
#rep_platform = PlaintextCorpusReader(corpus_root, 'rep_platform_clean.*txt')

In [60]:
rep_plat_df = pd.DataFrame({'Speakers': 'Republican Party Platform',
                         'Aff':'R',
                         'transcript': rep_plat}, index=[0])
dem_plat_df = pd.DataFrame({'Speakers':'Democratic Party Platform',
                         'Aff':'D',
                         'transcript': dem_plat}, index=[1])
platforms = pd.concat([rep_plat_df,dem_plat_df])

In [61]:
platform = platforms.set_index('Speakers')
platform

Unnamed: 0_level_0,Aff,transcript
Speakers,Unnamed: 1_level_1,Unnamed: 2_level_1
Republican Party Platform,R,We dedicate this platform with admiration and ...
Democratic Party Platform,D,﻿DEMOCRATIC NATIONAL CONVENTION LAND ACKNOWLED...


In [50]:
pd.concat([maincsv,platform]) #just to keep everything together, this is not the main csv for analysis

Unnamed: 0_level_0,Aff,transcript
Speakers,Unnamed: 1_level_1,Unnamed: 2_level_1
alexandria ocasio-cortex.txt,D,Good evening and thank you to everyone here to...
andrew cuomo.txt,D,"We climbed the impossible mountain, and right ..."
andrew yang.txt,D,"Hello, America. I'm Andrew Yang. You might kno..."
barack obama.txt,D,Good evening everybody. As you've seen by now ...
bernie sanders.txt,D,Good evening. Our great nation is now living i...
bill clinton.txt,D,"We have a leader to help us solve problems, cr..."
chuck schumer.txt,D,"Brooklyn, New York. Behind me is a sight eye s..."
colin powell.txt,D,"Hi, I'm former secretary of state Colin Powell..."
cory booker.txt,D,Union job lifted my family out of poverty and ...
dr jill biden.txt,D,Quiet that sparks with possibility just before...


In [74]:
# making a dataframe of sentences from each platform

rep_plat2 = nltk.sent_tokenize(rep_plat)
rep_plat_df2 = pd.DataFrame({'Speakers': 'Republican Party Platform',
                         'Aff':'R',
                         'sentences': rep_plat2})
dem_plat2 = nltk.sent_tokenize(dem_plat)
dem_plat_df2 = pd.DataFrame({'Speakers': 'Democratic Party Platform',
                         'Aff':'D',
                         'sentences': dem_plat2})
plat_sent = pd.concat([rep_plat_df2,dem_plat_df2])
plat_sent

Unnamed: 0,Speakers,Aff,sentences
0,Republican Party Platform,R,We dedicate this platform with admiration and ...
1,Republican Party Platform,R,"Preamble\nWith this platform, we the Republica..."
2,Republican Party Platform,R,We believe in American exceptionalism.
3,Republican Party Platform,R,We believe the United States of America is unl...
4,Republican Party Platform,R,We believe America is exceptional because of o...
...,...,...,...
1522,Democratic Party Platform,D,Democrats will continue to stand against incit...
1523,Democratic Party Platform,D,We oppose settlement expansion.
1524,Democratic Party Platform,D,We believe that while Jerusalem is a matter fo...
1525,Democratic Party Platform,D,Democrats will restore U.S.-Palestinian diplom...


In [75]:
#plat_sent.to_csv(r'data/platformsents.csv')