# July 2019 Democratic Debate: NLP Part 1

A skill I've been interested in learning has been Natural Language Processing (NLP). It has so many great applications in healthcare from being able to summarize and extract themes from medical notes to being able to translate medical terminology into lay language for patients. Given its widespread use in healthcare, I thought it would be a great idea to get experience using these tools. 

As someone who loves politics, I thought it would be a great starter project to analyze and search for new insights on the democratic debate that occured on July 30-31, 2019. 

## a. Creating My Text Corpus

The first thing I needed to do was to find a transcript of the two nights of debate. Unfortunately, I could not find a one source place for the transcripts, so I extracted the transcripts from NBC News and Washington Post. The first task was to make sure that they were in a consistent format and only included the information from the actual debate.

I downloaded the transcripts into a .txt file. Then, I skimmed over the text to get a sense of the format of the two transcripts. Next, I deleted any text that did not pertain to the actual debate. For example, random text from the website or commentary made by the network pundits after the debate. I wanted to keep the two .txt files separate because it didn't make sense to me to lump all their text together given the stage dynamics and the nature of the questions asked on both nights. 

The next step was to get them into the same format. My ideal format was: "[Name]: [Text]". I copied the text into Microsoft Word and quickly realized how messy the formatting was. I used Microsoft Word's built-in find-and-replace function with wildcards to delete extra spaces, time marks and links, ensure the name format was "First Last", and ensuring that their text was all in one line (ie. lines not separated by paragraph breaks). I won't go into all the specifics but Word's find-and-replace function is amazing for widespread text manipulation and saved me a lot of time.

Now that my two .txt files are in the format I wanted them in. It's time to create a corpus dataset that separated out the names and the speech into two separated columns. 

In [1]:
import os


'C:\\Users\\Steven S-C\\Desktop\\Data Science Projects\\NLPDemD'

In [2]:
import numpy as np
import pandas as pd
import nltk
import re
import pprint
import string
from sklearn.feature_extraction.text import CountVectorizer
import pickle

In [5]:
DemDebate1raw = open('.\\raw text\\Democratic Debate July 30 Night 1.txt', encoding="utf8").read()
DemDebate1raw = DemDebate1raw.splitlines()
str_DemDebate1raw = {'Name':DemDebate1raw}
DemDebate1raw_df = pd.DataFrame(str_DemDebate1raw, columns = ['Name'])
df1 = DemDebate1raw_df['Name'].str.split(":", n=1, expand=True)
DemDebate1raw_df['Name'] = df1[0]
DemDebate1raw_df['Text'] = df1[1]

In [22]:
DemDebate2raw = open('.\\raw text\\Democratic Debate July 31 Night 2.txt', encoding="utf8").read()
DemDebate2raw = DemDebate2raw.splitlines()
str_DemDebate2raw = {'Name':DemDebate2raw}
DemDebate2raw_df = pd.DataFrame(str_DemDebate2raw, columns = ['Name'])
df2 = DemDebate2raw_df['Name'].str.split(":", n=1, expand=True)
DemDebate2raw_df['Name'] = df2[0]
DemDebate2raw_df['Text'] = df2[1]

In [18]:
DemDebate1raw_df.head()

Unnamed: 0,Name,Text
0,Part 1,"Opening Statements, Taxes, Health Insurance, ..."
1,Jake Tapper,Welcome back to the CNN Democratic Presidenti...
2,Jake Tapper,You will each receive one minute to answer qu...
3,Jake Tapper,A candidate infringing on another candidate’s...
4,Dana Bash,"Time now for opening statements, you’ll each ..."


## b. Cleaning up the Corpus
For easier manipulation and analysis, I'm now going to append all the strings that came from the same person, into one cell. 

In [19]:
DemDebate1_df = DemDebate1raw_df.groupby(['Name'])['Text'].apply(', '.join).reset_index()

In [23]:
DemDebate2_df = DemDebate2raw_df.groupby(['Name'])['Text'].apply(', '.join).reset_index()

In [None]:
## Deleting the non-named rows
DemDebate1_df = DemDebate1_df.drop([6,10,12,13,14,15]).reset_index()
DemDebate1_df.set_index('Name', inplace=True)
DemDebate1_df = DemDebate1_df.drop('index', axis=1)
#DemDebate1_df

In [24]:
## Deleting the non-named rows
DemDebate2_df = DemDebate2_df.drop([0,13]).reset_index()
DemDebate2_df.set_index('Name', inplace=True)
DemDebate2_df = DemDebate2_df.drop('index', axis=1)
DemDebate2_df

Unnamed: 0_level_0,Text
Name,Unnamed: 1_level_1
ANDREW YANG,If you've heard anything about me and my camp...
BILL DE BLASIO,"To the working people of America, tonight I b..."
CORY BOOKER,"Thank you, Dana. Last week the president of t..."
DANA BASH,Let's start with opening statements. You will...
DON LEMON,"Stand by, Senator. , Please stand by. , Ple..."
JAKE TAPPER,Welcome back to the CNN Democratic presidenti...
JAY INSLEE,Good evening. I'm Jay Inslee. I am running fo...
JOE BIDEN,"Tonight, I think Democrats are expecting some..."
JULIAN CASTRO,"Thank you, Dana, and good evening. You know, ..."
KAMALA HARRIS,This is an inflection moment in the history o...


## c. More Cleaning

The next step is to standardize and further clean the text. This involves removing unneeded characters and numbers, and then exploring the text as we continue to tag and break down text.

In [26]:
def clean_text1(text):
    '''Make lower case, remove punctuation, remove numbers and numbers with letters next to it'''
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text1(x)

In [None]:
DemDebate1_clean = pd.DataFrame(DemDebate1_df.Text.apply(round1))
DemDebate2_clean = pd.DataFrame(DemDebate2_df.Text.apply(round1))
#DemDebate2_clean.head()

## d. Tokenization and Document-Term Matrix

In this step, I am removing words that are considered "stop words" like "a", "the", etc. Then I am making the remaining words its own column (re: tokenization). And then restructuring the dataframe so that each word column represents a count for every candidate. 

In [30]:
cv = CountVectorizer(stop_words='english')
DD1_cv = cv.fit_transform(DemDebate1_clean.Text)
DD1_dtm = pd.DataFrame(DD1_cv.toarray(), columns=cv.get_feature_names())
DD1_dtm.index = DemDebate1_clean.index
DD1_dtm.head()

Unnamed: 0_level_0,abide,ability,able,abolition,abroad,absolutely,abusers,abusing,aca,accelerated,...,yes,yesterday,york,young,younger,youngest,youngstown,zealand,zero,zone
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Amy Klobuchar,0,0,0,0,0,0,0,0,0,0,...,7,0,0,0,0,0,0,0,0,0
Bernie Sanders,0,0,0,0,0,1,0,0,0,0,...,1,0,1,2,1,0,0,0,0,0
Beto O’Rourke,0,0,2,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
Dana Bash,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Don Lemon,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
Elizabeth Warren,0,2,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
Jake Tapper,0,0,2,0,0,0,0,0,0,0,...,2,1,0,0,0,0,0,0,0,0
John Delaney,0,0,2,0,0,1,0,0,0,0,...,0,0,1,0,0,1,0,0,3,0
John Hickenlooper,0,0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Marianne Williamson,0,0,1,1,0,0,0,0,0,0,...,2,0,0,1,0,0,0,0,0,0


In [31]:
DD2_cv = cv.fit_transform(DemDebate2_clean.Text)
DD2_dtm = pd.DataFrame(DD2_cv.toarray(), columns=cv.get_feature_names())
DD2_dtm.index = DemDebate2_clean.index
DD2_dtm.head()

Unnamed: 0_level_0,aberration,abhorrent,ability,able,abroad,absolutely,abused,abusing,abusive,accept,...,yesterday,york,youll,young,youngstown,youre,youve,zero,zip,zone
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ANDREW YANG,0,0,0,1,1,0,0,0,1,0,...,0,0,0,0,0,1,3,3,1,0
BILL DE BLASIO,0,0,0,1,0,1,0,0,0,2,...,0,3,0,0,0,4,0,1,0,0
CORY BOOKER,0,0,1,0,0,0,0,0,0,0,...,2,0,0,0,0,1,1,0,0,0
DANA BASH,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
DON LEMON,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## e. Next Steps

In this part, I have compiled my text from their sources, cleaned up what I could using Microsoft Word before bringing them in, restructured the information into dataframes, did another around of cleaning, and created a Document-Text Matrix. 

Now that we've done an initial clean and have two corpus that are mostly ready to use, I'm going to "pickle" both the clean and the DTM files so that I can use them in my next notebook. 

In [34]:
DemDebate1_clean.to_pickle("DemDebate1_clean.pkl")
DemDebate2_clean.to_pickle("DemDebate2_clean.pkl")
DD1_dtm.to_pickle("DD1_dtm.pkl")
DD2_dtm.to_pickle("DD2_dtm.pkl")
pickle.dump(cv, open("cv.pkl", "wb"))

## f. Acknowledgements
Thanks to NBC News and Washington Post for having a publicly available transcript of the debates. And a special thanks to [Alice Zhao](https://www.youtube.com/channel/UCyv-PL-QgkAXEfDRcKrYMeA) for creating very thorough and clear video tutorials for doing NLP. 