## Statistics in Presidential Debates

In this notebook, I scrape data from the Presidential Debates from the Commission of Presidential Debates website: https://www.debates.org/voter-education/debate-transcripts/

I analyze the counts of specific values from each of the speeches and sentiment score each of the candidate's sentiment score using Blob NLP Sentiment analysis toolkit.

Using `requests` and `BeautifulSoup` to find all links/ URLs on the website and use the links found to get the text from each presidential debate.

This project was inspired by a project I did for my DataX class.



In [None]:
import requests
import nltk
from textblob import TextBlob
import numpy as np
import bs4 as bs
from collections import Counter
import re
import pandas as pd
source = requests.get("https://www.debates.org/voter-education/debate-transcripts/") 
soup = bs.BeautifulSoup(source.content, features='html.parser')

In [None]:
urllist = []
titlelist = []
for a in soup.find_all('a'):
        stringurl = "https://www.debates.org" + a.get('href')
        if stringurl not in urllist and 'debate-transcript' in a.get('href'):
            urllist.append(stringurl)
            titlelist.append(a.text)

titlelist

#deal with first val/last val
#will have to combine
#print(titlelist)

In [None]:
#removing first and last speeches manually (scraping isn't perfect)
urllist = urllist[1:-1]
titlelist = titlelist[1:-1]


In [None]:
#checks whether two strings are similar to one another, to deal with typos in the labeling of the speaker's name/ inconsistencies in naming
#This method will be used below.
from difflib import SequenceMatcher
def isSimilar(x,y):
    simthreshhold = 0.5
    if x in y:
        return True
    if y in x:
        return True
    return SequenceMatcher(None, x, y).ratio() > simthreshhold

#### Okay, so here's where the data scraping gets tricky....
In the following block of code I go through every url of presidential debates, find each paragraph break and use heuristics like the formatting of the paragraph to figure out who is speaking. The code can seem intimidating, but that's because while scraping there were quite a few edge cases (like typos or weird formatting and inconsistencies from the website) that I had to find a system of processing robust enough to get those data points too.

In [None]:
speechlist = []
for i in urllist:
    #For each url in the list of presidential debates
    source1 = requests.get(i) 
    soup1 = bs.BeautifulSoup(source1.content, features='html.parser')

    speech = ''
    speechdf = pd.DataFrame()

    for p in soup1.find_all('p'):
        currspeaker = ''
        splittext = p.text.split(':',1)
        
        #if there's colon
        if len(splittext) == 2:
            name = splittext[0]
            speechtext =splittext[1]
            #if there's new speaker
            if (name.isupper()):
                currspeaker = name
                #if already in list of speakers
                if any(isSimilar(name,col) for col in speechdf.columns):
                    for cols in speechdf.columns:
                        if isSimilar(name, cols):
                        
                            currspeaker = cols
                    #set currspeaker to similar column
                    speechdf.loc[len(speechdf.index), currspeaker] = speechtext
                else: #if not in list of speakers
                    #create new column for new speaker
                    speechdf[currspeaker] = np.NaN
                    speechdf.loc[len(speechdf.index), currspeaker] = speechtext

   
    #continue paragraph
        elif len(currspeaker) > 1:
            speechdf.loc[len(speechdf.index), currspeaker] = p.text

    for col in speechdf.columns:
        if len(speechdf[col]) - speechdf[col].isna().sum() < 5: #if you only want the presidents and not reporters, set this value higher
            speechdf = speechdf.drop(col, axis = 1)     
    speechlist.append(speechdf)

    
    

In [None]:
#speechlist contains a list of dataframes, with each DF representing a presidential debate
len(speechlist)

In [None]:
#for speech in speechlist:

for speech in speechlist:
    sentdf = pd.DataFrame()
    for col in speech.columns:
        speechstring = ''
        for index,row in speech.iterrows():
            if isinstance(row[col],str):
                speechstring += row[col]
        
        blob = TextBlob(speechstring)
        #angry:-1 to happy: +1
        polarity = blob.sentiment.polarity
        sentdf[col] = np.NaN
        sentdf.loc[0, col] = polarity
    display(sentdf)
        
    

    
        

In [None]:
#example of how sentiment analysis works
string = 'hello happy person'
blob = TextBlob(string)
print(blob.sentiment.polarity)


# End of Data Mining of Presidential Debates
If you look at the data frames above, you'll see a few different names like reporters sprinkled in. I commented within my cleaning method if someone wants to get rid of the reporters and other question askers from the data frame, just set the threshold for number of paragraphs spoken to a higher number. I left it in for now in case someone wanted to go through the sentiments of the questions as well. Thanks for reading through this notebook.