<h1>Shibboleth:
NLP and the presidential debates</h1>
<em>
Jo Anna Capp
Kevin O’Gallagher
William Sankey
</em>

<h2>Introduction</h2>
<p>
Our Fall 2016 Incubator project was to see if we could identify party affiliation using Presidential candidates’ own words in conjunction with the latest Natural Language Processing (NLP) methods. NLP is the analysis of language used for communication by humans. Using python and the Natural Language Toolkit (NLTK) we were able to process political speech for classification purposes with roughly 80% accuracy.
</p>
<p>
The origin of the word ‘shibboleth’ makes the case that language can be used to distinguish groups of individuals from one-another. To accomplish the task of identifying party affiliation through speech we needed language used by Republicans and Democrats, we set aside political independents for this analysis. The data most readily available for analysis, with identified Republicans and Democrats, was from presidential debates made publicly available through the The American Presidency Project (http://www.presidency.ucsb.edu/index.php). To account for shifting party lines and tone across time we focused on debates since the year 2000.
</p>
<p>
The following provides snippets of the most critical parts of our code, for the full thing please see our repository here: LINK</p>

<h2>Data Munging</h2>

Although the data is publicly available there is a significant amount of processing needed to make it usuable for our NLP purposes. These tasks fall generally into getting the data and turning it into features that can be passed to a learner.

In [4]:
#Create a function that will parse the data from the website
def parse_website_b(url):
    """
    Grabs data from website (url) and parses it.
    Data format: in class_='displaytext', debate speaker is seperated by text with <b/> tag.
    
    Args:
        url: the webpage to parse
        
    Returns:
        json files saved to directory with debate info parsed by paragraph:
        title, date, speaker, text
        
    """
    fetched = urllib.urlopen(url).read()
    soup = BeautifulSoup(fetched, "lxml")

    #Parsing
    titles = unicode(soup.title.string)
    dates = unicode(soup.find("span", class_="docdate").string)
    body = soup.find("span", class_="displaytext")
    paragraphs = soup.find("span", class_="displaytext").findChildren("p")

    #Creating a dataframe
    text_list = []
    speaker_list = []
    child_list = []

    #pull text and speaker from html
    for paragraph in paragraphs:
        text = unicode(paragraph.find(text=True, recursive=False))
        text_list.append(text)
        children = paragraph.findChildren('b')
        for child in children:
            child_list.append(child)
        if child_list == []:
            prevchild = body.find_previous_sibling('b')
            speaker_list.append(prevchild)
        else:
            speakers = unicode(paragraph.b.get_text())
            speaker_list.append(speakers)
        child_list[:] = []

    #replace 'None' in speaker list
    start = next(element for element in speaker_list if element is not None)
    for i, element in enumerate(speaker_list):
        if element is None:
            speaker_list[i] = start
        else:
            start = element

    # Pandas dataframe
    columns = {'text': text_list, 'speaker': speaker_list, 'title': titles, 'date': dates}
    debates = pandas.DataFrame(columns)

    # Exporting to JSON
    directory_name = 'your/local/data/path'
    base_filename = str(re.findall(r'\d+', url))
    suffix = '.json'
    save_path = os.path.join(directory_name, base_filename + suffix)

    debates.to_json(save_path, orient='index')

The above function takes a URL and returns the data in json format from the website. Importantly the data returned here is separated our by paragraph -- this means the data input into the classifier will ultimately come from paragraphs labelled with the part affiliation of the speaker. It also means that our classifier will ultimately take paragraphs as input.

<h3>Cleaning</h3>

Two core principles in NLP are removing stop words (common words like 'the', 'and', 'it' etc.) and stemming words to find their root. For

In [7]:
#Function to tokenize and stem
def clean_text(text):
    """
    Removes punctuation, converts all characters to lowercase, removes stop words, stems
    
    Args:
        a single string of text 
        
    Returns:
        processed text string
        
    """
    tokens = RegexpTokenizer(r'\w+')
    stops = set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    
    token = tokens.tokenize(text)
    filtered_words = [word for word in token if word not in stops]
    stems = [stemmer.stem(t) for t in filtered_words]
    return( " ".join(stems)) 

<h3>Feature Engineering</h3>