<h1>Shibboleth:
NLP and the presidential debates</h1>
<em>
Jo Anna Capp
Kevin O’Gallagher
William Sankey
</em>

<h2>Introduction</h2>
<p>
Our Fall 2016 Incubator project sought to identify political party affiliation identification through speech. To accomplish this task we used Presidential candidate debates and the latest Natural Language Processing (NLP) methods. NLP is the analysis of language, from plain-spoken english of children to the polished rhetoric of politicans. Using python and the Natural Language Toolkit (NLTK) we were able to process political speech for classification purposes with roughly 80% accuracy. Essentially, given a paragraph of text we could estimate with 80% accuracy whether that paragraph leaned left or right on the American political spectrum.
</p>
<p>
The data most readily available for analysis was from presidential debates made publicly available through the The American Presidency Project (http://www.presidency.ucsb.edu/index.php). This data is valuable in several ways: 
<ol>
<li><strong>It is labelled:</strong>We know the candidate who made the response and their party affiliation. If we were to pull data from other venues, such as chatrooms or comment sections, we might infer party affiliation but we wouldn't <em>know</em> unless we asked that person...and even if we did ask we might not know how mainstream (in terms of the party represented) their opinion was.</li>
<li><strong>It is relatively clean:</strong>These are prepared debates. The website hosting this data has clearly marked the speaker and organized their responses in structured paragraphs using clear HTML conventions. This makes data retrieval easy.</li>
<li><strong>It can be verified:</strong>These debates were broadcast to millions of individuals. If there is some concern over the quality of our data at a future point in time, any critic can review our data against several independent sources to verify we have it right -- the candidates actually said these words.</li>
</ol>
As a final caveat, to account for shifting party lines and tone across time we focused on debates since the year 2000.
</p>
<p>
The following provides snippets of the most critical parts of our code, for the full thing please see our repository here: https://github.com/DistrictDataLabs/political_history/branches</p>

<h2>Data Munging</h2>

Although the data is publicly available there is a significant amount of processing needed to make it usuable for our NLP purposes. These tasks fall generally into getting the data and turning it into features that can be passed to a learner.

In [4]:
#Create a function that will parse the data from the website
def parse_website_b(url):
    """
    Grabs data from website (url) and parses it.
    Data format: in class_='displaytext', debate speaker is seperated by text with <b/> tag.
    
    Args:
        url: the webpage to parse
        
    Returns:
        json files saved to directory with debate info parsed by paragraph:
        title, date, speaker, text
        
    """
    fetched = urllib.urlopen(url).read()
    soup = BeautifulSoup(fetched, "lxml")

    #Parsing
    titles = unicode(soup.title.string)
    dates = unicode(soup.find("span", class_="docdate").string)
    body = soup.find("span", class_="displaytext")
    paragraphs = soup.find("span", class_="displaytext").findChildren("p")

    #Creating a dataframe
    text_list = []
    speaker_list = []
    child_list = []

    #pull text and speaker from html
    for paragraph in paragraphs:
        text = unicode(paragraph.find(text=True, recursive=False))
        text_list.append(text)
        children = paragraph.findChildren('b')
        for child in children:
            child_list.append(child)
        if child_list == []:
            prevchild = body.find_previous_sibling('b')
            speaker_list.append(prevchild)
        else:
            speakers = unicode(paragraph.b.get_text())
            speaker_list.append(speakers)
        child_list[:] = []

    #replace 'None' in speaker list
    start = next(element for element in speaker_list if element is not None)
    for i, element in enumerate(speaker_list):
        if element is None:
            speaker_list[i] = start
        else:
            start = element

    # Pandas dataframe
    columns = {'text': text_list, 'speaker': speaker_list, 'title': titles, 'date': dates}
    debates = pandas.DataFrame(columns)

    # Exporting to JSON
    directory_name = 'your/local/data/path'
    base_filename = str(re.findall(r'\d+', url))
    suffix = '.json'
    save_path = os.path.join(directory_name, base_filename + suffix)

    debates.to_json(save_path, orient='index')

The above function takes a URL and returns the website data in JSON format.

Importantly the data returned here is separated our by paragraph -- this means the data input into the classifier will ultimately come from paragraphs labelled with the part affiliation of the speaker. It also means that our classifier will take paragraphs as input. At the end of this process we have a set of JSON files providing the part affiliation and text for each paragraph of every presidential debate since 2000.

With the data sitting in a folder in our directory we can move on to cleaning it, a critical step ensuring the classifier has the right information it needs to learn on the data.

<h3>Cleaning</h3>

Two core principles in NLP are removing stop words (common words like 'the', 'and', 'it' etc.) and stemming words to find their root. In stemming words like 'thinking', 'acting', and 'approving' become 'think', 'act', and 'approv' respectively. Let's see how these tokenizers work on an exert of a paragraph from a 2000 debate.

In [7]:
#Function to tokenize and stem
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
def clean_text(text):
    """
    Removes punctuation, converts all characters to lowercase, removes stop words, stems
    
    Args:
        a single string of text 
        
    Returns:
        processed text string
        
    """
    tokens = RegexpTokenizer(r'\w+')
    stops = set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    
    token = tokens.tokenize(text)
    filtered_words = [word for word in token if word not in stops]
    stems = [stemmer.stem(t) for t in filtered_words]
    return( " ".join(stems)) 

In [11]:
gore_text = "We've got a bumper crop this year. But that's the good news.You know what the bad news is that follows on that. The prices are low. In the last several years, the so-called Freedom To Farm Law has, in my view, been mostly a failure."

In [12]:
print clean_text(gore_text)

we got bumper crop year but good news you know bad news follow the price low in last sever year call freedom to farm law view most failur


The process isn't always perfect but importantly it is isoldating the roots of the words in the paragraph and standardizing them for the learner.

At the end of this cleaning process we have all the paragraphs from 2000 to the present looking very similar to the exert from Gore in 2000. This labelled information the classifier will use to learn on for our supervised learning exercise.

<h3>Testing out models</h3>

Machine learning starts with separating the data into training and testing samples (many also advocate for separating out a validation sample). A twenty percent test set is fairly standard.

In [13]:
#train/test split of data (randomized)
text_train, text_test, labels_train, labels_test = cross_validation.train_test_split(text, labels, test_size=0.2, random_state=42)

NameError: name 'cross_validation' is not defined

The learner cannot take raw language as an input, we need to transform these paragraphs 1. into numpy arrays; and 2. into lists of words with some importance attached to each word for the classifier to learn. The `toarray()` method on the vectorizer transforms the paragraph into the numpy array while the TF-IDF process turns these lists of words into something we can use.

Here's an aside on the TF-IDF process:

<em>tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.</em>
source: wikipedia

In [None]:
#tfidf vectorizer and numpy array
vectorizer = TfidfVectorizer(sublinear_tf=True)
text_train_transformed = vectorizer.fit_transform(text_train).toarray()
text_test_transformed  = vectorizer.transform(text_test).toarray()

There's a lot going on in the following lines of code but we are essentially finding the best features, fitting the model, and reviewing the results.

In [None]:
#build classifier pipeline
select = SelectPercentile(f_classif)
pca = PCA()
feature_selection = FeatureUnion([('select', select), ('pca', pca)],
                    transformer_weights={'pca': 10})
clfNB = GaussianNB()

steps1 = [('feature_selection', feature_selection),
        ('naive_bayes', clf)]

pipeline1 = sklearn.pipeline.Pipeline(steps1)

#search for best parameters
parameters1 = dict(feature_selection__select__percentile=[.05, .1, .25], 
              feature_selection__pca__n_components=[10, 50, 100])

cv = sklearn.grid_search.GridSearchCV(pipeline1, param_grid=parameters1)

#because tf-idf vectorizer isn't in this pipeline, fit/predict on transformed data
cv.fit(text_train_transformed, labels_train)
pred = cv.predict(text_test_transformed)

print cv.best_params_

#pipeline.fit(features_train, labels_train)
#pred = pipeline.predict(features_test)
report = sklearn.metrics.classification_report(labels_test, pred)
print report


The efficacy of classifiers are measured in terms of accuracy, precision, recall, and area under the curve (auc) which shows the tradeoffs between False Positives and False Negatives.

In [None]:
#set up scoring function and table
scoring_table = PrettyTable(['pipeline_name', 'accuracy', 'precision', 'recall', 'auc'])

def scoring_function(pipeline_name, test_labels, prediction):
    """
    runs evaluation metrics on prediction from classifier
    Args:
        labels from the test data set, prediction from classifier     
    Returns:
        prints scoring functions, appends scores to scoring dataframe
    """
    accuracy = sklearn.metrics.accuracy_score(test_labels, prediction)
    precision = sklearn.metrics.precision_score(test_labels, prediction)
    recall = sklearn.metrics.recall_score(test_labels, prediction)
    auc = sklearn.metrics.roc_auc_score(test_labels, prediction)
    print "Validation Metrics for %s: accuracy: %s, precision: %s, recall: %s, auc: %s"%(pipeline_name, accuracy, precision, recall, auc)
    
    scoring_table.add_row([pipeline_name, accuracy, precision, recall, auc])
    return scoring_table

The following lines demonstrate how the essential components of a classifier are put togehter: from vectorizing the data, selecting the features, to specifying and fitting the classifier.

In [None]:
#Put together pieces of classifier

#tf-idf vectorizer
vectorizer1 = TfidfVectorizer(sublinear_tf=True)
vectorizer2 = TfidfVectorizer(max_df = 1, min_df = 0, sublinear_tf=True)
vectorizer3 = TfidfVectorizer(ngram_range = (1,3), sublinear_tf=True)
vectorizer4 = TfidfVectorizer(max_df = 0.8, min_df = 0.2, ngram_range = (1,3), sublinear_tf=True)

#feature selection
select = SelectPercentile(f_classif)
pca = PCA()
feature_selection = FeatureUnion([('select', select), ('pca', pca)],
                    transformer_weights={'pca': 10})

#classifier
clfNB = GaussianNB()
clfAdaBoost = AdaBoostClassifier(random_state = 42)
clfLR = LogisticRegression(random_state=42, solver='sag')
clfSVM = SGDClassifier(loss='modified_huber', penalty='l2', n_iter=200, random_state=42)

We tested out a set of different models (18 in total and nearly all looked like the following lines of code). We choose the set of features that flow into the pipeline, we specify the type of classifier to use, and then we review the cross-validated results of the models. It is a tedious task that puts the 'science' in data science.

In [None]:
#test2 - GaussianNB, simple vectorizer, PCA
steps = [
         ('feature_pick', pca),
         ('classifier', clfNB)]

params = dict(feature_pick__n_components=[100, 200, 500])

prediction = gridsearch_pipeline('test2', text_train_transformed, labels_train, text_test_transformed, steps, params)
scoring_function('test2', labels_test, prediction)
print scoring_table

In [None]:
Our final model showed a cross-validated 80% accuracy.

MORE DISCUSSION

<h3>References</h3>

In [None]:
SEE KEVIN'S REFERENCES