<h1>Shibboleth:
NLP and the presidential debates</h1>
<em>
Jo Anna Capp
Kevin O’Gallagher
William Sankey
</em>

<h2>Introduction</h2>
<p>
Our Fall 2016 Incubator project was to see if we could identify party affiliation using Presidential candidates’ own words in conjunction with the latest Natural Language Processing (NLP) methods. NLP is the analysis of language used for communication by humans. Using python and the Natural Language Toolkit (NLTK) we were able to process political speech for classification purposes with roughly 80% accuracy.
</p>
<p>
The origin of the word ‘shibboleth’ makes the case that language can be used to distinguish groups of individuals from one-another. To accomplish the task of identifying party affiliation through speech we needed language used by Republicans and Democrats, we set aside political independents for this analysis. The data most readily available for analysis, with identified Republicans and Democrats, was from presidential debates made publicly available through the The American Presidency Project (http://www.presidency.ucsb.edu/index.php). To account for shifting party lines and tone across time we focused on debates since the year 2000.
</p>
<p>
The following provides snippets of the most critical parts of our code, for the full thing please see our repository here: LINK</p>

<h2>Data Munging</h2>

Although the data is publicly available there is a significant amount of processing needed to make it usuable for our NLP purposes. These tasks fall generally into getting the data and turning it into features that can be passed to a learner.

In [4]:
#Create a function that will parse the data from the website
def parse_website_b(url):
    """
    Grabs data from website (url) and parses it.
    Data format: in class_='displaytext', debate speaker is seperated by text with <b/> tag.
    
    Args:
        url: the webpage to parse
        
    Returns:
        json files saved to directory with debate info parsed by paragraph:
        title, date, speaker, text
        
    """
    fetched = urllib.urlopen(url).read()
    soup = BeautifulSoup(fetched, "lxml")

    #Parsing
    titles = unicode(soup.title.string)
    dates = unicode(soup.find("span", class_="docdate").string)
    body = soup.find("span", class_="displaytext")
    paragraphs = soup.find("span", class_="displaytext").findChildren("p")

    #Creating a dataframe
    text_list = []
    speaker_list = []
    child_list = []

    #pull text and speaker from html
    for paragraph in paragraphs:
        text = unicode(paragraph.find(text=True, recursive=False))
        text_list.append(text)
        children = paragraph.findChildren('b')
        for child in children:
            child_list.append(child)
        if child_list == []:
            prevchild = body.find_previous_sibling('b')
            speaker_list.append(prevchild)
        else:
            speakers = unicode(paragraph.b.get_text())
            speaker_list.append(speakers)
        child_list[:] = []

    #replace 'None' in speaker list
    start = next(element for element in speaker_list if element is not None)
    for i, element in enumerate(speaker_list):
        if element is None:
            speaker_list[i] = start
        else:
            start = element

    # Pandas dataframe
    columns = {'text': text_list, 'speaker': speaker_list, 'title': titles, 'date': dates}
    debates = pandas.DataFrame(columns)

    # Exporting to JSON
    directory_name = 'your/local/data/path'
    base_filename = str(re.findall(r'\d+', url))
    suffix = '.json'
    save_path = os.path.join(directory_name, base_filename + suffix)

    debates.to_json(save_path, orient='index')

The above function takes a URL and returns the data in json format from the website. Importantly the data returned here is separated our by paragraph -- this means the data input into the classifier will ultimately come from paragraphs labelled with the part affiliation of the speaker. It also means that our classifier will ultimately take paragraphs as input. At the end of this process we have a set of JSON files providing the part affiliation and text for each paragraph of every presidential debate since 2000.

<h3>Cleaning</h3>

Two core principles in NLP are removing stop words (common words like 'the', 'and', 'it' etc.) and stemming words to find their root. In stemming words like 'thinking', 'acting', and 'approving' become 'think', 'act', and 'approv' respectively (the process isn't always perfect).

In [7]:
#Function to tokenize and stem
def clean_text(text):
    """
    Removes punctuation, converts all characters to lowercase, removes stop words, stems
    
    Args:
        a single string of text 
        
    Returns:
        processed text string
        
    """
    tokens = RegexpTokenizer(r'\w+')
    stops = set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    
    token = tokens.tokenize(text)
    filtered_words = [word for word in token if word not in stops]
    stems = [stemmer.stem(t) for t in filtered_words]
    return( " ".join(stems)) 

<h3>Feature Engineering</h3>

<h3>Testing out models</h3>

In [None]:
#train/test split of data (randomized)
text_train, text_test, labels_train, labels_test = cross_validation.train_test_split(text, labels, test_size=0.2, random_state=42)

In [None]:
#tfidf vectorizer and numpy array
vectorizer = TfidfVectorizer(sublinear_tf=True)
text_train_transformed = vectorizer.fit_transform(text_train).toarray()
text_test_transformed  = vectorizer.transform(text_test).toarray()

In [None]:
#build classifier pipeline
select = SelectPercentile(f_classif)
pca = PCA()
feature_selection = FeatureUnion([('select', select), ('pca', pca)],
                    transformer_weights={'pca': 10})
clfNB = GaussianNB()

steps1 = [('feature_selection', feature_selection),
        ('naive_bayes', clf)]

pipeline1 = sklearn.pipeline.Pipeline(steps1)

#search for best parameters
parameters1 = dict(feature_selection__select__percentile=[.05, .1, .25], 
              feature_selection__pca__n_components=[10, 50, 100])

cv = sklearn.grid_search.GridSearchCV(pipeline1, param_grid=parameters1)

#because tf-idf vectorizer isn't in this pipeline, fit/predict on transformed data
cv.fit(text_train_transformed, labels_train)
pred = cv.predict(text_test_transformed)

print cv.best_params_

#pipeline.fit(features_train, labels_train)
#pred = pipeline.predict(features_test)
report = sklearn.metrics.classification_report(labels_test, pred)
print report


In [None]:
#set up scoring function and table
scoring_table = PrettyTable(['pipeline_name', 'accuracy', 'precision', 'recall', 'auc'])

def scoring_function(pipeline_name, test_labels, prediction):
    """
    runs evaluation metrics on prediction from classifier
    Args:
        labels from the test data set, prediction from classifier     
    Returns:
        prints scoring functions, appends scores to scoring dataframe
    """
    accuracy = sklearn.metrics.accuracy_score(test_labels, prediction)
    precision = sklearn.metrics.precision_score(test_labels, prediction)
    recall = sklearn.metrics.recall_score(test_labels, prediction)
    auc = sklearn.metrics.roc_auc_score(test_labels, prediction)
    print "Validation Metrics for %s: accuracy: %s, precision: %s, recall: %s, auc: %s"%(pipeline_name, accuracy, precision, recall, auc)
    
    scoring_table.add_row([pipeline_name, accuracy, precision, recall, auc])
    return scoring_table

In [None]:
# set-up generic grid-search cv function
def gridsearch_pipeline(pipeline_name, train_data, train_labels, test_data, pipeline_steps, parameters):
    """
    generic function to run gridsearchcv on an input dataset, pipeline, and parameters
    Args:
        data separated into features/labels and train/test
        steps of the pipeline
        parameters for gridsearchcv
    Returns:
        best parameters from gridsearch, prediction for test features
    """
    #pipeline
    pipe = sklearn.pipeline.Pipeline(pipeline_steps)
    
    #gridsearch
    cv = sklearn.grid_search.GridSearchCV(pipe, param_grid=parameters)
    cv.fit(train_data, train_labels)
    pred = cv.predict(test_data)
    print cv.best_params_
    return pred

In [None]:
#Put together pieces of classifier

#tf-idf vectorizer
vectorizer1 = TfidfVectorizer(sublinear_tf=True)
vectorizer2 = TfidfVectorizer(max_df = 1, min_df = 0, sublinear_tf=True)
vectorizer3 = TfidfVectorizer(ngram_range = (1,3), sublinear_tf=True)
vectorizer4 = TfidfVectorizer(max_df = 0.8, min_df = 0.2, ngram_range = (1,3), sublinear_tf=True)

#feature selection
select = SelectPercentile(f_classif)
pca = PCA()
feature_selection = FeatureUnion([('select', select), ('pca', pca)],
                    transformer_weights={'pca': 10})

#classifier
clfNB = GaussianNB()
clfAdaBoost = AdaBoostClassifier(random_state = 42)
clfLR = LogisticRegression(random_state=42, solver='sag')
clfSVM = SGDClassifier(loss='modified_huber', penalty='l2', n_iter=200, random_state=42)

In [None]:
#test2 - GaussianNB, simple vectorizer, PCA
steps = [
         ('feature_pick', pca),
         ('classifier', clfNB)]

params = dict(feature_pick__n_components=[100, 200, 500])

prediction = gridsearch_pipeline('test2', text_train_transformed, labels_train, text_test_transformed, steps, params)
scoring_function('test2', labels_test, prediction)
print scoring_table

<h3>Final Model and Discussion</h3>

<h3>References</h3>