# Introduction

Having experimented with the NLTK library and rudimentary generative models in Part 1 greater light was shed on the Presidential Primary debates. Part 2 aims to further elaborate with more investigation and experimentation with Natural Language Processing techniques. The final aims is to see whether the candidates' language from the debates can gain predictive insight into their language on Twitter. This leap across data boundaries will then define the direction of the final Capstone Report.

In [2]:
import pandas as pd
df = pd.read_csv('data/debate.csv', encoding = "ISO-8859-1")
import numpy as np
import matplotlib.pyplot as plt
import nltk
from textblob.classifiers import NaiveBayesClassifier
%matplotlib notebook

In [2]:
df.head()

Unnamed: 0,Line,Speaker,Text,Date
0,1,Holt,Good evening from Hofstra University in Hempst...,9/26/16
1,2,Audience,(APPLAUSE),9/26/16
2,3,Clinton,"How are you, Donald?",9/26/16
3,4,Audience,(APPLAUSE),9/26/16
4,5,Holt,Good luck to you.,9/26/16


In [4]:
cl_list = []

def remove_words(input):
    words = ['donald', 'clinton', 'trump', 'hillary', 'his', 'her', "she's", 'she', 'he', "he's"]
    querywords = input.split()

    resultwords  = [word for word in querywords if word.lower() not in words]
    result = ' '.join(resultwords)
    return result 

for index, row in df.iterrows():
    if row[1] == 'Clinton':
        text = remove_words(row[2])
        cl_list.append((text.lower(), 'Clinton'))
    elif row[1] == 'Trump':
        text = remove_words(row[2])
        cl_list.append((text.lower(), 'Trump'))

In [4]:
length = len(cl_list)
split = int(length*4/5)

In [5]:
train=cl_list[:split]
test=cl_list[split:]

In [6]:
cl = NaiveBayesClassifier(train)

In [7]:
print(cl.accuracy(test))

0.775


In [8]:
cl.show_informative_features(25)

Most Informative Features
        contains(worked) = True           Clinto : Trump  =     11.8 : 1.0
          contains(part) = True           Clinto : Trump  =     10.8 : 1.0
      contains(disaster) = True            Trump : Clinto =     10.2 : 1.0
           contains(bad) = True            Trump : Clinto =     10.2 : 1.0
      contains(everyone) = True           Clinto : Trump  =      9.9 : 1.0
      contains(security) = True           Clinto : Trump  =      8.9 : 1.0
         contains(still) = True           Clinto : Trump  =      8.9 : 1.0
       contains(support) = True           Clinto : Trump  =      8.2 : 1.0
          contains(vote) = True           Clinto : Trump  =      8.0 : 1.0
      contains(military) = True           Clinto : Trump  =      8.0 : 1.0
          contains(paid) = True           Clinto : Trump  =      8.0 : 1.0
         contains(tried) = True           Clinto : Trump  =      7.1 : 1.0
       contains(working) = True           Clinto : Trump  =      7.1 : 1.0

Having used TextBlob to run a Naive Bayes Classifier on the Presidential text, while eliminating trivial words, we can yield even greater insight into how the use of vocabulary has differentied the respective candidates. Clinton's greater lexical diversity as indicated in the Data Story is demonstrated most here. Given a much greater volume of text and more udnerstanding of the technology perhaps a vectorised model could be trained to spot an even greater depth of features and perhaps a generative model could be trained also. In light of my current limitations I will stop here, but I think a great deal of previously unrecognised insight has been revealed on this dataset with just this brief and experimental approach.

In [5]:
df_tweets = pd.read_csv('data/tweets.csv')
tweet_list = []

for index, row in df_tweets.iterrows():
    if row[1] == 'HillaryClinton':
        text = remove_words(row[2])
        tweet_list.append((text.lower(), 'Clinton'))
    elif row[1] == 'realDonaldTrump':
        text = remove_words(row[2])
        tweet_list.append((text.lower(), 'Trump'))


In [10]:
print(cl.accuracy(tweet_list))


0.5040347610180013


After having wrangled the tweets it seems that the trained classifier from the debates yields no greater insight into the language used in tweets than a coin-flip! Surprised? Well... let's train a classifier on the Tweets themselves and see what vocabulary is more predictive.

In [14]:
length = len(tweet_list)
split = int(length*4/5)
train_tweets=tweet_list[:split]
test_tweets=tweet_list[split:]

In [13]:
tweet_cl = NaiveBayesClassifier(train_tweets)

In [15]:
print(tweet_cl.accuracy(test_tweets))

0.8828549262994569


In [6]:
tweet_cl.show_informative_features(50)

NameError: name 'tweet_cl' is not defined

We can see from the predictive features of the text in the tweets that Donald Trump's use of words is much more radicalised. He repeatedly calls out his opponents, Republican and Democrat, by name, and uses much more divise, colloquial and extreme language. It's well known that Trump's shrewd use of social media, with Cambridge Analytics in particular, helped to propel him in the polls. Dissociative Identity Disorder or Crazy like a Fox? Probably the latter!

In [28]:
df_tweets = pd.read_csv('data/tweets.csv')
h_tweet_str = " "
t_tweet_str = " "
import re 

def remove_link(input):
    result = re.sub(r'http\S+','', input)
    return result

for index, row in df_tweets.iterrows():
    if row[1] == 'HillaryClinton':
        text = remove_words(row[2])
        text = remove_link(text)
        h_tweet_str += text.lower()
    elif row[1] == 'realDonaldTrump':
        text = remove_words(row[2])
        text = remove_link(text)
        t_tweet_str += text.lower()

In [26]:
from wordcloud import WordCloud

import matplotlib.pyplot as plt
#plt.imshow(wordcloud)
plt.axis("off")

# lower max_font_size
wordcloud = WordCloud(max_font_size=70).generate(t_tweet_str)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

<IPython.core.display.Javascript object>

We can see that Trump's tweet word cloud is significantly different from the word cloud prodocued in the Presidential debates. Phrases we all remember like 'Make America Great Again' and 'Crooked Hillary' feature alongside 'Fake News' and Obama, CNN and Ted Cruz. This is all much more what we associate with the 'saga' of the debate, with Trump being the angry populist championing the cause of the downtrodden masses against the establishment. Maybe it has become easier for meaningful views to be expressed via social media than on TV. 

In [27]:
#plt.imshow(wordcloud)
plt.axis("off")

# lower max_font_size
wordcloud = WordCloud(max_font_size=70).generate(h_tweet_str)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

<IPython.core.display.Javascript object>

In contrast we can see that Hillary's tweets are more measured. There is nothing in here conveying emnity or radical change. It's all very much what one would expect from a centre left politician. Trump's Word Cloud is much more 'remarkable' than Hillary's which again reflects the rise of social media and digital marketing in politics. Seth Godin would be much more likely to approve of Trump's tweets from a 'Purple Cow' perspective than Hillary's even regardless of the issues. 

From this preliminary analysis it would seem that Donald Trump was more effective at adapting his communication style according to the platform and medium he was using.

Being unable to find significant features to link Twitter classification has derailed the final direction of Capstone project but it has yielded insight into the impact of social media in politics. Also I have learned that a Data Scientist needs to be able to draw from their own experience, contextual understanding and insight to be able to draw conclusions from data. The tools are useful but a Data Scentist should not be overly reliant upon them. A good data scientist should have a good understanding of the subject matter in hand and be prepared for his conclusions to be challenged. 

Also I have learned that many simple steps can yield more insight than a complicated step. If we go through a process of data manipuation and classification sequentially it is easier for the reader to understanding the logical steps taken to arrive at conclusions. Also in trying to find ways to attack the problem I have spent weeks in 'analysis paralysis' when perhaps I should have started expertimenting sooner. Data Science is not so much a science as a set of tools for building a plausible narrative. A Data Scientist is perhaps more of a data storyteller or an insight advocate. It's important to recognise the limitations of technology.

As for next steps? This will require further thought. I feel for this set of data the story ends here with a larger data set. It would be very interesting to study the impact of social media on politics. Kaggle has a collection of political memes from the Presidential election that could be interesting. Perhaps I could repeat all the above analysis on the full Presidential Primary debates.