## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [2]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

In [3]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

### Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [4]:
import re
convention_data = []

query_results = convention_cur.execute(
                            '''
                            SELECT text, party
                            FROM conventions
                            WHERE speaker != "Unknown"
                            ''')

for row in query_results:
    text,party = row 

    text = [w.lower() for w in text.split()] #lower
    text = [re.sub(r'[^\w\s]','',w) for w in text] #removing non-alphas
    
    convention_data.append([text,party])


Let's look at some random entries and see if they look right. 

In [None]:
random.choices(convention_data,k=10)

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [6]:
from nltk.corpus import stopwords
sw = stopwords.words('english')


word_cutoff = 5

tokens = [w for t, p in convention_data for w in t]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() : 
    if word in sw: #removing stop words
        continue
    elif count > word_cutoff: #ensuring it meets word cutoff
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2350 as features in the model.


In [7]:
def conv_features(text,fw) :
    
    ret_dict = dict() #dictionary
    if type(text) == str: #Allowing strings to be used
        text = text.split()
    for word in text:
        if word in fw:
            ret_dict[word] = True #creating feature words dictionary
    
    return(ret_dict)
    
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """

In [8]:
assert(len(feature_words)>0)
assert(conv_features(tuple("donald is the president".split()),feature_words)==
       {'donald':True,'president':True})
assert(conv_features(tuple("people are american in america".split()),feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [9]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data] #feature set

In [10]:
random.seed(20201013)
random.shuffle(featuresets)

test_size = 500

In [11]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set) # fitting the model
print(nltk.classify.accuracy(classifier, test_set)) #accuracy of model (The model is not very accurate)

0.528


In [12]:
classifier.show_most_informative_features(25) 

Most Informative Features
                 radical = True           Republ : Democr =     43.0 : 1.0
             enforcement = True           Republ : Democr =     33.4 : 1.0
                   votes = True           Democr : Republ =     26.2 : 1.0
                    mike = True           Republ : Democr =     23.9 : 1.0
                   media = True           Republ : Democr =     20.7 : 1.0
                   china = True           Republ : Democr =     17.5 : 1.0
                 destroy = True           Republ : Democr =     16.5 : 1.0
                  defund = True           Republ : Democr =     14.3 : 1.0
                    flag = True           Republ : Democr =     14.3 : 1.0
                    isis = True           Republ : Democr =     14.3 : 1.0
                patriots = True           Republ : Democr =     13.3 : 1.0
                 chinese = True           Republ : Democr =     12.2 : 1.0
               countries = True           Republ : Democr =     12.2 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

It is interesting that the Republican party seems to use their particular feature words more. Most of the informative features describe Republican feature words. The only democratic word that appeared in the Most Informative Features column was votes. Republican had the other 24 Most Informative feature words.



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [13]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [14]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')
results = list(results) # Just to store it, since the query is time consuming

In [15]:
tweet_data = []
from nltk.corpus import stopwords
sw = stopwords.words('english')



for name, party, tweet in results:
    tweet = tweet.decode('utf-8') #decoding tweets
    tweet = tweet.lower() #lower
    tweet = tweet.split() #spliting list
    tweet = [word for word in tweet if word.isalpha()] #removing none alpha
    tweet = [word for word in tweet if word not in sw] #removing stop words
    tweet_data.append([tweet,party]) #appending to dictionary


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [21]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [18]:
for tweet, party in tweet_data_sample :
    estimated_party = classifier.classify(conv_features(tweet,feature_words)) #get the classification for tweets
    
    print(f"Here's our (cleaned) tweet: {' '.join(tweet)}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: earlier spoke house floor abt protecting health care women praised work central
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: go
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: trump thinks easy students overwhelmed crushing burden debt pay student loans
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: grateful first rescue volunteers working tirelessly keep people provide putting lives
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: make even greater
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: tie series
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: congrats new gig sd city glad continue
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: really raised toward

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [19]:
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, party = tp    

    estimated_party = classifier.classify(conv_features(tweet,feature_words)) #changed to get classification of tweets
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score: 
        break

In [20]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3495, 'Democratic': 783}),
             'Democratic': defaultdict(int,
                         {'Republican': 4561, 'Democratic': 1163})})

### Reflections

Our Classification model seems to lean towards democratic classification. More tweets from Republicans were classified as Democratic than Republican. Overall, the classification model is not very good.

There could be some potential reasons for this. We are basing our feature word set off convention speeches. Although Democrats and Republicans may have feature words that indicate a party affiliation, convention speech featured words are likely different than tweets. In a convention speech a candite is not limited on the number of characters like they are on twitter. This is just one of the many differences between tweets and convention speeches. A better model would base feature words on a similar corpus type or preferably the same corpus.

