Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

# Notebook Setup

In [1]:
import nltk
import numpy as np
import random
import sqlite3

from collections import Counter, defaultdict

In [2]:
def conv_features(text,fw, include_false=False) :
    """Given some text, this returns a dictionary holding the
       feature words.

       Args:
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word
            in `text` must be in fw in order to be returned. This
            prevents us from considering very rarely occurring words.

       Returns:
            A dictionary with the words in `text` that appear in `fw`.
            Words are only counted once.
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of
            {'quick' : True,
             'fox' :    True}

    """
    ret_dict = dict()

    present_tokens = set(text.split())

    for token in feature_words:
        if token in present_tokens:
            ret_dict[token] = True
        else:
            if include_false: # include false
                ret_dict[token] = False

    return(ret_dict)

In [3]:
# added libraries
import os
import pandas as pd
import string
import re

from nltk.corpus import stopwords
from tqdm import tqdm

# --- functions from past homework assignments for this course ---
# Some punctuation variations
punctuation = set(string.punctuation) # speeds up comparison
# somehow, to_add[1] != to_add[2]
to_add = ['`','’','’','•','›','«','×']
punctuation.update(to_add)
punctuation.remove('#')

# Stopwords
sw = stopwords.words("english")

def contains_emoji(s):
    emoji_count = emoji.emoji_count(s)
    return(emoji_count > 0)

def prepare(text, pipeline) :
    '''
        Chandler, John
        August 22, 2022
        ADS 509 Module 3: Group Comparison
        Code Version: Git commit 0405f0f35f67edf62f95bba5052cc11efbda26c9
        NLP Pipeline Transformer
        https://github.com/37chandler/ads-tm-group-comp/blob/main/Group%20Comparison.ipynb
    '''
    tokens = str(text)
    for transform in pipeline :
        tokens = transform(tokens)

    return(tokens)

def remove_punctuation(text, punct_set=punctuation):
    '''
        Chandler, John
        August 22, 2022
        ADS 509 Module 3: Group Comparison
        Code Version: Git commit 0405f0f35f67edf62f95bba5052cc11efbda26c9
        NLP Punctuation Remover
        https://github.com/37chandler/ads-tm-group-comp/blob/main/Group%20Comparison.ipynb
    '''
    return("".join([ch for ch in text if ch not in punct_set]))

def remove_stop(text) :
    tokens = text.split()
    tokens = [token for token in tokens if token not in sw]
    string_ =  ' '.join(tokens)
    return(string_)

def tokenize(text) :
    """ Splitting on whitespace rather than the book's tokenize function. That
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """

    tokens = text.split()
    return(tokens)

# Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.

In [4]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [6]:
convention_data = []

query = '''
            SELECT text, party
            FROM conventions
        '''

query_results = convention_cur.execute(query)

text_prep_pipeline = [str.lower,
                      remove_punctuation,
                      remove_stop]

for row in query_results :
    text = prepare(text=row[0], pipeline=text_prep_pipeline)
    party = row[1]
    convention_data.append([text,party])

Let's look at some random entries and see if they look right. 

In [7]:
random.choices(convention_data,k=2)

[['must come together defeat donald trump elect joe biden kamala harris next president vice president',
  'Democratic'],
 ['joe always cared military families theyve much went iraq one generals said ” want share story you” daughters class christmas program playing ave maria one little girls burst tears teacher ran said “whats matter whats matter” said “thats song played daddys funeral died war” teacher idea little girls father fought war died night said staff im teacher better weve got better help military kids',
  'Democratic']]

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it.

In [8]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2383 as features in the model.


In [9]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory.

## Classifier 1 W/Out False Values

In [12]:
featuresets = []
for (text, party) in tqdm(convention_data):
    featuresets.append((conv_features(text,feature_words), party))

100%|██████████| 2541/2541 [00:00<00:00, 6484.91it/s]


In [13]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in tqdm(convention_data)]

100%|██████████| 2541/2541 [00:00<00:00, 7300.99it/s]


In [14]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [15]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.498


In [16]:
classifier.show_most_informative_features(50)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                    isis = True           Republ : Democr =     13.0 : 1.0

## Classifier 2 W/ False Values

In [18]:
featuresets = []
for (text, party) in tqdm(convention_data):
    featuresets.append((conv_features(text,
                                      feature_words,
                                      include_false=True),
                        party))

100%|██████████| 2541/2541 [00:00<00:00, 3357.68it/s]


In [17]:
featuresets = [(conv_features(text,
                              feature_words,
                              include_false=True),
                party) for (text, party) in tqdm(convention_data)]


100%|██████████| 2541/2541 [00:00<00:00, 3186.81it/s]


In [None]:
featuresets = [(conv_features(text,
                              feature_words,
                              include_false=True),
                party) for (text, party) in tqdm(convention_data)]

random.seed(20220507)
random.shuffle(featuresets)

test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier_wFalse = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier_wFalse, test_set))

In [None]:
classifier_wFalse.show_most_informative_features(5)

## My Observations
Strangely enough, even though we include false values which increases test accuracy by roughly 20%, most informative features is the exact same for both classifiers.  Only difference is False values in our classifier 1 are None instead of False as shown in classifier 2.

Another interesting find is that only 2 features, votes and climate, are the only two Democratic Party dominant words out of the top 50 informative features.  Thus, it appears that the classifier favors in recognizing if a text is Republican or not Republican.

# Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

## Data Pull

In [4]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [5]:
%%time
query = '''
           SELECT DISTINCT
                  cd.candidate,
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle
               AND cd.candidate == tw.candidate
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic')
               AND tw.tweet_text NOT LIKE '%RT%'
        '''

results = cong_db.execute(query)

results = list(results) # Just to store it, since the query is time consuming

CPU times: user 4.29 s, sys: 1.43 s, total: 5.73 s
Wall time: 14.2 s


In [6]:
# FROM https://stackoverflow.com/questions/24399820/expression-to-remove-url-links-from-twitter-tweet
def html_remover(str_):
    pattern = r'(http)\S+'
    text = re.sub(pattern, '', str_)

    return(text)

## Text Prep

In [7]:
tweet_data = []

for row in tqdm(results):
    text_prep_pipeline = [str.lower,
                          html_remover,
                          remove_punctuation,
                          remove_stop]

    text = prepare(text=row[2].decode('utf-8'), pipeline=text_prep_pipeline)
    party = row[1]

    tweet_data.append([text, party])

100%|████████████████████████████████| 664656/664656 [00:25<00:00, 26509.38it/s]


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [8]:
random.seed(20201014)
tweet_data_sample = random.sample(tweet_data,k=10)

## Word Cutoff

In [9]:
%%time
word_cutoff = 10
tokens = [w for t, p in tweet_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()
for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 10, we have 33151 as features in the model.
CPU times: user 4.55 s, sys: 184 ms, total: 4.73 s
Wall time: 4.78 s


-----------
## Feature Building w/out false values

In [12]:
featuresets = [(conv_features(text,
                              feature_words,
                              include_false=False),
                party) for (text, party) in tqdm(tweet_data)]

100%|██████████████████████████████████| 664656/664656 [54:42<00:00, 202.51it/s]


In [None]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [None]:
%%time
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
twitter_clf = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(twitter_clf, test_set))

In [None]:

for tweet, party in tweet_data_sample:

    estimated_party = twitter_clf.classify(conv_features(tweet,
                                                         feature_words,
                                                         False))
    print(f"Clean tweet: {tweet}\n\n")
    print(f"Actual party: {party}\n",
          f'Classifier prediction {estimated_party}')
    print("--"*10)


Now that we've looked at it some, let's score a bunch and see how we're doing.

In [None]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp
   
    # get the estimated party
    estimated_party = twitter_clf.classify(conv_features(tweet,
                                                         feature_words,
                                                         False))
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

results

### Reflections

Write a little about what you see in the results_

It appears that our classifier favors classifying documents that are written by Democratics.  As we see with Democratic documents, the classifier was able to successfully classify 91.4% (=5174/(5174 + 487)) of all Democratic Documents.  While the accuracy performance of 61.53% (=2671/(2671 + 1670) clearly shows that the classifier struggled with Republican Documents.

# Projekt X
(project experiment)

I tried training a clf with True & False values for our twitter data, but for some reason, my computer was highly inconsistent with building the feature set when I included false values utilizing list comprehension.  Thus, the internals the function "conv_features" was copied below to allow me to print certain things while training (such as print statements).  *Sidenote, sometimes the following list comprehension took ~6 seconds, other attempts lasted beyond 20 minutes, or worst of all... kill my kernel!  All attempts had no background software running other than OS essentials.

In [None]:
%%time
featuresets = [(conv_features(text,
                              feature_words,
                              include_false=True),
                party) for (text, party) in tqdm(tweet_data)]

 15%|████▉                           | 101986/664656 [13:55<1:15:11, 124.73it/s]

In [None]:
random.seed(20220507)
random.shuffle(feature_sets)

In [None]:
%%time
test_size = 500
test_set, train_set = feature_sets[:test_size], feature_sets[test_size:]
twitter_clf = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(twitter_clf, test_set))

In [None]:
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0

random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp

    # get the estimated party
    estimated_party = twitter_clf.classify(conv_features(tweet,
                                                         feature_words,
                                                         True))
    results[party][estimated_party] += 1

    if idx > num_to_score :
        break

results

Talk about overfitting!