# Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

from string import punctuation

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [5]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [6]:
import pandas as pd

pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", convention_db)


Unnamed: 0,name
0,conventions


In [7]:
pd.read_sql("PRAGMA table_info(conventions);", convention_db)


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,party,TEXT,0,,0
1,1,night,INTEGER,0,,0
2,2,speaker,TEXT,0,,0
3,3,speaker_count,INTEGER,0,,0
4,4,time,TEXT,0,,0
5,5,text,TEXT,0,,0
6,6,text_len,TEXT,0,,0
7,7,file,TEXT,0,,0


## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text 
for each party and prepare it for use in Naive Bayes. 

In [8]:
convention_data = []

query_results = convention_cur.execute(
    '''
    SELECT text, party
    FROM conventions
    WHERE party != 'Other';
    '''
)

for row in query_results:
    convention_data.append([row[0], row[1]])

print(len(convention_data))
print(convention_data[:2])
 

2541
[['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and subtitling.', 'Democratic'], ['I’

In [9]:
# it's a best practice to close up your DB connection when you're done
convention_db.close()

Let's look at some random entries and see if they look right. 

In [10]:
random.choices(convention_data,k=5)

[['Every generation before us has had to fight for what they believe in and it’s just our turn now. Jack G.: ( 01:02:03 )  I was proud when I saw the demonstrations that were going on across the country.',
  'Democratic'],
 ['They’re not bad folks, folks. But guess what? They’re not competition for us.',
  'Republican'],
 ['And when I finally did, he was my rock, getting me through those hours, weeks, months of unspeakable pain and unending surgeries. He was my anchor as I relearned to walk, helping me through every step and every stumble. Our military spouses hold their families together, praying for their loved ones safety, wherever they’re deployed and serving as caregivers to our disabled service members. And then picking up the pieces and starting again, whenever the next tour or the next war arises, Joe Biden understands the sacrifices because he’s made them himself. When his son Beau deployed to Iraq, his burden was also shouldered by his family. Joe knows the fear military fami

It'll be useful for us to have a large sample size than 2020 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html). 

In [14]:
import nltk

nltk.download("punkt")
nltk.download("punkt_tab")


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mammajamma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/mammajamma/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [15]:
from nltk.tokenize import sent_tokenize

conv_sent_data = []

for speech, party in convention_data:
    sentences = sent_tokenize(speech)
    for sent in sentences:
        conv_sent_data.append([sent, party])

print(len(conv_sent_data))
print(conv_sent_data[:5])

10740
[['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20.', 'Democratic'], ['Read the full transcript of the event here.', 'Democratic'], ['Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning

Again, let's look at some random entries. 

In [16]:
random.choices(conv_sent_data,k=5)

[['I grew up in Grapevine, Texas, a town that my great grandfather was the first black man to settle as a sharecropper in 1896.',
  'Republican'],
 ['And he’ll solve them in a way that puts working people first.',
  'Democratic'],
 ['Our agenda is based on freedom.', 'Republican'],
 ['You are making America safe again.', 'Republican'],
 ['We worshiped our mother.', 'Democratic']]

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps: 

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [17]:
from nltk.corpus import stopwords
import string

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

clean_conv_sent_data = []

for idx, (sent, party) in enumerate(conv_sent_data):
    # lowercase + tokenize on whitespace
    tokens = sent.lower().split()
    
    # remove punctuation + non-alpha + stopwords
    tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
    
    # join tokens back to string
    cleaned_sentence = " ".join(tokens)
    
    if cleaned_sentence:  # keep only non-empty
        clean_conv_sent_data.append((cleaned_sentence, party))

print(len(clean_conv_sent_data))
print(random.choices(clean_conv_sent_data, k=5))


10081
[('millions women men flooded streets', 'Democratic'), ('marie trust trust know word', 'Democratic'), ('remember scared', 'Democratic'), ('put opportunity zones trump tax bill would drive investment communities decades', 'Republican'), ('storm', 'Democratic')]


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mammajamma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5. 

In [18]:
word_cutoff = 5

tokens = [w for t, p in clean_conv_sent_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 1776 as features in the model.


In [20]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Your code here
    words = set(text.split())   # unique words in this sentence
    ret_dict = dict()           # initialize dictionary BEFORE loop

    for w in fw:
        if w in words:
            ret_dict[w] = True

    return ret_dict


In [21]:
test_text = "freedom and america first"
print(conv_features(test_text, feature_words))


{'america': True, 'first': True, 'freedom': True}


In [22]:
assert(len(feature_words)>0)
assert(conv_features("obama was the president",feature_words)==
       {'obama':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [24]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in clean_conv_sent_data]

In [25]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [26]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.668


In [27]:
classifier.show_most_informative_features(25)

Most Informative Features
                   votes = True           Democr : Republ =     44.5 : 1.0
                 radical = True           Republ : Democr =     29.2 : 1.0
             enforcement = True           Republ : Democr =     20.0 : 1.0
                 climate = True           Democr : Republ =     15.6 : 1.0
                 sanders = True           Democr : Republ =     15.3 : 1.0
               committee = True           Democr : Republ =     14.1 : 1.0
                   media = True           Republ : Democr =     13.5 : 1.0
              affordable = True           Democr : Republ =     12.4 : 1.0
                 current = True           Democr : Republ =     11.9 : 1.0
                freedoms = True           Republ : Democr =     11.4 : 1.0
                  bernie = True           Democr : Republ =     10.9 : 1.0
                 chinese = True           Republ : Democr =     10.8 : 1.0
                 destroy = True           Republ : Democr =     10.8 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Your observations to come._



It is a little bit strange to me that there are Pronouns like peoples names listed here.  I dont know that this would be information that we would want to include because those are pretty self explanatory. For example, we know that Kamala is most likely referencing Democratic, because she is a Democrat. It is interesting that Naive Bayes was so good at clearly delineating the words that represent the classes in a simingly clear and accurate way. There is considerable more aggression in the Rpublican output tahn the Democratic. I think that removing the names as an option would have provided better context. 

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [29]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [30]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [31]:
print(len(results))

print(results[:5])


664656
[('Mo Brooks', 'Republican', b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq'), ('Mo Brooks', 'Republican', b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6'), ('Mo Brooks', 'Republican', b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA'), ('Mo Brooks', 'Republican', b'"The trouble with Socialism is that eventually you run out of other people\'s money." - Margaret Thatcher https://t.co/X97g7wzQwJ'), ('Mo Brooks', 'Republican', b'"The trouble with socialism is eventually you run out of other people\'s money" \xe2\x80\x93 Thatcher. She\'ll be sorely missed. http://t.co/Z8gBnDQUh8')]


In [32]:
tweet_data = []

for candidate, party, tweet in results:
    # decode bytes to string due to the b delineation in the above output
    if isinstance(tweet, bytes):
        tweet = tweet.decode("utf-8", errors="ignore")
    
    # repeat cleaining process from part 1
    tokens = [w.lower() for w in nltk.word_tokenize(tweet) if w.isalpha()] 
    tokens = [w for w in tokens if w not in stopwords.words("english")]     
    cleaned_tweet = " ".join(tokens)                                        
    
    if cleaned_tweet:  
        tweet_data.append([cleaned_tweet, party])

print(len(tweet_data))
print(tweet_data[:5])



664155
[['brooks joins alabama delegation voting flawed funding bill http', 'Republican'], ['brooks senate democrats allowing president give americans jobs illegals securetheborder https', 'Republican'], ['nasa square event sat stop amp hear incredible work done downtownhsv http', 'Republican'], ['trouble socialism eventually run people money margaret thatcher https', 'Republican'], ['trouble socialism eventually run people money thatcher sorely missed http', 'Republican']]


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [33]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [34]:

for tweet, party in tweet_data_sample :
    #estimated_party = '??' # you need to fill this in.
    # Fill in the right-hand side above with code that estimates the actual party
    features = conv_features(tweet, feature_words)
    
    # Classify using Naive Bayes
    estimated_party = classifier.classify(features)
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: despite president says people know always american values https
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: goldman sachs invested pension funds russian banks amp iraqi bonds amp single bonds fishy https
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: another pretty terrible week field woman teacher worker lgbt parent read http
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: going posting brat facts election quiz brat pack http
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: judge dismisses charges muslim new mexico compound suspects https
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: watch live https
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: congrats rafa enjoy please reach want jaw vent http

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [35]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
    features = conv_features(tweet, feature_words)
    # get the estimated party
    estimated_party = classifier.classify(features)
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [36]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 2824, 'Democratic': 1551}),
             'Democratic': defaultdict(int,
                         {'Republican': 3142, 'Democratic': 2485})})

### Reflections

_Write a little about what you see in the results_ 

This shows that the is really incorrectly categorizing Democrats as Republicans.  The True Positive rate for Repbulicans is considerably better than it is for Democrats indicating that the model is for some reason favoring placing tweets in the Republican classification. There are different result calculations that can be run with these numbers to closer examine the output including, accuracy, precision and recall but it is clear from these observations that the classifier has difficulty identifing Democratic Tweets and favors classifying them as Republican which demonstrates class imbalance. After analyzing the speeches, there were key words that were clearly designated as more Republican or Democratic with obvious ratios.  Based on the results after observing the same training on tweets rather than on speeches it seems to suggest that different training should have been applied to the twitter data. This makes sense as the way members tweet will be different than formal speeches. 