## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
# creating functions
def clean_and_tokenize(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    # Join tokens back to a single string
    return ' '.join(cleaned_tokens)

In [3]:
# database filepath
db_path = 'C:/Users/jessh/Documents/ADS509/2020_Conventions.db'
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [4]:
# Example table creation (adjust according to your actual schema)
query_results = convention_cur.execute('SELECT text, party FROM conventions')

In [5]:
# List to store the results
convention_data = []

# Process query results
for row in query_results:
    cleaned_text = clean_and_tokenize(row[0])
    party = row[1]
    convention_data.append([cleaned_text, party])

# Close the cursor and connection
convention_cur.close()
convention_db.close()

Let's look at some random entries and see if they look right. 

In [6]:
random.choices(convention_data,k=5)

[['line streets philadelphia bring communities life built country',
  'Democratic'],
 ['unlike joe biden president trump choose woman chose best person job four years ago president trump started movement unlike next four days hear millions hardworking everyday americans benefited leadership watched dnc last week probably noticed democrats spent lot time talking much despise president heard little actual policies policies would unthinkable decade ago policies like banning fossil fuels eliminating private health insurance taxpayer funded healthcare people come illegally defunding police',
  'Republican'],
 ['family complete', 'Democratic'],
 ['congresswoman lisa blunt rochester history class future children learning moment learning pain grief worry also learning man named joe biden restored decency government integrity democracy',
  'Democratic'],
 ['resident council president washington houses spanish harlem', 'Republican']]

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [7]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2281 as features in the model.


In [8]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # tokenize the text
    tokens = text.split()
    
    # storing feature words
    ret_dict = {}
    
    # go through the tokens to check if they are in feature words set (fw)
    
    for token in tokens:
        if token in fw:
            ret_dict[token] = True
    
    return(ret_dict)

In [9]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [10]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [11]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [12]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.498


In [13]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     27.1 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.8 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

The above code tells us the top 25 features that the classifier used to determine which party the text would belong to. On the left hand side are the features and on the left are the ratio of republican to democratic or vice versa. So, for the first row where the word "china" is in the text, the odds are that the text belongs to a Republican party member 27.1 to 1. Interestingly enough, all the words I would associate with a republican party post/speech is reflected in the results. It is also interesting that most of the results, the classifier assigned to Republican. It makes me think that this is a classifier that's 50/50, where it just randomly assigns the words with the party.  



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [14]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [15]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [18]:
def clean_and_tokenize(text):
    # Ensure text is a string and decode if necessary
    if isinstance(text, bytes):
        text = text.decode('utf-8')
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    # Join tokens back to a single string
    return ' '.join(cleaned_tokens)

In [19]:
# list to store the results
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

# Process query results
for candidate, party, tweet_text in results:
    # Clean and tokenize the tweet text
    cleaned_text = clean_and_tokenize(tweet_text)
    # Append the cleaned text and party to tweet_data
    tweet_data.append([cleaned_text, party])

# Example of printing some processed tweet data
for data in tweet_data[:5]:  # Print the first 5 entries as an example
    print(data)

# Close the cursor and connection
cong_cur.close()
cong_db.close()

['brooks joins alabama delegation voting flawed funding bill http', 'Republican']
['brooks senate democrats allowing president give americans jobs illegals securetheborder https', 'Republican']
['nasa square event sat 11am 4pm stop amp hear incredible work done al05 downtownhsv http', 'Republican']
['trouble socialism eventually run people money margaret thatcher https', 'Republican']
['trouble socialism eventually run people money thatcher sorely missed http', 'Republican']


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [31]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

# Prepare the test set from the randomly chosen samples
test_set = [(conv_features(tweet, feature_words), party) for tweet, party in tweet_data_sample]

In [34]:
for tweet, party in tweet_data_sample :
    estimated_party = classifier.classify(conv_features(tweet,test_set))
    # Fill in the right-hand side above with code that estimates the actual party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast https
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: go tribe rallytogether https
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans trumpbudget https
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide help putting lives line https
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: let make even greater kag https
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: 1hr cavs tie series allin216 repbarbaralee scared roadtovictory
Actual party is Democratic a

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [35]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # get the estimated party
    estimated_party = classifier.classify(conv_features(tweet,test_set))
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [36]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 0, 'Democratic': 4278}),
             'Democratic': defaultdict(int,
                         {'Republican': 0, 'Democratic': 5724})})

### Reflections

It seems that based on the naive bayes classifer, it classified everything as Democratic. Because the text varies so much and some tokens can be used for both parties, the classifier made the decision to lean heavily on democrativ as the party. I've also noticed that the token "http" or "https" was in each excerpt regardless of whether the person speaking is Republican or Democratic. The next step to improving this model would be to remove these tokens from each text.