## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [67]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score



In [68]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [69]:
convention_cur

<sqlite3.Cursor at 0x2e2e6edff40>

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [70]:
# The SQLite database
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

# Query to list all table names
convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = convention_cur.fetchall()

# Table names
print("Tables in the database:")
for table in tables:
    print(table[0])


Tables in the database:
conventions


In [71]:
table_name = 'conventions'

# Query to get the column names of the specific table
convention_cur.execute(f"PRAGMA table_info({table_name});")
columns = convention_cur.fetchall()

# Display the column names
print(f"\nColumns in the '{table_name}' table:")
for column in columns:
    print(column[1])  # Column names are in the second index (index 1)



Columns in the 'conventions' table:
party
night
speaker
speaker_count
time
text
text_len
file


In [72]:

# Query to see the column names in the 'conventions' table
convention_cur.execute("PRAGMA table_info(conventions)").fetchall()


[(0, 'party', 'TEXT', 0, None, 0),
 (1, 'night', 'INTEGER', 0, None, 0),
 (2, 'speaker', 'TEXT', 0, None, 0),
 (3, 'speaker_count', 'INTEGER', 0, None, 0),
 (4, 'time', 'TEXT', 0, None, 0),
 (5, 'text', 'TEXT', 0, None, 0),
 (6, 'text_len', 'TEXT', 0, None, 0),
 (7, 'file', 'TEXT', 0, None, 0)]

In [73]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Connect to the database
#convention_db = sqlite3.connect("2020_Conventions.db")
#convention_cur = convention_db.cursor()

# Initialize an empty list to hold the processed data
convention_data = []

# Query the 'conventions' table to get text and party columns
query_results = convention_cur.execute(
    '''
    SELECT text, party FROM conventions
    '''
)

# Set of stopwords for filtering
stop_words = set(stopwords.words('english'))

# Function to clean and tokenize text
# Convert to lowercase and tokenize
# Remove stopwords and non-alphabetic tokens
def clean_tokenize(text):
    tokens = word_tokenize(text.lower())  
    cleaned_tokens = [word for word in tokens if word.isalpha() 
                      and word not in stop_words]  
    return ' '.join(cleaned_tokens)

# Process the query results
for row in query_results:
    speech_text = row[0]  # text
    party = row[1]  # party affiliation
    
    # Clean and tokenize the speech text
    cleaned_speech = clean_tokenize(speech_text)
    
    # Append the cleaned text and party to the convention_data list
    convention_data.append([cleaned_speech, party])

# Take a random sample of 5 speeches
random_sample = random.choices(convention_data, k=5)
for speech in random_sample:
    print(speech)


[nltk_data] Downloading package punkt to C:\Users\bista/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bista/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['foreign prince', 'Republican']
['reproductive justice', 'Democratic']
['mission fight future equal ideals founders hopes children sacrifices veterans brave men women uniform families', 'Democratic']
['black americans standing native land probably represent oregon dual viruses racism laid bare equal healthcare access deaths communities color', 'Democratic']
['joe purpose always driven forward strength unstoppable faith unshakable politicians political parties even providence god faith us yes many classrooms quiet right playgrounds still listen closely hear sparks change air across country educators parents first responders americans walks life putting shoulders back fighting given need leadership worthy nation worthy honest leadership bring us back together recover pandemic prepare whatever else next dr', 'Democratic']


Let's look at some random entries and see if they look right. 

In [74]:
random.choices(convention_data,k=5)

[['love heart', 'Democratic'],
 ['rhode island ocean state restaurant fishing industry decimated pandemic lucky governor gina raimondo whose program lets fishermen sell catches directly public state appetizer calamari available states calamari comeback state rhode island casts vote bernie sanders votes next president joe biden',
  'Democratic'],
 ['knows like send child war', 'Democratic'],
 ['america', 'Democratic'],
 ['trillions dollars repatriated back united states sitting foreign lands far long america became envy world renewed strength came leverage president demanded allies pay fair share defense western world father rebuilt mighty american military adding new jets aircraft carriers increased wages incredible men women uniform expanded military defense budget billion per year america longer weak eye enemy moment president trump ordered special forces kill deadliest terrorists planet day mighty moab dropped insurgent camps day america took stance never defeated enemy soleimani de

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [75]:

word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, 
      we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2236 as features in the model.


In [76]:

def conv_features(text, fw):
    """Given some text, this returns a dictionary holding the feature words.
    
    Args: 
        * text: a piece of text in a continuous string. Assumes
        text has been cleaned and case folded.
        * fw: the *feature words* that we're considering. A word 
        in `text` must be in fw in order to be returned. This 
        prevents us from considering very rarely occurring words.
    
    Returns: 
        A dictionary with the words in `text` that appear in `fw`. 
        Words are only counted once. 
        If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
        then this would return a dictionary of 
        {'quick' : True,
         'fox' :    True}
    """
    
    # Split the text into words
    words = text.split()
    
    # Create a dictionary for features found in the text
    ret_dict = {word: True for word in words if word in fw}
    
    return ret_dict

# test cases
assert(len(feature_words) > 0)
assert(conv_features("donald is the president", feature_words) == {'donald': True, 'president': True})
assert(conv_features("some people in america are citizens", 
                     feature_words) == {'people': True, 'america': True, 'citizens': True})


Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [77]:
# Create the feature set for each piece of text
featuresets = [(conv_features(text, feature_words), party) 
               for (text, party) in convention_data]

# Shuffle the feature sets and split into training and test sets
random.seed(20220507)
random.shuffle(featuresets)

# Define the test size 
test_size = 500

# Split the data into test and training sets
test_set, train_set = featuresets[:test_size], featuresets[test_size:]

# Train a Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Evaluate the classifier on the test set
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy of the Naive Bayes Classifier: {accuracy * 100:.2f}%")

# Display the most informative features
print("\nMost Informative Features:")
classifier.show_most_informative_features(25)


Accuracy of the Naive Bayes Classifier: 49.40%

Most Informative Features:
Most Informative Features
                   china = True           Republ : Democr =     27.1 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.8 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

The  Accuracy of the Naive Bayes Classifier is 49.40% which can be explained as the model performance with the given dataset. The low accuracy indicates that there is a challenged in separating Democratic and Republican based on the features derived from the text dataset. This needs to be imporved so that we can increase the accuracy of the model. 

Some of the intersting obaservation were the assocaition of word "China" with the Republican party and word "vote" associated with Democratic speeches. Some of the pupular words from Republican speeches are "enforcement", "destroy", "freedoms", "crime", etc. Simialrly, the frequent words associated with Democratic speeches are "climate", "votes", etc. The use of the words from both Republican speeches and Democratic speeches represents the political opinions and  ideology of the repective parties. For example, climate change and voting rights are the major plocies of the Democrats. 

I also think there are intersting findings in this analysis. "Vote" should be general term but this is more tilted towards the Democrats. The words "media" is frequnt in Republican but they are mostly critisized for being conservative on various issues. Similarly, it is somewhat odd to see the words as "crime","defund", "destroy" paired with "freedom","greatness" beind tied to one party. 

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [62]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [64]:

# NLTK data 
#nltk.download('punkt')
#nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Clean and tokenize function
def clean_tokenize(text):
    if isinstance(text, bytes):  
        text = text.decode('utf-8')  
    tokens = word_tokenize(text.lower())  
    cleaned_tokens = [word for word in tokens 
                      if word.isalpha() and word not in stop_words]  
    return ' '.join(cleaned_tokens)

# Connect to the congressional tweets database
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

# Execute the query to get the tweets and party affiliation
results = cong_cur.execute('''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results)  

# Close the connection after retrieving the data
cong_db.close()

# Clean and tokenize the tweet data
tweet_data = []
for row in results:
    tweet_text = row[2]  
    party = row[1]  

    cleaned_tweet = clean_tokenize(tweet_text)  
    tweet_data.append([cleaned_tweet, party])  

# Split the data into features and labels
X = [tweet[0] for tweet in tweet_data]  
y = [tweet[1] for tweet in tweet_data]  

# Vectorize the tweets using CountVectorizer
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_vec, y)

# Test the classifier on a random sample of 10 tweets
random.seed(20201014)
tweet_data_sample = random.choices(tweet_data, k=10)

for tweet, party in tweet_data_sample:
    tweet_vec = vectorizer.transform([tweet])  
    estimated_party = clf.predict(tweet_vec)[0]  
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")


Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast https
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: go tribe rallytogether https
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans trumpbudget https
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide help putting lives line https
Actual party is Republican and our classifier says Democratic.

Here's our (cleaned) tweet: let make even greater kag https
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: cavs tie series repbarbaralee scared roadtovictory
Actual party is Democratic and our c

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [65]:
# Split the data into features and labels
X = [tweet[0] for tweet in tweet_data]  
y = [tweet[1] for tweet in tweet_data]  

# Vectorize the tweets using CountVectorizer
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, 
                                                    random_state=42)

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Test classifier on a larger dataset and store results
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

# Number of tweets to score
num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, actual_party = tp

    # Vectorize the tweet for classification
    tweet_vec = vectorizer.transform([tweet])

    # Estimate the party using the classifier
    estimated_party = clf.predict(tweet_vec)[0]

    # Update the results dictionary with the actual and estimated party
    results[actual_party][estimated_party] += 1

    # Break the loop after scoring the specified number of tweets
    if idx >= num_to_score:
        break

# Display the classification results
results


defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3337, 'Democratic': 941}),
             'Democratic': defaultdict(int,
                         {'Republican': 943, 'Democratic': 4780})})

### Reflections

There are some interesting results from the Naïve Bayes classification model. One of the key observations is the overall accuracy of the model. There were 3337 Republican tweets and 941 of them were wrongly classified as Democratic tweets. The misclassification percentage of Republican tweets is 22%. Similarly, there were 4780 Democratic tweets whereas 943 Democratic tweets were classified as Republican tweets. The misclassification percentage is about 16%. This shows that the model will likely misclassify Republican tweets as Democratic tweets. The large common term present in both tweets could also be one of the reasons behind this. The common words of election and parties could confuse this classification model. There are some class imbalances in the tweet dataset as there are more Democratic tweets compared to Republican tweets. Hence the model would target the democratic tweet which is majority class. The class-balanced dataset would produce accurate results.

There is also some room for improvement in this classification model. Balancing the tweet dataset so that the model performs better with both minority and majority datasets could enhance the model performance. The use of other algorithms such as SVM, etc., or using tuned hyperparameters along with Naive Bayes could also improve the accuracy while predicting the tweets. 

## References

OpenAI. (2024). ChatGPT (September 29 version) [Large language model]. https://chat.openai.com/