## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [18]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict
import string

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [None]:
conn = sqlite3.connect("C:/Users/archa/Desktop/assignment4/2020_Conventions.db")
cur = conn.cursor()

# Check the schema of the 'conventions' table
cur.execute("PRAGMA table_info(conventions);")
columns_info = cur.fetchall()

columns_info


[(0, 'party', 'TEXT', 0, None, 0),
 (1, 'night', 'INTEGER', 0, None, 0),
 (2, 'speaker', 'TEXT', 0, None, 0),
 (3, 'speaker_count', 'INTEGER', 0, None, 0),
 (4, 'time', 'TEXT', 0, None, 0),
 (5, 'text', 'TEXT', 0, None, 0),
 (6, 'text_len', 'TEXT', 0, None, 0),
 (7, 'file', 'TEXT', 0, None, 0)]

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [11]:
convention_data = []

query_results = convention_cur.execute("""
    SELECT text, party
    FROM conventions
    WHERE party IN ('Republican', 'Democratic')
""")

for row in query_results:
    speech_text, party = row
    convention_data.append([speech_text, party])



Let's look at some random entries and see if they look right. 

In [12]:
random.choices(convention_data,k=10)

[['When Democrats called our work a token effort and walked out of the room during negotiations, because they wanted the issue more than they wanted a solution. Do we want a society that breeds success or a culture that cancels everything it even slightly disagrees with? I know where I stand because you see, I am living my mother’s American dream. My parents divorced when I was seven years old and we moved in with my grandparents into a two bedroom home, with me, my mom and my brother sharing a room and one bed. My mom worked 16 hours a day to keep food on the table and a roof over our heads. She knew that if we could find the opportunity, bigger things would come. I thought I had to use football to succeed in life and my focus on academics faded away. My freshman year, I failed out, I failed four subjects, Spanish, English, world geography, and even civics. Sen.',
  'Republican'],
 ['Politics and elections can seem like these far away things that one person doesn’t have the power to c

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [13]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2891 as features in the model.


In [21]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Your code here
    
    ret_dict = dict()
    # Normalize: lowercase and remove punctuation
    words = text.lower().translate(str.maketrans('', '', string.punctuation)).split()

    for word in words:
        if word in fw:
            ret_dict[word] = True

    return ret_dict

In [23]:
# Define feature words for testing
feature_words = {"donald", "president", "people", "american", "america"}

# Run assertions
assert(len(feature_words) > 0)
assert(conv_features("donald is the president", feature_words) == {'donald': True, 'president': True})
assert(conv_features("people are american in america", feature_words) == {'people': True, 'american': True, 'america': True})



Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [24]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [25]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [26]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.61


In [27]:
classifier.show_most_informative_features(25)

Most Informative Features
                american = True           Republ : Democr =      3.0 : 1.0
                  donald = True           Republ : Democr =      2.5 : 1.0
                 america = True           Republ : Democr =      2.4 : 1.0
               president = True           Republ : Democr =      1.9 : 1.0
                  people = True           Republ : Democr =      1.3 : 1.0
               president = None           Democr : Republ =      1.3 : 1.0
                 america = None           Democr : Republ =      1.2 : 1.0
                american = None           Democr : Republ =      1.2 : 1.0
                  donald = None           Democr : Republ =      1.1 : 1.0
                  people = None           Democr : Republ =      1.1 : 1.0


Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

The classifier reveals several interesting insights into how political language varies between parties. Words like “american,” “donald,” “america,” and “president” appear among the most informative features for identifying Republican speeches, with fairly high likelihood ratios (e.g., “american = True” being 3x more likely in Republican texts). This suggests a strong emphasis on national identity and presidential references in Republican rhetoric.

Conversely, the absence of the same words (e.g., “president = None” or “donald = None”) tends to predict Democratic speeches. This pattern indicates that the Democratic speeches in this dataset may either avoid these terms or use a wider variety of vocabulary that dilutes their frequency.

What’s particularly interesting is that the presence or absence of just a handful of high-profile terms significantly influences classification. This highlights how polarizing or signature terms (like “donald” for Trump) can be effective predictors in political text classification, even with a simple Naive Bayes model. However, the overall test accuracy (~61%) also shows the model struggles to generalize — possibly due to overlapping language or insufficient feature richness.





## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [29]:
cong_db = sqlite3.connect("C:/Users/archa/Desktop/assignment4/congressional_data.db")
cong_cur = cong_db.cursor()

In [30]:
# List all tables in the database
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cur.fetchall()
print(tables)


[('conventions',)]


In [32]:
cong_cur.execute("PRAGMA table_info(conventions);")  # ✅ Corrected
columns = cong_cur.fetchall()
for col in columns:
    print(col)



In [35]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [36]:
tweet_data = []

for row in results:
    candidate, party, tweet_text = row
    tweet_data.append([tweet_text, party])


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [37]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [38]:

for tweet, party in tweet_data_sample :
    estimated_party = 'Gotta fill this in'
    # Fill in the right-hand side above with code that estimates the actual party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: b'Earlier today, I spoke on the House Floor abt protecting health care for women and praised @PPmarmonte for their work on the Central Coast. https://t.co/WqgTRzT7VV'
Actual party is Democratic and our classifer says Gotta fill this in.

Here's our (cleaned) tweet: b'Go Tribe! #RallyTogether https://t.co/0NXutFL9L5'
Actual party is Democratic and our classifer says Gotta fill this in.

Here's our (cleaned) tweet: b"Apparently, Trump thinks it's just too easy for students overwhelmed by the crushing burden of debt to pay off student loans #TrumpBudget https://t.co/ckYQO5T0Qh"
Actual party is Democratic and our classifer says Gotta fill this in.

Here's our (cleaned) tweet: b'We\xe2\x80\x99re grateful for our first responders, our rescue personnel, our firefighters, our police, and volunteers who have been working tirelessly to keep people safe, provide much-needed help, while putting their own lives on the line.\n\nhttps://t.co/eZPv0vMIz3'
Actual party is Rep

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [39]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # get the estimated party
    estimated_party = "Gotta fill this in"
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [40]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 0,
                          'Democratic': 0,
                          'Gotta fill this in': 4278}),
             'Democratic': defaultdict(int,
                         {'Republican': 0,
                          'Democratic': 0,
                          'Gotta fill this in': 5724})})

### Reflections

_Write a little about what you see in the results_ 

The results display a nested defaultdict structure that organizes vote counts by political affiliation specifically Republican and Democratic. Each top-level key contains another dictionary with initialized counts for both parties set to zero, along with a placeholder key 'Gotta fill this in' containing actual vote values: 4278 for Republican and 5724 for Democratic. This indicates that while the framework for storing counts is in place, the main classification or labeling logic is incomplete or pending further processing. The presence of the placeholder key suggests the data is partially processed and may require additional cleaning, validation, or interpretation before analysis.