Parisa Kamizi Assignment 4.1

## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [150]:
# ChatGPT was utilized on the assignment as a learning tool. 

import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
import string
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [58]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()
convention_cur

<sqlite3.Cursor at 0x7fbb1806bb90>

In [60]:
convention_db = sqlite3.connect('/Users/parisakamizi/Downloads/2020_Conventions.db')  
convention_cur = convention_db.cursor()

tables = convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
print("Tables in the database:", tables)

Tables in the database: [('conventions',)]


### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [63]:
# Resources can be found from --> https://docs.python.org/3/library/sqlite3.html
# and, https://www.nltk.org/
 
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute(
                            '''
                            SELECT text, party
                            FROM conventions
                            WHERE speaker != "Unknown"
                            ''')

for row in query_results:
    if row[0]:  
       convention_data.append([row[0].strip(), row[1]])  

print(convention_data[:5])  

[['I’m here by calling the full session of the 48th Quadrennial National Convention of the Democratic Party to order. Welcome all to our final session of this historic and memorable convention. We’ve called the 48th Quadrennial Democratic National Convention to order.', 'Democratic'], ['Every four years, we come together to reaffirm our democracy. This year, we’ve come to save it.', 'Democratic'], ['We fight for a more perfect union because we are fighting for the soul of this country and for our lives. And right now that fight is real.', 'Democratic'], ['We must come together to defeat Donald Trump, and elect Joe Biden and Kamala Harris as our next President and Vice President.', 'Democratic'], ['Donald Trump is the wrong President for our country. He has had more than enough time to prove that he can do the job, but he is clearly in over his head. He simply cannot be who we need him to be for us. It is what it is.', 'Democratic']]


Let's look at some random entries and see if they look right. 

In [66]:
random.choices(convention_data,k=10)

[['(singing) Mejente, let’s stand by each other. Don’t forget to vote this November. Together we can make a chance. [Spanish 01:07:16] Let’s go.',
  'Democratic'],
 ['In closing, I’d like to speak directly to my father. I miss working alongside you every single day, but I’m damn proud to be on the front lines of this fight. I’m proud of what you’re doing for this country. I’m proud to show my children what their grandfather is fighting for. I’m proud to watch you give them hell. Never stop. Continue to be unapologetic. Keep fighting for what is right. You are making America strong again. You are making America safe again. You are making America proud again. And yes, together with our forgotten men and women who are finally forgotten no more, you are making America great again. Dad, let’s make Uncle Robert very proud this week. Let’s go get another four years. I love you very much. God bless you. And God bless the United States of America.',
  'Republican'],
 ['Ladies and gentlemen, let

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [69]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2845 as features in the model.


In [79]:
# Resources can be found from here --> https://docs.python.org/3/library/stdtypes.html#string-methods
# https://docs.python.org/3/library/stdtypes.html#mapping-types-dict
# https://docs.python.org/3/library/stdtypes.html#str.lower
# https://docs.python.org/3/library/stdtypes.html#str.split
# and, https://docs.python.org/3/library/stdtypes.html#str.translate

def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Your code here
    
    # Case-folding the text and removing punctuation
    text = text.lower().translate(str.maketrans("", "", string.punctuation))
    
    # Split the text into words
    words = text.split()
    
    # Create a dictionary for feature words 
    ret_dict = {word: True for word in words if word in fw}
    
    return ret_dict

In [81]:
feature_words = {'donald', 'president', 'america', 'american', 'people'}

assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [85]:
# Resources can be found from here --> https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn
# and https://docs.python.org/3/tutorial/datastructures.html

democratic_speeches = []
republican_speeches = []

for text, party in convention_data:
    if party == 'Democratic':
        democratic_speeches.append(text)
    elif party == 'Republican':
        republican_speeches.append(text)
balanced_data = []

for text in democratic_speeches:
    balanced_data.append((text, 'Democratic'))

for text in republican_speeches:
    balanced_data.append((text, 'Republican'))

print(balanced_data[:5])

[('I’m here by calling the full session of the 48th Quadrennial National Convention of the Democratic Party to order. Welcome all to our final session of this historic and memorable convention. We’ve called the 48th Quadrennial Democratic National Convention to order.', 'Democratic'), ('Every four years, we come together to reaffirm our democracy. This year, we’ve come to save it.', 'Democratic'), ('We fight for a more perfect union because we are fighting for the soul of this country and for our lives. And right now that fight is real.', 'Democratic'), ('We must come together to defeat Donald Trump, and elect Joe Biden and Kamala Harris as our next President and Vice President.', 'Democratic'), ('Donald Trump is the wrong President for our country. He has had more than enough time to prove that he can do the job, but he is clearly in over his head. He simply cannot be who we need him to be for us. It is what it is.', 'Democratic')]


In [89]:
# Resources can be found from here --> https://docs.python.org/3/library/collections.html#collections.Counter
# https://stackoverflow.com/questions/49328319/random-sampling-from-a-column-several-times-in-python-pandas
# https://www.geeksforgeeks.org/text-summarization-in-nlp/
# and, https://stackoverflow.com/questions/1637807/modifying-list-while-iterating

# Step 1: Count speeches for each party
d_count, r_count = Counter([p for t, p in convention_data])['Democratic'], Counter([p for t, p in convention_data])['Republican']

# Step 2: Calculate the median text length
median_text = np.median([len(t.split()) for t, p in convention_data])

# Step 3: Filter speeches 
new_convention_data = [
    [text, party] for text, party in convention_data if party == "Republican" or len(text.split()) >= median_text
]

# Step 4: Handle short Democratic speeches
short_texts = [text for text, party in convention_data if party == "Democratic" and len(text.split()) < median_text]
cur_d_count = Counter([p for t, p in new_convention_data])['Democratic']

# Merge short speeches until counts match
while cur_d_count < r_count and short_texts:
    cur_text = ""  
    while len(cur_text.split()) < median_text and short_texts:
        cur_text += " " + short_texts.pop()  
    new_convention_data.append([cur_text.strip(), 'Democratic'])
    cur_d_count += 1

# Step 5: Randomly append remaining short texts if necessary
if cur_d_count == r_count and short_texts:
    for idx, (text, party) in enumerate(new_convention_data):
        if short_texts:
            new_convention_data[idx][0] = " ".join([text.split(), short_texts.pop()])

print(f"Balanced data includes {len(new_convention_data)} speeches.")

Balanced data includes 1941 speeches.


In [91]:
featuresets = [(conv_features(text, feature_words), party) for text, party in new_convention_data]

In [93]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [95]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.552


In [97]:
classifier.show_most_informative_features(25)

Most Informative Features
                american = True           Republ : Democr =      1.7 : 1.0
                  donald = True           Republ : Democr =      1.4 : 1.0
                 america = True           Republ : Democr =      1.3 : 1.0
               president = True           Republ : Democr =      1.2 : 1.0
                  people = True           Democr : Republ =      1.2 : 1.0
               president = None           Democr : Republ =      1.1 : 1.0
                american = None           Democr : Republ =      1.1 : 1.0
                 america = None           Democr : Republ =      1.1 : 1.0
                  donald = None           Democr : Republ =      1.0 : 1.0
                  people = None           Republ : Democr =      1.0 : 1.0


In [105]:
predictions = [classifier.classify(features) for features, _ in test_set]
true_labels = [label for _, label in test_set]

print(classification_report(true_labels, predictions))

              precision    recall  f1-score   support

  Democratic       0.75      0.15      0.25       249
  Republican       0.53      0.95      0.68       251

    accuracy                           0.55       500
   macro avg       0.64      0.55      0.47       500
weighted avg       0.64      0.55      0.47       500



In [107]:
# Resources cane be found here --> https://docs.python.org/3/library/collections.html#collections.Counter

party_counts = Counter([p for t, p in new_convention_data])

print(party_counts)

Counter({'Republican': 986, 'Democratic': 955})


Write a little prose here about what you see in the classifier. Anything odd or interesting?


### My Observations

_Your observations to come._

The Naive Bayes classifier shows moderate performance with 55% accuracy, revealing key political terms tied to the Republican party, while suggesting room for improvement through better text analysis techniques.
The model performs better for Republican speeches, correctly identifying most of them, but it struggles with precision, often misclassifying non-Republican speeches. On the other hand, it misses many Democratic speeches, as indicated by the low recall. Overall, the accuracy is low, and the model seems to be biased towards the Republican party.
To improve the model, I believe enhancing the data cleaning and tokenization process could help by ensuring that the text is properly preprocessed, removing unnecessary noise. Additionally, fine-tuning the model with better feature extraction and experimenting with class weighting could improve recall for Democratic speeches.









## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [126]:
# Resources can be found from here --> https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# Moudle 2 and 3 assignment
# and https://www.nltk.org/api/nltk.tokenize.html

stop_words = set(stopwords.words('english'))

def clean_tokenize(text):
    if isinstance(text, bytes):  
        text = text.decode('utf-8')  
    tokens = word_tokenize(text.lower())  
    cleaned_tokens = [word for word in tokens 
                      if word.isalpha() and word not in stop_words]  
    return ' '.join(cleaned_tokens)

In [128]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [130]:
cong_db = sqlite3.connect('/Users/parisakamizi/Downloads/congressional_data.db')  
cong_cur = cong_db.cursor()

tables = cong_db.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
print("Tables in the database:", tables)

Tables in the database: [('websites',), ('candidate_data',), ('tweets',)]


In [132]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) 

cong_db.close()

In [134]:
# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.
# Resources can be found here --> https://www.nltk.org/api/nltk.tokenize.html
# https://www.analyticsvidhya.com/blog/2022/01/text-cleaning-methods-in-nlp/
# and, https://docs.python.org/3/tutorial/datastructures.html#more-on-lists

tweet_data = []
for row in results:
    tweet_text = row[2]  
    party = row[1]  

    cleaned_tweet = clean_tokenize(tweet_text)  
    tweet_data.append([cleaned_tweet, party])  

In [532]:
# Resources can be found from here --> https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# and, https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

X = [tweet[0] for tweet in tweet_data]  
y = [tweet[1] for tweet in tweet_data]  

# Vectorize the tweets using CountVectorizer
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_vec, y)

MultinomialNB()

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [534]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [536]:
for tweet, party in tweet_data_sample:
    tweet_vec = vectorizer.transform([tweet])  
    estimated_party = clf.predict(tweet_vec)[0]  
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast https
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: go tribe rallytogether https
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans trumpbudget https
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide help putting lives line https
Actual party is Republican and our classifier says Democratic.

Here's our (cleaned) tweet: let make even greater kag https
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: cavs tie series repbarbaralee scared roadtovictory
Actual party is Democratic and our c

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [142]:
# Resources can be found from here --> https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn

# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated

# Split the data into features and labels
X = [tweet[0] for tweet in tweet_data]  
y = [tweet[1] for tweet in tweet_data]  

# Vectorize the tweets using CountVectorizer
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, 
                                                    random_state=42)

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, actual_party = tp

    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # Vectorize the tweet for classification
    tweet_vec = vectorizer.transform([tweet])

    # get the estimated party
    estimated_party = clf.predict(tweet_vec)[0]

    results[actual_party][estimated_party] += 1

    if idx >= num_to_score:
        break

In [146]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3398, 'Democratic': 977}),
             'Democratic': defaultdict(int,
                         {'Republican': 890, 'Democratic': 4736})})

### Reflections

_Write a little about what you see in the results_ 

Based on the results, it looks like the classifier did a decent job in classifying the tweets, but there’s still room for improvement. The Republican tweets were mostly classified correctly, with 3,398 Republican tweets correctly identified and 977 incorrectly labeled as Democratic. On the other hand, the Democratic tweets had more misclassifications, with 890 incorrectly labeled as Republican, but 4,736 correctly identified. This shows that while the classifier performs well with Republican tweets, it struggles more with Democratic ones. It may help to fine-tune the model, balance the dataset, or explore more sophisticated text processing techniques to improve accuracy.









In [152]:
!jupyter nbconvert --to html "/Users/parisakamizi/ADS-509 Text Mining/Political Naive Bayes.ipynb"


[NbConvertApp] Converting notebook /Users/parisakamizi/ADS-509 Text Mining/Political Naive Bayes.ipynb to html
[NbConvertApp] Writing 346295 bytes to /Users/parisakamizi/ADS-509 Text Mining/Political Naive Bayes.html
