<a href="https://colab.research.google.com/github/Riverag0011/ADS509-Text-Mining/blob/main/Assignment_4_Political_Naive_Bayes_GR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 4: Political Naive Bayes

Name: Gabi Rivera \
Course: ADS509-01 \
Date: 29Sep2024

Code Reference: https://chatgpt.com/ and https://colab.research.google.com/

## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

from nltk.tokenize import word_tokenize
import string
import re
from nltk.corpus import stopwords

In [2]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Connect to the SQLite database
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [4]:
# Check available tables
table_query = convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = table_query.fetchall()
print(tables)

[('conventions',)]


In [5]:
# Preview the first 5 rows from the conventions table
query_results = convention_cur.execute("SELECT * FROM conventions LIMIT 5;")
rows = query_results.fetchall()
for row in rows:
    print(row)

('Democratic', 4, 'Unknown', 1, '00:00', 'Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and

In [6]:
# Fetch one row to inspect column names
query_results = convention_cur.execute("SELECT * FROM conventions LIMIT 1;")
columns = [description[0] for description in query_results.description]
print(columns)

['party', 'night', 'speaker', 'speaker_count', 'time', 'text', 'text_len', 'file']


### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text
for each party and prepare it for use in Naive Bayes.  

In [7]:
# Create convention data list
convention_data = []

# SQL query to pull party, speaker, and speech text from Democratic and Republican parties
query_results = convention_cur.execute(
    '''
    SELECT party, speaker, text FROM conventions WHERE party IN ('Democratic', 'Republican') AND party != 'Other';
    '''
)

# Populate convention_data list
for row in query_results:
    party = row[0]
    speaker = row[1]
    speech_text = row[2]
    convention_data.append([party, speaker, speech_text])

In [8]:
# Close the database connection
convention_db.close()

Let's look at some random entries and see if they look right.

In [9]:
# Display some random entries to check the data
random_entries = random.choices(convention_data, k=10)
for entry in random_entries:
    print(entry)

['Republican', 'Rudy Giuliani', 'Although an agreement on action against police brutality would be very valuable for the country, it would also make President Trump appear to be an effective leader. They could have none of that. So, Black Lives Matter and Antifa sprang into action. In a flash, they hijacked the peaceful protests into vicious, brutal riots. Soon, protests turned into riots in many other American cities, almost all Democrat. Businesses were burned and crushed. People beaten, shot, and killed. Police officers routinely assaulted, badly beaten, and occasionally murdered. And, the police handcuffed by progressive Democrat mayors from doing anything, but observe the crimes and absorb the blows.']
['Democratic', 'Kamala Harris', 'We will speak truths and we will act with the same faith in you that we ask you to place in us. We believe that our country, all of us will stand together for a better future. And we already are, we see it in the doctors, the nurses, the home healthc

In [10]:
# Get the list of stopwords
stop_words = set(stopwords.words('english'))

# Create a function to clean and tokenize the speech text
def clean_and_tokenize(text):
    # Tokenize by splitting on whitespace
    tokens = text.split()

    # Remove punctuation and filter out stopwords and non-alphabetic tokens
    cleaned_tokens = [
        token.casefold() for token in tokens
        if token.isalpha() and token.casefold() not in stop_words
    ]

    # Join remaining tokens into a single string
    return ' '.join(cleaned_tokens)

# Clean and format the data
cleaned_data = []
for entry in convention_data:
    party = entry[0]
    speech_text = clean_and_tokenize(entry[2])
    cleaned_data.append(f"{speech_text}, {party}")

# Display some random entries to check the cleaned data
random_entries = random.choices(cleaned_data, k=10)
for entry in random_entries:
    print(entry)

joe always cared military went one generals want share story christmas program playing ave one little girls burst tears teacher ran song played died teacher idea little father fought war night said got better help military, Democratic
watched middle big one kids dropped everything take taught family top watched treat security waiters way would treat made feel special knew understands people make country told make effort get know remember stuck, Republican
vice mike pence held tightly threads freedom woven leading principles alongside president nation experienced prosperity like never, Republican
, Democratic
started tea years american civil civil unrest division separated countrymen two opposing one determined keep people determined see people elizabeth cady stanton lucretia mott felt call fight selected delegates upon told could speak vote july three women met end formed coalition sole purpose gaining right women turn would free fight freedoms women across america united formed activi

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it.

**build our list of candidate words.**

In [11]:
# Assuming you've already defined the word_cutoff and created feature_words
word_cutoff = 5

# Tokenize and create frequency distribution from cleaned_data
tokens = [w for entry in cleaned_data for w in entry.split(',')[0].split()]
word_dist = nltk.FreqDist(tokens)

feature_words = set()
for word, count in word_dist.items():
    if count > word_cutoff:
        feature_words.add(word)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 1776 as features in the model.


In [12]:
# Define convention features
def conv_features(text, fw):
    """Given some text, this returns a dictionary holding the
       feature words.

       Args:
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word
            in `text` must be in fw in order to be returned. This
            prevents us from considering very rarely occurring words.

       Returns:
            A dictionary with the words in `text` that appear in `fw`.
            Words are only counted once.
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of
            {'quick' : True,
             'fox' :    True}
    """

    # Split the text into words
    tokens = text.split()

    # Create a dictionary to hold the feature words found in text
    ret_dict = {word: True for word in set(tokens) if word in fw}

    return ret_dict

In [13]:
# Assertions to test the function
assert len(feature_words) > 0
assert conv_features("donald is the president", feature_words) == {'donald': True, 'president': True}
assert conv_features("people are american in america", feature_words) == {'america': True, 'american': True, 'people': True}

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory.

In [14]:
# Create feature sets using cleaned_data
featuresets = [
    (conv_features(text.split(',')[0].strip(), feature_words), text.split(',')[1].strip())
    for text in cleaned_data
]

# Display some random entries to check the feature sets
random_entries = random.choices(featuresets, k=10)
for entry in random_entries:
    print(entry)

({'always': True}, 'Democratic')
({}, 'Democratic')
({'treat': True, 'matter': True, 'everyone': True}, 'Democratic')
({'vice': True}, 'Republican')
({'creating': True, 'understands': True, 'biden': True, 'joe': True, 'rule': True, 'really': True, 'move': True, 'best': True, 'us': True, 'toward': True, 'respects': True, 'wants': True, 'perfect': True}, 'Democratic')
({'hope': True, 'american': True, 'ever': True, 'left': True}, 'Republican')
({'august': True, 'moved': True, 'special': True, 'obama': True, 'went': True, 'kayla': True, 'army': True, 'learned': True, 'rescue': True, 'white': True, 'brutal': True, 'administration': True, 'everything': True, 'prepared': True, 'house': True, 'telling': True, 'president': True, 'something': True, 'operation': True, 'named': True, 'looking': True, 'months': True, 'mission': True, 'thank': True, 'time': True, 'military': True, 'difference': True, 'another': True, 'forward': True, 'kept': True, 'us': True, 'isis': True}, 'Republican')
({'part': 

In [15]:
# Shuffle and split into training and test sets
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500
test_set, train_set = featuresets[:test_size], featuresets[test_size:]

In [16]:
# Train the Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [17]:
# Evaluate accuracy
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {accuracy*100:.1f}%")

Accuracy: 50.2%


In [18]:
# Show the most informative features
classifier.show_most_informative_features(25)

Most Informative Features
             enforcement = True           Republ : Democr =     27.5 : 1.0
                   china = True           Republ : Democr =     26.5 : 1.0
                   votes = True           Democr : Republ =     21.6 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                 destroy = True           Republ : Democr =     17.1 : 1.0
                supports = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.9 : 1.0
                preserve = True           Republ : Democr =     15.1 : 1.0
                  signed = True           Republ : Democr =     15.1 : 1.0
                freedoms = True           Republ : Democr =     14.0 : 1.0
                 abraham = True           Republ : Democr =     11.9 : 1.0
                 private = True           Republ : Democr =     11.9 : 1.0
                    drug = True           Republ : Democr =     10.9 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

It's odd to me that there are proportionally higher Republican hits. Only 4 out of 25 features are lead by Democrats. This might mean that the dataset is composed of higher Republican word frequency. Also, the accuracy score is at 50% so this adds to the point that classification model did not really performed well or better than just guessing half of the time. \
When it comes to patterns, it seems that frequent words for Republicans aligns with the party's rhetoric which are focused on law enforcement, concern over china, celebration of veterans, etc. Although there's noticeable fewer key Democratic features, words "climate" and "vote" seems to also align with the party's political focus. Progressive agenda on climate change and encouragement for voter's participation resonnates the Democratic parties political perspective.



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and
is unindexed, so the query takes a minute or two to run on my machine.

In [19]:
# Connect to the SQLite database
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [20]:
# Check available tables
cong_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cong_cur.fetchall()
print("Available tables:", tables)

Available tables: [('websites',), ('candidate_data',), ('tweets',)]


In [21]:
# Function to preview rows from a specified table
def preview_table(table_name):
    cong_cur.execute(f"SELECT * FROM {table_name} LIMIT 5;")
    sample_rows = cong_cur.fetchall()
    print(f"Sample Rows from {table_name}:", sample_rows)

# Preview rows from each table
preview_table('websites')
preview_table('candidate_data')
preview_table('tweets')

Sample Rows from websites: []
Sample Rows from candidate_data: [(0, 'alex', 'Alabama', '5', 5, 'AL', 'AL05', 'Mo Brooks', 'Republican', 'https://brooks.house.gov/', 'RepMoBrooks', 'T', 64.0, 'M', 'Married', 'T', 'F', 'F', 'R+18', 'T', '65.53', 51960.0, 'S'), (1, 'alex', 'Alabama', '5', 5, 'AL', 'AL05', 'Peter Joffrion', 'Democratic', 'https://www.peterjoffrion.com/', 'peter_joffrion', 'F', 60.0, 'M', 'Single', 'T', 'F', 'F', 'R+18', 'T', '65.53', 51960.0, 'S'), (2, 'alex', 'California', '10', 10, 'CA', 'CA10', 'Jeff Denham', 'Republican', 'https://denham.house.gov/', 'RepJeffDenham', 'T', 51.0, 'M', 'Married', 'T', 'F', 'F', 'EVEN', 'T', None, 80817.0, 'W'), (3, 'alex', 'California', '10', 10, 'CA', 'CA10', 'Josh Harder', 'Democratic', 'https://www.harderforcongress.com/', 'joshua_harder', 'F', 30.0, 'M', 'Married', 'T', 'F', 'F', 'EVEN', 'T', None, 80817.0, 'W'), (4, 'alex', 'California', '27', 27, 'CA', 'CA27', 'Judy Chu', 'Democratic', 'https://chu.house.gov/', 'RepJudyChu', 'T', 65

In [22]:
# Function to preview column names for a specified table
def preview_columns(table_name):
    cong_cur.execute(f"PRAGMA table_info({table_name});")
    columns = cong_cur.fetchall()
    column_names = [column[1] for column in columns]
    print(f"Column Names from {table_name}:", column_names)

# Preview columns from each table
preview_columns('websites')
preview_columns('candidate_data')
preview_columns('tweets')

Column Names from websites: ['district', 'candidate', 'pull_time', 'url', 'site_text']
Column Names from candidate_data: ['index', 'student', 'state', 'district_num', 'formatted_dist_num', 'abbrev', 'district', 'candidate', 'party', 'website', 'twitter_handle', 'incumbent', 'age', 'gender', 'marital_status', 'white_non_hispanic', 'hispanic', 'black', 'partisian_lean_pvi', 'opposed', 'pct_urban', 'income', 'region']
Column Names from tweets: ['district', 'candidate', 'pull_time', 'tweet_time', 'handle', 'is_retweet', 'tweet_id', 'tweet_text', 'likes', 'replies', 'retweets', 'tweet_ratio']


In [23]:
# Execute the query
results = cong_cur.execute(
    '''
       SELECT DISTINCT
              cd.candidate,
              cd.party,
              tw.tweet_text
       FROM candidate_data cd
       INNER JOIN tweets tw ON cd.twitter_handle = tw.handle
           AND cd.candidate = tw.candidate
           AND cd.district = tw.district
       WHERE cd.party IN ('Republican','Democratic')
           AND tw.tweet_text NOT LIKE '%RT%'
    '''
)

# Store the results in a list
results = list(results)

# Display some sample results
for row in results[:5]:
    print(row)

('Mo Brooks', 'Republican', b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq')
('Mo Brooks', 'Republican', b'"Brooks: I Do Not Support America Raising, Training, and Arming a \nRebel Army to Fight in Syria\xe2\x80\x99s Civil War" http://t.co/f2QFErMkD4')
('Mo Brooks', 'Republican', b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6')
('Mo Brooks', 'Republican', b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA')
('Mo Brooks', 'Republican', b'"Rep. Mo Brooks: NDAA Amnesty Amendment \xe2\x80\x98Betrays Americans\xe2\x80\x99" via @BreitbartNews http://t.co/aflHYdUkuF')


In [24]:
# Define preprocessing step for tweet comments
def preprocess_tweet(tweet):
    # Remove HTTPS links using regex
    tweet = re.sub(r'https?://\S+', '', tweet)

    # Tokenize on whitespace
    tokens = tweet.split()

    # Remove punctuation and lowercase the tokens
    tokens = [w.translate(str.maketrans('', '', string.punctuation)).lower() for w in tokens]

    # Remove tokens that fail the isalpha test
    tokens = [w for w in tokens if w.isalpha()]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if w not in stop_words]

    # Join the remaining tokens into a string
    cleaned_text = ' '.join(tokens)

    return cleaned_text

# Initialize the list to hold tweet data
tweet_data = []

# Fill up tweet_data with cleaned and tokenized sublists
for row in results:
    candidate = row[0]
    party = row[1]
    tweet_text_bytes = row[2]

    # Decode the tweet text
    tweet_text = tweet_text_bytes.decode('utf-8')

    # Preprocess the tweet text
    cleaned_tweet = preprocess_tweet(tweet_text)

    # Append a sublist to tweet_data
    tweet_data.append([cleaned_tweet, party])

# Display some random entries to check the cleaned data
random_entries = random.choices(tweet_data, k=10)
for entry in random_entries:
    print(entry)

['good morning amjalexjohnson', 'Democratic']
['rt alexckaufman top democrats repdonbeyer rep gerryconnolly ask epas busy inspector general open investigation', 'Democratic']
['rt youngest canvasser day nico joshuaharder swingleft', 'Democratic']
['patients ask federal government permission save lives great see righttotry signed law congratulations one biggest champions state rep nickzerwas able white house signing ceremony', 'Republican']
['join markreardonkmox yesterday discuss taxreform important economy amp middleincome families also collected friendly bet goraiders', 'Republican']
['focus kind america want friends neighbors children grandchildren world applaud rep joe kennedy focusing positive vision future', 'Democratic']
['im happy push muchneeded work beltrami island state forest public use areas outdoors huge part minnesota life atv snowmobile access important recreation economy seventh', 'Democratic']
['live cbakershow kfabnews', 'Republican']
['support supporting grandparent

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [25]:
random.seed(20201014)

# Sample 10 random tweets from the tweet_data
tweet_data_sample = random.choices(tweet_data, k=10)

In [26]:
for tweet, party in tweet_data_sample :
    estimated_party = classifier.classify(conv_features(tweet, feature_words))
    # Fill in the right-hand side above with code that estimates the actual party

    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: mass shooting las vegas horrific act violence victims families thoughts prayers
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: early morning traveltuesday leaving dc
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: moderates iraq amp syria civilians weve enemies sides conflict assist either
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: rt natsecaction national security veterans demanding answers release confidential national security
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: buildthatwall
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: glad attend assure everyone could majority americans still stand traditional allies
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: cnn everyone wraps flag patrioti

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [27]:
# Initialize the dictionary of counts by actual party and estimated party
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

# Set the number of tweets to score
num_to_score = 10000
random.shuffle(tweet_data)

# Score the tweets
for idx, tp in enumerate(tweet_data):
    tweet, party = tp

    # Get the estimated party
    estimated_party = classifier.classify(conv_features(tweet, feature_words))

    # Store the results in the dictionary
    results[party][estimated_party] += 1

    # Break the loop after scoring the specified number of tweets
    if idx >= num_to_score:
        break

In [28]:
# Display the results
for actual_party in parties:
    print(f"Actual Party: {actual_party}")
    for estimated_party in parties:
        print(f"  Estimated {estimated_party}: {results[actual_party][estimated_party]}")

Actual Party: Republican
  Estimated Republican: 3580
  Estimated Democratic: 560
Actual Party: Democratic
  Estimated Republican: 5028
  Estimated Democratic: 833


### Reflections

The same as the convention result at 50% accuracy. The congressional tweets performed poorly in correctly identifying Democratic text. There is definitely a class imbalance in both corpus. It will be better to balance the dataset before feeding the data into the classification model to improve the accuracy score. The Naive Bayes classfication model is definitely getting confused in distiguising the two parties apart.