Katie Mears

# Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
import sqlite3
import nltk
from nltk.tokenize import sent_tokenize
import random
import numpy as np
from collections import Counter, defaultdict
import os
from string import punctuation
import pandas as pd
import re
import string
import random
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns
nltk.download('stopwords')
nltk.download('punkt')

# Load English stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\katie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\katie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Mount Drive and Changing Working Directory, Confirm DB is in WD.

In [2]:
os.chdir('C:/Users/katie/OneDrive/Documents/ADS509_Assignment4_Repo')
print("Current Working Directory:", os.getcwd())

Current Working Directory: C:\Users\katie\OneDrive\Documents\ADS509_Assignment4_Repo


In [3]:
# Verify if the database file exists
db_file = "2020_Conventions.db"
print("Does the database file exist?", os.path.isfile(db_file))

Does the database file exist? True


In [4]:
convention_db = sqlite3.connect("C:/Users/katie/OneDrive/Documents/ADS509_Assignment4_Repo/2020_Conventions.db")
convention_cur = convention_db.cursor()

## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text
for each party and prepare it for use in Naive Bayes.

In [5]:
# List all tables in the database
tables = convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
print(tables)

[('conventions',)]


In [6]:
# View Scema of conventions table
schema = convention_cur.execute("PRAGMA table_info(conventions);").fetchall()
for column in schema:
    print(column)

(0, 'party', 'TEXT', 0, None, 0)
(1, 'night', 'INTEGER', 0, None, 0)
(2, 'speaker', 'TEXT', 0, None, 0)
(3, 'speaker_count', 'INTEGER', 0, None, 0)
(4, 'time', 'TEXT', 0, None, 0)
(5, 'text', 'TEXT', 0, None, 0)
(6, 'text_len', 'TEXT', 0, None, 0)
(7, 'file', 'TEXT', 0, None, 0)


In [7]:
# View Sample of the data 
convention_cur.execute("SELECT * FROM conventions LIMIT 5;")
rows = convention_cur.fetchall()
for row in rows:
    print(row)

('Democratic', 4, 'Unknown', 1, '00:00', 'Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and

In [8]:
convention_data = []

# fill the above list up with items that are themselves lists. The
# sublists will have two elements. The first element in the sublist
# should be the speech in a single string. The second element
# of the sublist should be the party.

query_results = convention_cur.execute(
                            '''
                            SELECT text, party 
                            FROM conventions
                            WHERE party != 'Other';
                            ''')

for row in query_results :
    # store the results in convention_data
    speech, party = row
    convention_data.append([speech, party])



In [9]:
# Convert to DataFrame
df_convention = pd.DataFrame(convention_data, columns=['text', 'party'])

# Display the first few rows of the DataFrame
print(df_convention.head())

                                                text       party
0  Skip to content The Company Careers Press Free...  Democratic
1  I’m here by calling the full session of the 48...  Democratic
2  Every four years, we come together to reaffirm...  Democratic
3  We fight for a more perfect union because we a...  Democratic
4  We must come together to defeat Donald Trump, ...  Democratic


In [10]:
# it's a best practice to close up your DB connection when you're done
convention_db.close()

Let's look at some random entries and see if they look right.

In [11]:
random.choices(convention_data,k=5)

[['We need a president who stands up for America, not one who takes a knee. A strong and proud America is a safe America, safe from our enemies and safe from war. No one who’s seen the face of war desires to see it again. Too many of our fellow Americans or already honored at the hallowed grounds of Arlington. But if we want peace, we must be strong. Weakness is provocative. President Trump’s strength has kept us out of war. Joe Biden won’t stand up for America. Donald Trump will. So this November let’s stand with the President and vote to keep America great.',
  'Republican'],
 ['Our military is now better equipped, better resourced and better manned than any military in the world. President Trump demolished the terrorist ISIS caliphate in the Middle East and eliminated its leader, al-Baghdadi, one of the world’s most brutal terrorists. President Trump took decisive action against Iranian terrorist mastermind, Qasem Soleimani, a man responsible for deaths of hundreds of American servi

It'll be useful for us to have a large sample size than 2024 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html).

In [12]:
conv_sent_data = []

# Define a regular expression pattern for sentence splitting
sentence_pattern = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\!|\?)(\s)'

for speech, party in convention_data :
    sentences = re.split(sentence_pattern, speech)
    for sentence in sentences:
        sentence = sentence.strip()  
        if sentence:  
            conv_sent_data.append([sentence, party])


Again, let's look at some random entries.

In [13]:
random.choices(conv_sent_data,k=5)

[['I’ve said from the outset of this election that we’re in the battle for the soul of this nation.',
  'Democratic'],
 ['When the field was clear for him to run for the Senate, he chose to finish his job as AG instead.',
  'Democratic'],
 ['In North Korea, the president lowered the temperature and against all odds got the North Korean leadership to the table.',
  'Republican'],
 ['Providing unyielding support for our troops, combating crime and violence against women, leading our quest to cure cancer and safeguarding the landmark American recovery and Reinvestment Act from corruption.',
  'Democratic'],
 ['Imagine what we could achieve.', 'Democratic']]

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps:

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [14]:
clean_conv_sent_data = [] # list of tuples (sentence, party), with sentence cleaned

for idx, sent_party in enumerate(conv_sent_data) :
    sentence, party = sent_party
    tokens = sentence.split()
    tokens = [
        token.strip(string.punctuation)
        for token in tokens 
        if token.strip(string.punctuation).isalpha()]
    tokens = [token 
        for token in tokens 
        if token.lower() not in stop_words]
    tokens = [token.lower() for token in tokens]
    cleaned_sentence = ' '.join(tokens)
    clean_conv_sent_data.append((cleaned_sentence, party))

random.choices(clean_conv_sent_data,k=5)

[('work believe goodness america promise men women created equal watching tonight betting',
  'Republican'),
 ('yet every often pace various generations compelled resurrect give rebirth providential beginning renew present days exuberance founding days',
  'Republican'),
 ('make america safe', 'Republican'),
 ('going talk something close heart', 'Democratic'),
 ('result seen smallest economic contraction major western nation recovering much faster rate anybody',
  'Republican')]

If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5.

In [15]:
word_cutoff = 5

tokens = [w for t, p in clean_conv_sent_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2239 as features in the model.


In [16]:

def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.

       Args:
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word
            in `text` must be in fw in order to be returned. This
            prevents us from considering very rarely occurring words.

       Returns:
            A dictionary with the words in `text` that appear in `fw`.
            Words are only counted once.
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of
            {'quick' : True,
             'fox' :    True}

    """

    ret_dict = {}

    for word in set(text.split()):
        if word in fw:
            ret_dict[word] = True

    return(ret_dict)

In [17]:
assert(len(feature_words)>0)
assert(conv_features("obama was the president",feature_words)==
       {'obama':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory.

In [18]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [19]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [20]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.498


In [21]:
classifier.show_most_informative_features(25)

Most Informative Features
             enforcement = True           Republ : Democr =     27.5 : 1.0
                   votes = True           Democr : Republ =     21.6 : 1.0
                 climate = True           Democr : Republ =     17.3 : 1.0
                 destroy = True           Republ : Democr =     17.1 : 1.0
                supports = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.9 : 1.0
                preserve = True           Republ : Democr =     15.1 : 1.0
                  signed = True           Republ : Democr =     15.1 : 1.0
              appreciate = True           Republ : Democr =     14.0 : 1.0
                freedoms = True           Republ : Democr =     14.0 : 1.0
                 private = True           Republ : Democr =     11.9 : 1.0
                  defund = True           Republ : Democr =     10.9 : 1.0
                    drug = True           Republ : Democr =     10.3 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

It is interesting to see which features are mentioned by which party more often. Words like enforcement, destroy and media are all words that seems to be more prevalent in Republican speeches compared to Democrats. On the other hand, words like climate and votes were spoken more often than in Republican speeches. It is also interesting that the Republican speech's mentioned most of these words more often than Democrats in general as 22/25 features were spoken more frequently in Republican speeches compared to Democrat speeches.  



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and
is unindexed, so the query takes a minute or two to run on my machine.

In [22]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [23]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT
                  cd.candidate,
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle
               AND cd.candidate == tw.candidate
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic')
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [24]:
tweet_data = []

for candidate, party, tweet_text in results:
    tweet_data.append([tweet_text, party])  

print(tweet_data[:5])

[[b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq', 'Republican'], [b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6', 'Republican'], [b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA', 'Republican'], [b'"The trouble with Socialism is that eventually you run out of other people\'s money." - Margaret Thatcher https://t.co/X97g7wzQwJ', 'Republican'], [b'"The trouble with socialism is eventually you run out of other people\'s money" \xe2\x80\x93 Thatcher. She\'ll be sorely missed. http://t.co/Z8gBnDQUh8', 'Republican']]


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [25]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [26]:
# Define stopwords
stop_words = set(stopwords.words('english'))

# Function to clean and tokenize text
def clean_tokenize(text):
    if isinstance(text, bytes):  
        text = text.decode('utf-8')  
    text = text.lower()  
    text = text.translate(str.maketrans('', '', string.punctuation)) 
    tokens = text.split()  
    cleaned_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]  
    return ' '.join(cleaned_tokens) 

# Iterate over tweet data sample
for tweet, party in tweet_data_sample:
    cleaned_tweet = clean_tokenize(tweet)  
    features = conv_features(cleaned_tweet, feature_words) 
    estimated_party = classifier.classify(features)  
    
    # Output the result
    print(f"Here's our (cleaned) tweet: {cleaned_tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")


Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: go tribe rallytogether
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans trumpbudget
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide muchneeded help putting lives line
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: make even greater kag
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: cavs tie series im repbarbaralee scared roadtovictory
Actual party is Democratic and our classifier says Democ

In [29]:
# dictionary of counts by actual party and estimated party.
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10
random.shuffle(tweet_data_sample)

for idx, tp in enumerate(tweet_data_sample) :
    tweet, party = tp
    features = conv_features(tweet.lower(), feature_words) 

    # get the estimated party
    estimated_party = classifier.classify(features)

    results[party][estimated_party] += 1

    if idx > num_to_score :
        break

In [30]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 0, 'Democratic': 2}),
             'Democratic': defaultdict(int,
                         {'Republican': 0, 'Democratic': 8})})

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [27]:
# dictionary of counts by actual party and estimated party.
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp
    features = conv_features(tweet.lower(), feature_words) 

    # get the estimated party
    estimated_party = classifier.classify(features)

    results[party][estimated_party] += 1

    if idx > num_to_score :
        break

In [28]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 0, 'Democratic': 4278}),
             'Democratic': defaultdict(int,
                         {'Republican': 0, 'Democratic': 5724})})

### Reflections

The classification model classified every tweet to be democrat, which resulted in the misclassification of all 4278 Republican tweets. On the other hand, it correctly identified all 5724 Democrat tweets, never misclassifying any of them. The classification model had a low accuracy score of 49.8% which indicates that the model performance is very low and possibly biased towards one class, which was observed in the twitter data. This could be due to a class imbalance. I tried to look through the code for errors and couldnt find anything that jumped out at me as to why it would be classifying so poorly. Perhaps there is something wrong with how I cleaned and tokenized the data? I could not get the Punkt package to work even though I have it installed and it says its up to date. I tried to uninstall and reinstall and still had the same errors. To work around this, I used the string package instead. 