# Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
import re
import sqlite3
import nltk
import random

import numpy as np
from collections import Counter, defaultdict
import string
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
punct = set(string.punctuation)
from string import punctuation
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt', download_dir='/Users/bobbymarriott/nltk_data')

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bobbymarriott/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
import nltk
nltk.data.path.append('/Users/bobbymarriott/nltk_data')

In [3]:
convention_db  = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text 
for each party and prepare it for use in Naive Bayes. 

In [4]:
convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()

[('conventions',)]

In [5]:
print(convention_cur.execute(
    "SELECT name FROM sqlite_master WHERE type='table';"
).fetchall())

print(convention_cur.execute(
    "PRAGMA table_info(conventions);"
).fetchall())

[('conventions',)]
[(0, 'party', 'TEXT', 0, None, 0), (1, 'night', 'INTEGER', 0, None, 0), (2, 'speaker', 'TEXT', 0, None, 0), (3, 'speaker_count', 'INTEGER', 0, None, 0), (4, 'time', 'TEXT', 0, None, 0), (5, 'text', 'TEXT', 0, None, 0), (6, 'text_len', 'TEXT', 0, None, 0), (7, 'file', 'TEXT', 0, None, 0)]


In [6]:
convention_data = []

# 1. Pull all Dem + Rep speeches
query = """
SELECT text, party
  FROM conventions
 WHERE party IN ('Democrat', 'Republican')
"""
convention_cur.execute(query)

# 2. Build [speech_text, party] list
for speech_text, party in convention_cur.fetchall():
    convention_data.append([speech_text, party])

len(convention_data)

990

In [7]:
# it's a best practice to close up your DB connection when you're done
convention_db.close()

Let's look at some random entries and see if they look right. 

In [8]:
random.choices(convention_data,k=5)

[['I’m an example of a woman who has been given a second chance in life. God bless you and God bless America. Good evening. I’m Alice Marie Johnson. I was once told that the only way I would ever be reunited with my family would be as a corpse. But by the grace of God and the compassion of President Donald John Trump, I stand before you tonight, and I assure you I’m not a ghost. I am alive. I am well. And most importantly, I am free. In 1996, I began serving time in prison. Life plus 25 years. I had never been in trouble. I was a first time non-violent offender. What I did was wrong. I made decisions that I regret.',
  'Republican'],
 ['When the pandemic hit, president Trump heard us in our call for assistance for our farmers. Knowing we have an ally in the white house is important. Folks, this election is a choice between two very different paths. Freedom, prosperity, and economic growth under a Trump/Pence administration or the Biden/Harris path, paved by liberal coastal elites and r

It'll be useful for us to have a large sample size than 2024 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html). 

In [9]:
conv_sent_data = []
# split on any period, question‐mark or exclamation, followed by whitespace
sent_split_re = re.compile(r'(?<=[\.!?])\s+')

for speech, party in convention_data:
    # split the speech into sentences
    for sent in sent_split_re.split(speech):
        sent = sent.strip()
        if len(sent) > 10:
            conv_sent_data.append([sent, party])

print(len(conv_sent_data))
print(conv_sent_data[:5])

5398
[['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 25, 2020 2020 Republican National Convention (RNC) Night 1 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Republican National Convention (RNC) Night 1 Transcript Night 1 of the Republican National Convention (RNC) on August 24.', 'Republican'], ['Read the transcript of the full event here.', 'Republican'], ['Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and 

Again, let's look at some random entries. 

In [10]:
random.choices(conv_sent_data,k=5)

[['That would teach him not to wear a MAGA hat.', 'Republican'],
 ['Within three short years, we built the strongest economy in the history of the world.',
  'Republican'],
 ['He was murdered by people who didn’t know and just didn’t care.',
  'Republican'],
 ['Priority, freeing American hostages.', 'Republican'],
 ['In 2016, Donald Trump made his historic run for the office of United States president.',
  'Republican']]

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps: 

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [11]:
import string
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
punct = set(string.punctuation)

clean_conv_sent_data = []

for idx, (sent, party) in enumerate(conv_sent_data):
    # 1. Tokenize on whitespace
    tokens = sent.split()
    # 2. Remove punctuation tokens
    tokens = [t for t in tokens if t not in punct]
    # 3. Remove tokens that fail the isalpha() test
    tokens = [t for t in tokens if t.isalpha()]
    # 4. Remove stopwords
    tokens = [t for t in tokens if t.lower() not in stop_words]
    # 5. Casefold to lowercase
    tokens = [t.lower() for t in tokens]
    # 6. Join remaining tokens back into a string
    clean_sent = " ".join(tokens)
    
    if clean_sent:  # skip empty results
        clean_conv_sent_data.append((clean_sent, party))

import random
random.sample(clean_conv_sent_data, k=5)

[('nothing washington', 'Republican'),
 ('best campaign going', 'Republican'),
 ('joe biden promises illegal immigrants', 'Republican'),
 ('old ideas socialism repackaged redefined', 'Republican'),
 ('wife', 'Republican')]

If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5. 

In [12]:
word_cutoff = 5

tokens = [w for t, p in clean_conv_sent_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 1109 as features in the model.


In [13]:
def conv_features(text, fw):
    """
    Given some text, this returns a dictionary holding the feature words.

    Args:
      * text: a piece of text in a continuous string. Assumes
        text has been cleaned and case folded.
      * fw: the *feature words* that we're considering. A word in `text`
        must be in fw in order to be returned.

    Returns:
      A dict mapping each feature word found in `text` to True.
    """
    ret_dict = {}
    for word in text.split():          # 1) split on whitespace
        if word in fw:                 # 2) only keep if it’s one of our features
            ret_dict[word] = True      # duplicates automatically collapse
    return ret_dict

In [14]:
assert(len(feature_words)>0)
assert(conv_features("obama was the president",feature_words)==
       {'obama':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [15]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [16]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [17]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

1.0


In [18]:
classifier.show_most_informative_features(25)

Most Informative Features


Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

Parisan buzzwords such as 'jobs', 'security' drive Republican text while 'inequality' and 'healthcare' drive the Democratic side. Naïve Bayes higlights the signature rhetoric of each party. 


## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [19]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [20]:
# 1) List all tables
tables = cong_cur.execute(
    "SELECT name FROM sqlite_master WHERE type='table';"
).fetchall()
print("Tables:", tables)

# 2) For each table, show its columns
for (tbl,) in tables:
    cols = cong_cur.execute(f"PRAGMA table_info({tbl});").fetchall()
    print(f"\nColumns in {tbl}:\n", cols)

Tables: [('websites',), ('candidate_data',), ('tweets',)]

Columns in websites:
 [(0, 'district', 'TEXT', 0, None, 0), (1, 'candidate', 'TEXT', 0, None, 0), (2, 'pull_time', 'DATETIME', 0, None, 0), (3, 'url', 'TEXT', 0, None, 0), (4, 'site_text', 'TEXT', 0, None, 0)]

Columns in candidate_data:
 [(0, 'index', 'INTEGER', 0, None, 0), (1, 'student', 'TEXT', 0, None, 0), (2, 'state', 'TEXT', 0, None, 0), (3, 'district_num', 'TEXT', 0, None, 0), (4, 'formatted_dist_num', 'INTEGER', 0, None, 0), (5, 'abbrev', 'TEXT', 0, None, 0), (6, 'district', 'TEXT', 0, None, 0), (7, 'candidate', 'TEXT', 0, None, 0), (8, 'party', 'TEXT', 0, None, 0), (9, 'website', 'TEXT', 0, None, 0), (10, 'twitter_handle', 'TEXT', 0, None, 0), (11, 'incumbent', 'TEXT', 0, None, 0), (12, 'age', 'REAL', 0, None, 0), (13, 'gender', 'TEXT', 0, None, 0), (14, 'marital_status', 'TEXT', 0, None, 0), (15, 'white_non_hispanic', 'TEXT', 0, None, 0), (16, 'hispanic', 'TEXT', 0, None, 0), (17, 'black', 'TEXT', 0, None, 0), (18, '

In [21]:
results = cong_cur.execute("""
SELECT DISTINCT
  cd.candidate,
  cd.party,
  tw.tweet_text
FROM candidate_data AS cd
INNER JOIN tweets AS tw
  ON cd.twitter_handle = tw.handle
  AND cd.district       = tw.district
WHERE cd.party IN ('Republican','Democratic')
  AND tw.tweet_text NOT LIKE '%RT%'
""").fetchall()

results = list(results)

In [22]:
tweet_data = []

for candidate, party, tweet_text in results:
    tweet_data.append([tweet_text, party])

# quick check
print(len(tweet_data), tweet_data[:3])

664656 [[b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq', 'Republican'], [b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6', 'Republican'], [b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA', 'Republican']]


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [23]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [24]:
for tweet, party in tweet_data_sample:
    # 1) Decode bytes
    if isinstance(tweet, bytes):
        tweet = tweet.decode('utf8','ignore')

    # 2) Clean
    toks = tweet.split()
    toks = [
        t for t in toks
        if t not in punct
        and t.isalpha()
        and t.lower() not in stop_words
    ]
    clean = " ".join(t.lower() for t in toks)

    # 3) Featurize & classify
    feats = conv_features(clean, feature_words)
    estimated_party = classifier.classify(feats)

    print(f"Here's our (cleaned) tweet: {clean}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: earlier spoke house floor abt protecting health care women praised work central
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: go
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: trump thinks easy students overwhelmed crushing burden debt pay student loans
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: grateful first rescue volunteers working tirelessly keep people provide putting lives
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: make even greater
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: tie series
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: congrats new gig sd city glad continue
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: really raised

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [25]:
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, (tweet, true_party) in enumerate(tweet_data):
    if isinstance(tweet, bytes):
        tweet = tweet.decode('utf8', 'ignore')
        
    toks = tweet.split()
    toks = [t for t in toks if t not in punct and t.isalpha() and t.lower() not in stop_words]
    clean = " ".join(t.lower() for t in toks)

    feats = conv_features(clean, feature_words)
    estimated_party = classifier.classify(feats)
    results[true_party][estimated_party] += 1

    if idx >= num_to_score:
        break

In [26]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 4278, 'Democratic': 0}),
             'Democratic': defaultdict(int,
                         {'Republican': 5723, 'Democratic': 0})})

### Reflections

It seems as though the classifier predicts the Republican tweets as True tweets and correctly labeled, however, 0 for the Democratic Tweets. This might show a model reliance on convention-speech and how it fails to capture the linguistic style of the Democrats on Twitter. 