## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [51]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
import nltk
import pandas as pd
import os
import re

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from wordcloud import WordCloud 
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

In [52]:
#some punctuation variations
punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

#create the stopwords variable
sw = stopwords.words("english")

#two useful regex
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# It's handy to have a full set of emojis
all_language_emojis = set()

In [167]:
#call the conventions.db
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

convention_cur

<sqlite3.Cursor at 0x7fc448179f10>

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [168]:
#create a list to story the query results
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 
query_results = convention_cur.execute(
                            '''
                            select  c.[text], c.party
                            from conventions as c
                            ''')
for row in query_results :
    convention_data.append(row)
#convention_data

In [169]:
#create the convetion_df with text and party columns
convention_df = pd.DataFrame(convention_data, columns = ["text","party"])
convention_df.head()

Unnamed: 0,text,party
0,Skip to content The Company Careers Press Free...,Democratic
1,I’m here by calling the full session of the 48...,Democratic
2,"Every four years, we come together to reaffirm...",Democratic
3,We fight for a more perfect union because we a...,Democratic
4,"We must come together to defeat Donald Trump, ...",Democratic


Let's look at some random entries and see if they look right. 

In [170]:
#view a random choice of the convetion data
random.choices(convention_data,k=5)

[('Thanks for that, Bernie. I want to thank you all for joining us for this segment. I mean this sincerely, it was an honor to run against you, and then there’s even a greater honor to stand with you and support of Joe Biden and Kamala Harris.',
  'Democratic'),
 ('(singing.', 'Democratic'),
 ('Thank you, Mr. President. The honor is all mine.', 'Republican'),
 ('After my daughter’s murder, the media didn’t seem interested in the facts. So, I found them myself. I learned that gun control laws didn’t fail my daughter, people did. The gunman had threatened to kill his classmates before. He had threatened to rape them. He had threatened to shoot up the school. Every red flag you could imagine, but the school didn’t just miss these red flags, they knowingly ignored them. Far left Democrats in our school district made this shooting possible because they implemented something they called restorative justice. This policy, which really just blames teachers for student’s failures puts kids and t

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [171]:
#define the word cutoff limit
word_cutoff = 5

#initially split all tokens for analysis of feature_words
tokens = [w for t, p in convention_data for w in t.split()]

#create the tokens by splitting on whitespace
def tokenize(text) : 
    return re.split(whitespace_pattern, text)

#remove stop words
def remove_stop(tokens) :
    stop_words = set(stopwords.words("english"))
    return [word for word in tokens if word.lower() not in stop_words]

tokens = remove_stop(tokens)

#apply the .lower() function to all words
def lower_words(tokens) :
    return [word.lower() for word in tokens]

tokens = lower_words(tokens)

#define the function to remove punctuation from each word where it exists
def remove_punctuation_from_words(tokens):
    punctuations = tw_punct #from the 
    result = []
    for word in tokens:
        word_without_punctuation = "" #empty string
        for char in word:
            if char not in punctuations:
                word_without_punctuation += char
        result.append(word_without_punctuation)
    return result #return the result of the word_without_punctuation string

#remove the punctuation from the dataframe
#this expects text to be single string - 
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))
 #.join.text -- test this
    
#defind the function for preparing the data for the pipeline
def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)

tokens = remove_punctuation_from_words(tokens)

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2462 as features in the model.


In [172]:
#view the feature words (if needed)
#print(feature_words)

In [173]:
#apply the functions to the individual rows within the conventions dataframe
#create pipeline - try it once a time
my_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]

#assign token variable by applying my_pipeline
convention_df["tokens"] = convention_df["text"].apply(prepare,pipeline=my_pipeline)

In [174]:
#iterate through the convetion_df.token column and assign to list
for tokenlist in convention_df.tokens.tolist():
    break

In [175]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    result = {word: True for word in text if word in fw}
              #word in text.split() for word in fw}
    return result

conv_features(tokenlist, feature_words)

{'skip': True,
 'content': True,
 'company': True,
 'careers': True,
 'press': True,
 'freelancers': True,
 'blog': True,
 '×': True,
 'services': True,
 'transcription': True,
 'captions': True,
 'foreign': True,
 'subtitles': True,
 'translation': True,
 'contact': True,
 'login': True,
 '«': True,
 'return': True,
 'transcript': True,
 'library': True,
 'home': True,
 'categories': True,
 'transcripts': True,
 '2020': True,
 'election': True,
 'classic': True,
 'speech': True,
 'congressional': True,
 'testimony': True,
 'hearing': True,
 'debate': True,
 'donald': True,
 'trump': True,
 'entertainment': True,
 'financial': True,
 'interview': True,
 'political': True,
 'conference': True,
 'sports': True,
 'technology': True,
 'aug': True,
 'democratic': True,
 'national': True,
 'convention': True,
 'dnc': True,
 'night': True,
 '4': True,
 'rev': True,
 '›': True,
 'august': True,
 '20': True,
 'read': True,
 'full': True,
 'event': True,
 'transcribe': True,
 'try': True,
 'free

In [176]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

AssertionError: 

In [177]:
#create the features column 
convention_df['features'] = convention_df["tokens"].apply(conv_features,fw=feature_words)

In [178]:
convention_df.head()

Unnamed: 0,text,party,tokens,features
0,Skip to content The Company Careers Press Free...,Democratic,"[skip, content, company, careers, press, freel...","{'skip': True, 'content': True, 'company': Tru..."
1,I’m here by calling the full session of the 48...,Democratic,"[i’m, calling, full, session, 48th, quadrennia...","{'i’m': True, 'calling': True, 'full': True, '..."
2,"Every four years, we come together to reaffirm...",Democratic,"[every, four, years, come, together, reaffirm,...","{'every': True, 'four': True, 'years': True, '..."
3,We fight for a more perfect union because we a...,Democratic,"[fight, perfect, union, fighting, soul, countr...","{'fight': True, 'perfect': True, 'union': True..."
4,"We must come together to defeat Donald Trump, ...",Democratic,"[must, come, together, defeat, donald, trump, ...","{'must': True, 'come': True, 'together': True,..."


In [179]:
#create the features sets list
feature_sets = []

#iterate through the convention_df and collect the features and party of each 
for i in range(convention_df.shape[0]):
    feature_sets.append((convention_df.iloc[i,3], convention_df.iloc[i,1]))

In [180]:
#view the first 5 elements in the feature_sets list
feature_sets[:5]

[({'skip': True,
   'content': True,
   'company': True,
   'careers': True,
   'press': True,
   'freelancers': True,
   'blog': True,
   '×': True,
   'services': True,
   'transcription': True,
   'captions': True,
   'foreign': True,
   'subtitles': True,
   'translation': True,
   'contact': True,
   'login': True,
   '«': True,
   'return': True,
   'transcript': True,
   'library': True,
   'home': True,
   'categories': True,
   'transcripts': True,
   '2020': True,
   'election': True,
   'classic': True,
   'speech': True,
   'congressional': True,
   'testimony': True,
   'hearing': True,
   'debate': True,
   'donald': True,
   'trump': True,
   'entertainment': True,
   'financial': True,
   'interview': True,
   'political': True,
   'conference': True,
   'sports': True,
   'technology': True,
   'aug': True,
   'democratic': True,
   'national': True,
   'convention': True,
   'dnc': True,
   'night': True,
   '4': True,
   'rev': True,
   '›': True,
   'august': True,


Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [183]:
#shuffle the feature_sets and establish test size for test/training
random.seed(20220507)
random.shuffle(feature_sets)

test_size = 500

In [184]:
#create the test and train set
test_set, train_set = feature_sets[:test_size], feature_sets[test_size:]

#create the classifier for the Naive Bayes Classifier from the train set
classifier = nltk.NaiveBayesClassifier.train(train_set)

#print the accuracy of the classifier from the test set
print(nltk.classify.accuracy(classifier, test_set))

0.51


In [185]:
#test a random word to see the output that the classifier.classify function
classifier.classify({'military': True})

'Republican'

In [186]:
#show the top 25 most invormative features
classifier.show_most_informative_features(25)

Most Informative Features
                 radical = True           Republ : Democr =     37.0 : 1.0
                   media = True           Republ : Democr =     33.9 : 1.0
                   votes = True           Democr : Republ =     24.5 : 1.0
               greatness = True           Republ : Democr =     20.3 : 1.0
             enforcement = True           Republ : Democr =     19.7 : 1.0
                   china = True           Republ : Democr =     17.6 : 1.0
               amendment = True           Republ : Democr =     17.2 : 1.0
                 destroy = True           Republ : Democr =     17.2 : 1.0
                   taxes = True           Republ : Democr =     17.2 : 1.0
                supports = True           Republ : Democr =     14.1 : 1.0
                    flag = True           Republ : Democr =     12.8 : 1.0
                    mike = True           Republ : Democr =     12.8 : 1.0
                 abraham = True           Republ : Democr =     10.9 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations
The first observation that immediately stands out is that 24 of the 25 most informative features are associated to the Republican Party. The second observation is that because I am relatively aware of the political camps and their messages, the associated words that have are more associated to each party make senses to me. 

To unpack both statements (and to try and keep political bias to a minimum), Republicans tends to grab onto a handful of key ideas and play those over and over to their audience. Especially during election season, the message is typically the same, “Radical liberals are trying to destroy your freedoms, and Iran and China are major threats” - this general message was essentially on repeat. Further, the primary message of the Democrats of the last election cycle was voting and simply defeating Donald Trump. Actual policy, from my perspective, was not the focus of conversation. Hence, the words that are associated more/less to each party generally make intuitive sense to me. 


## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [188]:
#call the congressional_data.db
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [189]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming
results_df = pd.DataFrame(results, columns =[['name','party','tweet']])

In [190]:
twitter_data = []

query_results2 = cong_cur.execute(
                            '''
           SELECT DISTINCT 
               --   cd.candidate, 
               --   cd.party,
                  tw.tweet_text as 'text',
                  cd.party as 'party'
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
                            ''')

for row in query_results2 :
    twitter_data.append(row)
#twitter_data

In [191]:
#take the twitter_data list and store text and party to a dataframe
twitter_df = pd.DataFrame(twitter_data, columns = ["text","party"])
twitter_df.shape

(664088, 2)

In [192]:
#apply the functions to the individual rows within the dataframe
my_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]

#create the tokens column from the text column
twitter_df["tokens"] = twitter_df["text"].apply(prepare,pipeline=my_pipeline)

#identify the columns to keep
keep_only = ['tokens', 'party']
twitter_df = twitter_df[keep_only]

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [196]:
random.seed(20201014)

#create a sample dataset of the tweet_data
tweet_data_sample = random.choices(tweet_data,k=10)
tweet_data_sample

[[b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA',
  'Republican'],
 [b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6',
  'Republican'],
 [b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq',
  'Republican'],
 [b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6',
  'Republican'],
 [b'"The trouble with Socialism is that eventually you run out of other people\'s money." - Margaret Thatcher https://t.co/X97g7wzQwJ',
  'Republican'],
 [b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq',
  'Republican'],
 [b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq',
  'Republican'],


In [198]:
#twitter_df.head()

In [199]:
twitter_df['features'] = twitter_df["tokens"].apply(conv_features,fw=feature_words)
twitter_df.head()

Unnamed: 0,tokens,party,features
0,"[bbrooks, joins, alabama, delegation, voting, ...",Republican,"{'alabama': True, 'voting': True, 'funding': T..."
1,"[bbrooks, senate, democrats, allowing, preside...",Republican,"{'senate': True, 'democrats': True, 'allowing'..."
2,"[bnasa, square, event, sat, 11am, xe2x80x93, 4...",Republican,"{'square': True, 'event': True, 'sat': True, '..."
3,"[bthe, trouble, socialism, eventually, run, pe...",Republican,"{'trouble': True, 'socialism': True, 'run': Tr..."
4,"[bthe, trouble, socialism, eventually, run, pe...",Republican,"{'trouble': True, 'socialism': True, 'run': Tr..."


In [200]:
twitter_sets = []

for i in range(twitter_df.shape[0]):
    twitter_sets.append((twitter_df.iloc[i,2], twitter_df.iloc[i,1]))

In [201]:
#view the first two elemnts of the twitter_sets list
twitter_sets[:2]

[({'alabama': True, 'voting': True, 'funding': True, 'bill': True},
  'Republican'),
 ({'senate': True,
   'democrats': True,
   'allowing': True,
   'president': True,
   'give': True,
   'jobs': True},
  'Republican')]

In [202]:
#iterate through the convetion_df.token column and assign to list
for twitter_list in twitter_df.features.tolist():
    break

In [203]:
random.seed(20201014)

tweet_data_sample = random.choices(twitter_sets,k=10)

In [204]:
#for tweet, party in tweet_data_sample :
for tweet, party in tweet_data_sample:
    
    estimated_party = classifier.classify(tweet)
    # Fill in the right-hand side above with code that estimates the actual party
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: {'awesome': True, 'join': True, 'coast': True, 'residents': True, 'students': True}
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: {}
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: {'yet': True, 'speaker': True, 'still': True, 'willing': True, 'donald': True, 'trump': True, 'candidate': True, 'racist': True, 'policy': True}
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: {'celebrate': True, 'anniversary': True, 'signing': True, 'nations': True, 'founding': True}
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: {'one': True, 'year': True, 'economy': True, 'people': True, 'want': True, 'work': True}
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: {'heard': True, 'town': True, 'committee': True, 'today': True, 'must': True}
Actual party is Democrati

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [205]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(twitter_sets)
#tweet_data is the right varialbe name
for idx, tp in enumerate(twitter_sets) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # get the estimated party
    estimated_party = classifier.classify(tweet)
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [206]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3520, 'Democratic': 787}),
             'Democratic': defaultdict(int,
                         {'Republican': 4596, 'Democratic': 1099})})

### Reflections

The classifier does a decent job at correctly predicting Republican text, however, incorrectly predicts Democratic text as Republican too frequently. My initial thoughts are that this is due to the most important feature words (24 of 25) being "Republican" words, and there being no way of distinguising use that via our model. Additionally, Democrats could very well (and do) say many of those words, however, the Naive model doesn't take that into consideration, and instead, if the word is seen, the chance that it is Repbulcian is incrased. 

Therefore, I belive a "top 25 list" that is more balanced from a Republican and Democrat standpoint would likely return better results, and could help to improve classification results. Lastly, as some of the individual word test below display, words that are not within our feature words (such as random words like soda, table and others) are being classified as Democrat versus Republican - which is an interesting, but possibly impactful feature on the data.

In [207]:
##### Example random text and the part it gets classified to:
classifier.classify({'america': True})

'Republican'

In [208]:
classifier.classify({'nation': True})

'Democratic'

In [209]:
classifier.classify({'country': True})

'Republican'

In [210]:
classifier.classify({'immigration': True})

'Democratic'

In [211]:
classifier.classify({'illegal': True})

'Republican'

In [214]:
classifier.classify({'table': True})

'Democratic'

In [215]:
classifier.classify({'nonsense': True})

'Democratic'

In [216]:
classifier.classify({'soda': True})

'Democratic'