<a href="https://colab.research.google.com/github/EmmaCOo/Assignment-4.1-Text_Classification/blob/main/ADS509_Text_Mining_Assignment4_1_TextClassificationModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###**ADS 509 Module 4: Assignment 4.1: Text Classification Model**

###**Emma Oo**

###09/26/2022

In this assignment, we use Naïve Bayes (NB) for its two greatest strengths:

Exploration of a data set.

Classification of new data based on training data.

Instruction Repo Link:

https://github.com/37chandler/tm-nb-conventions/blob/main/Political%20Naive%20Bayes.ipynb

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install emoji==1.7

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict
import nltk
nltk.download('stopwords')
from string import punctuation

import os
import re
import emoji
import pandas as pd

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from wordcloud import WordCloud 

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns


# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison
punctuation.add('’')  # manual add the punctuation symbols that weren't removed from convention_db after trial runs
punctuation.add('" "') # manual add the punctuation symbols that weren't removed from convention_db after trial runs
tw_punct = punctuation

# Stopwords
#sw = stopwords.words("english")

# Two useful regex
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# It's handy to have a full set of emojis
all_language_emojis = set()

for country in emoji.UNICODE_EMOJI : 
    for em in emoji.UNICODE_EMOJI[country] : 
        all_language_emojis.add(em)


def is_emoji(s):
    return(s in all_language_emojis)

def contains_emoji(s):
    
    s = str(s)
    emojis = [ch for ch in s if is_emoji(ch)]

    return(len(emojis) > 0)

#Remove Stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))
def remove_stop(tokens) :
  return[t for t in tokens if t not in stopwords]

#Remove Punctuation
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))


#Tokenization while keeping # and emojis
RE_TOKEN = re.compile(r"""
                   ( [#]?[@\w'’\.\-\:]*\w     # words, hashtags and email addresses
                   | [:;<]\-?[\)\(3]          # coarse pattern for basic text emojis
                   | [\U0001F100-\U0001FFFF]  # coarse code range for unicode emojis
                   )
                  """, re.VERBOSE)
def tokenize(text) : 
  return text.split()


#Define pipeline function
pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]
def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)
  
my_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]


In [5]:
print (tw_punct) 

{'~', '/', '@', '`', '(', '_', '’', ',', ':', '[', '{', '+', '!', ')', '%', '^', '.', '"', '$', ';', '-', "'", '=', '" "', '<', '}', '?', ']', '|', '>', '*', '#', '&', '\\'}


In [6]:
#connect to SQL DB
convention_db = sqlite3.connect("/content/drive/MyDrive/2020_Conventions.db")
#Execute the SQL query
convention_cur = convention_db.cursor()

###**Part 1: Exploratory Naive Bayes**


We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text for each party and prepare it for use in Naive Bayes.

In [7]:
convention_data = []
raw_text =[]

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute(
                            '''
                            SELECT text, party   
                            FROM conventions
                            ''')

for row in query_results :
    # store the results in convention_data
    raw_text = [row[0]]  #text column is at row[0]
    clean_text = prepare(raw_text, pipeline = my_pipeline)  #apply data cleaning pipeline on raw_text
    string_clean_text = " ".join(clean_text) #create a cleaned and tokenized text in a SINGLE STRING
    clean_data = [string_clean_text, row[1]] #combine cleaned tokenized string and party column
    convention_data.append(clean_data)    



Let's look at some random entries and see if they look right.



In [8]:
random.choices(convention_data,k=10)

[['joe helped bring us back recession 2009 barack obama joe biden started worst economy since great depression done delivered six straight years job growth',
  'Democratic'],
 ['president trump continues place strong women significant positions throughout administration campaign far president us history',
  'Republican'],
 ['see theres another part story part ran office part served congress part worked joe biden barack obama make sure kids grandkids theyre dependents stay parents health insurance theyre 26 got done yes big effing deal thats america know thats america love thats america joe biden kamala harris white house nation plans nation builds nation builds back say home nation builds back better wisconsin state motto one word forward november lets move forward never look back thank',
  'Democratic'],
 ['okay dont know grandfather', 'Democratic'],
 ['led become lawyer district attorney attorney general united states senator every step way ive guided words spoke first time stood cou

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least word_cutoff times. Here's the code to test that if you want it.

In [9]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2387 as features in the model.


In [10]:
word_dist  #let's look at some freq distribution of word

FreqDist({'president': 1101, 'joe': 778, 'us': 745, 'trump': 708, 'america': 679, 'biden': 671, 'people': 608, 'country': 506, 'american': 462, 'one': 430, ...})

In [11]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.  #convention_data[0]
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words. #feature_words
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Your code here
    ret_dict = dict()
    for i in text.split():
        if i in fw:
            ret_dict[i] = True
    return(ret_dict)

In [12]:
# assert to check if the function has any bug
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})


Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory.

In [13]:
#apply conv_feature function onto convention_data set
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [14]:
#Split test_size=500
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [15]:
#train, test for NB classifier model and print accuracy
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.498


In [16]:
#print top 25 most informative features
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                    isis = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

###**My Observations**

The ratio of the Republican and Democratic are displayed for all the top feature tokens. For example, for the token 'china', the republic to democratic ratio is 25.8:1.00. Majority of the words from the top 25 features were used by Republicans such as 'china', 'enforcement', 'destroy', 'freedom' etc.  The only two words that democractic use more than Republicans are 'vote' and 'climate.' This study reveals the subjects and topics that each party liked to discuss and raised their concern about. 


###**Part 2: Classifying Congressional Tweets**


In this part we **apply the classifer we just built** to a set of tweets by people running for congress in 2018. These tweets are stored in the database congressional_data.db. That DB is funky, so I'll give you the query I used to pull out the tweets. Note that this DB has some big tables and is unindexed, so the query takes a minute or two to run on my machine.

In [17]:
cong_db = sqlite3.connect("/content/drive/MyDrive/congressional_data.db")
cong_cur = cong_db.cursor()

In [18]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [19]:
#let's look at result data frame to get an idea of what the result data looks like
df = pd.DataFrame(results)
df.head()

Unnamed: 0,0,1,2
0,Mo Brooks,Republican,"b'""Brooks Joins Alabama Delegation in Voting A..."
1,Mo Brooks,Republican,"b'""Brooks: I Do Not Support America Raising, T..."
2,Mo Brooks,Republican,"b'""Brooks: Senate Democrats Allowing President..."
3,Mo Brooks,Republican,"b'""NASA on the Square"" event this Sat. 11AM \x..."
4,Mo Brooks,Republican,"b'""Rep. Mo Brooks: NDAA Amnesty Amendment \xe2..."


In [20]:
tweet_data = []
raw_tweet = []

results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

for row in results:
  raw_tweet = [row[2].decode('utf-8')]   # select text column (row[2]) and decode the b symbols in front of the text
  clean_tweet = prepare(raw_tweet, pipeline = my_pipeline)
  string_clean_tweet = " ".join(clean_tweet)
  clean_tweet_data = [string_clean_tweet,row[1]]  #pull only cleaned text and party
  tweet_data.append(clean_tweet_data)
# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [21]:
random.choices(tweet_data,k=10)

[['good morning amjalexjohnson', 'Democratic'],
 ['rt alexckaufman top democrats repdonbeyer rep gerryconnolly ask epas busy inspector general open investigation int…',
  'Democratic'],
 ['rt swingleftca10 youngest canvasser day nico ca10 joshuaharder swingleft httpstco1zgrqliyfd',
  'Democratic'],
 ['patients ask federal government permission save lives great see righttotry signed law congratulations one biggest champions state rep nickzerwas able white house signing ceremony',
  'Republican'],
 ['join markreardonkmox yesterday discuss taxreform important economy amp middleincome families also collected friendly bet goraiders httpstcov8okl963s5 httpstcoqqavvyighk',
  'Republican'],
 ['focus kind america want friends neighbors children grandchildren world applaud rep joe kennedy focusing positive vision future',
  'Democratic'],
 ['im happy push muchneeded work beltrami island state forest public use areas outdoors huge part minnesota life atv snowmobile access important recreation eco

In [22]:
random.seed(20201014)
tweet_data_sample = random.choices(tweet_data,k=10)

In [23]:
for tweet, party in tweet_data_sample :
    estimated_party = classifier.classify(conv_features(tweet, feature_words))  #apply NB classifier on twitter_data
    # Fill in the right-hand side above with code that estimates the actual party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: mass shooting las vegas horrific act violence victims families thoughts prayers
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: early morning traveltuesday leaving ok02 dc httptcoigknci79e7
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: moderates iraq amp syria civilians weve enemies sides conflict assist either
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: rt natsecaction 200 national security veterans demanding answers release confidential national security questionna…
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: 💯 buildthatwall httpstcohyb6jcw5ea
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: glad attend g20 assure everyone could majority americans still stand traditional allies
Actual party is Democratic and our classifer says Republica

Now that we've looked at it some, let's score a bunch and see how we're doing.



In [24]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)


for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # get the estimated party
    estimated_party = classifier.classify(conv_features(tweet, feature_words))
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [25]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3624, 'Democratic': 516}),
             'Democratic': defaultdict(int,
                         {'Republican': 5021, 'Democratic': 841})})

###**Reflections**

Write a little about what you see in the results

The NB classifier predicts majority as Republician over Democratic.  For the Republican tweets, the ratio of Republican to Democratic is 88%:12%.  For the Democratic, it is also 86% to 14%. The model seems biased towards the Republican class. It's possible that there's a class imbalance between Republican and Democratic with the lower tweet about Democratic. Thus, the NBC overlearn the Republican class and underfit the Democratic.  

In [29]:
#convert to html 
!jupyter nbconvert --to html /content/ADS509_Text_Mining_Assignment4_1_TextClassificationModel.ipynb

[NbConvertApp] Converting notebook /content/ADS509_Text_Mining_Assignment4_1_TextClassificationModel.ipynb to html
[NbConvertApp] Writing 337791 bytes to /content/ADS509_Text_Mining_Assignment4_1_TextClassificationModel.html
