## Using MLP to Classify Political Party in Tweets

In [1]:
import pandas as pd

In [2]:
tw = pd.read_csv('twitter_data.csv')
tweets = list(tw['text'])
labels = list(tw['party'])

In [3]:
print(len(tweets))
print(len(labels))

25821
25821


In [4]:
# local library housing tools to clean and process incoming twitter data
import twittertextcleaner as ttc
processed_tweets = []
for tw in tweets: 
    processed_tweets.append(ttc.preprocess_tweet(tw))

In [5]:
print(len(labels))
print(len(processed_tweets))

25821
25821


In [6]:
print(labels[0])
print(labels[1])

democrat
republican


In [7]:
from collections import Counter
import numpy as np
dem_counts = Counter()
rep_counts = Counter()
total_counts = Counter()

In [8]:
for i in range(len(processed_tweets)):
    if(labels[i] == labels[0]):
        for tw in processed_tweets[i].split(' '):
            dem_counts[tw] += 1
            total_counts[tw] += 1
    else: 
        for tw in processed_tweets[i].split(' '):
            rep_counts[tw] +=1
            total_counts[tw] +=1

In [9]:
print('Ten Most Common Dem words:')
print('\n')
dem_counts.most_common()[:10]

Ten Most Common Dem words:




[('’', 5583),
 ('trump', 2423),
 ('presid', 1950),
 ('rt', 1879),
 ('—', 1568),
 ('peopl', 1531),
 ('u', 1397),
 ('need', 1385),
 ('american', 1382),
 ('make', 1355)]

Its interesting to note that the democratic tweets are saying Trump most often. But using his name. I hypothesize that these tweets are about Trump. Whereas the results of the  republican tweets are:

In [10]:
print('Ten Most Common Rep words:')
print('\n')
rep_counts.most_common()[:10]

Ten Most Common Rep words:




[('rt', 4422),
 ('’', 3915),
 ('presid', 2553),
 ('realdonaldtrump', 2179),
 ('american', 1743),
 ('amp', 1635),
 ('whitehous', 1357),
 ('today', 1103),
 ('obama', 1088),
 ('‘', 1013)]

Using rt (retweet) and his handle as the most used word. I think its fair to speculate that the democratic tweets are commenting on Trump's opinions where as the republication tweets are repeating Trump's opinions.

Lets see how things change when we look at ratio vs count

In [11]:
# create counter object to store ratios
dem_rep_ratios = Counter()
for term,cnt in list(total_counts.most_common()):
    if(cnt > 50):
        dem_rep_ratio = dem_counts[term] / float(rep_counts[term]+1)
        dem_rep_ratios[term] = dem_rep_ratio

In [12]:
# a word with a number or 1 or greater, more democrats have used it
# a word, with a number under zero, more republicans have used it

print("dem-to-rep ratio for 'rt' = {}".format(dem_rep_ratios["rt"]))
print("dem-to-rep ratio for 'great' = {}".format(dem_rep_ratios["great"]))
print("dem-to-rep ratio for 'news' = {}".format(dem_rep_ratios["news"]))
print("dem-to-rep ratio for 'wealth' = {}".format(dem_rep_ratios["wealth"]))
print("dem-to-rep ratio for 'health' = {}".format(dem_rep_ratios["health"]))
#according to this, from the candidates I chosen, a republican is more likly to retweet

dem-to-rep ratio for 'rt' = 0.4248247795613837
dem-to-rep ratio for 'great' = 0.36769394261424015
dem-to-rep ratio for 'news' = 0.5891891891891892
dem-to-rep ratio for 'wealth' = 9.7
dem-to-rep ratio for 'health' = 2.942857142857143


In [13]:
for tw,ratio in dem_rep_ratios.most_common():
    dem_rep_ratios[tw] = np.log(ratio)

In [14]:
dem_rep_ratios.most_common()[:10]

[('teamjo', 5.717027701406222),
 ('—hillari', 4.969813299576001),
 ('demdeb', 4.859812404361672),
 ('jill', 4.836281906951478),
 ('lgbtq', 4.574710978503383),
 ('berniesand', 4.42484663185681),
 ('knock', 4.23410650459726),
 ('oneterm', 4.189654742026425),
 ('inclus', 4.110873864173311),
 ('superwealthi', 4.07753744390572)]

Here the words are much more representative of what we might expect to see from a democrats twitter feed. The only word that I personally don't immediately understand is 'knock'. But the rest seem very appropriate.

In [15]:
list(reversed(dem_rep_ratios.most_common()))[:10]

[('keepamericagreat', -inf),
 ('scotu', -inf),
 ('schumer', -inf),
 ('irand', -inf),
 ('spous', -inf),
 ('immigrationact', -inf),
 ('karen', -inf),
 ('trumpwarroom', -inf),
 ('garland', -inf),
 ('schiff', -inf)]

Here the words are also much more representative of what we might expect to see from a republican's twitter feed. 

In [16]:
# 1 = democrat
# -1 = republican
# 0 = neutral
print("dem-to-rep ratio for 'rt' = {}".format(dem_rep_ratios["rt"]))
print("dem-to-rep ratio for 'great' = {}".format(dem_rep_ratios["great"]))
print("dem-to-rep ratio for 'news' = {}".format(dem_rep_ratios["news"]))
print("dem-to-rep ratio for 'wealth' = {}".format(dem_rep_ratios["wealth"]))
print("dem-to-rep ratio for 'health' = {}".format(dem_rep_ratios["health"]))

dem-to-rep ratio for 'rt' = -0.8560784784548614
dem-to-rep ratio for 'great' = -1.0005043645276555
dem-to-rep ratio for 'news' = -0.5290079428491812
dem-to-rep ratio for 'wealth' = 2.272125885509337
dem-to-rep ratio for 'health' = 1.0793809267402221


In [21]:
# local library housing a custom neural net with two hidden layers
import mfinchmods as mf

In [22]:
#Tw_Net - a near clone of the Senti_Net
mlp = mf.Tw_Net(processed_tweets[:-1000],
                labels[:-1000],
                min_count=20,
                polarity_cutoff=0.05,
                learning_rate=0.01)

In [23]:
mlp.train(processed_tweets[:-1000],labels[:-1000])

How much I've read:0.0% How fast I can read:(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
How much I've read:10.0% How fast I can read:(reviews/sec):4534. #Correct:1956 #Trained:2501 Training Accuracy:78.2%
How much I've read:20.1% How fast I can read:(reviews/sec):4671. #Correct:4057 #Trained:5001 Training Accuracy:81.1%
How much I've read:30.2% How fast I can read:(reviews/sec):4757. #Correct:6188 #Trained:7501 Training Accuracy:82.4%
How much I've read:40.2% How fast I can read:(reviews/sec):4807. #Correct:8352 #Trained:10001 Training Accuracy:83.5%
How much I've read:50.3% How fast I can read:(reviews/sec):4837. #Correct:10513 #Trained:12501 Training Accuracy:84.0%
How much I've read:60.4% How fast I can read:(reviews/sec):4837. #Correct:12732 #Trained:15001 Training Accuracy:84.8%
How much I've read:70.5% How fast I can read:(reviews/sec):4852. #Correct:14955 #Trained:17501 Training Accuracy:85.4%
How much I've read:80.5% How fast I can read:(reviews/sec):4856. #

In [40]:
# i messed up the internal print statement here, but the actual accuracy is 85.7% - we can see this in the 'result' print out
results = mlp.predict(processed_tweets[-1000:-1],labels[-1000:-1])

How much I've read:99.8% How fast I can read:(reviews/sec):5426. #Correct:0 #Tested:999 Testing Accuracy:0.0%

In [41]:
# results returns the overall predictions with pred on the left and label on the right
results

{'a republican tweeted this': 'republican',
 'a democrat tweeted this': 'democrat'}

## Tw_Net Performance

In [93]:
print('Model Classification:')
print(mlp.classify(processed_tweets[18]))
print('\n')
print('Actual Label:')
print(labels[18])
print('\n')
print('Processed Tweet:')
print(processed_tweets[18])
print('\n')
print('Raw Tweet')
print(tweets[18])

Model Classification:
a republican tweeted this


Actual Label:
republican


Processed Tweet:
rt secondladi ’ begin look lot like christma vice presid ’ resid thank volunt help de…


Raw Tweet
RT @SecondLady: It’s beginning to look a lot like Christmas at the Vice President’s Residence! Thank you to the volunteers who helped to de…


In [95]:
print('Model Classification:')
print(mlp.classify(processed_tweets[2]))
print('\n')
print('Actual Label:')
print(labels[2])
print('\n')
print('Processed Tweet:')
print(processed_tweets[2])
print('\n')
print('Raw Tweet')
print(tweets[2])

Model Classification:
a democrat tweeted this


Actual Label:
democrat


Processed Tweet:
least 32 million nurs caregiv food servic worker america dont access paid sick leav moral wrong coronaviru crisi make clear put u risk


Raw Tweet
At least 32 million nurses, caregivers, and food service workers in America don't have access to any paid sick leave. 

It's morally wrong and, as the coronavirus crisis makes clear, it puts us all at risk. https://t.co/b6ZTIhBWl8


In [96]:
print('Model Classification:')
print(mlp.classify(processed_tweets[325]))
print('\n')
print('Actual Label:')
print(labels[325])
print('\n')
print('Processed Tweet:')
print(processed_tweets[325])
print('\n')
print('Raw Tweet')
print(tweets[325])

Model Classification:
a democrat tweeted this


Actual Label:
democrat


Processed Tweet:
onward togeth end 2017 support six incred organ fight protect vote right make easier young divers candid get ballot get elect


Raw Tweet
Onward Together is ending 2017 by supporting six more incredible organizations fighting to protect voting rights and to make it easier for young, diverse candidates to get on the ballot and get elected.


In [94]:
print('Model Classification:')
print(mlp.classify(processed_tweets[2000]))
print('\n')
print('Actual Label:')
print(labels[2000])
print('\n')
print('Processed Tweet:')
print(processed_tweets[2000])
print('\n')
print('Raw Tweet')
print(tweets[2000])

Model Classification:
a republican tweeted this


Actual Label:
republican


Processed Tweet:
america lost patriot humbl servant georg herbert walker bush heart heavi today also fill gratitud thought entir bush famili tonight – inspir georg barbara ’ exampl


Raw Tweet
America has lost a patriot and humble servant in George Herbert Walker Bush. While our hearts are heavy today, they are also filled with gratitude. Our thoughts are with the entire Bush family tonight – and all who were inspired by George and Barbara’s example. https://t.co/g9OUPu2pjY
