
# Breitbart vs. The Onion: predicting who tweeted what

<p align="center">
<img src="http://cdn.someecards.com/someecards/usercards/1279906132319_9016369.png">
</p>
</img>


I mean, you'd think a human would be able to pretty quickly tell the difference between Breitbart and The Onion, but

1. are you sure? Every time? The Onion is really good at its job sometimes, and 
2. even if you can, can your machine?

It's not exactly the foremost question of our time, but it makes for a fun afternoon.

### Data collection 

I used the [Python Twitter Tools](http://mike.verdone.ca/twitter/) library to collect data from [@BreitbartNews](http://twitter.com/BreitbartNews) and [@TheOnion](http://twitter.com/TheOnion). 

You can read the documentation on the various data structures you can work with [here](https://python-twitter.readthedocs.io/en/latest/models.html)

In [1]:
import twitter, re, datetime, pandas as pd

class twitterminer():

    request_limit   =   20    
    api             =   False
    data            =   []
    
    twitter_keys = {
        'consumer_key':        '', #insert your own twitter keys here
        'consumer_secret':     '',
        'access_token_key':    '',
        'access_token_secret': ''
    }
    
    def __init__(self,  request_limit = 20):
        
        self.request_limit = request_limit
        
        self.set_api()
        
    def set_api(self):
        
        self.api = twitter.Api(
            consumer_key         =   self.twitter_keys['consumer_key'],
            consumer_secret      =   self.twitter_keys['consumer_secret'],
            access_token_key     =   self.twitter_keys['access_token_key'],
            access_token_secret  =   self.twitter_keys['access_token_secret']
        )

    def mine_user_tweets(self, user=""):

        statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.request_limit, include_rts=False)
        data       =   []
        
        for item in statuses:

            mined = {
                'tweet_id': item.id,
                'handle': item.user.name,
                'retweet_count': item.retweet_count,
                'text': item.text,
                'mined_at': datetime.datetime.now(),
                'created_at': item.created_at,
            }
            
            data.append(mined)
            
        return data

In [2]:
#Let's get the first thousand tweets from both accounts
miner = twitterminer(request_limit = 1000)

In [3]:
#create a new dataframe with tweets by The Onion
onion_tweets = miner.mine_user_tweets("TheOnion")
onion_tweets = pd.DataFrame(onion_tweets)

In [4]:
onion_tweets.head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Mon Jan 30 16:53:00 +0000 2017,The Onion,2017-01-30 11:54:31.522798,3,"The Week In Pictures – Week Of January 30, 201...",826110981025779712
1,Mon Jan 30 16:16:04 +0000 2017,The Onion,2017-01-30 11:54:31.522808,42,Finland Aims To Be Tobacco-Free By 2020 https:...,826101685491752963
2,Mon Jan 30 15:39:04 +0000 2017,The Onion,2017-01-30 11:54:31.522812,233,Man Dying From Cancer Spends Last Good Day On ...,826092371255386113
3,Mon Jan 30 14:48:06 +0000 2017,The Onion,2017-01-30 11:54:31.522815,219,Parents Seize Creative Control Of 3rd-Grade Ar...,826079548089458690
4,Mon Jan 30 13:57:03 +0000 2017,The Onion,2017-01-30 11:54:31.522818,395,Slow-Witted Conspiracy Theorist Convinced Gove...,826066701460582400


In [5]:
#and another with Breitbart tweets
breitbart_tweets = miner.mine_user_tweets("BreitbartNews")
breitbart_tweets = pd.DataFrame(breitbart_tweets)

In [6]:
breitbart_tweets.head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Mon Jan 30 16:47:01 +0000 2017,Breitbart News,2017-01-30 11:54:35.601077,56,Can't Chill the Mill! https://t.co/QZJHj9Pif7,826109473156689921
1,Mon Jan 30 16:20:49 +0000 2017,Breitbart News,2017-01-30 11:54:35.601087,224,Maybe start worrying about your own citizens i...,826102879287128064
2,Mon Jan 30 15:53:43 +0000 2017,Breitbart News,2017-01-30 11:54:35.601092,65,Thanks for reading! 👋 https://t.co/j2m76LDglc,826096060888150017
3,Mon Jan 30 15:40:30 +0000 2017,Breitbart News,2017-01-30 11:54:35.601095,109,Here we go. https://t.co/P3QDUNqz3V,826092732875669505
4,Mon Jan 30 15:23:03 +0000 2017,Breitbart News,2017-01-30 11:54:35.601097,213,Keep calling everyone Nazis and hope for the b...,826088340537692162


In [9]:
#How similar do the ngrams for both accounts look?
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

vect = TfidfVectorizer(ngram_range=(5,5))

summaries = "".join(onion_tweets['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(10)

[(u'https co csf5qubhed https co', 4),
 (u'visit https co csf5qubhed https', 4),
 (u'complete coverage of the last', 2),
 (u'americans most physically active when', 2),
 (u'of secretary of state john', 2),
 (u'fun activity he signed up', 2),
 (u'shotgunning brewhas with diamond joe', 2),
 (u'excruciating torment he experiences at', 2),
 (u'secretary of state john kerry', 2),
 (u'experiences at every moment https', 2)]

In [10]:
vect = TfidfVectorizer(ngram_range=(5,5))

summaries = "".join(breitbart_tweets['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(10)

#not too similar, so that probably bodes well if we're training an algorithm to tell them apart
#there's a tonne of numbers/handles in the Breitbart tweets, but when I googled them, they're all Curt Schilling
#promoting Dinesh D'Souza. Should be okay to leave that in here.

[(u'in 877 240 1776 gt', 6),
 (u'with gehrig38 call in 877', 6),
 (u'whatever it takes with gehrig38', 6),
 (u'gehrig38 call in 877 240', 6),
 (u'call in 877 240 1776', 6),
 (u'240 1776 gt https co', 6),
 (u'takes with gehrig38 call in', 6),
 (u'it takes with gehrig38 call', 6),
 (u'877 240 1776 gt https', 6),
 (u'hour two whatever it takes', 3)]

In [11]:
#okay, let's concat the two dataframes so we can work with tweets from both accounts 
all_tweets = pd.concat([onion_tweets, breitbart_tweets])

### Working with the data

The ultimate goal is to classify tweets based on if they're from Breitbart or the Onion. The tweets aren't really in a format I can work with right now, so I will select a feature (the text of the tweet) that I want to use to predict the handle (my y variable). 

I'll be using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to vectorize the text data, initializing a bunch of classification models, training my data on a subset of tweets and then testing it on others to see how accurate the models are, and then pulling in tweets from accounts affiliated with Breitbart and The Onion to see how well the models do at predicting similar tweets. 

<p align="center">
<img src="https://cdn.meme.am/cache/instances/folder666/65701666.jpg">
</p>
</img>

Sidenote: Allie Brosh is amazing and everyone should read [Hyperbole and a Half](https://hyperboleandahalf.blogspot.com/)

In [15]:
#see image above
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np

In [17]:
#preprocess the text of the tweets
tfv = TfidfVectorizer(lowercase=True, strip_accents='unicode')
X_all = tfv.fit_transform(all_tweets['text'])

In [39]:
#set up all the classifiers
svc = SVC(kernel='linear', degree=4)
rf = RandomForestClassifier(max_depth=10, n_estimators=500)
knn = KNeighborsClassifier(n_neighbors=3)
lr = LogisticRegression()

In [19]:
#create a function to do all the things
#specifically: to split the data into training and testing sets, fit the model, predict whether or not a tweet in the 
#test set is Breitbart's or the Onion's
#create a confusion matrix to see how well the model did in classifying tweets one way or the other
#print out an accuracy score for how accurate the model was at predicting vs. what the tweet actually was
# print out a classification report which shows how precise and sensitive the model is
names = ["is breitbart", "is onion", "pred breitbart", "predicted onion"]
def do_model(model, X, y, names):
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    conmat = np.array(confusion_matrix(Y_test, Y_pred, labels=['Breitbart News','The Onion']))
    confusion = pd.DataFrame(conmat, index=[names[0:2]],
                         columns=[names[2:]])
    print confusion
    acc = accuracy_score(Y_test, Y_pred)
    print "accuracy score is", acc
    print(classification_report(Y_test, Y_pred))



In [30]:
#first run the support vector machine classifier
do_model(svc, X_all, all_tweets['handle'], names)

#that's a pretty decent accuracy score, and the model seems to do quite a good job of predicting when something is 
#The Onion. Breitbart is apparently more difficult to tell apart.

              pred breitbart  predicted onion
is breitbart              61                3
is onion                  14               41
accuracy score is 0.857142857143
                precision    recall  f1-score   support

Breitbart News       0.81      0.95      0.88        64
     The Onion       0.93      0.75      0.83        55

   avg / total       0.87      0.86      0.85       119



In [20]:
#now the random forest classifier
do_model(rf, X_all, all_tweets['handle'], names)
#again, the model does a lot better at correctly predicting the Onion than at predicting Breitbart

              pred breitbart  predicted onion
is breitbart              63                1
is onion                  22               33
accuracy score is 0.806722689076
                precision    recall  f1-score   support

Breitbart News       0.74      0.98      0.85        64
     The Onion       0.97      0.60      0.74        55

   avg / total       0.85      0.81      0.80       119



In [25]:
#how does a KNN do?
do_model(knn, X_all, all_tweets['handle'], names)
#that got slightly worse at predicting the Onion, but overall (going off the f1-score which measures the harmonic mean of precision and recall) 
#it does about as well as the random forest

              pred breitbart  predicted onion
is breitbart              54               10
is onion                  14               41
accuracy score is 0.798319327731
                precision    recall  f1-score   support

Breitbart News       0.79      0.84      0.82        64
     The Onion       0.80      0.75      0.77        55

   avg / total       0.80      0.80      0.80       119



In [26]:
#and logistic regression
do_model(lr, X_all, all_tweets['handle'], names)
#whoo! That's a lot better than the random forest and the knn, but the SVC still ends up doing best. We'll use that for now.

              pred breitbart  predicted onion
is breitbart              62                2
is onion                  16               39
accuracy score is 0.848739495798
                precision    recall  f1-score   support

Breitbart News       0.79      0.97      0.87        64
     The Onion       0.95      0.71      0.81        55

   avg / total       0.87      0.85      0.85       119



### Check how well top faring models do against random Onion Labs tweets

In [87]:
#let's start by training the model again (this time without training and testing because we're bringing in new tweets)
tfidf = TfidfVectorizer(strip_accents='unicode',lowercase=True,stop_words='english')
X_all = tfidf.fit_transform(all_tweets['text'])

estimator = SVC(kernel='linear', degree=4, probability=True)
estimator.fit(X_all, all_tweets['handle'])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=4, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [88]:
#double check classes 
print estimator.classes_

[u'Breitbart News' u'The Onion']


In [89]:
# Prep our source as vectors
source_test = [
    "There are over 10,000 lakes in Minnesota, & there's a dare for every one of them. How many would you try?",
    "If only Miles the interactive personality pump from @BP_America was on Twitter. Alas, this video will have to do http://onion.com/2gESxpN "
]

#transform but DO NOT REFIT MODEL.
X = tfidf.transform(source_test)

#aaand predict!
print estimator.predict_proba(X)

[[ 0.19155199  0.80844801]
 [ 0.39333073  0.60666927]]


The SVC estimator correctly predicts that the tweets are probably from the Onion with a high degree of certainty.

### Just for additional kicks: Using TextBlob to conduct sentiment analysis on tweets

In [85]:
from textblob import TextBlob

In [90]:
#first create a new column in the dataframe that uses textblob to ...create a blob of text.
all_tweets['blob'] = all_tweets['text'].map(TextBlob)

In [92]:
#next get the polarity of the textblob with a simple lambda function
all_tweets['polarity'] = all_tweets['blob'].map(lambda x: x.sentiment.polarity)

In [93]:
#and the same with the subjectivity of each textblob
all_tweets['subjectivity'] = all_tweets['blob'].map(lambda x: x.sentiment.subjectivity)

In [98]:
#check to see if the columns came out okay
all_tweets.head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,blob,polarity,subjectivity
0,Mon Jan 30 16:53:00 +0000 2017,The Onion,2017-01-30 11:54:31.522798,3,"The Week In Pictures – Week Of January 30, 201...",826110981025779712,"(T, h, e, , W, e, e, k, , I, n, , P, i, c, ...",0.0,0.0
1,Mon Jan 30 16:16:04 +0000 2017,The Onion,2017-01-30 11:54:31.522808,42,Finland Aims To Be Tobacco-Free By 2020 https:...,826101685491752963,"(F, i, n, l, a, n, d, , A, i, m, s, , T, o, ...",0.0,0.0
2,Mon Jan 30 15:39:04 +0000 2017,The Onion,2017-01-30 11:54:31.522812,233,Man Dying From Cancer Spends Last Good Day On ...,826092371255386113,"(M, a, n, , D, y, i, n, g, , F, r, o, m, , ...",0.35,0.333333
3,Mon Jan 30 14:48:06 +0000 2017,The Onion,2017-01-30 11:54:31.522815,219,Parents Seize Creative Control Of 3rd-Grade Ar...,826079548089458690,"(P, a, r, e, n, t, s, , S, e, i, z, e, , C, ...",0.5,1.0
4,Mon Jan 30 13:57:03 +0000 2017,The Onion,2017-01-30 11:54:31.522818,395,Slow-Witted Conspiracy Theorist Convinced Gove...,826066701460582400,"(S, l, o, w, -, W, i, t, t, e, d, , C, o, n, ...",-0.4,0.7


In [99]:
#how do The Onion and Breitbart compare in terms of how polarizing their tweets are? 
all_tweets.groupby('handle').polarity.describe()
#polarity just looks at the negativity and positivity of words (-1 being the most negative and 1 being the most positive), so it looks like on average, Breitbart uses more
#positive language than the Onion. But for the most part, it looks like they both try to avoid language that's 
#super positive or negative

handle               
Breitbart News  count    192.000000
                mean       0.045009
                std        0.252172
                min       -0.900000
                25%        0.000000
                50%        0.000000
                75%        0.078958
                max        1.000000
The Onion       count    166.000000
                mean       0.065367
                std        0.207902
                min       -0.457143
                25%        0.000000
                50%        0.000000
                75%        0.136364
                max        0.700000
Name: polarity, dtype: float64

In [100]:
#how do The Onion and Breitbart compare in terms of how subjective their tweets are? 
all_tweets.groupby('handle').subjectivity.describe()
#subjectivity in textblob looks at words to see how subjective or objective they are (1 being the most subjective and 0 being the most objective)
#given that both Breitbart and Onion market themselves as news sites (in one way or another), I'm not too surprised to see that
#their mean subjectivity is pretty close to 0 (or being relatively objective)
#it IS interesting to see tht both handles have tweeted something that rates as completely subjective, though

handle               
Breitbart News  count    192.000000
                mean       0.242091
                std        0.330249
                min        0.000000
                25%        0.000000
                50%        0.000000
                75%        0.500000
                max        1.000000
The Onion       count    166.000000
                mean       0.269967
                std        0.283788
                min        0.000000
                25%        0.000000
                50%        0.200000
                75%        0.500000
                max        1.000000
Name: subjectivity, dtype: float64

I think the chief conclusion here is that, while fake news and parody news are pretty similar in terms of how they construct their tweets (in order to resemble the objective construction that real news orgs use), machines are relatively good at predicting the probability of a tweet being one or the other.

Of course, the next step here is to teach machines how to tell the difference between "objective" language that is steeped in hate and ...language that isn't.