In [1]:
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn.cluster import KMeans
from random import randint

In [2]:
tweets = pd.read_pickle("../../data/cp_tweets.pkl")

## Abstract
In this notebook, we will determine for each tweet if it belongs to any of these classes:
* Talks about Trump
* Talks about Hillary
* Talks about both
* We couldn't define any of the other groups

This it will be evaluated by checking keywords (hashtags and mentions) on the tweets, these keywords will be our features. Once each tweet is labeled, we can evaluate sentiment analysis or topic analysis, or even compare the data with demographic data, such as population or states.


## Getting the features
We will define a function to extract the features from each tweet. Our features will be hashtags and mentions, we will not distinguish one with another, we will treat them the same way. Therefore, we will remove *#* and *@* from the hashtags and mentions, and from now on, we will call them keywords.

In [3]:
def features(tweet):
    tweet = tweet.lower()
    
    # Get hashtags and mentions with regex
    # I guess we can improve this by taking hashtags and mentions from the API
    tags = re.findall('(^|\s)([@#][\w-]+)', tweet)
    tags = [item[1] for item in tags]
    
    #Remove mentions and hashtags symbols
    tags = list(map(lambda item: item
                     .replace('@', '')
                     .replace('#',''), tags))
    
    # Remove repetitive elements
    tags = list(set(tags))
    
    # If the list is empty, return a NaN
    return tags if tags else np.nan

We apply the function for each tweet and afterwards, remove those tweets that do not have any keywords.

In [4]:
kw = tweets.text.apply(features)
kw.dropna(inplace=True)

In [5]:
print(kw.head(10))
print("\nWe reduce the dataset from",len(tweets), "to", len(kw))

0                               [theblaze, realdonaldtrump]
1         [barackobama, fbi, trumppence, lorettalynch, n...
10                           [realdonaldtrump, mike4193496]
100       [joenbc, morningjoe, williegeist, mike_pence, ...
1000                                      [realdonaldtrump]
10000     [realdonaldtrump, barackobama, grammybijou3, i...
100000    [barackobama, mmqc15, walshfreedom, hillarycli...
100001      [tnuts, trumpers, dt, foxnews, realdonaldtrump]
100002      [realdonaldtrump, hillaryclinton, melaniatrump]
100003        [supportourtroopsandvets, bigduhie1955, maga]
Name: text, dtype: object

We reduce the dataset from 657307 to 597024


We define a function that given *n* will give us a list of the *n* most common keywords. This list will be used to build a binary matrix.

In [6]:
def getNKeywords(kw, n):
    output = []
    dic = Counter()
    for key, keywords in kw.iteritems():
        for k in keywords: dic[k] += 1
            
    output = sorted(dic, key=dic.__getitem__, reverse=True)
    return output[:n]

# Getting the N most common keywords
all_keywords = getNKeywords(kw, 50)
print(all_keywords)

['realdonaldtrump', 'hillaryclinton', 'trump', 'foxnews', 'cnn', 'nevertrump', 'maga', 'imwithher', 'neverhillary', 'hillary', 'trumppence16', 'donaldtrump', 'crookedhillary', 'seanhannity', 'kellyannepolls', 'msnbc', 'potus', 'gop', 'trumptrain', 'epn', 'nytimes', 'politico', 'mike_pence', 'reince', 'morning_joe', 'dumptrump', 'makeamericagreatagain', 'danscavino', 'erictrump', 'trump2016', 'donaldjtrumpjr', 'abc', 'basketofdeplorables', 'cnnpolitics', 'barackobama', 'greta', 'teamtrump', 'billclinton', 'timkaine', 'loudobbs', 'morningmika', 'tcot', 'rogerjstonejr', 'ingrahamangle', 'maddow', 'katrinapierson', 'washingtonpost', 'speakerryan', 'hfa', 'anncoulter']


# Building a binary matrix
In order to fit the clustering method, we will need a matrix $N x M$ and right now what we have so far is a list of list which the inside list has different lenght. That is the reason we will build a binary matrix where each row is a tweet and each column is a keyword from the list extracted before. The cell will indicated if that tweet has that keyword (True) or not (False).

In [7]:
matrix = kw.apply(lambda items : np.array([keyword in items for keyword in all_keywords]))

If the tweet does not contain any keyword as we reduced the total keywords that we have, then we will remove it.

In [8]:
reduced_matrix = matrix.apply(lambda x : x if True in x else np.nan)
reduced_matrix.dropna(inplace=True)
print("\nWe reduce the dataset from",matrix.shape[0], "to", reduced_matrix.shape[0])


We reduce the dataset from 597024 to 573745


## K-means
Out main goal is to classify a tweet in these 4 groups:
* Talks about Trump
* Talks about Hillary
* Talks about both
* It's not relevant
    
For that, we will use a clustering method, K-means. K-means is a method for cluster analysis where you can give the number of cluster you want and it defines centroids which determine the 'middle' of a group (or class). This method normally even the clusters size which is not our case as we want 4 groups were the big one it will be the one that talks about Trump. Therefore, we decided to apply K-means but with *n_clusters* bigger than 4, let's say 15 and then we can classify manually if the cluster talks about Trump, Hillary or both.

We should not forget we are trying to divide keywords (hashtags and mentions mainly) into groups, so each group will have a 'top-5' keywords that are used from that class. Once we know which are the most common keywords for each group, we can decide who are they talking about.

In [9]:
cluster = KMeans(n_clusters=15, random_state=0, algorithm='full').fit(reduced_matrix.tolist())

## What the centroids tell us
The centroids from the clustering will tell us what is the proability for each hashtag that appears in the tweets from that cluster. In orther words if the hashtag has a value of 1 that means that that hashtag appears in all the tweets from that group. The same if we have 0.45, there is a 45% chance that the hashtag appear in a selected random tweet from that group.

In [10]:
cluster.cluster_centers_[0]

array([ 1.00000000e+00,  8.06466005e-13,  1.06339452e-02, -6.78207490e-14,
        2.04489203e-14,  1.95433947e-13,  7.43918815e-14, -7.33857419e-14,
        1.62479444e-03, -5.52335955e-15,  7.07356031e-03,  2.62799682e-03,
        2.96633174e-03,  5.77524140e-14,  2.09374238e-02,  5.97987301e-03,
        6.20018412e-03,  8.37182221e-03,  2.94666111e-03,  1.53116222e-02,
        9.94547300e-03,  1.00398920e-02,  1.17945127e-02,  6.95160237e-03,
        8.98554602e-03,  1.75068651e-03,  6.65260872e-03,  1.05001849e-02,
        8.94227062e-03,  1.91985397e-03,  8.80064205e-03,  1.98280000e-03,
        1.79396190e-03,  3.99313888e-03,  3.27712777e-03,  4.31967142e-03,
        5.47630475e-03,  1.00320238e-03,  1.07008254e-03,  6.35754920e-03,
        7.49451189e-03,  4.95699999e-04,  1.12122619e-03,  4.09149206e-03,
        2.77749365e-03,  6.09002856e-03,  3.09615793e-03,  4.09936031e-03,
        5.78316666e-04,  4.53998253e-03])

As we see in the first centroid, there is a probability of 1 that 'realdonaldtrump' appears on the tweets from that cluster and there is low chance for 'hillaryclinton', 'foxnews', 'cnn', 'nevertrump', 'maga', 'imwithher' appearing. The order preserved in the centroids is the same as the variable *all_keywords*, that is how we know which keyword belongs each number.

We have to define a threshold to determine which keywords are significant for each class, this is just to visualise and clarify what is the cluster talking about, but the K-means algorithm is already applied. 'Undefined group' means the class that doesn't have any significant keywords.

In [11]:
p = 0.95 # Threshold probability. It will give us the hashtags that its chance is bigger than p.
c = 0
clusters = {}
counters = Counter(cluster.labels_)
for center in cluster.cluster_centers_:
    s = pd.Series(center)
    idxs = s[s>=p].index
    clusters[c] = {}
    clusters[c]['keywords'] = [all_keywords[i] for i in idxs] if idxs.tolist() else 'Undefined group'
    clusters[c]['tweets'] = counters[c] 
    c+=1
        
clusters

{0: {'keywords': ['realdonaldtrump'], 'tweets': 254186},
 1: {'keywords': ['trump'], 'tweets': 48855},
 2: {'keywords': ['hillaryclinton'], 'tweets': 111544},
 3: {'keywords': ['nevertrump'], 'tweets': 16926},
 4: {'keywords': ['donaldtrump'], 'tweets': 10733},
 5: {'keywords': ['hillary'], 'tweets': 9684},
 6: {'keywords': ['realdonaldtrump', 'hillaryclinton'], 'tweets': 25779},
 7: {'keywords': ['foxnews'], 'tweets': 13494},
 8: {'keywords': ['maga'], 'tweets': 16798},
 9: {'keywords': ['seanhannity'], 'tweets': 4890},
 10: {'keywords': ['imwithher'], 'tweets': 14116},
 11: {'keywords': ['cnn'], 'tweets': 12846},
 12: {'keywords': ['neverhillary'], 'tweets': 9367},
 13: {'keywords': 'Undefined group', 'tweets': 18152},
 14: {'keywords': ['hillaryclinton', 'foxnews'], 'tweets': 6375}}

## Sampling some tweets with its class

We can see here that given *n* tweets, we'll see which class belongs that tweet

In [28]:
n = 5
for i in range(n):
    i = randint(0, len(cluster.labels_))
    idx = reduced_matrix.index.tolist()[i]
    print(tweets.at[idx, 'text'])
    print(clusters[cluster.labels_[i]]['keywords'], "\n")

@DanScavino @realDonaldTrump 199
['realdonaldtrump'] 

@realDonaldTrump liar liar pants on fire!
['realdonaldtrump'] 

@ananavarro @realDonaldTrump https://t.co/rIHjwzkTo7
['realdonaldtrump'] 

Well THIS is worth sharing, too!Thanks, @PeterHase2014  #NeverTrump https://t.co/0DwCeMsVfp
['nevertrump'] 

So Maddow made Conway look like a fool defending #Trump and unconstitutional measures! bwahaha!
['trump'] 



# Converting the classes from K-means to our classes

Once we have the classes obtained from K-means, we can decide manually what is the tweet talking about (Trump, Hillary, both or irrelevant), knowing the most common words in that class.