# Tweet Topic Clustering to predict a Tweet's # of Favorites

How much does a tweet's content affect the number of "favorites" it receives? We will use doc2vec (extension of word2vec) to embed tweets in a vector space. Then we will cluster these vectors using K-Means and see if those clusters carry any explanatory power for favorite counts

In terms of process flow:
MongoDB -> Gensim Python Package -> K-Means Clustering -> StatsModel OLS

In [1]:
import pymongo
from pymongo import MongoClient
from nltk.corpus import stopwords
import string,logging,re
import pandas as pd
import gensim
import statsmodels.api as sm
import statsmodels.formula.api as smf

Using TensorFlow backend.




# Query MongoDB and Process Tweets

We will read in tweets from a MongoDB holding that last ~3200 tweets posted by a user. In this example, exploring the most consequential twitter handle: @RealDonaldTrump

In [2]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
c=MongoClient()
tweets_data=c.twitter.tweets
data=[x['text'] for x in tweets_data.find({'user.name':'Donald J. Trump'},{'text':1, '_id':0})]

For our tweet processing:

    Split into words
    
    Remove all punctuation
    
    Remove tokens that are only numeric or only 1 character in length
    
    Remove links and '&' symbols
    
    Tag Each Sentence with an integer id for our doc2vec model

In [3]:
table = str.maketrans({key: None for key in string.punctuation})
tokenized_tweets=list()
for d in data:
     text=[word for word in d.lower().split() if word not in stopwords.words("english")]
     text=[t.translate(table) for t in text]
     clean_tokens=[]
     for token in text:
        if re.search('[a-zA-Z]', token) and len(token)>1 and '@' not in token and token!='amp' and 'https' not in token:
            clean_tokens.append(token)
     tokenized_tweets.append(clean_tokens)
tag_tokenized=[gensim.models.doc2vec.TaggedDocument(tokenized_tweets[i],[i]) for i in range(len(tokenized_tweets))]
tag_tokenized[10]

TaggedDocument(words=['use', 'social', 'media', 'presidential', 'it’s', 'modern', 'day', 'presidential', 'make', 'america', 'great', 'again'], tags=[10])

In [4]:
print(data[:5])
print(len(data))
print ('\n')
print(tokenized_tweets[:5])
print(len(tokenized_tweets))

['If we can help little #CharlieGard, as per our friends in the U.K. and the Pope, we would be delighted to do so.', 'At some point the Fake News will be forced to discuss our great jobs numbers, strong economy, success with ISIS, the border &amp; so much else!', 'Will be speaking with Italy this morning!', 'Spoke yesterday with the King of Saudi Arabia about peace in the Middle-East. Interesting things are happening!', 'Will be speaking with Germany and France this morning.']
3219


[['help', 'little', 'charliegard', 'per', 'friends', 'uk', 'pope', 'would', 'delighted', 'so'], ['point', 'fake', 'news', 'forced', 'discuss', 'great', 'jobs', 'numbers', 'strong', 'economy', 'success', 'isis', 'border', 'much', 'else'], ['speaking', 'italy', 'morning'], ['spoke', 'yesterday', 'king', 'saudi', 'arabia', 'peace', 'middleeast', 'interesting', 'things', 'happening'], ['speaking', 'germany', 'france', 'morning']]
3219


# Doc2Vec Embedding - From Words to Vectors

We train our doc2vec model on our sentences for 10 iterations and ensure use min_count=2 to filter out tweets with only 1 word. We represent all the documents in 100-dimensional vector space.

Our resultant vectorized docuemnts are stored in model.docvecs

In [34]:
model = gensim.models.doc2vec.Doc2Vec(size=200, min_count=3, iter=200)
model.build_vocab(tag_tokenized)
model.train(tag_tokenized, total_examples=model.corpus_count, epochs=model.iter)
print(model.docvecs[10])

2017-07-03 12:22:39,800 : INFO : collecting all words and their counts
2017-07-03 12:22:39,802 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-07-03 12:22:39,824 : INFO : collected 6503 word types and 3219 unique tags from a corpus of 3219 examples and 33749 words
2017-07-03 12:22:39,826 : INFO : Loading a fresh vocabulary
2017-07-03 12:22:39,835 : INFO : min_count=3 retains 2157 unique words (33% of original 6503, drops 4346)
2017-07-03 12:22:39,838 : INFO : min_count=3 leaves 28446 word corpus (84% of original 33749, drops 5303)
2017-07-03 12:22:39,847 : INFO : deleting the raw counts dictionary of 6503 items
2017-07-03 12:22:39,848 : INFO : sample=0.001 downsamples 47 most-common words
2017-07-03 12:22:39,851 : INFO : downsampling leaves estimated 26029 word corpus (91.5% of prior 28446)
2017-07-03 12:22:39,856 : INFO : estimated required memory for 2157 words and 200 dimensions: 7104900 bytes
2017-07-03 12:22:39,868 : INFO : resetting layer weig

[  8.55307952e-02  -2.29421154e-01  -2.97324687e-01   4.31422025e-01
  -4.76930439e-01   1.78160936e-01   2.72290576e-02   6.79558933e-01
  -9.21427608e-02   1.65501520e-01   2.35088587e-01   6.92448437e-01
   3.12060624e-01   2.12603897e-01  -9.68669951e-02  -2.57202089e-01
  -1.85139507e-01   2.08733723e-01   1.16149141e-02   3.46391261e-01
   2.09190175e-02  -2.75064230e-01  -1.51937649e-01   8.85751005e-03
   5.30224562e-01  -4.55350876e-01   3.15887958e-01   2.57615417e-01
  -1.81422219e-01   5.05250059e-02  -2.04724640e-01  -3.02806310e-02
   4.78561163e-01  -1.46701247e-01   2.20914364e-01   3.71472389e-01
   2.03948408e-01   9.40331966e-02   3.17314297e-01  -1.01005437e-03
  -5.47288954e-02  -4.60623443e-01  -2.29597747e-01  -2.40998387e-01
   3.45313460e-01   1.74022958e-01   3.84957343e-01  -1.23862557e-01
   2.28261322e-01   3.86696681e-02   1.28710955e-01   5.51725149e-01
   2.78684080e-01  -8.50958899e-02   3.17794979e-01  -4.42174017e-01
   2.04700276e-01   1.81669250e-01

# Clustering our document vectors using K-Means

We fit our model K-Means model to 100 distinct clusters and return those labels

In [43]:
from sklearn.cluster import KMeans
num_clusters = 50
km = KMeans(n_clusters=num_clusters)
km.fit(model.docvecs)
clusters = km.labels_.tolist()

# Adding in all other Metrics and preparing OLS dataset

In [44]:
data_jsn=[x for x in tweets_data.find({},
                     {'favorite_count':1,'retweet_count':1,'created_at':1,'entities.hashtags':1,'entities.urls':1,
                      'entities.media':1,'_id':0})]

In [45]:
df=pd.io.json.json_normalize(data_jsn)

In [46]:
df['has_hashtags']=[len(a)>0 for a in df['entities.hashtags']]
df['has_urls']=[len(a)>0 for a in df['entities.urls']]
df['has_media']=1-df['entities.media'].isnull()
df['dow']=df.created_at.str[:3]
df['time']=df.created_at.str[11:13]
df['time_group']=df.time.astype('int64').mod(4)

In [47]:
df_clusters=pd.concat([pd.Series(clusters),df,pd.Series(data)],axis=1)

In [48]:
df_clusters.columns=['cluster']+list(df.columns)+['text']
df_clusters['cluster_cat']=pd.Categorical(df_clusters['cluster'])
df_clusters.head()

Unnamed: 0,cluster,created_at,entities.hashtags,entities.media,entities.urls,favorite_count,retweet_count,has_hashtags,has_urls,has_media,dow,time,time_group,text,cluster_cat
0,19,Mon Jul 03 14:00:21 +0000 2017,"[{'text': 'CharlieGard', 'indices': [22, 34]}]",,[],27279,9279,True,False,0,Mon,14,2,"If we can help little #CharlieGard, as per our...",19
1,17,Mon Jul 03 12:10:44 +0000 2017,[],,[],56934,14786,False,False,0,Mon,12,0,At some point the Fake News will be forced to ...,17
2,49,Mon Jul 03 11:38:20 +0000 2017,[],,[],35696,6392,False,False,0,Mon,11,3,Will be speaking with Italy this morning!,49
3,23,Mon Jul 03 11:19:08 +0000 2017,[],,[],43428,13346,False,False,0,Mon,11,3,Spoke yesterday with the King of Saudi Arabia ...,23
4,23,Mon Jul 03 11:00:56 +0000 2017,[],,[],37902,7269,False,False,0,Mon,11,3,Will be speaking with Germany and France this ...,23


# OLS modeling for Tweet Favorite Count

We will compare the performance for OLS regressions for favorite_count with/without the inclusion of our clusters (var='cluster_cat') as an indepenent variable. We will also control for: Day of Week, Tweet Time of day, and if the tweet had linked urls,media or hashtags. Furthermore, we filter out tweets with 0 favorites

In [49]:
results_baseline = smf.ols('favorite_count ~ dow + time_group + has_media + has_hashtags + has_urls', data=df_clusters[(df_clusters.favorite_count>0)]).fit()
results_clusters = smf.ols('favorite_count ~ dow + time_group + has_media + has_hashtags + has_urls+cluster_cat', data=df_clusters[(df_clusters.favorite_count>0)]).fit()

Comparing OLS Results for both models. Baseline R2=.104 and Model w/ clusters R2=0.180. This indicates that clustering tweets into 100 distict topic groups improves our model fit. However, given the low explanatory power, there is still plenty of unexplained variance for the favorite count of tweets

In [51]:
print(results_baseline.summary())
print(results_clusters.summary())

                            OLS Regression Results                            
Dep. Variable:         favorite_count   R-squared:                       0.104
Model:                            OLS   Adj. R-squared:                  0.101
Method:                 Least Squares   F-statistic:                     34.55
Date:                Mon, 03 Jul 2017   Prob (F-statistic):           2.48e-64
Time:                        12:24:37   Log-Likelihood:                -36251.
No. Observations:                3001   AIC:                         7.252e+04
Df Residuals:                    2990   BIC:                         7.259e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------
Intercept             6.204e+04 

# Taking a look at some of the clusters

In [53]:
#Cluster 16
#In this case seems to related to Clinton email Investigation
[print (tweet) for tweet in df_clusters[df_clusters.cluster_cat==16].text]

FBI Director Comey was the best thing that ever happened to Hillary Clinton in that he gave her a free pass for many bad deeds! The phony...
being a movie star-and that was season 1 compared to season 14. Now compare him to my season 1. But who cares, he supported Kasich &amp; Hillary
Just got a call from my friend Bill Ford, Chairman of Ford, who advised me that he will be keeping the Lincoln plant in Kentucky - no Mexico
Clinton Aides: ‘Definitely’ Not Releasing Some HRC Emails:
https://t.co/GY5pGKaDoW
RT @PaulaReidCBS: .@CBSNews confirms FBI found emails on #AnthonyWeiner computer, related to Hillary Clinton server, that are "new" &amp; not p…
Crooked Hillary Clinton deleted 33,000 e-mails AFTER they were subpoenaed by the United States Congress. Guilty - cannot run. Rigged system!
'Podesta urged Clinton team to hand over emails after use of private server emerged' https://t.co/NYvVmoA8wl
Just out: Neera Tanden, Hillary Clinton adviser said, “Israel is depressing.” I think Israel is

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [54]:
#Cluster 33
#General topic of WH visits by foriegn leaders and others
[print (tweet) for tweet in df_clusters[df_clusters.cluster_cat==33].text]

Getting rdy to leave for tonight's Celebrate Freedom Concert honoring our GREAT VETERANS w/ so many of my evangelic… https://t.co/shfTAoMJry
It was my great honor to welcome the 2016 World Series Champion Chicago @Cubs ⚾️ to the @WhiteHouse this afternoon.… https://t.co/GvPhJ9hZdv
#ICYMI- on Monday, I had the great honor of welcoming India's Prime Minister @narendramodi to the WH. Full Remarks:… https://t.co/7IIQEh7xXv
.@FLOTUS &amp; I were honored to host our first WH Congressional Picnic. A wonderful evening &amp; tradition. @MarineBand:… https://t.co/h5L4myWmam
It was a great honor to welcome President Petro Poroshenko of Ukraine to the @WhiteHouse today with @VP Pence.
➡️… https://t.co/J1ulOd6pYQ
Melania and I offer our deepest condolences to the family of Otto Warmbier. Full statement: https://t.co/8kmcA6YtFD https://t.co/EhrP4BiJeB
"National Security Presidential Memorandum on Strengthening the Policy of the United States Toward Cuba"
Memorandum… https://t.co/RVr9Dg2hXt
Thank you

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]