# Tweet Topic Clustering to predict a Tweet's # of Favorites

How much does a tweet's content affect the number of "favorites" it receives? We will use doc2vec (extension of word2vec) to embed tweets in a vector space. Then we will cluster these vectors using K-Means and see if those clusters carry any explanatory power for favorite counts

In terms of process flow:
MongoDB -> Gensim Python Package -> K-Means Clustering -> StatsModel OLS

In [1]:
import pymongo
from pymongo import MongoClient
from nltk.corpus import stopwords
import string,logging,re
import pandas as pd
import gensim
import statsmodels.api as sm
import statsmodels.formula.api as smf

Using TensorFlow backend.




# Query MongoDB and Process Tweets

We will read in tweets from a MongoDB holding that last ~3200 tweets posted by a user. In this example, exploring the most consequential twitter handle: @RealDonaldTrump

In [2]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
c=MongoClient()
tweets_data=c.twitter.tweets
data=[x['text'] for x in tweets_data.find({'user.name':'Donald J. Trump'},{'text':1, '_id':0})]

For our tweet processing:

    Split into words
    
    Remove all punctuation
    
    Remove tokens that are only numeric or only 1 character in length
    
    Remove links and '&' symbols
    
    Tag Each Sentence with an integer id for our doc2vec model

In [3]:
table = str.maketrans({key: None for key in string.punctuation})
tokenized_tweets=list()
for d in data:
     text=[word for word in d.lower().split() if word not in stopwords.words("english")]
     text=[t.translate(table) for t in text]
     clean_tokens=[]
     for token in text:
        if re.search('[a-zA-Z]', token) and len(token)>1 and '@' not in token and token!='amp' and 'https' not in token:
            clean_tokens.append(token)
     tokenized_tweets.append(clean_tokens)
tag_tokenized=[gensim.models.doc2vec.TaggedDocument(tokenized_tweets[i],[i]) for i in range(len(tokenized_tweets))]
tag_tokenized[10]

TaggedDocument(words=['use', 'social', 'media', 'presidential', 'it’s', 'modern', 'day', 'presidential', 'make', 'america', 'great', 'again'], tags=[10])

In [4]:
print(data[:5])
print(len(data))
print ('\n')
print(tokenized_tweets[:5])
print(len(tokenized_tweets))

['If we can help little #CharlieGard, as per our friends in the U.K. and the Pope, we would be delighted to do so.', 'At some point the Fake News will be forced to discuss our great jobs numbers, strong economy, success with ISIS, the border &amp; so much else!', 'Will be speaking with Italy this morning!', 'Spoke yesterday with the King of Saudi Arabia about peace in the Middle-East. Interesting things are happening!', 'Will be speaking with Germany and France this morning.']
3219


[['help', 'little', 'charliegard', 'per', 'friends', 'uk', 'pope', 'would', 'delighted', 'so'], ['point', 'fake', 'news', 'forced', 'discuss', 'great', 'jobs', 'numbers', 'strong', 'economy', 'success', 'isis', 'border', 'much', 'else'], ['speaking', 'italy', 'morning'], ['spoke', 'yesterday', 'king', 'saudi', 'arabia', 'peace', 'middleeast', 'interesting', 'things', 'happening'], ['speaking', 'germany', 'france', 'morning']]
3219


# Doc2Vec Embedding - From Words to Vectors

We train our doc2vec model on our sentences for 10 iterations and ensure use min_count=2 to filter out tweets with only 1 word. We represent all the documents in 100-dimensional vector space.

Our resultant vectorized docuemnts are stored in model.docvecs

In [5]:
model = gensim.models.doc2vec.Doc2Vec(size=200, min_count=3, iter=200)
model.build_vocab(tag_tokenized)
model.train(tag_tokenized, total_examples=model.corpus_count, epochs=model.iter)
print(model.docvecs[10])

2017-07-03 12:49:58,402 : INFO : collecting all words and their counts
2017-07-03 12:49:58,404 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-07-03 12:49:58,424 : INFO : collected 6503 word types and 3219 unique tags from a corpus of 3219 examples and 33749 words
2017-07-03 12:49:58,425 : INFO : Loading a fresh vocabulary
2017-07-03 12:49:58,438 : INFO : min_count=3 retains 2157 unique words (33% of original 6503, drops 4346)
2017-07-03 12:49:58,439 : INFO : min_count=3 leaves 28446 word corpus (84% of original 33749, drops 5303)
2017-07-03 12:49:58,448 : INFO : deleting the raw counts dictionary of 6503 items
2017-07-03 12:49:58,451 : INFO : sample=0.001 downsamples 47 most-common words
2017-07-03 12:49:58,453 : INFO : downsampling leaves estimated 26029 word corpus (91.5% of prior 28446)
2017-07-03 12:49:58,454 : INFO : estimated required memory for 2157 words and 200 dimensions: 7104900 bytes
2017-07-03 12:49:58,465 : INFO : resetting layer weig

[-0.0223127   0.71794951  0.15652561 -0.3871288   0.09737334  0.30886981
 -0.1897698  -0.02318754  0.29577002 -0.25845578  0.85019886  0.15728363
  0.66041511 -0.50196338 -0.01232568 -0.69066787 -0.11652973 -0.11894244
  0.25488374 -0.26980492 -0.08240258 -0.33684728  0.03732453  0.17349607
  0.12191286  0.03625823  0.04406116 -0.12207581  0.39899775 -0.17920512
 -0.13989277  0.24545549 -0.00747421 -0.31095013  0.23375721 -0.60718364
 -0.21391943  0.16231993  0.03919971  0.0108712  -0.45206833 -0.46321338
 -0.46140927 -0.07949995 -0.62248254  0.0677736   0.19140556  0.19491951
 -0.21051081 -0.02319764  0.50184739  0.5713349  -0.14749479  0.42342877
 -0.52857226 -0.41388413 -0.41674462  0.14050347  0.06123683  0.20868874
 -0.57156885  0.196704    0.10411382 -0.11350733  0.38653967 -0.48477373
 -0.3338708   0.11424053  0.36781299  0.01491879 -0.17283344  0.45260099
 -0.02258966  0.16506541 -0.50829226  0.04534959 -0.70939964  0.23371673
 -0.47035155 -0.23309872 -0.10950626 -0.09217826  0

# Clustering our document vectors using K-Means

We fit our model K-Means model to 100 distinct clusters and return those labels

In [6]:
from sklearn.cluster import KMeans
num_clusters = 50
km = KMeans(n_clusters=num_clusters)
km.fit(model.docvecs)
clusters = km.labels_.tolist()

# Adding in all other Metrics and preparing OLS dataset

In [7]:
data_jsn=[x for x in tweets_data.find({},
                     {'favorite_count':1,'retweet_count':1,'created_at':1,'entities.hashtags':1,'entities.urls':1,
                      'entities.media':1,'_id':0})]

In [8]:
df=pd.io.json.json_normalize(data_jsn)

In [9]:
df['has_hashtags']=[len(a)>0 for a in df['entities.hashtags']]
df['has_urls']=[len(a)>0 for a in df['entities.urls']]
df['has_media']=1-df['entities.media'].isnull()
df['dow']=df.created_at.str[:3]
df['time']=df.created_at.str[11:13]
df['time_group']=df.time.astype('int64').mod(4)

In [10]:
df_clusters=pd.concat([pd.Series(clusters),df,pd.Series(data)],axis=1)

In [11]:
df_clusters.columns=['cluster']+list(df.columns)+['text']
df_clusters['cluster_cat']=pd.Categorical(df_clusters['cluster'])
df_clusters.head()

Unnamed: 0,cluster,created_at,entities.hashtags,entities.media,entities.urls,favorite_count,retweet_count,has_hashtags,has_urls,has_media,dow,time,time_group,text,cluster_cat
0,0,Mon Jul 03 14:00:21 +0000 2017,"[{'text': 'CharlieGard', 'indices': [22, 34]}]",,[],27279,9279,True,False,0,Mon,14,2,"If we can help little #CharlieGard, as per our...",0
1,38,Mon Jul 03 12:10:44 +0000 2017,[],,[],56934,14786,False,False,0,Mon,12,0,At some point the Fake News will be forced to ...,38
2,40,Mon Jul 03 11:38:20 +0000 2017,[],,[],35696,6392,False,False,0,Mon,11,3,Will be speaking with Italy this morning!,40
3,14,Mon Jul 03 11:19:08 +0000 2017,[],,[],43428,13346,False,False,0,Mon,11,3,Spoke yesterday with the King of Saudi Arabia ...,14
4,25,Mon Jul 03 11:00:56 +0000 2017,[],,[],37902,7269,False,False,0,Mon,11,3,Will be speaking with Germany and France this ...,25


# OLS modeling for Tweet Favorite Count

We will compare the performance for OLS regressions for favorite_count with/without the inclusion of our clusters (var='cluster_cat') as an indepenent variable. We will also control for: Day of Week, Tweet Time of day, and if the tweet had linked urls,media or hashtags. Furthermore, we filter out tweets with 0 favorites

In [12]:
results_baseline = smf.ols('favorite_count ~ dow + time_group + has_media + has_hashtags + has_urls', data=df_clusters[(df_clusters.favorite_count>0)]).fit()
results_clusters = smf.ols('favorite_count ~ dow + time_group + has_media + has_hashtags + has_urls+cluster_cat', data=df_clusters[(df_clusters.favorite_count>0)]).fit()

Comparing OLS Results for both models. Baseline R2=.104 and Model w/ clusters R2=0.170. This indicates that clustering tweets into 100 distict topic groups improves our model fit. However, given the low explanatory power, there is still plenty of unexplained variance for the favorite count of tweets

In [13]:
print(results_baseline.summary())
print(results_clusters.summary())

                            OLS Regression Results                            
Dep. Variable:         favorite_count   R-squared:                       0.104
Model:                            OLS   Adj. R-squared:                  0.101
Method:                 Least Squares   F-statistic:                     34.55
Date:                Mon, 03 Jul 2017   Prob (F-statistic):           2.48e-64
Time:                        12:50:29   Log-Likelihood:                -36251.
No. Observations:                3001   AIC:                         7.252e+04
Df Residuals:                    2990   BIC:                         7.259e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------
Intercept             6.204e+04 

# Taking a look at some of the clusters

In [20]:
#Cluster 15
#General topic is Fake News
[tweet for tweet in df_clusters[df_clusters.cluster_cat==15].text[:10]]

["The failing @nytimes writes false story after false story about me. They don't even call to verify the facts of a story. A Fake News Joke!",
 '..under a magnifying glass, they have zero "tapes" of T people colluding. There is no collusion &amp; no obstruction. I should be given apology!',
 'They made up a phony collusion with the Russians story, found zero proof, so now they go for obstruction of justice on the phony story. Nice',
 "....it is very possible that those sources don't exist but are made up by fake news writers. #FakeNews is the enemy!",
 'It is my opinion that many of the leaks coming out of the White House are fabricated lies made up by the #FakeNews media.',
 "Biggest story today between Clapper &amp; Yates is on surveillance. Why doesn't the media report on this? #FakeNews!",
 'The Russia-Trump collusion story is a total hoax, when will this taxpayer funded charade end?',
 'Sally Yates made the fake media extremely unhappy today --- she said nothing but old news!',
 '

In [23]:
#Cluster 40
#General topic of NoSanctuaryForCriminalsAct
[tweet for tweet in df_clusters[df_clusters.cluster_cat==40].text[:10]]

['Will be speaking with Italy this morning!',
 'Weekly Address🇺🇸 #KatesLaw\n#NoSanctuaryForCriminalsAct\nStatement: https://t.co/I8cqKGDK2B https://t.co/00mao6Vk7R',
 'The era of strategic patience with the North Korea regime has failed. That patience is over. We are working closely… https://t.co/MxN04V2Yn4',
 "I am encouraged by President Moon's assurances that he will work to level the playing field for American workers, b… https://t.co/DUZh6aMjJS",
 'Statement on House Passage of Kate’s Law and No Sanctuary for Criminals Act. https://t.co/uPRy9XgK5A',
 'Join me LIVE with @VP, @SecretaryPerry, @SecretaryZinke and @EPAScottPruitt. \n#UnleashingAmericanEnergy\nhttps://t.co/hlM7F2BQD9',
 'MAKE AMERICA SAFE AGAIN!\n\n#NoSanctuaryForCriminalsAct \n#KatesLaw #SaveAmericanLives \n\nhttps://t.co/jbN4hPjqjS',
 'The #AmazonWashingtonPost, sometimes referred to as the guardian of Amazon not paying internet taxes (which they should) is FAKE NEWS!',
 'MAKE AMERICA GREAT AGAIN!']