For topic modeling, I think it would be useful to use the document per twitter user information rather than doing the analysis on a per-tweet basis. This is because tweets at an individual level wil not convey much meaning, mostly because there is a max limit of 140 characters.  By aggregating each tweet per user, we get more robust documents that will likely do better with topic modeling. 

In [129]:

import gensim
import pyLDAvis
from pyLDAvis import gensim as gensimvis
import spacy

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.preprocessing import Normalizer
import pandas as pd
import numpy as np

# Latent Dirichlet Allocation

In [4]:
data = pd.read_csv('user_with_sentiment.csv')

In [7]:
test = 'this is a test string'

In [8]:
gensim.models.Phrases(test)

<gensim.models.phrases.Phrases at 0x1a26e12860>

In [9]:
data.head()

Unnamed: 0,screen_name,documents,polarity,subjectivity
0,511SFBay,today tomorrow day visit learn green commute ...,0.020008,0.369191
1,ABCPolitics,president trump threaten pull federal funding ...,0.071031,0.420216
2,AdorabIeDog,dog rescue friend slip fall water dog rescue ...,0.179136,0.562469
3,AndrewYNg,apparently 20 year ago already contribute face...,0.195607,0.500348
4,BarackObama,veteran family thank tribute truly match magni...,0.170326,0.445381


In [70]:
count = CountVectorizer( max_features = None, ngram_range=(1,3), max_df = 0.05)

In [71]:
lda = LatentDirichletAllocation(n_topics = 5, random_state = 0,\
                                learning_method='batch')

In [72]:
X = count.fit_transform(data.documents)

In [73]:
X_topics = lda.fit_transform(X)



In [74]:
lda.transform(X)

array([[3.03721822e-06, 3.03686186e-06, 9.99987852e-01, 3.03692493e-06,
        3.03723148e-06],
       [2.58908699e-06, 2.59018890e-06, 2.58958041e-06, 2.59054565e-06,
        9.99989641e-01],
       [1.17776230e-05, 1.17771957e-05, 1.17764786e-05, 9.99952892e-01,
        1.17767670e-05],
       [9.99959834e-01, 1.00400437e-05, 1.00406915e-05, 1.00414712e-05,
        1.00436368e-05],
       [9.99982937e-01, 4.26799165e-06, 4.26383566e-06, 4.26574552e-06,
        4.26496527e-06],
       [3.35786038e-06, 9.99986576e-01, 3.35464249e-06, 3.35639481e-06,
        3.35500951e-06],
       [2.57816702e-06, 2.57929347e-06, 2.57718056e-06, 9.99989688e-01,
        2.57744937e-06],
       [1.00247104e-05, 1.00239189e-05, 1.00221117e-05, 1.00230521e-05,
        9.99959906e-01],
       [5.37449172e-06, 5.36996743e-06, 5.40017876e-06, 9.99978485e-01,
        5.36986350e-06],
       [3.60737409e-06, 9.99985577e-01, 3.60453429e-06, 3.60648241e-06,
        3.60483203e-06],
       [9.96336422e-06, 9.9631

In [75]:
X_topics.shape

(79, 5)

In [76]:
lda.components_.shape

(5, 2866220)

Let's check out the top ten topics provided when using LDA and an ngrams up to trigrams. 

In [77]:
n_top_words = 10
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d: " % (topic_idx + 1))
    print( " ".join([feature_names[i] for i in topic.argsort()\
                     [:-n_top_words - 1:-1]]))

Topic 1: 
americangreatness drinkup emtech githubber actonclimate ocasio2018 emtechdigital harvard18 bacow 30 10
Topic 2: 
financial time briefing need briefing need know page financial time front page financial page financial debatenight publish front page publish front unlimited access
Topic 3: 
update residual update residual delay capitol corridor capitol corridor train corridor train block update lane block update update capitol corridor update capitol sf muni
Topic 4: 
abdsc abdsc bigdata abdsc bigdata datascience bigdata datascience ai datascience ai machinelearn sidelinecam analytic datascience predictiveanalytic bigdata analytic datascience dataliteracy
Topic 5: 
kdn let see help follow let follow let know let know share know share next step dm share next step next step dm please follow let


# Latent Sematic Analysis

LDA is a useful measure for extracting underlying topics from documents, but another method, namely Latent Semantic Analysis (LSA), is also very useful.  Instead of replicating a document-generating stochastic process, LDA uses matrix factorization techniques in order to extract the underlying important features. 

In order to properly execute LSA, we first need a document to term matrix.  Each row will make up the document of each user and each column will represent a term in the corpus.  The actual values inside the matrix will be the frequency of that term within the respective document. 

In [79]:
data.head()

Unnamed: 0,screen_name,documents,polarity,subjectivity
0,511SFBay,today tomorrow day visit learn green commute ...,0.020008,0.369191
1,ABCPolitics,president trump threaten pull federal funding ...,0.071031,0.420216
2,AdorabIeDog,dog rescue friend slip fall water dog rescue ...,0.179136,0.562469
3,AndrewYNg,apparently 20 year ago already contribute face...,0.195607,0.500348
4,BarackObama,veteran family thank tribute truly match magni...,0.170326,0.445381


In [96]:
vectorizer = CountVectorizer(ngram_range=(1,3))
dtm = vectorizer.fit_transform(data.documents)

In [119]:
lsa = TruncatedSVD(20, algorithm='arpack')
dtm_lsa = lsa.fit_transform(dtm.asfptype())
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)

In [120]:
lsa.explained_variance_ratio_

array([0.09444461, 0.19630054, 0.12005429, 0.11230977, 0.04684205,
       0.03518249, 0.02678811, 0.02583805, 0.02246343, 0.02047923,
       0.01970678, 0.01844419, 0.01690844, 0.01424714, 0.01297385,
       0.01210545, 0.01204705, 0.01080512, 0.01059234, 0.00991071])

In [127]:
pd.DataFrame(dtm_lsa[:][0:5].round(5), index = data.index[0:5])


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.12093,-0.08309,0.98538,-0.00226,0.04047,0.03706,-0.05017,-0.03132,-0.00695,0.01116,0.00875,0.01748,0.01815,-0.00714,-0.00122,-0.0015,-0.00601,0.00083,-0.00061,-0.00133
1,0.65787,-0.29228,-0.0941,-0.18592,0.57564,-0.0838,-0.02241,-0.00074,-0.15144,0.04299,0.0632,-0.11964,-0.15424,-0.06525,-0.02112,0.07496,-0.11908,0.03211,0.06654,-0.05128
2,0.81858,-0.16723,-0.06681,-0.03051,-0.33462,0.05272,0.07276,-0.07331,-0.04111,-0.07084,0.14767,0.18169,0.20348,-0.11852,-0.04273,-0.19711,0.10782,0.01927,0.01226,0.03672
3,0.75492,-0.19301,-0.05117,0.33904,-0.36537,-0.17756,-0.21561,-0.02505,-0.0612,0.04596,-0.04089,0.04133,0.02451,-0.0446,0.01531,0.00747,0.01647,0.11943,0.18952,0.02374
4,0.44443,-0.1504,-0.06329,-0.09346,0.07158,0.35235,-0.40609,0.62165,-0.01362,-0.23327,0.1262,-0.07001,0.04461,0.05707,-0.0434,-0.0217,0.02603,-0.01411,0.01078,0.04978


In [139]:
test = np.random.randn(5,5)