For topic modeling, I think it would be useful to use the document per twitter user information rather than doing the analysis on a per-tweet basis. This is because tweets at an individual level wil not convey much meaning, mostly because there is a max limit of 140 characters.  By aggregating each tweet per user, we get more robust documents that will likely do better with topic modeling. 

In [1]:

import gensim
import pyLDAvis
from pyLDAvis import gensim as gensimvis
from pyLDAvis import sklearn
import spacy

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.preprocessing import Normalizer
import pandas as pd
import numpy as np
import pprint as pp


# Latent Dirichlet Allocation

In [2]:
data = pd.read_csv('user_with_sentiment.csv')

In [3]:
data.head()

Unnamed: 0,screen_name,documents,textblob_polarity,textblob_subjectivity,vader_polarity,grouped_sentiment
0,511SFBay,today tomorrow day visit learn green commute ...,0.005019,0.312647,-0.273358,-0.217682
1,ABCPolitics,president trump threaten pull federal funding ...,0.073089,0.341123,0.033277,0.04124
2,AdorabIeDog,dog rescue friend slip fall water dog rescue ...,0.14106,0.383278,0.219881,0.204117
3,AndrewYNg,apparently 20 year ago already contribute face...,0.198599,0.424465,0.315762,0.29233
4,BarackObama,veteran family thank tribute truly match magni...,0.14016,0.336449,0.215674,0.200571


In [42]:
count = CountVectorizer( max_features = 40000, ngram_range=(1,2), min_df=3, max_df=0.4)

In [47]:
lda = LatentDirichletAllocation(n_topics = 10,\
                                learning_method='batch',max_iter=100)

In [48]:
X = count.fit_transform(data.documents)

In [49]:
X_topics = lda.fit_transform(X)



Let's check out the top ten topics provided when using LDA several ngrams. After doing several combinations, it seemed that the best topic combinations were generated by bigrams and setting the max features to 30000. 

In [50]:
n_top_words = 10
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d: " % (topic_idx + 1))
    print( " ".join([feature_names[i] for i in topic.argsort()\
                     [:-n_top_words - 1:-1]]))

Topic 1: 
khashoggi brett kavanaugh brexit tariff jamal jamal khashoggi trump say arabia saudi arabia caravan
Topic 2: 
yc blockchain vr mit sponsor content disinformation tesla d4d autonomous combinator
Topic 3: 
espn college football gameday clemson lssc ohio state td playoff oklahoma yard
Topic 4: 
github cohn kaggle data science datum science datum scientist dropbox machine learn dataset data scientist
Topic 5: 
lane block residual residual delay lane open capitol corridor capitol corridor southbound northbound rd
Topic 6: 
lebron sctop10 nba chrome laker disability sctop10 via sox red sox jimmy
Topic 7: 
president obama hillary medicare dreamer first lady bernie work family million american actonclimate american people
Topic 8: 
datascience bigdata machinelearn bigdata datascience datascientist dm abdsc analytic see help machinelearning
Topic 9: 
oakland town hall tesla campfire texan suspicious sanfrancisco san jose 49er rourke
Topic 10: 
bart oakland rider hbs parking hbr transi

In [51]:
prob_distributions = lda.components_ / lda.components_.sum(axis = 1)[:, np.newaxis]

In [52]:
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda,X, count)

# Latent Semantic Analysis

LDA is a useful measure for extracting underlying topics from documents, but another method, namely Latent Semantic Analysis (LSA), is also very useful.  Instead of replicating a document-generating stochastic process, LDA uses matrix factorization techniques in order to extract the underlying important features. 

In order to properly execute LSA, we first need a document to term matrix.  Each row will make up the document of each user and each column will represent a term in the corpus.  The actual values inside the matrix will be the frequency of that term within the respective document. 

In [37]:
data.head()

Unnamed: 0,screen_name,documents,textblob_polarity,textblob_subjectivity,vader_polarity
0,511SFBay,today tomorrow day visit learn green commute ...,0.014809,0.309244,-0.396608
1,ABCPolitics,president trump threaten pull federal funding ...,0.055523,0.343434,0.036858
2,AdorabIeDog,dog rescue friend slip fall water dog rescue ...,0.133552,0.38631,0.216551
3,AndrewYNg,apparently 20 year ago already contribute face...,0.170709,0.417123,0.310766
4,BarackObama,veteran family thank tribute truly match magni...,0.124232,0.32278,0.227911


In [59]:
vectorizer = CountVectorizer( max_features = 40000, ngram_range=(1,2), min_df=5, max_df=0.4)
dtm = vectorizer.fit_transform(data.documents)

In [60]:
lsa = TruncatedSVD(20, algorithm='arpack')
dtm_lsa = lsa.fit_transform(dtm.asfptype())
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)

In [61]:
sum(lsa.explained_variance_ratio_)

0.8890411269669992

In [62]:
lsa.explained_variance_ratio_

array([0.36921618, 0.11194742, 0.0902646 , 0.07410271, 0.03450982,
       0.01835107, 0.02496128, 0.02276306, 0.02188167, 0.02079289,
       0.01308321, 0.01294213, 0.01217953, 0.01152576, 0.01126916,
       0.00885206, 0.00896924, 0.0078528 , 0.00699504, 0.00658151])

In [63]:
important_words = pd.DataFrame(lsa.components_.round(5),columns = vectorizer.get_feature_names())

In [65]:
word_list = list(important_words.columns)
topics_words = []
for i in range(0,10):
    check_topic = sorted(list(zip(word_list,lsa.components_[i])), \
                         key=(lambda x: x[1]), reverse = True)
    topics_words.append(check_topic[:10])
pp.pprint(topics_words)


[[('datascience', 0.5636139791365203),
  ('bigdata', 0.5282764400744121),
  ('machinelearn', 0.29802335204390684),
  ('bigdata datascience', 0.2879898668645746),
  ('datascientist', 0.24266362104273376),
  ('deeplearn', 0.17072715131752136),
  ('machinelearning', 0.1678675315948245),
  ('analytic', 0.15886773139743185),
  ('datascience ai', 0.12441113044354182),
  ('ai machinelearn', 0.11372770053956119)],
 [('dm', 0.4908009804310787),
  ('see help', 0.40401762113276096),
  ('let see', 0.32347539742816594),
  ('hmm', 0.320940377027107),
  ('please follow', 0.29383134555469487),
  ('know share', 0.2902366358099235),
  ('help please', 0.2158490332524288),
  ('hope help', 0.20239173274274053),
  ('email address', 0.1327943956640738),
  ('help look', 0.10814196900842804)],
 [('lane block', 0.3384461725098171),
  ('residual', 0.29625540535761397),
  ('residual delay', 0.29622653847997205),
  ('capitol', 0.268870673985623),
  ('corridor', 0.26682089151857435),
  ('southbound', 0.256230837490