Topic modelling allows us to find patterns in large texts or corpuses of text, and thus to define **what are the mains topics** in the text. We review **two topic modelling algorithms : LSA and LDA**.

> ⚠️ Be careful, unlike other ML models, LSA and LDA results will be **up to our interpretation** !

![img](1.png)

💻 We will learn how to compute those models using `Gensim`, a powerful library mainly dedicated to Topic Modelling.



> ⚙️ You can install it using the following command in your terminal : `pip install --upgrade gensim`

# Utils

➡️ During this lesson, we will work on a sample dataset containing news of different topics.

In [None]:
!pip install --upgrade gensim

In [None]:
!pip install gnews

In [6]:
import pandas as pd

from gnews import GNews

google_news = GNews()
news = google_news.get_news('Australia')

df = pd.DataFrame(news)
df.head()

Unnamed: 0,title,description,published date,url,publisher
0,"In Australia, slot machines are everywhere. So...","In Australia, slot machines are everywhere. So...","Tue, 26 Apr 2022 15:02:45 GMT",https://www.washingtonpost.com/world/2022/04/2...,"{'href': 'https://www.washingtonpost.com', 'ti..."
1,€10 flights to Australia: The new scheme to en...,€10 flights to Australia: The new scheme to en...,"Tue, 26 Apr 2022 12:03:13 GMT",https://www.irishtimes.com/life-and-style/abro...,"{'href': 'https://www.irishtimes.com', 'title'..."
2,Consumer inflation tipped to hit 4.5% in March...,Consumer inflation tipped to hit 4.5% in March...,"Tue, 26 Apr 2022 17:30:00 GMT",https://www.theguardian.com/business/2022/apr/...,"{'href': 'https://www.theguardian.com', 'title..."
3,‘I want to know why he died’: family seeks jus...,‘I want to know why he died’: family seeks jus...,"Tue, 26 Apr 2022 17:32:00 GMT",https://www.theguardian.com/australia-news/202...,"{'href': 'https://www.theguardian.com', 'title..."
4,Australia's AMP to sell some assets of AMP Cap...,Australia's AMP to sell some assets of AMP Cap...,"Wed, 27 Apr 2022 01:11:00 GMT",https://www.reuters.com/business/australias-am...,"{'href': 'https://www.reuters.com', 'title': '..."


In [7]:
import numpy as np
from nltk import word_tokenize, wordpunct_tokenize, pos_tag
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_data(quote):
    quote = quote.lower()
    tokens = word_tokenize(quote)
    token_punc = [t for t in tokens if t.isalpha()]
    token_stop = [t for t in token_punc if t not in stop_words]
    stemmed_words = [stemmer.stem(w) for w in token_stop]
    return stemmed_words

df["token"] = df["description"].apply(lambda x: clean_data(x))
df.head()

Unnamed: 0,title,description,published date,url,publisher,token
0,"In Australia, slot machines are everywhere. So...","In Australia, slot machines are everywhere. So...","Tue, 26 Apr 2022 15:02:45 GMT",https://www.washingtonpost.com/world/2022/04/2...,"{'href': 'https://www.washingtonpost.com', 'ti...","[australia, slot, machin, everywher, gambl, ad..."
1,€10 flights to Australia: The new scheme to en...,€10 flights to Australia: The new scheme to en...,"Tue, 26 Apr 2022 12:03:13 GMT",https://www.irishtimes.com/life-and-style/abro...,"{'href': 'https://www.irishtimes.com', 'title'...","[flight, australia, new, scheme, encourag, mov..."
2,Consumer inflation tipped to hit 4.5% in March...,Consumer inflation tipped to hit 4.5% in March...,"Tue, 26 Apr 2022 17:30:00 GMT",https://www.theguardian.com/business/2022/apr/...,"{'href': 'https://www.theguardian.com', 'title...","[consum, inflat, tip, hit, march, australian, ..."
3,‘I want to know why he died’: family seeks jus...,‘I want to know why he died’: family seeks jus...,"Tue, 26 Apr 2022 17:32:00 GMT",https://www.theguardian.com/australia-news/202...,"{'href': 'https://www.theguardian.com', 'title...","[want, know, die, famili, seek, justic, austra..."
4,Australia's AMP to sell some assets of AMP Cap...,Australia's AMP to sell some assets of AMP Cap...,"Wed, 27 Apr 2022 01:11:00 GMT",https://www.reuters.com/business/australias-am...,"{'href': 'https://www.reuters.com', 'title': '...","[australia, amp, sell, asset, amp, capit, mln]"


# 1. Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) analyzes relationships between a set of documents and the terms they contain by **producing a set of concepts (= the topics) related to the documents and terms**. You can see it as a **kind of PCA** applied to your documents. Sometimes, it is also called Latent Semantic Indexing (LSI).

There are two steps in a LSA computation and we already know all the tools:

* **TF-IDF** matrix computation
* **Singular Value Decomposition** (the same technique was used in PCA)

> ⚠️Like in a PCA, the topics **don't have an actual meaning**: they are more like a combination of words !

## 1.1. Principles
👉🏻 The first step is to **compute the TF-IDF** (it could also be a BOW but TF-IDF is more powerful most of the time).

From a corpus of M documents (or texts or sentences) and N words, we get the following TF-IDF matrix of shape (M, N):

![img](2.png)

👉🏻 The second step is to perform a **Singular Value Decomposition** on this matrix

Where

* $S$ is a diagonal matrix of singular values in decreasing order: each value represents the weights of the corresponding topic
* $U$ is the Term-Topic Matrix
* $V$ is the Document-Topic Matrix

Like in a PCA, we can choose to **keep only the largest values** in the matrix $S$ (corresponding the the tt most representative topics):

![img](3.png)

We end up with the following matrix:

* $U_t$ of shape (N, t): the words and topics relationship
* $V_t$ of shape (M, t): the documents and topics relationship

> 🔦 Hint: Using the matrix $V_t$, we can print the 10 words with the highest values of a given topic and guess the related topic!

## 1.2. Implementation with Gensim

In [9]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.models import LsiModel
from pprint import pprint

# Create a corpus
corpus = df['token']

# Compute the dictionary: this is a dictionary mapping words and their corresponding numbers for later visualisation
id2word = Dictionary(corpus)
print(id2word[0])

# Create a BOW
bow = [id2word.doc2bow(line) for line in corpus]  # convert corpus to BoW format
print(bow[0])

04/27/2022 11:46:42 AM - adding document #0 to Dictionary(0 unique tokens: [])
04/27/2022 11:46:42 AM - built Dictionary(641 unique tokens: ['addict', 'australia', 'everywher', 'gambl', 'machin']...) from 100 documents (total 1005 corpus positions)


addict
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]


In [10]:
# Fit a TF-IDF
tfidf_model = TfidfModel(bow)

# Compute the TF-IDF
tf_idf_gensim = tfidf_model[bow]
print(len(tf_idf_gensim))

04/27/2022 11:46:47 AM - collecting document frequencies
04/27/2022 11:46:47 AM - PROGRESS: processing document #0
04/27/2022 11:46:47 AM - calculating IDF weights for 100 documents and 641 features (982 matrix non-zeros)


100


In [11]:
lsi = LsiModel(tf_idf_gensim, id2word=id2word, num_topics=5)

pprint(lsi.print_topics())

04/27/2022 11:46:52 AM - using serial LSI version on this node
04/27/2022 11:46:52 AM - updating model with new documents
04/27/2022 11:46:52 AM - preparing a new chunk of documents
04/27/2022 11:46:52 AM - using 100 extra samples and 2 power iterations
04/27/2022 11:46:52 AM - 1st phase: constructing (641, 105) action matrix
04/27/2022 11:46:52 AM - orthonormalizing (641, 105) action matrix
04/27/2022 11:46:52 AM - 2nd phase: running dense svd on (105, 100) matrix
04/27/2022 11:46:52 AM - computing the final decomposition
04/27/2022 11:46:52 AM - keeping 5 factors (discarding 91.545% of energy spectrum)
04/27/2022 11:46:52 AM - processed documents up to #100
04/27/2022 11:46:52 AM - topic #0(1.395): 0.197*"herald" + 0.197*"morn" + 0.196*"sydney" + 0.195*"news" + 0.186*"guardian" + 0.186*"australian" + 0.149*"time" + 0.141*"new" + 0.141*"first" + 0.134*"wait"
04/27/2022 11:46:52 AM - topic #1(1.330): 0.342*"sydney" + 0.333*"herald" + 0.333*"morn" + -0.157*"news" + 0.149*"crisi" + 0.145

[(0,
  '0.197*"herald" + 0.197*"morn" + 0.196*"sydney" + 0.195*"news" + '
  '0.186*"guardian" + 0.186*"australian" + 0.149*"time" + 0.141*"new" + '
  '0.141*"first" + 0.134*"wait"'),
 (1,
  '0.342*"sydney" + 0.333*"herald" + 0.333*"morn" + -0.157*"news" + '
  '0.149*"crisi" + 0.145*"democraci" + 0.145*"cancer" + 0.138*"fix" + '
  '-0.123*"australian" + -0.122*"new"'),
 (2,
  '0.236*"australian" + -0.236*"news" + 0.215*"financi" + 0.202*"review" + '
  '-0.195*"first" + -0.181*"abc" + 0.167*"post" + 0.160*"new" + -0.148*"covid" '
  '+ -0.143*"sb"'),
 (3,
  '0.230*"new" + 0.205*"time" + -0.182*"guardian" + -0.165*"govern" + '
  '-0.152*"climat" + -0.145*"live" + -0.143*"happen" + 0.141*"zealand" + '
  '-0.138*"nsw" + -0.134*"call"'),
 (4,
  '0.247*"guardian" + -0.217*"news" + 0.198*"new" + -0.182*"abc" + '
  '-0.178*"washington" + -0.164*"post" + -0.162*"food" + -0.154*"travel" + '
  '0.151*"zealand" + 0.144*"cruis"')]


# 2. Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is, in a way, **an improvement of the LSA**. Indeed, the problem with LSA is that it needs large corpuses of documents to be accurate enough.

LDA is a **probabilistic model** (based on Bayesian probabilities) that allows **more flexibility on the size of the dataset**.

## 2.1. Principles
We won't dive into a fully detailed explaination of this algorithm, since it is mathematically **a bit complicated and involves lots of calculations**. Even though not many people really master it, it is widely used to perform `Topic Modelling`. We will just focus on a general overview : just remember what it does and how to use it.

👉🏻 LDA makes two main assumptions:

* **Mixture**: each document is a mixture of topics
* **Sparsity**: each document covers a small set of topics, and each uses only a small subset of words frequently

👉🏻 Then the LDA algorithm follows the following steps:

1. **Initialization**: assign to each document a random (sparse) distribution of topics, and to each a random (sparse) distribution of words
2. **For each word in each document, compute the most likely topic** (according to other words in that document)
3. **Repeat step 2 until convergence or iteration limit**

> 🔦 Hint: To sum up, you can just remember that it works in a way similar to a Kmeans algorithm.

👉🏻 **Every document is a mix of topics and every topic consists of a mix of words**

![img](4.jpeg)

👉🏻 Expected LDA Output:

* Topic A = 30% dog, 20% frog, 20& insect, 5% cute... = **ANIMALS**
* Topic B = 30% Olympics, 20% players, 20% beat, 10% corner, 10% Dota, 2% dog = **SPORTS**
* Topic C = 30% AI, 20% flying, 15% cars, 10% driven, 5% beat, Dota, players = **TECH**

## 2.2. Implementation
The method to compute Topic Modelling using Gensim is `LdaModel`, it has the following signature:

```
lda_model = gensim.models.ldamodel.LdaModel(corpus,
                                            id2word,
                                            num_topics, 
                                            random_state,
                                            chunksize,
                                            passes)
```

Where:

* `Corpus` is the input TF-IDF or BOW
* `id2word` is a dictionary with the correspondance between indices in the corpus and words
* `num_topics` is the number of topics you expect
* `random_state` is the random seed for reproducibility
* `chunksize` is the size of a mini batch
* `passes` is the number of passes over all words in the corpus

👉🏻 There are 2 ways (at least) to do so:

* Use `Gensim`'s TF-IDF and topic modelling
* Use `scikit-learn`'s TF-IDF and then use Gensim topic modelling

> 🔦 Hint: There is **no good or bad choice**, it is all depending on your affinity with libraries. We will show you both methods here.

### 2.2.1. Gensim TF-IDF and topic modelling

In [12]:
from gensim.models import LdaModel
# Compute the LDA
lda1 = LdaModel(corpus=tf_idf_gensim, num_topics=5, id2word=id2word, passes=10)

# Print the main topics
print(lda1.print_topics())

04/27/2022 11:47:29 AM - using symmetric alpha at 0.2
04/27/2022 11:47:29 AM - using symmetric eta at 0.2
04/27/2022 11:47:29 AM - using serial LDA version on this node
04/27/2022 11:47:29 AM - running online (multi-pass) LDA training, 5 topics, 10 passes over the supplied corpus of 100 documents, updating model once every 100 documents, evaluating perplexity every 100 documents, iterating 50x with a convergence threshold of 0.001000
04/27/2022 11:47:29 AM - -15.190 per-word bound, 37382.0 perplexity estimate based on a held-out corpus of 100 documents with 290 words
04/27/2022 11:47:29 AM - PROGRESS: pass 0, at document #100/100
04/27/2022 11:47:29 AM - topic #0 (0.200): 0.007*"australian" + 0.005*"post" + 0.005*"time" + 0.004*"news" + 0.004*"guardian" + 0.004*"washington" + 0.004*"sb" + 0.004*"inflat" + 0.004*"busi" + 0.004*"forc"
04/27/2022 11:47:29 AM - topic #1 (0.200): 0.005*"call" + 0.004*"weather" + 0.004*"tie" + 0.004*"china" + 0.004*"prepar" + 0.004*"push" + 0.004*"guardian" 

04/27/2022 11:47:29 AM - topic #1 (0.200): 0.005*"call" + 0.004*"weather" + 0.004*"tie" + 0.004*"china" + 0.004*"prepar" + 0.004*"push" + 0.004*"guardian" + 0.004*"unit" + 0.004*"elect" + 0.004*"offer"
04/27/2022 11:47:29 AM - topic #2 (0.200): 0.006*"world" + 0.005*"nintendo" + 0.005*"rise" + 0.005*"deal" + 0.005*"crisi" + 0.004*"sydney" + 0.004*"top" + 0.004*"financi" + 0.004*"review" + 0.004*"independ"
04/27/2022 11:47:29 AM - topic #3 (0.200): 0.006*"die" + 0.005*"amp" + 0.005*"age" + 0.005*"death" + 0.005*"news" + 0.005*"new" + 0.004*"mln" + 0.004*"york" + 0.004*"nsw" + 0.004*"covid"
04/27/2022 11:47:29 AM - topic #4 (0.200): 0.005*"travel" + 0.005*"new" + 0.005*"sydney" + 0.005*"war" + 0.004*"convers" + 0.004*"day" + 0.004*"anzac" + 0.004*"want" + 0.004*"herald" + 0.004*"morn"
04/27/2022 11:47:29 AM - topic diff=0.000460, rho=0.353553
04/27/2022 11:47:29 AM - -8.750 per-word bound, 430.7 perplexity estimate based on a held-out corpus of 100 documents with 290 words
04/27/2022 11:

[(0, '0.007*"australian" + 0.005*"post" + 0.005*"time" + 0.004*"news" + 0.004*"washington" + 0.004*"guardian" + 0.004*"us" + 0.004*"busi" + 0.004*"sb" + 0.004*"inflat"'), (1, '0.005*"call" + 0.004*"weather" + 0.004*"tie" + 0.004*"china" + 0.004*"prepar" + 0.004*"push" + 0.004*"guardian" + 0.004*"unit" + 0.004*"elect" + 0.004*"offer"'), (2, '0.006*"world" + 0.005*"nintendo" + 0.005*"rise" + 0.005*"deal" + 0.005*"crisi" + 0.004*"sydney" + 0.004*"top" + 0.004*"financi" + 0.004*"review" + 0.004*"independ"'), (3, '0.006*"die" + 0.005*"amp" + 0.005*"age" + 0.005*"death" + 0.005*"news" + 0.005*"new" + 0.004*"mln" + 0.004*"york" + 0.004*"nsw" + 0.004*"covid"'), (4, '0.005*"travel" + 0.005*"new" + 0.005*"sydney" + 0.005*"war" + 0.004*"convers" + 0.004*"day" + 0.004*"anzac" + 0.004*"want" + 0.004*"herald" + 0.004*"morn"')]


### 2.2.2. sklearn TF-IDF and gensim topic modelling

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
pd.set_option('display.max_columns', 500) 

# Instantiate the TF-IDF vectorizer
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)

# Compute the TF-IDF
tf_idf = vectorizer.fit_transform(df['token'])
pd.DataFrame(data = tf_idf.toarray(), columns=vectorizer.get_feature_names(), index=corpus.index).head()

Unnamed: 0,abc,abruptli,access,across,action,actual,add,addict,admit,adult,afr,age,ahead,air,al,allianc,allow,ambassador,amid,among,amp,anger,annual,anticip,anzac,aoc,approv,april,argentina,ashor,asia,asian,asset,assist,athlet,auku,ausscd,aussi,australia,australian,back,bailey,ban,bar,base,battl,bbc,beauti,becam,began,begin,behind,best,beta,big,biggest,billboard,billion,biosolid,bipartisan,blackout,bloat,bloomberg,boom,boost,brandi,brazil,break,brief,bring,busi,butcher,buy,buzzfe,cabinet,call,camp,cancer,canola,capit,care,carolin,case,cash,caus,cba,ceremoni,chair,chang,child,children,china,chip,chri,citizen,claim,clarion,climat,cnbc,cnn,coach,coal,coalit,coastal,cold,colossu,commerc,commerci,commiss,common,compani,condit,consum,continu,control,convers,coronado,corp,corpor,cost,coupl,court,covert,covid,craig,creatur,crise,crisi,critic,crowd,cruis,crypto,cultur,custom,cyber,dad,daili,data,day,deal,death,decreas,demand,democraci,depart,deploy,depp,deputi,despit,detail,devour,dialogu,die,disabl,divis,dollar,donor,doubl,drive,drop,dust,dynam,east,econom,eg,elect,ell,elon,emerg,empti,encapsul,enceph,encourag,end,endang,energi,english,enter,eros,etf,eurasian,europ,evacu,everi,everywher,examin,expand,experi,expert,explain,export,express,extinct,extra,extrem,fact,fail,failur,famili,fan,fare,fault,favorit,fear,feder,fight,final,financ,financi,finger,fintech,fireproof,first,firstpost,fisherman,fix,flag,flare,flash,flight,flood,flu,follow,food,forb,forc,forecast,foreign,format,foundat,frog,full,function,futur,gain,gambl,game,gasif,gave,gener,georg,gestur,get,given,global,good,govern,govt,green,group,...,oilse,older,one,onlin,optim,order,outbreak,outstrip,overwatch,oz,panel,papua,parad,partnership,paul,pay,payment,payn,penalti,penni,perform,philanthropi,pilot,pipelin,place,plan,plant,player,pm,png,podcast,point,polici,poll,popul,possibl,post,pr,prais,premier,prepar,previou,price,privat,product,provid,public,punk,push,quarter,radic,radio,rain,ramsay,reaction,rebellion,recal,recommend,record,red,regul,releas,renew,rent,replac,republ,respons,restrict,retail,retir,return,revenu,review,right,rise,rock,roll,room,roughli,royal,russia,sabr,sail,save,say,sb,scheme,school,sciencealert,search,sebastian,secur,see,seek,select,sell,senat,send,sent,seri,set,settl,seven,sever,sexual,share,shed,sheet,ship,shock,shop,shortag,show,sight,similar,sinc,six,slot,smithsonian,snag,soar,softwar,solar,solomon,sordid,south,spark,spent,splinter,sport,spot,spread,spring,star,start,state,statement,step,stick,still,stoke,store,stori,strateg,strategi,strategist,street,strike,sue,sunflow,super,support,surg,surpris,swim,swimswam,sydney,syria,tackl,takeov,talk,tank,teacher,teal,tear,teen,tell,tenant,test,thing,thousand,threat,threaten,tide,tie,time,tip,today,toll,top,tortur,trade,trail,train,transact,travel,tree,trial,trigger,turn,twitter,two,u,ub,uber,uk,ukrain,ultim,un,unit,unlimit,unmask,updat,us,vaccin,valu,video,view,violenc,viral,virgin,visit,voic,vote,voyag,wage,wait,wake,walk,wall,want,war,wari,wash,washington,wast,weak,weather,web,week,wentworth,west,western,white,whitehaven,wild,win,wo,women,wong,woodsid,work,worker,world,would,yahoo,year,york,zealand,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.393596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.091899,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.393596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.393596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.320323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.393596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.338167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.409479,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.409479,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.409479,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.284349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.214294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.332934,0.0,0.0,0.0,0.0,0.0,0.0,0.332934,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.362823,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.362823,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16023,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.343125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.279248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.343125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.314859,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.314859,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.711724,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.355862,0.0,0.0,0.0,0.0,0.0,0.083089,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.355862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.355862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# Dictionary apping from word IDs to words, initialized in a lazy manner to save memory (not created until needed)
dictionary = Dictionary(df["token"])
print(dictionary)

04/27/2022 11:48:03 AM - adding document #0 to Dictionary(0 unique tokens: [])
04/27/2022 11:48:03 AM - built Dictionary(641 unique tokens: ['addict', 'australia', 'everywher', 'gambl', 'machin']...) from 100 documents (total 1005 corpus positions)


Dictionary(641 unique tokens: ['addict', 'australia', 'everywher', 'gambl', 'machin']...)


In [16]:
from gensim.matutils import Sparse2Corpus
from pprint import pprint

# Convert the TF-IDF to the needed input for Gensim
tf_idf_sklearn = Sparse2Corpus(tf_idf, documents_columns=False)

# Compute the LDA
lda2 = LdaModel(corpus=tf_idf_sklearn, id2word=id2word, num_topics=3, passes=10)

# Print the main topics
pprint(lda2.print_topics())

04/27/2022 11:48:07 AM - using symmetric alpha at 0.3333333333333333
04/27/2022 11:48:07 AM - using symmetric eta at 0.3333333333333333
04/27/2022 11:48:07 AM - using serial LDA version on this node
04/27/2022 11:48:07 AM - running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 100 documents, updating model once every 100 documents, evaluating perplexity every 100 documents, iterating 50x with a convergence threshold of 0.001000
04/27/2022 11:48:07 AM - -10.155 per-word bound, 1140.3 perplexity estimate based on a held-out corpus of 100 documents with 298 words
04/27/2022 11:48:07 AM - PROGRESS: pass 0, at document #100/100
04/27/2022 11:48:07 AM - topic #0 (0.333): 0.009*"sell" + 0.005*"critic" + 0.005*"flash" + 0.005*"action" + 0.004*"share" + 0.004*"sunflow" + 0.004*"english" + 0.004*"onlin" + 0.004*"search" + 0.004*"provid"
04/27/2022 11:48:07 AM - topic #1 (0.333): 0.010*"sell" + 0.005*"action" + 0.005*"critic" + 0.004*"billion" + 0.004*"headlin"

04/27/2022 11:48:08 AM - topic diff=0.001502, rho=0.316228
04/27/2022 11:48:08 AM - -7.951 per-word bound, 247.5 perplexity estimate based on a held-out corpus of 100 documents with 298 words
04/27/2022 11:48:08 AM - PROGRESS: pass 9, at document #100/100
04/27/2022 11:48:08 AM - topic #0 (0.333): 0.010*"sell" + 0.005*"critic" + 0.005*"action" + 0.005*"flash" + 0.005*"sunflow" + 0.005*"share" + 0.004*"onlin" + 0.004*"english" + 0.004*"south" + 0.004*"search"
04/27/2022 11:48:08 AM - topic #1 (0.333): 0.010*"sell" + 0.005*"action" + 0.005*"critic" + 0.005*"billion" + 0.004*"headlin" + 0.004*"biosolid" + 0.004*"twitter" + 0.004*"wong" + 0.004*"addict" + 0.003*"school"
04/27/2022 11:48:08 AM - topic #2 (0.333): 0.010*"sell" + 0.006*"action" + 0.005*"western" + 0.004*"headlin" + 0.004*"extrem" + 0.004*"hive" + 0.004*"guid" + 0.004*"billion" + 0.004*"quarter" + 0.003*"inflat"
04/27/2022 11:48:08 AM - topic diff=0.001255, rho=0.301511
04/27/2022 11:48:08 AM - topic #0 (0.333): 0.010*"sell" +

[(0,
  '0.010*"sell" + 0.005*"critic" + 0.005*"action" + 0.005*"flash" + '
  '0.005*"sunflow" + 0.005*"share" + 0.004*"onlin" + 0.004*"english" + '
  '0.004*"south" + 0.004*"search"'),
 (1,
  '0.010*"sell" + 0.005*"action" + 0.005*"critic" + 0.005*"billion" + '
  '0.004*"headlin" + 0.004*"biosolid" + 0.004*"twitter" + 0.004*"wong" + '
  '0.004*"addict" + 0.003*"school"'),
 (2,
  '0.010*"sell" + 0.006*"action" + 0.005*"western" + 0.004*"headlin" + '
  '0.004*"extrem" + 0.004*"hive" + 0.004*"guid" + 0.004*"billion" + '
  '0.004*"quarter" + 0.003*"inflat"')]


> ⚠️Be careful : documents are represented in columns in `gensim` TD-IDF sparse matrix, while documents are represented in rows in `sklearn` TD-IDF sparse matrix. If you want to use `sklearn` TF-IDF with gensim LDA, you should set documents_columns=False

## 2.3. LDA visualization
If you used Gensim all along, you can then perform visualization on your topics using the library pyLDAvis.

⚙️ You can install it using the following command in your terminal :

In [17]:
# !pip install pyLDAvis
# Import the modules
import pyLDAvis
import pyLDAvis.gensim

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(topic_model=lda2, corpus=bow, dictionary=id2word)
vis

04/27/2022 11:48:18 AM - NumExpr defaulting to 8 threads.
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


> ⚠️Note : with `pyLDAvis.gensim.prepare`, the `corpus` is a BOW, not a TF-IDF !