Topic modelling allows us to find patterns in large texts or corpuses of text, and thus to define **what are the mains topics** in the text. We review **two topic modelling algorithms : LSA and LDA**.

> ⚠️ Be careful, unlike other ML models, LSA and LDA results will be **up to our interpretation** !

![img](1.png)

💻 We will learn how to compute those models using `Gensim`, a powerful library mainly dedicated to Topic Modelling.



> ⚙️ You can install it using the following command in your terminal : `pip install --upgrade gensim`

# Utils

➡️ During this lesson, we will work on a sample dataset containing news of different topics.

In [2]:
!pip install --upgrade gensim



In [3]:
!pip install gnews

Collecting gnews
  Downloading gnews-0.2.7-py3-none-any.whl (14 kB)
Collecting python-dotenv~=0.19.0
  Downloading python_dotenv-0.19.2-py2.py3-none-any.whl (17 kB)
Collecting beautifulsoup4~=4.9.3
  Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.8/115.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting requests==2.26.0
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m429.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting bs4~=0.0.1
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting feedparser~=6.0.2
  Downloading feedparser-6.0.8-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.0/81.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting dnspython

In [5]:
import pandas as pd

from gnews import GNews

google_news = GNews()
news = google_news.get_news('Australia')

df = pd.DataFrame(news)
df.head()

Unnamed: 0,title,description,published date,url,publisher
0,The American Variant of Democracy Is Contamina...,The American Variant of Democracy Is Contamina...,"Mon, 16 May 2022 10:00:00 GMT",https://www.theatlantic.com/ideas/archive/2022...,"{'href': 'https://www.theatlantic.com', 'title..."
1,Xi Jinping looms large over Australia's electi...,Xi Jinping looms large over Australia's electi...,"Mon, 16 May 2022 01:00:00 GMT",https://www.cnn.com/2022/05/15/australia/austr...,"{'href': 'https://www.cnn.com', 'title': 'CNN'}"
2,How Australia Elections 2022 Could Reshape Chi...,How Australia Elections 2022 Could Reshape Chi...,"Mon, 16 May 2022 12:00:00 GMT",https://www.bloomberg.com/graphics/australia-f...,"{'href': 'https://www.bloomberg.com', 'title':..."
3,CVC Capital in Deal Talks With Australia’s Bra...,CVC Capital in Deal Talks With Australia’s Bra...,"Mon, 16 May 2022 06:30:00 GMT",https://www.wsj.com/articles/cvc-capital-in-de...,"{'href': 'https://www.wsj.com', 'title': 'The ..."
4,Electric Cars Could Win a Boost in Australia's...,Electric Cars Could Win a Boost in Australia's...,"Mon, 16 May 2022 10:15:05 GMT",https://www.bloomberg.com/news/articles/2022-0...,"{'href': 'https://www.bloomberg.com', 'title':..."


In [6]:
import numpy as np
from nltk import word_tokenize, wordpunct_tokenize, pos_tag
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_data(quote):
    quote = quote.lower()
    tokens = word_tokenize(quote)
    token_punc = [t for t in tokens if t.isalpha()]
    token_stop = [t for t in token_punc if t not in stop_words]
    stemmed_words = [stemmer.stem(w) for w in token_stop]
    return stemmed_words

df["token"] = df["description"].apply(lambda x: clean_data(x))
df.head()

Unnamed: 0,title,description,published date,url,publisher,token
0,The American Variant of Democracy Is Contamina...,The American Variant of Democracy Is Contamina...,"Mon, 16 May 2022 10:00:00 GMT",https://www.theatlantic.com/ideas/archive/2022...,"{'href': 'https://www.theatlantic.com', 'title...","[american, variant, democraci, contamin, home,..."
1,Xi Jinping looms large over Australia's electi...,Xi Jinping looms large over Australia's electi...,"Mon, 16 May 2022 01:00:00 GMT",https://www.cnn.com/2022/05/15/australia/austr...,"{'href': 'https://www.cnn.com', 'title': 'CNN'}","[xi, jinp, loom, larg, australia, elect, cnn]"
2,How Australia Elections 2022 Could Reshape Chi...,How Australia Elections 2022 Could Reshape Chi...,"Mon, 16 May 2022 12:00:00 GMT",https://www.bloomberg.com/graphics/australia-f...,"{'href': 'https://www.bloomberg.com', 'title':...","[australia, elect, could, reshap, china, tie, ..."
3,CVC Capital in Deal Talks With Australia’s Bra...,CVC Capital in Deal Talks With Australia’s Bra...,"Mon, 16 May 2022 06:30:00 GMT",https://www.wsj.com/articles/cvc-capital-in-de...,"{'href': 'https://www.wsj.com', 'title': 'The ...","[cvc, capit, deal, talk, australia, brambl, lo..."
4,Electric Cars Could Win a Boost in Australia's...,Electric Cars Could Win a Boost in Australia's...,"Mon, 16 May 2022 10:15:05 GMT",https://www.bloomberg.com/news/articles/2022-0...,"{'href': 'https://www.bloomberg.com', 'title':...","[electr, car, could, win, boost, australia, fe..."


# 1. Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) analyzes relationships between a set of documents and the terms they contain by **producing a set of concepts (= the topics) related to the documents and terms**. You can see it as a **kind of PCA** applied to your documents. Sometimes, it is also called Latent Semantic Indexing (LSI).

There are two steps in a LSA computation and we already know all the tools:

* **TF-IDF** matrix computation
* **Singular Value Decomposition** (the same technique was used in PCA)

> ⚠️Like in a PCA, the topics **don't have an actual meaning**: they are more like a combination of words !

## 1.1. Principles
👉🏻 The first step is to **compute the TF-IDF** (it could also be a BOW but TF-IDF is more powerful most of the time).

From a corpus of M documents (or texts or sentences) and N words, we get the following TF-IDF matrix of shape (M, N):

![img](2.png)

👉🏻 The second step is to perform a **Singular Value Decomposition** on this matrix

Where

* $S$ is a diagonal matrix of singular values in decreasing order: each value represents the weights of the corresponding topic
* $U$ is the Term-Topic Matrix
* $V$ is the Document-Topic Matrix

Like in a PCA, we can choose to **keep only the largest values** in the matrix $S$ (corresponding the the tt most representative topics):

![img](3.png)

We end up with the following matrix:

* $U_t$ of shape (N, t): the words and topics relationship
* $V_t$ of shape (M, t): the documents and topics relationship

> 🔦 Hint: Using the matrix $V_t$, we can print the 10 words with the highest values of a given topic and guess the related topic!

## 1.2. Implementation with Gensim

In [7]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.models import LsiModel
from pprint import pprint

# Create a corpus
corpus = df['token']

# Compute the dictionary: this is a dictionary mapping words and their corresponding numbers for later visualisation
id2word = Dictionary(corpus)
print(id2word[0])

# Create a BOW
bow = [id2word.doc2bow(line) for line in corpus]  # convert corpus to BoW format
print(bow[0])

05/17/2022 04:59:21 AM - adding document #0 to Dictionary<0 unique tokens: []>
05/17/2022 04:59:21 AM - built Dictionary<611 unique tokens: ['american', 'atlant', 'contamin', 'democraci', 'home']...> from 99 documents (total 972 corpus positions)
05/17/2022 04:59:21 AM - Dictionary lifecycle event {'msg': "built Dictionary<611 unique tokens: ['american', 'atlant', 'contamin', 'democraci', 'home']...> from 99 documents (total 972 corpus positions)", 'datetime': '2022-05-17T04:59:21.641592', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}


american
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]


In [8]:
# Fit a TF-IDF
tfidf_model = TfidfModel(bow)

# Compute the TF-IDF
tf_idf_gensim = tfidf_model[bow]
print(len(tf_idf_gensim))

05/17/2022 04:59:21 AM - collecting document frequencies
05/17/2022 04:59:21 AM - PROGRESS: processing document #0
05/17/2022 04:59:21 AM - TfidfModel lifecycle event {'msg': 'calculated IDF weights for 99 documents and 611 features (951 matrix non-zeros)', 'datetime': '2022-05-17T04:59:21.661158', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'initialize'}


99


In [9]:
lsi = LsiModel(tf_idf_gensim, id2word=id2word, num_topics=5)

pprint(lsi.print_topics())

05/17/2022 04:59:21 AM - using serial LSI version on this node
05/17/2022 04:59:21 AM - updating model with new documents
05/17/2022 04:59:21 AM - preparing a new chunk of documents
05/17/2022 04:59:21 AM - using 100 extra samples and 2 power iterations
05/17/2022 04:59:21 AM - 1st phase: constructing (611, 105) action matrix
05/17/2022 04:59:21 AM - orthonormalizing (611, 105) action matrix
05/17/2022 04:59:21 AM - 2nd phase: running dense svd on (105, 99) matrix
05/17/2022 04:59:21 AM - computing the final decomposition
05/17/2022 04:59:21 AM - keeping 5 factors (discarding 91.867% of energy spectrum)
05/17/2022 04:59:21 AM - processed documents up to #99
05/17/2022 04:59:21 AM - topic #0(1.377): -0.257*"elect" + -0.241*"news" + -0.236*"guardian" + -0.190*"abc" + -0.151*"australian" + -0.147*"say" + -0.140*"morrison" + -0.115*"scott" + -0.112*"covid" + -0.105*"former"
05/17/2022 04:59:21 AM - topic #1(1.291): -0.290*"elect" + 0.280*"news" + 0.253*"abc" + -0.165*"morrison" + -0.163*"m

[(0,
  '-0.257*"elect" + -0.241*"news" + -0.236*"guardian" + -0.190*"abc" + '
  '-0.151*"australian" + -0.147*"say" + -0.140*"morrison" + -0.115*"scott" + '
  '-0.112*"covid" + -0.105*"former"'),
 (1,
  '-0.290*"elect" + 0.280*"news" + 0.253*"abc" + -0.165*"morrison" + '
  '-0.163*"mail" + -0.163*"daili" + -0.159*"loom" + -0.147*"scott" + '
  '0.123*"death" + -0.098*"al"'),
 (2,
  '-0.312*"etf" + -0.276*"bloomberg" + -0.235*"bitcoin" + 0.216*"say" + '
  '-0.185*"come" + 0.171*"australian" + 0.160*"guardian" + -0.155*"elect" + '
  '-0.151*"could" + 0.139*"new"'),
 (3,
  '-0.293*"etf" + -0.219*"bitcoin" + -0.212*"bloomberg" + -0.171*"australian" '
  '+ -0.170*"come" + 0.161*"daili" + 0.161*"mail" + -0.149*"climat" + '
  '0.145*"scott" + 0.133*"morrison"'),
 (4,
  '-0.210*"etf" + 0.209*"al" + 0.209*"english" + 0.209*"jazeera" + '
  '-0.205*"scott" + -0.199*"morrison" + 0.197*"loom" + -0.161*"mail" + '
  '-0.161*"daili" + 0.160*"defend"')]


# 2. Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is, in a way, **an improvement of the LSA**. Indeed, the problem with LSA is that it needs large corpuses of documents to be accurate enough.

LDA is a **probabilistic model** (based on Bayesian probabilities) that allows **more flexibility on the size of the dataset**.

## 2.1. Principles
We won't dive into a fully detailed explaination of this algorithm, since it is mathematically **a bit complicated and involves lots of calculations**. Even though not many people really master it, it is widely used to perform `Topic Modelling`. We will just focus on a general overview : just remember what it does and how to use it.

👉🏻 LDA makes two main assumptions:

* **Mixture**: each document is a mixture of topics
* **Sparsity**: each document covers a small set of topics, and each uses only a small subset of words frequently

👉🏻 Then the LDA algorithm follows the following steps:

1. **Initialization**: assign to each document a random (sparse) distribution of topics, and to each a random (sparse) distribution of words
2. **For each word in each document, compute the most likely topic** (according to other words in that document)
3. **Repeat step 2 until convergence or iteration limit**

> 🔦 Hint: To sum up, you can just remember that it works in a way similar to a Kmeans algorithm.

👉🏻 **Every document is a mix of topics and every topic consists of a mix of words**

![img](4.jpeg)

👉🏻 Expected LDA Output:

* Topic A = 30% dog, 20% frog, 20& insect, 5% cute... = **ANIMALS**
* Topic B = 30% Olympics, 20% players, 20% beat, 10% corner, 10% Dota, 2% dog = **SPORTS**
* Topic C = 30% AI, 20% flying, 15% cars, 10% driven, 5% beat, Dota, players = **TECH**

## 2.2. Implementation
The method to compute Topic Modelling using Gensim is `LdaModel`, it has the following signature:

```
lda_model = gensim.models.ldamodel.LdaModel(corpus,
                                            id2word,
                                            num_topics, 
                                            random_state,
                                            chunksize,
                                            passes)
```

Where:

* `Corpus` is the input TF-IDF or BOW
* `id2word` is a dictionary with the correspondance between indices in the corpus and words
* `num_topics` is the number of topics you expect
* `random_state` is the random seed for reproducibility
* `chunksize` is the size of a mini batch
* `passes` is the number of passes over all words in the corpus

👉🏻 There are 2 ways (at least) to do so:

* Use `Gensim`'s TF-IDF and topic modelling
* Use `scikit-learn`'s TF-IDF and then use Gensim topic modelling

> 🔦 Hint: There is **no good or bad choice**, it is all depending on your affinity with libraries. We will show you both methods here.

### 2.2.1. Gensim TF-IDF and topic modelling

In [10]:
from gensim.models import LdaModel
# Compute the LDA
lda1 = LdaModel(corpus=tf_idf_gensim, num_topics=5, id2word=id2word, passes=10)

# Print the main topics
print(lda1.print_topics())

05/17/2022 04:59:23 AM - using symmetric alpha at 0.2
05/17/2022 04:59:23 AM - using symmetric eta at 0.2
05/17/2022 04:59:23 AM - using serial LDA version on this node
05/17/2022 04:59:23 AM - running online (multi-pass) LDA training, 5 topics, 10 passes over the supplied corpus of 99 documents, updating model once every 99 documents, evaluating perplexity every 99 documents, iterating 50x with a convergence threshold of 0.001000
05/17/2022 04:59:23 AM - -14.981 per-word bound, 32338.6 perplexity estimate based on a held-out corpus of 99 documents with 283 words
05/17/2022 04:59:23 AM - PROGRESS: pass 0, at document #99/99
05/17/2022 04:59:23 AM - topic #0 (0.200): 0.005*"news" + 0.005*"restrict" + 0.005*"travel" + 0.005*"guardian" + 0.005*"dutton" + 0.005*"new" + 0.004*"abc" + 0.004*"elect" + 0.004*"countri" + 0.004*"sort"
05/17/2022 04:59:23 AM - topic #1 (0.200): 0.005*"leav" + 0.005*"tie" + 0.005*"guardian" + 0.004*"labor" + 0.004*"ndi" + 0.004*"promis" + 0.004*"major" + 0.004*"su

05/17/2022 04:59:24 AM - topic #0 (0.200): 0.005*"news" + 0.005*"restrict" + 0.005*"travel" + 0.005*"guardian" + 0.005*"dutton" + 0.005*"new" + 0.004*"abc" + 0.004*"elect" + 0.004*"countri" + 0.004*"sort"
05/17/2022 04:59:24 AM - topic #1 (0.200): 0.005*"leav" + 0.005*"tie" + 0.005*"guardian" + 0.004*"labor" + 0.004*"ndi" + 0.004*"promis" + 0.004*"major" + 0.004*"sustain" + 0.004*"pictur" + 0.004*"style"
05/17/2022 04:59:24 AM - topic #2 (0.200): 0.006*"guardian" + 0.005*"former" + 0.005*"strategist" + 0.005*"voter" + 0.005*"listen" + 0.005*"age" + 0.004*"launch" + 0.004*"elect" + 0.004*"block" + 0.004*"say"
05/17/2022 04:59:24 AM - topic #3 (0.200): 0.005*"covid" + 0.005*"financi" + 0.005*"american" + 0.005*"defend" + 0.005*"worri" + 0.005*"loom" + 0.005*"world" + 0.004*"day" + 0.004*"hospit" + 0.004*"beat"
05/17/2022 04:59:24 AM - topic #4 (0.200): 0.005*"bloomberg" + 0.005*"uk" + 0.005*"back" + 0.005*"rate" + 0.005*"need" + 0.004*"come" + 0.004*"death" + 0.004*"bitcoin" + 0.004*"etf

[(0, '0.005*"news" + 0.005*"restrict" + 0.005*"travel" + 0.005*"guardian" + 0.005*"dutton" + 0.005*"new" + 0.004*"abc" + 0.004*"elect" + 0.004*"countri" + 0.004*"sort"'), (1, '0.005*"leav" + 0.005*"tie" + 0.005*"guardian" + 0.004*"labor" + 0.004*"ndi" + 0.004*"promis" + 0.004*"major" + 0.004*"sustain" + 0.004*"pictur" + 0.004*"style"'), (2, '0.006*"guardian" + 0.005*"former" + 0.005*"strategist" + 0.005*"voter" + 0.005*"listen" + 0.005*"age" + 0.004*"launch" + 0.004*"elect" + 0.004*"block" + 0.004*"say"'), (3, '0.005*"covid" + 0.005*"financi" + 0.005*"american" + 0.005*"defend" + 0.005*"worri" + 0.005*"loom" + 0.005*"world" + 0.004*"day" + 0.004*"hospit" + 0.004*"beat"'), (4, '0.005*"bloomberg" + 0.005*"uk" + 0.005*"back" + 0.005*"rate" + 0.005*"need" + 0.004*"come" + 0.004*"death" + 0.004*"bitcoin" + 0.004*"etf" + 0.004*"look"')]


### 2.2.2. sklearn TF-IDF and gensim topic modelling

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
pd.set_option('display.max_columns', 500) 

# Instantiate the TF-IDF vectorizer
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)

# Compute the TF-IDF
tf_idf = vectorizer.fit_transform(df['token'])
pd.DataFrame(data = tf_idf.toarray(), columns=vectorizer.get_feature_names(), index=corpus.index).head()



Unnamed: 0,aat,ab,abc,abus,accur,achiev,act,action,activ,advanc,advoc,affair,afford,afghan,afraid,age,aggress,ahead,airlift,al,albanes,alli,ambassador,american,amid,andrew,answer,anthoni,anticip,anxiou,appeal,appear,appoint,approach,art,asset,associ,asylum,atlant,attack,attribut,aussi,australia,australian,author,auto,autoevolut,autom,back,batteri,bbc,bear,beat,began,believ,bench,bend,bernard,betoota,better,biennal,biggest,bishop,bitcoin,block,bloomberg,blue,bonu,boost,borrow,brace,brambl,brand,break,bring,buffalo,built,buy,buzzfe,call,camp,campaign,canada,capit,car,care,case,cathol,caus,cboe,centuri,chanc,chang,chemic,chief,china,chines,chri,christchurch,cite,citi,citizen,class,client,climat,cnn,coach,collaeri,come,complet,compliment,confirm,conserv,contamin,convers,correct,could,countri,court,covid,cricket,crisi,crocodil,crypto,cultur,cup,cvc,cyber,daili,damag,danger,david,dawson,day,deal,death,decis,defenc,defend,deficit,defin,deliber,democraci,denounc,deport,die,dinkum,diplomat,document,doubl,doubt,downgrad,downsid,drive,drought,drum,duti,dutton,dwight,effort,elect,electr,emerg,emiss,empir,en,end,energi,england,english,entri,envoy,español,etf,ethereum,european,everi,excel,experi,expert,exploit,expos,express,facebook,factcheck,fail,fake,fear,feder,feel,felt,fight,fijian,fill,final,financi,fire,first,fisheri,fitch,five,flood,forb,foreign,former,fossil,foundat,fox,freak,free,frequenc,frydenberg,fuel,fusinato,futur,gain,gallagh,gap,gener,get,getaway,gillard,giro,global,glori,goal,gold,govern,grace,greater,green,grid,guardian,guitar,hand,handl,hardlin,havoc,health,held,herald,highlight,hindley,hit,holiday,home,hors,hospit,hospitalis,hot,...,network,new,news,newsday,nrl,number,obsess,obtain,oecd,one,onslaught,opinion,outsid,overload,pacif,pandem,panel,parti,past,pavilion,pend,peopl,perform,permit,person,peter,pfizer,pictur,pitch,pledg,plot,pm,pocock,podcast,point,polici,polit,politicis,poll,pollut,pom,post,postcod,pound,power,prepar,presid,press,preview,previou,pricier,primari,print,prison,priu,project,promis,properti,prosecut,protect,push,queensland,question,quiet,quietli,race,radicalis,ralli,rampant,rare,rate,receiv,recoveri,reduct,refuge,region,reintroduc,relat,relay,reli,remov,report,requir,reshap,restrict,retir,return,reuter,reveal,review,revolutionari,rich,right,rip,rise,risk,rnz,road,role,room,rse,rugbi,run,russia,saga,sail,save,say,sb,school,scientist,scott,second,secret,secur,senat,set,seven,sever,shape,share,ship,shock,shooter,shortag,shut,sign,simon,slam,solar,solomon,sordid,sort,sour,south,spain,speak,spectacl,spi,splinter,sport,spring,staff,stage,stamp,stand,star,state,stay,still,storag,strain,strait,strategist,street,struggl,student,style,success,suggest,super,supremaci,surg,suspend,sustain,swimswam,sydney,symond,syria,system,tactic,take,talent,talk,tame,targa,target,taskforc,tax,team,teewah,tell,ten,territori,tesla,test,thousand,tie,time,tini,today,toll,toward,toyota,trail,travel,travelpuls,trial,trip,trust,tt,tunnel,two,tycoon,tyranni,uk,ukrain,un,unpaid,unpreced,unwant,updat,upset,urbanmatt,ux,vacanc,vaccin,variant,venic,visit,voter,vulner,wage,wall,want,war,warder,wari,washington,way,western,whistleblow,white,wider,wife,win,wit,woodsid,work,worker,world,worri,worth,wrestl,xi,yate,year,yet,york,zealand
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.397875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.433673,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.433673,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.372477,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.372477,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.433673,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101511,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.409718,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.276405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.446582,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.084706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.320065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.34189,0.0,0.0,0.34189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.303137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.34189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.230647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.372651,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.34189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300417,0.0,0.0,0.0,0.0,0.0,0.0,0.068287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.244377,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300417,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300417,0.0,0.0,0.0,0.0,0.0,0.275619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.275619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.275619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086812,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.328023,0.0,0.0,0.381916,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.381916,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.350391,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.236381,0.381916,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.381916,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.350391,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# Dictionary apping from word IDs to words, initialized in a lazy manner to save memory (not created until needed)
dictionary = Dictionary(df["token"])
print(dictionary)

05/17/2022 04:59:29 AM - adding document #0 to Dictionary<0 unique tokens: []>
05/17/2022 04:59:29 AM - built Dictionary<611 unique tokens: ['american', 'atlant', 'contamin', 'democraci', 'home']...> from 99 documents (total 972 corpus positions)
05/17/2022 04:59:29 AM - Dictionary lifecycle event {'msg': "built Dictionary<611 unique tokens: ['american', 'atlant', 'contamin', 'democraci', 'home']...> from 99 documents (total 972 corpus positions)", 'datetime': '2022-05-17T04:59:29.410998', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}


Dictionary<611 unique tokens: ['american', 'atlant', 'contamin', 'democraci', 'home']...>


In [13]:
from gensim.matutils import Sparse2Corpus
from pprint import pprint

# Convert the TF-IDF to the needed input for Gensim
tf_idf_sklearn = Sparse2Corpus(tf_idf, documents_columns=False)

# Compute the LDA
lda2 = LdaModel(corpus=tf_idf_sklearn, id2word=id2word, num_topics=3, passes=10)

# Print the main topics
pprint(lda2.print_topics())

05/17/2022 04:59:31 AM - using symmetric alpha at 0.3333333333333333
05/17/2022 04:59:31 AM - using symmetric eta at 0.3333333333333333
05/17/2022 04:59:31 AM - using serial LDA version on this node
05/17/2022 04:59:31 AM - running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 99 documents, updating model once every 99 documents, evaluating perplexity every 99 documents, iterating 50x with a convergence threshold of 0.001000
05/17/2022 04:59:31 AM - -10.032 per-word bound, 1047.2 perplexity estimate based on a held-out corpus of 99 documents with 293 words
05/17/2022 04:59:31 AM - PROGRESS: pass 0, at document #99/99
05/17/2022 04:59:31 AM - topic #0 (0.333): 0.011*"jazeera" + 0.007*"oecd" + 0.006*"member" + 0.004*"joint" + 0.004*"anthoni" + 0.004*"splinter" + 0.004*"contamin" + 0.004*"spi" + 0.004*"herald" + 0.004*"pitch"
05/17/2022 04:59:31 AM - topic #1 (0.333): 0.013*"jazeera" + 0.009*"oecd" + 0.005*"joint" + 0.005*"quiet" + 0.004*"envoy" + 0.004

05/17/2022 04:59:31 AM - topic diff=0.000689, rho=0.316228
05/17/2022 04:59:31 AM - -7.888 per-word bound, 236.8 perplexity estimate based on a held-out corpus of 99 documents with 293 words
05/17/2022 04:59:31 AM - PROGRESS: pass 9, at document #99/99
05/17/2022 04:59:31 AM - topic #0 (0.333): 0.012*"jazeera" + 0.006*"oecd" + 0.006*"member" + 0.005*"anthoni" + 0.005*"joint" + 0.004*"splinter" + 0.004*"contamin" + 0.004*"herald" + 0.004*"spi" + 0.004*"pitch"
05/17/2022 04:59:31 AM - topic #1 (0.333): 0.013*"jazeera" + 0.009*"oecd" + 0.005*"joint" + 0.005*"quiet" + 0.004*"envoy" + 0.004*"race" + 0.004*"pavilion" + 0.004*"newsday" + 0.004*"risk" + 0.004*"felt"
05/17/2022 04:59:31 AM - topic #2 (0.333): 0.008*"jazeera" + 0.006*"pitch" + 0.005*"member" + 0.005*"contamin" + 0.004*"oecd" + 0.004*"storag" + 0.004*"secur" + 0.004*"activ" + 0.004*"abc" + 0.004*"saga"
05/17/2022 04:59:31 AM - topic diff=0.000455, rho=0.301511
05/17/2022 04:59:31 AM - LdaModel lifecycle event {'msg': 'trained Lda

[(0,
  '0.012*"jazeera" + 0.006*"oecd" + 0.006*"member" + 0.005*"anthoni" + '
  '0.005*"joint" + 0.004*"splinter" + 0.004*"contamin" + 0.004*"herald" + '
  '0.004*"spi" + 0.004*"pitch"'),
 (1,
  '0.013*"jazeera" + 0.009*"oecd" + 0.005*"joint" + 0.005*"quiet" + '
  '0.004*"envoy" + 0.004*"race" + 0.004*"pavilion" + 0.004*"newsday" + '
  '0.004*"risk" + 0.004*"felt"'),
 (2,
  '0.008*"jazeera" + 0.006*"pitch" + 0.005*"member" + 0.005*"contamin" + '
  '0.004*"oecd" + 0.004*"storag" + 0.004*"secur" + 0.004*"activ" + 0.004*"abc" '
  '+ 0.004*"saga"')]


> ⚠️Be careful : documents are represented in columns in `gensim` TD-IDF sparse matrix, while documents are represented in rows in `sklearn` TD-IDF sparse matrix. If you want to use `sklearn` TF-IDF with gensim LDA, you should set documents_columns=False

## 2.3. LDA visualization
If you used Gensim all along, you can then perform visualization on your topics using the library pyLDAvis.

⚙️ You can install it using the following command in your terminal :

In [15]:
# !pip install pyLDAvis
# Import the modules
import pyLDAvis
import pyLDAvis.gensim

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(topic_model=lda2, corpus=bow, dictionary=id2word)
vis

ModuleNotFoundError: No module named 'pyLDAvis.gensim'

> ⚠️Note : with `pyLDAvis.gensim.prepare`, the `corpus` is a BOW, not a TF-IDF !