<a href="https://colab.research.google.com/github/LeoMaggio/Deep-NLP/blob/main/practices/P1/Practice_1_Text_processing_and_topic_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 1:** Text processing and topic modeling

# **Text processing**
---
The text processing phase is a preliminary stage where the text to be manipulated is processed to be ready for subsequent analysis.

Text processing usually entails several steps that could possibly include:
- **Language Identification**: identifying the language of a given text.
- **Tokenization**: splitting a given text in several sentences/words. 
- **Dependency tree parsing:** analyzing the depencies between words composing the text.
- **Stemming/Lemmatization:** obtain the root form for each word in text.
- **Stopword removal**: removing words that are si commonly used that they carry very little useful information.
- **Part of Speech Tagging:** given a word, retrieve its part of speech (proper noun, common noun or verb).



### Language Identification

| Text                                                                                                                                | Language Code |
|-------------------------------------------------------------------------------------------------------------------------------------|---------------|
| The "Deep Natural Language Processing" course is offered during the first semester of the second year at Politecnico di Torino      | `EN`            |
| Il corso "Deep Natural Language Processing" viene impartito al Politecnico di Torino durante il primo semestre del secondo anno.    | `IT`            |
| Le cours "Deep Natural Language Processing" est enseigné au Politecnico di Torino pendant le premier semestre de la deuxième année. | `FR`            |

**Language Identification** is a crucial prelimiary step because each language has its own characteristics. The knowledge of the main language associated to a given text could be beneficial for all subsequent steps in text processing pipeline.

The data collection used in this first part of the practice is provided [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P1/langid_dataset.csv) - [source: Kaggle](https://www.kaggle.com/martinkk5575/language-detection)

## Exercise 1:

Benchmark different language-detection algorithm by computing the accuracy of each approach:
- [FastText](https://pypi.org/project/fastlangid/)
- [LangID](https://github.com/saffsd/langid.py)
- [langdetect](https://pypi.org/project/langdetect/)

**Hint:** language code conversion: [iso639-lang](https://pypi.org/project/iso639-lang/)

For each method report:
- Accuracy
- Average time per example

In [1]:
%%capture
!pip install iso639-lang

In [2]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

In [3]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from iso639 import Lang
import time
from IPython.utils import io

df = pd.read_csv('langid_dataset.csv')
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [4]:
X = df['Text']
y = df['language'].apply(lambda x: Lang(x).pt1)

### 1.1 FastText

In [5]:
%%capture
!pip install fastlangid

In [6]:
from fastlangid.langid import LID

langid = LID()

start = time.time()
y_pred = langid.predict(X)
elapsed_time = time.time() - start

avg_time = elapsed_time * 1000 / len(y)
print(f"Accuracy: {accuracy_score(y, y_pred)}")
print(f"Average ms per example: {avg_time}")

Accuracy: 0.9231818181818182
Average ms per example: 0.16538201678882947


### 1.2 LangID

In [7]:
%%capture
!pip install langid

In [8]:
from langid.langid import LanguageIdentifier, model

identifier = LanguageIdentifier.from_modelstring(model)
y_pred = []

start = time.time()
for i, text in enumerate(X):
  y_pred.append(identifier.classify(text)[0])
elapsed_time = time.time() - start

avg_time = elapsed_time * 1000 / len(y)
print(f"Accuracy: {accuracy_score(y, y_pred)}")
print(f"Average ms per example: {avg_time}")

Accuracy: 0.9542727272727273
Average ms per example: 4.0447222861376675


### 1.3 langdetect

In [9]:
%%capture
!pip install langdetect

In [10]:
from langdetect import detect

y_pred = []

start = time.time()
for i, text in enumerate(X):
  try:
    y_pred.append(detect(text))
  except:
    y_pred.append("")
    print("This text throws and error: ", text)
elapsed_time = time.time() - start

avg_time = elapsed_time * 1000 / len(y)
print(f"Accuracy: {accuracy_score(y, y_pred)}")
print(f"Average ms per example: {avg_time}")

This text throws and error:  ﺩﺍﻭﺩﺍﺳﻪ ﻓﻀﯿﻠﺖ ﭘﻪ ﺍﺣﺎﺩﯾﺜﻮﮐﻲ  – ﺣﻤﺮﺍﻥ ﺭﻭﺍﯾﺖ ﮐﻮﯼ ﭼﯥ ﯾﻮﻩ ﻭﺭځ ﺣﻀﺮﺕ ﻋﺜﻤﺎﻥ ﺑﻦ ﻋﻔﺎﻥ ﺭﺿﯽ ﺍﻟﻠﻪ ﻋﻨﻪ ﭘﻪ ښﻪ ډﻭﻝ ﺳﺮﻩ ﺍﻭﺩﺱ ﺗﺎﺯﻩ ﮐړ ﺍﻭﺑﯿﺎﯾﯽ ﻭﻓﺮﻣﺎﯾﻞ ﻣﺎﺭﺳﻮﻝ ﺍﻟﻠﻪ ﺻﻠﯽ ﺍﻟﻠﻪ ﻋﻠﯿﻪ ﻭﺳﻠﻢ ﭘﺮﺍﻭﺩﺍﺳﻪ ﻭﻟﯿﺪﯼ ﭘﻪ ښﻪ ډﻭﻝ ﺳﺮﻩ ﺋﯥ ﺍﻭﺩﺱ ﺗﺎﺯﻩ ﮐړ ﺍﻭﺑﯿﺎﯾﯽ ﻭﻓﺮﻣﺎﯾﻞ
This text throws and error:                                           
This text throws and error:   – ﺩﺣﻀﺮﺕ ﺍﺑﻮﻫﺮﯾﺮﻩ ﺭﺿﯽ ﺍﻟﻠﻪ ﻋﻨﻪ څﺨﻪ ﺭﻭﺍﯾﺖ ﺩﯼ ﭼﯥ ﻣﺎﺩﻧﺒﯽ ﮐﺮﯾﻢ ﺻﻠﯽ ﺍﻟﻠﻪ ﻋﻠﯿﻪ ﻭﺳﻠﻢ ﻧﻪ ﻭﺍﻭﺭﯾﺪﻝ ﭼﯥ ﺩﺍﺍﻣﺖ ﺑﻪ ﺩﻗﯿﺎﻣﺖ ﭘﻪ ﻭﺭځ ﺑﺎﻧﺪﯼ ﺭﺍﻭﻏﻮښﺘﻞ ﺷﻲ ﭼﯥ ﺩﺍﻭﺩﺍﺳﻪ ﻟﻪ ﮐﺒﻠﻪ ﺑﻪ ﺩﺩﻭﯼ ﻻﺳﻮﻧﻪ ﭘښﯥ ﺍﻭﻣﺨﻮﻧﻪ ﻧﻮﺭﺍﻧﻲ ﺍﻭﺭﻭښﺎﻧﻪ ﻭﻱ څﻮﮎ ﭼﯥ ﺧﭙﻠﻪ ﺭﻭښﻨﺎﯾﯽ ﺯﯾﺎﺗﻮﯼ ﻧﻮﺯﯾﺎﺗﯽ ﺩﯼ ﮐړﻱﺭﻭﺍﻩ ﺑﺨﺎﺭﻱ ﺍﻟﺘﺮﻏﯿﺐ ﻭﺍﻟﺘﺮﻫﯿﺐ ﻟﻮﻣړﯼ ټﻮﮎ  ﭘﺎڼﻪ ﺣﺪﯾﺚ   ﻟﯿﮑﻮﺍﻝ ﺣﺎﻓﻆ ﺯﮐﻲ ﺍﻟﺪﯾﻦ ﻋﺒﺪﺍﻟﻌﻈﯿﻢ ﺑﻦ ﻋﺒﺪﺍﻟﻘﻮﻱ ﺍﻟﻤﻨﺬﺭﯼ ﺍﻟﻤﺘﻮﻓﯽ  ﻫﻖ
Accuracy: 0.8435
Average ms per example: 6.79802947694605


## Exercise 2

For English-written text, apply word-level tokenization. What is the average number of words per sentence?

Implement word-tokenization using both [nltk](https://www.nltk.org/) and [spacy](https://spacy.io/). Report the results for both of them.

For spaCy use the `en_core_web_sm` model.

### 2.1 Natural Language Toolkit

In [11]:
%%capture
!pip install nltk

In [12]:
import nltk
with io.capture_output() as captured:
  nltk.download('punkt')

words_per_sentence = []

start = time.time()
for i, sentence in enumerate(X):
  if y[i] == "en":
    words_per_sentence.append(len(nltk.word_tokenize(sentence)))
elapsed_time = time.time() - start

avg_time = elapsed_time * 1000 / len(y)
print(f"Average number of words per sentence: {np.mean(words_per_sentence)}")
print(f"Average ms per example: {avg_time}")

Average number of words per sentence: 68.738
Average ms per example: 0.02054020491513339


### 2.2 spaCy

In [13]:
%%capture
!pip install --upgrade spacy
!python -m spacy download en_core_web_sm

In [14]:
import spacy

nlp = spacy.load("en_core_web_sm")
words_per_sentence = []

start = time.time()
for i, sentence in enumerate(X):
  if y[i] == "en":
    words_per_sentence.append(len(nlp(sentence)))
elapsed_time = time.time() - start

avg_time = elapsed_time * 1000 / len(y)
print(f"Average number of words per sentence: {np.mean(words_per_sentence)}")
print(f"Average ms per example: {avg_time}")

Average number of words per sentence: 72.334
Average ms per example: 0.8450686498121782


## Exercise 3

Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [15]:
import spacy
import random

nlp = spacy.load("en_core_web_sm")
indexes = [i for i, x in enumerate(y) if x == 'en']
sentence = X[random.choice(indexes)]
print(sentence)
doc = nlp(sentence)
spacy.displacy.render(doc, style='dep', jupyter=True)

the designs of barron and chubb were based on the use of movable levers but joseph bramah a prolific inventor developed an alternative method in  his lock used a cylindrical key with precise notches along the surface these moved the metal slides that impeded the turning of the bolt into an exact alignment allowing the lock to open the lock was at the limits of the precision manufacturing capabilities of the time and was said by its inventor to be unpickable in the same year bramah started the bramah locks company at  piccadilly and displayed the "challenge lock" in the window of his shop from  challenging "the artist who can make an instrument that will pick or open this lock" for the reward of £ the challenge stood for over  years until at the great exhibition of  the american locksmith alfred charles hobbs was able to open the lock and following some argument about the circumstances under which he had opened it was awarded the prize hobbs attempt required some  hours spread over  day

## Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [16]:
doc = nlp(sentence)
lemmas = []
for word in doc:
  lemmas.append(word.lemma_)
print(lemmas)

['the', 'design', 'of', 'barron', 'and', 'chubb', 'be', 'base', 'on', 'the', 'use', 'of', 'movable', 'lever', 'but', 'joseph', 'bramah', 'a', 'prolific', 'inventor', 'develop', 'an', 'alternative', 'method', 'in', ' ', 'his', 'lock', 'use', 'a', 'cylindrical', 'key', 'with', 'precise', 'notch', 'along', 'the', 'surface', 'these', 'move', 'the', 'metal', 'slide', 'that', 'impede', 'the', 'turning', 'of', 'the', 'bolt', 'into', 'an', 'exact', 'alignment', 'allow', 'the', 'lock', 'to', 'open', 'the', 'lock', 'be', 'at', 'the', 'limit', 'of', 'the', 'precision', 'manufacturing', 'capability', 'of', 'the', 'time', 'and', 'be', 'say', 'by', 'its', 'inventor', 'to', 'be', 'unpickable', 'in', 'the', 'same', 'year', 'bramah', 'start', 'the', 'bramah', 'lock', 'company', 'at', ' ', 'piccadilly', 'and', 'display', 'the', '"', 'challenge', 'lock', '"', 'in', 'the', 'window', 'of', 'his', 'shop', 'from', ' ', 'challenge', '"', 'the', 'artist', 'who', 'can', 'make', 'an', 'instrument', 'that', 'will

In [17]:
clean_sentence = []
for word in doc:
  if not word.is_stop:
    clean_sentence.append(word.lemma_)
print(" ".join(clean_sentence))

design barron chubb base use movable lever joseph bramah prolific inventor develop alternative method   lock cylindrical key precise notch surface move metal slide impede turning bolt exact alignment allow lock open lock limit precision manufacturing capability time say inventor unpickable year bramah start bramah lock company   piccadilly display " challenge lock " window shop   challenge " artist instrument pick open lock " reward £ challenge stand   year great exhibition   american locksmith alfred charles hobbs able open lock follow argument circumstance open award prize hobb attempt require   hour spread   day


In [18]:
for word in doc:
  print(word.text, word.pos_)

the DET
designs NOUN
of ADP
barron NOUN
and CCONJ
chubb PROPN
were AUX
based VERB
on ADP
the DET
use NOUN
of ADP
movable ADJ
levers NOUN
but CCONJ
joseph PROPN
bramah VERB
a DET
prolific ADJ
inventor NOUN
developed VERB
an DET
alternative ADJ
method NOUN
in ADP
  SPACE
his PRON
lock NOUN
used VERB
a DET
cylindrical ADJ
key NOUN
with ADP
precise ADJ
notches NOUN
along ADP
the DET
surface NOUN
these DET
moved VERB
the DET
metal NOUN
slides NOUN
that DET
impeded VERB
the DET
turning NOUN
of ADP
the DET
bolt NOUN
into ADP
an DET
exact ADJ
alignment NOUN
allowing VERB
the DET
lock NOUN
to PART
open VERB
the DET
lock NOUN
was VERB
at ADP
the DET
limits NOUN
of ADP
the DET
precision NOUN
manufacturing NOUN
capabilities NOUN
of ADP
the DET
time NOUN
and CCONJ
was AUX
said VERB
by ADP
its PRON
inventor NOUN
to PART
be VERB
unpickable ADJ
in ADP
the DET
same ADJ
year NOUN
bramah NOUN
started VERB
the DET
bramah NOUN
locks NOUN
company NOUN
at ADP
  SPACE
piccadilly ADV
and CCONJ
displayed VERB
t

# **Occurrence-based text representation - TF-IDF**

---
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

## Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X)
print(X_tfidf.shape)

(22000, 277719)


## Exercise 6

Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.

In [20]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, 
                                                    y, 
                                                    test_size=0.20)
clf = SVC()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Accuracy: 0.9325


# **Topic Modelling**

Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modelling focuses on capturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)


## Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [21]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

In [22]:
%%capture
!pip install --upgrade gensim

In [23]:
df = pd.read_csv('CovidFake_filtered.csv', index_col='Unnamed: 0')
df.head()

Unnamed: 0,headlines,outcome
0,A post claims compulsory vacination violates t...,0
1,A photo claims that this person is a doctor wh...,0
2,Post about a video claims that it is a protest...,0
3,All deaths by respiratory failure and pneumoni...,0
4,The dean of the College of Biologists of Euska...,0


In [24]:
# Creating a corpus composed of the headlines contained in the data collection
# Building a vector where each entry is a document, splitted in words
corpus = df['headlines'].tolist()
corpus = [s.split() for s in corpus]
corpus[:2]

[['A',
  'post',
  'claims',
  'compulsory',
  'vacination',
  'violates',
  'the',
  'principles',
  'of',
  'bioethics,',
  'that',
  'coronavirus',
  "doesn't",
  'exist,',
  'that',
  'the',
  'PCR',
  'test',
  'returns',
  'many',
  'false',
  'positives,',
  'and',
  'that',
  'influenza',
  'vaccine',
  'is',
  'related',
  'to',
  'COVID-19.'],
 ['A',
  'photo',
  'claims',
  'that',
  'this',
  'person',
  'is',
  'a',
  'doctor',
  'who',
  'died',
  'after',
  'attending',
  'to',
  'too',
  'many',
  'COVID-19',
  'patinents',
  'in',
  'Hospital',
  'Muñiz',
  'in',
  'Buenos',
  'Aires.']]

In [25]:
# Generating a dictionary to create a word -> id mapping (required by LSI module)
from gensim.corpora.dictionary import Dictionary
dct = Dictionary(corpus)
bow_corpus = [dct.doc2bow(sentence) for sentence in corpus]

In [26]:
# Using the dictionary to preprocess the corpus to obtain the representation required for LSI model training
from gensim.models import LsiModel
model = LsiModel(bow_corpus, id2word=dct)
model.print_topics(num_topics=5)

[(0,
  '0.636*"the" + 0.389*"of" + 0.314*"in" + 0.280*"a" + 0.254*"to" + 0.179*"and" + 0.155*"that" + 0.121*"is" + 0.107*"coronavirus" + 0.106*"A"'),
 (1,
  '0.629*"the" + -0.598*"a" + -0.386*"in" + -0.105*"A" + -0.097*"and" + -0.078*"has" + -0.077*"COVID-19" + -0.071*"video" + -0.070*"on" + -0.064*"been"'),
 (2,
  '-0.709*"to" + 0.633*"of" + -0.137*"and" + -0.123*"is" + -0.071*"for" + -0.069*"be" + -0.054*"that" + -0.051*"are" + -0.049*"from" + -0.048*"due"'),
 (3,
  '0.734*"in" + -0.428*"of" + -0.363*"to" + -0.283*"a" + 0.172*"the" + 0.097*"coronavirus" + -0.046*"from" + 0.040*"was" + -0.037*"COVID-19." + -0.034*"on"'),
 (4,
  '-0.479*"to" + -0.425*"of" + 0.407*"a" + -0.391*"in" + 0.259*"that" + 0.240*"the" + 0.213*"is" + 0.187*"and" + 0.115*"for" + 0.075*"coronavirus"')]

## Exercise 8 (Optional)

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to*, *for*, *in*, *of*, *on*...). Moreover, missing punctuation removal could be critical for topic identification. Repeat the same procedure of Ex. 7 by adding preliminary preprocessing step to:

1. **remove stopwords**
2. **strip punctuation**
3. **lowercase all words**

In [27]:
import nltk
from nltk.corpus import stopwords

with io.capture_output() as captured:
  nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

#processed_corpus = []
#for sentence in corpus:
#  processed_sentence = []
#  for word in sentence:
#    if word not in stop_words:
#      processed_sentence.append(word.lower())
#  processed_corpus.append(processed_sentence)

processed_corpus = [[w.lower() for w in s if w.lower() not in stop_words] for s in corpus]

In [28]:
import string

#processed_corpus_2 = []
#for sentence in processed_corpus:
#  processed_sentence = []
#  for word in sentence:
#    processed_sentence.append(word.translate(str.maketrans('', '', string.punctuation)))
#  processed_corpus_2.append(processed_sentence)

processed_corpus = [[w.translate(str.maketrans('', '', string.punctuation)) for w in s] for s in processed_corpus]

In [29]:
from gensim.corpora.dictionary import Dictionary
from gensim.models import LsiModel
dct = Dictionary(processed_corpus)
bow_corpus = [dct.doc2bow(sentence) for sentence in processed_corpus]
lsi = LsiModel(bow_corpus, id2word=dct)
lsi.print_topics(num_topics=5)

[(0,
  '0.743*"coronavirus" + 0.406*"covid19" + 0.172*"video" + 0.139*"people" + 0.120*"shows" + 0.120*"facebook" + 0.109*"novel" + 0.107*"claim" + 0.098*"new" + 0.096*"shared"'),
 (1,
  '0.811*"covid19" + -0.544*"coronavirus" + 0.065*"video" + 0.049*"shows" + -0.045*"novel" + -0.040*"new" + 0.039*"hospital" + 0.037*"claims" + 0.037*"facebook" + 0.034*"lockdown"'),
 (2,
  '-0.358*"video" + -0.326*"facebook" + -0.297*"claim" + 0.294*"covid19" + -0.278*"shows" + -0.278*"posts" + -0.267*"times" + 0.250*"coronavirus" + -0.250*"shared" + -0.196*"multiple"'),
 (3,
  '0.652*"video" + 0.300*"shows" + 0.253*"people" + -0.240*"facebook" + -0.237*"posts" + -0.214*"claim" + -0.202*"shared" + -0.174*"times" + -0.152*"multiple" + 0.143*"lockdown"'),
 (4,
  '0.880*"people" + -0.303*"video" + -0.130*"shows" + -0.097*"coronavirus" + -0.089*"covid19" + 0.073*"lockdown" + 0.072*"government" + 0.068*"virus" + 0.065*"died" + -0.065*"patients"')]

## Exercise 9 (Optional)

Leveraging the same corpus used for LSI model generation, apply LDA modelling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [30]:
from gensim.models.ldamodel import LdaModel
lda = LdaModel(bow_corpus, id2word=dct, num_topics=5)
lda.print_topics(num_topics=5)

[(0,
  '0.044*"coronavirus" + 0.021*"covid19" + 0.014*"facebook" + 0.013*"novel" + 0.012*"claim" + 0.011*"video" + 0.010*"shared" + 0.009*"shows" + 0.009*"times" + 0.009*"posts"'),
 (1,
  '0.042*"coronavirus" + 0.018*"covid19" + 0.009*"people" + 0.008*"masks" + 0.007*"health" + 0.007*"president" + 0.007*"pandemic" + 0.006*"government" + 0.006*"new" + 0.006*"predicted"'),
 (2,
  '0.047*"coronavirus" + 0.027*"covid19" + 0.015*"people" + 0.015*"video" + 0.010*"shows" + 0.009*"infected" + 0.008*"cases" + 0.008*"due" + 0.007*"outbreak" + 0.006*"new"'),
 (3,
  '0.063*"coronavirus" + 0.027*"covid19" + 0.018*"china" + 0.013*"wuhan" + 0.012*"vaccine" + 0.010*"chinese" + 0.009*"new" + 0.008*"patients" + 0.008*"died" + 0.008*"shows"'),
 (4,
  '0.032*"coronavirus" + 0.020*"covid19" + 0.013*"people" + 0.009*"government" + 0.009*"lockdown" + 0.007*"video" + 0.006*"outbreak" + 0.006*"indian" + 0.006*"quarantine" + 0.005*"free"')]

## Exercise 10 (Optional)

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [33]:
%%capture
!pip install pyldavis

In [34]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
lda_display = gensimvis.prepare(lda, bow_corpus, dct, sort_topics=False)
pyLDAvis.display(lda_display)

TypeError: ignored