# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [10]:
!pip install pyLDAvis
!pip install gensim==3.6.0
!pip install nltk
!pip install pandas
!pip install numpy
!pip install matplotlib

  and should_run_async(code)




### Import the libraries

In [11]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from gensim.utils import simple_preprocess
from gensim.models import Phrases
from gensim.corpora import Dictionary
from collections import defaultdict
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

  and should_run_async(code)


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

### Load the dataset

In [12]:
# URL of the dataset
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"

# Read the dataset into a Pandas DataFrame
df = pd.read_json(url)

# Display the first few rows of the DataFrame
print(df.head())

  and should_run_async(code)


                                             content  target  \
0  From: lerxst@wam.umd.edu (where's my thing)\nS...       7   
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...       4   
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...       4   
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...       1   
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


### Preprocess the data

### Email Removal

In [13]:
# Function to remove email addresses using regex
def remove_emails(text):
    # Define the regex pattern for email addresses
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    # Substitute email addresses with an empty string
    text_without_emails = re.sub(pattern, '', text)
    return text_without_emails

  and should_run_async(code)


In [14]:
# Apply the function to remove emails from the text column
df['content'] = df['content'].apply(remove_emails)

# Display the updated DataFrame
print(df.head())

  and should_run_async(code)


                                             content  target  \
0  From:  (where's my thing)\nSubject: WHAT car i...       7   
1  From:  (Guy Kuo)\nSubject: SI Clock Poll - Fin...       4   
2  From:  (Thomas E Willis)\nSubject: PB question...       4   
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...       1   
4  From:  (Jonathan McDowell)\nSubject: Re: Shutt...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


### Newline Removal

In [15]:
# Using regular expressions
df['content'] = df['content'].apply(lambda x: re.sub(r'\n', '', x))

# Display the updated DataFrame
print(df.head())

  and should_run_async(code)


                                             content  target  \
0  From:  (where's my thing)Subject: WHAT car is ...       7   
1  From:  (Guy Kuo)Subject: SI Clock Poll - Final...       4   
2  From:  (Thomas E Willis)Subject: PB questions....       4   
3  From: jgreen@amber (Joe Green)Subject: Re: Wei...       1   
4  From:  (Jonathan McDowell)Subject: Re: Shuttle...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


### Single Quotes Removal

In [16]:
# Using regular expressions
df['content'] = df['content'].apply(lambda x: re.sub(r"'", '', x))

# Display the updated DataFrame
print(df.head())

                                             content  target  \
0  From:  (wheres my thing)Subject: WHAT car is t...       7   
1  From:  (Guy Kuo)Subject: SI Clock Poll - Final...       4   
2  From:  (Thomas E Willis)Subject: PB questions....       4   
3  From: jgreen@amber (Joe Green)Subject: Re: Wei...       1   
4  From:  (Jonathan McDowell)Subject: Re: Shuttle...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


  and should_run_async(code)


### Tokenize
- Create **sent_to_words()**
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [17]:
# Define the function using a generator
def sent_to_words(sentences):
    for sentence in sentences:
        # Use gensim's simple_preprocess to tokenize each sentence
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

  and should_run_async(code)


In [18]:
# Tokenize data
data = df['content'].values.tolist()
tokenized_data = list(sent_to_words(data))

# Display the first 5 tokenized data
print(tokenized_data[:5])

  and should_run_async(code)




### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [19]:
# Download the NLTK stopwords corpus
nltk.download('stopwords')

# Extend the stopwords list with custom words
custom_stopwords = set(stopwords.words('english'))
custom_words = ["from", "subject", "re", "edu", "use"]
custom_stopwords.update(custom_words)

  and should_run_async(code)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#### remove_stopwords( )

In [20]:
# Function to remove stop words
def remove_stopwords(text):
    return [[word for word in sentence if word not in custom_stopwords] for sentence in text]

  and should_run_async(code)


In [21]:
# Tokenized data without stopwords
tokenized_data_without_stopwords = remove_stopwords(tokenized_data)

# Display the first 5 tokenized data without stopwords
print(tokenized_data_without_stopwords[:5])

  and should_run_async(code)




### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [22]:
# Create bigrams
bigram_phrases = Phrases(tokenized_data_without_stopwords, min_count=1, threshold=100)

  and should_run_async(code)


In [23]:
# Get the bigram-transformed sentences
tokenized_data_with_bigrams = [bigram_phrases[sentence] for sentence in tokenized_data_without_stopwords]

  and should_run_async(code)


In [24]:
# Display the tokenized data with bigrams
print(tokenized_data_with_bigrams[:5])



  and should_run_async(code)


#### make_bigrams( )

In [25]:
def make_bigrams(tokenized_data, threshold=100):
    """
    Function to create bigrams from a list of tokenized sentences.

    Parameters:
    tokenized_data (list): List of tokenized sentences.
    threshold (int): Threshold value for bigram formation.

    Returns:
    list: List of tokenized sentences with bigrams.
    """
    # Create bigrams
    bigram_phrases = Phrases(tokenized_data, min_count=1, threshold=threshold)

    # Get the bigram-transformed sentences
    tokenized_data_with_bigrams = [bigram_phrases[sentence] for sentence in tokenized_data]

    return tokenized_data_with_bigrams

  and should_run_async(code)


In [26]:
# Use the make_bigrams function
tokenized_data_with_bigrams = make_bigrams(tokenized_data_without_stopwords, threshold=100)

# Display the tokenized data with bigrams
print(tokenized_data_with_bigrams[:5])

  and should_run_async(code)




### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [27]:
import spacy

  and should_run_async(code)


In [28]:
!python -m spacy download en
# !pip3 install -U spacy

  and should_run_async(code)


[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [30]:
pip install en_core_web_sm


  and should_run_async(code)




In [31]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])


  and should_run_async(code)


#### lemmatizaton( )

In [32]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

  and should_run_async(code)


In [34]:
data_lemmatized = lemmatization(tokenized_data_with_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

  and should_run_async(code)


In [35]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'wam_umd', 'eduorganization_university', 'maryland_college', 'parkline', 'enlighten', 'car', 'sawthe', 'day', 'door_sport', 'car', 'look', 'late_early', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'name', 'engine', 'spec', 'yearsof', 'production', 'car', 'make', 'history', 'info', 'youhave', 'funky_looking', 'car', 'mail', 'thank', 'lerxst']]


  and should_run_async(code)


### Create a Dictionary

In [36]:
# Create a dictionary from the tokenized data
dictionary = Dictionary(tokenized_data_with_bigrams)

# Filter out tokens that appear in less than 5 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=5, no_above=0.5)

# Print the dictionary
print(dictionary.token2id)

  and should_run_async(code)




### Create Corpus

In [37]:
# Convert tokenized documents into a bag-of-words format
corpus = [dictionary.doc2bow(doc) for doc in tokenized_data_with_bigrams]

# Print the first few elements of the corpus
print(corpus[:5])

  and should_run_async(code)


[[(0, 1), (1, 1), (2, 1), (3, 5), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1)], [(5, 1), (23, 1), (25, 2), (32, 1), (39, 1), (40, 1), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 2), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 2), (55, 1), (56, 2), (57, 1), (58, 1), (59, 2), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 2), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1)], [(4, 3), (5, 2), (15, 2), (16, 1), (27, 2), (32, 1), (60, 1), (65, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 3), (93, 1), (94, 2), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1),

### Filter low-frequency words

In [38]:
# Define the threshold for minimum frequency
min_frequency = 5

# Create a defaultdict to count the frequency of each word
word_frequency = defaultdict(int)
for doc in corpus:
    for word_id, freq in doc:
        word_frequency[word_id] += freq

# Filter out low-frequency words from the corpus
filtered_corpus = [[(word_id, freq) for word_id, freq in doc if word_frequency[word_id] >= min_frequency] for doc in corpus]

# Print the first few elements of the filtered corpus
print(filtered_corpus[:5])

  and should_run_async(code)


[[(0, 1), (1, 1), (2, 1), (3, 5), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1)], [(5, 1), (23, 1), (25, 2), (32, 1), (39, 1), (40, 1), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 2), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 2), (55, 1), (56, 2), (57, 1), (58, 1), (59, 2), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 2), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1)], [(4, 3), (5, 2), (15, 2), (16, 1), (27, 2), (32, 1), (60, 1), (65, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 3), (93, 1), (94, 2), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1),

### Create Index 2 word dictionary

In [39]:
# Create the index to word dictionary
index_to_word = {idx: word for word, idx in dictionary.token2id.items()}

  and should_run_async(code)


In [40]:
# Print the first few elements of the index to word dictionary
print({k: index_to_word[k] for k in range(10)})  # Displaying the first 10 elements as an example

{0: 'addition', 1: 'anyone', 2: 'body', 3: 'car', 4: 'could', 5: 'day', 6: 'door_sports', 7: 'doors', 8: 'eduorganization_university', 9: 'engine'}


  and should_run_async(code)


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [41]:
# Define parameters
num_topics = 10
chunksize = 100
alpha = 'auto'
passes = 10

# Train the model
lda_model = LdaModel(corpus=filtered_corpus, id2word=dictionary, num_topics=num_topics, chunksize=chunksize, alpha=alpha, passes=passes)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad 

### Print the Keyword in the 10 topics

In [42]:
# Print the topics
for topic_id, topic in lda_model.print_topics():
    print(f"Topic {topic_id}: {topic}")

Topic 0: 0.017*"state" + 0.015*"israel" + 0.010*"gun" + 0.010*"law" + 0.009*"government" + 0.007*"states" + 0.006*"public" + 0.006*"guns" + 0.006*"society" + 0.006*"rights"
Topic 1: 0.013*"space" + 0.012*"key" + 0.011*"information" + 0.009*"data" + 0.008*"may" + 0.008*"chip" + 0.008*"system" + 0.006*"available" + 0.006*"technology" + 0.005*"scsi"
Topic 2: 0.009*"organization" + 0.008*"drive" + 0.008*"thanks" + 0.008*"mail" + 0.008*"please" + 0.008*"university" + 0.007*"nntp_posting" + 0.007*"windows" + 0.007*"system" + 0.007*"host"
Topic 3: 0.007*"university" + 0.006*"talking" + 0.006*"new" + 0.005*"president" + 0.005*"new_york" + 0.004*"years" + 0.004*"also" + 0.004*"department" + 0.004*"health" + 0.004*"com"
Topic 4: 0.010*"nntp_posting" + 0.010*"back" + 0.009*"get" + 0.008*"car" + 0.007*"article" + 0.007*"old" + 0.007*"organization" + 0.007*"good" + 0.007*"go" + 0.007*"new"
Topic 5: 0.029*"god" + 0.015*"people" + 0.012*"evidence" + 0.011*"us" + 0.009*"jesus" + 0.008*"faith" + 0.007*

  and should_run_async(code)


## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [43]:
# Compute model perplexity
perplexity = lda_model.log_perplexity(filtered_corpus)
print(f"Model Perplexity: {perplexity}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad 

Model Perplexity: -8.092525669858476


  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [44]:
# Compute topic coherence
coherence_model = CoherenceModel(model=lda_model, texts=tokenized_data_with_bigrams, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Topic Coherence Score: {coherence_score}")

  and should_run_async(code)


Topic Coherence Score: 0.5863410631691223


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [45]:
# Visualize the topics
pyLDAvis.enable_notebook()

  and should_run_async(code)


In [46]:
vis_data = gensimvis.prepare(lda_model, filtered_corpus, dictionary)
pyLDAvis.display(vis_data)

  and should_run_async(code)
