# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [24]:
! pip install pyLDAvis gensim spacy

  pid, fd = os.forkpty()




### Import the libraries

In [25]:
# Step 2: Import necessary libraries

# Data handling
import pandas as pd
import numpy as np

# Text preprocessing
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.tokenizer import Tokenizer

# Topic modeling
import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel

# Visualization
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
%matplotlib inline


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/karthik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

### Load the dataset

In [26]:


# Load the JSON file
file_path = 'newsgroups.json'  # Replace with the correct path if needed
data = pd.read_json(file_path)

# Display the first few rows
data.head()


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


### Preprocess the data

### Email Removal

In [27]:


data['clean_text'] = data['content'].apply(lambda x: re.sub(r'\S+@\S+', '', str(x)))

# Display a few examples to confirm emails are removed
data[['content', 'clean_text']].head()


Unnamed: 0,content,clean_text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,From: (where's my thing)\nSubject: WHAT car i...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,From: (Guy Kuo)\nSubject: SI Clock Poll - Fin...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,From: (Thomas E Willis)\nSubject: PB question...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,From: (Joe Green)\nSubject: Re: Weitek P9000 ...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,From: (Jonathan McDowell)\nSubject: Re: Shutt...


### Newline Removal

In [28]:


# Remove newline characters (\n) and extra spaces
data['clean_text'] = data['clean_text'].apply(lambda x: re.sub(r'\s+', ' ', str(x)).strip())

# Display a few examples to verify
data[['content', 'clean_text']].head()


Unnamed: 0,content,clean_text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,From: (where's my thing) Subject: WHAT car is ...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,From: (Guy Kuo) Subject: SI Clock Poll - Final...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,From: (Thomas E Willis) Subject: PB questions....
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,From: (Joe Green) Subject: Re: Weitek P9000 ? ...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,From: (Jonathan McDowell) Subject: Re: Shuttle...


### Single Quotes Removal

In [29]:


# Remove single quotes from the text
data['clean_text'] = data['clean_text'].apply(lambda x: re.sub(r"\'", "", str(x)))

# Display a few examples to verify
data[['content', 'clean_text']].head()


Unnamed: 0,content,clean_text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,From: (wheres my thing) Subject: WHAT car is t...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,From: (Guy Kuo) Subject: SI Clock Poll - Final...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,From: (Thomas E Willis) Subject: PB questions....
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,From: (Joe Green) Subject: Re: Weitek P9000 ? ...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,From: (Jonathan McDowell) Subject: Re: Shuttle...


### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [30]:


from gensim.utils import simple_preprocess

# Define a generator function to yield tokens
def sent_to_words(sentences):
    for sentence in sentences:
        # simple_preprocess removes punctuation, lowercases text, and tokenizes
        yield(simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# Apply the generator to your clean text
data_words = list(sent_to_words(data['clean_text']))

# Display a few tokenized examples
for i in range(3):
    print(data_words[i])


['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']
['from', 'guy', 'kuo', 'subject

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [31]:


# Download NLTK stopwords if not already downloaded
nltk.download('stopwords')

# Get the default English stopwords
stop_words = stopwords.words('english')

# Extend with custom words
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

# Define a function to remove stopwords
def remove_stopwords(texts):
    return [[word for word in doc if word not in stop_words] for doc in texts]

# Apply the stopword removal
data_words_nostops = remove_stopwords(data_words)

# Display first few examples
for i in range(3):
    print(data_words_nostops[i])


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/karthik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['wheres', 'thing', 'car', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'organization', 'university', 'maryland', 'college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']
['guy', 'kuo', 'si', 'clock', 'poll', 'final', 'call', 'summary', 'final', 'call', 'si', 'clock', 'reports', 'keywords', 'si', 'acceleration', 'clock', 'upgrade', 'article', 'shelley', 'qvfo', 'innc', 'organization', 'university', 'washington', 'lines', 'nntp', 'posting', 'host', 'carson', 'washington', 'fair', 'number', 'brave', 'souls', 'upgraded', 'si', 'clock', 'oscillator', 'shared', 'expe

#### remove_stopwords( )

In [32]:
def remove_stopwords(texts):
    return None

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [33]:

from gensim.models import Phrases
from gensim.models.phrases import Phraser

# Build the bigram model
bigram = Phrases(data_words_nostops, min_count=5, threshold=100)

# Create a faster Phraser object
bigram_mod = Phraser(bigram)

# Function to make bigrams for a list of tokenized words
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

# Apply the bigram model
data_words_bigrams = make_bigrams(data_words_nostops)

# Display examples to check bigrams formed
for i in range(3):
    print(data_words_bigrams[i])


['wheres', 'thing', 'car', 'nntp_posting', 'host', 'rac_wam', 'umd', 'organization', 'university', 'maryland_college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']
['guy_kuo', 'si', 'clock', 'poll', 'final', 'call', 'summary', 'final', 'call', 'si', 'clock', 'reports', 'keywords', 'si', 'acceleration', 'clock', 'upgrade', 'article_shelley', 'qvfo', 'innc', 'organization', 'university', 'washington', 'lines', 'nntp_posting', 'host', 'carson_washington', 'fair', 'number', 'brave', 'souls', 'upgraded', 'si', 'clock', 'oscillator', 'shared', 'experiences', 'poll', 'pleas

#### make_bigrams( )

In [34]:
def make_bigrams(texts):
    return None

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [35]:
! python -m spacy download en


import spacy

# Download the English model if not done already
!python -m spacy download en_core_web_sm

# Load the model
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Define the lemmatization function
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """Lemmatize words keeping only nouns, adjectives, verbs, and adverbs."""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

# Apply lemmatization on bigram data
data_lemmatized = lemmatization(data_words_bigrams)

# Display a few examples
for i in range(3):
    print(data_lemmatized[i])


  pid, fd = os.forkpty()


[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.5 MB/s[0m  [33m0:00:01[0m eta [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


  pid, fd = os.forkpty()


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
['s', 'thing', 'car', 'nntp_poste', 'host', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']
['final', 'call', 'summary', 'final', 'call', 'si', 'clock', 'report', 'keyword', 'acceleration', 'clock', 'nntp_poste', 'host', 'fair', 'number', 'brave', 'soul', 'upgrade', 'si', 'clock', 'oscillator', 'share', 'experience', 'poll', 'send', 'brief', 'message', 'detail', 'experience', 'proced

In [37]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#### lemmatizaton( )

In [38]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [39]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [40]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [41]:


from gensim import corpora

# Create the dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Display a few tokens to check
print("Sample tokens:\n")
print(id2word.token2id)

# Filter out extremes (optional but recommended)
# Removes very rare and very common words to improve topic quality
id2word.filter_extremes(no_below=5, no_above=0.5)

print("\nNumber of unique tokens after filtering:", len(id2word))


Sample tokens:


Number of unique tokens after filtering: 14350


### Create Corpus

In [42]:


# Each document is converted to BoW format
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

# Display the first document in BoW format
print(corpus[0])

# Optional: Display word-id mapping for first document
print([ (id2word[id], freq) for id, freq in corpus[0] ])


[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)]
[('addition', 1), ('body', 1), ('bring', 1), ('call', 1), ('car', 5), ('day', 1), ('door', 2), ('early', 1), ('engine', 1), ('enlighten', 1), ('funky', 1), ('history', 1), ('host', 1), ('info', 1), ('know', 1), ('late', 1), ('look', 2), ('mail', 1), ('make', 1), ('model', 1), ('name', 1), ('neighborhood', 1), ('nntp_poste', 1), ('park', 1), ('production', 1), ('really', 1), ('rest', 1), ('s', 1), ('see', 1), ('separate', 1), ('small', 1), ('spec', 1), ('sport', 1), ('thank', 1), ('thing', 1), ('wonder', 1), ('year', 1)]


### Filter low-frequency words

In [43]:


# Remove words that appear in less than 5 documents or more than 50% of all documents
id2word.filter_extremes(no_below=5, no_above=0.5)

# Recreate corpus after filtering
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

# Display number of unique tokens after filtering
print("Number of unique tokens after filtering:", len(id2word))

# Display first document after filtering
print(corpus[0])


Number of unique tokens after filtering: 14350
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)]


### Create Index 2 word dictionary

In [44]:


# id2word already maps word -> id, now we create id -> word
index2word = {v: k for k, v in id2word.token2id.items()}

# Display first 10 entries
print("Sample index-to-word mapping:")
for i, (idx, word) in enumerate(index2word.items()):
    print(idx, word)
    if i >= 9:
        break


Sample index-to-word mapping:
0 addition
1 body
2 bring
3 call
4 car
5 day
6 door
7 early
8 engine
9 enlighten


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [45]:


from gensim.models import LdaModel

# Set parameters
num_topics = 5       # Define the number of topics you want
chunksize = 100      # Number of documents processed at a time
passes = 10          # Number of full passes through the corpus during training
alpha = 'auto'       # Let gensim optimize alpha

# Train LDA model
lda_model = LdaModel(
    corpus=corpus,           # Corpus in BoW format
    id2word=id2word,         # Dictionary mapping
    num_topics=num_topics,
    random_state=100,
    update_every=1,
    chunksize=chunksize,
    passes=passes,
    alpha=alpha,
    per_word_topics=True
)

# Print the topics discovered
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}\n")


Topic 0: 0.011*"year" + 0.011*"key" + 0.009*"team" + 0.009*"get" + 0.009*"good" + 0.009*"game" + 0.007*"car" + 0.007*"go" + 0.007*"article" + 0.006*"nntp_poste"

Topic 1: 0.745*"ax" + 0.034*"_" + 0.006*"c" + 0.005*"cx" + 0.005*"ei" + 0.003*"rlk" + 0.003*"mf" + 0.003*"ai" + 0.003*"r" + 0.002*"uy"

Topic 2: 0.010*"evidence" + 0.010*"say" + 0.009*"reason" + 0.008*"believe" + 0.007*"people" + 0.006*"think" + 0.006*"know" + 0.006*"many" + 0.006*"claim" + 0.005*"point"

Topic 3: 0.010*"use" + 0.009*"get" + 0.009*"system" + 0.009*"nntp_poste" + 0.008*"host" + 0.007*"drive" + 0.007*"need" + 0.006*"problem" + 0.006*"program" + 0.006*"work"

Topic 4: 0.012*"people" + 0.011*"say" + 0.011*"go" + 0.009*"get" + 0.008*"article" + 0.008*"think" + 0.008*"make" + 0.007*"know" + 0.007*"right" + 0.006*"time"



### Print the Keyword in the 10 topics

In [46]:


num_top_words = 10  # Number of keywords per topic

for idx, topic in lda_model.show_topics(num_topics=10, num_words=num_top_words, formatted=False):
    print(f"Topic {idx}: ", end='')
    print([word for word, _ in topic])


Topic 0: ['year', 'key', 'team', 'get', 'good', 'game', 'car', 'go', 'article', 'nntp_poste']
Topic 1: ['ax', '_', 'c', 'cx', 'ei', 'rlk', 'mf', 'ai', 'r', 'uy']
Topic 2: ['evidence', 'say', 'reason', 'believe', 'people', 'think', 'know', 'many', 'claim', 'point']
Topic 3: ['use', 'get', 'system', 'nntp_poste', 'host', 'drive', 'need', 'problem', 'program', 'work']
Topic 4: ['people', 'say', 'go', 'get', 'article', 'think', 'make', 'know', 'right', 'time']


## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [47]:
# Evaluate LDA Model - Perplexity

# Compute Perplexity
perplexity = lda_model.log_perplexity(corpus)
print(f'LDA Model Perplexity: {perplexity}')


LDA Model Perplexity: -7.617422349231275


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [48]:


from gensim.models import CoherenceModel

# Compute Coherence Score using c_v measure
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print(f'LDA Model Coherence Score: {coherence_lda}')


LDA Model Coherence Score: 0.5378598438163787


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [49]:
# Visualize the Topic Model using pyLDAvis

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Prepare the visualization data
lda_vis_data = gensimvis.prepare(lda_model, corpus, id2word)

# Display the interactive visualization
pyLDAvis.display(lda_vis_data)
