#✅ Practical: Topic Modeling and Keyword Extraction using Gensim

#🎯 Objectives:
Perform basic text preprocessing

Apply LDA (Latent Dirichlet Allocation) topic modeling using Gensim

Extract keywords for each topic

#📦 Step 1: Install Required Libraries

In [1]:
!pip install gensim nltk



#🧾 Step 2: Import Libraries and Create Synthetic Data

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora, models
import string

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Synthetic medium-size dataset (20 reviews)
documents = [
    "The battery life of this phone is amazing and lasts all day.",
    "Camera quality is excellent and great for photography.",
    "Very poor customer service experience, not recommended.",
    "I love the display, it’s bright and colorful.",
    "The laptop is fast and lightweight, perfect for travel.",
    "Terrible experience, the screen cracked in two days.",
    "Build quality is top-notch, feels very premium.",
    "Customer service was helpful and resolved my issue quickly.",
    "The speakers are loud and clear, great for music lovers.",
    "Performance is smooth and multitasking is effortless.",
    "Not happy with the charging time, it takes too long.",
    "This smartwatch has many great fitness features.",
    "The interface is intuitive and easy to use.",
    "App support is limited and lacks popular options.",
    "This tablet is great for reading and light work.",
    "Keyboard quality is bad, not suitable for typing.",
    "Storage capacity is high and fast to access.",
    "The device heats up quickly and becomes uncomfortable.",
    "Highly recommend this device for students and travelers.",
    "Software updates are timely and improve performance."
]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#🧹 Step 3: Text Preprocessing

In [4]:
import nltk
nltk.download('punkt_tab')
# Initialize stopwords and punctuation
stop_words = set(stopwords.words('english'))

def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words and word.isalpha()]
    return tokens

# Apply preprocessing
processed_docs = [preprocess(doc) for doc in documents]


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


#📚 Step 4: Prepare Dictionary and Corpus for Gensim

In [5]:
# Create a dictionary from processed documents
dictionary = corpora.Dictionary(processed_docs)

# Create Bag of Words corpus
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Print sample
print("Sample BOW:", corpus[0])
print("Sample Tokens:", processed_docs[0])


Sample BOW: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
Sample Tokens: ['battery', 'life', 'phone', 'amazing', 'lasts', 'day']


#🧠 Step 5: Train LDA Topic Model

In [6]:
# Train LDA model
lda_model = models.LdaModel(corpus=corpus,
                            id2word=dictionary,
                            num_topics=3,        # You can change to 4 or 5
                            random_state=42,
                            passes=10,
                            alpha='auto')

# Print discovered topics
for i, topic in lda_model.print_topics(num_words=5):
    print(f"\n🔹 Topic {i+1}: {topic}")



🔹 Topic 1: 0.035*"performance" + 0.035*"quality" + 0.020*"experience" + 0.020*"great" + 0.020*"two"

🔹 Topic 2: 0.038*"quickly" + 0.022*"great" + 0.022*"quality" + 0.022*"service" + 0.022*"customer"

🔹 Topic 3: 0.036*"great" + 0.021*"fast" + 0.021*"device" + 0.021*"customer" + 0.021*"tablet"


#🔍 Step 6: Keyword Extraction from Topics

In [7]:
# Display keywords for each topic
print("\n📌 Keywords per Topic:\n")
topics = lda_model.show_topics(formatted=False)
for i, topic in topics:
    keywords = [word for word, prob in topic]
    print(f"Topic {i+1}: {', '.join(keywords)}")



📌 Keywords per Topic:

Topic 1: performance, quality, experience, great, two, cracked, days, screen, life, many
Topic 2: quickly, great, quality, service, customer, becomes, limited, popular, lacks, photography
Topic 3: great, fast, device, customer, tablet, light, experience, time, high, work


#EXTRAA


In [9]:
!pip install pyLDAvis
# Installs the pyLDAvis library, which is a powerful interactive visualization tool
# specifically designed for topic models like Latent Dirichlet Allocation (LDA).

import pyLDAvis.gensim_models
# Imports the `gensim_models` module from `pyLDAvis`. This module provides
# specific functionalities to prepare data from Gensim LDA models for visualization.

pyLDAvis.enable_notebook()
# Enables the pyLDAvis visualization to be displayed directly within a Jupyter Notebook or similar environment.
# This function modifies the output so that the interactive plot renders inline.

pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
# Prepares the data for the interactive LDA visualization.
# - `lda_model`: This is your trained Latent Dirichlet Allocation (LDA) model,
#                typically from the Gensim library. It contains the topic-word distributions.
# - `corpus`: This is the document-term matrix (or bag-of-words representation) of your text data.
#             It represents the frequency of words in each document.
# - `dictionary`: This is the Gensim `Dictionary` object created from your text data.
#                 It maps words to their unique integer IDs.
# This function returns an object that can be rendered by pyLDAvis to show an interactive
# visualization of your LDA model, allowing you to explore topics, their terms, and
# their relationships.

