<a href="https://colab.research.google.com/github/Meenarekha/GEN-AI/blob/main/UNSUPERVISED_LEARNING_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Clustering and Topic Interpretation using BERTopic

In [4]:
# Step 1: Install dependencies
!pip install bertopic scikit-learn pandas plotly -q

In [5]:

# Step 2: Import required libraries
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
import plotly.io as pio

# Step 3: Load sample data
print("Loading dataset...")
data = fetch_20newsgroups(subset='all')['data'][:1000]  # Limit to 1000 for speed
print(f"Loaded {len(data)} documents.")

# Step 4: Fit BERTopic model
print("Fitting BERTopic model...")
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(data)
print("BERTopic model fitted.")

# Step 5: Check for NaN topic embeddings
if np.any(np.isnan(topic_model.topic_embeddings_)):
    print("Warning: NaN values found in topic embeddings.")

# Step 6: Safe visualization
try:
    print("Generating topic visualization...")
    fig = topic_model.visualize_topics()
    fig.show()
    fig.write_html("bertopic_topics.html")
    print("Visualization saved to 'bertopic_topics.html'")
except ValueError as e:
    if "zero-size array" in str(e):
        print("Error: Empty topic embeddings encountered.")
    else:
        raise e

# Step 7: Save results to CSV
df = pd.DataFrame({
    "document": data,
    "topic": topics,
    "probability": probs
})
df.to_csv("bertopic_results.csv", index=False)
print("Results saved to 'bertopic_results.csv'")

# Finished
print("All steps completed successfully.")


Loading dataset...


2025-04-30 08:16:04,002 - BERTopic - Embedding - Transforming documents to embeddings.


Loaded 1000 documents.
Fitting BERTopic model...


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

2025-04-30 08:17:34,989 - BERTopic - Embedding - Completed ✓
2025-04-30 08:17:34,990 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-30 08:17:36,796 - BERTopic - Dimensionality - Completed ✓
2025-04-30 08:17:36,797 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-30 08:17:36,824 - BERTopic - Cluster - Completed ✓
2025-04-30 08:17:36,828 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-04-30 08:17:37,055 - BERTopic - Representation - Completed ✓


BERTopic model fitted.
Generating topic visualization...
Error: Empty topic embeddings encountered.
Results saved to 'bertopic_results.csv'
All steps completed successfully.


In [6]:
import pandas as pd
data = pd.read_csv(r"/content/bertopic_results.csv")
data.head()

Unnamed: 0,document,topic,probability
0,From: Mamatha Devineni Ratnam <mr47+@andrew.cm...,1,1.0
1,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,0,1.0
2,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,0,1.0
3,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,0,1.0
4,From: Alexander Samuel McDiarmid <am2o+@andrew...,0,1.0


In [7]:
data.tail()

Unnamed: 0,document,topic,probability
995,From: s872505@minyos.xx.rmit.OZ.AU (Stephen Bo...,0,1.0
996,From: astein@nysernet.org (Alan Stein)\nSubjec...,0,1.0
997,From: francis@ircam.fr (Joseph Francis)\nSubje...,0,1.0
998,From: carolan@owlnet.rice.edu (Bryan Carolan D...,0,1.0
999,From: Wayne Alan Martin <wm1h+@andrew.cmu.edu>...,0,1.0


In [8]:
data.describe()

Unnamed: 0,topic,probability
count,1000.0,1000.0
mean,0.106,0.993353
std,0.307992,0.041088
min,0.0,0.411385
25%,0.0,1.0
50%,0.0,1.0
75%,0.0,1.0
max,1.0,1.0


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   document     1000 non-null   object 
 1   topic        1000 non-null   int64  
 2   probability  1000 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 23.6+ KB
