# 🧠 Topic Modeling using ChatGPT Embeddings and NMF

This notebook demonstrates a modern unsupervised topic modeling pipeline:

- ✅ Embedding texts using OpenAI's `text-embedding-3-small` model
- ✅ Clustering them into topics using NMF (Non-negative Matrix Factorization)
- ✅ Interpreting topics with two methods:
    - TF-IDF keyword extraction
    - ChatGPT-based summarization

This approach provides semantic topic clusters with human-readable interpretations.

In [None]:
import openai
import numpy as np
import pandas as pd
from sklearn.decomposition import NMF, PCA
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfVectorizer

# Set your OpenAI API Key
openai.api_key = "YOUR_API_KEY"  # Replace with your actual API key

# Sample dataset
data = pd.DataFrame({
    'text': [
        "The economy is facing challenges due to inflation and interest rate hikes.",
        "Stock markets have been volatile amid recession fears.",
        "Investors are looking for safe haven assets like gold and bonds.",
        "Machine learning algorithms are improving rapidly.",
        "Neural networks are widely used in image recognition tasks.",
        "AI is transforming the healthcare industry.",
        "The new iPhone was released with better battery and camera.",
        "Samsung's latest phone features a foldable screen.",
        "Tech gadgets are popular gifts during the holiday season.",
        "Scientists discovered a new exoplanet in the habitable zone."
    ]
})

## Step 1: Generate ChatGPT Embeddings

In [None]:
def get_embedding(text, model="text-embedding-3-small"):
    """Get text embedding from OpenAI's API"""
    response = openai.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Generate embeddings for all texts
embedding_list = []
for i, row in data.iterrows():
    embedding = get_embedding(row['text'])
    embedding_list.append(embedding)

X = np.array(embedding_list)  # Shape: (num_documents, 1536)

## Step 2: Dimensionality Reduction (PCA) + Topic Modeling (NMF)

In [None]:
# Reduce dimensionality to 100 components
X_reduced = PCA(n_components=100, random_state=42).fit_transform(X)

# Perform NMF topic modeling
n_topics = 3
nmf = NMF(n_components=n_topics, random_state=42)
W = nmf.fit_transform(X_reduced)  # Document-topic matrix
H = nmf.components_             # Topic-feature matrix

# Assign dominant topic to each document
topic_assignments = np.argmax(W, axis=1)
data['topic'] = topic_assignments

# Display documents with their assigned topics
data[['text', 'topic']]

## Step 3a: Keyword Extraction per Topic (TF-IDF + NMF)

In [None]:
print("=== TF-IDF + NMF Keyword Summaries ===")
n_keywords = 5

for topic_id in sorted(data['topic'].unique()):
    # Get all documents for current topic
    topic_texts = data[data['topic'] == topic_id]['text'].tolist()
    print(f"\n--- Topic {topic_id} ---")

    # Compute TF-IDF
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=1, stop_words='english')
    tfidf = vectorizer.fit_transform(topic_texts)

    # Apply NMF to find keywords
    nmf_sub = NMF(n_components=1, random_state=0)
    W_sub = nmf_sub.fit_transform(tfidf)
    H_sub = nmf_sub.components_

    # Extract top keywords
    feature_names = vectorizer.get_feature_names_out()
    top_keywords = [feature_names[i] for i in H_sub[0].argsort()[::-1][:n_keywords]]
    print("Top keywords:", ", ".join(top_keywords))

## Step 3b: Topic Summarization via ChatGPT

In [None]:
print("=== ChatGPT-based Topic Summaries ===")
for topic_id in sorted(data['topic'].unique()):
    # Get example documents for the topic
    examples = data[data['topic'] == topic_id]['text'].tolist()
    
    # Create prompt for ChatGPT
    prompt = "Summarize the main theme of these documents:\n\n" + "\n\n".join(examples)

    # Get summary from ChatGPT
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    summary = response.choices[0].message.content
    print(f"\n--- Topic {topic_id} Summary ---\n{summary}")

## ✅ Key Takeaways

This notebook demonstrated:

- How to use ChatGPT embeddings for semantic text representation
- Effective topic clustering using NMF on reduced-dimension embeddings
- Two complementary interpretation methods:
    - Traditional TF-IDF keyword analysis
    - LLM-powered natural language summaries

### Advantages of This Approach:
1. **Semantic Understanding**: Goes beyond simple word counts
2. **Flexible**: Works well with short texts
3. **Interpretable**: Provides both keywords and natural language explanations
4. **State-of-the-Art**: Combines the best of traditional ML and modern LLMs

### Next Steps:
- Experiment with different numbers of topics
- Try alternative dimensionality reduction methods (e.g., UMAP)
- Explore other topic modeling algorithms (e.g., BERTopic)