<a href="https://colab.research.google.com/github/1hatemalharbi-rgb/W5_exercises/blob/main/08_topic_modeling_ex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling: Organizing Unlabeled CVs with LDA

## Overview

This notebook demonstrates **Topic Modeling** using **Latent Dirichlet Allocation (LDA)** to organize unlabeled CVs (resumes) by automatically discovering hidden topics. Unlike supervised classification, topic modeling works with completely unlabeled data, making it ideal for organizing large document collections without manual labeling. You'll learn how to apply LDA to discover topics, interpret results, and organize documents based on their dominant topics.

> "The best way to find a needle in a haystack is to organize the haystack first."

**The Problem**: You have a folder full of CVs—unlabeled, unorganized. You need to find candidates for specific roles, but manually reading through hundreds of CVs is impossible.

## Objectives

- Understand what Topic Modeling is and why it's useful for unsupervised document organization
- Learn how LDA (Latent Dirichlet Allocation) discovers hidden topics in text collections
- Apply LDA to organize unlabeled documents automatically
- Interpret topic modeling results by examining top words and document-topic distributions
- Organize documents into folders based on their dominant topics

## Outline

1. **Introduction to Topic Modeling** - What it is and why it's useful
2. **What is LDA?** - Understanding Latent Dirichlet Allocation
3. **The Pipeline** - Complete workflow from data loading to organization
4. **Step 1: Loading Data** - Reading CVs from JSON files
5. **Step 2: Preprocessing** - Cleaning and preparing text
6. **Step 3: Vectorization** - Converting text to document-term matrix
7. **Step 4: Training LDA** - Discovering topics automatically
8. **Step 5: Analyzing Results** - Interpreting discovered topics
9. **Step 6: Organizing Documents** - Creating folders and organizing CVs by topic

## Topic Modeling

**Topic Modeling** is an **unsupervised learning** task that discovers hidden topics in a collection of unlabeled documents. Unlike classification (which requires labeled data), topic modeling finds patterns automatically.

**Example applications:**
- **Organizing unlabeled documents**: Group CVs by field (AI/ML, Data Analysis, etc.) without manual labeling
- **Understanding large text collections**: Discover what themes exist in news archives, research papers, or social media
- **Content recommendation**: Find documents similar to a given document based on topic similarity

**Why it's useful:**
- No labels needed: works with completely unlabeled data
- Interpretable: topics are defined by their top words, making them understandable
- Scalable: can process large document collections
- Flexible: number of topics can be adjusted based on the corpus

## What is LDA?

**Latent Dirichlet Allocation (LDA)** is a probabilistic model that discovers hidden topics in a collection of documents.

**Key idea**:
- Each document is a **mixture of topics** (e.g., 70% AI/ML, 20% Data Analysis, 10% Software Engineering)
- Each topic is a **distribution over words** (e.g., Topic 1: 30% "PyTorch", 25% "TensorFlow", 20% "NLP"...)
- LDA discovers these topics automatically by finding words that co-occur together

**For our CVs**: LDA will discover topics like "AI/ML", "Data Analysis", "Big Data" by looking at which words appear together, then assign each CV to the most relevant topic(s).

**Reference**: Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). [Latent dirichlet allocation](https://dl.acm.org/doi/10.5555/944919.944937). *Journal of machine Learning research*, 3(Jan), 993-1022.

![Left: BoW. Right: LDA](../assets/lda.png)

## The Pipeline

1. **Load CVs**: Read all JSON files from topic folders using glob patterns and extract structured fields
2. **Preprocess**: Clean the text (remove URLs, emails, etc.)
3. **Vectorize**: Convert text to document-term matrix (Bag of Words)
4. **Train LDA**: Discover topics automatically
5. **Analyze Results**: See what topics were found and which CVs belong to each
6. **Organize**: Create folders and copy CVs based on their dominant topic

## Step 1: Loading Data

In [1]:
# %pip install numpy==1.26.4 pandas==2.3.3 scikit-learn==1.8.0 --quiet

In [2]:
# Standard library imports
import json
import re
import shutil
from pathlib import Path

# Third-party imports
import numpy as np
import pandas as pd

# Machine Learning
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

print("Libraries imported successfully!")

Libraries imported successfully!


In [3]:
# Load CVs from JSON files in all topic folders
cv_dir = Path('../datasets/CVs')
# Use glob pattern to find all JSON files in Topic_* subdirectories, excluding English versions
cv_files = sorted([f for f in cv_dir.glob('Topic_*/*.json') if not f.name.endswith('_en.json')])

# Load and extract structured data from JSON
cvs_data = []
cv_names = []
cv_file_paths = []  # Store original file paths for later copying

for file in cv_files:
    with open(file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        cvs_data.append(data)
        cv_names.append(file.stem)
        cv_file_paths.append(file)  # Store the full path

print(f"Loaded {len(cvs_data)} CV files from {len(set(f.parent.name for f in cv_files))} topic folders:")
for i, name in enumerate(cv_names, 1):
    print(f"  {i}. {name}")

# Combine structured fields into text for each CV
def combine_cv_fields(cv_json):
    """Combine Heading, Skills, Projects, Experience, Education into a single text"""
    parts = []

    # Add heading
    if 'Heading' in cv_json:
        parts.append(cv_json['Heading'])

    # Add skills (join list items)
    if 'Skills' in cv_json:
        skills_text = ' '.join(cv_json['Skills']) if isinstance(cv_json['Skills'], list) else cv_json['Skills']
        parts.append(skills_text)

    # Add projects
    if 'Projects' in cv_json:
        projects_text = ' '.join(cv_json['Projects']) if isinstance(cv_json['Projects'], list) else cv_json['Projects']
        parts.append(projects_text)

    # Add experience
    if 'Experience' in cv_json:
        exp_text = ' '.join(cv_json['Experience']) if isinstance(cv_json['Experience'], list) else cv_json['Experience']
        parts.append(exp_text)

    # Add education
    if 'Education' in cv_json:
        edu_text = ' '.join(cv_json['Education']) if isinstance(cv_json['Education'], list) else cv_json['Education']
        parts.append(edu_text)

    return ' '.join(parts)

# Convert JSON data to text
cvs = [combine_cv_fields(cv_data) for cv_data in cvs_data]
print(f"\nCombined structured data into text for {len(cvs)} CVs")

Loaded 0 CV files from 0 topic folders:

Combined structured data into text for 0 CVs


In [6]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(
    subset="train",
    remove=("headers", "footers", "quotes")
)

cvs = data.data
print("Loaded CVs:", len(cvs))


Loaded CVs: 11314


## Step 2: Data Preprocessing

In [7]:
def preprocess_text(text):
    """Clean text: remove URLs, emails, and normalize whitespace"""
    # Remove emails and URLs
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'http\S+', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Keep only Arabic/English letters and numbers
    text = re.sub(r'[^\w\s\u0600-\u06FF]', ' ', text)
    return text.strip()

# Preprocess all CVs
cvs_processed = [preprocess_text(cv) for cv in cvs]
print(f"Preprocessed {len(cvs_processed)} CVs")

Preprocessed 11314 CVs


## Step 3: Prepare Data for LDA

Convert text to a document-term matrix (same as Bag of Words from classification).

In [8]:
# Create document-term matrix
vectorizer = CountVectorizer(
    max_features=1000,  # Top 1000 words
    min_df=2,           # Word must appear in at least 2 CVs
    max_df=0.8          # Ignore words in >80% of CVs
)

doc_term_matrix = vectorizer.fit_transform(cvs_processed)
feature_names = vectorizer.get_feature_names_out()

print(f"Document-Term Matrix: {doc_term_matrix.shape[0]} CVs × {doc_term_matrix.shape[1]} words")
print(f"Sparsity: {(1 - doc_term_matrix.nnz / (doc_term_matrix.shape[0] * doc_term_matrix.shape[1])) * 100:.1f}%")

Document-Term Matrix: 11314 CVs × 1000 words
Sparsity: 94.8%


## Step 4: Train LDA Model

In [9]:
# Train LDA model
n_topics = 3  # Number of topics to discover

lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=10,
    learning_method='online'
)

print(f"Training LDA to discover {n_topics} topics...")
lda.fit(doc_term_matrix)
print("✓ Training complete!")

Training LDA to discover 3 topics...
✓ Training complete!


## Step 5: Analyze Results

Let's see what topics LDA discovered and which words define each topic.

In [11]:
# Display top words for each topic
def display_topics(model, feature_names, n_top_words=10):
    """Display top words for each topic"""
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        top_weights = [topic[i] for i in top_words_idx]

        print(f"\nTopic {topic_idx + 1}:")
        print("  Top words:", ", ".join(top_words))
        print("  Weights:", [f"{w:.3f}" for w in top_weights])

display_topics(lda, feature_names, n_top_words=10)


Topic 1:
  Top words: ax, max, g9v, b8f, a86, pl, 145, 1d9, cx, 1t
  Weights: ['56280.154', '4013.534', '1123.403', '896.450', '759.790', '613.741', '531.368', '517.216', '380.701', '373.113']

Topic 2:
  Top words: to, of, and, that, in, is, it, you, not, for
  Weights: ['40708.993', '36845.601', '29997.563', '26124.782', '23729.360', '22518.139', '19425.964', '15277.955', '11041.749', '10962.718']

Topic 3:
  Top words: and, to, of, for, is, in, on, with, or, it
  Weights: ['12324.849', '12195.068', '9735.657', '8931.578', '7813.729', '7234.284', '5052.329', '4468.024', '4107.881', '4089.796']


**Interpreting the topics**: Look at the top words for each topic. Can you guess what each topic represents? For example:
- Topic with "PyTorch", "TensorFlow", "NLP" → probably AI/ML
- Topic with "Tableau", "Power BI", "dashboard" → probably Data Analysis
- Topic with "Hadoop", "Spark", "Kafka" → probably Big Data

Now let's see which CV belongs to which topic:

In [13]:
# Make cv_names match the number of documents used in doc_term_matrix
cv_names = [f"CV_{i+1}" for i in range(doc_term_matrix.shape[0])]
print("cv_names:", len(cv_names), "| docs:", doc_term_matrix.shape[0])


cv_names: 11314 | docs: 11314


In [14]:
# Get topic distribution for each CV
doc_topic_dist = lda.transform(doc_term_matrix)

# Find dominant topic for each CV
dominant_topics = doc_topic_dist.argmax(axis=1)

# Create a DataFrame to see results
df_results = pd.DataFrame({
    'CV': cv_names,
    'Dominant Topic': dominant_topics + 1,
    'Topic Probabilities': [dist for dist in doc_topic_dist]
})

# Show which CVs belong to which topic
print("CV Assignment to Topics:")
print("=" * 60)
for topic_id in range(n_topics):
    topic_cvs = df_results[df_results['Dominant Topic'] == topic_id + 1]
    print(f"\nTopic {topic_id + 1} ({len(topic_cvs)} CVs):")
    for idx, row in topic_cvs.iterrows():
        prob = row['Topic Probabilities'][topic_id]
        print(f"  - {row['CV']} ({prob:.1%})")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  - CV_8551 (96.3%)
  - CV_8552 (67.2%)
  - CV_8556 (86.6%)
  - CV_8557 (58.4%)
  - CV_8558 (69.7%)
  - CV_8559 (97.5%)
  - CV_8561 (59.2%)
  - CV_8563 (97.7%)
  - CV_8568 (90.9%)
  - CV_8569 (98.9%)
  - CV_8571 (80.0%)
  - CV_8574 (99.0%)
  - CV_8575 (67.3%)
  - CV_8578 (99.3%)
  - CV_8580 (51.8%)
  - CV_8583 (98.3%)
  - CV_8584 (84.2%)
  - CV_8586 (99.5%)
  - CV_8588 (99.3%)
  - CV_8590 (61.4%)
  - CV_8591 (92.4%)
  - CV_8592 (96.2%)
  - CV_8596 (74.4%)
  - CV_8597 (91.2%)
  - CV_8598 (98.3%)
  - CV_8599 (98.3%)
  - CV_8600 (99.3%)
  - CV_8601 (53.6%)
  - CV_8602 (82.1%)
  - CV_8604 (90.8%)
  - CV_8605 (98.1%)
  - CV_8609 (81.8%)
  - CV_8610 (97.2%)
  - CV_8611 (66.4%)
  - CV_8613 (90.4%)
  - CV_8614 (62.9%)
  - CV_8615 (52.2%)
  - CV_8616 (96.8%)
  - CV_8617 (97.8%)
  - CV_8618 (65.4%)
  - CV_8619 (82.1%)
  - CV_8620 (97.6%)
  - CV_8621 (99.4%)
  - CV_8622 (99.0%)
  - CV_8624 (83.7%)
  - CV_8625 (59.2%)
  - CV_8629 (61

## Step 6: Organize CVs into Folders

Now comes the practical part: **automatically organize CVs into folders** based on their dominant topic!

In [15]:
# Create output directory structure
output_dir = Path('output/organized_cvs')
output_dir.mkdir(parents=True, exist_ok=True)

# Create a folder for each topic
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    topic_dir.mkdir(exist_ok=True)

# Copy each CV to its topic folder
for idx, (cv_name, topic_id, source_file) in enumerate(zip(cv_names, dominant_topics, cv_file_paths)):
    target_dir = output_dir / f"Topic_{topic_id + 1}"
    target_file = target_dir / f"{cv_name}.json"

    shutil.copy2(source_file, target_file)
    print(f"Copied {cv_name}.json → Topic_{topic_id + 1}/")

print(f"\n✓ Organization complete! CVs are now in: {output_dir}")


✓ Organization complete! CVs are now in: output/organized_cvs


### Verify the Organization

Let's check what's in each folder:

In [16]:
# Show contents of each topic folder
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    files = list(topic_dir.glob('*.json'))
    print(f"\nTopic_{topic_id + 1}/ ({len(files)} CVs):")
    for f in sorted(files):
        print(f"  - {f.name}")


Topic_1/ (0 CVs):

Topic_2/ (0 CVs):

Topic_3/ (0 CVs):


## **Student Exercise**: discover topics on a dataset of your choice

In [None]:
# STUDENT EXERCISE

In [17]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

docs = fetch_20newsgroups(subset="train", remove=("headers","footers","quotes")).data

X = CountVectorizer(max_features=1000, stop_words="english").fit_transform(docs)

lda = LatentDirichletAllocation(n_components=3, random_state=42).fit(X)

words = CountVectorizer(max_features=1000, stop_words="english").fit(docs).get_feature_names_out()

for i, t in enumerate(lda.components_):
    print(f"Topic {i+1}:", [words[j] for j in t.argsort()[-5:]])


Topic 1: ['know', 'think', 'just', 'don', 'people']
Topic 2: ['new', 'file', 'like', 'edu', 'use']
Topic 3: ['a86', 'b8f', 'g9v', 'max', 'ax']


## Summary

**What we accomplished**:
1. ✅ Loaded unlabeled CVs from a folder
2. ✅ Preprocessed the text data
3. ✅ Created a document-term matrix
4. ✅ Trained an LDA model to discover topics
5. ✅ Analyzed which CVs belong to which topic
6. ✅ **Automatically organized CVs into folders** based on discovered topics

**Key Takeaways**:
- **LDA discovers topics automatically** by finding words that co-occur together
- **Each document is a mixture of topics** - LDA assigns probabilities
- **Topic modeling is unsupervised** - no labels needed!
- **Practical application**: Organize unlabeled documents automatically

**Next Steps**:
- Try different numbers of topics (`n_topics`) and see how results change
- Experiment with preprocessing (stemming, stop words removal)
- Use topic probabilities to handle CVs that belong to multiple topics
- Visualize topics using tools like pyLDAvis

**References**:
- [Scikit-learn LDA documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
- [Topic modeling visualization guide](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/)

---


## Module 1 Synthesis: The Complete Pipeline

Congratulations! You've completed **Module 1: Text Analysis with Statistical NLP**. Let's reflect on the journey and see how all the pieces fit together.

### The Circular Learning Experience

Remember the question chain we started with? Let's trace how we answered each question and built a complete NLP pipeline:

1. **"What is NLP?"** → We learned that NLP bridges computers and human language, with applications in understanding and generation.

2. **"How do we extract patterns from text?"** → We used **Regular Expressions** to find, match, and manipulate text patterns—essential for preprocessing.

3. **"How do we understand our data?"** → We performed **Exploratory Data Analysis (EDA)** on corpora to assess data quality, vocabulary characteristics, and preprocessing needs.

4. **"How do we prepare text for ML?"** → We applied **Preprocessing** techniques (cleaning, normalization, tokenization, stemming) to transform raw text into clean tokens.

5. **"How do we convert text to numbers?"** → We used **Vectorization** (BoW, TF-IDF) to convert text into numerical features that ML models can process.

6. **"How do we build classifiers?"** → We built **Text Classification** models (like sentiment analysis) using vectorized features and supervised learning.

7. **"How do we search documents?"** → We implemented **Information Retrieval** systems using TF-IDF and cosine similarity to find relevant documents.

8. **"How do we discover topics?"** → We applied **Topic Modeling** (LDA) to automatically organize unlabeled documents by discovering hidden topics.

### The Complete NLP Pipeline

Throughout this module, you've learned to build a complete NLP pipeline:

```
Raw Text
    ↓
[Regex: Pattern Extraction]
    ↓
[Corpus & EDA: Understanding Data]
    ↓
[Preprocessing: Cleaning & Normalization]
    ↓
[Vectorization: Text → Numbers]
    ↓
[Modeling: Classification / IR / Topic Modeling]
    ↓
Actionable Insights
```

### Key Skills You've Acquired

By completing this module, you can now:

✅ **Build supervised ML text classification pipelines**
- Preprocess Arabic and English text
- Vectorize text using BoW and TF-IDF
- Train and evaluate classifiers
- Interpret model results

✅ **Apply keyword-based information retrieval**
- Implement TF-IDF-based search engines
- Measure document similarity using cosine similarity
- Rank and retrieve relevant documents

✅ **Apply unsupervised ML for document organization**
- Discover hidden topics using LDA
- Organize unlabeled documents automatically
- Interpret topic modeling results

### The Foundation for What's Next

This module focused on **statistical NLP**—traditional methods that work well for many tasks. In **Module 2**, you'll learn about **Deep Learning approaches** (embeddings, transformers) that build on these foundations to achieve even better performance.

**What you learned here is still valuable:**
- Preprocessing techniques apply to both statistical and deep learning methods
- Understanding vectorization helps you understand embeddings
- EDA is always the first step, regardless of the approach
- The pipeline structure (preprocess → vectorize → model) remains the same

### Reflection Questions

Before moving to Module 2, consider:

1. **When would you use statistical NLP vs. deep learning?**
   - Statistical NLP: Fast, interpretable, works with small data
   - Deep Learning: Better accuracy, requires more data and computation

2. **What preprocessing steps are most important?**
   - Depends on your data and task, but EDA always guides the decision

3. **How does TF-IDF differ from BoW?**
   - BoW: Simple word counts
   - TF-IDF: Weighted counts that emphasize distinctive words

4. **When would you use topic modeling vs. classification?**
   - Classification: When you have labels and want to predict categories
   - Topic Modeling: When you have no labels and want to discover structure

### The Journey Continues

You've built a solid foundation in statistical NLP. The concepts you've learned—preprocessing, vectorization, classification, retrieval, and topic modeling—are the building blocks for more advanced techniques.

**Next Module Preview:**
- **Module 2** introduces **Deep Learning for NLP**:
  - Tokenization with modern tools (WordPiece, BPE)
  - Word embeddings (Word2Vec, GloVe, contextual embeddings)
  - Transformers and BERT
  - Fine-tuning pre-trained models

The journey from statistical NLP to deep learning is a natural progression—you'll see how embeddings generalize vectorization, how transformers improve on traditional methods, and how pre-trained models leverage the foundations you've built.

---

**Module 1 Complete! 🎉**

You now have the skills to work with text data using statistical methods. You understand the complete pipeline from raw text to actionable insights, and you're ready to explore the power of deep learning in Module 2.