# Topic Modeling: Organizing Unlabeled CVs with LDA

## Overview

This notebook demonstrates **Topic Modeling** using **Latent Dirichlet Allocation (LDA)** to organize unlabeled CVs (resumes) by automatically discovering hidden topics. Unlike supervised classification, topic modeling works with completely unlabeled data, making it ideal for organizing large document collections without manual labeling. You'll learn how to apply LDA to discover topics, interpret results, and organize documents based on their dominant topics.

> "The best way to find a needle in a haystack is to organize the haystack first."

**The Problem**: You have a folder full of CVsâ€”unlabeled, unorganized. You need to find candidates for specific roles, but manually reading through hundreds of CVs is impossible.

## Objectives

- Understand what Topic Modeling is and why it's useful for unsupervised document organization
- Learn how LDA (Latent Dirichlet Allocation) discovers hidden topics in text collections
- Apply LDA to organize unlabeled documents automatically
- Interpret topic modeling results by examining top words and document-topic distributions
- Organize documents into folders based on their dominant topics

## Outline

1. **Introduction to Topic Modeling** - What it is and why it's useful
2. **What is LDA?** - Understanding Latent Dirichlet Allocation
3. **The Pipeline** - Complete workflow from data loading to organization
4. **Step 1: Loading Data** - Reading CVs from JSON files
5. **Step 2: Preprocessing** - Cleaning and preparing text
6. **Step 3: Vectorization** - Converting text to document-term matrix
7. **Step 4: Training LDA** - Discovering topics automatically
8. **Step 5: Analyzing Results** - Interpreting discovered topics
9. **Step 6: Organizing Documents** - Creating folders and organizing CVs by topic

## Topic Modeling

**Topic Modeling** is an **unsupervised learning** task that discovers hidden topics in a collection of unlabeled documents. Unlike classification (which requires labeled data), topic modeling finds patterns automatically.

**Example applications:**
- **Organizing unlabeled documents**: Group CVs by field (AI/ML, Data Analysis, etc.) without manual labeling
- **Understanding large text collections**: Discover what themes exist in news archives, research papers, or social media
- **Content recommendation**: Find documents similar to a given document based on topic similarity

**Why it's useful:**
- No labels needed: works with completely unlabeled data
- Interpretable: topics are defined by their top words, making them understandable
- Scalable: can process large document collections
- Flexible: number of topics can be adjusted based on the corpus

## What is LDA?

**Latent Dirichlet Allocation (LDA)** is a probabilistic model that discovers hidden topics in a collection of documents.

**Key idea**: 
- Each document is a **mixture of topics** (e.g., 70% AI/ML, 20% Data Analysis, 10% Software Engineering)
- Each topic is a **distribution over words** (e.g., Topic 1: 30% "PyTorch", 25% "TensorFlow", 20% "NLP"...)
- LDA discovers these topics automatically by finding words that co-occur together

**For our CVs**: LDA will discover topics like "AI/ML", "Data Analysis", "Big Data" by looking at which words appear together, then assign each CV to the most relevant topic(s).

**Reference**: Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). [Latent dirichlet allocation](https://dl.acm.org/doi/10.5555/944919.944937). *Journal of machine Learning research*, 3(Jan), 993-1022.

![Left: BoW. Right: LDA](../assets/lda.png)

## The Pipeline

1. **Load CVs**: Read all JSON files from topic folders using glob patterns and extract structured fields
2. **Preprocess**: Clean the text (remove URLs, emails, etc.)
3. **Vectorize**: Convert text to document-term matrix (Bag of Words)
4. **Train LDA**: Discover topics automatically
5. **Analyze Results**: See what topics were found and which CVs belong to each
6. **Organize**: Create folders and copy CVs based on their dominant topic

## Step 1: Loading Data

In [None]:
# %pip install numpy==1.26.4 pandas==2.3.3 scikit-learn==1.8.0 --quiet

In [1]:
# Standard library imports
import json
import re
import shutil
from pathlib import Path

# Third-party imports
import numpy as np
import pandas as pd

# Machine Learning
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

print("Libraries imported successfully!")

Libraries imported successfully!


In [2]:
# Load CVs from JSON files in all topic folders
cv_dir = Path('../datasets/CVs')
# Use glob pattern to find all JSON files in Topic_* subdirectories, excluding English versions
cv_files = sorted([f for f in cv_dir.glob('Topic_*/*.json') if not f.name.endswith('_en.json')])

# Load and extract structured data from JSON
cvs_data = []
cv_names = []
cv_file_paths = []  # Store original file paths for later copying

for file in cv_files:
    with open(file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        cvs_data.append(data)
        cv_names.append(file.stem)
        cv_file_paths.append(file)  # Store the full path

print(f"Loaded {len(cvs_data)} CV files from {len(set(f.parent.name for f in cv_files))} topic folders:")
for i, name in enumerate(cv_names, 1):
    print(f"  {i}. {name}")

# Combine structured fields into text for each CV
def combine_cv_fields(cv_json):
    """Combine Heading, Skills, Projects, Experience, Education into a single text"""
    parts = []
    
    # Add heading
    if 'Heading' in cv_json:
        parts.append(cv_json['Heading'])
    
    # Add skills (join list items)
    if 'Skills' in cv_json:
        skills_text = ' '.join(cv_json['Skills']) if isinstance(cv_json['Skills'], list) else cv_json['Skills']
        parts.append(skills_text)
    
    # Add projects
    if 'Projects' in cv_json:
        projects_text = ' '.join(cv_json['Projects']) if isinstance(cv_json['Projects'], list) else cv_json['Projects']
        parts.append(projects_text)
    
    # Add experience
    if 'Experience' in cv_json:
        exp_text = ' '.join(cv_json['Experience']) if isinstance(cv_json['Experience'], list) else cv_json['Experience']
        parts.append(exp_text)
    
    # Add education
    if 'Education' in cv_json:
        edu_text = ' '.join(cv_json['Education']) if isinstance(cv_json['Education'], list) else cv_json['Education']
        parts.append(edu_text)
    
    return ' '.join(parts)

# Convert JSON data to text
cvs = [combine_cv_fields(cv_data) for cv_data in cvs_data]
print(f"\nCombined structured data into text for {len(cvs)} CVs")

Loaded 34 CV files from 3 topic folders:
  1. 07
  2. 10
  3. 11
  4. 15
  5. 17
  6. 20
  7. 21
  8. 22
  9. 26
  10. 29
  11. 30
  12. 33
  13. 34
  14. 39
  15. 40
  16. 08
  17. 09
  18. 12
  19. 16
  20. 18
  21. 19
  22. 23
  23. 24
  24. 31
  25. 32
  26. 35
  27. 36
  28. 37
  29. 38
  30. 13
  31. 14
  32. 25
  33. 27
  34. 28

Combined structured data into text for 34 CVs


## Step 2: Data Preprocessing

In [3]:
def preprocess_text(text):
    """Clean text: remove URLs, emails, and normalize whitespace"""
    # Remove emails and URLs
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'http\S+', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Keep only Arabic/English letters and numbers
    text = re.sub(r'[^\w\s\u0600-\u06FF]', ' ', text)
    return text.strip()

# Preprocess all CVs
cvs_processed = [preprocess_text(cv) for cv in cvs]
print(f"Preprocessed {len(cvs_processed)} CVs")

Preprocessed 34 CVs


## Step 3: Prepare Data for LDA

Convert text to a document-term matrix (same as Bag of Words from classification).

In [4]:
# Create document-term matrix
vectorizer = CountVectorizer(
    max_features=1000,  # Top 1000 words
    min_df=2,           # Word must appear in at least 2 CVs
    max_df=0.8          # Ignore words in >80% of CVs
)

doc_term_matrix = vectorizer.fit_transform(cvs_processed)
feature_names = vectorizer.get_feature_names_out()

print(f"Document-Term Matrix: {doc_term_matrix.shape[0]} CVs Ã— {doc_term_matrix.shape[1]} words")
print(f"Sparsity: {(1 - doc_term_matrix.nnz / (doc_term_matrix.shape[0] * doc_term_matrix.shape[1])) * 100:.1f}%")

Document-Term Matrix: 34 CVs Ã— 662 words
Sparsity: 86.7%


## Step 4: Train LDA Model

In [7]:
# Train LDA model
n_topics = 3  # Number of topics to discover

lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=10,
    learning_method='online'
)

print(f"Training LDA to discover {n_topics} topics...")
lda.fit(doc_term_matrix)
print("âœ“ Training complete!")

Training LDA to discover 3 topics...
âœ“ Training complete!


## Step 5: Analyze Results

Let's see what topics LDA discovered and which words define each topic.

In [8]:
# Display top words for each topic
def display_topics(model, feature_names, n_top_words=10):
    """Display top words for each topic"""
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        top_weights = [topic[i] for i in top_words_idx]
        
        print(f"\nTopic {topic_idx + 1}:")
        print("  Top words:", ", ".join(top_words))
        print("  Weights:", [f"{w:.3f}" for w in top_weights])

display_topics(lda, feature_names, n_top_words=10)


Topic 1:
  Top words: engineer, analyst, analytics, 2019, 2021, business, with, 10, bi, governance
  Weights: ['25.418', '25.095', '21.468', '20.625', '19.749', '19.336', '18.298', '16.163', '16.113', '15.502']

Topic 2:
  Top words: ai, on, models, engineer, research, computer, model, 06, 08, engineering
  Weights: ['32.898', '28.287', '22.209', '20.499', '20.153', '20.043', '19.232', '16.574', '16.569', '15.889']

Topic 3:
  Top words: engineer, spark, big, hadoop, on, aws, platform, 2021, 01, time
  Weights: ['7.766', '7.080', '5.884', '5.601', '4.821', '4.686', '4.546', '4.334', '4.237', '4.199']


**Interpreting the topics**: Look at the top words for each topic. Can you guess what each topic represents? For example:
- Topic with "PyTorch", "TensorFlow", "NLP" â†’ probably AI/ML
- Topic with "Tableau", "Power BI", "dashboard" â†’ probably Data Analysis
- Topic with "Hadoop", "Spark", "Kafka" â†’ probably Big Data

Now let's see which CV belongs to which topic:

In [22]:
# Get topic distribution for each CV
doc_topic_dist = lda.transform(doc_term_matrix)

# Find dominant topic for each CV
dominant_topics = doc_topic_dist.argmax(axis=1)

# Create a DataFrame to see results
df_results = pd.DataFrame({
    'CV': cv_names,
    'Dominant Topic': dominant_topics + 1,
    'Topic Probabilities': [dist for dist in doc_topic_dist]
})

# Show which CVs belong to which topic
print("CV Assignment to Topics:")
print("=" * 60)
for topic_id in range(n_topics):
    topic_cvs = df_results[df_results['Dominant Topic'] == topic_id + 1]
    print(f"\nTopic {topic_id + 1} ({len(topic_cvs)} CVs):")
    for idx, row in topic_cvs.iterrows():
        prob = row['Topic Probabilities'][topic_id]
        print(f"  - {row['CV']} ({prob:.1%})")

CV Assignment to Topics:

Topic 1 (17 CVs):
  - 15 (99.4%)
  - 29 (99.4%)
  - 30 (99.3%)
  - 08 (99.4%)
  - 09 (99.5%)
  - 18 (99.4%)
  - 19 (99.4%)
  - 23 (75.0%)
  - 24 (79.7%)
  - 31 (99.2%)
  - 35 (99.3%)
  - 36 (99.3%)
  - 37 (99.3%)
  - 38 (99.4%)
  - 14 (99.5%)
  - 27 (99.4%)
  - 28 (99.4%)

Topic 2 (14 CVs):
  - 07 (99.5%)
  - 10 (99.4%)
  - 11 (99.5%)
  - 17 (99.4%)
  - 20 (99.4%)
  - 21 (99.4%)
  - 22 (99.4%)
  - 26 (99.4%)
  - 33 (99.4%)
  - 34 (99.4%)
  - 39 (99.4%)
  - 40 (99.4%)
  - 13 (84.1%)
  - 25 (99.3%)

Topic 3 (3 CVs):
  - 12 (99.5%)
  - 16 (99.3%)
  - 32 (99.4%)


## Step 6: Organize CVs into Folders

Now comes the practical part: **automatically organize CVs into folders** based on their dominant topic!

In [23]:
# Create output directory structure
output_dir = Path('output/organized_cvs')
output_dir.mkdir(parents=True, exist_ok=True)

# Create a folder for each topic
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    topic_dir.mkdir(exist_ok=True)

# Copy each CV to its topic folder
for idx, (cv_name, topic_id, source_file) in enumerate(zip(cv_names, dominant_topics, cv_file_paths)):
    target_dir = output_dir / f"Topic_{topic_id + 1}"
    target_file = target_dir / f"{cv_name}.json"
    
    shutil.copy2(source_file, target_file)
    print(f"Copied {cv_name}.json â†’ Topic_{topic_id + 1}/")

print(f"\nâœ“ Organization complete! CVs are now in: {output_dir}")

Copied 07.json â†’ Topic_2/
Copied 10.json â†’ Topic_2/
Copied 11.json â†’ Topic_2/
Copied 15.json â†’ Topic_1/
Copied 17.json â†’ Topic_2/
Copied 20.json â†’ Topic_2/
Copied 21.json â†’ Topic_2/
Copied 22.json â†’ Topic_2/
Copied 26.json â†’ Topic_2/
Copied 29.json â†’ Topic_1/
Copied 30.json â†’ Topic_1/
Copied 33.json â†’ Topic_2/
Copied 34.json â†’ Topic_2/
Copied 39.json â†’ Topic_2/
Copied 40.json â†’ Topic_2/
Copied 08.json â†’ Topic_1/
Copied 09.json â†’ Topic_1/
Copied 12.json â†’ Topic_3/
Copied 16.json â†’ Topic_3/
Copied 18.json â†’ Topic_1/
Copied 19.json â†’ Topic_1/
Copied 23.json â†’ Topic_1/
Copied 24.json â†’ Topic_1/
Copied 31.json â†’ Topic_1/
Copied 32.json â†’ Topic_3/
Copied 35.json â†’ Topic_1/
Copied 36.json â†’ Topic_1/
Copied 37.json â†’ Topic_1/
Copied 38.json â†’ Topic_1/
Copied 13.json â†’ Topic_2/
Copied 14.json â†’ Topic_1/
Copied 25.json â†’ Topic_2/
Copied 27.json â†’ Topic_1/
Copied 28.json â†’ Topic_1/

âœ“ Organization complete! CVs are now in: outp

### Verify the Organization

Let's check what's in each folder:

In [24]:
# Show contents of each topic folder
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    files = list(topic_dir.glob('*.json'))
    print(f"\nTopic_{topic_id + 1}/ ({len(files)} CVs):")
    for f in sorted(files):
        print(f"  - {f.name}")


Topic_1/ (17 CVs):
  - 08.json
  - 09.json
  - 14.json
  - 15.json
  - 18.json
  - 19.json
  - 23.json
  - 24.json
  - 27.json
  - 28.json
  - 29.json
  - 30.json
  - 31.json
  - 35.json
  - 36.json
  - 37.json
  - 38.json

Topic_2/ (14 CVs):
  - 07.json
  - 10.json
  - 11.json
  - 13.json
  - 17.json
  - 20.json
  - 21.json
  - 22.json
  - 25.json
  - 26.json
  - 33.json
  - 34.json
  - 39.json
  - 40.json

Topic_3/ (3 CVs):
  - 12.json
  - 16.json
  - 32.json


## **Student Exercise**: discover topics on a dataset of your choice

## Step 1: Loading Data

In [70]:
# STUDENT EXERCISE
import kagglehub

# Download latest version
path = kagglehub.dataset_download("abisheksudarshan/topic-modeling-for-research-articles")


In [37]:
import pandas as pd
import os
print(os.listdir(path))
# Load the dataset
df_train = pd.read_csv(path + "/Train.csv")
df_test = pd.read_csv(path + "/Test.csv")
print(df_train.shape)
print(df_test.shape)
# concatenate the two dataframes
df = pd.concat([df_train, df_test])
print(df.shape)
# change the display option to show all the text
pd.set_option('display.max_colwidth', None)
# Display the first few rows of the dataset
df.head(1)

['Test.csv', 'Tags.csv', 'sample_sub.csv', 'Train.csv']
(14004, 31)
(6002, 6)
(20006, 31)


Unnamed: 0,id,ABSTRACT,Computer Science,Mathematics,Physics,Statistics,Analysis of PDEs,Applications,Artificial Intelligence,Astrophysics of Galaxies,...,Methodology,Number Theory,Optimization and Control,Representation Theory,Robotics,Social and Information Networks,Statistics Theory,Strongly Correlated Electrons,Superconductivity,Systems and Control
0,1824,"a ever-growing datasets inside observational astronomy have challenged scientists inside many aspects, including an efficient and interactive data exploration and visualization. many tools have been developed to confront this challenge. however, they usually focus on displaying a actual images or focus on visualizing patterns within catalogs inside the predefined way. inside this paper we introduce vizic, the python visualization library that builds a connection between images and catalogs through an interactive map of a sky region. vizic visualizes catalog data over the custom background canvas with the help of a shape, size and orientation of each object inside a catalog. a displayed objects inside a map are highly interactive and customizable comparing to those inside a images. these objects should be filtered by or colored by their properties, such as redshift and magnitude. they also should be sub-selected with the help of the lasso-like tool considering further analysis with the help of standard python functions from in the jupyter notebook. furthermore, vizic allows custom overlays to be appended dynamically on top of a sky map. we have initially implemented several overlays, namely, voronoi, delaunay, minimum spanning tree and healpix grid layers, which are helpful considering visualizing large-scale structure. all these overlays should be generated, added or removed interactively with one line of code. a catalog data was stored inside the non-relational database, and a interfaces were developed inside javascript and python to work within jupyter notebook, which allows to create custom widgets, user generated scripts to analyze and plot a data selected/displayed inside a interactive map. this unique design makes vizic the very powerful and flexible interactive analysis tool. vizic should be adopted inside variety of exercises, considering example, data inspection, clustering analysis, galaxy alignment studies, outlier identification or simply large-scale visualizations.",0,0,1,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Step 2: Data Preprocessing

In [63]:
def preprocess_text(text):
    """Clean text: remove URLs, emails, and normalize whitespace"""
    # Remove emails and URLs
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'http\S+', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Keep only Arabic/English letters and numbers
    text = re.sub(r'[^\w\s\u0600-\u06FF]', ' ', text)
    return text.strip().lower()

# Import stopwords
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def remove_stopwords(text):
    """Remove stopwords from text including academic and generic terms"""
    # Comprehensive academic stopwords list
    academic_stopwords = [
        # Generic academic connectors
        'considering', 'using', 'therefore', 'furthermore', 'moreover', 'however', 
        'thus', 'hence', 'consequently', 'additionally', 'further', 'also',
        
        # Academic verbs (too generic)
        'demonstrating', 'illustrating', 'showing', 'shows', 'show', 'present', 
        'presents', 'presented', 'propose', 'proposes', 'proposed', 'suggest', 
        'suggests', 'suggested', 'indicate', 'indicates', 'indicated',
        
        # Generic research terms
        'model', 'models', 'approach', 'approaches', 'method', 'methods', 
        'methodology', 'study', 'studies', 'research', 'paper', 'papers',
        'result', 'results', 'analysis', 'analyses', 'analysis', 'finding', 'findings',
        
        # Generic descriptors
        'based', 'help', 'helps', 'helping', 'use', 'uses', 'used', 'using',
        'provide', 'provides', 'provided', 'allow', 'allows', 'allowed',
        'enable', 'enables', 'enabled', 'require', 'requires', 'required',
        
        # Generic academic phrases
        'work', 'works', 'works', 'system', 'systems', 'application', 'applications',
        'process', 'processes', 'technique', 'techniques', 'framework', 'frameworks',
        'tool', 'tools', 'solution', 'solutions', 'problem', 'problems',
        
        # Common filler words in academic text
        'one', 'two', 'three', 'first', 'second', 'third', 'new', 'novel',
        'important', 'significant', 'effect', 'effects', 'different', 'various'
    ]
    
    # Combine standard English stopwords with academic stopwords
    stopwords = ENGLISH_STOP_WORDS.union(set(academic_stopwords))
    
    # Filter out stopwords and very short words (1-2 characters) which are usually noise
    words = [word for word in text.split() 
             if word not in stopwords and len(word) > 2]
    
    return ' '.join(words)

# Clean abstract column
df['ABSTRACT'] = df['ABSTRACT'].apply(preprocess_text)
df['ABSTRACT'] = df['ABSTRACT'].apply(remove_stopwords)
df.head(1)




Unnamed: 0,id,ABSTRACT,Computer Science,Mathematics,Physics,Statistics,Analysis of PDEs,Applications,Artificial Intelligence,Astrophysics of Galaxies,...,Methodology,Number Theory,Optimization and Control,Representation Theory,Robotics,Social and Information Networks,Statistics Theory,Strongly Correlated Electrons,Superconductivity,Systems and Control
0,1824,growing datasets inside observational astronomy challenged scientists inside aspects including efficient interactive data exploration visualization developed confront challenge usually focus displaying actual images focus visualizing patterns catalogs inside predefined way inside introduce vizic python visualization library builds connection images catalogs interactive map sky region vizic visualizes catalog data custom background canvas shape size orientation object inside catalog displayed objects inside map highly interactive customizable comparing inside images objects filtered colored properties redshift magnitude sub selected lasso like standard python functions jupyter notebook vizic custom overlays appended dynamically sky map initially implemented overlays voronoi delaunay minimum spanning tree healpix grid layers helpful visualizing large scale structure overlays generated added removed interactively line code catalog data stored inside non relational database interfaces developed inside javascript python jupyter notebook create custom widgets user generated scripts analyze plot data selected displayed inside interactive map unique design makes vizic powerful flexible interactive vizic adopted inside variety exercises example data inspection clustering galaxy alignment outlier identification simply large scale visualizations,0,0,1,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Step 3: Prepare Data for LDA

Convert text to a document-term matrix (same as Bag of Words from classification).

In [65]:
# Create document-term matrix
vectorizer = CountVectorizer(
    max_features=1000,  # Top 1000 words
    min_df=2,           # Word must appear in at least 2 articles
    max_df=0.7          # Ignore words in >80% of articles
)

doc_term_matrix = vectorizer.fit_transform(df['ABSTRACT'])
feature_names = vectorizer.get_feature_names_out()

print(f"Document-Term Matrix: {doc_term_matrix.shape[0]} articles Ã— {doc_term_matrix.shape[1]} words")
print(f"Sparsity: {(1 - doc_term_matrix.nnz / (doc_term_matrix.shape[0] * doc_term_matrix.shape[1])) * 100:.1f}%")

Document-Term Matrix: 20006 articles Ã— 1000 words
Sparsity: 96.6%


## Step 4: Train LDA Model


In [66]:
# Train LDA model
n_topics = 5  # Number of topics to discover

lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=10,
    learning_method='online'
)

print(f"Training LDA to discover {n_topics} topics...")
lda.fit(doc_term_matrix)
print("âœ“ Training complete!")

Training LDA to discover 5 topics...
âœ“ Training complete!


## Step 5: Analyze Results

Let's see what topics LDA discovered and which words define each topic.

In [69]:
# Display top words for each topic
def display_topics(model, feature_names, n_top_words=10):
    """Display top words for each topic"""
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        top_weights = [topic[i] for i in top_words_idx]
        
        print(f"\nTopic {topic_idx + 1}:")
        print("  Top words:", ", ".join(top_words))
        print("  Weights:", [f"{w:.3f}" for w in top_weights])

display_topics(lda, feature_names, n_top_words=5)


Topic 1:
  Top words: phase, magnetic, spin, energy, field
  Weights: ['2486.255', '2470.282', '2332.492', '2187.426', '1903.768']

Topic 2:
  Top words: learning, data, network, networks, neural
  Weights: ['7582.480', '5749.925', '5724.820', '4419.499', '3660.568']

Topic 3:
  Top words: mass, galaxies, star, observations, data
  Weights: ['2332.233', '1434.641', '1347.577', '1273.391', '1249.730']

Topic 4:
  Top words: space, flow, group, prove, mathbb
  Weights: ['1452.993', '1390.706', '1389.660', '1352.290', '1339.774']

Topic 5:
  Top words: algorithm, data, time, approximation, algorithms
  Weights: ['4734.481', '4564.084', '2868.735', '2631.065', '2469.988']


## Summary

**What we accomplished**:
1. âœ… Loaded unlabeled CVs from a folder
2. âœ… Preprocessed the text data
3. âœ… Created a document-term matrix
4. âœ… Trained an LDA model to discover topics
5. âœ… Analyzed which CVs belong to which topic
6. âœ… **Automatically organized CVs into folders** based on discovered topics

**Key Takeaways**:
- **LDA discovers topics automatically** by finding words that co-occur together
- **Each document is a mixture of topics** - LDA assigns probabilities
- **Topic modeling is unsupervised** - no labels needed!
- **Practical application**: Organize unlabeled documents automatically

**Next Steps**:
- Try different numbers of topics (`n_topics`) and see how results change
- Experiment with preprocessing (stemming, stop words removal)
- Use topic probabilities to handle CVs that belong to multiple topics
- Visualize topics using tools like pyLDAvis

**References**:
- [Scikit-learn LDA documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
- [Topic modeling visualization guide](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/)

---


## Module 1 Synthesis: The Complete Pipeline

Congratulations! You've completed **Module 1: Text Analysis with Statistical NLP**. Let's reflect on the journey and see how all the pieces fit together.

### The Circular Learning Experience

Remember the question chain we started with? Let's trace how we answered each question and built a complete NLP pipeline:

1. **"What is NLP?"** â†’ We learned that NLP bridges computers and human language, with applications in understanding and generation.

2. **"How do we extract patterns from text?"** â†’ We used **Regular Expressions** to find, match, and manipulate text patternsâ€”essential for preprocessing.

3. **"How do we understand our data?"** â†’ We performed **Exploratory Data Analysis (EDA)** on corpora to assess data quality, vocabulary characteristics, and preprocessing needs.

4. **"How do we prepare text for ML?"** â†’ We applied **Preprocessing** techniques (cleaning, normalization, tokenization, stemming) to transform raw text into clean tokens.

5. **"How do we convert text to numbers?"** â†’ We used **Vectorization** (BoW, TF-IDF) to convert text into numerical features that ML models can process.

6. **"How do we build classifiers?"** â†’ We built **Text Classification** models (like sentiment analysis) using vectorized features and supervised learning.

7. **"How do we search documents?"** â†’ We implemented **Information Retrieval** systems using TF-IDF and cosine similarity to find relevant documents.

8. **"How do we discover topics?"** â†’ We applied **Topic Modeling** (LDA) to automatically organize unlabeled documents by discovering hidden topics.

### The Complete NLP Pipeline

Throughout this module, you've learned to build a complete NLP pipeline:

```
Raw Text
    â†“
[Regex: Pattern Extraction]
    â†“
[Corpus & EDA: Understanding Data]
    â†“
[Preprocessing: Cleaning & Normalization]
    â†“
[Vectorization: Text â†’ Numbers]
    â†“
[Modeling: Classification / IR / Topic Modeling]
    â†“
Actionable Insights
```

### Key Skills You've Acquired

By completing this module, you can now:

âœ… **Build supervised ML text classification pipelines**
- Preprocess Arabic and English text
- Vectorize text using BoW and TF-IDF
- Train and evaluate classifiers
- Interpret model results

âœ… **Apply keyword-based information retrieval**
- Implement TF-IDF-based search engines
- Measure document similarity using cosine similarity
- Rank and retrieve relevant documents

âœ… **Apply unsupervised ML for document organization**
- Discover hidden topics using LDA
- Organize unlabeled documents automatically
- Interpret topic modeling results

### The Foundation for What's Next

This module focused on **statistical NLP**â€”traditional methods that work well for many tasks. In **Module 2**, you'll learn about **Deep Learning approaches** (embeddings, transformers) that build on these foundations to achieve even better performance.

**What you learned here is still valuable:**
- Preprocessing techniques apply to both statistical and deep learning methods
- Understanding vectorization helps you understand embeddings
- EDA is always the first step, regardless of the approach
- The pipeline structure (preprocess â†’ vectorize â†’ model) remains the same

### Reflection Questions

Before moving to Module 2, consider:

1. **When would you use statistical NLP vs. deep learning?**
   - Statistical NLP: Fast, interpretable, works with small data
   - Deep Learning: Better accuracy, requires more data and computation

2. **What preprocessing steps are most important?**
   - Depends on your data and task, but EDA always guides the decision

3. **How does TF-IDF differ from BoW?**
   - BoW: Simple word counts
   - TF-IDF: Weighted counts that emphasize distinctive words

4. **When would you use topic modeling vs. classification?**
   - Classification: When you have labels and want to predict categories
   - Topic Modeling: When you have no labels and want to discover structure

### The Journey Continues

You've built a solid foundation in statistical NLP. The concepts you've learnedâ€”preprocessing, vectorization, classification, retrieval, and topic modelingâ€”are the building blocks for more advanced techniques.

**Next Module Preview:**
- **Module 2** introduces **Deep Learning for NLP**:
  - Tokenization with modern tools (WordPiece, BPE)
  - Word embeddings (Word2Vec, GloVe, contextual embeddings)
  - Transformers and BERT
  - Fine-tuning pre-trained models

The journey from statistical NLP to deep learning is a natural progressionâ€”you'll see how embeddings generalize vectorization, how transformers improve on traditional methods, and how pre-trained models leverage the foundations you've built.

---

**Module 1 Complete! ðŸŽ‰**

You now have the skills to work with text data using statistical methods. You understand the complete pipeline from raw text to actionable insights, and you're ready to explore the power of deep learning in Module 2.