# Topic Modeling: Organizing Unlabeled CVs with LDA

## Overview

This notebook demonstrates **Topic Modeling** using **Latent Dirichlet Allocation (LDA)** to organize unlabeled CVs (resumes) by automatically discovering hidden topics. Unlike supervised classification, topic modeling works with completely unlabeled data, making it ideal for organizing large document collections without manual labeling. You'll learn how to apply LDA to discover topics, interpret results, and organize documents based on their dominant topics.

> "The best way to find a needle in a haystack is to organize the haystack first."

**The Problem**: You have a folder full of CVs—unlabeled, unorganized. You need to find candidates for specific roles, but manually reading through hundreds of CVs is impossible.

## Objectives

- Understand what Topic Modeling is and why it's useful for unsupervised document organization
- Learn how LDA (Latent Dirichlet Allocation) discovers hidden topics in text collections
- Apply LDA to organize unlabeled documents automatically
- Interpret topic modeling results by examining top words and document-topic distributions
- Organize documents into folders based on their dominant topics

## Outline

1. **Introduction to Topic Modeling** - What it is and why it's useful
2. **What is LDA?** - Understanding Latent Dirichlet Allocation
3. **The Pipeline** - Complete workflow from data loading to organization
4. **Step 1: Loading Data** - Reading CVs from JSON files
5. **Step 2: Preprocessing** - Cleaning and preparing text
6. **Step 3: Vectorization** - Converting text to document-term matrix
7. **Step 4: Training LDA** - Discovering topics automatically
8. **Step 5: Analyzing Results** - Interpreting discovered topics
9. **Step 6: Organizing Documents** - Creating folders and organizing CVs by topic

## Topic Modeling

**Topic Modeling** is an **unsupervised learning** task that discovers hidden topics in a collection of unlabeled documents. Unlike classification (which requires labeled data), topic modeling finds patterns automatically.

**Example applications:**
- **Organizing unlabeled documents**: Group CVs by field (AI/ML, Data Analysis, etc.) without manual labeling
- **Understanding large text collections**: Discover what themes exist in news archives, research papers, or social media
- **Content recommendation**: Find documents similar to a given document based on topic similarity

**Why it's useful:**
- No labels needed: works with completely unlabeled data
- Interpretable: topics are defined by their top words, making them understandable
- Scalable: can process large document collections
- Flexible: number of topics can be adjusted based on the corpus

## What is LDA?

**Latent Dirichlet Allocation (LDA)** is a probabilistic model that discovers hidden topics in a collection of documents.

**Key idea**: 
- Each document is a **mixture of topics** (e.g., 70% AI/ML, 20% Data Analysis, 10% Software Engineering)
- Each topic is a **distribution over words** (e.g., Topic 1: 30% "PyTorch", 25% "TensorFlow", 20% "NLP"...)
- LDA discovers these topics automatically by finding words that co-occur together

**For our CVs**: LDA will discover topics like "AI/ML", "Data Analysis", "Big Data" by looking at which words appear together, then assign each CV to the most relevant topic(s).

**Reference**: Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). [Latent dirichlet allocation](https://dl.acm.org/doi/10.5555/944919.944937). *Journal of machine Learning research*, 3(Jan), 993-1022.

![Left: BoW. Right: LDA](../assets/lda.png)

## The Pipeline

1. **Load CVs**: Read all JSON files from topic folders using glob patterns and extract structured fields
2. **Preprocess**: Clean the text (remove URLs, emails, etc.)
3. **Vectorize**: Convert text to document-term matrix (Bag of Words)
4. **Train LDA**: Discover topics automatically
5. **Analyze Results**: See what topics were found and which CVs belong to each
6. **Organize**: Create folders and copy CVs based on their dominant topic

## Step 1: Loading Data

In [48]:
# %pip install numpy==1.26.4 pandas==2.3.3 scikit-learn==1.8.0 --quiet

In [49]:
# Standard library imports
import json
import re
import shutil
from pathlib import Path

# Third-party imports
import numpy as np
import pandas as pd

# Machine Learning
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

print("Libraries imported successfully!")

Libraries imported successfully!


In [50]:
# Load CVs from JSON files in all topic folders
cv_dir = Path('../datasets/CVs')
# Use glob pattern to find all JSON files in Topic_* subdirectories, excluding English versions
cv_files = sorted([f for f in cv_dir.glob('Topic_*/*.json') if not f.name.endswith('_en.json')])

# Load and extract structured data from JSON
cvs_data = []
cv_names = []
cv_file_paths = []  # Store original file paths for later copying

for file in cv_files:
    with open(file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        cvs_data.append(data)
        cv_names.append(file.stem)
        cv_file_paths.append(file)  # Store the full path

print(f"Loaded {len(cvs_data)} CV files from {len(set(f.parent.name for f in cv_files))} topic folders:")
for i, name in enumerate(cv_names, 1):
    print(f"  {i}. {name}")

# Combine structured fields into text for each CV
def combine_cv_fields(cv_json):
    """Combine Heading, Skills, Projects, Experience, Education into a single text"""
    parts = []
    
    # Add heading
    if 'Heading' in cv_json:
        parts.append(cv_json['Heading'])
    
    # Add skills (join list items)
    if 'Skills' in cv_json:
        skills_text = ' '.join(cv_json['Skills']) if isinstance(cv_json['Skills'], list) else cv_json['Skills']
        parts.append(skills_text)
    
    # Add projects
    if 'Projects' in cv_json:
        projects_text = ' '.join(cv_json['Projects']) if isinstance(cv_json['Projects'], list) else cv_json['Projects']
        parts.append(projects_text)
    
    # Add experience
    if 'Experience' in cv_json:
        exp_text = ' '.join(cv_json['Experience']) if isinstance(cv_json['Experience'], list) else cv_json['Experience']
        parts.append(exp_text)
    
    # Add education
    if 'Education' in cv_json:
        edu_text = ' '.join(cv_json['Education']) if isinstance(cv_json['Education'], list) else cv_json['Education']
        parts.append(edu_text)
    
    return ' '.join(parts)

# Convert JSON data to text
cvs = [combine_cv_fields(cv_data) for cv_data in cvs_data]
print(f"\nCombined structured data into text for {len(cvs)} CVs")

Loaded 34 CV files from 3 topic folders:
  1. 07
  2. 10
  3. 11
  4. 15
  5. 17
  6. 20
  7. 21
  8. 22
  9. 26
  10. 29
  11. 30
  12. 33
  13. 34
  14. 39
  15. 40
  16. 08
  17. 09
  18. 12
  19. 16
  20. 18
  21. 19
  22. 23
  23. 24
  24. 31
  25. 32
  26. 35
  27. 36
  28. 37
  29. 38
  30. 13
  31. 14
  32. 25
  33. 27
  34. 28

Combined structured data into text for 34 CVs


## Step 2: Data Preprocessing

In [51]:
def preprocess_text(text):
    """Clean text: remove URLs, emails, and normalize whitespace"""
    # Remove emails and URLs
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'http\S+', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Keep only Arabic/English letters and numbers
    text = re.sub(r'[^\w\s\u0600-\u06FF]', ' ', text)
    return text.strip()

# Preprocess all CVs
cvs_processed = [preprocess_text(cv) for cv in cvs]
print(f"Preprocessed {len(cvs_processed)} CVs")

Preprocessed 34 CVs


## Step 3: Prepare Data for LDA

Convert text to a document-term matrix (same as Bag of Words from classification).

In [52]:
# Create document-term matrix
vectorizer = CountVectorizer(
    max_features=1000,  # Top 1000 words
    min_df=2,           # Word must appear in at least 2 CVs
    max_df=0.8          # Ignore words in >80% of CVs
)

doc_term_matrix = vectorizer.fit_transform(cvs_processed)
feature_names = vectorizer.get_feature_names_out()

print(f"Document-Term Matrix: {doc_term_matrix.shape[0]} CVs × {doc_term_matrix.shape[1]} words")
print(f"Sparsity: {(1 - doc_term_matrix.nnz / (doc_term_matrix.shape[0] * doc_term_matrix.shape[1])) * 100:.1f}%")

Document-Term Matrix: 34 CVs × 662 words
Sparsity: 86.7%


## Step 4: Train LDA Model

In [53]:
# Train LDA model
n_topics = 3  # Number of topics to discover

lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=10,
    learning_method='online'
)

print(f"Training LDA to discover {n_topics} topics...")
lda.fit(doc_term_matrix)
print("✓ Training complete!")

Training LDA to discover 3 topics...
✓ Training complete!


## Step 5: Analyze Results

Let's see what topics LDA discovered and which words define each topic.

In [54]:
# Display top words for each topic
def display_topics(model, feature_names, n_top_words=10):
    """Display top words for each topic"""
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        top_weights = [topic[i] for i in top_words_idx]
        
        print(f"\nTopic {topic_idx + 1}:")
        print("  Top words:", ", ".join(top_words))
        print("  Weights:", [f"{w:.3f}" for w in top_weights])

display_topics(lda, feature_names, n_top_words=10)


Topic 1:
  Top words: engineer, analyst, analytics, 2019, 2021, business, with, 10, bi, governance
  Weights: ['25.418', '25.095', '21.468', '20.625', '19.749', '19.336', '18.298', '16.163', '16.113', '15.502']

Topic 2:
  Top words: ai, on, models, engineer, research, computer, model, 06, 08, engineering
  Weights: ['32.898', '28.287', '22.209', '20.499', '20.153', '20.043', '19.232', '16.574', '16.569', '15.889']

Topic 3:
  Top words: engineer, spark, big, hadoop, on, aws, platform, 2021, 01, time
  Weights: ['7.766', '7.080', '5.884', '5.601', '4.821', '4.686', '4.546', '4.334', '4.237', '4.199']


**Interpreting the topics**: Look at the top words for each topic. Can you guess what each topic represents? For example:
- Topic with "PyTorch", "TensorFlow", "NLP" → probably AI/ML
- Topic with "Tableau", "Power BI", "dashboard" → probably Data Analysis
- Topic with "Hadoop", "Spark", "Kafka" → probably Big Data

Now let's see which CV belongs to which topic:

In [55]:
# Get topic distribution for each CV
doc_topic_dist = lda.transform(doc_term_matrix)

# Find dominant topic for each CV
dominant_topics = doc_topic_dist.argmax(axis=1)

# Create a DataFrame to see results
df_results = pd.DataFrame({
    'CV': cv_names,
    'Dominant Topic': dominant_topics + 1,
    'Topic Probabilities': [dist for dist in doc_topic_dist]
})

# Show which CVs belong to which topic
print("CV Assignment to Topics:")
print("=" * 60)
for topic_id in range(n_topics):
    topic_cvs = df_results[df_results['Dominant Topic'] == topic_id + 1]
    print(f"\nTopic {topic_id + 1} ({len(topic_cvs)} CVs):")
    for idx, row in topic_cvs.iterrows():
        prob = row['Topic Probabilities'][topic_id]
        print(f"  - {row['CV']} ({prob:.1%})")

CV Assignment to Topics:

Topic 1 (17 CVs):
  - 15 (99.4%)
  - 29 (99.4%)
  - 30 (99.3%)
  - 08 (99.4%)
  - 09 (99.5%)
  - 18 (99.4%)
  - 19 (99.4%)
  - 23 (75.0%)
  - 24 (79.7%)
  - 31 (99.2%)
  - 35 (99.3%)
  - 36 (99.3%)
  - 37 (99.3%)
  - 38 (99.4%)
  - 14 (99.5%)
  - 27 (99.4%)
  - 28 (99.4%)

Topic 2 (14 CVs):
  - 07 (99.5%)
  - 10 (99.4%)
  - 11 (99.5%)
  - 17 (99.4%)
  - 20 (99.4%)
  - 21 (99.4%)
  - 22 (99.4%)
  - 26 (99.4%)
  - 33 (99.4%)
  - 34 (99.4%)
  - 39 (99.4%)
  - 40 (99.4%)
  - 13 (84.1%)
  - 25 (99.3%)

Topic 3 (3 CVs):
  - 12 (99.5%)
  - 16 (99.3%)
  - 32 (99.4%)


## Step 6: Organize CVs into Folders

Now comes the practical part: **automatically organize CVs into folders** based on their dominant topic!

In [56]:
# Create output directory structure
output_dir = Path('output/organized_cvs')
output_dir.mkdir(parents=True, exist_ok=True)

# Create a folder for each topic
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    topic_dir.mkdir(exist_ok=True)

# Copy each CV to its topic folder
for idx, (cv_name, topic_id, source_file) in enumerate(zip(cv_names, dominant_topics, cv_file_paths)):
    target_dir = output_dir / f"Topic_{topic_id + 1}"
    target_file = target_dir / f"{cv_name}.json"
    
    shutil.copy2(source_file, target_file)
    print(f"Copied {cv_name}.json → Topic_{topic_id + 1}/")

print(f"\n✓ Organization complete! CVs are now in: {output_dir}")

Copied 07.json → Topic_2/
Copied 10.json → Topic_2/
Copied 11.json → Topic_2/
Copied 15.json → Topic_1/
Copied 17.json → Topic_2/
Copied 20.json → Topic_2/
Copied 21.json → Topic_2/
Copied 22.json → Topic_2/
Copied 26.json → Topic_2/
Copied 29.json → Topic_1/
Copied 30.json → Topic_1/
Copied 33.json → Topic_2/
Copied 34.json → Topic_2/
Copied 39.json → Topic_2/
Copied 40.json → Topic_2/
Copied 08.json → Topic_1/
Copied 09.json → Topic_1/
Copied 12.json → Topic_3/
Copied 16.json → Topic_3/
Copied 18.json → Topic_1/
Copied 19.json → Topic_1/
Copied 23.json → Topic_1/
Copied 24.json → Topic_1/
Copied 31.json → Topic_1/
Copied 32.json → Topic_3/
Copied 35.json → Topic_1/
Copied 36.json → Topic_1/
Copied 37.json → Topic_1/
Copied 38.json → Topic_1/
Copied 13.json → Topic_2/
Copied 14.json → Topic_1/
Copied 25.json → Topic_2/
Copied 27.json → Topic_1/
Copied 28.json → Topic_1/

✓ Organization complete! CVs are now in: output/organized_cvs


### Verify the Organization

Let's check what's in each folder:

In [57]:
# Show contents of each topic folder
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    files = list(topic_dir.glob('*.json'))
    print(f"\nTopic_{topic_id + 1}/ ({len(files)} CVs):")
    for f in sorted(files):
        print(f"  - {f.name}")


Topic_1/ (17 CVs):
  - 08.json
  - 09.json
  - 14.json
  - 15.json
  - 18.json
  - 19.json
  - 23.json
  - 24.json
  - 27.json
  - 28.json
  - 29.json
  - 30.json
  - 31.json
  - 35.json
  - 36.json
  - 37.json
  - 38.json

Topic_2/ (14 CVs):
  - 07.json
  - 10.json
  - 11.json
  - 13.json
  - 17.json
  - 20.json
  - 21.json
  - 22.json
  - 25.json
  - 26.json
  - 33.json
  - 34.json
  - 39.json
  - 40.json

Topic_3/ (3 CVs):
  - 12.json
  - 16.json
  - 32.json


## **Student Exercise**: discover topics on a dataset of your choice

In [58]:
# STUDENT EXERCISE

In [105]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd


data = fetch_20newsgroups(
    subset='train',
    remove=('headers', 'footers', 'quotes'),
    shuffle=True,
    random_state=42
)

df = pd.DataFrame({
    'text': data.data,
    'category': data.target,
    'category_name': [data.target_names[data.target[i]] for i in range(len(data.target))]
})
df.head()

Unnamed: 0,text,category,category_name
0,I was wondering if anyone out there could enli...,7,rec.autos
1,A fair number of brave souls who upgraded thei...,4,comp.sys.mac.hardware
2,"well folks, my mac plus finally gave up the gh...",4,comp.sys.mac.hardware
3,\nDo you have Weitek's address/phone number? ...,1,comp.graphics
4,"From article <C5owCB.n3p@world.std.com>, by to...",14,sci.space


In [106]:
import re

def preprocessing(text):
    text = re.sub(r'http\S+', ' ', text)
    text = re.sub(r'\S+@\S+', ' ', text)
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    
    return text.lower().strip()


df['texts_clean'] = df['text'].apply(preprocessing)


In [107]:
from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer(
    stop_words='english',
    max_features=1000,
    max_df=0.9,
    min_df=2,
    token_pattern=r'\b[a-zA-Z]{3,}\b'
)

X = vectorizer.fit_transform(df['texts_clean'])
feature_names = vectorizer.get_feature_names_out()

In [123]:
from sklearn.decomposition import LatentDirichletAllocation

num_topics = 8

lda = LatentDirichletAllocation(
    n_components=num_topics,
    random_state=42,
    max_iter=10,
    learning_method='online'
)

lda.fit(X)


0,1,2
,"n_components  n_components: int, default=10 Number of topics. .. versionchanged:: 0.19  ``n_topics`` was renamed to ``n_components``",8
,"doc_topic_prior  doc_topic_prior: float, default=None Prior of document topic distribution `theta`. If the value is None, defaults to `1 / n_components`. In [1]_, this is called `alpha`.",
,"topic_word_prior  topic_word_prior: float, default=None Prior of topic word distribution `beta`. If the value is None, defaults to `1 / n_components`. In [1]_, this is called `eta`.",
,"learning_method  learning_method: {'batch', 'online'}, default='batch' Method used to update `_component`. Only used in :meth:`fit` method. In general, if the data size is large, the online update will be much faster than the batch update. Valid options: - 'batch': Batch variational Bayes method. Use all training data in each EM  update. Old `components_` will be overwritten in each iteration. - 'online': Online variational Bayes method. In each EM update, use mini-batch  of training data to update the ``components_`` variable incrementally. The  learning rate is controlled by the ``learning_decay`` and the  ``learning_offset`` parameters. .. versionchanged:: 0.20  The default learning method is now ``""batch""``.",'online'
,"learning_decay  learning_decay: float, default=0.7 It is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence. When the value is 0.0 and batch_size is ``n_samples``, the update method is same as batch learning. In the literature, this is called kappa.",0.7
,"learning_offset  learning_offset: float, default=10.0 A (positive) parameter that downweights early iterations in online learning. It should be greater than 1.0. In the literature, this is called tau_0.",10.0
,"max_iter  max_iter: int, default=10 The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the :meth:`fit` method, and not the :meth:`partial_fit` method.",10
,"batch_size  batch_size: int, default=128 Number of documents to use in each EM iteration. Only used in online learning.",128
,"evaluate_every  evaluate_every: int, default=-1 How often to evaluate perplexity. Only used in `fit` method. set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold.",-1
,"total_samples  total_samples: int, default=1e6 Total number of documents. Only used in the :meth:`partial_fit` method.",1000000.0


In [124]:

def display_topics(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_top_words:][::-1]
        words = [feature_names[i] for i in top_words_idx]
        
        print(f"\nTopic {topic_idx + 1}:")
        print(", ".join(words))

display_topics(lda, feature_names, n_top_words=10)


Topic 1:
good, new, used, like, time, best, book, price, does, original

Topic 2:
don, just, people, know, think, like, does, way, say, right

Topic 3:
file, program, windows, use, available, files, mail, window, software, ftp

Topic 4:
key, use, information, encryption, chip, public, keys, clipper, number, used

Topic 5:
year, game, team, car, play, games, new, got, years, good

Topic 6:
max, drive, card, disk, scsi, dos, mac, bit, video, memory

Topic 7:
god, people, jesus, said, bible, armenian, christian, jews, life, did

Topic 8:
space, government, president, national, state, new, gun, states, american, university


In [132]:
doc_topic_dist = lda.transform(X)

dominant_topic = doc_topic_dist.argmax(axis=1) 


df['dominant_topic'] = dominant_topic
df['dominant_topic'] = df['dominant_topic'] + 1


df['dominant_topic'].value_counts().sort_index()

dominant_topic
1    1760
2    3938
3    1415
4     398
5    1304
6     873
7     877
8     749
Name: count, dtype: int64

In [133]:
topic_names = {
    1 : 'Reviews',
    2 : 'General Discussion',
    3 : 'Software & Files',
    4 : 'Security',
    5 : 'Sports',
    6 : 'Computer',
    7 : 'Religion',
    8 : 'Politics'

}

df['topic_names'] = df['dominant_topic'].map(topic_names)

In [134]:
for t in sorted(df['dominant_topic'].unique()):
    print('\n'+ '='*80)
    print(f"Topic: {t} - {topic_names.get(t,'Unknown')}")
    print('='*80)

    sample_texts = df[df['dominant_topic']==t].head(3)['text']

    for i, txt in enumerate(sample_texts, 1):
        print(f'\n--- Example{i} ---')
        print(txt[:500])



Topic: 1 - Reviews

--- Example1 ---
I have a line on a Ducati 900GTS 1978 model with 17k on the clock.  Runs
very well, paint is the bronze/brown/orange faded out, leaks a bit of oil
and pops out of 1st with hard accel.  The shop will fix trans and oil 
leak.  They sold the bike to the 1 and only owner.  They want $3495, and
I am thinking more like $3K.  Any opinions out there?  Please email me.
Thanks.  It would be a nice stable mate to the Beemer.  Then I'll get
a jap bike and call myself Axis Motors!

-- 
----------------------

--- Example2 ---
--


--- Example3 ---
Reduced Prices! 
I have a list of things forsale on behalf of my brother, who's moving (moved
already)

								Offer:
1) Black and Decker Duster Plus (Portable Hand Vaccum)	
 	purchased for $32, 					  $12

2) SR-1000 Dual Cassette Portable Player, AM/FM
5-Band graphics Equalizer, high speed dubing, Duo 
Tape.Tape deck A, seems to have lost treble sound. 
But, I bet  it's fixable.
	purchased for $80					  $25

3)Mono

In [135]:
df

Unnamed: 0,text,category,category_name,texts_clean,dominant_topic,topic_names
0,I was wondering if anyone out there could enli...,7,rec.autos,i was wondering if anyone out there could enli...,5,Sports
1,A fair number of brave souls who upgraded thei...,4,comp.sys.mac.hardware,a fair number of brave souls who upgraded thei...,6,Computer
2,"well folks, my mac plus finally gave up the gh...",4,comp.sys.mac.hardware,well folks my mac plus finally gave up the gho...,2,General Discussion
3,\nDo you have Weitek's address/phone number? ...,1,comp.graphics,do you have weitek s address phone number i d ...,4,Security
4,"From article <C5owCB.n3p@world.std.com>, by to...",14,sci.space,from article by tom a baker my understanding i...,2,General Discussion
...,...,...,...,...,...,...
11309,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,13,sci.med,dn from david nye dn a neurology dn consultati...,2,General Discussion
11310,"I have a (very old) Mac 512k and a Mac Plus, b...",4,comp.sys.mac.hardware,i have a very old mac k and a mac plus both of...,6,Computer
11311,I just installed a DX2-66 CPU in a clone mothe...,3,comp.sys.ibm.pc.hardware,i just installed a dx cpu in a clone motherboa...,2,General Discussion
11312,\nWouldn't this require a hyper-sphere. In 3-...,1,comp.graphics,wouldn t this require a hyper sphere in space ...,2,General Discussion


## Summary

**What we accomplished**:
1. ✅ Loaded unlabeled CVs from a folder
2. ✅ Preprocessed the text data
3. ✅ Created a document-term matrix
4. ✅ Trained an LDA model to discover topics
5. ✅ Analyzed which CVs belong to which topic
6. ✅ **Automatically organized CVs into folders** based on discovered topics

**Key Takeaways**:
- **LDA discovers topics automatically** by finding words that co-occur together
- **Each document is a mixture of topics** - LDA assigns probabilities
- **Topic modeling is unsupervised** - no labels needed!
- **Practical application**: Organize unlabeled documents automatically

**Next Steps**:
- Try different numbers of topics (`n_topics`) and see how results change
- Experiment with preprocessing (stemming, stop words removal)
- Use topic probabilities to handle CVs that belong to multiple topics
- Visualize topics using tools like pyLDAvis

**References**:
- [Scikit-learn LDA documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
- [Topic modeling visualization guide](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/)

---


## Module 1 Synthesis: The Complete Pipeline

Congratulations! You've completed **Module 1: Text Analysis with Statistical NLP**. Let's reflect on the journey and see how all the pieces fit together.

### The Circular Learning Experience

Remember the question chain we started with? Let's trace how we answered each question and built a complete NLP pipeline:

1. **"What is NLP?"** → We learned that NLP bridges computers and human language, with applications in understanding and generation.

2. **"How do we extract patterns from text?"** → We used **Regular Expressions** to find, match, and manipulate text patterns—essential for preprocessing.

3. **"How do we understand our data?"** → We performed **Exploratory Data Analysis (EDA)** on corpora to assess data quality, vocabulary characteristics, and preprocessing needs.

4. **"How do we prepare text for ML?"** → We applied **Preprocessing** techniques (cleaning, normalization, tokenization, stemming) to transform raw text into clean tokens.

5. **"How do we convert text to numbers?"** → We used **Vectorization** (BoW, TF-IDF) to convert text into numerical features that ML models can process.

6. **"How do we build classifiers?"** → We built **Text Classification** models (like sentiment analysis) using vectorized features and supervised learning.

7. **"How do we search documents?"** → We implemented **Information Retrieval** systems using TF-IDF and cosine similarity to find relevant documents.

8. **"How do we discover topics?"** → We applied **Topic Modeling** (LDA) to automatically organize unlabeled documents by discovering hidden topics.

### The Complete NLP Pipeline

Throughout this module, you've learned to build a complete NLP pipeline:

```
Raw Text
    ↓
[Regex: Pattern Extraction]
    ↓
[Corpus & EDA: Understanding Data]
    ↓
[Preprocessing: Cleaning & Normalization]
    ↓
[Vectorization: Text → Numbers]
    ↓
[Modeling: Classification / IR / Topic Modeling]
    ↓
Actionable Insights
```

### Key Skills You've Acquired

By completing this module, you can now:

✅ **Build supervised ML text classification pipelines**
- Preprocess Arabic and English text
- Vectorize text using BoW and TF-IDF
- Train and evaluate classifiers
- Interpret model results

✅ **Apply keyword-based information retrieval**
- Implement TF-IDF-based search engines
- Measure document similarity using cosine similarity
- Rank and retrieve relevant documents

✅ **Apply unsupervised ML for document organization**
- Discover hidden topics using LDA
- Organize unlabeled documents automatically
- Interpret topic modeling results

### The Foundation for What's Next

This module focused on **statistical NLP**—traditional methods that work well for many tasks. In **Module 2**, you'll learn about **Deep Learning approaches** (embeddings, transformers) that build on these foundations to achieve even better performance.

**What you learned here is still valuable:**
- Preprocessing techniques apply to both statistical and deep learning methods
- Understanding vectorization helps you understand embeddings
- EDA is always the first step, regardless of the approach
- The pipeline structure (preprocess → vectorize → model) remains the same

### Reflection Questions

Before moving to Module 2, consider:

1. **When would you use statistical NLP vs. deep learning?**
   - Statistical NLP: Fast, interpretable, works with small data
   - Deep Learning: Better accuracy, requires more data and computation

2. **What preprocessing steps are most important?**
   - Depends on your data and task, but EDA always guides the decision

3. **How does TF-IDF differ from BoW?**
   - BoW: Simple word counts
   - TF-IDF: Weighted counts that emphasize distinctive words

4. **When would you use topic modeling vs. classification?**
   - Classification: When you have labels and want to predict categories
   - Topic Modeling: When you have no labels and want to discover structure

### The Journey Continues

You've built a solid foundation in statistical NLP. The concepts you've learned—preprocessing, vectorization, classification, retrieval, and topic modeling—are the building blocks for more advanced techniques.

**Next Module Preview:**
- **Module 2** introduces **Deep Learning for NLP**:
  - Tokenization with modern tools (WordPiece, BPE)
  - Word embeddings (Word2Vec, GloVe, contextual embeddings)
  - Transformers and BERT
  - Fine-tuning pre-trained models

The journey from statistical NLP to deep learning is a natural progression—you'll see how embeddings generalize vectorization, how transformers improve on traditional methods, and how pre-trained models leverage the foundations you've built.

---

**Module 1 Complete! 🎉**

You now have the skills to work with text data using statistical methods. You understand the complete pipeline from raw text to actionable insights, and you're ready to explore the power of deep learning in Module 2.