# Topic Modeling: Organizing Unlabeled CVs with LDA

## Overview

This notebook demonstrates **Topic Modeling** using **Latent Dirichlet Allocation (LDA)** to organize unlabeled CVs (resumes) by automatically discovering hidden topics. Unlike supervised classification, topic modeling works with completely unlabeled data, making it ideal for organizing large document collections without manual labeling. You'll learn how to apply LDA to discover topics, interpret results, and organize documents based on their dominant topics.

> "The best way to find a needle in a haystack is to organize the haystack first."

**The Problem**: You have a folder full of CVs‚Äîunlabeled, unorganized. You need to find candidates for specific roles, but manually reading through hundreds of CVs is impossible.

## Objectives

- Understand what Topic Modeling is and why it's useful for unsupervised document organization
- Learn how LDA (Latent Dirichlet Allocation) discovers hidden topics in text collections
- Apply LDA to organize unlabeled documents automatically
- Interpret topic modeling results by examining top words and document-topic distributions
- Organize documents into folders based on their dominant topics

## Outline

1. **Introduction to Topic Modeling** - What it is and why it's useful
2. **What is LDA?** - Understanding Latent Dirichlet Allocation
3. **The Pipeline** - Complete workflow from data loading to organization
4. **Step 1: Loading Data** - Reading CVs from JSON files
5. **Step 2: Preprocessing** - Cleaning and preparing text
6. **Step 3: Vectorization** - Converting text to document-term matrix
7. **Step 4: Training LDA** - Discovering topics automatically
8. **Step 5: Analyzing Results** - Interpreting discovered topics
9. **Step 6: Organizing Documents** - Creating folders and organizing CVs by topic

## Topic Modeling

**Topic Modeling** is an **unsupervised learning** task that discovers hidden topics in a collection of unlabeled documents. Unlike classification (which requires labeled data), topic modeling finds patterns automatically.

**Example applications:**
- **Organizing unlabeled documents**: Group CVs by field (AI/ML, Data Analysis, etc.) without manual labeling
- **Understanding large text collections**: Discover what themes exist in news archives, research papers, or social media
- **Content recommendation**: Find documents similar to a given document based on topic similarity

**Why it's useful:**
- No labels needed: works with completely unlabeled data
- Interpretable: topics are defined by their top words, making them understandable
- Scalable: can process large document collections
- Flexible: number of topics can be adjusted based on the corpus

## What is LDA?

**Latent Dirichlet Allocation (LDA)** is a probabilistic model that discovers hidden topics in a collection of documents.

**Key idea**: 
- Each document is a **mixture of topics** (e.g., 70% AI/ML, 20% Data Analysis, 10% Software Engineering)
- Each topic is a **distribution over words** (e.g., Topic 1: 30% "PyTorch", 25% "TensorFlow", 20% "NLP"...)
- LDA discovers these topics automatically by finding words that co-occur together

**For our CVs**: LDA will discover topics like "AI/ML", "Data Analysis", "Big Data" by looking at which words appear together, then assign each CV to the most relevant topic(s).

**Reference**: Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). [Latent dirichlet allocation](https://dl.acm.org/doi/10.5555/944919.944937). *Journal of machine Learning research*, 3(Jan), 993-1022.

![Left: BoW. Right: LDA](../assets/lda.png)

## The Pipeline

1. **Load CVs**: Read all JSON files from topic folders using glob patterns and extract structured fields
2. **Preprocess**: Clean the text (remove URLs, emails, etc.)
3. **Vectorize**: Convert text to document-term matrix (Bag of Words)
4. **Train LDA**: Discover topics automatically
5. **Analyze Results**: See what topics were found and which CVs belong to each
6. **Organize**: Create folders and copy CVs based on their dominant topic

## Step 1: Loading Data

In [21]:
# File > Open Folder > W5_NLP
# (VS Code root should be at W5_NLP)
# Then run: `uv sync`

In [22]:
import sys 
import os
if not os.path.exists('B5'):
    !git clone https://github.com/AFAskar/B5

In [23]:
# Standard library imports
import json
import re
import shutil
from pathlib import Path

# Third-party imports
import numpy as np
import pandas as pd

# Machine Learning
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

print("Libraries imported successfully!")

Libraries imported successfully!


In [24]:
# Load CVs from JSON files in all topic folders
cv_dir = Path('B5/W5_NLP/M1/datasets/CVs')
# Use glob pattern to find all JSON files in Topic_* subdirectories, excluding English versions
cv_files = sorted([f for f in cv_dir.glob('Topic_*/*.json') if not f.name.endswith('_en.json')])

# Load and extract structured data from JSON
cvs_data = []
cv_names = []
cv_file_paths = []  # Store original file paths for later copying

for file in cv_files:
    with open(file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        cvs_data.append(data)
        cv_names.append(file.stem)
        cv_file_paths.append(file)  # Store the full path

print(f"Loaded {len(cvs_data)} CV files from {len(set(f.parent.name for f in cv_files))} topic folders:")
for i, name in enumerate(cv_names, 1):
    print(f"  {i}. {name}")

# Combine structured fields into text for each CV
def combine_cv_fields(cv_json):
    """Combine Heading, Skills, Projects, Experience, Education into a single text"""
    parts = []
    
    # Add heading
    if 'Heading' in cv_json:
        parts.append(cv_json['Heading'])
    
    # Add skills (join list items)
    if 'Skills' in cv_json:
        skills_text = ' '.join(cv_json['Skills']) if isinstance(cv_json['Skills'], list) else cv_json['Skills']
        parts.append(skills_text)
    
    # Add projects
    if 'Projects' in cv_json:
        projects_text = ' '.join(cv_json['Projects']) if isinstance(cv_json['Projects'], list) else cv_json['Projects']
        parts.append(projects_text)
    
    # Add experience
    if 'Experience' in cv_json:
        exp_text = ' '.join(cv_json['Experience']) if isinstance(cv_json['Experience'], list) else cv_json['Experience']
        parts.append(exp_text)
    
    # Add education
    if 'Education' in cv_json:
        edu_text = ' '.join(cv_json['Education']) if isinstance(cv_json['Education'], list) else cv_json['Education']
        parts.append(edu_text)
    
    return ' '.join(parts)

# Convert JSON data to text
cvs = [combine_cv_fields(cv_data) for cv_data in cvs_data]
print(f"\nCombined structured data into text for {len(cvs)} CVs")

Loaded 34 CV files from 3 topic folders:
  1. 07
  2. 10
  3. 11
  4. 15
  5. 17
  6. 20
  7. 21
  8. 22
  9. 26
  10. 29
  11. 30
  12. 33
  13. 34
  14. 39
  15. 40
  16. 08
  17. 09
  18. 12
  19. 16
  20. 18
  21. 19
  22. 23
  23. 24
  24. 31
  25. 32
  26. 35
  27. 36
  28. 37
  29. 38
  30. 13
  31. 14
  32. 25
  33. 27
  34. 28

Combined structured data into text for 34 CVs


## Step 2: Data Preprocessing

In [25]:
def preprocess_text(text):
    """Clean text: remove URLs, emails, and normalize whitespace"""
    # Remove emails and URLs
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'http\S+', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Keep only Arabic/English letters and numbers
    text = re.sub(r'[^\w\s\u0600-\u06FF]', ' ', text)
    return text.strip()

# Preprocess all CVs
cvs_processed = [preprocess_text(cv) for cv in cvs]
print(f"Preprocessed {len(cvs_processed)} CVs")

Preprocessed 34 CVs


## Step 3: Prepare Data for LDA

Convert text to a document-term matrix (same as Bag of Words from classification).

In [26]:
# Create document-term matrix
vectorizer = CountVectorizer(
    max_features=1000,  # Top 1000 words
    min_df=2,           # Word must appear in at least 2 CVs
    max_df=0.8          # Ignore words in >80% of CVs
)

doc_term_matrix = vectorizer.fit_transform(cvs_processed)
feature_names = vectorizer.get_feature_names_out()

print(f"Document-Term Matrix: {doc_term_matrix.shape[0]} CVs √ó {doc_term_matrix.shape[1]} words")
print(f"Sparsity: {(1 - doc_term_matrix.nnz / (doc_term_matrix.shape[0] * doc_term_matrix.shape[1])) * 100:.1f}%")

Document-Term Matrix: 34 CVs √ó 662 words
Sparsity: 86.7%


## Step 4: Train LDA Model

In [27]:
# Train LDA model
n_topics = 3  # Number of topics to discover

lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=10,
    learning_method='online'
)

print(f"Training LDA to discover {n_topics} topics...")
lda.fit(doc_term_matrix)
print("‚úì Training complete!")

Training LDA to discover 3 topics...
‚úì Training complete!


## Step 5: Analyze Results

Let's see what topics LDA discovered and which words define each topic.

In [28]:
# Display top words for each topic
def display_topics(model, feature_names, n_top_words=10):
    """Display top words for each topic"""
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        top_weights = [topic[i] for i in top_words_idx]
        
        print(f"\nTopic {topic_idx + 1}:")
        print("  Top words:", ", ".join(top_words))
        print("  Weights:", [f"{w:.3f}" for w in top_weights])

display_topics(lda, feature_names, n_top_words=10)


Topic 1:
  Top words: engineer, analyst, analytics, 2019, 2021, business, with, 10, bi, governance
  Weights: ['25.418', '25.095', '21.468', '20.625', '19.749', '19.336', '18.298', '16.163', '16.113', '15.502']

Topic 2:
  Top words: ai, on, models, engineer, research, computer, model, 06, 08, engineering
  Weights: ['32.898', '28.287', '22.209', '20.499', '20.153', '20.043', '19.232', '16.574', '16.569', '15.889']

Topic 3:
  Top words: engineer, spark, big, hadoop, on, aws, platform, 2021, 01, time
  Weights: ['7.766', '7.080', '5.884', '5.601', '4.821', '4.686', '4.546', '4.334', '4.237', '4.199']


**Interpreting the topics**: Look at the top words for each topic. Can you guess what each topic represents? For example:
- Topic with "PyTorch", "TensorFlow", "NLP" ‚Üí probably AI/ML
- Topic with "Tableau", "Power BI", "dashboard" ‚Üí probably Data Analysis
- Topic with "Hadoop", "Spark", "Kafka" ‚Üí probably Big Data

Now let's see which CV belongs to which topic:

In [29]:
# Get topic distribution for each CV
doc_topic_dist = lda.transform(doc_term_matrix)

# Find dominant topic for each CV
dominant_topics = doc_topic_dist.argmax(axis=1)

# Create a DataFrame to see results
df_results = pd.DataFrame({
    'CV': cv_names,
    'Dominant Topic': dominant_topics + 1,
    'Topic Probabilities': [dist for dist in doc_topic_dist]
})

# Show which CVs belong to which topic
print("CV Assignment to Topics:")
print("=" * 60)
for topic_id in range(n_topics):
    topic_cvs = df_results[df_results['Dominant Topic'] == topic_id + 1]
    print(f"\nTopic {topic_id + 1} ({len(topic_cvs)} CVs):")
    for idx, row in topic_cvs.iterrows():
        prob = row['Topic Probabilities'][topic_id]
        print(f"  - {row['CV']} ({prob:.1%})")

CV Assignment to Topics:

Topic 1 (17 CVs):
  - 15 (99.4%)
  - 29 (99.4%)
  - 30 (99.3%)
  - 08 (99.4%)
  - 09 (99.5%)
  - 18 (99.4%)
  - 19 (99.4%)
  - 23 (75.0%)
  - 24 (79.7%)
  - 31 (99.2%)
  - 35 (99.3%)
  - 36 (99.3%)
  - 37 (99.3%)
  - 38 (99.4%)
  - 14 (99.5%)
  - 27 (99.4%)
  - 28 (99.4%)

Topic 2 (14 CVs):
  - 07 (99.5%)
  - 10 (99.4%)
  - 11 (99.5%)
  - 17 (99.4%)
  - 20 (99.4%)
  - 21 (99.4%)
  - 22 (99.4%)
  - 26 (99.4%)
  - 33 (99.4%)
  - 34 (99.4%)
  - 39 (99.4%)
  - 40 (99.4%)
  - 13 (84.1%)
  - 25 (99.3%)

Topic 3 (3 CVs):
  - 12 (99.5%)
  - 16 (99.3%)
  - 32 (99.4%)


## Step 6: Organize CVs into Folders

Now comes the practical part: **automatically organize CVs into folders** based on their dominant topic!

In [30]:
# Create output directory structure
output_dir = Path('output/organized_cvs')
output_dir.mkdir(parents=True, exist_ok=True)

# Create a folder for each topic
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    topic_dir.mkdir(exist_ok=True)

# Copy each CV to its topic folder
for idx, (cv_name, topic_id, source_file) in enumerate(zip(cv_names, dominant_topics, cv_file_paths)):
    target_dir = output_dir / f"Topic_{topic_id + 1}"
    target_file = target_dir / f"{cv_name}.json"
    
    shutil.copy2(source_file, target_file)
    print(f"Copied {cv_name}.json ‚Üí Topic_{topic_id + 1}/")

print(f"\n‚úì Organization complete! CVs are now in: {output_dir}")

Copied 07.json ‚Üí Topic_2/
Copied 10.json ‚Üí Topic_2/
Copied 11.json ‚Üí Topic_2/
Copied 15.json ‚Üí Topic_1/
Copied 17.json ‚Üí Topic_2/
Copied 20.json ‚Üí Topic_2/
Copied 21.json ‚Üí Topic_2/
Copied 22.json ‚Üí Topic_2/
Copied 26.json ‚Üí Topic_2/
Copied 29.json ‚Üí Topic_1/
Copied 30.json ‚Üí Topic_1/
Copied 33.json ‚Üí Topic_2/
Copied 34.json ‚Üí Topic_2/
Copied 39.json ‚Üí Topic_2/
Copied 40.json ‚Üí Topic_2/
Copied 08.json ‚Üí Topic_1/
Copied 09.json ‚Üí Topic_1/
Copied 12.json ‚Üí Topic_3/
Copied 16.json ‚Üí Topic_3/
Copied 18.json ‚Üí Topic_1/
Copied 19.json ‚Üí Topic_1/
Copied 23.json ‚Üí Topic_1/
Copied 24.json ‚Üí Topic_1/
Copied 31.json ‚Üí Topic_1/
Copied 32.json ‚Üí Topic_3/
Copied 35.json ‚Üí Topic_1/
Copied 36.json ‚Üí Topic_1/
Copied 37.json ‚Üí Topic_1/
Copied 38.json ‚Üí Topic_1/
Copied 13.json ‚Üí Topic_2/
Copied 14.json ‚Üí Topic_1/
Copied 25.json ‚Üí Topic_2/
Copied 27.json ‚Üí Topic_1/
Copied 28.json ‚Üí Topic_1/

‚úì Organization complete! CVs are now in: outp

### Verify the Organization

Let's check what's in each folder:

In [31]:
# Show contents of each topic folder
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    files = list(topic_dir.glob('*.json'))
    print(f"\nTopic_{topic_id + 1}/ ({len(files)} CVs):")
    for f in sorted(files):
        print(f"  - {f.name}")


Topic_1/ (17 CVs):
  - 08.json
  - 09.json
  - 14.json
  - 15.json
  - 18.json
  - 19.json
  - 23.json
  - 24.json
  - 27.json
  - 28.json
  - 29.json
  - 30.json
  - 31.json
  - 35.json
  - 36.json
  - 37.json
  - 38.json

Topic_2/ (14 CVs):
  - 07.json
  - 10.json
  - 11.json
  - 13.json
  - 17.json
  - 20.json
  - 21.json
  - 22.json
  - 25.json
  - 26.json
  - 33.json
  - 34.json
  - 39.json
  - 40.json

Topic_3/ (3 CVs):
  - 12.json
  - 16.json
  - 32.json


## **Student Exercise**: discover topics on a dataset of your choice

In [None]:
import requests
import joblib
cache_dir= 'cache_github'
memory = joblib.Memory(cache_dir, verbose=0)
@memory.cache()
def getREADMEs(user:str):
    repos_url = f"https://api.github.com/users/{user}/repos"
    response = requests.get(repos_url)
    if response.status_code != 200:
        print(f"Error fetching repos for user {user}: {response.status_code}")
        return
    
    repos = response.json()
    readmes = {}
    
    for repo in repos:
        
        repo_name = repo['name']
        if repo_name.casefold() == user.casefold():
            continue
        if repo.get('fork'):
            continue
        
        readme_url = f"https://api.github.com/repos/{user}/{repo_name}/readme"
        readme_response = requests.get(readme_url, headers={'Accept': 'application/vnd.github.v3.raw'})
        
        if readme_response.status_code == 200:
            readmes[repo_name] = readme_response.text
        else:
            readme_url_md = f"https://api.github.com/repos/{user}/{repo_name}/contents/README.md"
            readme_response_md = requests.get(readme_url_md, headers={'Accept': 'application/vnd.github.v3.raw'})
            if readme_response_md.status_code == 200:
                readmes[repo_name] = readme_response_md.text
            else:
                readmes[repo_name] = None  # No README found or error occurred
    
    return readmes


In [33]:
# Fetch READMEs from a GitHub user
github_user = "AFAskar"  # Change this to any GitHub username you want to analyze

print(f"Fetching repositories for user: {github_user}...")
readmes_dict = getREADMEs(github_user)

# Filter out repos without READMEs
readmes_dict = {repo: text for repo, text in readmes_dict.items() if text is not None}

print(f"‚úì Found {len(readmes_dict)} repositories with READMEs\n")

# Display repository names
print("Repositories found:")
for i, repo_name in enumerate(readmes_dict.keys(), 1):
    print(f"  {i}. {repo_name}")

Fetching repositories for user: AFAskar...
‚úì Found 9 repositories with READMEs

Repositories found:
  1. aqar-scraper
  2. AzureStreamerBot
  3. AzureTTVChat
  4. CarDealerShipAPI
  5. Common-Utility-Scripts
  6. CSVProfiler
  7. DisNote-bot
  8. Foley-Mixer
  9. InfoSecProject


In [34]:
# Prepare data for topic modeling
repo_names = list(readmes_dict.keys())
readme_texts = list(readmes_dict.values())

# Preprocess READMEs (reuse the same preprocessing function)
readmes_processed = [preprocess_text(text) for text in readme_texts]

# Vectorize
vectorizer_github = CountVectorizer(
    max_features=500,  # Top 500 words
    min_df=1,          # Word must appear in at least 1 README
    max_df=0.7,        # Ignore words in >70% of READMEs
    stop_words='english'  # Remove common English words
)

doc_term_matrix_github = vectorizer_github.fit_transform(readmes_processed)
feature_names_github = vectorizer_github.get_feature_names_out()

print(f"Document-Term Matrix: {doc_term_matrix_github.shape[0]} READMEs √ó {doc_term_matrix_github.shape[1]} words")
print(f"Sparsity: {(1 - doc_term_matrix_github.nnz / (doc_term_matrix_github.shape[0] * doc_term_matrix_github.shape[1])) * 100:.1f}%")

Document-Term Matrix: 9 READMEs √ó 500 words
Sparsity: 83.4%


In [35]:
# Train LDA on GitHub READMEs
n_topics_github = 5  # Looking for 5 topics: AI, Security, Web, Embedded, Automation

lda_github = LatentDirichletAllocation(
    n_components=n_topics_github,
    random_state=42,
    max_iter=20,
    learning_method='online'
)

print(f"Training LDA to discover {n_topics_github} topics in GitHub READMEs...")
lda_github.fit(doc_term_matrix_github)
print("‚úì Training complete!\n")

# Display discovered topics
print("=" * 70)
print("DISCOVERED TOPICS IN GITHUB REPOSITORIES")
print("=" * 70)
display_topics(lda_github, feature_names_github, n_top_words=15)

# Get topic distribution for each README
doc_topic_dist_github = lda_github.transform(doc_term_matrix_github)
dominant_topics_github = doc_topic_dist_github.argmax(axis=1)

# Create results DataFrame
df_github_results = pd.DataFrame({
    'Repository': repo_names,
    'Dominant Topic': dominant_topics_github + 1,
    'Topic Probabilities': [dist for dist in doc_topic_dist_github]
})

# Display repositories organized by topic
print("\n" + "=" * 70)
print("REPOSITORIES ORGANIZED BY TOPIC")
print("=" * 70)
for topic_id in range(n_topics_github):
    topic_repos = df_github_results[df_github_results['Dominant Topic'] == topic_id + 1]
    print(f"\nüìÅ Topic {topic_id + 1} ({len(topic_repos)} repositories):")
    for idx, row in topic_repos.iterrows():
        prob = row['Topic Probabilities'][topic_id]
        print(f"  ‚Ä¢ {row['Repository']} ({prob:.1%})")

Training LDA to discover 5 topics in GitHub READMEs...
‚úì Training complete!

DISCOVERED TOPICS IN GITHUB REPOSITORIES

Topic 1:
  Top words: data, csv, cli, output, sh, json, listings, python, profiler, web, project, scraper, raw, report, aqar_fm_listings
  Weights: ['37.352', '31.735', '15.120', '15.055', '13.281', '10.462', '10.385', '9.505', '7.699', '7.684', '7.680', '7.645', '6.741', '5.836', '5.810']

Topic 2:
  Top words: need, pipenv, use, python, voice, command, pip, using, azurestreambot, script, installation, tts, env, dev, package
  Weights: ['4.927', '4.589', '2.504', '2.474', '2.468', '2.462', '2.425', '2.149', '2.122', '1.429', '1.386', '1.382', '1.364', '1.363', '1.360']

Topic 3:
  Top words: files, script, bot, directory, notes, use, moves, discord, installation, uv, tool, functions, extension, daily, 01
  Weights: ['14.238', '9.588', '8.626', '6.786', '6.754', '4.913', '4.909', '4.893', '4.890', '4.018', '4.002', '3.990', '3.982', '3.981', '3.981']

Topic 4:
  Top 

## Summary

**What we accomplished**:
1. ‚úÖ Loaded unlabeled CVs from a folder
2. ‚úÖ Preprocessed the text data
3. ‚úÖ Created a document-term matrix
4. ‚úÖ Trained an LDA model to discover topics
5. ‚úÖ Analyzed which CVs belong to which topic
6. ‚úÖ **Automatically organized CVs into folders** based on discovered topics

**Key Takeaways**:
- **LDA discovers topics automatically** by finding words that co-occur together
- **Each document is a mixture of topics** - LDA assigns probabilities
- **Topic modeling is unsupervised** - no labels needed!
- **Practical application**: Organize unlabeled documents automatically

**Next Steps**:
- Try different numbers of topics (`n_topics`) and see how results change
- Experiment with preprocessing (stemming, stop words removal)
- Use topic probabilities to handle CVs that belong to multiple topics
- Visualize topics using tools like pyLDAvis

**References**:
- [Scikit-learn LDA documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
- [Topic modeling visualization guide](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/)

---


## Module 1 Synthesis: The Complete Pipeline

Congratulations! You've completed **Module 1: Text Analysis with Statistical NLP**. Let's reflect on the journey and see how all the pieces fit together.

### The Circular Learning Experience

Remember the question chain we started with? Let's trace how we answered each question and built a complete NLP pipeline:

1. **"What is NLP?"** ‚Üí We learned that NLP bridges computers and human language, with applications in understanding and generation.

2. **"How do we extract patterns from text?"** ‚Üí We used **Regular Expressions** to find, match, and manipulate text patterns‚Äîessential for preprocessing.

3. **"How do we understand our data?"** ‚Üí We performed **Exploratory Data Analysis (EDA)** on corpora to assess data quality, vocabulary characteristics, and preprocessing needs.

4. **"How do we prepare text for ML?"** ‚Üí We applied **Preprocessing** techniques (cleaning, normalization, tokenization, stemming) to transform raw text into clean tokens.

5. **"How do we convert text to numbers?"** ‚Üí We used **Vectorization** (BoW, TF-IDF) to convert text into numerical features that ML models can process.

6. **"How do we build classifiers?"** ‚Üí We built **Text Classification** models (like sentiment analysis) using vectorized features and supervised learning.

7. **"How do we search documents?"** ‚Üí We implemented **Information Retrieval** systems using TF-IDF and cosine similarity to find relevant documents.

8. **"How do we discover topics?"** ‚Üí We applied **Topic Modeling** (LDA) to automatically organize unlabeled documents by discovering hidden topics.

### The Complete NLP Pipeline

Throughout this module, you've learned to build a complete NLP pipeline:

```
Raw Text
    ‚Üì
[Regex: Pattern Extraction]
    ‚Üì
[Corpus & EDA: Understanding Data]
    ‚Üì
[Preprocessing: Cleaning & Normalization]
    ‚Üì
[Vectorization: Text ‚Üí Numbers]
    ‚Üì
[Modeling: Classification / IR / Topic Modeling]
    ‚Üì
Actionable Insights
```

### Key Skills You've Acquired

By completing this module, you can now:

‚úÖ **Build supervised ML text classification pipelines**
- Preprocess Arabic and English text
- Vectorize text using BoW and TF-IDF
- Train and evaluate classifiers
- Interpret model results

‚úÖ **Apply keyword-based information retrieval**
- Implement TF-IDF-based search engines
- Measure document similarity using cosine similarity
- Rank and retrieve relevant documents

‚úÖ **Apply unsupervised ML for document organization**
- Discover hidden topics using LDA
- Organize unlabeled documents automatically
- Interpret topic modeling results

### The Foundation for What's Next

This module focused on **statistical NLP**‚Äîtraditional methods that work well for many tasks. In **Module 2**, you'll learn about **Deep Learning approaches** (embeddings, transformers) that build on these foundations to achieve even better performance.

**What you learned here is still valuable:**
- Preprocessing techniques apply to both statistical and deep learning methods
- Understanding vectorization helps you understand embeddings
- EDA is always the first step, regardless of the approach
- The pipeline structure (preprocess ‚Üí vectorize ‚Üí model) remains the same

### Reflection Questions

Before moving to Module 2, consider:

1. **When would you use statistical NLP vs. deep learning?**
   - Statistical NLP: Fast, interpretable, works with small data
   - Deep Learning: Better accuracy, requires more data and computation

2. **What preprocessing steps are most important?**
   - Depends on your data and task, but EDA always guides the decision

3. **How does TF-IDF differ from BoW?**
   - BoW: Simple word counts
   - TF-IDF: Weighted counts that emphasize distinctive words

4. **When would you use topic modeling vs. classification?**
   - Classification: When you have labels and want to predict categories
   - Topic Modeling: When you have no labels and want to discover structure

### The Journey Continues

You've built a solid foundation in statistical NLP. The concepts you've learned‚Äîpreprocessing, vectorization, classification, retrieval, and topic modeling‚Äîare the building blocks for more advanced techniques.

**Next Module Preview:**
- **Module 2** introduces **Deep Learning for NLP**:
  - Tokenization with modern tools (WordPiece, BPE)
  - Word embeddings (Word2Vec, GloVe, contextual embeddings)
  - Transformers and BERT
  - Fine-tuning pre-trained models

The journey from statistical NLP to deep learning is a natural progression‚Äîyou'll see how embeddings generalize vectorization, how transformers improve on traditional methods, and how pre-trained models leverage the foundations you've built.

---

**Module 1 Complete! üéâ**

You now have the skills to work with text data using statistical methods. You understand the complete pipeline from raw text to actionable insights, and you're ready to explore the power of deep learning in Module 2.