# Data Preprocessing for RAG-based AI Resume Builder

This notebook preprocesses the Kaggle **Resume.csv** dataset (Snehaan Bhawal) for ingestion into ChromaDB.

**Pipeline Steps:**
1. Load the dataset
2. Clean the `Resume_str` column (strip HTML tags, special characters, normalise whitespace)
3. Chunk text using `RecursiveCharacterTextSplitter` (chunk_size=1000, overlap=200)
4. Format output as a list of dictionaries with `id`, `content`, and `metadata`

**Constraints:** Only `pandas`, `re`, and `langchain` are used.

## 1. Imports & Setup

In [1]:
import json
import os
import re

import pandas as pd
from langchain_text_splitters import RecursiveCharacterTextSplitter

print("All imports loaded successfully.")

All imports loaded successfully.


## 2. Load the Dataset

In [2]:
# Update this path if your Resume.csv is in a different location
CSV_PATH = os.path.join("..", "archive(7)", "Resume", "Resume.csv")
CSV_PATH = os.path.normpath(CSV_PATH)

print(f"Loading dataset from: {CSV_PATH}")
df = pd.read_csv(CSV_PATH)
print(f"Total rows loaded: {len(df)}")
print(f"Columns: {list(df.columns)}")
df.head(2)

Loading dataset from: ../archive(7)/Resume/Resume.csv
Total rows loaded: 2484
Columns: ['ID', 'Resume_str', 'Resume_html', 'Category']


Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR


In [3]:
# Drop rows where Resume_str is null
df = df.dropna(subset=["Resume_str"])
print(f"Rows after dropping nulls: {len(df)}")
print(f"Unique categories ({df['Category'].nunique()}): {df['Category'].unique().tolist()}")

Rows after dropping nulls: 2484
Unique categories (24): ['HR', 'DESIGNER', 'INFORMATION-TECHNOLOGY', 'TEACHER', 'ADVOCATE', 'BUSINESS-DEVELOPMENT', 'HEALTHCARE', 'FITNESS', 'AGRICULTURE', 'BPO', 'SALES', 'CONSULTANT', 'DIGITAL-MEDIA', 'AUTOMOBILE', 'CHEF', 'FINANCE', 'APPAREL', 'ENGINEERING', 'ACCOUNTANT', 'CONSTRUCTION', 'PUBLIC-RELATIONS', 'BANKING', 'ARTS', 'AVIATION']


## 3. Data Cleaning

The `clean_text` function performs three cleaning steps:
1. **Strip HTML tags** — using `re.sub(r'<[^>]+>', ' ', text)`
2. **Remove special characters** — keep only letters, digits, and basic punctuation (`. , ; : ! ? - '`)
3. **Normalise whitespace** — collapse multiple spaces/newlines into a single space

In [4]:
def clean_text(text: str) -> str:
    """Clean a raw resume string.

    Steps:
        1. Remove all HTML tags.
        2. Remove special characters (keep letters, digits, basic punctuation).
        3. Collapse multiple whitespace / newlines into a single space.
        4. Strip leading and trailing whitespace.

    Args:
        text: Raw resume string, potentially containing HTML fragments.

    Returns:
        Cleaned, normalised plain-text string.
    """
    if not isinstance(text, str):
        return ""

    # Strip HTML tags
    text = re.sub(r"<[^>]+>", " ", text)

    # Keep only alphanumeric characters, whitespace, and basic punctuation
    text = re.sub(r"[^a-zA-Z0-9\s\.\,\;\:\!\?\-\']", " ", text)

    # Normalise whitespace (spaces, tabs, newlines) to single space
    text = re.sub(r"\s+", " ", text)

    return text.strip()


print("clean_text function defined.")

clean_text function defined.


In [5]:
# Apply cleaning to the Resume_str column
print("Cleaning Resume_str column ...")
df["Resume_str"] = df["Resume_str"].apply(clean_text)
print("Done!\n")

# Preview a cleaned resume
print("Sample cleaned text (first 500 chars):")
print(df["Resume_str"].iloc[0][:500])

Cleaning Resume_str column ...
Done!

Sample cleaned text (first 500 chars):
HR ADMINISTRATOR MARKETING ASSOCIATE HR ADMINISTRATOR Summary Dedicated Customer Service Manager with 15 years of experience in Hospitality and Customer Service Management. Respected builder and leader of customer-focused teams; strives to instill a shared, enthusiastic commitment to customer service. Highlights Focused on customer satisfaction Team management Marketing savvy Conflict resolution techniques Training and development Skilled multi-tasker Client relations specialist Accomplishments 


## 4. Text Chunking

We use LangChain's `RecursiveCharacterTextSplitter` with:
- **chunk_size = 1000** characters
- **chunk_overlap = 200** characters

This maintains semantic context across chunk boundaries, ensuring that when ChromaDB retrieves a chunk for a user query, the surrounding context is preserved.

In [6]:
def chunk_and_format(df: pd.DataFrame) -> list:
    """Split cleaned resumes into chunks and format for ChromaDB ingestion.

    Each chunk is returned as a dictionary with:
        - id:       "<original_ID>_chunk_<index>"
        - content:  the cleaned text chunk
        - metadata: {"Category": "<original_category>"}

    Args:
        df: DataFrame with columns ID, Resume_str (cleaned), and Category.

    Returns:
        List of dictionaries ready for vector-database ingestion.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )

    results = []

    for _, row in df.iterrows():
        resume_id = row["ID"]
        category = row["Category"]
        cleaned_text = row["Resume_str"]

        if not cleaned_text:
            continue

        chunks = splitter.split_text(cleaned_text)

        for i, chunk in enumerate(chunks):
            results.append(
                {
                    "id": f"{resume_id}_chunk_{i}",
                    "content": chunk,
                    "metadata": {"Category": category},
                }
            )

    return results


print("chunk_and_format function defined.")

chunk_and_format function defined.


In [7]:
# Run chunking
print("Chunking text (chunk_size=1000, overlap=200) ...")
preprocessed = chunk_and_format(df)
print(f"Total chunks generated: {len(preprocessed)}")

Chunking text (chunk_size=1000, overlap=200) ...
Total chunks generated: 18893


## 5. Results Summary

In [8]:
print("=" * 60)
print(f"  Resumes processed : {len(df)}")
print(f"  Unique categories : {df['Category'].nunique()}")
print(f"  Total chunks      : {len(preprocessed)}")
print("=" * 60)

  Resumes processed : 2484
  Unique categories : 24
  Total chunks      : 18893


## 6. Sample Output

Preview the first few chunks to verify the output format.

In [9]:
# Display first 3 chunks
for entry in preprocessed[:3]:
    print(json.dumps(entry, indent=2))
    print("-" * 40)

{
  "id": "16852973_chunk_0",
  "content": "HR ADMINISTRATOR MARKETING ASSOCIATE HR ADMINISTRATOR Summary Dedicated Customer Service Manager with 15 years of experience in Hospitality and Customer Service Management. Respected builder and leader of customer-focused teams; strives to instill a shared, enthusiastic commitment to customer service. Highlights Focused on customer satisfaction Team management Marketing savvy Conflict resolution techniques Training and development Skilled multi-tasker Client relations specialist Accomplishments Missouri DOT Supervisor Training Certification Certified by IHG in Customer Loyalty and Marketing by Segment Hilton Worldwide General Manager Training Certification Accomplished Trainer for cross server hospitality systems such as Hilton OnQ , Micros Opera PMS , Fidelio OPERA Reservation System ORS , Holidex Completed courses and seminars in customer service, sales strategies, inventory control, loss prevention, safety, time management, leadership and 

## 7. Save to JSON

Save the preprocessed chunks to `preprocessed_chunks.json` for downstream ChromaDB ingestion.

In [10]:
OUTPUT_PATH = "preprocessed_chunks.json"

with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    json.dump(preprocessed, f, ensure_ascii=False, indent=2)

print(f"Saved {len(preprocessed)} chunks to {OUTPUT_PATH}")

Saved 18893 chunks to preprocessed_chunks.json


---

In the Preprocessing step, we cleaned the raw Kaggle dataset by removing structural noise (HTML tags and special characters) to improve embedding quality. We then used a Recursive Character Text Splitter to divide resumes into 1000-character chunks with a 200-character (20%) overlap. This ensures that when the user queries for a specific skill, ChromaDB retrieves the most semantically relevant sections rather than irrelevant headers or footers.