# Synthetic Data Maker from Kaggle Dataset

This notebook processes the Kaggle Resume dataset, selects a few samples, and uses the LLM to convert them into the standard JSON Resume format.

This is done because, we have to have a pdf => json paired dataset, kaggle had pdf => text format.

Having text format simplifies most of the difficulties, but we are making a common pipeline, that way our evaluation pipeline works similarly for both cases.


In [1]:
import pandas as pd
import sys
import os
import json
from typing import List

# Add parent directory to path to import core modules
sys.path.append(os.path.abspath("../.."))

from core.llm.factory import get_llm
from core.parsing.schema import Resume
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import PydanticOutputParser



In [2]:
# Load Dataset
csv_path = '/home/acer/Desktop/cv/core/parsing/tests_data/resume_and_texts_kaggle/all/Resume/Resume.csv'
df = pd.read_csv(csv_path)

print("Dataset Shape:", df.shape)
print("Categories:", df['Category'].unique())

Dataset Shape: (2484, 4)
Categories: ['HR' 'DESIGNER' 'INFORMATION-TECHNOLOGY' 'TEACHER' 'ADVOCATE'
 'BUSINESS-DEVELOPMENT' 'HEALTHCARE' 'FITNESS' 'AGRICULTURE' 'BPO' 'SALES'
 'CONSULTANT' 'DIGITAL-MEDIA' 'AUTOMOBILE' 'CHEF' 'FINANCE' 'APPAREL'
 'ENGINEERING' 'ACCOUNTANT' 'CONSTRUCTION' 'PUBLIC-RELATIONS' 'BANKING'
 'ARTS' 'AVIATION']


In [3]:
# Select 5 samples from different categories
categories = df['Category'].unique()[:7]
samples = []

for cat in categories:
    sample = df[df['Category'] == cat].iloc[0]
    samples.append(sample)
    
print(f"Selected {len(samples)} samples from categories: {categories}")

Selected 7 samples from categories: ['HR' 'DESIGNER' 'INFORMATION-TECHNOLOGY' 'TEACHER' 'ADVOCATE'
 'BUSINESS-DEVELOPMENT' 'HEALTHCARE']


In [4]:
# Initialize LLM and Parser
llm = get_llm()
parser = PydanticOutputParser(pydantic_object=Resume)

def convert_to_json(text: str) -> Resume:
    prompt = PromptTemplate(
        template="""Convert the following resume text into a valid JSON object matching the schema.
        If information is missing, leave fields null or empty.
        
        RESUME TEXT:
        {text}
        
        {format_instructions}
        """,
        input_variables=["text"],
        partial_variables={"format_instructions": parser.get_format_instructions()}
    )
    
    chain = prompt | llm | parser
    return chain.invoke({"text": text})

In [5]:
# Process and Save
output_dir = '/home/acer/Desktop/cv/core/parsing/tests_data/resume_and_texts_kaggle/some'
os.makedirs(output_dir, exist_ok=True)

for i, sample in enumerate(samples):
    cat = sample['Category']
    rid = sample['ID']
    text = sample['Resume_str']
    
    print(f"Processing {i+1}/{len(samples)}: Category={cat}, ID={rid}...")
    try:
        resume = convert_to_json(text)
        
        # Save JSON
        output_path = os.path.join(output_dir, f"{cat}_{rid}.json")
        with open(output_path, "w") as f:
            f.write(resume.model_dump_json(indent=2))
        print(f"Saved to {output_path}")
        
        # Also save original text for reference
        text_path = os.path.join(output_dir, f"{cat}_{rid}.txt")
        with open(text_path, "w") as f:
            f.write(text)
            
    except Exception as e:
        print(f"Error processing {rid}: {e}")

Processing 1/7: Category=HR, ID=16852973...
Saved to /home/acer/Desktop/cv/core/parsing/tests_data/resume_and_texts_kaggle/some/HR_16852973.json
Processing 2/7: Category=DESIGNER, ID=37058472...
Saved to /home/acer/Desktop/cv/core/parsing/tests_data/resume_and_texts_kaggle/some/DESIGNER_37058472.json
Processing 3/7: Category=INFORMATION-TECHNOLOGY, ID=36856210...
Saved to /home/acer/Desktop/cv/core/parsing/tests_data/resume_and_texts_kaggle/some/INFORMATION-TECHNOLOGY_36856210.json
Processing 4/7: Category=TEACHER, ID=12467531...
Saved to /home/acer/Desktop/cv/core/parsing/tests_data/resume_and_texts_kaggle/some/TEACHER_12467531.json
Processing 5/7: Category=ADVOCATE, ID=14445309...
Saved to /home/acer/Desktop/cv/core/parsing/tests_data/resume_and_texts_kaggle/some/ADVOCATE_14445309.json
Processing 6/7: Category=BUSINESS-DEVELOPMENT, ID=65708020...
Saved to /home/acer/Desktop/cv/core/parsing/tests_data/resume_and_texts_kaggle/some/BUSINESS-DEVELOPMENT_65708020.json
Processing 7/7: Cate