# Resume Data Cleaning Pipeline

This notebook processes raw resume data from `master_resumes_original.jsonl` and produces `cleaned_resumes.csv` for model training.

## Pipeline

1. **Inspect structure**: examine nested JSON of one resume
2. **Flatten data**: convert nested JSON to DataFrame columns (personal_info → name/email/summary, experience/education/skills/projects → JSON strings)
3. **Filter invalid dates**: remove resumes with placeholder start/end dates
4. **Extract seniority**: calculate total experience time, identify target job's level, clear that level from data (avoid leakage)
5. **Parse to text**: convert JSON fields to readable strings
6. **Generate summaries**: use LLM to create professional summaries, remove seniority keywords
7. **Balance dataset**: sample 700 per class (junior/mid/senior)

## Imports

- **`json`**: parse JSONL records
- **`pandas`**: DataFrame operations
- **`datetime`**: calculate experience durations
- **`re`**: remove seniority keywords
- **`urllib.request`**: call OpenRouter API
- **`ThreadPoolExecutor`**: parallelize API calls


# 1. Inspect Raw Data Structure

Read the first resume and print its nested structure to understand what fields we need to flatten.


In [1]:
import json

with open('master_resumes_original.jsonl', 'r', encoding='utf-8') as f:
    first_line = f.readline()
    record = json.loads(first_line.strip())
    
    for key in record.keys():
        print(f"\nKey: {key}")
        value = record[key]
        print(f"Type: {type(value).__name__}")
        
        if isinstance(value, dict):
            print(f"Dict keys: {list(value.keys())}")
            for sub_key, sub_value in value.items():
                print(f"  - {sub_key}: {type(sub_value).__name__}")
                if isinstance(sub_value, dict):
                    print(f"    Dict keys: {list(sub_value.keys())}")
                elif isinstance(sub_value, list) and sub_value:
                    print(f"    List length: {len(sub_value)}, first item type: {type(sub_value[0]).__name__}")
                    if isinstance(sub_value[0], dict):
                        print(f"    First item keys: {list(sub_value[0].keys())}")
        
        elif isinstance(value, list):
            print(f"List length: {len(value)}")
            if value:
                for idx, item in enumerate(value):
                    print(f"Item {idx} type: {type(item).__name__}")
                    if isinstance(item, dict):
                        print(f"Item {idx} keys: {list(item.keys())}")
                        for sub_key, sub_value in item.items():
                            print(f"  - {sub_key}: {type(sub_value).__name__}")
                            if isinstance(sub_value, dict):
                                print(f"    Dict keys: {list(sub_value.keys())}")
                            elif isinstance(sub_value, list) and sub_value:
                                print(f"    List length: {len(sub_value)}, first item type: {type(sub_value[0]).__name__}")
        else:
            print(f"Value preview: {str(value)[:100]}")



Key: personal_info
Type: dict
Dict keys: ['name', 'email', 'phone', 'location', 'summary', 'linkedin', 'github']
  - name: str
  - email: str
  - phone: str
  - location: dict
    Dict keys: ['city', 'country', 'remote_preference']
  - summary: str
  - linkedin: str
  - github: str

Key: experience
Type: list
List length: 1
Item 0 type: dict
Item 0 keys: ['company', 'company_info', 'title', 'level', 'employment_type', 'dates', 'responsibilities', 'technical_environment']
  - company: str
  - company_info: dict
    Dict keys: ['industry', 'size']
  - title: str
  - level: str
  - employment_type: str
  - dates: dict
    Dict keys: ['start', 'end', 'duration']
  - responsibilities: list
    List length: 1, first item type: str
  - technical_environment: dict
    Dict keys: ['technologies', 'methodologies', 'tools']

Key: education
Type: list
List length: 2
Item 0 type: dict
Item 0 keys: ['degree', 'institution', 'dates', 'achievements']
  - degree: dict
    Dict keys: ['level', 'field',

# 2. Flatten Data to DataFrame

`clean_resume_data`: extracts personal_info fields (name, email, summary, linkedin, github), converts nested fields (experience, education, skills, projects, certifications) to JSON strings. Reads all JSONL records, creates DataFrame, adds `summary_count`, drops `phone`.


In [2]:
import pandas as pd

def clean_resume_data(record):
    cleaned = {}
    personal_info = record.get('personal_info', {})
    if isinstance(personal_info, dict):
        for key, value in personal_info.items():
            if key == 'location':
                continue
            if isinstance(value, (str, int, float, bool, type(None))):
                cleaned[key] = value

    experience = record.get('experience', [])
    cleaned['experience'] = json.dumps(experience)
    education = record.get('education', [])
    cleaned['education'] = json.dumps(education)
    skills = record.get('skills')
    cleaned['skills'] = json.dumps(skills) if skills is not None else json.dumps({})
    projects = record.get('projects', [])
    cleaned['projects'] = json.dumps(projects)
    certifications = record.get('certifications')
    cleaned['certifications'] = json.dumps(certifications) if certifications is not None else json.dumps([])
    return cleaned

all_records = []
with open('master_resumes_original.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        try:
            record = json.loads(line.strip())
            cleaned_record = clean_resume_data(record)
            all_records.append(cleaned_record)
        except json.JSONDecodeError:
            continue

df = pd.DataFrame(all_records)
df['summary_count'] = df['summary'].fillna('').apply(lambda x: len(str(x).split()))
df = df.drop(columns=['phone'])

print(f"Processed {len(all_records)} records")
print(f"Created {len(df.columns)} columns")
print(f"\nColumn names: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())


Processed 4817 records
Created 11 columns

Column names: ['name', 'email', 'summary', 'linkedin', 'github', 'experience', 'education', 'skills', 'projects', 'certifications', 'summary_count']

First few rows:
           name         email  \
0       Unknown       Unknown   
1       Unknown       Unknown   
2  Not Provided  Not Provided   
3       Unknown       Unknown   
4                               

                                             summary      linkedin  \
0  Python Developer with experience in Python, Te...       Unknown   
1  Experienced Operations Manager with expertise ...           NaN   
2  Software Proficiency in various languages and ...  Not Provided   
3  Experienced Operations Manager with expertise ...           NaN   
4                                                                    

         github                                         experience  \
0       Unknown  [{"company": "Fresher", "company_info": {"indu...   
1           NaN  [{"company": "

# 3. Filter Invalid Dates

Remove resumes where experience has placeholder dates ("Unknown", "Not Provided", "N/A", ""). `has_invalid_start_end` checks each experience's start/end dates.

In [3]:
invalid_values = {"Unknown", "Not Provided", "Not Available", "N/A", "unknown", "not provided", ""}

def has_invalid_start_end(experience_json):
    try:
        exp_list = json.loads(experience_json)
        for exp in exp_list:
            dates = exp.get("dates", {})
            start = dates.get("start", "")
            end = dates.get("end", "")
            if start in invalid_values or end in invalid_values:
                return True
        return False
    except Exception:
        return True

df = df[~df["experience"].apply(has_invalid_start_end)].reset_index(drop=True)


In [4]:
print(f"Total rows left after deletion: {len(df)}")

Total rows left after deletion: 4634


# 4. Extract Seniority and Experience Time

We define helper functions and extract seniority data:

- `parse_date`: converts date strings (or "present") to datetime objects
- `calculate_duration`: computes years between start/end dates
- `extract_seniority_and_experience`: for each resume, we:
  - Calculate duration for each job and store it in `duration_calc`
  - Normalize `level` and `title` to lowercase
  - Identify the target experience (the one with "present" end date, or max end date, or highest level if same title)
  - Extract `experience_level` from the target job **before clearing it** (to avoid leakage)
  - Return: `experience_level`, `last_experience_only`, `total_experience_time`, `last_experience_time`, `target_job_title`, and updated experience JSON


In [5]:
from datetime import datetime
import json

PRESENT_DATE = "2025-12-12"

def parse_date(date_str):
    if isinstance(date_str, str) and date_str.strip().lower() == "present":
        try:
            return datetime.strptime(PRESENT_DATE, "%Y-%m-%d")
        except:
            return None
    try:
        return datetime.strptime(date_str, "%Y-%m-%d")
    except:
        return None

def calculate_duration(start_date, end_date):
    start = parse_date(start_date)
    end = parse_date(end_date)
    if start and end:
        duration_years = (end - start).days / 365.25
        return max(0, duration_years)
    return 0

level_map = {"junior": 1, "mid": 2, "senior": 3}

def extract_seniority_and_experience(exp_json):
    try:
        experiences = json.loads(exp_json)
    except:
        return {
            "experience_level": "unknown",
            "last_experience_only": "N/A",
            "total_experience_time": "0 Years",
            "last_experience_time": "0 Years",
            "target_job_title": "",
            "updated_experience_json": exp_json
        }

    if not isinstance(experiences, list) or len(experiences) == 0:
        return {
            "experience_level": "unknown",
            "last_experience_only": "N/A",
            "total_experience_time": "0 Years",
            "last_experience_time": "0 Years",
            "target_job_title": "",
            "updated_experience_json": exp_json
        }

    for exp in experiences:
        dates = exp.get('dates', {})
        start = dates.get('start', '')
        end = dates.get('end', '')
        duration_val = calculate_duration(start, end)
        exp['duration_calc'] = duration_val
        if 'dates' not in exp:
            exp['dates'] = {}
        exp['dates']['duration'] = f"{duration_val:.2f} years"
        if 'level' in exp and isinstance(exp['level'], str):
            exp['level'] = exp['level'].lower()
        if 'title' in exp and exp['title'] is not None:
            exp['title'] = str(exp['title']).strip().lower()
        else:
            exp['title'] = ""

    if len(experiences) == 1:
        last_experience_only_flag = "Only last Experience listed"
    else:
        last_experience_only_flag = "Multiple Experiences listed"

    if len(experiences) == 1:
        most_recent_exp = experiences[0]
    else:
        # More than one experience; find which is the "most recent" (should be the one with "present", if any)
        def is_present(exp):
            end = exp.get('dates', {}).get('end', '')
            return isinstance(end, str) and end.strip().lower() == "present"
        # 'present' always marks the last experience
        present_exps = [exp for exp in experiences if is_present(exp)]
        if present_exps:
            most_recent_exp = present_exps[0]
        else:
            # Fall back to max end date
            def exp_end_date(exp):
                return parse_date(exp.get('dates', {}).get('end', '')) or datetime.min
            most_recent_exp = max(experiences, key=exp_end_date)

        temp_exp = most_recent_exp
        temp_title = temp_exp.get('title', '').lower()
        temp_level = temp_exp.get('level', '').lower()
        temp_level_rank = level_map.get(temp_level, 0)

        for exp in experiences:
            if exp is temp_exp:
                continue
            exp_title = exp.get('title', '').lower()
            exp_level = exp.get('level', '').lower()
            exp_level_rank = level_map.get(exp_level, 0)
            if exp_title == temp_title and exp_level_rank > temp_level_rank:
                temp_level_rank = exp_level_rank
                temp_exp = exp
                temp_level = exp_level
        
        most_recent_exp = temp_exp

    experience_level = most_recent_exp.get('level', 'unknown').lower()
    target_job_title = most_recent_exp.get('title', '')
    most_recent_exp['level'] = ""
    
    total_experience_years = sum(exp.get('duration_calc', 0) for exp in experiences)
    last_experience_years = most_recent_exp.get('duration_calc', 0)

    # Keep only title and responsibilities for each experience
    # cleaned_experiences = []
    # for exp in experiences:
    #     cleaned_exp = {}
    #     if 'title' in exp:
    #         cleaned_exp['title'] = exp['title']
    #     if 'responsibilities' in exp:
    #         cleaned_exp['responsibilities'] = exp['responsibilities']
    #     cleaned_experiences.append(cleaned_exp)


    return {
        "experience_level": experience_level,
        "last_experience_only": last_experience_only_flag,
        "total_experience_time": f"{round(total_experience_years, 2)} Years",
        "last_experience_time": f"{round(last_experience_years, 2)} Years",
        "target_job_title": target_job_title,
        "updated_experience_json": json.dumps(experiences)
    }

df_results = df['experience'].apply(extract_seniority_and_experience)
df['experience_level'] = df_results.apply(lambda x: x['experience_level'])
df['last_experience_only'] = df_results.apply(lambda x: x['last_experience_only'])
df['total_experience_time'] = df_results.apply(lambda x: x['total_experience_time'])
df['last_experience_time'] = df_results.apply(lambda x: x['last_experience_time'])
df['job title'] = df_results.apply(lambda x: x['target_job_title'])
df['experience'] = df_results.apply(lambda x: x.get('updated_experience_json', df['experience'].iloc[0]))

In [6]:
from datetime import datetime, timedelta
import json

def _parse_date_flexible(date_str):
    if not isinstance(date_str, str):
        return None
    s = date_str.strip()
    if not s:
        return None
    if s.lower() == "present":
        try:
            return datetime.strptime(PRESENT_DATE, "%Y-%m-%d")
        except Exception:
            return None
    for fmt in ("%Y-%m-%d", "%Y/%m/%d", "%Y-%m", "%Y/%m", "%Y", "%b %Y", "%B %Y"):
        try:
            return datetime.strptime(s, fmt)
        except Exception:
            pass
    return None

def _format_like(dt, template):
    if not isinstance(template, str):
        return dt.strftime("%Y-%m-%d")
    t = template.strip()
    if t.lower() == "present":
        return "Present"
    if len(t) == 4 and t.isdigit():
        return dt.strftime("%Y")
    if "-" in t:
        parts = t.split("-")
        if len(parts) == 2:
            return dt.strftime("%Y-%m")
        if len(parts) >= 3:
            return dt.strftime("%Y-%m-%d")
    if "/" in t:
        parts = t.split("/")
        if len(parts) == 2:
            return dt.strftime("%Y/%m")
        if len(parts) >= 3:
            return dt.strftime("%Y/%m/%d")
    return dt.strftime("%Y-%m-%d")

def _fix_experience_json(experience_json):
    try:
        experiences = json.loads(experience_json)
    except Exception:
        return experience_json

    if not isinstance(experiences, list) or len(experiences) == 0:
        return experience_json

    for exp in experiences:
        if not isinstance(exp, dict):
            continue
        exp.pop("employment_type", None)
        exp.pop("employmentType", None)

    for i in range(len(experiences) - 2, -1, -1):
        cur = experiences[i]
        nxt = experiences[i + 1]
        if not isinstance(cur, dict) or not isinstance(nxt, dict):
            continue

        cur_dates = cur.get("dates") if isinstance(cur.get("dates"), dict) else {}
        nxt_dates = nxt.get("dates") if isinstance(nxt.get("dates"), dict) else {}
        cur["dates"] = cur_dates
        nxt["dates"] = nxt_dates

        cur_end_str = cur_dates.get("end", "")
        nxt_start_str = nxt_dates.get("start", "")

        cur_end_dt = _parse_date_flexible(cur_end_str)
        nxt_start_dt = _parse_date_flexible(nxt_start_str)
        if not cur_end_dt or not nxt_start_dt:
            continue

        if cur_end_dt > nxt_start_dt:
            dur_years = cur.get("duration_calc")
            if not isinstance(dur_years, (int, float)) or dur_years <= 0:
                cur_start_dt = _parse_date_flexible(cur_dates.get("start", ""))
                if not cur_start_dt:
                    continue
                dur_years = max(0, (cur_end_dt - cur_start_dt).days / 365.25)

            new_end_dt = nxt_start_dt
            new_start_dt = new_end_dt - timedelta(days=int(round(dur_years * 365.25)))
            if new_start_dt > new_end_dt:
                new_start_dt = new_end_dt

            cur_start_str = cur_dates.get("start", "")
            cur_dates["end"] = _format_like(new_end_dt, cur_end_str)
            cur_dates["start"] = _format_like(new_start_dt, cur_start_str)

    return json.dumps(experiences)


df["experience"] = df["experience"].apply(_fix_experience_json)


# 5. Parse JSON Fields to Text

We convert JSON fields to readable text strings for model training.


In [7]:
def parse_experience_to_text(experience_json):
    try:
        experiences = json.loads(experience_json)
        if not experiences or not isinstance(experiences, list):
            return ""
        text_parts = []
        for i, exp in enumerate(experiences, 1):
            exp_text = f"Experience {i}: "
            if exp.get('company'):
                exp_text += f"Company: {exp['company']}. "
            if exp.get('title'):
                exp_text += f"Title: {exp['title']}. "
            if exp.get('level'):
                exp_text += f"Level: {exp['level']}. "
            if exp.get('dates'):
                dates = exp['dates']
                if dates.get('start'):
                    exp_text += f"Start Date: {dates['start']}. "
                if dates.get('end'):
                    exp_text += f"End Date: {dates['end']}. "
                if dates.get('duration'):
                    exp_text += f"Duration: {dates['duration']}. "
            if exp.get('responsibilities') and isinstance(exp['responsibilities'], list):
                responsibilities = ' '.join(exp['responsibilities'])
                exp_text += f"Responsibilities: {responsibilities}. "
            if exp.get('technical_environment'):
                tech_env = exp['technical_environment']
                if tech_env.get('technologies') and isinstance(tech_env['technologies'], list):
                    technologies = ', '.join(tech_env['technologies'])
                    exp_text += f"Technologies: {technologies}. "
                if tech_env.get('methodologies') and isinstance(tech_env['methodologies'], list):
                    methodologies = ', '.join(tech_env['methodologies'])
                    exp_text += f"Methodologies: {methodologies}. "
                if tech_env.get('tools') and isinstance(tech_env['tools'], list):
                    tools = ', '.join(tech_env['tools'])
                    exp_text += f"Tools: {tools}. "
            if exp.get('company_info'):
                company_info = exp['company_info']
                if company_info.get('industry'):
                    exp_text += f"Industry: {company_info['industry']}. "
                if company_info.get('size'):
                    exp_text += f"Company Size: {company_info['size']}. "
            text_parts.append(exp_text.strip())
        return ' '.join(text_parts)
    except (json.JSONDecodeError, TypeError, AttributeError):
        return ""

df['experience'] = df['experience'].apply(parse_experience_to_text)

`parse_education_to_text`: converts education JSON to text (degree, institution, dates, achievements).


In [8]:
def parse_education_to_text(education_str):
    try:
        if pd.isna(education_str) or education_str == '':
            return ""
        education_list = json.loads(education_str)
        if not isinstance(education_list, list):
            return ""
        text_parts = []
        for idx, edu in enumerate(education_list, 1):
            if not isinstance(edu, dict):
                continue
            edu_text = f"Education {idx}: "
            if edu.get('degree'):
                degree = edu['degree']
                if degree.get('level'):
                    edu_text += f"Degree Level: {degree['level']}. "
                if degree.get('field'):
                    edu_text += f"Field: {degree['field']}. "
                if degree.get('major'):
                    edu_text += f"Major: {degree['major']}. "
            if edu.get('institution'):
                institution = edu['institution']
                if institution.get('name'):
                    edu_text += f"Institution: {institution['name']}. "
                if institution.get('location'):
                    edu_text += f"Location: {institution['location']}. "
                if institution.get('accreditation'):
                    edu_text += f"Accreditation: {institution['accreditation']}. "
            if edu.get('dates'):
                dates = edu['dates']
                if dates.get('start'):
                    edu_text += f"Start Date: {dates['start']}. "
                if dates.get('end'):
                    edu_text += f"End Date: {dates['end']}. "
                if dates.get('expected_graduation'):
                    edu_text += f"Expected Graduation: {dates['expected_graduation']}. "
            if edu.get('achievements'):
                achievements = edu['achievements']
                if achievements.get('gpa'):
                    edu_text += f"GPA: {achievements['gpa']}. "
                if achievements.get('honors'):
                    edu_text += f"Honors: {achievements['honors']}. "
                if achievements.get('relevant_coursework') and isinstance(achievements['relevant_coursework'], list):
                    coursework = ', '.join(achievements['relevant_coursework'])
                    if coursework:
                        edu_text += f"Relevant Coursework: {coursework}. "
            text_parts.append(edu_text.strip())
        return ' '.join(text_parts)
    except (json.JSONDecodeError, TypeError, AttributeError):
        return ""

df['education'] = df['education'].apply(parse_education_to_text)

`parse_skills_to_text`: converts skills JSON to text (programming languages, frameworks, databases, cloud, spoken languages).


In [9]:
def parse_skills_to_text(skills_json):
    if pd.isna(skills_json) or skills_json == '':
        return ""
    try:
        if isinstance(skills_json, str):
            skills = json.loads(skills_json)
        else:
            skills = skills_json
        text_parts = []
        if skills.get('technical'):
            technical = skills['technical']
            if technical.get('programming_languages') and isinstance(technical['programming_languages'], list):
                for lang in technical['programming_languages']:
                    if lang.get('name'):
                        lang_text = f"Programming Language: {lang['name']}"
                        if lang.get('level'):
                            lang_text += f" (Level: {lang['level']})"
                        text_parts.append(lang_text + ". ")
            if technical.get('frameworks') and isinstance(technical['frameworks'], list):
                for framework in technical['frameworks']:
                    if framework.get('name'):
                        framework_text = f"Framework: {framework['name']}"
                        if framework.get('level'):
                            framework_text += f" (Level: {framework['level']})"
                        text_parts.append(framework_text + ". ")
            if technical.get('databases') and isinstance(technical['databases'], list):
                for db in technical['databases']:
                    if db.get('name'):
                        db_text = f"Database: {db['name']}"
                        if db.get('level'):
                            db_text += f" (Level: {db['level']})"
                        text_parts.append(db_text + ". ")
            if technical.get('cloud') and isinstance(technical['cloud'], list):
                for cloud in technical['cloud']:
                    if isinstance(cloud, dict) and cloud.get('name'):
                        cloud_text = f"Cloud: {cloud['name']}"
                        if cloud.get('level'):
                            cloud_text += f" (Level: {cloud['level']})"
                        text_parts.append(cloud_text + ". ")
                    elif isinstance(cloud, str):
                        text_parts.append(f"Cloud: {cloud}. ")
        if skills.get('languages') and isinstance(skills['languages'], list):
            for language in skills['languages']:
                if isinstance(language, dict) and language.get('name'):
                    lang_text = f"Language: {language['name']}"
                    if language.get('level'):
                        lang_text += f" (Level: {language['level']})"
                    text_parts.append(lang_text + ". ")
                elif isinstance(language, str):
                    text_parts.append(f"Language: {language}. ")
        return ' '.join(text_parts).strip()
    except (json.JSONDecodeError, TypeError, AttributeError):
        return ""

df['skills'] = df['skills'].apply(parse_skills_to_text)


`parse_projects_to_text`: converts projects JSON to text (name, description, technologies, role, URL, impact). Returns "No Projects made" if empty.


In [10]:
def parse_projects_to_text(projects_str):
    try:
        if pd.isna(projects_str) or str(projects_str).strip() == '' or str(projects_str).strip() == '[]':
            return "No Projects made"
        projects = json.loads(projects_str)
        if not projects or not isinstance(projects, list) or len(projects) == 0:
            return "No Projects made"
        text_parts = []
        for idx, project in enumerate(projects, 1):
            if not isinstance(project, dict):
                continue
            project_text = f"Project {idx}: "
            if project.get('name'):
                project_text += f"{project['name']}. "
            if project.get('description'):
                project_text += f"Description: {project['description']}. "
            if project.get('technologies') and isinstance(project['technologies'], list):
                tech_list = [tech for tech in project['technologies'] if tech]
                if tech_list:
                    project_text += f"Technologies: {', '.join(tech_list)}. "
            if project.get('role'):
                project_text += f"Role: {project['role']}. "
            if project.get('url'):
                project_text += f"URL: {project['url']}. "
            if project.get('impact'):
                project_text += f"Impact: {project['impact']}. "
            
            text_parts.append(project_text.strip())
        if not text_parts:
            return "No Projects made"
        return ' '.join(text_parts).strip()
    except (json.JSONDecodeError, TypeError, AttributeError):
        return "No Projects made"

df['projects'] = df['projects'].apply(parse_projects_to_text)

# 6. Clean Up

We drop the `certifications` column (no data) and remove rows with empty `summary`.


In [11]:
df = df.drop('certifications', axis=1)

In [12]:
df = df[df['summary'].astype(str).str.strip() != '']
print(len(df))

4626


# 7. Generate Professional Summaries Using LLM

We generate new summaries using OpenRouter API (Mistral 3B) based on resume content - **excluding personal info and seniority level**. Summaries are cached in `generated_summaries.json`.

In [13]:
import os

LOAD_EXISTING_SUMMARIES = True
summaries_file = 'generated_summaries.json'

if LOAD_EXISTING_SUMMARIES and os.path.exists(summaries_file):
    with open(summaries_file, 'r', encoding='utf-8') as f:
        loaded_summaries = json.load(f)
    new_summaries = {int(k): v for k, v in loaded_summaries.items()}
    print(f"Loaded {len(new_summaries)} existing summaries from {summaries_file}")
    SKIP_GENERATION = True
else:
    SKIP_GENERATION = False
    print(f"Will generate new summaries (file not found or LOAD_EXISTING_SUMMARIES=False)")


Will generate new summaries (file not found or LOAD_EXISTING_SUMMARIES=False)


In [None]:
import urllib.request
from concurrent.futures import ThreadPoolExecutor, as_completed

MODEL = "google/gemma-3n-e4b-it"
OPENROUTER_URL = "https://openrouter.ai/api/v1/chat/completions"
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
MAX_WORKERS = 10
EXCLUDE_COLS = {"name", "email", "linkedin", "github", "experience_level", "summary", "summary_count"}


In [15]:
def build_summary_prompt(row):
    parts = []
    for col in row.index:
        if col not in EXCLUDE_COLS:
            val = str(row[col]).strip()
            if val and val.lower() != "nan":
                parts.append(f"{col}: {val}")
    resume_text = "\n".join(parts)
    return f"""You are the person described in this resume. Write a professional summary for your resume in first person that captures who you are as a professional.

    {resume_text}

    Focus on what makes your experience distinctive. Let the summary flow naturally - if you have deep expertise in one area, lean into that. If your career spans diverse technologies, show that breadth. If you've worked on interesting projects, bring them to life. The summary should feel authentic to your specific background, not generic.

    Write only the summary paragraph in text format only. min words 20, max words 70. Try to base the lengh on the resume data itself, if there is a lot to talk about, mention them, if not, try to be as short and concise as possible. for example, one expereince is not something to talk about much, but multiple can get you to talk more. be as simular as possible to the data and nothing from outside."""

def call_api(prompt):
    payload = json.dumps({
        "model": MODEL,
        "messages": [{"role": "user", "content": prompt}],
        # "max_tokens": 150
    }).encode("utf-8")
    req = urllib.request.Request(
        OPENROUTER_URL,
        data=payload,
        headers={
            "Authorization": f"Bearer {OPENROUTER_API_KEY}",
            "Content-Type": "application/json"
        }
    )
    with urllib.request.urlopen(req, timeout=60) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return result["choices"][0]["message"]["content"].strip()

def generate_summary_for_row(idx_row):
    idx, row = idx_row
    try:
        prompt = build_summary_prompt(row)
        summary = call_api(prompt)
        return idx, summary
    except Exception as e:
        print(f"Row {idx} failed: {e}")
        return idx, row.get("summary", "")

In [16]:
if not SKIP_GENERATION:
    print(f"Generating summaries for {len(df)} rows...")
    new_summaries = {}
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        futures = {executor.submit(generate_summary_for_row, (idx, row)): idx for idx, row in df.iterrows()}
        for i, future in enumerate(as_completed(futures), 1):
            idx, summary = future.result()
            new_summaries[idx] = summary
            if i % 100 == 0:
                print(f"Processed {i}/{len(df)}")
    print(f"Done! Generated {len(new_summaries)} summaries")
else:
    print("Skipped generation - using loaded summaries")

Generating summaries for 4626 rows...
Processed 100/4626
Processed 200/4626
Processed 300/4626
Processed 400/4626
Processed 500/4626
Processed 600/4626
Processed 700/4626
Processed 800/4626
Processed 900/4626
Processed 1000/4626
Processed 1100/4626
Processed 1200/4626
Processed 1300/4626
Processed 1400/4626
Processed 1500/4626
Processed 1600/4626
Processed 1700/4626
Processed 1800/4626
Processed 1900/4626
Processed 2000/4626
Row 2025 failed: IncompleteRead(11 bytes read)
Processed 2100/4626
Processed 2200/4626
Processed 2300/4626
Row 2409 failed: IncompleteRead(11 bytes read)
Processed 2400/4626
Processed 2500/4626
Processed 2600/4626
Processed 2700/4626
Processed 2800/4626
Processed 2900/4626
Processed 3000/4626
Processed 3100/4626
Processed 3200/4626
Processed 3300/4626
Processed 3400/4626
Processed 3500/4626
Processed 3600/4626
Processed 3700/4626
Processed 3800/4626
Processed 3900/4626
Processed 4000/4626
Processed 4100/4626
Processed 4200/4626
Processed 4300/4626
Processed 4400/46

In [17]:
if not SKIP_GENERATION:
    summaries_file = 'generated_summaries.json'
    with open(summaries_file, 'w', encoding='utf-8') as f:
        json.dump({str(k): v for k, v in new_summaries.items()}, f, ensure_ascii=False, indent=2)
    print(f"Saved {len(new_summaries)} summaries to {summaries_file}")
else:
    print("Summaries loaded from file - no need to save")

Saved 4626 summaries to generated_summaries.json


In [18]:
df["summary"] = df.index.map(new_summaries)
df["summary_count"] = df["summary"].fillna("").apply(len)

print("Sample generated summaries:")
print("="*80)
for i in range(3):
    print(f"\n[{i+1}] {df.iloc[i]['summary'][:300]}...")
    print("-"*80)


Sample generated summaries:

[1] Highly motivated and results-oriented Java Developer with 10.45 years of experience in developing and maintaining robust Java-based web applications. Proven ability to leverage Spring, Hibernate, and MySQL within Agile environments.  I am proficient in front-end development using various technologie...
--------------------------------------------------------------------------------

[2] Highly accomplished Project Manager with 11+ years at AT&T, leading complex transition and operational projects within the telecommunications industry. Proven expertise in vendor management, risk mitigation, and driving quality through automation and process improvements. Skilled in Agile methodolog...
--------------------------------------------------------------------------------

[3] Highly accomplished Advocate with 13.32 years of experience advising clients on legal rights and representing them in courts across various jurisdictions, including international companie

In [19]:
import re

def remove_seniority_keywords(text):
    if pd.isna(text) or text == "":
        return text
    text = re.sub(r'\b(junior|senior|mid)\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df["summary"] = df["summary"].apply(remove_seniority_keywords)
print("Removed seniority keywords from summaries")


Removed seniority keywords from summaries


# 9. Reorder Columns

We put informative features first (experience, projects, skills, summary, education), noise last (name, email, linkedin, github).

In [20]:
preferred_first = [
    "experience", "projects", "skills", "summary", "education",
    "certifications", "job title",
    "total_experience_time", "last_experience_time", "summary_count", "last_experience_only"
]
preferred_last = ["name", "email", "linkedin", "github"]

all_columns = df.columns.tolist()
first_cols = [col for col in preferred_first if col in all_columns]
last_cols = [col for col in preferred_last if col in all_columns]
middle_cols = [col for col in all_columns if col not in first_cols and col not in last_cols]

new_column_order = first_cols + middle_cols + last_cols
df = df[new_column_order]

print("Columns reordered:")
print(df.columns.tolist())

Columns reordered:
['experience', 'projects', 'skills', 'summary', 'education', 'job title', 'total_experience_time', 'last_experience_time', 'summary_count', 'last_experience_only', 'experience_level', 'name', 'email', 'linkedin', 'github']


# 10. Save Full Dataset

We save the complete cleaned dataset to `cleaned_resumes.csv`.


In [21]:
df.to_csv('cleaned_resumes.csv', index=False)

# 11. Balance Dataset (700 per class)

We select 700 samples per class: for **seniors/mids** take top 700 most experienced, for **juniors** random sample. Save to all model directories.


In [22]:
def extract_years(time_str):
    try:
        return float(str(time_str).replace(' Years', '').strip())
    except:
        return 0.0

senior_df = df[df['experience_level'] == 'senior'].copy()
mid_df = df[df['experience_level'] == 'mid'].copy()
junior_df = df[df['experience_level'] == 'junior']

senior_df['_exp_years'] = senior_df['total_experience_time'].apply(extract_years)
mid_df['_exp_years'] = mid_df['total_experience_time'].apply(extract_years)

senior_sample = senior_df.sort_values('_exp_years', ascending=False).head(700).drop(columns=['_exp_years'])
mid_sample = mid_df.sort_values('_exp_years', ascending=False).head(700).drop(columns=['_exp_years'])
junior_sample = junior_df.sample(n=min(700, len(junior_df)), random_state=42)

balanced_df = pd.concat([senior_sample, mid_sample, junior_sample], ignore_index=True)

In [23]:
balanced_df.to_csv('./Baseline/cleaned_resumes.csv', index=False)
balanced_df.to_csv('./Smaller Models/cleaned_resumes.csv', index=False)
balanced_df.to_csv('./Big Models/cleaned_resumes.csv', index=False)

print(f"Saved {len(balanced_df)} records to ./Baseline/cleaned_resumes.csv")
print(f"\nExperience level distribution:")
print(balanced_df['experience_level'].value_counts())

Saved 2100 records to ./Baseline/cleaned_resumes.csv

Experience level distribution:
experience_level
senior    700
mid       700
junior    700
Name: count, dtype: int64
