# Generic Resume Parser

**Goal:** Given a resume (PDF or Word), extract a structured JSON object:

## Outline

1. **Setup & Requirements** – optional installs, environment configuration  
2. **Project Structure & Settings** – skill list, file handling options  
3. **Text Extraction** – robust loaders for PDF/DOCX/TXT  
4. **Gemini Parser** – JSON‑only extraction with strict schema  
5. **Fallback Parser** – rules‑based extraction (email, name, skills)  
6. **Unified Pipeline** – `parse_resume(file_path)` chooses the best available method  



## 1) Setup & Requirements
- Install dependencies (if not already available)



**Environment variable :**  
Set api key as `GEMINI_API_KEY` in your local .env file.

In [1]:
import sys, os
from dotenv import load_dotenv
import google.generativeai as genai
import spacy

print("Python:", sys.version)
print("Interpreter:", sys.executable)

load_dotenv()  # loads .env from project root
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
assert GEMINI_API_KEY, "Set GEMINI_API_KEY in your .env"

genai.configure(api_key=GEMINI_API_KEY)
model = genai.GenerativeModel(model_name="gemini-1.5-flash")
print("Gemini OK →", model.generate_content("Say hi in 3 words.").text.strip())

nlp = spacy.load("en_core_web_sm")
print("spaCy OK →", nlp.pipe_names)


Python: 3.11.9 (v3.11.9:de54cf5be3, Apr  2 2024, 07:12:50) [Clang 13.0.0 (clang-1300.0.29.30)]
Interpreter: /Users/christinewei/Documents/resume-parser/.venv/bin/python
Gemini OK → Hello, friend!
spaCy OK → ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [None]:
## Uncoment the lines below to install necessary libraries if you use colab
# !pip install c google-generativeai python-dotenv
# !pip install spacy
# !python -m spacy download en_core_web_sm


In [2]:
import os, json, re
import fitz  # PyMuPDF
import docx
import pandas as pd
import numpy as np
import spacy

### 2) Project Structure & Settings

We keep a **small default skill list** to ensure the fallback works out of the box.  Here we use the esco_skills.csv dataset from Kaggle: https://www.kaggle.com/datasets/thenoob69/esco-skills

Optionally, you can **extend** it by pointing to your own CSV or TXT file.

In [3]:
# Build a default/general skills list set for fallback skill parse matching check

def build_skill_map_from_csv(csv_path='esco_skills.csv'):
    """
    Loads skills from the Kaggle ESCO CSV and creates a mapping from any
    skill variation (alternative or primary) to its canonical name.

    Args:
        csv_path (str): The path to the Kaggle skills CSV file.

    Returns:
        dict: A dictionary mapping every possible skill alias (in lowercase)
              to its canonical, original-cased skill name. Returns an empty
              dict if the file cannot be read.
    """
    try:
        df = pd.read_csv(csv_path)
        print(f"Successfully loaded {csv_path} with {len(df)} rows.")
    except FileNotFoundError:
        print(f"Error: The file was not found at {csv_path}.")
        print("Please download the Kaggle ESCO skills CSV and place it in the correct directory.")
        return {}

    # Create the mapping dictionary
    skill_map = {}

    # Drop rows where the primary skill label is missing, as they are unusable
    df.dropna(subset=['label_cleaned'], inplace=True)

    # Replace NaN in 'altLabels' with an empty string to prevent errors
    df['altLabels'].fillna('', inplace=True)

    for index, row in df.iterrows():
        # Get the canonical skill name (e.g., "Manage Musical Staff")
        canonical_skill = row['label_cleaned'].strip()

        # Map the lowercase version of the canonical skill to itself
        skill_map[canonical_skill.lower()] = canonical_skill

        # Process the alternative labels
        alt_labels_str = row['altLabels']

        # The altLabels are often a long string; we can split them if a clear delimiter exists
        # or treat common phrases. For robustness, we will treat them as space-separated words
        # and also add the full strings. Note: This part might need refinement based on the exact
        # format of the altLabels string. 
        alt_labels_list = alt_labels_str.strip().split('\n')

        for alt_label in alt_labels_list:
            alt_label = alt_label.strip()
            if alt_label:
                # Map the lowercase version of the alias to the canonical skill
                skill_map[alt_label.lower()] = canonical_skill

    print(f"Built a skill map with {len(skill_map)} total variations.")
    return skill_map

# --- Example Usage ---
# Make sure to provide the correct path to your downloaded CSV file.
# SKILL_MAP = build_skill_map_from_csv('/content/skills.csv')
# print(SKILL_MAP)

### 3) Text Extraction

Robust file loader that supports **PDF**, **DOCX**, and **TXT**.  
- Prefers `PyMuPDF` (fast/accurate) if installed.  
- DOCX via `python-docx`.  


In [4]:
def extract_text_from_pdf(file_path):
    """Extracts text from a PDF file."""
    try:
        doc = fitz.open(file_path)
        text = ""
        for page in doc:
            text += page.get_text()
        return text
    except Exception as e:
        print(f"Error reading PDF {file_path}: {e}")
        return None

def extract_text_from_docx(file_path):
    """Extracts text from a .docx file."""
    try:
        doc = docx.Document(file_path)
        text = "\n".join([para.text for para in doc.paragraphs])
        return text
    except Exception as e:
        print(f"Error reading DOCX {file_path}: {e}")
        return None

def extract_text(file_path):
    """
    Detects the file type and uses the appropriate function to extract text.
    Returns the extracted text as a string.
    """
    file_extension = os.path.splitext(file_path)[1].lower()
    if file_extension == '.pdf':
        return extract_text_from_pdf(file_path)
    elif file_extension == '.docx':
        return extract_text_from_docx(file_path)
    else:
        print(f"Unsupported file format: {file_extension}")
        return None

# --- Example Usage ---
# Make sure to provide the correct path to your downloaded sample PDF or Word file for testing
# sample_file_path = '~/sample_resume7.docx'
# if os.path.exists(sample_pdf_path):
#     extracted_content = extract_text(sample_file_path)
#     print("--- Extracted Content ---")
#     print(extracted_content)


### 4) Gemini Parser (LLM‑first)

We ask Gemini to output **strict JSON** with exactly the keys: `name`, `email`, `skills`.
If the model is unavailable or no key is set, we **skip** this path and let the unified pipeline use the fallback.

In [22]:
def parse_resume_with_genai(resume_text):
    """
    Sends resume text to the GenAI model and asks for structured data extraction.
    """
    if not resume_text:
        return None

    # This prompt is the key to the success of the parser.
    # It clearly defines the task, the input, and the desired output format.
    prompt = f"""
    You are an expert resume parser. Your task is to analyze the provided resume text and extract the following information: the candidate's full name, their email address, and a list of their skills.

    The output MUST be a valid JSON object with the following structure:
    {{
      "name": "...",
      "email": "...",
      "skills": ["...", "...", "..."]
    }}

    ** With skills extraction, Keep in mind to ensure you firstly include all the skills under skills section in the resume as how it's named, then check carefully for the rest parts of the resume to distinguish if there is more skills.** 
    Do not include any explanations, introductions, or additional text outside of the JSON object.

    Here is the resume text:
    ---
    {resume_text}
    ---
    """

    try:
        response = model.generate_content(prompt)

        # Clean the response to ensure it's valid JSON
        json_response_text = response.text.strip().replace('```json', '').replace('```', '')

        # Parse the JSON string into a Python dictionary
        return json.loads(json_response_text)
        # return response

    except Exception as e:
        print(f"An error occurred during GenAI parsing: {e}")
        print(f"Raw response was: {response.text}")
        return None



### 5) Fallback Parser (Regex + Heuristics)

- **Email** via robust regex  
- **Name** from early lines (2–4 words, title case, avoid section headers)  
- **Skills** via intersection against a skill list (configurable)


**Name Extraction:**

Extracts candidate names using a tiered approach:

1) spaCy NER , 2) regex near email, and 3) regex from top of text as fallback.


In [6]:
try:
    nlp = spacy.load('en_core_web_sm')
except OSError:
    print("Spacy model not found. Please run: !python -m spacy download en_core_web_sm")
    nlp = None

def clean_extracted_name(name_text):
  """
  Cleans the extracted name by taking only the first line.
  This solves the issue of capturing a name plus a job title on the next line.
  """
  if name_text is None:
      return None

  # Split the text by newline characters and take the first element
  return name_text.splitlines()[0].strip()

def extract_name_with_ner(text, nlp_model):
    """
    Uses spaCy's Named Entity Recognition (NER) to find person names.
    This is the most reliable method.
    """
    if not nlp_model:
        return None

    doc = nlp_model(text)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            # Check if the name seems plausible (e.g., more than one word)
            if len(ent.text.strip().split()) >= 2:
                return clean_extracted_name(ent.text)
    return None

def extract_name_near_email(text, email):
    """
    Finds a name-like pattern on the same line as or line above the email.
    """
    if not email:
        return None

    # A more flexible regex for names, allowing for initials and hyphens
    name_regex = r"([A-Z][a-z'-]+(?:\s+[A-Z][a-z'-]+|\s+[A-Z]\.)+)"
    lines = text.splitlines()

    for i, line in enumerate(lines):
        if email in line:
            # 1. Check the same line as the email
            match = re.search(name_regex, line)
            if match:
                # Ensure we didn't just match part of an email or URL
                if '@' not in match.group(0):
                    return clean_extracted_name(match.group(0))

            # 2. If not found, check the line directly above (if it exists)
            if i > 0:
                match = re.search(name_regex, lines[i-1])
                if match:
                    return clean_extracted_name(match.group(0))
    return None

def extract_name_from_top(text):
    """
    Cleans the top of the resume and uses regex to find the most likely name.
    This is the final fallback method.
    """
    # Look at the first 300 characters
    text_top = text[:300]

    # More flexible regex allowing for middle initials, different spacing, etc.
    name_regex = r"([A-Z][a-z'-]+(?:\s+[A-Z][a-z'-]+|\s+[A-Z]\.)+)"

    match = re.search(name_regex, text_top)
    if match:
        return clean_extracted_name(match.group(0))
    return None



Fallback Parser

In [7]:
def parse_resume_with_rules(resume_text, skill_map, nlp_model):
    """
    Parses resume text using a multi-layered, rule-based approach.

    Args:
        resume_text (str): The raw text of the resume.
        skill_map (dict): A mapping of skill variations to canonical names.
        nlp_model: The loaded spaCy model for NER.

    Returns:
        dict: A dictionary containing the extracted information.
    """
    if not resume_text:
        return None

    # --- Email Extraction (Do this first as it can help with name extraction) ---
    email = None
    email_match = re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', resume_text)
    if email_match:
        email = email_match.group(0)

    # --- Name Extraction (Multi-layered approach) ---
    name = None
    # 1. Try NER first - it's the most intelligent
    name = extract_name_with_ner(resume_text, nlp_model)

    # 2. If NER fails, try finding the name near the email
    if not name:
        name = extract_name_near_email(resume_text, email)

    # 3. As a last resort, use the improved top-of-document search
    if not name:
        name = extract_name_from_top(resume_text)

    # --- Skill Extraction (Using the advanced SKILL_MAP) ---
    found_canonical_skills = set()
    resume_text_lower = resume_text.lower()
    if skill_map:
        for skill_alias, canonical_skill in skill_map.items():
            if re.search(r'\b' + re.escape(skill_alias) + r'\b', resume_text_lower):
                found_canonical_skills.add(canonical_skill)

    # --- Assemble final JSON object ---
    parsed_data = {
        "name": name,
        "email": email,
        "skills": sorted(list(found_canonical_skills))
    }

    return parsed_data


# --- FINAL EXECUTION EXAMPLE ---
# 1. Load dependencies at the start
# SKILL_MAP = build_skill_map_from_csv('skills.csv')
# nlp = spacy.load('en_core_web_sm')



### 6) Unified Pipeline

`parse_resume(file_path)` tries **Gemini** first (if available), otherwise falls back to the rules‑based parser.


In [8]:
def process_resume(file_path, skill_map):
    """
    Main function to process a resume file.
    It extracts text, tries the GenAI parser, and uses a rule-based
    fallback with an advanced skill map if the GenAI parser fails.
    """
    print(f"\n--- Processing {os.path.basename(file_path)} ---")

    text = extract_text(file_path)
    if not text:
        print("Could not extract text. Aborting.")
        return None, "Error"

    print("Attempting to parse with GenAI...")
    parsed_data = parse_resume_with_genai(text)

    if parsed_data is None:
        print("GenAI parsing failed. Switching to rule-based fallback parser...")
        # Pass the loaded SKILL_MAP to the fallback function
        parsed_data = parse_resume_with_rules(text, skill_map)
        parser_used = "Fallback (Rules)"
    else:
        parser_used = "Primary (GenAI)"

    print(f"Parsing complete. Method used: {parser_used}")
    return parsed_data, parser_used



In [18]:
from pathlib import Path
OUT_DIR = Path("outputs")
OUT_DIR.mkdir(exist_ok=True)

def process_resume(file_path, skill_map, use_llm: bool = True):
    """
    Main function to process a resume file.
    It extracts text, tries the GenAI parser, and uses a rule-based
    fallback with an advanced skill map if the GenAI parser fails.
    """
    file_path = Path(file_path)

    # Example: detect file type robustly
    ext = file_path.suffix.lower()
    if ext == ".pdf":
        # if your PDF reader needs a str path, cast with str(file_path)
        text = extract_text_from_pdf(str(file_path))
    elif ext in {".docx", ".doc"}:
        text = extract_text_from_docx(str(file_path))
    else:
        raise ValueError(f"Unsupported resume format: {file_path}")

    text = extract_text(file_path)
    if not text:
        print("Could not extract text. Aborting.")
        return None, "Error"

    print("Attempting to parse with GenAI...")
    parsed_data = parse_resume_with_genai(text)

    if (parsed_data is None) or (use_llm==False):
        print("GenAI parsing failed. Switching to rule-based fallback parser...")
        # Pass the loaded SKILL_MAP to the fallback function
        parsed_data = parse_resume_with_rules(text, skill_map, nlp)
        parser_used = "Fallback (Rules)"
    else:
        parser_used = "Primary (GenAI)"

    print(f"Parsing complete. Method used: {parser_used}")
    # save JSON results to outputs 
    with open(OUT_DIR / f"{file_path.stem}.json", "w", encoding="utf-8") as f:
        json.dump(parsed_data, f, ensure_ascii=False, indent=2)
    return parsed_data



### 7) Final Execution

In [25]:
from glob import glob

# 1. First, build the map
SKILL_MAP = build_skill_map_from_csv('skills.csv')
nlp = spacy.load('en_core_web_sm')

samples = sorted(glob("sample_resume*.pdf")) + sorted(glob("sample_resume*.docx"))
results = [process_resume(p, SKILL_MAP, use_llm=True) for p in samples]
len(results), results


Successfully loaded skills.csv with 13893 rows.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['altLabels'].fillna('', inplace=True)


Built a skill map with 27178 total variations.
Attempting to parse with GenAI...
Parsing complete. Method used: Primary (GenAI)
Attempting to parse with GenAI...
Parsing complete. Method used: Primary (GenAI)
Attempting to parse with GenAI...
Parsing complete. Method used: Primary (GenAI)
Attempting to parse with GenAI...
Parsing complete. Method used: Primary (GenAI)
Could not extract text. Aborting.
Attempting to parse with GenAI...
Parsing complete. Method used: Primary (GenAI)
Attempting to parse with GenAI...
Parsing complete. Method used: Primary (GenAI)


(7,
 [{'name': 'Kristen Connelly',
   'email': 'email@email.com',
   'skills': ['Call Sheets & Sides',
    'Adobe Premiere Pro',
    'Camera Boom, Light Boom, Mic Boom',
    'DaVinci Resolve',
    'scriptwriting',
    'audio editing',
    'mixing',
    'Video Production',
    'graphics editing',
    'videography proposals',
    'Production Acquisition',
    'storytelling',
    'digital video editing']},
  {'name': 'Mandy Campbell',
   'email': 'email@email.com',
   'skills': ['Cardio Training',
    'Fitness Routines',
    'HIIT',
    'Client Assessments',
    'Health & Safety',
    'Active Listening',
    'Personalized Fitness Assessments',
    'Marketing',
    'Staffing',
    'Sales',
    'High-intensity training',
    'Strength and conditioning',
    'Nutrition',
    'Group Cycling',
    'Personal Training',
    'CrossFit Level 1 Instructor',
    'Advanced First Aid',
    'Social Media Marketing',
    'Program Budgeting',
    'Program Statistics Analysis',
    'Client Rapport',
    '

### 8) Evaluation

In [None]:
import json, glob, re
import pandas as pd
from pathlib import Path

# ---------- Normalizers (strip < >, trim, lowercase) ----------
def _norm_text(s):
    if s is None:
        return ""
    s = str(s).strip()
    # remove surrounding angle brackets and extra spaces
    s = re.sub(r'^[<\s]+|[>\s]+$', '', s)
    return s

def _norm_name(s):  return _norm_text(s).lower()
def _norm_email(s): return _norm_text(s).lower()

def _norm_skills(lst):
    if not lst:
        return set()
    return { _norm_text(x).lower() for x in lst if str(x).strip() }

# ---------- Load ground truth from answer.jsonl ----------
with open("answer.jsonl", "r", encoding="utf-8") as f:
    answer = [json.loads(l) for l in f if l.strip()]

# make sure 'file' is a simple basename (e.g., sample_resume1.pdf)
for g in answer:
    g["file"] = Path(g["file"]).name

answer_map  = { g["file"]: g for g in answer }
answer_keys = set(answer_map.keys())

# ---------- Helper: map prediction to a ground-truth filename ----------
def map_pred_to_answer_filename(output_path, pred_dict):
    """
    1) Prefer 'file' inside prediction if present.
    2) Otherwise infer from output filename stem and try .pdf/.docx against answer_keys.
    """
    inner = pred_dict.get("file")
    if inner:
        return Path(inner).name

    stem = Path(output_path).stem  # e.g., 'sample_resume1'
    for ext in (".pdf", ".docx"):
        cand = stem + ext
        if cand in answer_keys:
            return cand
    return None  # couldn't map

# ---------- Load predictions and compute metrics ----------
rows = []

# Ensure outputs exist; you should have run your parser first
pred_paths = sorted(glob.glob("outputs/*.json"))
if not pred_paths:
    print("No prediction files found in outputs/. Run your parser cells first.")

for p in pred_paths:
    with open(p, "r", encoding="utf-8") as f:
        pred = json.load(f)

    fn = map_pred_to_answer_filename(p, pred)
    if not fn:
        print(f"Skipping (cannot map to answer): {p}")
        continue

    g = answer_map.get(fn)
    if not g:
        print(f"No answer entry for: {fn}")
        continue

    # Field metrics
    name_acc  = int(_norm_name(pred.get("name"))  == _norm_name(g.get("name")))
    email_acc = int(_norm_email(pred.get("email")) == _norm_email(g.get("email")))

    P, G = _norm_skills(pred.get("skills")), _norm_skills(g.get("skills"))
    tp = len(P & G); fp = len(P - G); fn_miss = len(G - P)
    prec = tp / (tp + fp) if (tp + fp) else 0.0
    rec  = tp / (tp + fn_miss) if (tp + fn_miss) else 0.0
    f1   = (2*prec*rec)/(prec+rec) if (prec+rec) else 0.0

    rows.append({
        "file": fn,
        "name_acc": name_acc,
        "email_acc": email_acc,
        "skills_prec": round(prec, 3),
        "skills_rec": round(rec, 3),
        "skills_f1": round(f1, 3),
        "used_llm": pred.get("used_llm", None)
    })

df = pd.DataFrame(rows).sort_values("file")
display(df)

summary = pd.DataFrame({
    "metric": ["name_acc","email_acc","skills_prec","skills_rec","skills_f1"],
    "mean": [
        df["name_acc"].mean()  if not df.empty else 0.0,
        df["email_acc"].mean() if not df.empty else 0.0,
        df["skills_prec"].mean() if not df.empty else 0.0,
        df["skills_rec"].mean()  if not df.empty else 0.0,
        df["skills_f1"].mean()   if not df.empty else 0.0,
    ]
})
display(summary)


Skipping (cannot map to answer): outputs/sample_resume4.json
Skipping (cannot map to answer): outputs/sample_resume2.json
Skipping (cannot map to answer): outputs/sample_resume3.json
Skipping (cannot map to answer): outputs/sample_resume1.json
Skipping (cannot map to answer): outputs/sample_resume6.json
Skipping (cannot map to answer): outputs/sample_resume7.json


KeyError: 'file'

### 9) Demonstrate fallback works (one sample)

In [30]:
# 1. First, build the map
# SKILL_MAP = build_skill_map_from_csv('skills.csv')
# nlp = spacy.load('en_core_web_sm')

fallback_sample = process_resume("sample_resume3.pdf", SKILL_MAP, use_llm=False)
# json.load(open("outputs/fallback_sample_resume4.json"))
fallback_sample


Attempting to parse with GenAI...
GenAI parsing failed. Switching to rule-based fallback parser...
Parsing complete. Method used: Fallback (Rules)


{'name': '• Contributed',
 'email': 'email@email.com',
 'skills': ['adobe illustrator',
  'design process',
  'english',
  'history',
  'italian',
  'market research']}