# Resume Parser

This notebook demonstrates a generic Resume Parser capable of extracting **name**, **email**, and **skills** from resumes in PDF and Word formats. The goal is to transform unstructured resumes into structured JSON outputs for downstream analysis or automation.

Resumes are sourced from Kaggle datasets. The approach uses:

1. **Document ingestion**: PDFs parsed with `pdfplumber`, Word documents with `python-docx`, and legacy `.doc` files converted via LibreOffice.
2. **LLM-based extraction**: Google Gemini API (`gemini-2.5-flash`) parses resume text to JSON.
3. **Batch processing**: The pipeli

Note: I'm using a linux machine, therefore required converting doc to docx using LibreOffice. If you are using a windows machine, please use docx files instaed of doc. 

## Approach

1. **Text extraction**  
   - PDF → `pdfplumber`  
   - DOCX → `python-docx`  
   - DOC → LibreOffice conversion → DOCX → `python-docx`  

2. **LLM Parsing**  
   - The text is fed into Gemini with a prompt that requests JSON output for name, email, and skills.  
   - `json_repair` is used to handle minor formatting issues.

3. **Random testing & batch processing**  
   - A single resume can be randomly selected for testing.  
   - Full folder processing collects all outputs into a DataFrame, exported as Excel.

In [1]:
#import libaries
import os
from docx import Document
import pdfplumber
import random 
from google import genai
import json_repair
import pandas as pd
import platform
import subprocess
from pathlib import Path
from rapidfuzz import fuzz

## Data Preparation

In [2]:
#Data Sources

# I filtered the list of pdfs that I sourced from the links below to the pdfs I felt would 
#show the full capabilities of my approach

#https://www.kaggle.com/datasets/anuvagoyal/resume-pdf/data
#https://www.kaggle.com/datasets/hussnainmushtaq/sample-cvs-dataset-for-analysis/data
#https://www.kaggle.com/datasets/sauravsolanki/hire-a-perfect-machine-learning-engineer
#https://www.kaggle.com/datasets/extremelysunnyyk/resume-data-with-annotations
# resume1.doc and resume2.doc were Gemini created files as there weren't many examples of doc files online.

In [3]:
DATA_DIR = Path("data")        # folder with resumes
OUTPUT_DIR = Path("outputs")   # folder for results
OUTPUT_DIR.mkdir(exist_ok=True)  # creates outputs/ if it doesn't exist

In [4]:
#File to Text Functions 

def pdf_to_text(pdf_path):
    """
    Reads text from a PDF (.pdf) file and returns it as a single string.
    
    Args:
        pdf_path (str): Path to the .pdf file.
        
    Returns:
        str: Text content of the document.
    """
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text


def docx_to_text(docx_path):
    """
    Reads text from a Word (.docx) file and returns it as a single string.
    
    Args:
        docx_path (str): Path to the .docx file.
        
    Returns:
        str: Text content of the document.
    """
    doc = Document(docx_path)
    full_text = [para.text for para in doc.paragraphs]
    return "\n".join(full_text)


#Working on a linux machine therefore I converted doc files to docx files 
def convert_doc_to_docx(doc_path, output_dir=DATA_DIR/'converted'):
    """
    Converts a .doc file to .docx using LibreOffice CLI.
    Always uses absolute paths so the file is found.
    """
    output_dir.mkdir(exist_ok=True)
    
    abs_input = os.path.abspath(doc_path)
    abs_output_dir = os.path.abspath(output_dir)
    cmd = f'libreoffice --headless --convert-to docx --outdir "{abs_output_dir}" "{abs_input}"'
    os.system(cmd)
    
    base_name = os.path.splitext(os.path.basename(doc_path))[0]
    converted_path = os.path.join(output_dir, base_name + ".docx")
    
    return converted_path



def doc_to_text_linux(doc_path):
    """
    Extracts text from a .doc file on Linux/macOS by first converting it to .docx
    using LibreOffice, then reading the text with python-docx.

    Notes:
        - Requires LibreOffice installed and in PATH.
    """
    converted_path = convert_doc_to_docx(doc_path)
    text = docx_to_text(converted_path)
    return text


#Combine the functions above 
def extract_text(file_path:Path):
    valid_exts = [".pdf", ".docx", ".doc"]
    if file_path.suffix.lower() not in valid_exts:
        raise ValueError(f"Unsupported file type: {file_path}")
    if file_path.suffix.lower() == '.pdf':
        return pdf_to_text(file_path)
    elif file_path.suffix.lower() == ".docx":
        return docx_to_text(file_path)
    elif file_path.suffix.lower() == ".doc":
        return doc_to_text_linux(file_path)

## LLM Setup

In [5]:
#Prompt to pass to llm
def llm_instructions(text):
    instruction = f"""Text extracted from a Resume:\n\n{text}\n\nCan you extract the name, email and skills from the resume above. Your output should be a json dictionary with 3 keys, name, email and skills. The skills section should be a list of skills you've extracted from the resume. Avoid repeating skills in the skills section.\n"""
    instruction += """Here is an example output, you should follow this format:\n{\n'name': 'Jane Doe',\n'email': 'jane.doe@gmail.com',\n'skills': ['Machine Learning', 'Python', 'LLM']\n}\n"""
    return instruction
print(llm_instructions('RESUME'))

Text extracted from a Resume:

RESUME

Can you extract the name, email and skills from the resume above. Your output should be a json dictionary with 3 keys, name, email and skills. The skills section should be a list of skills you've extracted from the resume. Avoid repeating skills in the skills section.
Here is an example output, you should follow this format:
{
'name': 'Jane Doe',
'email': 'jane.doe@gmail.com',
'skills': ['Machine Learning', 'Python', 'LLM']
}



In [6]:
# The client gets the API key from the environment variable `GEMINI_API_KEY`.
api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)

In [7]:
#Resume Parsing Function

def resume_parser(file_path: Path):
    #first get text
    text = extract_text(file_path)
    #Get instructions
    prompt = llm_instructions(text)
    #llm response
    response = client.models.generate_content(model="gemini-2.5-flash", contents=prompt)
    #repair response
    json_format = json_repair.repair_json(response.text,return_objects=True)
    
    return json_format
    

## Inference

In [8]:
#Process a single resume 
files = [f for f in DATA_DIR.iterdir()]
random_file = random.choice(files)
print(DATA_DIR/random_file.name)
print('\n')
result = resume_parser(DATA_DIR/random_file.name)
print(result)

data/Xinni_Chng.pdf


{'name': 'xinni chng', 'email': 'hello@xinni.co', 'skills': ['Responsive Web Design', 'Mobile design', 'Visual Design', 'game Design', 'illustration', 'User Research', 'Competitive Analysis', 'Wireframing', 'Prototyping', 'Usability testing', 'Statistical Analysis', 'Adobe Creative Suite', 'Sketch', 'Principle', 'invision', 'Balsamiq', 'R Studio', 'Cogtool', 'htMl', 'CSS', 'JavaScript', 'C++', 'Java', 'JQuery', 'Angular', 'React', 'ionic', 'grunt', 'gulp', 'Qt', 'SQl', 'Firebase']}


In [9]:
#Process a folder of resumes, output results as a df
results = []
     
for file_path in DATA_DIR.iterdir(): 
    try:
        print(f"Processing: {file_path.name}")
        result = resume_parser(DATA_DIR/file_path.name)
        result['file'] = file_path.name
        results.append(result)
    except Exception as e: 
        print(str(e))

        
pd.DataFrame(results).to_excel(OUTPUT_DIR/'resume_results.xlsx', index = False)

Processing: resume1.doc


  java_version = re.match('openjdk version "(?P<version>[\d\._]+)"',


convert /home/sapienserver2/Documents/Learning /resume-parser/data/resume1.doc as a Writer document -> /home/sapienserver2/Documents/Learning /resume-parser/data/converted/resume1.docx using filter : MS Word 2007 XML
Processing: Alice Clark CV.docx
Processing: azam rafique_cv_master (1).pdf
Processing: Nouman Ali - CV.pdf
Processing: Smith Resume.docx
Processing: AnuvaGoyal_Latex.pdf
Processing: Sample_Resume.pdf
Processing: Xinni_Chng.pdf
Processing: sample_input.pdf
Processing: resume v6.pdf
Processing: 1901841_RESUME.pdf
Processing: resume2.doc


  java_version = re.match('openjdk version "(?P<version>[\d\._]+)"',


convert /home/sapienserver2/Documents/Learning /resume-parser/data/resume2.doc as a Writer document -> /home/sapienserver2/Documents/Learning /resume-parser/data/converted/resume2.docx using filter : MS Word 2007 XML
Processing: AshlyLauResume.pdf


## Evaluation

In [10]:

TEST_FILE = Path("Test Set.xlsx")  
test_df = pd.read_excel(TEST_FILE)

# Ensure consistent column names
test_df.columns = test_df.columns.str.lower().str.strip()
test_df


Unnamed: 0,name,email,skills,file
0,ANUVA GOYAL,anuvagoyal111@gmail.com,"C, C++,Python, SQL,Data Structures,CSS, HTML,N...",AnuvaGoyal_Latex.pdf
1,Thomas Frank,thomas@thomasjfrank.com,"Adobe Premiere Pro, After Effects, Photoshop, ...",Sample_Resume.pdf
2,xinni chng,hello@xinni.co,"Responsive Web Design,Mobile design, Visual De...",Xinni_Chng.pdf
3,Chia Yong Kang,chiayongkang@hotmail.com,"Python,Java,Javascript,HTML,CSS, XML,SQL,PHP,K...",resume v6.pdf
4,Ashly Lau,ashlylau@gmail.com,"Java, Python, Haskell, C, C++, JavaScript, Pro...",AshlyLauResume.pdf
5,Alice Clark,aclark123@ai.io,"Machine Learning, Natural Language Processing,...",Alice Clark CV.docx
6,Michael Smith,msmith@manunited27.com,"problem solving, project lifecycle, project ma...",Smith Resume.docx


In [11]:
def normalize_name(name):
    if not name:
        return ""
    return " ".join(name.strip().lower().split())

def skill_set(skills):
    if not skills:
        return set()
    if isinstance(skills, str):
        return {s.strip().lower() for s in skills.split(",")}
    if isinstance(skills, list):
        return {s.strip().lower() for s in skills}
    return set(skills)

def prf(pred_set, gold_set):
    tp = len(pred_set & gold_set)
    fp = len(pred_set - gold_set)
    fn = len(gold_set - pred_set)
    precision = tp / (tp + fp) if (tp + fp) else 0.0
    recall = tp / (tp + fn) if (tp + fn) else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
    return {"precision": precision, "recall": recall, "f1": f1}


In [13]:
def fuzzy_match_skills(pred_skills, gold_skills, threshold=80):
    """
    Match predicted skills to gold skills using fuzzy string matching.
    Recall is judged as 'covered' if a gold skill matches at least one predicted skill above threshold.
    """
    matched = set()
    for g in gold_skills:
        for p in pred_skills:
            if fuzz.token_sort_ratio(g, p) >= threshold:
                matched.add(g)
                break
    return matched

eval_rows = []

# Map test set by file name for easy lookup
test_map = {Path(row.file).stem: row for _, row in test_df.iterrows()}

for r in results:
    file_stem = Path(r['file']).stem
    gold_row = test_map.get(file_stem)

    if gold_row is None:
        continue  # skip if no gold row found

    # Email exact match
    email_match = int((r.get("email") or "").lower() == (gold_row.email or "").lower())

    # Name similarity
    name_similarity = fuzz.token_sort_ratio(
        normalize_name(r.get("name")),
        normalize_name(str(gold_row["name"]))
    )

    # Skills sets
    pred_skills = skill_set(r.get("skills"))
    gold_skills = skill_set(gold_row.skills)

    # --- Standard precision/recall ---
    skills_metrics = prf(pred_skills, gold_skills)

    # --- Fuzzy recall (ignores wording variants) ---
    matched_gold = fuzzy_match_skills(pred_skills, gold_skills)
    fuzzy_recall = len(matched_gold) / len(gold_skills) if gold_skills else 1.0

    eval_rows.append({
        "file": r['file'],
        "email_exact": email_match,
        "name_similarity": name_similarity,
        "skills_precision": round(skills_metrics["precision"], 3),
        "skills_recall": round(skills_metrics["recall"], 3),
        "skills_f1": round(skills_metrics["f1"], 3),
        "fuzzy_recall": round(fuzzy_recall, 3)
    })

df_eval = pd.DataFrame(eval_rows)
display(df_eval)

print("\nOverall averages:")
print(df_eval[["email_exact","name_similarity","skills_precision","skills_recall","skills_f1","fuzzy_recall"]].mean())
df_eval.to_excel('results/Test Set results.xlsx', index = False)

Unnamed: 0,file,email_exact,name_similarity,skills_precision,skills_recall,skills_f1,fuzzy_recall
0,Alice Clark CV.docx,1,100.0,0.097,1.0,0.176,1.0
1,Smith Resume.docx,1,100.0,0.059,0.75,0.109,1.0
2,AnuvaGoyal_Latex.pdf,1,100.0,1.0,1.0,1.0,1.0
3,Sample_Resume.pdf,1,100.0,1.0,1.0,1.0,1.0
4,Xinni_Chng.pdf,1,100.0,1.0,1.0,1.0,1.0
5,resume v6.pdf,1,100.0,0.541,0.952,0.69,0.952
6,AshlyLauResume.pdf,1,100.0,1.0,1.0,1.0,1.0



Overall averages:
email_exact           1.000000
name_similarity     100.000000
skills_precision      0.671000
skills_recall         0.957429
skills_f1             0.710714
fuzzy_recall          0.993143
dtype: float64


# 📊 Evaluation Summary

I evaluated the resume parsing model against a **curated test set** of resumes with gold-standard labels for name, email, and skills.  

---

### Results
- **Email extraction** → **100% accuracy** across all resumes.  
- **Name similarity** → **100%**, confirming the model reliably captures candidate names.  
- **Skills extraction**:
  - **Precision** = 0.67 (on average) → the model predicts many additional skills beyond the gold labels.  
  - **Recall** (strict) = 0.96 → the model recovered almost all of the labeled skills.  
  - **F1** = 0.71 (balance of precision & recall).  

---

### Interpreting Skills Metrics
The **gold skill labels** only came from the explicit *Skills section* of resumes.  
- Strict **precision** is therefore less meaningful, since the model often extracts *additional valid skills* from work experience, education, or projects.  
- This broader extraction is **useful**, not noise, for downstream applications.  

To account for skill name variations (e.g., *“LLM” vs “LLMs”*, *“Project Lifecycle” vs “Project Lifecycle Management”*), I introduced **fuzzy recall**:  
- **Fuzzy Recall** = **0.99** → showing that, after accounting for near-matches, the model essentially covered all labeled skills.  
- After manual review, we can consider **recall ≈ 1.0** for all resumes, since every gold skill was present, just sometimes under a slightly different name.  

---

### ✅ Key Takeaways
- Name and email extraction are **solved** for this dataset.  
- Skill extraction recall is **excellent**, though precision is lower due to broader coverage.  
- For this use case, **recall matters more than precision** → prefer capturing extra relevant skills rather than missing ones.  
- Future improvement: **skill normalization** (mapping synonyms/variants to a curated list), which would push fuzzy recall to **1.0**.  


## Future Enhancements

1. **Fallback mechanisms**  
   - Regex-based extraction for emails, skills, or names if LLM fails or rate-limits occur.

2. **Enhanced skill extraction**  
   - Split skills into languages, programs, libaries, etc. 
   - Handle synonyms and skill variations (e.g., "PyTorch" vs "Torch").

4. **Performance improvements**  
   - Extract more information from the resumes and split into sections
   - If LLMs weren't accuracte enough for the task above, could use VL model and convert the resumes to images, it would be able to analsis different formats better potentially. 
   
