# Symantical Evaluation: Resume & Job Description


## LangExtract & PDFLoader
### LangExtract
A Google solution that was optimised for long documents to **extract key entities with few-shot examples from unstructured data** (PDF in our case) adapting LLM to the job-seeking domain. It overcomes the "needle-in-a-haystack" challenge of large document extraction by using an optimized strategy of text chunking, parallel processing, and multiple passes for higher recall, you can read more about LangExtract [here](https://github.com/google/langextract/tree/main). 

LangExtract will be use to extract the following entities **from key sections** within `job-description` and `resume`:
**Entities Extracted**  
- **Resume (`r`)**  
  1. Hard skills  
  2. Soft skills  
  
- **Job Description (`job_d`)**  
  1. Hard skills  
  2. Soft skills  
  3. Contact person  
  4. Years of experience  

**Alternative to LangExtract**: An alternative is [SetFit](https://huggingface.co/blog/setfit), which can be trained with minimal examples to achieve RoBERTa-level performance.  
- **Process:** Create a dataset with labeled soft skills (matching the ontology below), paired with job-description and resume sections.  
- **Benefit:** Few-shot learning, efficient fine-tuning, and high performance with small data.  


### PDFLoader
PDFLoader is used to parse resumes and job descriptions into structured sections:

- **Resume (`r`)**  
  1. About / Summary  
  2. Education  
  3. Skills / Qualifications  
  4. Professional Experience  
  5. Certificates  
  6. Volunteering  
  7. Personal Projects  

- **Job Description (`job_d`)**  
  1. Responsibility  
  2. Requirements / Qualifications / Experience  
  3. Role Description  
  4. Contact Person  

---

## Hybrid Solution: Sentence Transformers & LLM
We evaluate resumes against job descriptions by comparing aligned sections extracted via PDFLoader and LangExtract.
To evaluate resume with a job-description, we will compare sections from Resume (`r`) and the Job Description (`job_d`):

#### 1. PDFLoader-Based Evaluation
1. `job_d.Responsibility ↔ r.Professional Experience`: An LLM evaluates whether the candidate’s professional experience matches the job responsibilities.  
   - **Guiding question:** *Does the candidate’s experience demonstrate the capability to deliver on the company’s stated responsibilities?*  
   - **Scoring:** The model outputs a score between `0` and `1` within a predefined range:  
     - `0 – 0.4`: Negative indication, results within this range indicates a missmatch between the professional experience of the candidate and the responsibility stated (could also indicates on commertial missmatch).
     - `0.4 – 0.6`: Neutral indication, results within this range indicates that the LLM wasn't able to explicitly determnine an answer to the question.   
     - `0.6 – 1.0`: Positive indication, results within this range indicate a match between the professional experience and the responsibilities stated.
2. **Job description role title**: Given a title extracted, we will compute the cosine similarity between the pre-defined role-lists (a list of roles JobAgent will require the user to supply) embeddings and the job-description title embedding to determine whether or not the role-title match the candidate search by setting up an alpha (threshold). 


### 2. LangExtract-Based Evaluation
1. **Hard Skills Evaluation**:
    - `job_d.Requirements/Qualifications/Experience ↔ r.Skills`: Using the **hard skills** extracted by `LangExract`, we will compute a **hard skill score** by calculating the intersection of candidate skills with the required hard skills, then dividing by the total number of hard skills listed in the job description.

2. **Soft Skills Evaluation**:
  LangExtract maps soft skills from both resumes and job descriptions sections into a **predefined soft skill ontology** (can be found in `./soft_skills.json`). <br>
  For each mapped section, we perform semantic similarity evaluation. Each ontology category is weighted, and the **dot product** is computed to obtain the final `soft_skill_score`.
    - **Resume Sections:** About / Summary, Professional Experience, Education, Certificates, Volunteering  
    - **Job Description Sections:** Responsibility, Requirements / Qualifications / Experience, Role Description
  
    To evalute the symantic similarity we will use a `sentence transformer` (bi-directional encoder) model [TechWolf/JobBERT-v2](https://huggingface.co/TechWolf/JobBERT-v2) which was specifically trained on 5M+ job-pairs. The model maps job titles and descriptions to a 1024-dimensional dense vector space and can be used for semantic job title matching, job similarity search, and related HR/recruitment tasks.

3. **Years of experience**: Given the unique years extracted from a resume or years of experience extracted from job-description, we will evaluate experience of a candidate is within an acceptible predefined margin (gap).
    - An acceptible marignal gap is : `Gap <= 2`. 

4. **Contact person**: Given the job description, if a contact person was found and candidate is interasted in the job, JobAgent will compose an persinlised email, otherwise JobAgent will compose a general email. The email will be composed taking into account the following context: 
    - The overall evaluation score (LangExtract + PDFLoader)
    - The context extracted from web (the company's about-page context and reviews)
    

In [1]:
%load_ext autoreload
%autoreload 2

In [89]:
import json
import os
import re
from collections import Counter
from datetime import datetime
from pprint import pprint
from typing import Any, Dict, List

import langextract as lx
import torch
from dotenv import load_dotenv
from google import genai
from langchain.document_loaders import PyPDFLoader
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import batch_to_device, cos_sim

load_dotenv()

gemini_api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=gemini_api_key)

In [3]:
# Text Gimini connection 
response = client.models.generate_content(
    model="gemini-2.5-flash", contents="hi"
)
response.text

'Hi there! How can I help you today?'

## Parse PDF & Extract Resume Sections

In [7]:
resume_path: str = "~/Desktop/Private/CVs/version_new/Gal Beeri Data Science CV.pdf"
loader = PyPDFLoader(
    file_path=resume_path,
    extract_images=False,
    mode="page",
    extraction_mode="layout"
)
resume = loader.load()
resume_pages: List[str] = [page.page_content for page in resume]
print(resume_pages[0])

Gal Beeri - Data Scientist
                             Melbourne, VIC I 0474 369 551 I  galbeeri1@gmail.com
            LinkedIn: https://www.linkedin.com/in/gal-beeri I Github: https://github.com/Kokolipa

Summary

Forward-thinking Data Scientist with experience in machine learning, LLMs, and cloud-based data
platforms. Skilled in using AWS, Azure, Python, SQL, Power BI and cutting-edge AI techniques to
design and implement scalable, data-driven solutions that solve complex problems. Results-oriented
and passionate about empowering organisations to make data-driven decisions that drive business
success.

Skills

•   Cloud & DevOps: AWS, Azure, Microsoft Fabric, CLI, Git, GitHub, GitLab, Docker, MLflow
•   GenAI: Agents (ReAct, Router, Multi-Agents, Conversational), LLMs (PEFT, LoRA, Summarisation,
    Prompt Engineering), RAG (Multi-source, Self-RAG, Self-query, Re-ranking, Hybrid, HyDE, Binary
    Quantisation), NLP-based solutions
•   Machine Learning: Regression, Classification, U

In [None]:
# Define regex patterns: 
dot_points_pattern: str = r"^•\s+"
job_experience_pattern: str = r"^(?!•)(.+?\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{4}\s*[-–]\s*(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?\s*\d{0,4})$"
# Define a list with section names
sections: List[str] = [
    "Summary",
    "Skills",
    "Professional Experience",
    "Education",
    "Professional Certifications",
    "Personal Projects & Volunteering"
]
resume_dict: Dict[str, str] = {}

for i, page in enumerate(resume_pages, 1):
    # Extract min and max spans: Reset the spans for each page iterated
    max_spans: List[int] = [] 
    min_spans: List[int] = []
    max_span: int = 0
    for section in sections: 
        results = re.finditer(pattern=section, string=page, flags=re.DOTALL)
        for result in results:
            max_spans.append(result.span()[1]) # end span
            max_span = result.span()[1] 
            min_spans.append(result.span()[0])
    
    # Get the size of the sections in the page
    min_span = min(min_spans)
    sections_size = len(max_spans)
    for idx, span in enumerate(max_spans):
        if i == 1: 
            # Extract sections and clean up results
            if span != max_span:
                section_result = page[max_spans[idx]:max_spans[idx+1]].strip("\n\n").strip()
        
                # section_result = re.sub(pattern=dot_points_pattern, repl=" ", string=section_result, flags=re.MULTILINE)
                resume_dict[sections[idx]] = section_result
            else:  
                section_result = page[span:].strip("\n\n").strip()
                
                # section_result = re.sub(pattern=dot_points_pattern, repl=" ", string=section_result, flags=re.MULTILINE).strip("\n\n").strip()
                resume_dict[sections[idx]] = section_result
        else: 
            if span != max_span:
                section_result = page[max_spans[idx]:max_spans[idx+1]].strip("\n\n").strip()
                
                # section_result = re.sub(pattern=dot_points_pattern, repl=" ", string=section_result, flags=re.MULTILINE).strip("\n\n").strip()
                resume_dict[sections[idx + sections_size]] = section_result
            else:
                # Extract additional job experience from the second page 
                job_exper = [match for match in re.finditer(pattern=job_experience_pattern, string=page, flags=re.MULTILINE)]
                additional_experience = job_exper[0].span()[0]
                resume_dict["Professional Experience"] = resume_dict.get("Professional Experience") + "\n" + page[additional_experience:min_span]
                
                section_result = page[span:].strip("\n\n").strip()
                resume_dict[sections[idx + sections_size]] = section_result


# Print the resulted dictionary
resume_dict


{'Summary': 'Forward-thinking Data Scientist with experience in machine learning, LLMs, and cloud-based data\nplatforms. Skilled in using AWS, Azure, Python, SQL, Power BI and cutting-edge AI techniques to\ndesign and implement scalable, data-driven solutions that solve complex problems. Results-oriented\nand passionate about empowering organisations to make data-driven decisions that drive business\nsuccess.\n\nSkills',
 'Skills': '•   Cloud & DevOps: AWS, Azure, Microsoft Fabric, CLI, Git, GitHub, GitLab, Docker, MLflow\n•   GenAI: Agents (ReAct, Router, Multi-Agents, Conversational), LLMs (PEFT, LoRA, Summarisation,\n    Prompt Engineering), RAG (Multi-source, Self-RAG, Self-query, Re-ranking, Hybrid, HyDE, Binary\n    Quantisation), NLP-based solutions\n•   Machine Learning: Regression, Classification, Unsupervised Clustering, Decision Trees, Boosting,\n    Random Forest, Support Vector Classifier, Anomaly Detection\n•   Python: LangChain, LangGraph, LangSmith, DeepEval, FastAPI, P

In [None]:
def get_years_of_experience(resume_obj: Dict[str, str], key: str): 
    experiences = resume_dict[key]
    # Extract and parse dates from resume 
    years_pattern = r"\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[-\s/]*(?:19|20)\d{2}\b"
    year_month_format = "%b %Y"
    year_months = [match for match in re.findall(pattern=years_pattern, string=experiences, flags=re.IGNORECASE)]
    parse_dates = [datetime.strptime(year, year_month_format) for year in year_months]

    return round((max(parse_dates) - min(parse_dates)).days / 365, 2)

get_years_of_experience(resume_dict, "Professional Experience")


4.0

## Define Extraction Task

In [9]:
document_path: str = "../src/data/soft_skills.json"
with open(document_path, "r") as file: 
    soft_skill_ontology: Dict[str, List[str]] = json.load(file)
soft_skill_ontology[:2]

[{'text': 'presented to stakeholders',
  'extractions': [{'extraction_text': 'presented to stakeholders',
    'extraction_class': 'communication'}]},
 {'text': 'collaborated with cross-functional teams',
  'extractions': [{'extraction_text': 'collaborated with cross-functional teams',
    'extraction_class': 'communication'}]}]

In [10]:
# The prompt for LangExtract has to descrive the extraction class 
EXTRACTION_PROMPT: str = ("""\
Extract communication, leadership, problem solving, teamwork collaboration, analytical thinking, adaptability, creativity and innovation, and time management optimisation using attributes to group soft skills related information: 
1. Extract entities in the order they appear in the text.
2. Use the exact text for extractions. Do not paraphrase or overlap entities.
3. Soft skill groups can have different values but should always have the same key "softskill_type".
""")

# Create an object to store examples
examples = []
example_size: int = 3

# Create an object to limit the amount of examples to be provided to LangExtract 
group_counts: Dict[str, int] = {}
for example in soft_skill_ontology: 
    # Extract key elements from softskill ontology 
    extractions = example.get("extractions")
    text = example.get("text")
    extraction_class = example.get("extractions")[0].get("extraction_class")

    # Get the current count for the group, defaulting to 0 if it's the first time
    current_group_count = group_counts.get(extraction_class, 0)
    if current_group_count < example_size: 
        group_counts[extraction_class] = current_group_count + 1
        
        # Formulate LangExtract examples  
        for extraction in extractions:
            extraction_text = extraction.get("extraction_text")
            examples.append(
                lx.data.ExampleData(
                    text=text,
                    extractions=[
                        lx.data.Extraction(
                            extraction_class=extraction_class,
                            extraction_text=extraction_text,
                            attributes={"softskill_type": extraction_class}
                        )
                    ]
                )
            )


# Preview first 5 examples 
examples[:5]
        

[ExampleData(text='presented to stakeholders', extractions=[Extraction(extraction_class='communication', extraction_text='presented to stakeholders', char_interval=None, alignment_status=None, extraction_index=None, group_index=None, description=None, attributes={'softskill_type': 'communication'})]),
 ExampleData(text='collaborated with cross-functional teams', extractions=[Extraction(extraction_class='communication', extraction_text='collaborated with cross-functional teams', char_interval=None, alignment_status=None, extraction_index=None, group_index=None, description=None, attributes={'softskill_type': 'communication'})]),
 ExampleData(text='wrote technical documentation', extractions=[Extraction(extraction_class='communication', extraction_text='wrote technical documentation', char_interval=None, alignment_status=None, extraction_index=None, group_index=None, description=None, attributes={'softskill_type': 'communication'})]),
 ExampleData(text='led a team of', extractions=[Extra

## Extract Entities from Resume: Soft & Hard skills

### Soft Skill extractions

In [None]:
resume_keys: List[str] = ['Summary', 'Professional Experience', 'Personal Projects & Volunteering']
input_text: str = "\n\n".join([resume_dict[key] for key in resume_keys])

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=EXTRACTION_PROMPT,
    examples=examples,
    model_id="gemini-2.5-flash",
)

In [None]:
print(f"Extracted {len(result.extractions)} entities:\n")


# Save as JSON Line
lx.io.save_annotated_documents(
    annotated_documents=[result],
    output_dir="../src/data/",
    output_name="softskills.jsonl"
)

# Softskills breakdown 
breakdown = Counter([extraction.extraction_class for extraction in result.extractions])
pprint(breakdown, indent=2)

Extracted 28 entities:



[94m[1mLangExtract[0m: Saving to [92msoftskills.jsonl[0m: 1 docs [00:00, 874.18 docs/s]

[92m✓[0m Saved [1m1[0m documents to [92msoftskills.jsonl[0m
Counter({ 'problem_solving': 9,
          'creativity_innovation': 8,
          'analytical_thinking': 3,
          'teamwork_collaboration': 2,
          'leadership': 2,
          'communication': 2,
          'time_management_organization': 2})





In [151]:
len(result.extractions)

28

In [148]:
overall_entities = len(result.extractions)
for key, value in breakdown.items():
    class_weight = value/overall_entities
    weighted_avg = (class_weight * value) / overall_entities
    print(weighted_avg) 

0.32653061224489793
0.5625
1.6530612244897962
0.08163265306122448
0.12755102040816327


In [145]:
# Generate the interactive visualization
html_content = lx.visualize("../src/data/langextract/softskills/softskills.jsonl")
with open("../src/data/softskills.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data) 
    else:
        f.write(html_content)

[94m[1mLangExtract[0m: Loading [92msoftskills.jsonl[0m: 100%|██████████| 12.0k/12.0k [00:00<00:00, 18.9MB/s]

[92m✓[0m Loaded [1m1[0m documents from [92msoftskills.jsonl[0m





In [93]:
# Load skills.json 
skills_path: str = "../src/data/langextract/skills/skills.json"
with open(skills_path, "r") as file: 
    skills_json = json.load(file)
skills_json[:2]


[{'text': 'developed and deployed applications using AWS services like EC2, S3, and Lambda.',
  'extractions': [{'extraction_text': 'AWS',
    'extraction_class': 'cloud_services'},
   {'extraction_text': 'EC2', 'extraction_class': 'cloud_services'},
   {'extraction_text': 'S3', 'extraction_class': 'cloud_services'},
   {'extraction_text': 'Lambda', 'extraction_class': 'cloud_services'}]},
 {'text': 'architected solutions on Azure, leveraging Azure Functions and Cosmos DB for scalability.',
  'extractions': [{'extraction_text': 'Azure',
    'extraction_class': 'cloud_services'},
   {'extraction_text': 'Azure Functions',
    'extraction_class': 'cloud_services'},
   {'extraction_text': 'Cosmos DB', 'extraction_class': 'cloud_services'}]}]

In [122]:
# Define the prompt to Extract skilss from resume with LangExtract
EXTRACTION_PROMPT_SKILLS: str = ("""\
Extract cloud services, databases, dev languages, data visualisation, and algorithms using attributes to group skills related information: 
1. Extract entities in the order they appear in the text.
2. Use the exact text for extractions. Do not paraphrase or overlap entities.
3. Skill groups can have different values but should always have the same key "skill_type".
""")

# Create an object to store examples
skill_examples = []

for example in skills_json: 
    # Extract key elements from skills.json
    extractions = example.get("extractions")
    text = example.get("text")

    if len(example.get("extractions")) > 1:
        for extraction in  example.get("extractions"):
            extraction_class = extraction.get("extraction_class")
            extraction_text = extraction.get("extraction_text")
            # Formulate LangExtract examples
            skill_examples.append(
                lx.data.ExampleData(
                    text=text,
                    extractions=[
                        lx.data.Extraction(
                            extraction_class=extraction_class,
                            extraction_text=extraction_text,
                            attributes={"skill_type": extraction_class}
                        )
                    ]
                )
            )
    else:
        extraction_class = extraction.get("extraction_class")
        extraction_text = extraction.get("extraction_text") 

        # Formulate LangExtract examples
        skill_examples.append(
            lx.data.ExampleData(
                text=text,
                extractions=[
                    lx.data.Extraction(
                        extraction_class=extraction_class,
                        extraction_text=extraction_text,
                        attributes={"skill_type": extraction_class}
                    )
                ]
            )
        )

# View the first 5 examples
skill_examples[:5]


[ExampleData(text='developed and deployed applications using AWS services like EC2, S3, and Lambda.', extractions=[Extraction(extraction_class='cloud_services', extraction_text='AWS', char_interval=None, alignment_status=None, extraction_index=None, group_index=None, description=None, attributes={'skill_type': 'cloud_services'})]),
 ExampleData(text='developed and deployed applications using AWS services like EC2, S3, and Lambda.', extractions=[Extraction(extraction_class='cloud_services', extraction_text='EC2', char_interval=None, alignment_status=None, extraction_index=None, group_index=None, description=None, attributes={'skill_type': 'cloud_services'})]),
 ExampleData(text='developed and deployed applications using AWS services like EC2, S3, and Lambda.', extractions=[Extraction(extraction_class='cloud_services', extraction_text='S3', char_interval=None, alignment_status=None, extraction_index=None, group_index=None, description=None, attributes={'skill_type': 'cloud_services'})]),

In [126]:
skills_resume_keys: List[str] = ['Skills', 'Professional Experience', 'Professional Certifications', 'Personal Projects & Volunteering']
input_text_skills: str = "\n\n".join([resume_dict[key] for key in skills_resume_keys])

result_skills = lx.extract(
    text_or_documents=input_text_skills,
    prompt_description=EXTRACTION_PROMPT_SKILLS,
    examples=skill_examples,
    model_id="gemini-2.5-flash",
)



In [129]:
print(f"Extracted {len(result_skills.extractions)} entities:\n")

# Save as JSON Line
lx.io.save_annotated_documents(
    annotated_documents=[result_skills],
    output_dir="../src/data/langextract/skills",
    output_name="skills.jsonl"
)

# Softskills breakdown 
breakdown = Counter([extraction.extraction_class for extraction in result_skills.extractions])
pprint(breakdown, indent=2)

Extracted 91 entities:



[94m[1mLangExtract[0m: Saving to [92mskills.jsonl[0m: 1 docs [00:00, 536.70 docs/s]

[92m✓[0m Saved [1m1[0m documents to [92mskills.jsonl[0m
Counter({ 'dev_language': 36,
          'algorithms': 21,
          'cloud_services': 16,
          'data_visualisation': 10,
          'databases': 8})





In [146]:
# Visualise with LangExtract html object 
html_content_skills = lx.visualize("../src/data/langextract/skills/skills.jsonl")
with open("../src/data/langextract/skills/skills.html", "w") as f:
    if hasattr(html_content_skills, 'data'):
        f.write(html_content_skills.data) 
    else:
        f.write(html_content_skills)

[94m[1mLangExtract[0m: Loading [92mskills.jsonl[0m: 100%|██████████| 29.6k/29.6k [00:00<00:00, 15.1MB/s]

[92m✓[0m Loaded [1m1[0m documents from [92mskills.jsonl[0m





## Extract Entities from Job Description