# Fine-grained Parsing

After initial parsing, let's parse out details from each resume section individually

In [1]:
import json
from pprint import pprint
from ollama import chat
from pydantic import BaseModel, RootModel, Field

from utils.with_structured_output import with_structured_output

In [2]:
with open("../output/parsed_resume.json", "r") as file:
    parsed_resume = json.load(file)

In [3]:
parsed_resume

{'Experience': 'AI/ML Intern Aug 2024 – Dec 2024\n• Developed knowledge graph (KG) generation pipeline with internal LLM microservices to allow multi-hop reasoning in 3-stage retrieval augmented generation (RAG) pipeline\n• Extracted 30+ domain-specific seed topics from text corpus with BERTopic for KG subgraph creation\n• Achieved100%schema-compliantLLMoutputsviaprompt engineering andgrammar-contrained decoding\n• Packaged KG generation logic into reusable, object-oriented Python modules used by 30 developers\nSoftware Engineering Intern May 2024 – Aug 2024\n• Redesigned 30 year old Java data analysis suite architecture, cutting developer onboarding time by 3 weeks\n• Used MATLAB profiler to find bottleneck in data preprocessing script, leading to 90% execution time reduction\n• Built Java class to interface with Javalin REST API, enabling multithreaded network communication\n• Reduced logic errors by 50% via integration and regression testing in Jenkins CI/CD pipeline\nUndergraduate 

## Parse Education

In [4]:
class School(BaseModel):
    name: str           = Field(..., alias="Name")
    majors: list[str]   = Field(..., alias="Majors")
    minors: list[str]   = Field(..., alias="Minors")
    gpa: float          = Field(None, alias="GPA")
    grad_year: int      = Field(..., alias="Graduation Year")

class Education(RootModel[list[School]]):
    pass

In [5]:
EDUCATION_EXTRACTION_PROMPT = """
You are an expert resume parser. Given some resume text, your job is to parse education information as a list of JSON objects representing each school attended. Follow this format for each school:
    {{
        "Name": "<Name of School>",
        "Majors": ["list", "of", "majors"],
        "Minors": ["list", "of", "minors"],
        "GPA": <GPA>,
        "Graduation Year": <Graduation Year>
    }},

Notes:
1. If there are no minors, set "Minors" to an empty list.
2. If there is no GPA listed, set "GPA" to None.
3. If any school does not have a graduation year listed, omit the school from the output.
4. Output the full name of all degrees, e.g., "BS in Computer Science", "M.S. in Information Science". Note that the resume may contain a double major. If so, output all degrees with their full names, making sure to incldue the type of degree for each major ("BS," "MS," etc.). Please note that some schools offer emphasis areas or modifiers to the major that are not themselves considered majors, e.g. "Computer Science with statistics emphasis" is equivalent to "Computer Science."
5. If the resume does not contain information for one of the sections, return an empty list for that section.

Extracted information must be **explicitly contained in the resume.**

Resume text:
{resume_text}

Output:
"""

In [6]:
education_info = with_structured_output(
    prompt=EDUCATION_EXTRACTION_PROMPT.format(resume_text=parsed_resume["Education"]),
    schema=Education)

In [7]:
education_info

[{'Name': 'Texas A&M University',
  'Majors': ['BS in Computer Science'],
  'Minors': ['Statistics', 'Math'],
  'Graduation Year': 2026,
  'GPA': 4.0}]

## Parse Experience

In [8]:
class Experience(BaseModel):
    company: str = Field(..., alias="Company")
    role: str = Field(..., alias="Role")
    contributions: list[str] = Field(..., alias="Contributions")
    
class Experiences(BaseModel):
    experiences: list[Experience] = Field(..., alias="Experiences")
    yoe: float = Field(..., alias="Total Years of Experience")

In [9]:
EXPERIENCE_EXTRACTION_PROMPT = """
You are an expert at parsing resumes. Given some resume text, your job is to extract information about the candidate's work experience and format it as a list of JSON objects, where each object has the following format:
    {{
        "Experiences": [
            {{
                "Company": "<company>",
                "Role": "<applicant's role at the company>",
                "Contributions": ["list", "of", "contributions", "in", "the", "role"]
            }},
            ...
        ],
        "Total Years of Experience": <Total Years of Experience> 
    }}
    
The extracted information must be **explicitly contained in the resume.**

Calculate "Total Years of Experience" by summing up the duration of all experiences, rounded to the nearest quarter-year. **Note that overlapping timeframes should not double-counted.**

Resume text:
{resume_text}

Output:
"""

In [10]:
experience_info = with_structured_output(
    EXPERIENCE_EXTRACTION_PROMPT.format(resume_text=parsed_resume["Experience"]),
    Experiences)

In [11]:
experience_info

{'Experiences': [{'Company': 'Full Stack Developer Intern',
   'Role': 'Full Stack Developer Intern',
   'Contributions': ['Built CRM system with ASP.NET MVC serving 10 users, improving employee efficiency by 50%',
    'Optimized MySQL database performance by eliminating 1,000 duplicate records, improving query speed by 10%',
    'Designed scalable SQL database architecture to support complex entity relationships',
    'Created role-based authorization system with Entity Framework to facilitate project management for managers']},
  {'Company': 'Software Engineering Intern',
   'Role': 'Software Engineering Intern',
   'Contributions': ['Redesigned 30 year old Java data analysis suite architecture, cutting developer onboarding time by 3 weeks',
    'Used MATLAB profiler to find bottleneck in data preprocessing script, leading to 90% execution time reduction',
    'Built Java class to interface with Javalin REST API, enabling multithreaded network communication',
    'Reduced logic error

## Skill Parsing 

In [12]:
class Skills(BaseModel):
    technical_skills: list[str] = Field(..., alias="Technical Skills")
    domain_specific_skills: list[str] = Field(..., alias="Domain-Specific Skills")

In [13]:
SKILL_EXTRACTION_TEMPLATE = """
You are an expert are parsing skills from resumes.

Given resume text, please parse individual technical (e.g., programming languages, tools, frameworks, databases) and domain-specific (e.g., methodologies, architectures, or specialized techniques) skills contained in the resume. **Ensure that you do not miss any domain-specific skills.**

Format your output as a JSON object as follows:
    {{
        "Technical Skills": ["list", "of", "technical", "skills"]
        "Domain-Specific Skills": ["list", "of" "domain", "specific", "skills"]
    }}

Resume text:
```
{resume_text}
```

Parsed skills:
"""

In [14]:
skills_info = with_structured_output(
    prompt=SKILL_EXTRACTION_TEMPLATE.format(
        resume_text=parsed_resume["Experience"]
            + parsed_resume["Projects"]
            + parsed_resume["Skills"]
            + parsed_resume["Research"]
            + parsed_resume["Leadership"]),
    schema=Skills)

In [15]:
skills_info

{'Technical Skills': ['Python',
  'Java',
  'C/C++',
  'TypeScript/JavaScript',
  'C#',
  'SQL',
  'HTML/CSS',
  'MATLAB',
  'Bash',
  'React',
  'Bootstrap',
  'Flask',
  'JUnit',
  'ASP.NET',
  'Entity Framework',
  'Spring Boot',
  'Numpy',
  'Pandas',
  'Matplotlib',
  'LangChain',
  'OpenAI',
  'Pydantic',
  'TensorFlow',
  'Linux',
  'Git (GitHub, Gerrit)',
  'Anaconda',
  'Jupyter',
  'Azure DevOps',
  'Jenkins',
  'Vim',
  'Docker',
  'PostgreSQL',
  'MySQL',
  'Neo4J'],
 'Domain-Specific Skills': ['Knowledge Graph (KG) generation pipeline',
  'LLM microservices',
  'Multi-hop reasoning in RAG pipeline',
  'BERTopic for KG subgraph creation',
  'Prompt engineering and grammar-constrained decoding',
  'Object-oriented Python modules',
  'Java data analysis suite architecture redesign',
  'MATLAB profiler for bottleneck identification',
  'Data preprocessing script optimization',
  'Javalin REST API interface with Java class',
  'Multithreaded network communication',
  'Integrati

### Putting it all together

In [16]:
parsed_info = {
    "Personal Info": parsed_resume["Personal Info"],
    "Education": education_info,
    "Work Experience": experience_info,
    "Skills": skills_info
}
pprint(parsed_info)

{'Education': [{'GPA': 4.0,
                'Graduation Year': 2026,
                'Majors': ['BS in Computer Science'],
                'Minors': ['Statistics', 'Math'],
                'Name': 'Texas A&M University'}],
 'Personal Info': {'Email': 'kevzhang2022@gmail.com',
                   'GitHub': 'https://github.com/n1v3x2',
                   'LinkedIn': 'https://linkedin.com/in/kevinkz',
                   'Phone': '832-416-3570'},
 'Skills': {'Domain-Specific Skills': ['Knowledge Graph (KG) generation '
                                       'pipeline',
                                       'LLM microservices',
                                       'Multi-hop reasoning in RAG pipeline',
                                       'BERTopic for KG subgraph creation',
                                       'Prompt engineering and '
                                       'grammar-constrained decoding',
                                       'Object-oriented Python modules',
      

In [17]:
with open("../output/parsed_resume_info.json", "w") as file:
    json.dump(parsed_info, file, indent=4)