# Extracting data from Resumes

Let us assume that we are running a hiring process for a company and we have received a list of resumes from candidates. We want to extract structured data from the resumes so that we can run a screening process and shortlist candidates. 

Take a look at one of the resumes in the `data/resumes` directory. 

In [1]:
from IPython.display import IFrame

IFrame(src="./data/resumes/ai_researcher.pdf", width=600, height=400)

You will notice that all the resumes have different layouts but contain common information like name, email, experience, education, etc. 

With LlamaExtract, we will show you how to:
- *Define* a data schema to extract the information of interest. 
- *Iterate* over the data schema to generalize the schema for multiple resumes.
- *Finalize* the schema and schedule extractions for multiple resumes.

We will start by defining a `LlamaExtract` client which provides a Python interface to the LlamaExtract API. 

In [3]:
! pip install llama_cloud_services

Collecting llama_cloud_services
  Downloading llama_cloud_services-0.6.9-py3-none-any.whl.metadata (2.9 kB)
Collecting llama-cloud<0.2.0,>=0.1.17 (from llama_cloud_services)
  Downloading llama_cloud-0.1.17-py3-none-any.whl.metadata (902 bytes)
Collecting llama-index-core>=0.11.0 (from llama_cloud_services)
  Downloading llama_index_core-0.12.27-py3-none-any.whl.metadata (2.6 kB)
Collecting SQLAlchemy>=1.4.49 (from SQLAlchemy[asyncio]>=1.4.49->llama-index-core>=0.11.0->llama_cloud_services)
  Downloading sqlalchemy-2.0.40-cp313-cp313-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.6 (from llama-index-core>=0.11.0->llama_cloud_services)
  Using cached aiohttp-3.11.14-cp313-cp313-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting banks<3.0.0,>=2.0.0 (from llama-index-core>=0.11.0->llama_cloud_services)
  Downloading banks-2.1.0-py3-none-any.whl.metadata (11 kB)
Collecting dataclasses-json (from llama-index-core>=0.11.0->llama_cloud_services)
  Using cached dataclas

In [None]:
from dotenv import load_dotenv
from llama_cloud_services import LlamaExtract


# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)
load_dotenv(override=True)

# Optionally, add your project id/organization id
llama_extract = LlamaExtract()

No project_id provided, fetching default project.


Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.09it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  3.81it/s]
Extracting files: 100%|██████████| 1/1 [00:13<00:00, 13.54s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.01it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  3.02it/s]
Extracting files: 100%|██████████| 1/1 [00:25<00:00, 25.83s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  3.41it/s]
Extracting files: 100%|██████████| 1/1 [00:04<00:00,  4.94s/it]


### Defining the data schema

Next, let us try to extract two fields from the resume: `name` and `email`. We can either use a Python dictionary structure to define the `data_schema` as a JSON or use a Pydantic model instead, for brevity and convenience. In either case, our output is guaranteed to validate against this schema.

In [6]:
from pydantic import BaseModel, Field


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")

In [7]:
from llama_cloud.core.api_error import ApiError

try:
    existing_agent = llama_extract.get_agent(name="resume-screening")
    if existing_agent:
        llama_extract.delete_agent(existing_agent.id)
except ApiError as e:
    if e.status_code == 404:
        pass
    else:
        raise

agent = llama_extract.create_agent(name="resume-screening", data_schema=Resume)

In [8]:
llama_extract.list_agents()

[ExtractionAgent(id=f565c694-e79c-4cc9-a50f-92f1490c67cd, name=resume-screening)]

In [9]:
resume = agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

{'name': 'Dr. Rachel Zhang', 'email': 'rachel.zhang@email.com'}

### Iterating over the data schema

Now that we have created a data schema, let us add more fields to the schema. We will add `experience` and `education` fields to the schema. 
- We can create a new Pydantic model for each of these fields and represent `experience` and `education` as lists of these models. Doing this will allow us to extract multiple entities from the resume without having to pre-define how many experiences or education the candidate has. 
- We have added a `description` parameter to provide more context for extraction. We can use `description` to provide example inputs/outputs for the extraction. 
- Note that we have annotated the `start_date` and `end_date` fields with `Optional[str]` to indicate that these fields are optional. This is *important* because the schema will be used to extract data from multiple resumes and not all resumes will have the same format. A field must only be required if it is guaranteed to be present in all the resumes. 


In [10]:
from typing import List, Optional


class Education(BaseModel):
    institution: str = Field(description="The institution of the candidate")
    degree: str = Field(description="The degree of the candidate")
    start_date: Optional[str] = Field(
        default=None, description="The start date of the candidate's education"
    )
    end_date: Optional[str] = Field(
        default=None, description="The end date of the candidate's education"
    )


class Experience(BaseModel):
    company: str = Field(description="The name of the company")
    title: str = Field(description="The title of the candidate")
    description: Optional[str] = Field(
        default=None, description="The description of the candidate's experience"
    )
    start_date: Optional[str] = Field(
        default=None, description="The start date of the candidate's experience"
    )
    end_date: Optional[str] = Field(
        default=None, description="The end date of the candidate's experience"
    )


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")
    links: List[str] = Field(
        description="The links to the candidate's social media profiles"
    )
    experience: List[Experience] = Field(description="The candidate's experience")
    education: List[Education] = Field(description="The candidate's education")

Next, we will update the `data_schema` for the `resume-screening` agent to use the new `Resume` model. 

In [11]:
agent.data_schema = Resume
resume = agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

{'name': 'Dr. Rachel Zhang, Ph.D.',
 'email': 'rachel.zhang@email.com',
 'links': [],
 'experience': [{'company': 'DeepMind',
   'title': 'Senior Research Scientist',
   'description': 'Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%. Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023. Built and led team of 6 researchers working on foundational ML models. Developed novel regularization techniques for large language models, reducing catastrophic forgetting by 35%.',
   'start_date': '2019',
   'end_date': 'Present'},
  {'company': 'Google Research',
   'title': 'Research Scientist',
   'description': 'Developed probabilistic frameworks for robust ML, published in ICML 2018. Created novel attention mechanisms for computer vision models, improving accuracy by 25%. Led collaboration with Google Brain team on efficient training methods for transformer mode

This is a good start. Let us add a few more fields to the schema and re-run the extraction. 

In [12]:
class TechnicalSkills(BaseModel):
    programming_languages: List[str] = Field(
        description="The programming languages the candidate is proficient in."
    )
    frameworks: List[str] = Field(
        description="The tools/frameworks the candidate is proficient in, e.g. React, Django, PyTorch, etc."
    )
    skills: List[str] = Field(
        description="Other general skills the candidate is proficient in, e.g. Data Engineering, Machine Learning, etc."
    )


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")
    links: List[str] = Field(
        description="The links to the candidate's social media profiles"
    )
    experience: List[Experience] = Field(description="The candidate's experience")
    education: List[Education] = Field(description="The candidate's education")
    technical_skills: TechnicalSkills = Field(
        description="The candidate's technical skills"
    )
    key_accomplishments: str = Field(
        description="Summarize the candidates highest achievements."
    )

In [13]:
agent.data_schema = Resume
resume = agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

{'name': 'Dr. Rachel Zhang, Ph.D.',
 'email': 'rachel.zhang@email.com',
 'links': [],
 'experience': [{'company': 'DeepMind',
   'title': 'Senior Research Scientist',
   'description': 'Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%. Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023. Built and led team of 6 researchers working on foundational ML models. Developed novel regularization techniques for large language models, reducing catastrophic forgetting by 35%.',
   'start_date': '2019',
   'end_date': 'Present'},
  {'company': 'Google Research',
   'title': 'Research Scientist',
   'description': 'Developed probabilistic frameworks for robust ML, published in ICML 2018. Created novel attention mechanisms for computer vision models, improving accuracy by 25%. Led collaboration with Google Brain team on efficient training methods for transformer mode

### Finalizing the schema

This is great! We have extracted a lot of key information from the resume that is well-typed and can be used downstream for further processing. Until now, this data is ephemeral and will be lost if we close the session. Let us save the state of our extraction and use it to extract data from multiple resumes. 

In [14]:
agent.save()

In [15]:
agent = llama_extract.get_agent("resume-screening")
agent.data_schema  # Latest schema should be returned

{'additionalProperties': False,
 'properties': {'name': {'description': 'The name of the candidate',
   'type': 'string'},
  'email': {'description': 'The email address of the candidate',
   'type': 'string'},
  'links': {'description': "The links to the candidate's social media profiles",
   'items': {'type': 'string'},
   'type': 'array'},
  'experience': {'description': "The candidate's experience",
   'items': {'additionalProperties': False,
    'properties': {'company': {'description': 'The name of the company',
      'type': 'string'},
     'title': {'description': 'The title of the candidate', 'type': 'string'},
     'description': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
      'description': "The description of the candidate's experience"},
     'start_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
      'description': "The start date of the candidate's experience"},
     'end_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
      'description': "The 

#### Queueing extractions

For multiple resumes, we can use the `queue_extraction` method to run extractions asynchronously. This is ideal for processing batch extraction jobs.

In [16]:
import os

# All resumes in the data/resumes directory
resumes = []

with os.scandir("./data/resumes") as entries:
    for entry in entries:
        if entry.is_file():
            resumes.append(entry.path)

jobs = await agent.queue_extraction(resumes)

Uploading files: 100%|██████████| 2/2 [00:01<00:00,  1.73it/s]
Creating extraction jobs: 100%|██████████| 2/2 [00:00<00:00,  5.30it/s]


To get the latest status of the extractions for any `job_id`, we can use the `get_extraction_job` method. 


In [17]:
[agent.get_extraction_job(job_id=job.id).status for job in jobs]

[<StatusEnum.PENDING: 'PENDING'>, <StatusEnum.PENDING: 'PENDING'>]

We notice that all extraction runs are in a PENDING state. We can check back again to see if the extractions have completed. 

In [18]:
[agent.get_extraction_job(job_id=job.id).status for job in jobs]

[<StatusEnum.SUCCESS: 'SUCCESS'>, <StatusEnum.PENDING: 'PENDING'>]

#### Retrieving results

Let us now retrieve the results of the extractions. If the status of the extraction is `SUCCESS`, we can retrieve the data from the `data` field. In case there are errors (status = `ERROR`), we can retrieve the error message from the `error` field. 


In [19]:
results = []
for job in jobs:
    extract_run = agent.get_extraction_run_for_job(job.id)
    if extract_run.status == "SUCCESS":
        results.append(extract_run.data)
    else:
        print(f"Extraction status for job {job.id}: {extract_run.status}")

Extraction status for job 8816d49c-537f-431f-a84b-1954201fbe57: ExtractState.PENDING


In [20]:
results[0]

{'name': 'Dr. Rachel Zhang, Ph.D.',
 'email': 'rachel.zhang@email.com',
 'links': [],
 'experience': [{'company': 'DeepMind',
   'title': 'Senior Research Scientist',
   'description': 'Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%. Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023. Built and led team of 6 researchers working on foundational ML models. Developed novel regularization techniques for large language models, reducing catastrophic forgetting by 35%.',
   'start_date': '2019',
   'end_date': 'Present'},
  {'company': 'Google Research',
   'title': 'Research Scientist',
   'description': 'Developed probabilistic frameworks for robust ML, published in ICML 2018. Created novel attention mechanisms for computer vision models, improving accuracy by 25%. Led collaboration with Google Brain team on efficient training methods for transformer mode

In [21]:
results[1]

IndexError: list index out of range

In [None]:
results[2]

{'name': 'Sarah Chen',
 'email': 'sarah.chen@email.com',
 'links': [],
 'education': [{'degree': 'Master of Science in Computer Science',
   'end_date': '2013',
   'start_date': None,
   'institution': 'Stanford University'},
  {'degree': 'Bachelor of Science in Computer Engineering',
   'end_date': '2011',
   'start_date': None,
   'institution': 'University of California, Berkeley'}],
 'experience': [{'title': 'Senior Software Architect',
   'company': 'TechCorp Solutions',
   'end_date': None,
   'start_date': '2020',
   'description': '- Led architectural design and implementation of a cloud-native platform serving 2M+ users\n- Established architectural guidelines and best practices adopted across 12 development teams\n- Reduced system latency by 40% through implementation of event-driven architecture\n- Mentored 15+ senior developers in cloud-native development practices'},
  {'title': 'Lead Software Engineer',
   'company': 'DataFlow Systems',
   'end_date': '2020',
   'start_dat

Congratulations! You now have an agent that can extract structured data from resumes. 
- You can now use this agent to extract data from more resumes and use the extracted data for further processing. 
- To update the schema, you can simply update the `data_schema` attribute of the agent and re-run the extraction. 
- You can also use the `save` method to save the state of the agent and persist changes to the schema for future use. 

