# Parse CVs

Here we want to parse a typical CV from a PDF and extract the key details

## Step 0: Load environment variables using dotenv

This is a simple way to load environment variables from a `.env` file. This is useful for storing sensitive information like API keys and other secrets.

In [None]:
from dotenv import load_dotenv
load_dotenv()

: 

## Step 1:

- Copy a test CV into the `data/` folder (which is gitignored to maintain privacy)
- Load the CV into memory

In [None]:
from pypdf import PdfReader

# Path to the PDF file
pdf_path = 'data/cv.pdf'  # TODO: Change this to the path of your PDF file

# Open the PDF file
reader = PdfReader(pdf_path)

# Get the number of pages in the PDF
num_pages = len(reader.pages)

full_text = ""
# Iterate through each page and extract text
for page_num in range(num_pages):
    page = reader.pages[page_num]
    text = page.extract_text()
    full_text += text

print(f"Extracted text from {num_pages} pages")
full_text

## Step 2: split the CV into sections

- Prepend line numbers to the text
    - Split the text string by newlines `\n`
- Use a model which returns a structured output. See the [langchain documentation](https://python.langchain.com/v0.2/docs/how_to/structured_output/) for more information.
- Try outputting as Pydantic classes as in the docs
- Create a prompt to return the start and end lines of predefined sections (e.g. "Education", "Experience", "Skills" etc)
    - Output something like:
    ```json
    {
        "Education": {
            "start": 10,
            "end": 20
        },
        "Experience": {
            "start": 30,
            "end": 40
        },
        ...
    }
    ```
- Use these line numbers to extract the relevant sections


In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

## Step 3: use langchain to parse the CV data

- For each section, use a prompt to extract the different key details from the CV.
- Again, try outputting as Pydantic classes
