# Simple Document Extraction with LLMs

This notebook demonstrates the basic components for extracting structured information from documents using LLMs.

## 1. Setup

First, let's install and import the necessary packages:

In [None]:
# Install required packages
!pip install langchain langchain_openai pydantic python-dotenv pypdf

In [1]:
# Import libraries
import os
import json
from typing import List, Optional
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import PyPDFLoader
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set your OpenAI API key
# You can either set it in your environment or directly here
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key-here")

## 2. Define the Schema

Let's create a simple schema for extracting information from a CV/resume:

In [2]:
# Define a simple schema for CV extraction
class Experience(BaseModel):
    """Work experience information"""
    company: str = Field(..., description="Company name")
    position: str = Field(..., description="Job title")
    period: str = Field(..., description="Employment period (e.g., '2019-2021')")
    description: str = Field(..., description="Job description and responsibilities")

class Education(BaseModel):
    """Educational background"""
    institution: str = Field(..., description="School or university name")
    degree: str = Field(..., description="Degree obtained")
    year: str = Field(..., description="Graduation year")

class CVSchema(BaseModel):
    """Basic CV/resume schema"""
    name: str = Field(..., description="Full name of the person")
    summary: str = Field(..., description="Professional summary or objective")
    experience: List[Experience] = Field(default_factory=list, description="Work experience")
    education: List[Education] = Field(default_factory=list, description="Educational background")
    skills: List[str] = Field(default_factory=list, description="Professional skills")

## 3. Load a Document

Now let's load a document (PDF in this example):

In [4]:
def load_document(file_path):
    """Load a PDF document"""
    try:
        loader = PyPDFLoader(file_path)
        documents = loader.load()
        
        # Combine all pages into a single text
        text = "\n\n".join([doc.page_content for doc in documents])
        return text
    except Exception as e:
        print(f"Error loading document: {e}")
        return None

# Example usage - replace with your document path
document_path = "C:\\Users\\Oscar\\CascadeProjects\\RAGs\\cv_extractor\\documents_folder\\cv\\001.pdf"  # Replace with your file path

# Load the document
document_text = load_document(document_path)

if document_text:
    print(f"Document loaded successfully! Length: {len(document_text)} characters")
    print(f"Preview: {document_text[:200]}...")
else:
    print("Failed to load document")

Document loaded successfully! Length: 1981 characters
Preview: Oscar Quiroga
MATHEMATICIAN 
quipios@gmail.com
Bogotá, Colombia
Mathematician with management experience, responsible fordata analysis and reporting to support decision-making. Problem-solving and cri...


## 4. Create the Extraction Function

Let's create a function to extract structured data using an LLM:

In [5]:
def extract_information(text, model="gpt-4o-mini"):
    """Extract structured information from text using an LLM"""
    # Create the LLM
    llm = ChatOpenAI(model=model, temperature=0)
    
    # Create a prompt template
    prompt = ChatPromptTemplate.from_messages([
        ("system", """
        Extract structured information from the CV/resume according to the schema.
        Only extract information that is explicitly mentioned in the text.
        Do not make up or infer information that is not present.
        """),
        ("human", "{text}")
    ])
    
    # Create the extraction chain
    chain = prompt | llm.with_structured_output(CVSchema)
    
    # Run the extraction
    try:
        result = chain.invoke({"text": text})
        return result
    except Exception as e:
        print(f"Error during extraction: {e}")
        return None

## 5. Extract and Display Results

Now let's extract information from our document and display the results:

In [6]:
# Only run if we have document text
if document_text:
    # Extract information
    print("Extracting information...")
    extracted_data = extract_information(document_text)
    
    if extracted_data:
        # Display the extracted information
        print("\nExtracted Information:")
        print(f"Name: {extracted_data.name}")
        print(f"Summary: {extracted_data.summary}")
        
        print("\nExperience:")
        for exp in extracted_data.experience:
            print(f"- {exp.position} at {exp.company} ({exp.period})")
        
        print("\nEducation:")
        for edu in extracted_data.education:
            print(f"- {edu.degree} from {edu.institution} ({edu.year})")
        
        print("\nSkills:")
        for skill in extracted_data.skills:
            print(f"- {skill}")
            
        # Convert to JSON and save
        json_data = extracted_data.model_dump_json(indent=2)
        
        # Display JSON
        print("\nJSON Output:")
        print(json_data)
        
        # Save to file
        output_path = document_path.replace(".pdf", "_extracted.json")
        with open(output_path, "w") as f:
            f.write(json_data)
        print(f"\nSaved to: {output_path}")

Extracting information...

Extracted Information:
Name: Oscar Quiroga
Summary: Mathematician with management experience, responsible for data analysis and reporting to support decision-making. Problem-solving and critical thinking capabilities.

Experience:
- Tech Manager - Corporate Security at BBVA (2019-2021)
- Intelligence Analyst at Presidencia de Colombia (2012-2018)
- Smart Contract Developer at OnAnalytics (2021-2024)

Education:
- Mathematics from Pontifical Xaverian University (2004-2009)

Skills:
- Data Analysis
- Data Visualization
- Spreadsheets
- SQL
- Data Studio
- Tableau
- Python
- Google Suite
- Dune Analytics
- Block Explorers
- DexScreener
- Token Terminal
- Arkham Intelligence
- Web3 - Industry knowledge

JSON Output:
{
  "name": "Oscar Quiroga",
  "summary": "Mathematician with management experience, responsible for data analysis and reporting to support decision-making. Problem-solving and critical thinking capabilities.",
  "experience": [
    {
      "company":

## 6. Customize for Different Document Types

You can easily customize this for different document types by changing the schema:

In [None]:
# Example: Invoice schema
class InvoiceItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    amount: float

class InvoiceSchema(BaseModel):
    invoice_number: str
    date: str
    vendor: str
    customer: str
    items: List[InvoiceItem]
    total_amount: float

# To use this schema, you would just change the schema in the extract_information function:
# chain = prompt | llm.with_structured_output(InvoiceSchema)

## 7. Conclusion

This notebook demonstrated the basic components of document extraction:

1. Defining a schema with Pydantic
2. Loading documents
3. Creating an extraction function with LLMs
4. Generating structured output

You can extend this by adding support for more document types, improving the extraction prompt, or adding post-processing for the extracted data.