# Information Extraction Tutorial:

This tutorial demonstrates a complete end-to-end workflow for extracting structured information from documents using DLLMForge's Information Extraction tools.

## What You'll Learn

In this tutorial, as an example, we'll extract **machine learning model hyperparameters** from research papers by:

1. **Generating a Schema** - Automatically create a Pydantic schema for the data structure
2. **Processing Documents** - Convert PDFs to text/images suitable for LLM processing
3. **Extracting Information** - Use an LLM to extract structured data
4. **Batch Processing** - Process multiple documents efficiently
5. **Aggregating Results** - Combine results into a structured dataset

## Prerequisites

```bash
pip install dllmforge
```

You'll also need:
- API keys for your LLM provider (OpenAI, Azure OpenAI, Anthropic, etc.)
- PDF research papers to process

## Step 1: Import Required Libraries

First, let's import all the necessary modules from DLLMForge and Python's standard library.


In [1]:
from dllmforge.IE_agent_schema_generator import SchemaGenerator
from dllmforge.IE_agent_document_processor import DocumentProcessor
from dllmforge.IE_agent_extractor import InfoExtractor
from dllmforge.langchain_api import LangchainAPI
from pathlib import Path
import importlib.util
import re
import json
import pandas as pd

print("✓ All imports successful!")




✓ All imports successful!


## Step 2: Generate Extraction Schema

The first step is to define what information we want to extract. Instead of manually writing a Pydantic schema, we'll use the `SchemaGenerator` to automatically create one based on a task description.

We want to extract:
- **Model architecture**: type, layers, neurons
- **Training parameters**: learning rate, batch size, epochs
- **Optimization settings**: optimizer, loss function
- **Regularization**: dropout, weight decay


In [2]:
# Define what information we want to extract
schema_task_description = (
    "Generate a Pydantic schema class named ModelHyperparameters to extract "
    "machine learning model hyperparameters from research papers. "
    "Extract: model architecture (type, layers, neurons). "
)

# Create directory for generated schema
schema_dir = Path("generated_schemas")
schema_dir.mkdir(exist_ok=True)
schema_file = schema_dir / "model_hyperparameters.py"

# Create schema generator with direct arguments
schema_generator = SchemaGenerator(
    task_description=schema_task_description,
    output_path=str(schema_file)
)

# Generate and save the schema
schema_code = schema_generator.generate_schema()
schema_generator.save_schema(schema_code)

print(f"✓ Schema generated and saved to {schema_file}")
print("\nGenerated Schema Preview:")
print("=" * 60)
print(schema_code[:500] + "...")  # Show first 500 characters


Document loader initialized with support for: .pdf, .docx, .xlsx, .xls, .csv
No example document provided


INFO:httpx:HTTP Request: POST https://openai-floods.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview "HTTP/1.1 200 OK"


Schema saved to generated_schemas\model_hyperparameters.py
✓ Schema generated and saved to generated_schemas\model_hyperparameters.py

Generated Schema Preview:
from pydantic import BaseModel, Field
from typing import List, Optional

class LayerConfiguration(BaseModel):
    layer_type: str = Field(..., description="Type of the layer, e.g., 'Dense', 'Convolutional'")
    num_neurons: Optional[int] = Field(None, description="Number of neurons in the layer, if applicable")

class ModelArchitecture(BaseModel):
    model_type: str = Field(..., description="Type of the model architecture, e.g., 'Neural Network', 'CNN'")
    layers: List[LayerConfiguration] = ...


### Load the Generated Schema

Now we need to dynamically load the generated schema class so we can use it for extraction.


In [3]:
# Find the class name in the generated schema
class_matches = re.finditer(r"class\s+(\w+)\s*\(", schema_code)
class_names = [match.group(1) for match in class_matches]
schema_class_name = class_names[-1]  # Get the last (main) class

# Dynamically load the schema module
spec = importlib.util.spec_from_file_location("model_hyperparameters", schema_file)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
SchemaClass = getattr(module, schema_class_name)

print(f"✓ Schema class loaded: {schema_class_name}")
print(f"\nSchema fields:")
for field_name, field_info in SchemaClass.model_fields.items():
    print(f"  - {field_name}: {field_info.annotation}")


✓ Schema class loaded: ModelHyperparameters

Schema fields:
  - architecture: <class 'model_hyperparameters.ModelArchitecture'>


## Step 3: Configure Paths

Set up the input and output directories for your documents.

**Note:** Update these paths to match your local setup!


In [4]:
# Define input and output directories
# UPDATE THESE PATHS TO MATCH YOUR SETUP
# document_input_dir = r"path/to/research/papers"  # Directory containing your PDF files
document_input_dir = r'c:\Users\deng_jg\work\16centralized_agents\test_data\test'
# document_output_dir = r"path/to/output"          # Where results will be saved
document_output_dir = r'c:\Users\deng_jg\work\16centralized_agents\test_data\output'

print(f"✓ Input directory:  {document_input_dir}")
print(f"✓ Output directory: {document_output_dir}")


✓ Input directory:  c:\Users\deng_jg\work\16centralized_agents\test_data\test
✓ Output directory: c:\Users\deng_jg\work\16centralized_agents\test_data\output


## Step 4: Configure Document Processor

The `DocumentProcessor` handles converting PDF files into text or image format that can be processed by the LLM.

Key parameters:
- `input_dir`: Where your PDF files are located
- `file_pattern`: Pattern to match files (e.g., "*.pdf")
- `output_type`: "text" for text extraction or "image" for vision models
- `output_dir`: Where processed documents will be saved


In [5]:
# Create document processor
doc_processor = DocumentProcessor(
    input_dir=document_input_dir,
    file_pattern="*.pdf",           # Match all PDF files
    output_type="text",              # Extract as text (use "image" for vision models)
    output_dir=document_output_dir
)

print(f"✓ Document processor configured")
print(f"  - File pattern: *.pdf")
print(f"  - Output type: text")
print(f"  - Ready to process documents in: {document_input_dir}")


Document loader initialized with support for: .pdf, .docx, .xlsx, .xls, .csv
✓ Document processor configured
  - File pattern: *.pdf
  - Output type: text
  - Ready to process documents in: c:\Users\deng_jg\work\16centralized_agents\test_data\test


## Step 5: Initialize the LLM

We'll use the `LangchainAPI` to interface with the LLM

**Important:** Make sure you have set up your API keys as environment variables before running this cell.


In [6]:
# Initialize the LLM
llm_api = LangchainAPI(
    model_provider="azure-openai",  # Change to "openai" or "anthropic" as needed
    temperature=0.1                 # Low temperature for more consistent extraction
)


## Step 6: Create the Information Extractor

The `InfoExtractor` is the main component that orchestrates the extraction process. It combines:
- The schema (what to extract)
- The LLM (how to extract)
- The document processor (preparing documents)
- Custom instructions (system prompt)

Key parameters:
- `chunk_size`: Maximum tokens per chunk (important for long documents)
- `chunk_overlap`: Overlap between chunks to avoid missing information at boundaries
- `system_prompt`: Instructions for the LLM on how to extract information


In [7]:
# Define extraction instructions
system_prompt = """Extract machine learning model hyperparameters from the research paper.
Only extract explicitly stated values. Use None for fields not found.
Be precise with numeric values and units."""

# Create the extractor
extractor = InfoExtractor(
    output_schema=SchemaClass,              # The schema we generated earlier
    llm_api=llm_api,                        # The LLM we initialized
    system_prompt=system_prompt,            # Custom extraction instructions
    chunk_size=80000,                       # Max tokens per chunk
    chunk_overlap=10000,                    # Overlap between chunks
    doc_processor=doc_processor,            # Document processor handles PDF conversion
    document_output_type="text"             # Extract as text
)


## Step 7: Extract from a Single Document

Let's start by processing a single document to see how the extraction works. This is useful for:
- Testing your setup
- Inspecting the quality of extracted data
- Debugging extraction issues


In [8]:
# # UPDATE THIS PATH to point to one of your PDF files
# single_doc_path = Path(document_input_dir) / "Kratzert2018_Rainfall–runoff modelling using Long Short-Term.pdf"

# # Check if the file exists before processing
# if single_doc_path.exists():
#     print(f"Processing: {single_doc_path.name}")
    
#     # Step 1: Process the file (PDF -> text)
#     doc = extractor.doc_processor.process_file(single_doc_path)
#     print(f"✓ Document converted to text")
    
#     # Step 2: Extract structured information
#     results = extractor.process_document(doc)
#     print(f"✓ Extracted {len(results)} result(s)")
    
#     # Save the results
#     output_path = Path(document_output_dir) / f"{single_doc_path.stem}_extracted.json"
#     extractor.save_results(results, output_path)
    
# else:
#     print(f"⚠ File not found: {single_doc_path}")
#     print("Please update the 'single_doc_path' variable with a valid file path.")


## Step 8: Batch Process All Documents

Now that we've verified the extraction works for a single document, let's process all documents in the input directory!

The `process_all()` method automatically:
1. Finds all PDF files matching the pattern
2. Processes each file (PDF → text)
3. Extracts information from each document
4. Saves all results to a single combined JSON file

**New Feature:** By default, all results are now saved to a single `all_extracted.json` file with a `_source_document` field to identify which document each extraction came from. You can also save individual files by using `process_all(save_individual=True)`.


In [9]:
try:
    # Process all documents and save to a single combined JSON file
    # By default, results are saved to "all_extracted.json"
    # You can customize the output filename with: combined_output_name="my_results.json"
    extractor.process_all()
    
    # If you want to save BOTH individual files AND a combined file, use:
    # extractor.process_all(save_individual=True)
    
    print("\n" + "="*60)
    print("✓ All documents processed and extracted!")
    print("="*60)
    print(f"\nCheck output directory: {document_output_dir}")
    print("All results are combined in a single JSON file")
    
except Exception as e:
    print(f"\n⚠ Error during batch processing: {e}")
    print("This might happen if the input directory doesn't exist or has no PDF files.")


Found 2 files to process


INFO:httpx:HTTP Request: POST https://openai-floods.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openai-floods.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openai-floods.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openai-floods.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview "HTTP/1.1 200 OK"


Results saved to c:\Users\deng_jg\work\16centralized_agents\test_data\output\all_extracted.json

✓ Combined results saved to c:\Users\deng_jg\work\16centralized_agents\test_data\output\all_extracted.json
  Total extractions: 4

✓ All documents processed and extracted!

Check output directory: c:\Users\deng_jg\work\16centralized_agents\test_data\output
All results are combined in a single JSON file
