**Setup and Installation**

In [None]:
# Install necessary packages
!pip install langchain langchain-openai langchain-community chromadb pydantic

import os
import time
import json
import re
from typing import List, Optional, Dict, Any
import warnings
warnings.filterwarnings('ignore')

# Set your OpenAI API key (in Colab, you should use secrets or environment variables)
from getpass import getpass
OPENAI_API_KEY = getpass("Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Import necessary components
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import LLMChain, MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.output_parsers import PydanticOutputParser, StructuredOutputParser, ResponseSchema
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pydantic import BaseModel, Field

# Initialize the LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

print("Setup complete!")

**10.3.1 Reference-Based Extraction**

---

**Create Sample Research Papers Data**

In [None]:
research_documents = [
    Document(
        page_content="""
        Title: Advanced Techniques in Retrieval Augmented Generation

        Authors: Jane Smith, Robert Johnson, Wei Zhang

        Publication: Journal of Artificial Intelligence Research, 2023

        Abstract: Retrieval Augmented Generation (RAG) combines neural generation with information retrieval to enhance the factuality and reliability of generated text. This paper presents several novel techniques for improving the retrieval component, including adaptive retrievers, multi-step retrieval, and hybrid search methods. Our approaches demonstrate significant improvements in accuracy and relevance across multiple benchmark datasets.

        Methodology: We evaluated our techniques on the MS MARCO, Natural Questions, and HotpotQA datasets. Experiments were conducted using a combination of dense and sparse retrievers, with a T5-large model serving as the generator. All models were implemented in PyTorch and trained on 8 NVIDIA A100 GPUs.

        Results: Our adaptive retrieval approach achieved a 7.2% improvement in recall@10 compared to the strongest baseline. The multi-step retrieval technique showed particular strength in complex queries, improving answer accuracy by 12.3% on multi-hop questions. Hybrid search methods balanced efficiency and effectiveness, with only a 15% increase in computational cost yielding a 9.8% improvement in overall accuracy.
        """,
        metadata={"source": "advanced_rag_techniques.pdf", "year": 2023}
    ),
    Document(
        page_content="""
        Title: Enhancing RAG Systems with Knowledge Graph Integration

        Authors: Michael Lee, Sarah Garcia, John Patel

        Publication: Conference on Neural Information Processing Systems, 2022

        Abstract: This paper investigates the integration of knowledge graphs into Retrieval Augmented Generation systems to improve factual consistency and reasoning capabilities. We propose a novel architecture that jointly reasons over retrieved documents and knowledge graph subgraphs. Our approach enables more structured reasoning while maintaining the flexibility of text generation.

        Methodology: We constructed a benchmark dataset combining text retrieval with knowledge graph queries. Our system uses a dual-encoder architecture to process both textual and graph-structured inputs. Evaluation was performed on a modified version of the CommonsenseQA and WebQuestions datasets, focusing on questions requiring both factual retrieval and reasoning.

        Results: Knowledge graph integration improved factual accuracy by 14.7% compared to text-only RAG systems. Our approach was particularly effective for queries involving relationships between entities, showing a 23.2% improvement in such cases. Manual evaluation showed a 28% reduction in hallucinated facts when compared to standard RAG approaches.
        """,
        metadata={"source": "knowledge_graph_rag.pdf", "year": 2022}
    ),
    Document(
        page_content="""
        Title: Long-Context Retrieval Strategies for RAG Applications

        Authors: Carlos Diaz, Emily Wilson, Aisha Kumar

        Publication: ACL Workshop on Document Intelligence, 2023

        Abstract: As language models support increasingly large context windows, retrieval strategies must adapt to leverage this capability effectively. This paper explores techniques for retrieving and organizing information for long-context RAG applications. We introduce a hierarchical retrieval framework that combines local and global context to optimize information selection for extended documents.

        Methodology: We developed a benchmark suite of tasks requiring comprehension of long documents, including financial reports, legal contracts, and academic papers. Our hierarchical retriever combines BM25 for initial document selection, dense retrieval for passage ranking, and a novel context-aware reranker to optimize the final selection. Experiments used GPT-4 as the base language model with various context management strategies.

        Results: Our hierarchical approach improved answer accuracy by 18.3% compared to flat retrieval methods when dealing with documents exceeding 50 pages. The context-aware reranker demonstrated a 12.5% improvement in relevant information selection. Importantly, the system maintained consistent performance even as document length increased, showing only a 3% degradation when moving from 20-page to 100-page documents.
        """,
        metadata={"source": "long_context_retrieval.pdf", "year": 2023}
    )
]

print(f"Created {len(research_documents)} sample research paper documents")

**Basic Reference-Based Extraction with Pydantic**

In [None]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List, Optional

# Define the data structure we want to extract
class ResearchPaper(BaseModel):
    title: str = Field(description="The title of the research paper")
    authors: List[str] = Field(description="List of authors' names")
    publication_year: int = Field(description="Year the paper was published")
    abstract: str = Field(description="The paper's abstract")
    methods: Optional[List[str]] = Field(description="Research methods used", default=None)
    findings: Optional[List[str]] = Field(description="Key findings of the paper", default=None)

# Create a parser for this structure
parser = PydanticOutputParser(pydantic_object=ResearchPaper)

# Create an extraction template with reference examples
extraction_template = """
Extract structured information from the research paper below, following the format of the example.

EXAMPLE INPUT:
Title: Advances in Neural Information Processing Systems
Authors: John Smith, Jane Doe
Publication: Conference on Neural Information Processing Systems, 2022
Abstract: This paper presents a novel approach to neural network optimization that improves training efficiency by 30% while maintaining accuracy. We introduce a dynamic learning rate adjustment method that adapts based on gradient consistency across batches.
Methodology: We evaluated our approach on CIFAR-10 and ImageNet using ResNet architectures. Experiments were conducted using 4 NVIDIA A100 GPUs with batch sizes ranging from 32 to 256.
Results: Our method achieved 94.2% accuracy on CIFAR-10 and 76.8% top-1 accuracy on ImageNet, while reducing training time from 24 hours to 16.8 hours compared to baseline methods.

EXAMPLE OUTPUT:
{
  "title": "Advances in Neural Information Processing Systems",
  "authors": ["John Smith", "Jane Doe"],
  "publication_year": 2022,
  "abstract": "This paper presents a novel approach to neural network optimization that improves training efficiency by 30% while maintaining accuracy. We introduce a dynamic learning rate adjustment method that adapts based on gradient consistency across batches.",
  "methods": ["Evaluated on CIFAR-10 and ImageNet", "Used ResNet architectures", "Trained on 4 NVIDIA A100 GPUs", "Batch sizes from 32 to 256"],
  "findings": ["94.2% accuracy on CIFAR-10", "76.8% top-1 accuracy on ImageNet", "Reduced training time from 24 hours to 16.8 hours"]
}

INPUT PAPER:
{paper_text}

OUTPUT:
{format_instructions}
"""

# Create a prompt from the template
prompt = ChatPromptTemplate.from_template(
    template=extraction_template
).partial(format_instructions=parser.get_format_instructions())

# Create the extraction chain
extraction_chain = prompt | llm | parser

# Example execution
paper_text = research_documents[0].page_content

# Extract structured information
try:
    extracted_data = extraction_chain.invoke({"paper_text": paper_text})
    print(json.dumps(extracted_data.dict(), indent=2))
except Exception as e:
    print(f"Extraction failed: {e}")
    print("Attempting to handle the error and extract available information...")
    # You would implement fallback mechanisms here

**Batch Extraction from Multiple Documents**

In [None]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List, Optional

# Define the data structure we want to extract
class ResearchPaper(BaseModel):
    title: str = Field(description="The title of the research paper")
    authors: List[str] = Field(description="List of authors' names")
    publication_year: int = Field(description="Year the paper was published")
    abstract: str = Field(description="The paper's abstract")
    methods: Optional[List[str]] = Field(description="Research methods used", default=None)
    findings: Optional[List[str]] = Field(description="Key findings of the paper", default=None)

# Create a parser for this structure
parser = PydanticOutputParser(pydantic_object=ResearchPaper)

# Create an extraction template with reference examples
extraction_template = """
Extract structured information from the research paper below, following the format of the example.

EXAMPLE INPUT:
Title: Advances in Neural Information Processing Systems
Authors: John Smith, Jane Doe
Publication: Conference on Neural Information Processing Systems, 2022
Abstract: This paper presents a novel approach to neural network optimization that improves training efficiency by 30% while maintaining accuracy. We introduce a dynamic learning rate adjustment method that adapts based on gradient consistency across batches.
Methodology: We evaluated our approach on CIFAR-10 and ImageNet using ResNet architectures. Experiments were conducted using 4 NVIDIA A100 GPUs with batch sizes ranging from 32 to 256.
Results: Our method achieved 94.2% accuracy on CIFAR-10 and 76.8% top-1 accuracy on ImageNet, while reducing training time from 24 hours to 16.8 hours compared to baseline methods.

EXAMPLE OUTPUT:
{
  "title": "Advances in Neural Information Processing Systems",
  "authors": ["John Smith", "Jane Doe"],
  "publication_year": 2022,
  "abstract": "This paper presents a novel approach to neural network optimization that improves training efficiency by 30% while maintaining accuracy. We introduce a dynamic learning rate adjustment method that adapts based on gradient consistency across batches.",
  "methods": ["Evaluated on CIFAR-10 and ImageNet", "Used ResNet architectures", "Trained on 4 NVIDIA A100 GPUs", "Batch sizes from 32 to 256"],
  "findings": ["94.2% accuracy on CIFAR-10", "76.8% top-1 accuracy on ImageNet", "Reduced training time from 24 hours to 16.8 hours"]
}

INPUT PAPER:
{paper_text}

OUTPUT:
{format_instructions}
"""

# Create a prompt from the template
prompt = ChatPromptTemplate.from_template(
    template=extraction_template
).partial(format_instructions=parser.get_format_instructions())

# Create the extraction chain
extraction_chain = prompt | llm | parser

# Example execution
paper_text = research_documents[0].page_content

# Extract structured information
try:
    extracted_data = extraction_chain.invoke({"paper_text": paper_text})
    print(json.dumps(extracted_data.dict(), indent=2))
except Exception as e:
    print(f"Extraction failed: {e}")
    print("Attempting to handle the error and extract available information...")
    # You would implement fallback mechanisms here

**Extraction with Format Validation and Error Handling**

In [None]:
def validate_and_repair_extraction(extracted_text, parser, llm):
    """Validate extracted text and attempt to repair if invalid."""
    try:
        # Try to parse the extraction directly
        parsed_data = parser.parse(extracted_text)
        return parsed_data, True
    except Exception as initial_error:
        print(f"Initial parsing failed: {initial_error}")

        # Attempt to repair the output
        repair_template = """
        The following JSON is invalid or doesn't match the expected schema:

        {invalid_json}

        The expected schema is:
        {format_instructions}

        Please fix the JSON to match the schema exactly. Return ONLY the fixed JSON, nothing else.
        """

        repair_prompt = ChatPromptTemplate.from_template(repair_template)
        repair_chain = repair_prompt | llm

        try:
            repair_response = repair_chain.invoke({
                "invalid_json": extracted_text,
                "format_instructions": parser.get_format_instructions()
            })

            # Try to parse the repaired JSON
            repaired_text = repair_response.content
            parsed_data = parser.parse(repaired_text)
            print("Successfully repaired and parsed the extraction")
            return parsed_data, True
        except Exception as repair_error:
            print(f"Repair attempt failed: {repair_error}")
            return None, False

# Create a more robust extraction chain
def robust_extract(document, llm, parser):
    """Perform extraction with validation and repair attempts."""
    # First, get the raw extraction
    raw_extraction_template = """
    Extract structured information from the research paper below.
    Return the information in JSON format according to this specification:
    {format_instructions}

    PAPER:
    {paper_text}

    EXTRACTED JSON:
    """

    raw_prompt = ChatPromptTemplate.from_template(raw_extraction_template).partial(
        format_instructions=parser.get_format_instructions()
    )

    raw_extraction_chain = raw_prompt | llm

    # Get raw extraction
    raw_result = raw_extraction_chain.invoke({"paper_text": document.page_content})

    # Validate and potentially repair
    parsed_data, success = validate_and_repair_extraction(
        raw_result.content,
        parser,
        llm
    )

    if success:
        return parsed_data
    else:
        # Final fallback: try a simplified extraction
        print("Attempting simplified extraction...")
        simplified_model = create_simplified_model()
        simplified_parser = PydanticOutputParser(pydantic_object=simplified_model)

        simplified_prompt = ChatPromptTemplate.from_template(raw_extraction_template).partial(
            format_instructions=simplified_parser.get_format_instructions()
        )

        simplified_chain = simplified_prompt | llm
        simplified_result = simplified_chain.invoke({"paper_text": document.page_content})

        try:
            return simplified_parser.parse(simplified_result.content)
        except Exception as e:
            print(f"Even simplified extraction failed: {e}")
            return {"title": "Extraction Failed", "error": str(e)}

def create_simplified_model():
    """Create a simplified model with fewer required fields."""
    class SimplifiedPaper(BaseModel):
        title: str = Field(description="The title of the research paper")
        authors: Optional[List[str]] = Field(description="List of authors' names", default=[])
        publication_year: Optional[int] = Field(description="Year the paper was published", default=None)
        abstract: Optional[str] = Field(description="The paper's abstract", default="")

    return SimplifiedPaper

# Test the robust extraction on a potentially challenging document
test_document = Document(
    page_content="""
    Title: Challenges in Multi-modal Learning for RAG

    Authors: Alex Thompson, Maria Rodriguez

    Abstract: This work explores the challenges in incorporating multiple modalities (text, images, audio) in RAG systems. We identify key bottlenecks and propose architectural modifications.
    """
)

robust_result = robust_extract(test_document, llm, parser)
print("\nRobust extraction result:")
print(json.dumps(robust_result.dict() if hasattr(robust_result, 'dict') else robust_result, indent=2))

**10.3.2 Handling Long-Form Content Extraction**

---

**Create Sample Long Document**

In [None]:
# Create a sample long financial report document
long_financial_report = """
ANNUAL FINANCIAL REPORT - ACME CORPORATION
Fiscal Year 2023

EXECUTIVE SUMMARY
-----------------
Acme Corporation achieved record revenue of $1.2 billion in fiscal year 2023, representing a 15% increase over the previous year. Net income rose to $180 million, up 22% year-over-year. The Board of Directors has approved a dividend of $2.50 per share, payable to shareholders of record as of March 15, 2024.

FINANCIAL HIGHLIGHTS
-------------------
- Revenue: $1.2 billion (+15% YoY)
- Gross Profit: $450 million (+18% YoY)
- Operating Income: $225 million (+20% YoY)
- Net Income: $180 million (+22% YoY)
- Earnings Per Share (EPS): $4.25 (+24% YoY)
- Return on Equity (ROE): 18.5% (up from 16.2%)
- Debt-to-Equity Ratio: 0.65 (improved from 0.72)

REVENUE BY SEGMENT
-----------------
1. Consumer Products: $480 million (+10% YoY)
   - North America: $320 million
   - Europe: $90 million
   - Asia-Pacific: $70 million

2. Enterprise Solutions: $350 million (+22% YoY)
   - Software Services: $210 million
   - Hardware Solutions: $140 million

3. Digital Services: $370 million (+18% YoY)
   - Cloud Services: $220 million
   - Consulting: $150 million

BALANCE SHEET SUMMARY
--------------------
Assets:
- Cash and Equivalents: $250 million
- Accounts Receivable: $180 million
- Inventory: $120 million
- Property and Equipment: $450 million
- Intangible Assets: $350 million
- Other Assets: $150 million
Total Assets: $1.5 billion (+12% YoY)

Liabilities:
- Short-term Debt: $80 million
- Accounts Payable: $110 million
- Accrued Expenses: $95 million
- Long-term Debt: $350 million
- Other Liabilities: $140 million
Total Liabilities: $775 million (+5% YoY)

Stockholders' Equity: $725 million (+20% YoY)

CASH FLOW SUMMARY
----------------
- Operating Cash Flow: $320 million (+25% YoY)
- Capital Expenditures: $150 million
- Free Cash Flow: $170 million (+32% YoY)
- Cash Dividends Paid: $85 million
- Share Repurchases: $50 million

OUTLOOK FOR 2024
---------------
The company projects revenue growth of 12-15% for fiscal year 2024, with expected revenue between $1.34-1.38 billion. EPS is projected to be in the range of $4.75-$5.00. The company plans to increase R&D spending by 20% to accelerate product development in artificial intelligence and sustainable technologies.

RISK FACTORS
-----------
1. Market Competition: Increasing competition in the digital services segment may pressure margins.
2. Supply Chain: Ongoing global supply chain challenges may impact inventory management.
3. Regulatory Environment: New data privacy regulations may require additional compliance investments.
4. Currency Fluctuations: Significant operations in international markets expose the company to currency risks.
5. Technological Disruption: Rapid technological changes require continued innovation to maintain market position.

MANAGEMENT DISCUSSION
-------------------
"Our strong performance in 2023 reflects the successful execution of our strategic initiatives," said John Smith, CEO. "We've made significant progress in expanding our digital services offerings while maintaining solid growth in our core consumer products. The investments we've made in automation and operational efficiency have yielded substantial margin improvements."

"Our balance sheet remains strong, giving us the flexibility to pursue strategic acquisitions while returning capital to shareholders," added Mary Johnson, CFO. "We've reduced our debt-to-equity ratio while increasing our dividend and share repurchase program."

AUDITOR'S STATEMENT
-----------------
The financial statements of Acme Corporation have been audited by Independent Accounting LLP, who expressed an unqualified opinion on these statements. The audit was conducted in accordance with generally accepted auditing standards.

For the complete audited financial statements, including notes and detailed disclosures, please refer to the attached appendix.
"""

long_legal_contract = """
COMMERCIAL LEASE AGREEMENT

THIS LEASE AGREEMENT (the "Agreement") is made and entered into on January 15, 2023, by and between PROPERTY HOLDINGS LLC, a Delaware limited liability company ("Landlord"), and TENANT CORPORATION, a Nevada corporation ("Tenant").

WITNESSETH:

WHEREAS, Landlord is the owner of certain real property located at 123 Business Avenue, Metropolis, USA, including a commercial building with approximately 25,000 square feet of leasable space (the "Building"); and

WHEREAS, Tenant desires to lease approximately 10,000 square feet of space within the Building, as more particularly described in Exhibit A attached hereto (the "Premises"), and Landlord desires to lease the Premises to Tenant upon the terms and conditions set forth herein;

NOW, THEREFORE, in consideration of the mutual covenants and agreements herein contained, Landlord and Tenant hereby agree as follows:

1. PREMISES AND TERM

1.1 Premises. Landlord hereby leases to Tenant, and Tenant hereby leases from Landlord, the Premises described in Exhibit A, together with the non-exclusive right to use common areas of the Building and the property on which the Building is located (the "Property").

1.2 Term. The term of this Lease shall be for a period of five (5) years (the "Initial Term"), commencing on March 1, 2023 (the "Commencement Date") and ending on February 28, 2028, unless sooner terminated or extended as provided herein.

1.3 Option to Extend. Provided Tenant is not in default under this Lease, Tenant shall have one (1) option to extend the term of this Lease for an additional period of five (5) years (the "Extension Term") upon the same terms and conditions as set forth in this Lease, except that the Base Rent during the Extension Term shall be as set forth in Section 2.2 below. Tenant shall exercise such option by giving Landlord written notice at least one hundred eighty (180) days prior to the expiration of the Initial Term.

2. RENT AND OTHER CHARGES

2.1 Base Rent. Tenant shall pay to Landlord as base rent for the Premises the sum of Twenty-Five Dollars ($25.00) per square foot per year, for a total annual rent of Two Hundred Fifty Thousand Dollars ($250,000.00), payable in equal monthly installments of Twenty Thousand Eight Hundred Thirty-Three Dollars and Thirty-Three Cents ($20,833.33) in advance on the first day of each calendar month during the term of this Lease (the "Base Rent").

2.2 Base Rent During Extension Term. If Tenant exercises its option to extend as provided in Section 1.3, the Base Rent during the Extension Term shall be adjusted to the then-current fair market rental value of the Premises, but in no event less than 103% of the Base Rent in effect during the final year of the Initial Term. The parties shall negotiate in good faith to determine the fair market rental value of the Premises.

2.3 Additional Rent. In addition to the Base Rent, Tenant shall pay as additional rent Tenant's Proportionate Share (as defined below) of Operating Expenses (as defined below) to the extent that such Operating Expenses exceed the Operating Expenses for calendar year 2023 (the "Base Year"). Tenant's "Proportionate Share" shall be forty percent (40%), which is the ratio that the rentable area of the Premises bears to the rentable area of the Building.

2.4 Operating Expenses. "Operating Expenses" shall mean all costs and expenses incurred by Landlord in the ownership, operation, management, and maintenance of the Building and the Property, including, but not limited to:

   (a) Real property taxes and assessments;
   (b) Premiums for property, liability, and other insurance carried by Landlord;
   (c) Utilities not separately metered to tenants;
   (d) Maintenance, repair, and replacement of Building systems, including HVAC, electrical, plumbing, and mechanical systems;
   (e) Maintenance and repair of the roof, foundation, exterior walls, and structural elements of the Building;
   (f) Maintenance of common areas, including lobbies, elevators, corridors, restrooms, parking areas, landscaping, and sidewalks;
   (g) Janitorial services for common areas;
   (h) Security services;
   (i) Property management fees, not to exceed four percent (4%) of the gross rental receipts for the Building;
   (j) Amortization of capital improvements made to reduce Operating Expenses or to comply with applicable laws enacted after the Commencement Date.
"""

# Split the documents into manageable chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

financial_chunks = text_splitter.create_documents([long_financial_report])
legal_chunks = text_splitter.create_documents([long_legal_contract])

print(f"Split financial report into {len(financial_chunks)} chunks")
print(f"Split legal contract into {len(legal_chunks)} chunks")

**Chunk-and-Summarize Approach**

In [None]:
# Define the extraction template for each chunk
chunk_extract_template = """
Extract all financial metrics mentioned in the following text.
Include metric name, value, time period, and any comparison to previous periods.

TEXT:
{text}

EXTRACTED FINANCIAL METRICS:
"""

# Define a template for combining extracted metrics
combine_template = """
Below are financial metrics extracted from different sections of a document.
Combine them into a single comprehensive list, removing any duplicates.
If the same metric appears multiple times with different values, include all instances with their context.

EXTRACTED METRICS FROM ALL SECTIONS:
{text}

CONSOLIDATED FINANCIAL METRICS:
"""

# Create the prompts
chunk_extract_prompt = PromptTemplate.from_template(chunk_extract_template)
combine_prompt = PromptTemplate.from_template(combine_template)

# Create chains for processing chunks and combining results
chunk_extract_chain = LLMChain(llm=llm, prompt=chunk_extract_prompt)
combine_chain = LLMChain(llm=llm, prompt=combine_prompt)

# Create the document chains
reduce_chain = StuffDocumentsChain(
    llm_chain=combine_chain,
    document_variable_name="text"
)

map_reduce_chain = MapReduceDocumentsChain(
    llm_chain=chunk_extract_chain,
    reduce_documents_chain=reduce_chain,
    document_variable_name="text",
    return_intermediate_steps=True
)

# Process the chunked financial report
result = map_reduce_chain.invoke(financial_chunks)

# Access the results
final_summary = result['output_text']
chunk_extractions = result['intermediate_steps']

print(f"Found {len(chunk_extractions)} extractions from individual chunks")
print("\nFinal consolidated metrics:")
print(final_summary)

**Sequential Extraction with State Tracking**

In [None]:
# Create and execute a simplified sequential extraction
def simplified_sequential_extraction(chunks, initial_data=None):
    """Process document chunks sequentially, updating state as we go."""
    # Initialize with empty data if none provided
    current_data = initial_data or {
        "entities": {},
        "events": []
    }

    # Create a template for extracting from each chunk
    extraction_template = """
    Continue extracting information from this document chunk, based on what we've already extracted.

    Current extracted information:
    {current_data}

    New document chunk ({chunk_num} of {total_chunks}):
    {chunk_text}

    Please update the extraction with any new information found in this chunk.
    Format your response as:

    UPDATED ENTITIES:
    [key1]: [value1]
    [key2]: [value2]
    ...

    UPDATED EVENTS:
    - [event1]
    - [event2]
    ...
    """

    extraction_prompt = PromptTemplate.from_template(extraction_template)
    extraction_chain = LLMChain(llm=llm, prompt=extraction_prompt)

    # Process each chunk sequentially
    for i, chunk in enumerate(chunks, 1):
        print(f"Processing chunk {i}/{len(chunks)}")

        # Format current data for the prompt
        entities_str = "\n".join([f"{k}: {v}" for k, v in current_data["entities"].items()])
        events_str = "\n- " + "\n- ".join(current_data["events"]) if current_data["events"] else "None yet"

        formatted_data = f"ENTITIES:\n{entities_str}\n\nEVENTS:\n{events_str}"

        # Extract from this chunk
        response = extraction_chain.invoke({
            "current_data": formatted_data,
            "chunk_num": i,
            "total_chunks": len(chunks),
            "chunk_text": chunk.page_content
        })

        # Parse the response
        response_text = response['text']

        # Extract updated entities
        entities_match = re.search(r'UPDATED ENTITIES:(.*?)UPDATED EVENTS:', response_text, re.DOTALL)
        if entities_match:
            entities_text = entities_match.group(1).strip()
            # Parse line by line
            for line in entities_text.split('\n'):
                line = line.strip()
                if line and ':' in line:
                    parts = line.split(':', 1)
                    key = parts[0].strip()
                    value = parts[1].strip()
                    if key and value and value != "N/A":
                        current_data["entities"][key] = value

        # Extract updated events
        events_match = re.search(r'UPDATED EVENTS:(.*?)$', response_text, re.DOTALL)
        if events_match:
            events_text = events_match.group(1).strip()
            # Parse line by line
            for line in events_text.split('\n'):
                line = line.strip()
                if line.startswith('- '):
                    event = line[2:].strip()
                    if event and event not in current_data["events"] and event != "N/A":
                        current_data["events"].append(event)

    return current_data

# Initialize with some schema knowledge to help extraction
initial_legal_state = {
    "entities": {
        "landlord": "PROPERTY HOLDINGS LLC (Delaware limited liability company)",
        "tenant": "TENANT CORPORATION (Nevada corporation)",
        "premises": "unknown"
    },
    "events": []
}

# Run the sequential extraction on the legal contract chunks
print("Running sequential extraction with state tracking on legal document...")
legal_state = simplified_sequential_extraction(legal_chunks[:3], initial_legal_state)

print("\nFinal extracted state after processing legal document chunks:")
print("\nEntities:")
for key, value in legal_state["entities"].items():
    print(f"  {key}: {value}")

print("\nEvents:")
for event in legal_state["events"]:
    print(f"  - {event}")

**Targeted Extraction with Section Routing**

In [None]:
def section_based_extraction(document, target_sections):
    """First identify document sections, then extract from relevant ones."""
    # Step 1: Identify section boundaries
    section_identification_template = """
    Identify the main sections in the following document.
    For each section, provide the section title and where it begins in the document.

    DOCUMENT PREVIEW (first 1000 chars):
    {document_preview}

    SECTIONS (title, start marker):
    """

    section_prompt = PromptTemplate.from_template(section_identification_template)
    section_chain = LLMChain(llm=llm, prompt=section_prompt)

    # Get section boundaries
    response = section_chain.invoke({
        "document_preview": document[:1000]
    })

    # Parse the identified sections
    sections = {}
    for line in response['text'].strip().split('\n'):
        line = line.strip()
        if ':' in line:
            try:
                title, marker = line.split(':', 1)
                sections[title.strip().upper()] = marker.strip()
            except:
                pass

    # Add common section names if not found
    standard_sections = [
        "EXECUTIVE SUMMARY", "FINANCIAL HIGHLIGHTS", "REVENUE BY SEGMENT",
        "BALANCE SHEET", "CASH FLOW", "OUTLOOK", "RISK FACTORS"
    ]

    for section in standard_sections:
        if section not in sections:
            # Try to find it in the document
            if section in document:
                sections[section] = section

    # Step 2: Extract content from targeted sections
    extracted_data = {}

    for section_name, extraction_instructions in target_sections.items():
        section_start = None

        # Try to find the section
        for known_section, marker in sections.items():
            if section_name.upper() in known_section:
                section_start = document.find(marker)
                if section_start == -1:  # If marker not found, try section name
                    section_start = document.find(known_section)
                break

        if section_start is None:
            # Try direct search with the section name
            section_start = document.find(section_name.upper())

        if section_start >= 0:
            # Find the next section or end of document
            next_section_start = len(document)
            for marker in sections.values():
                pos = document.find(marker, section_start + len(marker))
                if pos > section_start and pos < next_section_start:
                    next_section_start = pos

            # Extract this section content
            section_content = document[section_start:next_section_start].strip()

            # Create extraction prompt
            extraction_template = f"""
            Extract the following information from this section:
            {extraction_instructions}

            SECTION TEXT:
            {{section_text}}

            EXTRACTED INFORMATION:
            """

            extraction_prompt = PromptTemplate.from_template(extraction_template)
            extraction_chain = LLMChain(llm=llm, prompt=extraction_prompt)

            # Extract from this section
            result = extraction_chain.invoke({"section_text": section_content})
            extracted_data[section_name] = result['text']

    return extracted_data

# Define targeted sections for financial report
financial_targets = {
    "FINANCIAL HIGHLIGHTS": "Extract all financial metrics including revenue, profit, EPS, and growth percentages.",
    "REVENUE BY SEGMENT": "Extract revenue for each business segment and their growth rates.",
    "OUTLOOK": "Extract revenue projections, growth targets, and key strategic initiatives for next year."
}

# Try the section-based extraction on the financial report
print("\nRunning section-based extraction on financial report...")
financial_sections = section_based_extraction(long_financial_report, financial_targets)

print("\nExtracted information by section:")
for section, data in financial_sections.items():
    print(f"\n--- {section} ---")

**10.3.3 Function-Free Extraction Methods**

---

**Template-Based Extraction with Output Parsing**

In [None]:
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import PromptTemplate

# Define the response schemas
response_schemas = [
    ResponseSchema(name="person_name", description="The full name of the person"),
    ResponseSchema(name="age", description="The age of the person as an integer"),
    ResponseSchema(name="occupation", description="The person's job or primary occupation"),
    ResponseSchema(name="location", description="Where the person currently lives"),
    ResponseSchema(name="interests", description="A list of the person's main interests or hobbies")
]

# Create a parser that can handle this structure
parser = StructuredOutputParser.from_response_schemas(response_schemas)

# Get the format instructions
format_instructions = parser.get_format_instructions()

# Create the extraction template
template = """
Extract the following information about the person described in the text.
Be precise and only extract information explicitly stated in the text.
If information is not present, output "unknown" for that field.

{format_instructions}

TEXT:
{text}

EXTRACTED INFORMATION:
"""

# Create the prompt
prompt = PromptTemplate(
    template=template,
    input_variables=["text"],
    partial_variables={"format_instructions": format_instructions}
)

# Create the extraction chain
extraction_chain = prompt | llm | parser

# Example text for extraction
text = """
Jane Smith is a 42-year-old software engineer living in Portland, Oregon.
She has been in the tech industry for over 15 years, specializing in backend systems.
In her free time, Jane enjoys hiking the trails of the Pacific Northwest, reading science fiction novels,
and experimenting with new recipes in her kitchen. She's also recently taken up pottery classes.
"""

# Perform the extraction
try:
    result = extraction_chain.invoke({"text": text})
    print("Structured output using template-based extraction:")
    print(json.dumps(result, indent=2))
except Exception as e:
    print(f"Extraction failed: {e}")

**Delimiter-Based Extraction**

In [None]:
# Define a delimiter-based extraction template
delimiter = "####"

extraction_template = f"""
Extract key information from the text below.
Format your response as follows:

{delimiter} PERSON NAME
[The full name of the person]

{delimiter} AGE
[The age as a number]

{delimiter} OCCUPATION
[The person's job]

{delimiter} LOCATION
[Where the person lives]

{delimiter} INTERESTS
[A comma-separated list of the person's interests or hobbies]

TEXT:
{{text}}

EXTRACTED INFORMATION:
"""

prompt = PromptTemplate(
    template=extraction_template,
    input_variables=["text"]
)

# Create the extraction chain
raw_extraction_chain = prompt | llm

# Parse the delimited output
def parse_delimited_output(output):
    """Parse a response with delimiter-separated sections."""
    extracted_data = {}

    # Get the text content from the output (which could be an AIMessage)
    if hasattr(output, 'content'):  # Handle AIMessage object
        output_text = output.content
    elif isinstance(output, dict) and 'text' in output:  # Handle dictionary
        output_text = output['text']
    else:  # Try using the output directly
        output_text = str(output)

    # Split on the delimiter
    sections = output_text.split(delimiter)

    # Process each section
    for section in sections:
        section = section.strip()
        if not section:
            continue

        # The first line after delimiter is the field name
        lines = section.split('\n', 1)
        if len(lines) < 2:
            continue

        field_name = lines[0].strip().lower().replace(' ', '_')
        field_value = lines[1].strip()

        extracted_data[field_name] = field_value

    return extracted_data

# Combine the chain with the parser
def extract_with_delimiters(text):
    """Extract information using delimiter-based format."""
    raw_output = raw_extraction_chain.invoke({"text": text})
    parsed_output = parse_delimited_output(raw_output)
    return parsed_output

# Test extraction
result = extract_with_delimiters(text)
print("\nDelimiter-based extraction results:")
print(json.dumps(result, indent=2))

**JSON Schema-Guided Extraction**

In [None]:
# Define a JSON schema to guide extraction
json_schema = {
    "type": "object",
    "properties": {
        "person_name": {
            "type": "string",
            "description": "The full name of the person"
        },
        "age": {
            "type": "integer",
            "description": "The age of the person in years"
        },
        "occupation": {
            "type": "string",
            "description": "The person's job or primary occupation"
        },
        "location": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "state": {"type": "string"},
                "country": {"type": "string"}
            },
            "description": "Where the person currently lives"
        },
        "interests": {
            "type": "array",
            "items": {"type": "string"},
            "description": "The person's hobbies or interests"
        },
        "years_of_experience": {
            "type": "integer",
            "description": "Years of professional experience"
        }
    },
    "required": ["person_name"]
}

# Create a JSON schema extraction template
json_extraction_template = """
Extract structured information about the person described in the text according to this JSON schema:

{schema}

If information for a field is not explicitly mentioned in the text, omit that field from the output.
All output should be valid JSON.

TEXT:
{text}

EXTRACTED JSON:
"""

json_prompt = PromptTemplate(
    template=json_extraction_template,
    input_variables=["text", "schema"]
)

# Format the schema as a readable string
formatted_schema = json.dumps(json_schema, indent=2)

# Create the extraction chain
json_extraction_chain = json_prompt | llm

# Handle parsing
def extract_with_json_schema(text):
    """Extract information using JSON schema guidance."""
    raw_output = json_extraction_chain.invoke({
        "text": text,
        "schema": formatted_schema
    })

    # Extract JSON from the response
    # Handle different output types (AIMessage or dictionary)
    if hasattr(raw_output, 'content'):  # Handle AIMessage object
        output_text = raw_output.content
    elif isinstance(raw_output, dict) and 'text' in raw_output:  # Handle dictionary
        output_text = raw_output['text']
    else:  # Try using the output directly
        output_text = str(raw_output)

    # Clean up the output to extract valid JSON
    output_text = output_text.strip()

    # Find JSON block if enclosed in triple backticks
    json_match = re.search(r'```(?:json)?\s*(.*?)\s*```', output_text, re.DOTALL)
    if json_match:
        json_str = json_match.group(1)
    else:
        # Try to extract anything that looks like JSON
        json_match = re.search(r'({.*})', output_text, re.DOTALL)
        if json_match:
            json_str = json_match.group(1)
        else:
            json_str = output_text

    # Try to parse the JSON
    try:
        return json.loads(json_str)
    except json.JSONDecodeError as e:
        return {"error": f"Failed to parse JSON: {str(e)}", "raw": json_str}

# Test extraction
result = extract_with_json_schema(text)
print("\nJSON schema-guided extraction results:")
print(json.dumps(result, indent=2))

**Tabular Extraction with Markdown**

In [None]:
def extract_tabular_data(text, columns):
    """Extract data in tabular format using markdown tables."""
    # Create a column list for the prompt
    column_list = ", ".join(columns)

    table_extraction_template = f"""
    Extract information from the text into a markdown table.

    The table should have these columns: {column_list}

    Format the table properly with markdown syntax, including the header row and separator row.
    Only include information explicitly stated in the text. If information for a column is not available, use "N/A".

    TEXT:
    {{text}}

    MARKDOWN TABLE:
    """

    table_prompt = PromptTemplate(
        template=table_extraction_template,
        input_variables=["text"]
    )

    # Create the extraction chain
    table_extraction_chain = table_prompt | llm

    # Execute and get the table
    result = table_extraction_chain.invoke({"text": text})

    # Handle different output types
    if hasattr(result, 'content'):  # Handle AIMessage object
        result_text = result.content
    elif isinstance(result, dict) and 'text' in result:  # Handle dictionary
        result_text = result['text']
    else:  # Try using the result directly
        result_text = str(result)

    # Parse the markdown table into a list of dictionaries
    return parse_markdown_table(result_text)

def parse_markdown_table(markdown_table):
    """Parse a markdown table into a list of dictionaries."""
    # Clean and extract table content
    table_text = markdown_table.strip()

    # Split into lines
    lines = table_text.split('\n')

    # Need at least 3 lines for a valid table (header, separator, data)
    if len(lines) < 3:
        return []

    # Process the header row
    header_line = lines[0].strip()
    if header_line.startswith('|'):
        header_line = header_line[1:]
    if header_line.endswith('|'):
        header_line = header_line[:-1]

    headers = [h.strip() for h in header_line.split('|')]

    # Skip the separator line
    data_rows = []
    for line in lines[2:]:
        line = line.strip()
        if not line or '|' not in line:
            continue

        # Remove leading/trailing |
        if line.startswith('|'):
            line = line[1:]
        if line.endswith('|'):
            line = line[:-1]

        values = [v.strip() for v in line.split('|')]

        # Create a dictionary for this row
        if len(values) == len(headers):
            row_dict = {}
            for i, header in enumerate(headers):
                row_dict[header] = values[i]
            data_rows.append(row_dict)

    return data_rows

# Create a sample text with multiple people to extract
multi_person_text = """
The research team consists of several key members:

Dr. John Williams is a 45-year-old senior researcher specializing in artificial intelligence. Based in Boston, Massachusetts, he has published over 30 papers on machine learning. When not working, he enjoys playing chess and hiking.

Sarah Chen, 38, is an associate professor of computer science at Stanford University. Living in Palo Alto, California, she leads the natural language processing group. Her interests include playing violin and competitive swimming.

Marcus Johnson, a 29-year-old research assistant from Austin, Texas, recently joined the team. He has a background in statistics and data visualization. In his free time, he enjoys rock climbing and photography.

Emily Rodriguez is the team's technical writer. At 34, she has 10 years of experience translating complex research into accessible content. Based in Seattle, Washington, she enjoys gardening and painting landscapes.
"""

# Extract into a table format
table_columns = ["Name", "Age", "Occupation", "Location", "Interests"]
tabular_result = extract_tabular_data(multi_person_text, table_columns)

print("\nTabular extraction results:")
for row in tabular_result:
    print(json.dumps(row, indent=2))

# Display as a formatted table
print("\nFormatted as table:")
print(f"| {'Name':<15} | {'Age':<5} | {'Occupation':<25} | {'Location':<20} | {'Interests':<30} |")
print(f"| {'-'*15} | {'-'*5} | {'-'*25} | {'-'*20} | {'-'*30} |")
for row in tabular_result:
    name = row.get('Name', 'N/A')
    age = row.get('Age', 'N/A')
    occupation = row.get('Occupation', 'N/A')
    location = row.get('Location', 'N/A')
    interests = row.get('Interests', 'N/A')
    print(f"| {name:<15} | {age:<5} | {occupation:<25} | {location:<20} | {interests:<30} |")

**Conclusion**

This notebook has demonstrated three key extraction patterns for RAG systems:

1. **Reference-Based Extraction**: Using examples to guide extraction with consistent formatting
2. **Long-Form Content Extraction**: Techniques for handling documents longer than the context window
3. **Function-Free Extraction Methods**: Reliable extraction approaches that don't require function calling
