# Test Few-Shot Extraction Implementation

This notebook tests the new `{FEW_SHOT_EXAMPLES}` placeholder functionality in the Extraction service.

In [11]:
import sys
import os
import yaml
import json
from pathlib import Path

# Set ROOT_DIR - used to locate example images from local directory
# OR set CONFIGURATION_BUCKET to S3 Configuration bucket name (contains config_library)
os.environ['ROOT_DIR'] = '../'

# Add the idp_common package to the path
sys.path.insert(0, '../lib/idp_common_pkg')

from idp_common.extraction.service import ExtractionService

## Load the Few-Shot Configuration

In [12]:
# Load the few-shot configuration
config_path = '../config_library/pattern-2/few_shot_example/config.yaml'
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

print("Configuration loaded successfully!")
print(f"Number of classes: {len(config.get('classes', []))}")
print(f"Extraction model: {config.get('extraction', {}).get('model')}")

Configuration loaded successfully!
Number of classes: 11
Extraction model: us.amazon.nova-pro-v1:0


## Examine the Task Prompt Template

In [13]:
# Look at the task prompt to see the FEW_SHOT_EXAMPLES placeholder
task_prompt = config['extraction']['task_prompt']
print("Task prompt template:")
print("=" * 50)
print(task_prompt)
print("=" * 50)

# Check if it contains the placeholder
has_placeholder = "{FEW_SHOT_EXAMPLES}" in task_prompt
print(f"\nContains {{FEW_SHOT_EXAMPLES}} placeholder: {has_placeholder}")

Task prompt template:
<background>
You are an expert in business document analysis and information extraction. 
You can understand and extract key information from business documents. 
<task>
Your task is to take the unstructured text provided and convert it into a
well-organized table format using JSON. Identify the main entities,
attributes, or categories mentioned in the attributes list below and use
them as keys in the JSON object. 
Then, extract the relevant information from the text and populate the
corresponding values in the JSON object. 
Guidelines:
Ensure that the data is accurately represented and properly formatted within the JSON structure
Include double quotes around all keys and values
Do not make up data - only extract information explicitly found in the document
Do not use /n for new lines, use a space instead
If a field is not found or if unsure, return null
All dates should be in MM/DD/YYYY format
Do not perform calculations or summations unless totals are explicitly

## Initialize Extraction Service

In [14]:
# Initialize the extraction service with the few-shot config
try:
    service = ExtractionService(
        config=config,
        region="us-east-1"  # You may need to adjust this
    )
    print("Extraction service initialized successfully!")
except Exception as e:
    print(f"Error initializing service: {e}")
    print("Note: This is expected if AWS credentials are not configured for Bedrock")

Extraction service initialized successfully!


## Examine Class Examples Structure

In [15]:
# Let's examine the examples in the configuration
print("Examples found in configuration:")
print("=" * 50)

classes = config.get('classes', [])
total_examples = 0

for class_obj in classes:
    class_name = class_obj.get('name', 'Unknown')
    examples = class_obj.get('examples', [])
    
    print(f"\nClass: {class_name}")
    print(f"Number of examples: {len(examples)}")
    
    for i, example in enumerate(examples):
        print(f"  Example {i+1}:")
        print(f"    Name: {example.get('name', 'N/A')}")
        print(f"    Class Prompt: {example.get('classPrompt', 'N/A')}")
        print(f"    Attributes Prompt: {example.get('attributesPrompt', 'N/A')[:100]}{'...' if len(example.get('attributesPrompt', '')) > 100 else ''}")
        print(f"    Image Path: {example.get('imagePath', 'N/A')}")
        
        # Check if image file exists (test the path resolution logic)
        image_path = example.get('imagePath')
        if image_path:
            print(f"    S3 URI: {image_path}")
        total_examples += 1

print(f"\nTotal examples across all classes: {total_examples}")
print(f"\nEnvironment variables:")
print(f"  CONFIGURATION_BUCKET: {os.environ.get('CONFIGURATION_BUCKET', 'Not set - using ROOT_DIR to resolve paths locally')}")
print(f"  ROOT_DIR: {os.environ.get('ROOT_DIR', 'Not set')}")

Examples found in configuration:

Class: letter
Number of examples: 2
  Example 1:
    Name: Letter1
    Class Prompt: This is an example of the class 'letter'
    Attributes Prompt: expected attributes are:
    "sender_name": "Will E. Clark",
    "sender_address": "206 Maple Street...
    Image Path: config_library/pattern-2/few_shot_example/example-images/letter1.jpg
    S3 URI: config_library/pattern-2/few_shot_example/example-images/letter1.jpg
  Example 2:
    Name: Letter2
    Class Prompt: This is an example of the class 'letter'
    Attributes Prompt: expected attributes are:
    "sender_name": "William H. W. Anderson",
    "sender_address": "P O. BO...
    Image Path: config_library/pattern-2/few_shot_example/example-images/letter2.png
    S3 URI: config_library/pattern-2/few_shot_example/example-images/letter2.png

Class: form
Number of examples: 0

Class: invoice
Number of examples: 0

Class: resume
Number of examples: 0

Class: scientific_publication
Number of examples: 0



## Test Few-Shot Examples Content Building for Specific Classes

In [16]:
# Test the _build_few_shot_examples_content method for different classes
print("Testing _build_few_shot_examples_content method for different classes...")

# Test for 'letter' class
print("\n=== LETTER CLASS ===")
try:
    letter_examples = service._build_few_shot_examples_content('letter')
    print(f"Generated {len(letter_examples)} content items for 'letter' class")
    
    for i, item in enumerate(letter_examples):
        print(f"\nItem {i+1}:")
        if 'text' in item:
            print(f"  Type: text")
            text_preview = item['text'][:200].replace('\n', '\\n')
            print(f"  Preview: {text_preview}{'...' if len(item['text']) > 200 else ''}")
        elif 'image' in item:
            print(f"  Type: image")
            print(f"  Format: {item['image'].get('format', 'unknown')}")
            if 'source' in item['image'] and 'bytes' in item['image']['source']:
                print(f"  Size: {len(item['image']['source']['bytes'])} bytes")
        else:
            print(f"  Type: unknown")
            print(f"  Keys: {list(item.keys())}")
            
except Exception as e:
    print(f"Error building content for 'letter' class: {e}")
    import traceback
    traceback.print_exc()

# Test for 'email' class
print("\n=== EMAIL CLASS ===")
try:
    email_examples = service._build_few_shot_examples_content('email')
    print(f"Generated {len(email_examples)} content items for 'email' class")
    
    for i, item in enumerate(email_examples):
        print(f"\nItem {i+1}:")
        if 'text' in item:
            print(f"  Type: text")
            text_preview = item['text'][:200].replace('\n', '\\n')
            print(f"  Preview: {text_preview}{'...' if len(item['text']) > 200 else ''}")
        elif 'image' in item:
            print(f"  Type: image")
            print(f"  Format: {item['image'].get('format', 'unknown')}")
            if 'source' in item['image'] and 'bytes' in item['image']['source']:
                print(f"  Size: {len(item['image']['source']['bytes'])} bytes")
        else:
            print(f"  Type: unknown")
            print(f"  Keys: {list(item.keys())}")
            
except Exception as e:
    print(f"Error building content for 'email' class: {e}")
    import traceback
    traceback.print_exc()

# Test for a class with no examples
print("\n=== FORM CLASS (no examples) ===")
try:
    form_examples = service._build_few_shot_examples_content('form')
    print(f"Generated {len(form_examples)} content items for 'form' class")
    print("This should be empty since 'form' class has no examples in the config")
            
except Exception as e:
    print(f"Error building content for 'form' class: {e}")
    import traceback
    traceback.print_exc()

Testing _build_few_shot_examples_content method for different classes...

=== LETTER CLASS ===
Generated 4 content items for 'letter' class

Item 1:
  Type: text
  Preview: expected attributes are:\n    "sender_name": "Will E. Clark",\n    "sender_address": "206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056",\n    "recipient_name": "The Honorable Wendell H. Ford",\n ...

Item 2:
  Type: image
  Format: jpeg
  Size: 106629 bytes

Item 3:
  Type: text
  Preview: expected attributes are:\n    "sender_name": "William H. W. Anderson",\n    "sender_address": "P O. BOX 12046 CAMERON VILLAGE STATION RALEIGH N. c 27605",\n    "recipient_name": "Mr. Addison Y. Yeaman",\n...

Item 4:
  Type: image
  Format: png
  Size: 83993 bytes

=== EMAIL CLASS ===
Generated 2 content items for 'email' class

Item 1:
  Type: text
  Preview: expected attributes are: \n   "from_address": "Kelahan, Ben",\n    "to_address": "TI New York: 'TI Minnesota",\n    "cc_address": "Ashley Bratich (MSMAIL)",\n    "b

## Test Complete Content Building with Examples

In [17]:
# Test the complete content building with few-shot examples
print("Testing _build_content_with_few_shot_examples method...")

# Sample document text for testing
sample_document_text = "This is a sample letter document for testing extraction."
sample_class_label = "letter"

# Get attributes for letter class and format them
letter_attributes = service._get_class_attributes(sample_class_label)
attribute_descriptions = service._format_attribute_descriptions(letter_attributes)

print(f"Letter class has {len(letter_attributes)} attributes")
print(f"Attribute descriptions preview: {attribute_descriptions[:200]}...")

try:
    # Get extraction config
    extraction_config = config.get('extraction', {})
    task_prompt_template = extraction_config['task_prompt']
    
    # Build content with few-shot examples
    content = service._build_content_with_few_shot_examples(
        task_prompt_template=task_prompt_template,
        document_text=sample_document_text,
        class_label=sample_class_label,
        attribute_descriptions=attribute_descriptions
    )
    
    print(f"\nGenerated content array with {len(content)} items")
    print("\nContent structure:")
    
    for i, item in enumerate(content):
        print(f"\nItem {i+1}:")
        if 'text' in item:
            print(f"  Type: text")
            text_preview = item['text'][:300].replace('\n', '\\n')
            print(f"  Preview: {text_preview}{'...' if len(item['text']) > 300 else ''}")
        elif 'image' in item:
            print(f"  Type: image")
            print(f"  Format: {item['image'].get('format', 'unknown')}")
            if 'source' in item['image'] and 'bytes' in item['image']['source']:
                print(f"  Size: {len(item['image']['source']['bytes'])} bytes")
        else:
            print(f"  Type: unknown")
            print(f"  Keys: {list(item.keys())}")
            
except Exception as e:
    print(f"Error building content with few-shot examples: {e}")
    import traceback
    traceback.print_exc()

Testing _build_content_with_few_shot_examples method...
Letter class has 10 attributes
Attribute descriptions preview: sender_name  	[ The name of the person or entity who wrote or sent the letter. Look for text following or near terms like 'from', 'sender', 'authored by', 'written by', or at the end of the letter bef...

Generated content array with 6 items

Content structure:

Item 1:
  Type: text
  Preview: <background>\nYou are an expert in business document analysis and information extraction. \nYou can understand and extract key information from business documents. \n<task>\nYour task is to take the unstructured text provided and convert it into a\nwell-organized table format using JSON. Identify the mai...

Item 2:
  Type: text
  Preview: expected attributes are:\n    "sender_name": "Will E. Clark",\n    "sender_address": "206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056",\n    "recipient_name": "The Honorable Wendell H. Ford",\n    "recipient_address": "United States S

## Test Path Resolution Logic

In [18]:
# Test path resolution with different environment variables
print("Testing image path resolution logic:")
print("=" * 50)

# Test 1: Without ROOT_DIR or CONFIGURATION_BUCKET
print("\n1. WITHOUT ROOT_DIR or CONFIGURATION_BUCKET:")
print("-" * 50)

if 'ROOT_DIR' in os.environ:
    del os.environ['ROOT_DIR']
if 'CONFIGURATION_BUCKET' in os.environ:
    del os.environ['CONFIGURATION_BUCKET']

try:
    # Create a new service instance without ROOT_DIR
    test_service = ExtractionService(
        config=config,
        region="us-east-1"
    )
    
    examples_content = test_service._build_few_shot_examples_content('letter')
    print(f"Successfully built {len(examples_content)} content items using calculated path")
    
    # Count successful image loads
    image_items = [item for item in examples_content if 'image' in item]
    print(f"Loaded {len(image_items)} image items from local files")
    
except Exception as e:
    print(f"Error building examples without ROOT_DIR: {e}")
    print("This is normal - either ROOT_DIR or CONFIGURATION_BUCKET must be set, OR image paths must specify full S3 URI")


# Test 2: With CONFIGURATION_BUCKET
print("\n2. WITH CONFIGURATION_BUCKET environment variable:")
print("-" * 50)

# Set a test bucket name
os.environ['CONFIGURATION_BUCKET'] = 'test-config-bucket'

try:
    test_service = ExtractionService(
        config=config,
        region="us-east-1"
    )
    
    print(f"CONFIGURATION_BUCKET set to: {os.environ.get('CONFIGURATION_BUCKET')}")
    print("Note: This would attempt to load images from S3, which may fail without proper setup")
    
    # This will likely fail since the S3 bucket doesn't exist, but it shows the logic
    try:
        examples_content = test_service._build_few_shot_examples_content('letter')
        print(f"Successfully built {len(examples_content)} content items using S3")
    except Exception as e:
        print(f"Expected error when trying to access S3: {e}")
        print("This is normal - the logic correctly tries to use S3 when CONFIGURATION_BUCKET is set")

except Exception as e:
    print(f"Error with CONFIGURATION_BUCKET test: {e}")

# Restore config
del os.environ['CONFIGURATION_BUCKET']
os.environ['ROOT_DIR'] = '../'

Error reading binary content from s3://test-config-bucket/config_library/pattern-2/few_shot_example/example-images/letter1.jpg: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied


Testing image path resolution logic:

1. WITHOUT ROOT_DIR or CONFIGURATION_BUCKET:
--------------------------------------------------
Error building examples without ROOT_DIR: Failed to load example image from config_library/pattern-2/few_shot_example/example-images/letter1.jpg: No CONFIGURATION_BUCKET or ROOT_DIR set. Cannot read example image from local filesystem.
This is normal - either ROOT_DIR or CONFIGURATION_BUCKET must be set, OR image paths must specify full S3 URI

2. WITH CONFIGURATION_BUCKET environment variable:
--------------------------------------------------
CONFIGURATION_BUCKET set to: test-config-bucket
Note: This would attempt to load images from S3, which may fail without proper setup
Expected error when trying to access S3: Failed to load example image from config_library/pattern-2/few_shot_example/example-images/letter1.jpg: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied
This is normal - the logic correctly tries to use S3 w

## Test Class-Specific Example Filtering

In [19]:
# Test that examples are properly filtered by class
print("Testing class-specific example filtering:")
print("=" * 50)

# Count examples per class in config
classes_with_examples = {}
for class_obj in config.get('classes', []):
    class_name = class_obj.get('name', 'Unknown')
    examples = class_obj.get('examples', [])
    classes_with_examples[class_name] = len(examples)

print("Examples per class in config:")
for class_name, count in classes_with_examples.items():
    print(f"  {class_name}: {count} examples")

print("\nTesting service filtering:")
for class_name, expected_count in classes_with_examples.items():
    if expected_count > 0:
        try:
            examples_content = service._build_few_shot_examples_content(class_name)
            # Each example should have attributesPrompt text + image, so 2 items per example
            expected_items = expected_count * 2  # attributesPrompt + image
            print(f"  {class_name}: Generated {len(examples_content)} items (expected ~{expected_items})")
        except Exception as e:
            print(f"  {class_name}: Error - {e}")
    else:
        examples_content = service._build_few_shot_examples_content(class_name)
        print(f"  {class_name}: Generated {len(examples_content)} items (expected 0)")

Testing class-specific example filtering:
Examples per class in config:
  letter: 2 examples
  form: 0 examples
  invoice: 0 examples
  resume: 0 examples
  scientific_publication: 0 examples
  memo: 0 examples
  advertisement: 0 examples
  email: 1 examples
  questionnaire: 0 examples
  specification: 0 examples
  generic: 0 examples

Testing service filtering:
  letter: Generated 4 items (expected ~4)
  form: Generated 0 items (expected 0)
  invoice: Generated 0 items (expected 0)
  resume: Generated 0 items (expected 0)
  scientific_publication: Generated 0 items (expected 0)
  memo: Generated 0 items (expected 0)
  advertisement: Generated 0 items (expected 0)
  email: Generated 2 items (expected ~2)
  questionnaire: Generated 0 items (expected 0)
  specification: Generated 0 items (expected 0)
  generic: Generated 0 items (expected 0)


## Comparison with Classification Service

In [20]:
# Compare the extraction service behavior with classification service
print("Comparison with Classification Service:")
print("=" * 50)

print("Key differences:")
print("1. Classification uses 'classPrompt' from examples")
print("2. Extraction uses 'attributesPrompt' from examples")
print("3. Classification gets examples from ALL classes")
print("4. Extraction gets examples only from the SPECIFIC class being extracted")

# Show the specific prompts used
letter_class = next((c for c in config['classes'] if c.get('name') == 'letter'), {})
letter_examples = letter_class.get('examples', [])

if letter_examples:
    example = letter_examples[0]
    print(f"\nExample from 'letter' class:")
    print(f"  classPrompt (used by Classification): {example.get('classPrompt', 'N/A')}")
    print(f"  attributesPrompt (used by Extraction): {example.get('attributesPrompt', 'N/A')[:100]}...")

print("\nThis ensures that:")
print("- Classification sees examples of different document types to learn classification")
print("- Extraction sees examples of the same document type to learn attribute extraction")

Comparison with Classification Service:
Key differences:
1. Classification uses 'classPrompt' from examples
2. Extraction uses 'attributesPrompt' from examples
3. Classification gets examples from ALL classes
4. Extraction gets examples only from the SPECIFIC class being extracted

Example from 'letter' class:
  classPrompt (used by Classification): This is an example of the class 'letter'
  attributesPrompt (used by Extraction): expected attributes are:
    "sender_name": "Will E. Clark",
    "sender_address": "206 Maple Street...

This ensures that:
- Classification sees examples of different document types to learn classification
- Extraction sees examples of the same document type to learn attribute extraction
