# Evaluation Methods Demonstration

This notebook demonstrates the various evaluation methods available in the IDP library for comparing expected values with actual extraction results. It covers:

1. All evaluation methods with both match and no-match scenarios
2. Threshold testing for applicable methods
3. Edge cases:
   - Attribute not found in actual results
   - Attribute not found in expected results
   - Attribute not found in either actual or expected results

In [1]:
# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[all]"

Found existing installation: idp_common 0.3.1
Uninstalling idp_common-0.3.1:
  Successfully uninstalled idp_common-0.3.1
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import necessary libraries
import sys
import os
import json
from typing import Dict, Any, List, Tuple, Optional
import logging

# Add parent directory to path to import the library
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

# Import IDP libraries
from idp_common.evaluation.models import EvaluationMethod
from idp_common.evaluation.comparator import compare_values
from idp_common.evaluation.service import EvaluationService
from idp_common.models import Document, Section, Status

print("Libraries imported successfully")

Libraries imported successfully


## Part 1: Comparing Individual Values with Different Methods

We'll test each evaluation method with matching and non-matching examples.

In [3]:
def test_comparison(method: EvaluationMethod, expected: Any, actual: Any, 
                    threshold: float = 0.8, document_class: str = "TestDoc",
                    attr_name: str = "test_attr", attr_description: str = "Test attribute"):
    """Test a comparison method and print results."""
    
    print(f"\n{'-'*60}")
    print(f"Method: {method.name}")
    print(f"Expected: {expected}")
    print(f"Actual: {actual}")
    
    if method in [EvaluationMethod.FUZZY, EvaluationMethod.BERT]:
        print(f"Threshold: {threshold}")
    
    # Set up LLM config for the LLM method
    llm_config = None
    if method == EvaluationMethod.LLM:
        llm_config = {
            "model": "us.amazon.nova-lite-v1:0",
            "temperature": 0.0,
            "top_k": 250,
            "system_prompt": "You are an evaluator that helps determine if the predicted and expected values match for document attribute extraction.",
            "task_prompt": """I need to evaluate attribute extraction for a document of class: {DOCUMENT_CLASS}.

For the attribute named "{ATTRIBUTE_NAME}" described as "{ATTRIBUTE_DESCRIPTION}":
- Expected value: {EXPECTED_VALUE}
- Actual value: {ACTUAL_VALUE}

Do these values match in meaning, taking into account formatting differences, word order, abbreviations, and semantic equivalence?
Provide your assessment as a JSON with three fields:
- "match": boolean (true if they match, false if not)
- "score": number between 0 and 1 representing the confidence/similarity score
- "reason": brief explanation of your decision

Respond ONLY with the JSON and nothing else."""
        }
    
    # Perform the comparison
    matched, score, reason = compare_values(
        expected=expected,
        actual=actual,
        method=method,
        threshold=threshold,
        document_class=document_class,
        attr_name=attr_name,
        attr_description=attr_description,
        llm_config=llm_config
    )
    
    print(f"Matched: {matched}")
    print(f"Score: {score}")
    if reason:
        print(f"Reason: {reason}")
        
    return matched, score, reason

### Test 1: EXACT Method
Testing exact string matching with both match and non-match cases.

In [4]:
# EXACT method - Match
test_comparison(EvaluationMethod.EXACT, "Account #12345", "Account #12345")

# EXACT method - No match
test_comparison(EvaluationMethod.EXACT, "Account #12345", "Account #12346")

# EXACT method - Match with different casing and punctuation
test_comparison(EvaluationMethod.EXACT, "Account Number: 12345", "account number 12345")


------------------------------------------------------------
Method: EXACT
Expected: Account #12345
Actual: Account #12345
Matched: True
Score: 1.0

------------------------------------------------------------
Method: EXACT
Expected: Account #12345
Actual: Account #12346
Matched: False
Score: 0.0

------------------------------------------------------------
Method: EXACT
Expected: Account Number: 12345
Actual: account number 12345
Matched: True
Score: 1.0


(True, 1.0, None)

### Test 2: NUMERIC_EXACT Method
Testing numeric comparison with different formats.

In [5]:
# NUMERIC_EXACT method - Match
test_comparison(EvaluationMethod.NUMERIC_EXACT, "$1,250.00", 1250)

# NUMERIC_EXACT method - No match
test_comparison(EvaluationMethod.NUMERIC_EXACT, "$1,250.00", 1251)

# NUMERIC_EXACT method - Match with different formats
test_comparison(EvaluationMethod.NUMERIC_EXACT, "(1,250.00)", "-1250")


------------------------------------------------------------
Method: NUMERIC_EXACT
Expected: $1,250.00
Actual: 1250
Matched: True
Score: 1.0

------------------------------------------------------------
Method: NUMERIC_EXACT
Expected: $1,250.00
Actual: 1251
Matched: False
Score: 0.0

------------------------------------------------------------
Method: NUMERIC_EXACT
Expected: (1,250.00)
Actual: -1250
Matched: False
Score: 0.0


(False, 0.0, None)

### Test 3: FUZZY Method
Testing fuzzy comparison with different thresholds.

In [6]:
# FUZZY method - High match
test_comparison(EvaluationMethod.FUZZY, "John A. Smith", "John Smith", threshold=0.8)

# FUZZY method - Medium match 
matched, score, _ = test_comparison(EvaluationMethod.FUZZY, "John A. Smith", "John Simpson", threshold=0.8)
print(f"With threshold=0.6: {score >= 0.6}")

# FUZZY method - Low match
test_comparison(EvaluationMethod.FUZZY, "John Alexander Smith", "Jane Marie Johnson", threshold=0.8)


------------------------------------------------------------
Method: FUZZY
Expected: John A. Smith
Actual: John Smith
Threshold: 0.8
Matched: True
Score: 0.8333333333333334

------------------------------------------------------------
Method: FUZZY
Expected: John A. Smith
Actual: John Simpson
Threshold: 0.8
Matched: False
Score: 0.41666666666666663
With threshold=0.6: False

------------------------------------------------------------
Method: FUZZY
Expected: John Alexander Smith
Actual: Jane Marie Johnson
Threshold: 0.8
Matched: False
Score: 0.15000000000000002


(False, 0.15000000000000002, None)

### Test 4: HUNGARIAN Method
Testing list comparison using the Hungarian algorithm.

In [7]:
# HUNGARIAN method - Full match
expected_list = ["Deposit: $500", "Withdrawal: $150", "Transfer: $200"]
actual_list = ["Deposit: $500", "Transfer: $200", "Withdrawal: $150"]
test_comparison(EvaluationMethod.HUNGARIAN, expected_list, actual_list)

# HUNGARIAN method - Partial match
expected_list = ["Deposit: $500", "Withdrawal: $150", "Transfer: $200"]
actual_list = ["Deposit: $500", "Withdrawal: $150", "Transfer: $210"]
test_comparison(EvaluationMethod.HUNGARIAN, expected_list, actual_list)

# HUNGARIAN method - Different number of items
expected_list = ["Deposit: $500", "Withdrawal: $150", "Transfer: $200"]
actual_list = ["Deposit: $500", "Withdrawal: $150"]
test_comparison(EvaluationMethod.HUNGARIAN, expected_list, actual_list)

# HUNGARIAN method - Non-list values (should convert to list)
test_comparison(EvaluationMethod.HUNGARIAN, "Single item", "Single item")


------------------------------------------------------------
Method: HUNGARIAN
Expected: ['Deposit: $500', 'Withdrawal: $150', 'Transfer: $200']
Actual: ['Deposit: $500', 'Transfer: $200', 'Withdrawal: $150']
Matched: True
Score: 1.0

------------------------------------------------------------
Method: HUNGARIAN
Expected: ['Deposit: $500', 'Withdrawal: $150', 'Transfer: $200']
Actual: ['Deposit: $500', 'Withdrawal: $150', 'Transfer: $210']
Matched: False
Score: 0.6666666666666666

------------------------------------------------------------
Method: HUNGARIAN
Expected: ['Deposit: $500', 'Withdrawal: $150', 'Transfer: $200']
Actual: ['Deposit: $500', 'Withdrawal: $150']
Matched: True
Score: 1.0

------------------------------------------------------------
Method: HUNGARIAN
Expected: Single item
Actual: Single item
Matched: True
Score: 1.0


(True, 1.0, None)

### Test 5: LLM Method
Testing semantic comparison using a Large Language Model.

In [8]:
# LLM method - High semantic match (different wording, same meaning)
test_comparison(
    EvaluationMethod.LLM,
    "Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.",
    "Statement with deposits totaling $1,250 and withdrawals of $850, leaving a balance of $2,400.",
    document_class="BankStatement",
    attr_name="statement_summary",
    attr_description="Summary of the bank statement"
)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials



------------------------------------------------------------
Method: LLM
Expected: Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.
Actual: Statement with deposits totaling $1,250 and withdrawals of $850, leaving a balance of $2,400.


INFO:idp_common.bedrock:Bedrock request attempt 1/8:
INFO:idp_common.bedrock:Response: {'ResponseMetadata': {'RequestId': '09bdb26a-b77e-431d-9f01-56bd4035b735', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Apr 2025 11:59:06 GMT', 'content-type': 'application/json', 'content-length': '410', 'connection': 'keep-alive', 'x-amzn-requestid': '09bdb26a-b77e-431d-9f01-56bd4035b735'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': '```json\n{\n  "match": true,\n  "score": 0.95,\n  "reason": "Both values convey the same information about deposits, withdrawals, and the resulting balance, despite slight differences in phrasing and structure."\n}\n```'}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 204, 'outputTokens': 51, 'totalTokens': 255}, 'metrics': {'latencyMs': 541}}
INFO:idp_common.evaluation.comparator:LLM evaluation for statement_summary (from code block): match=True, score=0.95, reason=Both values convey the same information ab

Matched: True
Score: 0.95
Reason: Both values convey the same information about deposits, withdrawals, and the resulting balance, despite slight differences in phrasing and structure.


(True,
 0.95,
 'Both values convey the same information about deposits, withdrawals, and the resulting balance, despite slight differences in phrasing and structure.')

In [9]:
# LLM method - No semantic match (different meaning)
test_comparison(
    EvaluationMethod.LLM,
    "Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.",
    "Statement with deposits of $2,500 and withdrawals of $1,200, leaving a balance of $3,800.",
    document_class="BankStatement",
    attr_name="statement_summary",
    attr_description="Summary of the bank statement"
)

INFO:idp_common.bedrock:Bedrock request attempt 1/8:



------------------------------------------------------------
Method: LLM
Expected: Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.
Actual: Statement with deposits of $2,500 and withdrawals of $1,200, leaving a balance of $3,800.


INFO:idp_common.bedrock:Response: {'ResponseMetadata': {'RequestId': 'b22235c4-1327-4164-8652-84e5d0800a30', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Apr 2025 11:59:07 GMT', 'content-type': 'application/json', 'content-length': '429', 'connection': 'keep-alive', 'x-amzn-requestid': 'b22235c4-1327-4164-8652-84e5d0800a30'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': '```json\n{\n  "match": false,\n  "score": 0.5,\n  "reason": "The actual value contains different amounts for deposits and withdrawals, resulting in a different ending balance, which does not semantically match the expected value."\n}\n```'}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 206, 'outputTokens': 53, 'totalTokens': 259}, 'metrics': {'latencyMs': 462}}
INFO:idp_common.evaluation.comparator:LLM evaluation for statement_summary (from code block): match=False, score=0.5, reason=The actual value contains different amounts for deposits and withdrawals, re

Matched: False
Score: 0.5
Reason: The actual value contains different amounts for deposits and withdrawals, resulting in a different ending balance, which does not semantically match the expected value.


(False,
 0.5,
 'The actual value contains different amounts for deposits and withdrawals, resulting in a different ending balance, which does not semantically match the expected value.')

In [10]:
# LLM method - Partial semantic match (some differences)
test_comparison(
    EvaluationMethod.LLM,
    "Policy effective date: January 15, 2023 to January 14, 2024",
    "Policy period begins on Jan 15, 2023 and expires on Jan 15, 2024",
    document_class="InsurancePolicy",
    attr_name="policy_period",
    attr_description="The dates during which the insurance policy is effective"
)

INFO:idp_common.bedrock:Bedrock request attempt 1/8:



------------------------------------------------------------
Method: LLM
Expected: Policy effective date: January 15, 2023 to January 14, 2024
Actual: Policy period begins on Jan 15, 2023 and expires on Jan 15, 2024


INFO:idp_common.bedrock:Response: {'ResponseMetadata': {'RequestId': '93e18ef0-cd9b-43d5-9604-775e6b62265b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Apr 2025 11:59:08 GMT', 'content-type': 'application/json', 'content-length': '429', 'connection': 'keep-alive', 'x-amzn-requestid': '93e18ef0-cd9b-43d5-9604-775e6b62265b'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': '```json\n{\n  "match": true,\n  "score": 0.95,\n  "reason": "Both values convey the same period, with minor differences in wording and formatting. The start and end dates are identical, and the meaning is semantically equivalent."\n}\n```'}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 204, 'outputTokens': 57, 'totalTokens': 261}, 'metrics': {'latencyMs': 486}}
INFO:idp_common.evaluation.comparator:LLM evaluation for policy_period (from code block): match=True, score=0.95, reason=Both values convey the same period, with minor differences in wording and format

Matched: True
Score: 0.95
Reason: Both values convey the same period, with minor differences in wording and formatting. The start and end dates are identical, and the meaning is semantically equivalent.


(True,
 0.95,
 'Both values convey the same period, with minor differences in wording and formatting. The start and end dates are identical, and the meaning is semantically equivalent.')

## Part 2: Edge Cases - Missing Attributes

### Test 6: Attributes Not Found
Testing scenarios where attributes are missing.

In [11]:
# Case 1: Attribute not found in actual (expected exists, actual is None)
print("\nCase 1: Attribute not found in actual results")
for method in [EvaluationMethod.EXACT, EvaluationMethod.NUMERIC_EXACT, 
               EvaluationMethod.FUZZY, EvaluationMethod.LLM]:
    matched, score, reason = test_comparison(method, "This value exists", None)
    print(f"Method: {method.name} - Score: {score} - Matched: {matched}")

# Case 2: Attribute not found in expected (expected is None, actual exists)
print("\nCase 2: Attribute not found in expected results")
for method in [EvaluationMethod.EXACT, EvaluationMethod.NUMERIC_EXACT, 
               EvaluationMethod.FUZZY, EvaluationMethod.LLM]:
    matched, score, reason = test_comparison(method, None, "This value exists")
    print(f"Method: {method.name} - Score: {score} - Matched: {matched}")

# Case 3: Attribute not found in either (both are None)
print("\nCase 3: Attribute not found in either expected or actual results")
for method in [EvaluationMethod.EXACT, EvaluationMethod.NUMERIC_EXACT, 
               EvaluationMethod.FUZZY, EvaluationMethod.LLM]:
    matched, score, reason = test_comparison(method, None, None)
    print(f"Method: {method.name} - Score: {score} - Matched: {matched}")

# Case 4: Empty string values ("")
print("\nCase 4: Empty string values")
for method in [EvaluationMethod.EXACT, EvaluationMethod.FUZZY, EvaluationMethod.LLM]:
    matched, score, reason = test_comparison(method, "", "")
    print(f"Method: {method.name} - Score: {score} - Matched: {matched}")

INFO:idp_common.bedrock:Bedrock request attempt 1/8:



Case 1: Attribute not found in actual results

------------------------------------------------------------
Method: EXACT
Expected: This value exists
Actual: None
Matched: False
Score: 0.0
Method: EXACT - Score: 0.0 - Matched: False

------------------------------------------------------------
Method: NUMERIC_EXACT
Expected: This value exists
Actual: None
Matched: False
Score: 0.0
Method: NUMERIC_EXACT - Score: 0.0 - Matched: False

------------------------------------------------------------
Method: FUZZY
Expected: This value exists
Actual: None
Threshold: 0.8
Matched: False
Score: 0.0
Method: FUZZY - Score: 0.0 - Matched: False

------------------------------------------------------------
Method: LLM
Expected: This value exists
Actual: None


INFO:idp_common.bedrock:Response: {'ResponseMetadata': {'RequestId': 'd18799d8-0570-446b-be89-404ddd32db71', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Apr 2025 11:59:09 GMT', 'content-type': 'application/json', 'content-length': '386', 'connection': 'keep-alive', 'x-amzn-requestid': 'd18799d8-0570-446b-be89-404ddd32db71'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': '```json\n{\n  "match": false,\n  "score": 0.0,\n  "reason": "The expected value indicates that the attribute exists, while the actual value is None, meaning the attribute does not exist."\n}\n```'}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 149, 'outputTokens': 49, 'totalTokens': 198}, 'metrics': {'latencyMs': 379}}
INFO:idp_common.evaluation.comparator:LLM evaluation for test_attr (from code block): match=False, score=0.0, reason=The expected value indicates that the attribute exists, while the actual value is None, meaning the attribute does not exist.
I

Matched: False
Score: 0.0
Reason: The expected value indicates that the attribute exists, while the actual value is None, meaning the attribute does not exist.
Method: LLM - Score: 0.0 - Matched: False

Case 2: Attribute not found in expected results

------------------------------------------------------------
Method: EXACT
Expected: None
Actual: This value exists
Matched: False
Score: 0.0
Method: EXACT - Score: 0.0 - Matched: False

------------------------------------------------------------
Method: NUMERIC_EXACT
Expected: None
Actual: This value exists
Matched: False
Score: 0.0
Method: NUMERIC_EXACT - Score: 0.0 - Matched: False

------------------------------------------------------------
Method: FUZZY
Expected: None
Actual: This value exists
Threshold: 0.8
Matched: False
Score: 0.0
Method: FUZZY - Score: 0.0 - Matched: False

------------------------------------------------------------
Method: LLM
Expected: None
Actual: This value exists


INFO:idp_common.bedrock:Response: {'ResponseMetadata': {'RequestId': 'a241622b-e628-4ade-b064-de6753ba3e96', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Apr 2025 11:59:11 GMT', 'content-type': 'application/json', 'content-length': '414', 'connection': 'keep-alive', 'x-amzn-requestid': 'a241622b-e628-4ade-b064-de6753ba3e96'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': '```json\n{\n  "match": false,\n  "score": 0.1,\n  "reason": "The expected value is \'None\', indicating no value, while the actual value \'This value exists\' indicates a value is present. They do not match in meaning."\n}\n```'}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 149, 'outputTokens': 60, 'totalTokens': 209}, 'metrics': {'latencyMs': 481}}
INFO:idp_common.evaluation.comparator:LLM evaluation for test_attr (from code block): match=False, score=0.1, reason=The expected value is 'None', indicating no value, while the actual value 'This value exists' i

Matched: False
Score: 0.1
Reason: The expected value is 'None', indicating no value, while the actual value 'This value exists' indicates a value is present. They do not match in meaning.
Method: LLM - Score: 0.1 - Matched: False

Case 3: Attribute not found in either expected or actual results

------------------------------------------------------------
Method: EXACT
Expected: None
Actual: None
Matched: True
Score: 1.0
Method: EXACT - Score: 1.0 - Matched: True

------------------------------------------------------------
Method: NUMERIC_EXACT
Expected: None
Actual: None
Matched: True
Score: 1.0
Method: NUMERIC_EXACT - Score: 1.0 - Matched: True

------------------------------------------------------------
Method: FUZZY
Expected: None
Actual: None
Threshold: 0.8
Matched: True
Score: 1.0
Method: FUZZY - Score: 1.0 - Matched: True

------------------------------------------------------------
Method: LLM
Expected: None
Actual: None


INFO:idp_common.bedrock:Response: {'ResponseMetadata': {'RequestId': 'da6ff544-19d9-4844-8ab3-5d00c59565c2', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Apr 2025 11:59:12 GMT', 'content-type': 'application/json', 'content-length': '346', 'connection': 'keep-alive', 'x-amzn-requestid': 'da6ff544-19d9-4844-8ab3-5d00c59565c2'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': '```json\n{\n  "match": true,\n  "score": 1.0,\n  "reason": "Both the expected and actual values are \'None\', indicating a perfect match in meaning."\n}\n```'}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 147, 'outputTokens': 44, 'totalTokens': 191}, 'metrics': {'latencyMs': 666}}
INFO:idp_common.evaluation.comparator:LLM evaluation for test_attr (from code block): match=True, score=1.0, reason=Both the expected and actual values are 'None', indicating a perfect match in meaning.
INFO:idp_common.bedrock:Bedrock request attempt 1/8:


Matched: True
Score: 1.0
Reason: Both the expected and actual values are 'None', indicating a perfect match in meaning.
Method: LLM - Score: 1.0 - Matched: True

Case 4: Empty string values

------------------------------------------------------------
Method: EXACT
Expected: 
Actual: 
Matched: True
Score: 1.0
Method: EXACT - Score: 1.0 - Matched: True

------------------------------------------------------------
Method: FUZZY
Expected: 
Actual: 
Threshold: 0.8
Matched: True
Score: 1.0
Method: FUZZY - Score: 1.0 - Matched: True

------------------------------------------------------------
Method: LLM
Expected: 
Actual: 


INFO:idp_common.bedrock:Response: {'ResponseMetadata': {'RequestId': '25912a7c-55ab-4a52-9925-8a632f7b6cee', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Apr 2025 11:59:13 GMT', 'content-type': 'application/json', 'content-length': '358', 'connection': 'keep-alive', 'x-amzn-requestid': '25912a7c-55ab-4a52-9925-8a632f7b6cee'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': '```json\n{\n  "match": false,\n  "score": 0.5,\n  "reason": "The expected and actual values do not match in meaning due to significant differences in content."\n}\n```'}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 145, 'outputTokens': 43, 'totalTokens': 188}, 'metrics': {'latencyMs': 585}}
INFO:idp_common.evaluation.comparator:LLM evaluation for test_attr (from code block): match=False, score=0.5, reason=The expected and actual values do not match in meaning due to significant differences in content.


Matched: False
Score: 0.5
Reason: The expected and actual values do not match in meaning due to significant differences in content.
Method: LLM - Score: 0.5 - Matched: False


## Part 3: Full Document Evaluation

Now we'll test the full document evaluation service with different evaluation methods. We'll use the real AWS Bedrock service for LLM-based evaluations.

In [12]:
# Setup test config
test_config = {
    "classes": [
        {
            "name": "TestDocument",
            "attributes": [
                {
                    "name": "exact_match_attr",
                    "description": "Attribute for exact matching",
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "numeric_attr",
                    "description": "Attribute for numeric matching",
                    "evaluation_method": "NUMERIC_EXACT"
                },
                {
                    "name": "fuzzy_attr",
                    "description": "Attribute for fuzzy matching",
                    "evaluation_method": "FUZZY",
                    "evaluation_threshold": 0.8
                },
                {
                    "name": "list_attr",
                    "description": "Attribute for list comparison",
                    "evaluation_method": "HUNGARIAN"
                },
                {
                    "name": "llm_attr",
                    "description": "Attribute for semantic comparison",
                    "evaluation_method": "LLM"
                },
                {
                    "name": "missing_in_actual",
                    "description": "Attribute missing in actual results",
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "missing_in_expected",
                    "description": "Attribute missing in expected results",
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "missing_everywhere",
                    "description": "Attribute missing in both expected and actual",
                    "evaluation_method": "EXACT"
                }
            ]
        }
    ],
    "evaluation": {
        "llm_method": {
            "model": "us.amazon.nova-lite-v1:0",
            "temperature": 0.0,
            "top_k": 250,
            "system_prompt": "You are an evaluator for document extraction attributes.",
            "task_prompt": """I need to evaluate attribute extraction for a document of class: {DOCUMENT_CLASS}.

For the attribute named "{ATTRIBUTE_NAME}" described as "{ATTRIBUTE_DESCRIPTION}":
- Expected value: {EXPECTED_VALUE}
- Actual value: {ACTUAL_VALUE}

Do these values match in meaning, taking into account formatting differences, word order, abbreviations, and semantic equivalence?
Provide your assessment as a JSON with three fields:
- "match": boolean (true if they match, false if not)
- "score": number between 0 and 1 representing the confidence/similarity score
- "reason": brief explanation of your decision

Respond ONLY with the JSON and nothing else."""
        }
    }
}

In [13]:
# Create mock S3 retrieval function
def mock_s3_get_json(uri: str) -> Dict[str, Any]:
    """Mock S3 file retrieval."""
    if "expected" in uri:
        return {
            "exact_match_attr": "Exact Match Value",
            "numeric_attr": "$1,250.00",
            "fuzzy_attr": "John Alexander Smith",
            "list_attr": ["Item 1", "Item 2", "Item 3"],
            "llm_attr": "Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.",
            "missing_in_actual": "This value exists in expected only",
            # missing_in_expected is intentionally omitted
            # missing_everywhere is intentionally omitted
        }
    else:  # actual results
        return {
            "exact_match_attr": "Exact Match Value",  # Exact match
            "numeric_attr": 1250,  # Numeric match
            "fuzzy_attr": "John A Smith",  # Fuzzy match
            "list_attr": ["Item 1", "Item 3", "Item 2"],  # List with different order
            "llm_attr": "Statement with deposits totaling $1,250 and withdrawals of $850, leaving a balance of $2,400.",  # Semantic match
            # missing_in_actual is intentionally omitted
            "missing_in_expected": "This value exists in actual only",
            # missing_everywhere is intentionally omitted
        }

# Set up mock storage - we'll still use this for S3
class MockS3:
    # Store report content for later display
    report_content = ""
    results_content = {}
    
    @staticmethod
    def get_json_content(uri: str) -> Dict[str, Any]:
        return mock_s3_get_json(uri)
    
    @staticmethod
    def write_content(content: Any, bucket: str, key: str, content_type: str = None):
        print(f"Writing content to s3://{bucket}/{key}")
        if key.endswith("results.json"):
            # Store the results for later access
            MockS3.results_content = content
            print(f"Evaluation results summary: {json.dumps(content.get('overall_metrics', {}), indent=2)}")
        elif key.endswith("report.md"):
            # Store the markdown report for later display
            MockS3.report_content = content

In [14]:
# Create mock documents for evaluation
def create_test_document(doc_id: str, is_expected: bool = False) -> Document:
    """Create a test document with a section."""
    section = Section(
        section_id="sec-001",
        classification="TestDocument",
        extraction_result_uri=f"s3://test-bucket/{doc_id}/{'expected' if is_expected else 'actual'}/extraction.json"
    )
    
    doc = Document(
        id=doc_id,
        sections=[section],
        input_key=doc_id,
        input_bucket="test-bucket",
        output_bucket="test-bucket",
        status=Status.EXTRACTED
    )
    
    return doc

# Create test documents
actual_doc = create_test_document("test-doc-001")
expected_doc = create_test_document("test-doc-001-baseline", is_expected=True)

In [15]:
# Evaluate document
# Only patch S3 module - use real Bedrock
import idp_common.evaluation.service
idp_common.evaluation.service.s3 = MockS3

# Create evaluation service
evaluation_service = EvaluationService(region="us-east-1", config=test_config)

# Evaluate document
result_doc = evaluation_service.evaluate_document(actual_doc, expected_doc, store_results=True)

# Print results
if hasattr(result_doc, 'evaluation_result'):
    eval_result = result_doc.evaluation_result
    print(f"\nOverall metrics: {eval_result.overall_metrics}")
    
    # Check section results
    for section_result in eval_result.section_results:
        print(f"\nSection {section_result.section_id} - Class: {section_result.document_class}")
        print(f"Metrics: {section_result.metrics}")
        
        # Print attribute details
        print("\nAttribute Details:")
        print("-" * 100)
        print(f"{'Name':<20} {'Method':<15} {'Expected':<25} {'Actual':<25} {'Matched':<10} {'Score':<10} {'Reason'}")
        print("-" * 100)
        
        for attr in section_result.attributes:
            expected_val = str(attr.expected)[:25]
            actual_val = str(attr.actual)[:25]
            method = attr.evaluation_method
            reason = attr.reason[:50] + "..." if attr.reason and len(attr.reason) > 50 else (attr.reason or "")
            print(f"{attr.name:<20} {method:<15} {expected_val:<25} {actual_val:<25} {attr.matched!s:<10} {attr.score:<10.2f} {reason}")

INFO:idp_common.evaluation.service:Initialized evaluation service with LLM configuration
INFO:idp_common.bedrock:Bedrock request attempt 1/8:
INFO:idp_common.bedrock:Response: {'ResponseMetadata': {'RequestId': 'e65d6a81-687a-4557-9fa9-d59eab4c425f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Apr 2025 11:59:14 GMT', 'content-type': 'application/json', 'content-length': '414', 'connection': 'keep-alive', 'x-amzn-requestid': 'e65d6a81-687a-4557-9fa9-d59eab4c425f'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': '```json\n{\n  "match": true,\n  "score": 0.95,\n  "reason": "Both values convey the same information regarding deposits, withdrawals, and the resulting balance, despite minor differences in phrasing and formatting."\n}\n```'}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 194, 'outputTokens': 51, 'totalTokens': 245}, 'metrics': {'latencyMs': 469}}
INFO:idp_common.evaluation.comparator:LLM evaluation for llm_attr (from co

Writing content to s3://test-bucket/test-doc-001/evaluation/results.json
Evaluation results summary: {
  "precision": 0.6666666666666666,
  "recall": 0.8,
  "f1_score": 0.7272727272727272,
  "accuracy": 0.625,
  "false_alarm_rate": 0.5,
  "false_discovery_rate": 0.2
}
Writing content to s3://test-bucket/test-doc-001/evaluation/report.md

Overall metrics: {'precision': 0.6666666666666666, 'recall': 0.8, 'f1_score': 0.7272727272727272, 'accuracy': 0.625, 'false_alarm_rate': 0.5, 'false_discovery_rate': 0.2}

Section sec-001 - Class: TestDocument
Metrics: {'precision': 0.6666666666666666, 'recall': 0.8, 'f1_score': 0.7272727272727272, 'accuracy': 0.625, 'false_alarm_rate': 0.5, 'false_discovery_rate': 0.2}

Attribute Details:
----------------------------------------------------------------------------------------------------
Name                 Method          Expected                  Actual                    Matched    Score      Reason
------------------------------------------------

### Display the Evaluation Report

Let's display the markdown evaluation report that was generated:

In [16]:
from IPython.display import Markdown

# Display the markdown report
if MockS3.report_content:
    display(Markdown(MockS3.report_content))
else:
    print("No evaluation report was generated.")

# Document Evaluation: test-doc-001

## Summary
- **Match Rate**: 🟠 4/8 attributes matched [██████████░░░░░░░░░░] 50%
- **Precision**: 0.67 | **Recall**: 0.80 | **F1 Score**: 🟡 0.73

## Overall Metrics
| Metric | Value | Rating |
| ------ | :----: | :----: |
| precision | 0.6667 | 🟠 Fair |
| recall | 0.8000 | 🟡 Good |
| f1_score | 0.7273 | 🟡 Good |
| accuracy | 0.6250 | 🟠 Fair |
| false_alarm_rate | 0.5000 | 🟠 Fair |
| false_discovery_rate | 0.2000 | 🟡 Good |


## Section: sec-001 (TestDocument)
### Metrics
| Metric | Value | Rating |
| ------ | :----: | :----: |
| precision | 0.6667 | 🟠 Fair |
| recall | 0.8000 | 🟡 Good |
| f1_score | 0.7273 | 🟡 Good |
| accuracy | 0.6250 | 🟠 Fair |
| false_alarm_rate | 0.5000 | 🟠 Fair |
| false_discovery_rate | 0.2000 | 🟡 Good |


### Attributes
| Status | Attribute | Expected | Actual | Score | Method | Reason |
| :----: | --------- | -------- | ------ | ----- | ------ | ------ |
| ✅ | exact_match_attr | Exact Match Value | Exact Match Value | 1.00 | EXACT |  |
| ✅ | numeric_attr | $1,250.00 | 1250 | 1.00 | NUMERIC_EXACT |  |
| ❌ | fuzzy_attr | John Alexander Smith | John A Smith | 0.60 | FUZZY (evaluation_threshold: 0.8) |  |
| ✅ | list_attr | ['Item 1', 'Item 2', 'Item 3'] | ['Item 1', 'Item 3', 'Item 2'] | 1.00 | HUNGARIAN |  |
| ✅ | llm_attr | Monthly statement showing deposits of $1,250, with | Statement with deposits totaling $1,250 and withdr | 0.95 | LLM | Both values convey the same information regarding deposits, withdrawals, and the |
| ❌ | missing_in_actual | This value exists in expected only | None | 0.00 | EXACT |  |
| ❌ | missing_in_expected | None | This value exists in actual only | 0.00 | EXACT |  |
| ❌ | missing_everywhere | None | None | 1.00 | EXACT |  |


Execution time: 1.22 seconds

## Evaluation Methods Used

This evaluation used the following methods to compare expected and actual values:

1. **EXACT** - Exact string match after stripping punctuation and whitespace
2. **NUMERIC_EXACT** - Exact numeric match after normalizing
3. **FUZZY** - Fuzzy string matching using string similarity metrics (with optional evaluation_threshold)
4. **BERT** - Semantic similarity comparison using BERT embeddings (with evaluation_threshold)
5. **HUNGARIAN** - Bipartite matching algorithm for lists of values
6. **LLM** - Advanced semantic evaluation using Bedrock large language models

Each attribute is configured with a specific evaluation method based on the data type and comparison needs.

## Summary

This notebook has demonstrated:

1. All evaluation methods available in the IDP library:
   - EXACT - Exact string matching
   - NUMERIC_EXACT - Numeric value matching
   - FUZZY - Fuzzy string matching with adjustable thresholds
   - HUNGARIAN - List comparison using the Hungarian algorithm
   - LLM - Semantic comparison using Large Language Models

2. Handling of edge cases:
   - Attributes missing in actual results
   - Attributes missing in expected results
   - Attributes missing in both actual and expected results
   - Empty string values

3. Full document evaluation with mixed evaluation methods
   - Comprehensive metrics calculation
   - Detailed attribute-level results

4. Threshold sensitivity analysis for fuzzy matching
   - How different threshold values affect match results
   - Trade-offs between precision and recall