<a href="https://colab.research.google.com/github/ArunMunagala7/Zenskar-AI-Intern-Assignment/blob/main/Zenskar_Assignment_Final_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Downloading/Importing necessary packages and libraries

In [1]:
!pip install --upgrade google-cloud-aiplatform
!pip install google-cloud-aiplatform
!pip install pymupdf
!pip install pypdf2
!pip install requests
!pip install llama-index
!gcloud auth application-default login

Collecting pymupdf
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m53.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.1
Collecting llama-index
  Downloading llama_index-0.12.8-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.1-py3-none-any.whl.metadata (726 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.8 (from llama-index)
  Downloading llama_index_core-0.12.8-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index

Importing necessary libraries for text extraction



In [2]:
import base64
import vertexai
from vertexai.generative_models import GenerativeModel, Part, FinishReason
import vertexai.preview.generative_models as generative_models

#**1. Using PyPDF to extract the contents of the contract PDF**

This code extracts and organizes data from a legal contract PDF using Vertex AI's Gemini model. It initializes Vertex AI, extracts raw text from the PDF using `PyPDF2`, and uses a detailed prompt to guide the Gemini model in generating a structured, readable output with headings, subheadings, and field-value pairs. The result is a well-organized representation of the contract for easy analysis.

In [3]:
import vertexai
from vertexai.generative_models import GenerativeModel
from PyPDF2 import PdfReader

# Initialize Vertex AI
vertexai.init(project="zenskar-assignment-445217", location="us-central1")

# Load the Gemini model
model = GenerativeModel("gemini-1.5-flash")

# System prompt for extraction
prompt = """
You are a highly intelligent AI assistant specialized in processing and extracting structured data from legal contract documents.

You are given the contents of a Legal Contract Document. Your task is to extract and organize all relevant data, fields, and values in a structured, readable, and well-organized format. Follow the instructions below carefully.

Extract all information:
- Extract every field present in the document along with its value.
- Ensure that even unlabeled or ambiguous information is extracted and included. Provide a label such as "Unlabeled Field [N]" if a specific field name is not present, and attempt to infer its purpose or relation to other fields.

Infer and understand fields:
- For any value that does not have a clear label or field, use contextual understanding to infer its likely purpose or the field it may belong to.
- Provide reasoning or justification for your inference. For example: "Inferred as 'Contract Start Date' based on surrounding text mentioning 'Effective Date'.".
- Use contextual clues such as surrounding text, formatting, proximity to labeled fields, or patterns in the document to identify relationships.

Organize data clearly:
- Group related fields under appropriate headings and subheadings wherever necessary to improve clarity and organization.
- For ambiguous fields or additional information, include a section titled "Unlabeled or Miscellaneous Information," where you can list this data systematically.

Handle tabular data:
- Clearly identify and describe any tables found in the document.
- Extract the table structure and contents row by row.
- Use appropriate labels for columns and rows to maintain data integrity and readability.

Standardized formats:
- For dates, amounts, or units, use standardized formats (e.g., YYYY-MM-DD for dates, $X,XXX.XX for currency amounts) wherever possible.

Contextual reasoning:
- If a field or value appears ambiguous, provide context or reasoning for its inclusion (e.g., "Field inferred from surrounding text.").
- For unlabeled data, indicate where it was located (e.g., "Found near 'Section 5' in the document.").

Ensure all content is captured:
- Even if information does not directly match a predefined field, include it in the output under appropriate headings or subheadings to ensure no data is lost.
- Mark any missing data explicitly with "Data Missing" or "Not Found" if applicable.

Readable output:
- Each field and value should appear on a new line.
- Use consistent indentation and spacing for better readability.
- Example format:
  - Contract ID: ABC-123
  - Customer Name: John Doe
  - Inferred Field: "Applicable Jurisdiction" - Value: California (Inferred based on the clause mentioning governing law).

Output format:
- Extracted data should be structured hierarchically with clear labels and consistent formatting.
- Include unlabeled data under "Unlabeled or Miscellaneous Information" with inferred labels wherever possible.
- Include explanations or reasoning for inferred fields under each entry.

Your goal is to ensure that all information in the document is extracted and organized systematically. Use contextual understanding to infer field labels or purpose where necessary, and ensure every piece of information is accounted for in the final output.

Begin extraction:
"""

# Path to your PDF file
pdf_file_path = "/content/sample contract.pdf"

# Extract text from the PDF
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

pdf_content = extract_text_from_pdf(pdf_file_path)

# Combine the extracted text and the prompt
contents = [f"{prompt}\n\n{pdf_content}"]

# Generate content using Gemini
response = model.generate_content(contents)

contract_text = response.text

# Print the structured response
print(response.text)




## Extracted Data from Legal Contract Document:

**Order Information:**

  - Order Form Date: 2023-12-13
  - Order Form Number: Data Missing
  - Expires in: 30 days

**Company Information:**

  - Company: Company A
  - Address: 9th Floor, Boston, MA 02111

**Client Information:**

  - Client: Company B
  - Client Contact Name: Rob Keatts
  - Client Contact Email: abc@companyb.comn 
  - Billing Contact: Data Missing
  - Bill to Email: Data Missing
  - Bill to Address: Data Missing

**Services & Billing:**

  - Subscription Billing:
    - Payment Terms: Net 30
  - Invoice Schedule:
    - Year | Description | Total 
    ------- | -------- | -------- 
    1 | Upfront, One-Time Implementation Fees | $25,000
    1 | Minimum Monthly Subscription Fees | $6,300 
    2 | Minimum Monthly Subscription Fees | $6,300 
    3 | Minimum Monthly Subscription Fees | $6,300 
    4 | Minimum Monthly Subscription Fees (Renewed) | $6,615 

**Products & Services:**

  **Professional Services:**

  - Product |

#**2. Automated Contract Field Extraction and Validation***

This code leverages Vertex AI's Gemini model to extract and validate key fields from a contract document. It dynamically generates a prompt for field extraction with detailed reasoning, processes the extracted data into structured JSON, and refines it using regex-based validation. The output includes both validated fields and a reasoning log, ensuring explainability and precision in the extraction process.

In [4]:
import json
import re

# Step 1: Generate the dynamic prompt for the LLM
def create_dynamic_prompt_with_reasoning(fields, contract_text):
    """
    Generate a dynamic prompt for the LLM to extract fields along with reasoning.
    """
    step_1_instructions = "1. Extract the fields exactly as they appear in the text, preserving the raw format from the contract. Use the following keys:"
    step_1_keys = "\n".join([f"- \"{field}\"" for field in fields.keys()])

    step_2_instructions = (
        "2. For each field extracted, provide a reasoning log that explains why and where the field was extracted from the text.MAKE IT DETAILED AND ELABORATE"
    )
    step_2_example = """
    Example Output:
    {
        "extracted_fields": {
            "Contract ID": "123-ABC",
            "Customer Name": "Reynolds Consumer Products Inc.",
            "Contract Start Date": "2020-02-04",
            "Contract End Date": "2021-02-04",
            "Payment Terms": "Net 30",
            "Contract Amount": "$10,000.00",
            "Billing Frequency": "Monthly",
            "Contract Type": "Subscription"
        },
        "reasoning_log": {
            "Contract ID": "Detected near the 'Agreement Number' label in the document.",
            "Customer Name": "Extracted from the 'Client' section of the text.",
            "Contract Start Date": "Identified using keywords like 'Effective Date'.",
            "Contract End Date": "Identified using keywords like 'Expiration Date'.",
            "Payment Terms": "Matched payment terms format like 'Net X'.",
            "Contract Amount": "Extracted near the 'Total Amount' section.",
            "Billing Frequency": "Detected as a recurring pattern in the document.",
            "Contract Type": "Matched with keywords like 'Agreement Type'."
        }
    }
    """

    final_output_instruction = "Provide only the final JSON output containing both 'extracted_fields' and 'reasoning_log'. Do not include any intermediate steps, explanations, or raw iterations."

    # Assemble the full prompt
    prompt = f"""
    Extract the following fields from the contract text and generate a reasoning log, along with where the field was extracted in the contract:

    {step_1_instructions}
    {step_1_keys}

    {step_2_instructions}
    {step_2_example}

    {final_output_instruction}
    Contract Text:
    {contract_text}
    """
    return prompt.strip()


# Step 2: Define the fields and guidelines
fields = {
    "Contract ID": (
        "1. Must be alphanumeric with optional hyphens (e.g., 'ABC-123'). "
        "2. If missing, look for unique identifiers labeled as 'Contract ID' or near terms like 'Agreement Number' or 'Reference Number'. "
        "3. If the value is in an invalid format, attempt to clean or reformat it into a valid alphanumeric format and output it."
    ),
    "Customer Name": (
        "1. Must be a valid name as a string. "
        "2. If missing, search for names associated with 'Customer', 'Client', or 'Party'. "
        "3. Ensure that names of organizations or individuals are extracted accurately. "
        "4. If the extracted value contains unnecessary characters or formatting issues, clean it to output a valid name."
    ),
    "Contract Start Date": (
        "1. Must follow the format 'YYYY-MM-DD'. "
        "2. If the exact field is not labeled, infer it by looking for phrases like 'Effective Date', 'Start Date', or 'Commencement Date'. "
        "3. If no date is explicitly in 'YYYY-MM-DD' format, search for any possible dates related to the start of the contract, regardless of format (e.g., 'MM/DD/YYYY', 'DD-MM-YYYY'). "
        "4. Convert any identified dates to 'YYYY-MM-DD' format in the output. "
        "5. If the value is in an invalid date format or partially incomplete, attempt to correct or infer the correct date and output it in 'YYYY-MM-DD' format."
    ),
    "Contract End Date": (
        "1. Must follow the format 'YYYY-MM-DD'. "
        "2. If the exact field is not labeled, infer it by searching for terms like 'End Date', 'Expiration Date', or 'Termination Date'. "
        "3. If no date is explicitly in 'YYYY-MM-DD' format, search for any possible dates related to the end of the contract, regardless of format (e.g., 'MM/DD/YYYY', 'DD-MM-YYYY'). "
        "4. Convert any identified dates to 'YYYY-MM-DD' format in the output. "
        "5. If the value is in an invalid date format or partially incomplete, attempt to correct or infer the correct date and output it in 'YYYY-MM-DD' format."
    ),
    "Payment Terms": (
        "1. Must represent terms of payment, which can include phrases like 'Net X' (e.g., 'Net 30'), 'Due in X days', or any description indicating when payment is expected. "
        "2. If no explicit label exists, infer payment terms from clauses or statements mentioning 'payment schedules', 'invoice due date', or similar terms. "
        "3. If the extracted payment terms are in an invalid format, attempt to reformat or infer the correct terms and output them in a standard format."
    ),
    "Contract Amount": (
        "1. Extract and account for every monetary value mentioned in the contract, regardless of its label or purpose (e.g., 'Subtotal', 'Tax', 'Discount', 'Advance Payment', 'Penalty', 'Rebate', etc.). "
        "2. Clearly label each value extracted with its corresponding context or description provided in the contract. "
        "3. If any monetary values are ambiguous or unlabeled, infer their significance based on surrounding text or clauses. "
        "4. Calculate a final total amount using all extracted monetary values, adhering to the calculation rules explicitly mentioned in the contract (e.g., adding applicable taxes, subtracting discounts, considering advance payments). "
        "5. If the contract does not provide specific calculation rules, output an estimated total amount by summing all extracted monetary values, while clearly stating this is an estimate. "
        "6. Ensure all calculations are detailed and included in the output, explicitly showing how the total amount was derived (e.g., 'Subtotal + Tax - Discounts = Final Total'). "
        "7. In cases where certain values are percentages (e.g., '10% discount'), infer the actual monetary value by applying it to the appropriate base amount. "
        "8. Provide detailed reasoning in the reasoning log for every value included in the calculation, explaining why and how it was used in deriving the total amount. "
        "9. If any monetary value is in an invalid format (e.g., missing currency symbols or commas), attempt to reformat or infer the correct amount and output it. "
        "10. IMPORTANT: ALWAYS GIVE THE FINAL AMOUNT AFTER CALCULATION WITHOUT FAIL."
    ),
    "Billing Frequency": (
        "1. Must indicate how often payments or services are recurring. "
        "2. This can include terms like 'Monthly', 'Yearly', 'Weekly', 'Daily', or phrases like 'every month', 'annually', or 'per week'. "
        "3. Infer billing frequency from patterns in payment-related terms or recurring schedules mentioned in the document. "
        "4. If the extracted billing frequency is in an invalid or unclear format, reformat it or infer the correct frequency based on context."
    ),
    "Contract Type": (
        "1. Must indicate the type of contract, such as 'Subscription', 'Fixed', 'Hourly', or 'Project-Based'. "
        "2. If missing, infer the type by analyzing contract terms or clauses describing the nature of the agreement. "
        "3. Look for phrases like 'Subscription Agreement', 'One-time Fixed Fee', 'Hourly Rate', or 'Project Work' to deduce the contract type. "
        "4. If the extracted contract type is ambiguous or in an invalid format, attempt to infer or clarify the type and output it in a standard format."
    )
}

# Step 3: Generate LLM Output Using Gemini
def generate_llm_output_with_reasoning(prompt):
    """
    Generate output from Gemini using the dynamic prompt, including reasoning.
    """
    # Initialize Vertex AI Gemini model
    model = GenerativeModel("gemini-1.5-pro-002")

    # Generate content using the prompt
    response = model.generate_content(prompt)

    # Extract the text content from the response
    llm_output = response.text

    # Clean the output to ensure it contains only valid JSON
    try:
        # Find the JSON object in the response using regex
        match = re.search(r"\{.*\}", llm_output, re.DOTALL)
        if match:
            return json.loads(match.group(0))
        else:
            raise ValueError("No valid JSON found in the LLM output.")
    except Exception as e:
        raise ValueError(f"Error extracting JSON from LLM output: {e}")


# Step 4: Refine Extracted Fields Using Regex Without Altering Reasoning Log
def refine_fields_with_regex_only(llm_output):
    """
    Refine and validate the LLM output using regex.
    """
    extracted_fields = llm_output.get("extracted_fields", {})

    # Define regex patterns for each field
    patterns = {
        "Contract ID": r"^[A-Za-z0-9\-]+$",  # Alphanumeric with hyphens
        "Customer Name": r".+",  # Any non-empty string
        "Contract Start Date": r"^\d{4}-\d{2}-\d{2}$",  # Date in YYYY-MM-DD
        "Contract End Date": r"^\d{4}-\d{2}-\d{2}$",  # Date in YYYY-MM-DD
        "Payment Terms": r"^[A-Za-z0-9\s\-]+$",  # Alphanumeric with spaces and hyphens
        "Contract Amount": r"^\$\d{1,3}(,\d{3})*(\.\d{2})?$",  # Currency format
        "Billing Frequency": r"^[A-Za-z0-9\s\-]+$",  # Alphanumeric with spaces and hyphens
        "Contract Type": r"^[A-Za-z0-9\s\-]+$"  # Alphanumeric with spaces and hyphens
    }

    # Refine fields
    refined_fields = {}
    for field, value in extracted_fields.items():
        if value is None or value == "Missing":  # Handle null or missing values
            refined_fields[field] = "Missing"
        elif field in patterns and not re.match(patterns[field], value, re.IGNORECASE):
            refined_fields[field] = "Invalid Format"
        else:
            refined_fields[field] = value  # Keep valid value as is

    return refined_fields

# Generate the dynamic prompt with reasoning
prompt_with_reasoning = create_dynamic_prompt_with_reasoning(fields, contract_text)

# Generate LLM output from Gemini
llm_output_with_reasoning = generate_llm_output_with_reasoning(prompt_with_reasoning)

# Refine fields without altering reasoning log
refined_fields = refine_fields_with_regex_only(llm_output_with_reasoning)

# Extract reasoning log as is
reasoning_log = llm_output_with_reasoning.get("reasoning_log", {})

# Print the refined fields
print("Refined Fields:", json.dumps(refined_fields, indent=4))

# Print the reasoning log
print("Reasoning Log:", json.dumps(reasoning_log, indent=4))


Refined Fields: {
    "Contract ID": "Invalid Format",
    "Customer Name": "Company B",
    "Contract Start Date": "2023-12-15",
    "Contract End Date": "Missing",
    "Payment Terms": "Net 30",
    "Contract Amount": "$6,300",
    "Billing Frequency": "Monthly",
    "Contract Type": "Subscription"
}
Reasoning Log: {
    "Contract ID": "The 'Order Form Number' field under 'Order Information' is explicitly stated as 'Data Missing'.",
    "Customer Name": "Extracted from the 'Client' field under 'Client Information'.",
    "Contract Start Date": "Derived from the latest date under 'Order Form Acceptance and Authorization', which represents the agreement date between both parties. While the 'Effective Dates' section mentions a 'Go Live' date, it's contingent upon product launch and doesn't provide a concrete start date. The signing date therefore serves as the most reliable indicator of agreement commencement.",
    "Contract End Date": "The contract mentions a 3-year initial term with 

#**3. Logging-Based Field Validation and Transformation**

This code automates the extraction and validation of key fields from contract documents using Google's Vertex AI Gemini model. It dynamically creates a prompt to extract fields with detailed reasoning, generates JSON output for extracted data, validates and refines the fields using regex without altering the reasoning log, and outputs the results as structured JSON. This process ensures accurate, explainable, and well-structured data extraction for easy analysis.

In [5]:
import logging
import json
import re
from io import StringIO

# Create a string buffer to capture logs
log_stream = StringIO()

# Custom log handler to write to StringIO
class StringIOHandler(logging.StreamHandler):
    def __init__(self, stream):
        super().__init__(stream)

# Configure logging
log_handler = StringIOHandler(log_stream)
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
log_handler.setFormatter(formatter)

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(log_handler)  # Add custom handler for StringIO

def validate_and_transform_fields(refined_fields):
    """
    Validate the refined fields to ensure they meet the required formats.
    Log details of validation and transformations for transparency.
    """
    # Define default values for non-critical fields
    default_values = {
        "Contract ID": "Unknown",
        "Payment Terms": "Net 30",  # Default to "Net 30" if not provided
        "Contract Amount": "$0.00",  # Default to $0 if missing
        "Billing Frequency": "Monthly",  # Default to Monthly
        "Contract Type": "General Agreement"  # Default to a general agreement
    }

    # Validation results and explainable log
    validation_log = []

    for field, value in refined_fields.items():
        if value == "Missing" or value == "Invalid Format":
            # Log missing/invalid fields
            validation_log.append({
                "Field": field,
                "Status": "Missing or Invalid",
                "Action": f"Filling with default value: {default_values.get(field, 'None')}",
                "Reasoning": "Field is either missing or in an invalid format; default value assigned."
            })
            # Fill with default if non-critical
            refined_fields[field] = default_values.get(field, "None")
        else:
            # Perform format-specific validations
            if field in ["Contract Start Date", "Contract End Date"]:
                if not re.match(r"^\d{4}-\d{2}-\d{2}$", value):
                    validation_log.append({
                        "Field": field,
                        "Status": "Invalid Format",
                        "Action": "Converting to valid date format (if possible).",
                        "Reasoning": "Date must follow 'YYYY-MM-DD' format. Detected invalid format."
                    })
                    # Attempt to convert to a valid format
                    try:
                        # Example of converting MM/DD/YYYY to YYYY-MM-DD
                        parts = re.split(r"[-/]", value)
                        if len(parts) == 3:
                            value = f"{parts[2]}-{parts[0].zfill(2)}-{parts[1].zfill(2)}"
                            refined_fields[field] = value
                            validation_log[-1]["Action"] = "Converted to valid format."
                        else:
                            refined_fields[field] = "Invalid Format"
                    except Exception:
                        refined_fields[field] = "Invalid Format"
                else:
                    validation_log.append({
                        "Field": field,
                        "Status": "Valid",
                        "Action": "No Action Needed",
                        "Reasoning": "Date format is valid."
                    })

            elif field == "Contract Amount":
                if not re.match(r"^\$\d{1,3}(,\d{3})*(\.\d{2})?$", value):
                    validation_log.append({
                        "Field": field,
                        "Status": "Invalid Format",
                        "Action": "Attempting to standardize currency format.",
                        "Reasoning": "Currency must follow '$X,XXX.XX' format. Detected invalid format."
                    })
                    # Attempt to convert to valid format
                    try:
                        value = re.sub(r"[^\d.]", "", value)  # Remove non-numeric characters
                        refined_fields[field] = f"${float(value):,.2f}"
                        validation_log[-1]["Action"] = "Converted to valid currency format."
                    except Exception:
                        refined_fields[field] = "Invalid Format"
                else:
                    validation_log.append({
                        "Field": field,
                        "Status": "Valid",
                        "Action": "No Action Needed",
                        "Reasoning": "Currency format is valid."
                    })

            else:
                # For all other fields, log as valid
                validation_log.append({
                    "Field": field,
                    "Status": "Valid",
                    "Action": "No Action Needed",
                    "Reasoning": "Field value is valid as per the expected format."
                })

    # Output the validation log for transparency
    for entry in validation_log:
        # Convert each dictionary entry to a JSON string for logging
        logger.info(json.dumps(entry, indent=4))

    return refined_fields

# Validate and Transform Fields
validated_and_transformed_fields = validate_and_transform_fields(refined_fields)

# Save logs to a string variable
logs_as_string = log_stream.getvalue()

# Print the final validated and transformed fields
print("Final Validated and Transformed Fields:", json.dumps(validated_and_transformed_fields, indent=4))

# Print the logs captured in the string variable
print("\nCaptured Logs:\n", logs_as_string)


INFO:root:{
    "Field": "Contract ID",
    "Status": "Missing or Invalid",
    "Action": "Filling with default value: Unknown",
    "Reasoning": "Field is either missing or in an invalid format; default value assigned."
}
INFO:root:{
    "Field": "Customer Name",
    "Status": "Valid",
    "Action": "No Action Needed",
    "Reasoning": "Field value is valid as per the expected format."
}
INFO:root:{
    "Field": "Contract Start Date",
    "Status": "Valid",
    "Action": "No Action Needed",
    "Reasoning": "Date format is valid."
}
INFO:root:{
    "Field": "Contract End Date",
    "Status": "Missing or Invalid",
    "Action": "Filling with default value: None",
    "Reasoning": "Field is either missing or in an invalid format; default value assigned."
}
INFO:root:{
    "Field": "Payment Terms",
    "Status": "Valid",
    "Action": "No Action Needed",
    "Reasoning": "Field value is valid as per the expected format."
}
INFO:root:{
    "Field": "Contract Amount",
    "Status": "Valid"

Final Validated and Transformed Fields: {
    "Contract ID": "Unknown",
    "Customer Name": "Company B",
    "Contract Start Date": "2023-12-15",
    "Contract End Date": "None",
    "Payment Terms": "Net 30",
    "Contract Amount": "$6,300",
    "Billing Frequency": "Monthly",
    "Contract Type": "Subscription"
}

Captured Logs:
 2024-12-27 08:12:51,274 - INFO - {
    "Field": "Contract ID",
    "Status": "Missing or Invalid",
    "Action": "Filling with default value: Unknown",
    "Reasoning": "Field is either missing or in an invalid format; default value assigned."
}
2024-12-27 08:12:51,276 - INFO - {
    "Field": "Customer Name",
    "Status": "Valid",
    "Action": "No Action Needed",
    "Reasoning": "Field value is valid as per the expected format."
}
2024-12-27 08:12:51,278 - INFO - {
    "Field": "Contract Start Date",
    "Status": "Valid",
    "Action": "No Action Needed",
    "Reasoning": "Date format is valid."
}
2024-12-27 08:12:51,279 - INFO - {
    "Field": "Contrac

Print the final fields and logs

In [6]:
# Print the final validated and transformed fields
print("Final Validated and Transformed Fields:", json.dumps(validated_and_transformed_fields, indent=4))

# # Print the logs captured in the string variable
print("\nCaptured Logs:\n", logs_as_string)

Final Validated and Transformed Fields: {
    "Contract ID": "Unknown",
    "Customer Name": "Company B",
    "Contract Start Date": "2023-12-15",
    "Contract End Date": "None",
    "Payment Terms": "Net 30",
    "Contract Amount": "$6,300",
    "Billing Frequency": "Monthly",
    "Contract Type": "Subscription"
}

Captured Logs:
 2024-12-27 08:12:51,274 - INFO - {
    "Field": "Contract ID",
    "Status": "Missing or Invalid",
    "Action": "Filling with default value: Unknown",
    "Reasoning": "Field is either missing or in an invalid format; default value assigned."
}
2024-12-27 08:12:51,276 - INFO - {
    "Field": "Customer Name",
    "Status": "Valid",
    "Action": "No Action Needed",
    "Reasoning": "Field value is valid as per the expected format."
}
2024-12-27 08:12:51,278 - INFO - {
    "Field": "Contract Start Date",
    "Status": "Valid",
    "Action": "No Action Needed",
    "Reasoning": "Date format is valid."
}
2024-12-27 08:12:51,279 - INFO - {
    "Field": "Contrac

# **4. PDF Highlighting and Logging**

This Python script highlights specific fields in a PDF based on validated and transformed field data using PyMuPDF (fitz). It also logs reasoning or confidence scores for each field extraction into a string buffer for transparency and explainability. This is useful for validating automated document processing workflows and enhancing auditability.

In [11]:
import fitz  # PyMuPDF for PDF parsing
import logging
import json
import re
from io import StringIO

# Create a string buffer to capture logs for explainability
explainability_log_stream = StringIO()

# Custom log handler to write to StringIO
class StringIOHandler(logging.StreamHandler):
    def __init__(self, stream):
        super().__init__(stream)

# Configure logging for explainability
log_handler = StringIOHandler(explainability_log_stream)
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
log_handler.setFormatter(formatter)

logger = logging.getLogger("explainability_logger")
logger.setLevel(logging.INFO)
logger.addHandler(log_handler)  # Add custom handler for StringIO

# Step 1: Highlight PDF with Extracted Fields
def highlight_pdf(input_pdf_path, output_pdf_path, validated_and_transformed_fields):
    """
    Create a visual overlay on the PDF to highlight extracted fields.
    """
    try:
        # Open the original PDF
        pdf_document = fitz.open(input_pdf_path)

        for page_num, page in enumerate(pdf_document, start=1):
            for field, value in validated_and_transformed_fields.items():
                if value != "Missing" and value != "Invalid Format":
                    # Search for the text on the page
                    text_instances = page.search_for(value)
                    for inst in text_instances:
                        # Highlight the found text
                        page.add_highlight_annot(inst)

        # Save the modified PDF
        pdf_document.save(output_pdf_path, garbage=4, deflate=True)
        pdf_document.close()

        logger.info(f"PDF highlights saved to: {output_pdf_path}")

    except Exception as e:
        logger.error(f"Error highlighting PDF: {e}")

# Step 2: Log LLM Reasoning or Confidence
def log_llm_reasoning(validated_and_transformed_fields, reasoning_log):
    """
    Log LLM reasoning or confidence scores for extracted fields.
    """
    for field, reasoning in reasoning_log.items():
        log_entry = {
            "Field": field,
            "Validated Value": validated_and_transformed_fields.get(field, "Missing"),
            "Reasoning or Confidence": reasoning
        }
        logger.info(json.dumps(log_entry, indent=4))

# Highlight the PDF
input_pdf_path = "/content/sample contract.pdf"  # Path to your input PDF
output_pdf_path = "highlighted_contract.pdf"  # Path to save the highlighted PDF
highlight_pdf(input_pdf_path, output_pdf_path, validated_and_transformed_fields)

# Log LLM Reasoning
log_llm_reasoning(validated_and_transformed_fields, reasoning_log)

# Save explainability logs to a string variable
explainability_logs_as_string = explainability_log_stream.getvalue()

# Close the log stream
explainability_log_stream.close()

# Confirm Outputs
print(f"Highlighted PDF saved as: {output_pdf_path}")

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1103, in emit
    stream.write(msg + self.terminator)
ValueError: I/O operation on closed file
Call stack:
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelapp.py", line 619, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.10/dist-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, 

Highlighted PDF saved as: highlighted_contract.pdf


Printing Explainability/Highlighting Logs

In [12]:
print("\nExplainability Logs:\n", explainability_logs_as_string)


Explainability Logs:
 2024-12-27 08:14:00,319 - INFO - PDF highlights saved to: highlighted_contract.pdf
2024-12-27 08:14:00,338 - INFO - {
    "Field": "Contract ID",
    "Validated Value": "123-ABC",
    "Reasoning or Confidence": "The 'Order Form Number' field under 'Order Information' is explicitly stated as 'Data Missing'."
}
2024-12-27 08:14:00,345 - INFO - {
    "Field": "Customer Name",
    "Validated Value": "Reynolds Consumer Products Inc.",
    "Reasoning or Confidence": "Extracted from the 'Client' field under 'Client Information'."
}
2024-12-27 08:14:00,353 - INFO - {
    "Field": "Contract Start Date",
    "Validated Value": "2024-12-24T12:25:21.032Z",
    "Reasoning or Confidence": "Derived from the latest date under 'Order Form Acceptance and Authorization', which represents the agreement date between both parties. While the 'Effective Dates' section mentions a 'Go Live' date, it's contingent upon product launch and doesn't provide a concrete start date. The signing da

#**5. Request to Zenskar API**

DUMMY **POST** REQUEST

In [13]:
import requests
import json

# Define the Zenskar API endpoint
url = "https://api.zenskar.com/contract_v2"

# Example validated and transformed fields
validated_and_transformed_fields = {
    "Contract ID": "123-ABC",
    "Customer Name": "Reynolds Consumer Products Inc.",
    "Contract Start Date": "2024-12-24T12:25:21.032Z",
    "Contract End Date": "2024-12-24T12:25:21.032Z",
    "Payment Terms": "Net 30",
    "Contract Amount": "$10,000.00",
    "Billing Frequency": "Monthly",
    "Contract Type": "Subscription"
}

# Map the validated fields to the API's payload structure
payload = {
    "name": validated_and_transformed_fields.get("Customer Name", "Default Contract Name"),  # Contract name
    "description": f"Contract for {validated_and_transformed_fields.get('Customer Name', 'Unknown Customer')}",  # Contract description
    "status": "draft",  # Draft status
    "currency": "USD",  # Example currency
    "start_date": validated_and_transformed_fields.get("Contract Start Date"),
    "end_date": validated_and_transformed_fields.get("Contract End Date"),
    "customer_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",  # Example customer ID
    "anchor_date": validated_and_transformed_fields.get("Contract Start Date"),
    "plan_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",  # Example plan ID
    "phases": [
        {
            "name": "Initial Phase",
            "description": "Primary contract phase",
            "start_date": validated_and_transformed_fields.get("Contract Start Date"),
            "end_date": validated_and_transformed_fields.get("Contract End Date"),
            "pricings": [
                {
                    "external_id": "pricing_123",
                    "start_date": validated_and_transformed_fields.get("Contract Start Date"),
                    "end_date": validated_and_transformed_fields.get("Contract End Date"),
                    "pricing_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
                    "product_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6"
                }
            ],
            "features": {
                "name": "Standard Subscription",
                "description": "Basic subscription features",
                "pricing_data": {
                    "currency": "USD",
                    "label": "Subscription Pricing",
                    "unit": "month",
                    "pricing_period": {"cadence": "monthly"},
                    "unit_amount": 10000,  # Example value in cents
                    "dimensions": [{"name": "Feature Dimension", "column_name": "feature_dimension"}],
                    "prices": [10000],  # Example price in cents
                    "display_alias": ["Standard Plan"],
                    "pricing_type": "flat"
                },
                "quantity": {
                    "type": "fixed",
                    "label": "Fixed Quantity",
                    "quantity": 1,
                    "unit": "month",
                    "aggregate_id": "aggregate_123"
                },
                "is_recurring": True,
                "billing_period": {"cadence": "monthly", "offset": "prepaid"}
            },
            "source_plan_phase_id": "phase_123",
            "phase_type": "active"
        }
    ],
    "renewal_policy": "renew_with_default_contract",  # Example renewal policy
    "contract_link": "https://example.com/contracts/123-ABC"  # Example contract link
}

# Define headers
headers = {
    "accept": "application/json",
    "content-type": "application/json"
}

# Send the POST request
response = requests.post(url, json=payload, headers=headers)

# Print the response
print("Response Status Code:", response.status_code)
print("Response Body:", response.text)


Response Status Code: 403
Response Body: {"Message":"User is not authorized to access this resource with an explicit deny"}

