# Step 2: Extract Information

This notebook uses an LLM to pull two types of structured information from each contract:

1. **References** -- What other agreements and documents does this contract mention? (e.g., "Master Agreement 1239-12900", "Rate Sheet RS-2024")
2. **Doc Info** -- Key facts about the contract itself: agreement name, type, dates, whether it's a master agreement or amendment, etc.

Both extractions run against the truncated text from the `flat` table (Step 1). They use `AI_QUERY` to call the LLM and `from_json` to parse the structured response.

**Before you run this:**
- Step 1 (Parse) must be complete
- The `flat` table must exist with data

**Output tables:**
- `references` -- extracted agreement references and document references per contract
- `doc_info` -- agreement name, type, dates, master/amendment classification per contract

## Configuration

In [None]:
dbutils.widgets.text("catalog", "shm", "Catalog")
dbutils.widgets.text("schema", "contract", "Schema")
dbutils.widgets.text("batch_size", "100", "Batch Size")
dbutils.widgets.text("max_input_char", "400000", "Max Input Characters")

catalog = dbutils.widgets.get("catalog")
schema = dbutils.widgets.get("schema")
batch_size = dbutils.widgets.get("batch_size")
max_input_char = dbutils.widgets.get("max_input_char")

---
## 2A: Extract References

For each contract, the LLM identifies:
- **Agreements** -- specific agreements referenced by name/number (e.g., "Master Agreement 1239-12900")
- **Referenced documents** -- attachments, exhibits, schedules, amendments mentioned in specific sections

The prompt also uses filename and folder structure as supporting clues (e.g., if files are in the same folder, they're likely related).

This step is idempotent -- it skips contracts already in the `references` table.

In [None]:
CREATE TABLE IF NOT EXISTS IDENTIFIER(:catalog || '.' || :schema || '.references') (
  path STRING,
  agreements ARRAY<STRING>,
  references ARRAY<STRUCT<section: STRING, document: STRING>>,
  combined_references STRING
)
USING DELTA;

MERGE INTO IDENTIFIER(:catalog || '.' || :schema || '.references') AS target
USING (
  WITH ref_results AS (
    SELECT
      f.path,
      from_json(
        AI_QUERY(
          "databricks-claude-sonnet-4-5",
          SUBSTRING(CONCAT(
            'You are a contract analysis expert. Your task is to identify all agreements referenced in the document and all referenced documents or attachments mentioned in any section, using both text and contextual clues from filename and folder structure.
            
            Step 1: Identify Referenced Agreements
            Scan the document for any specific agreements with identifiers or numbers (e.g., Master Agreement 1239-12900). Include the agreement type and its reference number. Common types include (but are not limited to):
            
            Master Agreement
            Framework Agreement
            Consulting Agreement (CSA)
            NDA / Confidentiality Agreement
            Purchase Order Terms and Conditions
            Mutually Agreed Terms and Conditions (MTC)
            Contract
            Master Work Agreement (MWA)
            Sales Contract
            Engineering Procurement Construction (EPC)
            Engineering Procurement Construction Management (EPCM)
            Construction Agreement
            Site Services Agreement
            Staffing Agreement
            Sales/Catering Contract
            Recruitment Agreement
            Administration Services Agreement
            Services Agreement
            License Agreement
            Supply Agreement
            Order Form
            Purchase Agreement
            General Terms and Conditions
            Scope of Work
            Termination Agreement
            
            Important:
            
            Only include agreements that have a specific identifier or number.
            Ignore generic mentions like "the agreement" unless paired with a unique reference.
            Check for completeness: If multiple agreements are referenced, include all of them, not just the first one found.
            
            Step 2: Identify Referenced Documents
            For each section of the contract, list any real referenced documents or attachments that include a name and/or identifier (e.g., Economic Disclosure Statement: Ownership Interest Declaration (EDS-7: 3/2015)). Common references include:
            
            Amendments
            Rate Sheets
            Schedules
            Exhibits
            Addendums
            Statement of Work (SOW)
            Termination Notices
            Forms of Undertaking (FOU)
            Commitment Letters
            Change Orders
            
            Important:
            
            Do not include hypothetical or generic references.
            Only capture actual documents with identifiers or names.
            Check for completeness: If multiple referenced documents appear in different sections, include all of them.
            
            Filename and Folder Heuristics (Supportive Clues Only):
            
            If the filename contains tokens like SOW, RateSheet, Schedule, Exhibit, Addendum, Amendment, or ChangeOrder, treat it as a strong clue that the document is a referenced attachment.
            If the folder path groups documents together (e.g., a folder named MasterAgreement_1239 contains multiple files), assume that related reference documents (amendments, schedules, rate sheets) are likely in the same folder.
            These clues must not replace reading the document content, but they can help confirm or strengthen associations.
            
            Output Format:
            Return the results in JSON format:
            {
              "agreements": [
                "Master Agreement 1239-12900",
                "Consulting Agreement CSA-456"
              ],
              "references": [
                {"section": "Section 5 - Pricing", "document": "Rate Sheet RS-2024"},
                {"section": "Appendix A", "document": "Statement of Work SOW-789"}
              ]
            }
            
            If no agreements or references are found, return empty arrays.
            ## Document ##',
            'Vendor Name:', f.vendor_name, '\n',
            'File Name:', f.file_name, '\n',
            'Text:', f.truncated, '\n'
          ),1,:max_input_char),
        responseFormat => 'STRUCT<result:STRUCT<
            agreements:ARRAY<STRING>, 
            references:ARRAY<STRUCT<
                section:STRING, 
                document:STRING
            >>
        >>'),
        'STRUCT<agreements:ARRAY<STRING>, references:ARRAY<STRUCT<section:STRING, document:STRING>>>'
      ) as result
    FROM IDENTIFIER(:catalog || '.' || :schema || '.flat') f
    ANTI JOIN IDENTIFIER(:catalog || '.' || :schema || '.references') r
      ON f.path = r.path
  )
  SELECT 
    path,
    result.agreements as agreements,
    result.references as references,
    CONCAT_WS('\n', 'AGREEMENTS', array_join(result.agreements, ', '), '\n REFERENCES',
      CASE 
        WHEN result.references IS NOT NULL AND size(result.references) > 0 
        THEN ', ' || array_join(transform(result.references, x -> x.section || ': ' || x.document), ', ')
        ELSE ''
      END
    ) as combined_references
  FROM ref_results
  LIMIT CAST(:batch_size AS INT)
) AS source
ON target.path = source.path
WHEN NOT MATCHED THEN INSERT *

In [None]:
-- Check: how many references have been extracted?
SELECT COUNT(*) as total_refs FROM IDENTIFIER(:catalog || '.' || :schema || '.references')

---
## 2B: Extract Doc Info

For each contract, the LLM extracts key document-level facts:
- **Agreement name and type** (e.g., MSA, CSA, NDA, SOW, Termination)
- **Document type** (Agreement, Amendment, Rate Schedule, etc.)
- **Dates** (effective, expiry)
- **Is it a master agreement?**
- **If it is an amendment**, what master agreement does it modify, and what is the new expiry?

This uses the references extracted in 2A as additional context. The step is idempotent -- it skips contracts already in doc_info.

In [None]:
CREATE TABLE IF NOT EXISTS IDENTIFIER(:catalog || '.' || :schema || '.doc_info') (
  path STRING,
  agreement_name STRING,
  agreement_type STRING,
  document_type STRING,
  effective_date STRING,
  expiry_date STRING,
  is_master_agreement STRING,
  related_master_agreement_name STRING,
  amendment_expiry_date STRING,
  combined_doc_info STRING
)
USING DELTA;

MERGE INTO IDENTIFIER(:catalog || '.' || :schema || '.doc_info') AS target
USING (
  SELECT
    path,
    metadata.*,
    cast(to_variant_object(named_struct('path', path, 'metadata', metadata)) as string) AS combined_doc_info
  FROM (
    SELECT
      f.path,
      from_json(AI_QUERY(
        "databricks-claude-sonnet-4-5",
        SUBSTRING(CONCAT(
          'You are a contract analysis expert tasked with extracting structured information from vendor contract documents. Your primary goal is to identify master agreements and any related amendments that modify their expiry dates, so that the ultimate expiry date of each master agreement can be determined. From the provided contract text, extract the following fields in JSON format:
          
          ## INFO TO EXTRACT
          AgreementName: The name of the agreement, often followed by "(the Agreement)".
          
          AgreementType: Classify the agreement using this guide (use filename as a supportive clue only, never a substitute for reading and understanding the document content):
          CC = Construction Contract
          NDA = Non-Disclosure Agreement
          CA = Confidentiality Agreement
          MWA = Miscellaneous Work Agreement
          MSA = Master Supply Agreement
          MSSA = Master Supply and Service Agreement
          EP = Engineering and Procurement Contract
          EPC = Engineering Procurement and Construction Contract
          T&C = Terms and Conditions
          CSA = Consulting Services Agreement
          MOU = Memorandum of Understanding
          MOA = Memorandum of Agreement
          SOW = Scope of Work
          TERMINATION = Termination Agreement
          OTHER = Not covered above
          NONAGREEMENT = Not an agreement
          
          If the filename contains tokens like "CA", "NDA", "MSA", "MSSA", "EPC", "EP", "CC", "CSA", "SOW", "Terms", or "Termination", it may indicate the AgreementType. Always confirm with document content.
          
          DocumentType: Classify the document type as --
          AGREEMENT
          AMENDMENT (including Change Orders, CCO, COR)
          SCOPE_OF_WORK
          TERMINATION
          Rate Tables / Rate Schedules
          Collective Bargaining Agreements
          Contract Executive Summary
          CRAF (Recommendation for Award)
          Proposals
          Technical Drawings
          Other
          
          EffectiveDate: The start or effective date of the contract.
          
          ExpiryDate: The end or expiry date of the contract. If the TERM section specifies duration (e.g., "3 years from Effective Date"), calculate the end date.
          
          IsMasterAgreement: Boolean (true/false) - Is this document a master agreement?
          
          RelatedMasterAgreementName: If this is an amendment, specify the name of the master agreement it modifies.
          
          AmendmentExpiryDate: If this is an amendment, extract the new expiry date of the master agreement being amended, not the expiry date of the amendment document itself.
          
          ## ADDITIONAL INSTRUCTIONS
          Vendor Paths:
          In most cases, amendment documents are located in the same folder as the master agreement they modify. If a document is determined to be an amendment, there is a very high likelihood that the master agreement it amends is in the same folder. Use the provided file names from the vendor folder as a strong clue when linking amendments to their master agreements, but still confirm using document content (e.g., references to agreement name or number).
          
          Agreements and References:
          The document will likely contain both agreement and other reference names. These have already been extracted and are provided as references, but may not be complete. Use this as additional context.

          Special Rules for Expiry Dates:
          If the document is a master agreement, record its original expiry date in ExpiryDate.
          If the document is an amendment, record the new master agreement expiry in AmendmentExpiryDate.
          If multiple amendments exist, the ultimate expiry date for the master agreement will be the latest date among all amendments.
          
          Output Format:
          Return the result in this JSON structure:
          {
            "agreement_name": "",
            "agreement_type": "",
            "document_type": "",
            "effective_date": "",
            "expiry_date": "",
            "is_master_agreement": "",
            "related_master_agreement_name": "",
            "amendment_expiry_date": ""
          }
          
          If you are not confident about a field, return an empty string ("").

          CONTRACT INFORMATION',
          'Vendor Name:', vendor_name, '\n',
          'File Name:', file_name, '\n',
          'Other Vendor Paths', array_join(vendor_folder_paths, ', '), '\n',
          'Extracted References', combined_references, '\n'
          'Text:', truncated, '\n'
        ),1,:max_input_char),
        responseFormat => 'STRUCT<result:STRUCT<
          agreement_name:STRING, 
          agreement_type:STRING, 
          document_type:STRING, 
          effective_date:STRING, 
          expiry_date:STRING, 
          is_master_agreement:STRING, 
          related_master_agreement_name:STRING,
          amendment_expiry_date:STRING
          >>'
      ),
      'STRUCT<
          agreement_name:STRING, 
          agreement_type:STRING, 
          document_type:STRING, 
          effective_date:STRING, 
          expiry_date:STRING, 
          is_master_agreement:STRING, 
          related_master_agreement_name:STRING,
          amendment_expiry_date:STRING
        >'
      )
      as metadata
    FROM IDENTIFIER(:catalog || '.' || :schema || '.flat') f
    LEFT JOIN IDENTIFIER(:catalog || '.' || :schema || '.references') r
      ON f.path = r.path
    ANTI JOIN IDENTIFIER(:catalog || '.' || :schema || '.doc_info') d
      ON f.path = d.path
    LIMIT CAST(:batch_size AS INT)
  )
) AS source
ON target.path = source.path
WHEN NOT MATCHED THEN INSERT *

In [None]:
-- Check: how many doc_info records have been extracted?
SELECT COUNT(*) as total_doc_info FROM IDENTIFIER(:catalog || '.' || :schema || '.doc_info')