# Schema Comparison Tool

* Author: docai-incubator@google.com

# Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
This tool compare the schemas of a pretrained “base” processor and a customer’s custom schema for uptraining, to highlight the differences in an easily readable (color-coded) format. 

This provides the information needed to correct 400 INVALID ARGUMENT errors encountered during Uptraining via Notebook. The schema given in the Uptraining notebook MUST be consistent with the base processor (no changes to base processor fields).  Additional custom fields are allowed, and base processor fields may be excluded.


Since the base version schema of each processor is fixed, the tool has the preloaded base version schema of 3 parsers as Below.
1. Invoice Parser
2. Purchase Order Parser
3. Contract Parser


The customer’s Uptrained schema json file has to be given as input into this tool.  The user must copy this from their notebook before using this tool, and save it to a text file with a .json extension.
The Tool shows clear differences between the two schemas with color coding.

# Prerequisites 
1. Python : Jupyter notebook (Vertex) or Google Colab 

    No permissions, reference or access to any Google project is needed.

2. Valid Schema which either has to be in latent(old) type or New schema type.

**NOTE**:Mix of both the schema patterns wont work with this tool.


## Imports

Import necessary Python packages required for processing the JSON schemas.


In [32]:
import json
import pandas as pd
from jsondiff import diff

## User Input

Prompt the user for input to determine the type of processor to be used.


In [46]:
# Prompt for user input
parser = input(
    "Processor name 'I' for Invoice, 'P' for Purchase Order, or 'C' for Contract: "
).upper()

Processor name 'I' for Invoice, 'P' for Purchase Order, or 'C' for Contract:  C


## Load Schema
1. **Copy your schema into the json file. Refer the sample schema.json for example.**

Load and map the base schema depending on the user input.

In [62]:
import json


def load_schema(parser_type):
    # Map the parser types to their respective file names
    schema_files = {
        "I": "invoice_schema.json",
        "P": "purchase_order_schema.json",
        "C": "contract_schema.json",
    }

    file_name = schema_files.get(parser_type)
    if not file_name:
        raise ValueError("Invalid parser type")

    with open(file_name, "r") as file:
        schemas = json.load(file)
    top_level_key = next(iter(schemas))
    return schemas[top_level_key]


Base_schema = load_schema(parser)
# print(Base_schema)

If you get an error at this point, there is likely a problem with the schema that you copied over.
If the schema is correct schema JSON (“old” or “new” format schema are supported), the tool execute normally. Ensure that you got the complete definition and that all braces are matched.

The tool then provide output, analyzing the customer schema as compared to the base processor schema.

## Schema Detection Function

This function detects the schema type and converts it into a pandas DataFrame.


In [56]:
def schema_detect(s):
    """
    Detects the schema type and converts it to a pandas DataFrame.

    Parameters:
    s (dict): A dictionary representing the JSON schema.

    Returns:
    DataFrame: A pandas DataFrame containing the schema entities.
    """
    try:
        df1 = pd.DataFrame(s["entityTypes"][0]["properties"])
        print("New type schema")
        df1.rename(columns={"name": "type_schema"}, inplace=True)
    except KeyError:
        print("Old type schema")
        df1 = pd.DataFrame(s["entityTypes"])
        df1.rename(columns={"type": "type_schema"}, inplace=True)
    return df1

## Schema Comparison

Compare the loaded base schema with a predefined schema and highlight the differences.


In [57]:
# Load the predefined schema and convert to DataFrame
my_schema_json = """ {
  'displayName': 'Base Inv Schema',
  'description': 'Base Inv Schema for uptrain',
   "entityTypes": [
    {
      "type": "currency",
      "baseType": "string",
      "occurrenceType": "OPTIONAL_ONCE"
    },
    {
      "type": "due_date",
      "baseType": "datetime",
      "occurrenceType": "OPTIONAL_ONCE"
    }
        ]
    }
    
"""
my_schema_json = my_schema_json.replace("'", '"')
s2 = json.loads(my_schema_json)

# Load the base schema as DataFrame
df1 = schema_detect(Base_schema)
df2 = schema_detect(s2)

# Merging both DataFrames and getting differences
compare = df1.merge(
    df2, on="type_schema", how="outer", suffixes=["_base", "_2"], indicator=True
)
compare["_merge"] = compare["_merge"].astype("object")
compare["_merge"] = compare["_merge"].replace(
    "right_only", "Schema 2 only ", regex=True
)
compare["_merge"] = compare["_merge"].replace(
    "left_only", "base schema only ", regex=True
)
compare.rename(columns={"_merge": "entity_exists_in"}, inplace=True)

Old type schema
Old type schema


## Color Coding Differences

Apply color coding to the DataFrame to visualize the differences between the schemas.


In [58]:
# Define the method to color code the differences
def custom_style1(row):
    # color = 'white'
    if row.values[-1] == "both" and (
        row.values[-2] != row.values[-4] or row.values[-3] != row.values[-5]
    ):
        color = "lightpink"
    elif row.values[-1] != "both":
        color = "lightyellow"
    else:
        color = "lightgreen"
    return ["color:black;background-color: %s" % color] * len(row.values)


compare.style.apply(custom_style1, axis=1)

Unnamed: 0,type_schema,baseType_base,occurrenceType_base,baseType_2,occurrenceType_2,entity_exists_in
0,document_name,string,OPTIONAL_MULTIPLE,,,base schema only
1,parties,string,OPTIONAL_MULTIPLE,,,base schema only
2,agreement_date,datetime,OPTIONAL_MULTIPLE,,,base schema only
3,effective_date,datetime,OPTIONAL_MULTIPLE,,,base schema only
4,expiration_date,datetime,OPTIONAL_MULTIPLE,,,base schema only
5,initial_term,string,OPTIONAL_MULTIPLE,,,base schema only
6,governing_law,string,OPTIONAL_MULTIPLE,,,base schema only
7,renewal_term,string,OPTIONAL_MULTIPLE,,,base schema only
8,notice_to_terminate_renewal,string,OPTIONAL_MULTIPLE,,,base schema only
9,arbitration_venue,string,OPTIONAL_MULTIPLE,,,base schema only


# Analyzing Tool Output
The color coding is defined to show the clear difference between two schemas:  
* Green→<img src="./images/green.png" width=20 height=7> </img>→ Both the schema entities are perfectly matching.  
* Yellow→<img src="./images/yellow.png" width=20 height=7> </img>→The entities are in only one schema (exists in base schema or untrained schema).  
* Pink→<img src="./images/pink.png" width=20 height=7> </img>→The entities exist in both the schemas and there is a <font color="red">mismatch. These fields must be changed in your notebook.</font>  

The entities are listed in the first column, the 2nd and 3rd columns are for Base Version schema details or preloaded schema in this tool. 4th and 5th columns are Uptrained Schema details which are provided as input to this schema comparison tool.

The last column indicates Entities check, this shows whether entities exist either in one of the schema or both the schema. If it is the base schema then the entity exists only in the Base Version schema and does not exists in Uptrained Version and vice versa.


<img src='./images/schema_comparasion.png' width=1000 height=800></img>

The pink rows indicate fields that must be changed in your Uptraining notebook to be consistent with the base processor’s Type an OccurenceType.