# Schema Converter Tool User’s Guide

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.


## Objective

This tool converts the old type schema to New type schema using “Base processor schema” and old schema in json format as input. This new schema can be used for up-training the processor.

**Old Processor Schema**: It is a JSON file which holds schema of specific processor_version of DocAI Processor which is different from current processor. Refer [public documentaion](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1beta3.services.document_service.DocumentServiceClient#google_cloud_documentai_v1beta3_services_document_service_DocumentServiceClient_get_dataset_schema)
 to download base processor schema. This schema is refered as `old_schema` in this tool
 
**Base Processor Schema**: It is a JSON file which holds schema of specific processor_version of DocAI Processor. Refer [public documentaion](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1beta3.services.document_service.DocumentServiceClient#google_cloud_documentai_v1beta3_services_document_service_DocumentServiceClient_get_dataset_schema)
 to download base processor schema
 Before up-training you need to have latest/new schema
 
**New Schema**: It is a JSON file `new_schema` which holds data about schema of old processor schema and base processor schema and entities whic are exist in both base processor schema & old schema(Intersection of these two) and entities old processor

<img src="./images/new_schema_json.png">  

In below screenshot
* **both**: entity exists in both old processor & base processor schema
* **custom schema only**: entity from old processor schema
<img src="./images/new_schema_dataframe.png">

## Prerequisites 

1.Knowledge of Python, and IO Operations. 

2.Python : Jupyter notebook (Vertex) or Google Colab 

3.No permissions, reference or access to any Google project is needed.

4.To get processor schema in json format refer documentation [get_dataset_schema](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1beta3.services.document_service.DocumentServiceClient#google_cloud_documentai_v1beta3_services_document_service_DocumentServiceClient_get_dataset_schema)

5.Current Up-training Schema in old format (json file)


## Tool Installation Procedure

The tool consists of some Python code. It can be loaded and run via: 

1.From Google Colab - make your own copy of this template,      
                                      or       
2.The code can also be copied from the appendix of this document and copied into a Google Colab or Vertex Notebook.


## Tool Operation Procedure

1. Copy the path of your old  schema json file and paste in the old_schema_path as shown below.

2. Copy the path of your Base processor schema json file and paste in the base _schema_json_path as shown below.

3. After updating the paths, run the entire code and the new schema json file should be created in your current working  directory, which can be used for uptraining the processor.

# Run the code

### Installing Required libraries

In [1]:
import json
import pandas as pd
import copy

In [2]:
old_schema_path = "old_schema.json"
with open(old_schema_path, "r") as f:
    old_schema = json.loads(f.read())

In [4]:
BASE_SCHEMA_JSON_PATH = "base_processor_version_info.json"
with open(BASE_SCHEMA_JSON_PATH, "r") as f:
    base_processor_version = json.loads(f.read())[
        "documentSchema"
    ]  # If base processor version is available
    # base_processor_version=json.loads(f.read()) # If directly schema is available

In [5]:
def schema_detect(s):
    flag = False
    # checking whether schema is old or new type and converting to pandas dataframe
    df1 = pd.DataFrame()
    df_new2 = None
    for i in range(len(s["entityTypes"])):
        if "properties" in s["entityTypes"][i].keys():
            flag = True
            break
    if flag:
        # print('New Type Schema')
        for i in range(len(s["entityTypes"])):
            if "properties" in s["entityTypes"][i].keys():
                for j in range(len(s["entityTypes"][i]["properties"])):
                    s["entityTypes"][i]["properties"][j] = {
                        ("".join(e for e in k if e.isalnum())).lower(): v
                        for k, v in s["entityTypes"][i]["properties"][j].items()
                    }
                    # print(s['entityTypes'][i]['properties'])
                for j in range(len(s["entityTypes"][i]["properties"])):
                    if df_new2 is not None:
                        df1 = pd.concat([df1, df_new2])
                        df_new2 = None
                    else:
                        df_new2 = pd.DataFrame(s["entityTypes"][i]["properties"])
        if "propertymetadata" in df1.columns:
            df1.drop(["propertymetadata"], axis=1, inplace=True)
        df1.rename(columns={"name": "type_schema"}, inplace=True)
        df1.drop_duplicates(inplace=True, ignore_index=True)
        # print(df1.head())
    else:
        for i in range(len(s["entityTypes"])):
            s["entityTypes"][i] = {
                ("".join(e for e in k if e.isalnum())).lower(): v
                for k, v in s["entityTypes"][i].items()
            }
        df1 = pd.DataFrame(s["entityTypes"])
        df1.rename(columns={"type": "type_schema"}, inplace=True)
        print("      Old Type Schema")
    return df1


def custom_style1(row):
    if row.values[-1] == "both" and (
        row.values[-2] != row.values[-4] or row.values[-3] != row.values[-5]
    ):
        color = "lightpink"
    elif row.values[-1] != "both":
        color = "lightyellow"
    else:
        color = "lightgreen"
    return ["color:black;background-color: %s" % color] * len(row.values)


base_schema_df1 = schema_detect(base_processor_version)
base_schema_dict = base_schema_df1.set_index("type_schema").T.to_dict()

new_schema = dict()
new_schema["displayName"] = old_schema["displayName"]
new_schema["description"] = old_schema["description"]
new_schema["metadata"] = base_processor_version["metadata"]
new_schema["entityTypes"] = [
    {
        "name": base_processor_version["entityTypes"][0]["name"],
        "baseTypes": base_processor_version["entityTypes"][0]["baseTypes"],
        "properties": list(),
    }
]
entityTypes = [base_processor_version["entityTypes"][0]["name"]]
for i in old_schema["entityTypes"]:
    if "/" in i["type"]:
        if i["type"].split("/")[0] not in entityTypes:
            temp = dict()
            temp["name"] = i["type"].split("/")[0]
            temp["baseTypes"] = ["object"]
            temp["properties"] = list()
            new_schema["entityTypes"].append(temp)
            entityTypes.append(i["type"].split("/")[0])
            temp2 = dict()
            temp2["name"] = i["type"].split("/")[0]
            temp2["valueType"] = i["type"].split("/")[0]
            temp2["occurrenceType"] = "OPTIONAL_MULTIPLE"
            new_schema["entityTypes"][0]["properties"].append(temp2)
for i in new_schema["entityTypes"]:
    for j in old_schema["entityTypes"]:
        if (
            "/" not in j["type"]
            and i["name"] == base_processor_version["entityTypes"][0]["name"]
        ):
            temp = {}
            temp["name"] = j["type"]
            if j["type"] in base_schema_dict.keys():
                temp["valueType"] = base_schema_dict[j["type"]]["valuetype"]
                temp["occurrenceType"] = base_schema_dict[j["type"]]["occurrencetype"]
            else:
                temp["valueType"] = j["baseType"]
                temp["occurrenceType"] = j["occurrenceType"]

            i["properties"].append(temp)
        else:
            if i["name"] == j["type"].split("/")[0]:
                temp = {}
                temp["name"] = j["type"]
                if j["type"] in base_schema_dict.keys():
                    temp["valueType"] = base_schema_dict[j["type"]]["valuetype"]
                    temp["occurrenceType"] = base_schema_dict[j["type"]][
                        "occurrencetype"
                    ]
                else:
                    temp["valueType"] = j["baseType"]
                    temp["occurrenceType"] = "OPTIONAL_ONCE"
                i["properties"].append(temp)
new_schema2 = copy.deepcopy(new_schema)
my_schema_df2 = schema_detect(new_schema)
# Merging both the data frame and getting differences
compare = base_schema_df1.merge(
    my_schema_df2,
    on="type_schema",
    how="outer",
    suffixes=["_base", "_2"],
    indicator=True,
)
compare["_merge"] = compare["_merge"].replace(
    "right_only", "custom schema only ", regex=True
)
compare["_merge"] = compare["_merge"].replace(
    "left_only", "base schema only ", regex=True
)
compare.rename(columns={"_merge": "entity_exists_in"}, inplace=True)
compare.style.apply(custom_style1, axis=1)
new_schema_file_name = "new_schema.json"
with open(new_schema_file_name, "w") as f:
    f.write(json.dumps(new_schema2, ensure_ascii=False))
print(new_schema2)

{'displayName': 'my schema name', 'description': 'my new schema for uptrain', 'metadata': {'prefixedNamingOnProperties': True}, 'entityTypes': [{'name': 'invoice_document_type', 'baseTypes': ['document'], 'properties': [{'name': 'line_item', 'valueType': 'line_item', 'occurrenceType': 'OPTIONAL_MULTIPLE'}, {'name': 'total_amount', 'valueType': 'money', 'occurrenceType': 'OPTIONAL_ONCE'}, {'name': 'purchase_order_date', 'valueType': 'datetime', 'occurrenceType': 'OPTIONAL_ONCE'}, {'name': 'purchase_order_id', 'valueType': 'string', 'occurrenceType': 'OPTIONAL_ONCE'}, {'name': 'ship_to_address', 'valueType': 'address', 'occurrenceType': 'OPTIONAL_ONCE'}, {'name': 'ship_to_name', 'valueType': 'string', 'occurrenceType': 'OPTIONAL_ONCE'}, {'name': 'delivery_date', 'valueType': 'datetime', 'occurrenceType': 'OPTIONAL_ONCE'}]}, {'name': 'line_item', 'baseTypes': ['object'], 'properties': [{'name': 'line_item/amount', 'valueType': 'money', 'occurrenceType': 'OPTIONAL_ONCE'}, {'name': 'line_it