First, we can create a new project and prepare the dataset in `.jsonl` format.
We have already placed the required data files in advance. So now we can directly proceed.

In [None]:
from seed import *
CreateProject(name="amazon_google", workspace="../")
# Prepare data as `.jsonl` files if not exists

After creating the project, a defult `config.json` is located under the projects' root folder, we can modify it to fit for our entity resolution task.
The most important parameters are `name`, `task_desc`, `inputs`, `outputs`, `evaluation_metric` and `evaluation_path`, which should be defined by the user and should be application-specific.
Here we use only the `Amazon-Google_demo.jsonl` dataset for ease of running.

In [None]:
config = LoadJson("config.json")
config = config | {
    "name": "entity_resolution",
    "task_desc": "Given two products, determine whether they are identical product.",
    "inputs": [
        {
            "name": "entity1",
            "type": "dict",
            "desc": "It contains three attributes: `title`, `manufacturer`, `price`. `title` and `manufacturer` are strings, `price` is float."
        },
        {
            "name": "entity2",
            "type": "dict",
            "desc": "Same as entity1."
        }
    ],
    "outputs": [
        {
            "name": "is_same",
            "type": "bool",
            "desc": "0 if the two product are not identical, 1 of the two products are identical.",
            "default": 0
        }
    ],
    "evaluation_metric": "f1",
    "evaluation_path": "./data/Amazon-Google_demo.jsonl",
}

Also, in many cases, if you have labelled training data, you can provide them with a little extra path settings:

In [None]:
config = config | {
    "examples_path": "./data/Amazon-Google_valid.jsonl",
    "labelled_path": "./data/Amazon-Google_train.jsonl",
}

Now, we can directly compile the config file with a minimal component — the `llmqa` agent enabled by default.

In [None]:
SaveJson(config, "config.json")
CompileProject("./")

After compilation, agent codes are generated under `./projects/amazon_google/agents/`, and you can directly test it for a single instance.

In [None]:
def test_entity_resolution_minimal_example(entity1, entity2):
    from __init__ import entity_resolution
    response = entity_resolution(entity1, entity2)
    if response is None:
        return "Unknown (Probably an error has occurred)"
    return ["The two entities are different!", "The two entities are the same!"][response]

print(test_entity_resolution_minimal_example(
    entity1 = {
        "title": "Sony VCL-DH1774",
        "manufacturer": "Sony",
        "price": 29.99
    },
    entity2 = {
        "title": "Sony VCL-DH1758",
        "manufacturer": "Sony",
        "price": 20.99
    }
))
print(test_entity_resolution_minimal_example(
    entity1 = {
        "title": "Sony VCL-DH1774",
        "manufacturer": "Sony",
        "price": 29.99
    },
    entity2 = {
        "title": "[DH 1774] Sony Variable Conversion Lens",
        "manufacturer": "sony",
        "price": 28.99
    }
))

We can also batch evaluate all data in the `Amazon-Google.jsonl` by using `evaluation`.

In [None]:
from evaluation import *
evaluate_entity_resolution()

After evaluation, a profile is generated under the root path.

In [None]:
PrintJson(LoadJson("profile.json"))

In this example, as we have labelled data, we can finetune a model using the labelled data, which can be then used by the `cache` and `model` agent.

In [None]:
# ! python train_model.py
# However this could take quite some time, you can directly download the trained checkpoint here: 
# https://drive.google.com/file/d/16IORSgLIwtfFFqBojXt3BAEAPzRwgdNq/view?usp=sharing
# The downloaded model checkpoint folder should be placed under `amazon_google/ckpts/`.

Now, we can setup the model path in the config and turn off online training.

In [None]:
config = config | {
    "cache_frozen_ckpt": "./ckpts/amazon_google",
    "model_initial_ckpt": "./ckpts/amazon_google",
    "model_sync_off": True,
    "activate_model": True,
    "model_confidence_ratio": 0.0,
    "model_confidence_default": 0.0,
}
SaveJson(config, "config.json")
CompileProject("./")
from evaluation import *
evaluate_entity_resolution()

In [None]:
PrintJson(LoadJson("profile.json"))

In this demo example, the trained model is good enough (this is usually not the case for larger datasets).
If you want to further optimize the hyperparameters, use `HyperparameterTuning` to search for the best configuration.

In [None]:
HyperparameterTuning("./")