# Notebook to Reconcile Collection Metadata

This notebook reconciles the collections in `/ingestion-data/collections` and retrieves the `summaries` value for each collection from the API, merges it to the existing collection in `veda-data` and posts the new collection to the API.

In [None]:
import glob
import json
import requests
from cognito_client import CognitoClient

Set the testing mode to `True` when testing and `False` otherwise. When the testing mode is `True`, the notebook will be set to run against `dev` endpoints.

In [None]:
testing_mode = True

The following cell retrieves collection JSON files from the `collections` directory and save collectionIds to a list.

In [None]:
local_collections_path = (
    "../ingestion-data/staging/collections/*.json"
    if testing_mode
    else "../ingestion-data/production/collections/*.json"
)

json_file_paths = glob.glob(local_collections_path)

file_paths_and_collection_ids = [
    {"filePath": file_path, "collectionId": data["id"]}
    for file_path in json_file_paths
    if "id" in (data := json.load(open(file_path, "r")))
]

Have your Cognito `username` and `password` ready to set up Cognito Client to retrieve a token that will be used to access the STAC Ingestor API.

In [None]:
dev_endpoint = "https://dev.openveda.cloud"
dev_client_id = "CHANGE ME"
dev_user_pool_id = "CHANGE ME"
dev_identity_pool_id = "CHANGE ME"

staging_endpoint = "https://staging-stac.delta-backend.com/"
staging_client_id = "CHANGE ME"
staging_user_pool_id = "CHANGE ME"
staging_identity_pool_id = "CHANGE ME"

ingestor_staging_url = "https://ig9v64uky8.execute-api.us-west-2.amazonaws.com/staging/"
ingestor_dev_url = "https://dev.openveda.cloud"

if testing_mode:
    STAC_INGESTOR_API = ingestor_dev_url
    VEDA_STAC_API = dev_endpoint
else:
    STAC_INGESTOR_API = ingestor_staging_url
    VEDA_STAC_API = staging_endpoint

client = CognitoClient(
    client_id=dev_client_id if testing_mode else staging_client_id,
    user_pool_id=dev_user_pool_id if testing_mode else staging_user_pool_id,
    identity_pool_id=dev_identity_pool_id if testing_mode else staging_identity_pool_id,
)
_ = client.login()

The following cell sets up headers for requests.

In [None]:
TOKEN = client.access_token
authorization_header = f"Bearer {TOKEN}"
headers = {
    "Authorization": authorization_header,
    "content-type": "application/json",
    "accept": "application/json",
}

The following cell defines the functions that will be used to consolidate `summaries` and `links` to reconcile the collection metadata.

In [None]:
def post_reconciled_collection(collection, collection_id):
    collection_url = f"{STAC_INGESTOR_API}api/stac/collections/{collection_id}"
    ingest_url = f"{STAC_INGESTOR_API}api/ingest/collections"

    try:
        response = requests.post(ingest_url, json=collection, headers=headers)
        response.raise_for_status()
        if response.status_code == 201:
            print(
                f"Request was successful. Find the updated collection at {collection_url}"
            )
        else:
            print(
                f"Updating {collection_id} failed. Request failed with status code: {response.status_code}"
            )
    except requests.RequestException as e:
        print(
            f"Updating {collection_id} failed. An error occurred during the request: {e}"
        )
    except Exception as e:
        print(
            f"An unexpected error occurred while trying to update {collection_id}: {e}"
        )


def merge_summaries(existing_summaries, retrieved_summaries):
    merged_summaries_dict = existing_summaries.copy()

    if retrieved_summaries:
        for key, value in retrieved_summaries.items():
            merged_summaries_dict.setdefault(key, value)

    return merged_summaries_dict


def retain_external_links(existing_links, retrieved_links):
    unique_hrefs = set(link.get("href") for link in existing_links)
    additional_external_links = [
        link
        for link in retrieved_links
        if link.get("rel") == "external" and link.get("href") not in unique_hrefs
    ]

    retained_links = existing_links + additional_external_links
    return retained_links

The following cell loops through `file_paths_and_collection_ids` to retrieve `summaries` and `links` information for each existing collection and publish the updated collection to the target ingestion `api/collections` endpoint.

In [None]:
for item in file_paths_and_collection_ids:
    collection_id = item["collectionId"]
    file_path = item["filePath"]

    if VEDA_STAC_API == dev_endpoint:
        url = f"{VEDA_STAC_API}api/stac/collections/{collection_id}"
    elif VEDA_STAC_API == staging_endpoint:
        url = f"{VEDA_STAC_API}collections/{collection_id}"

    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        json_response = response.json()

        retrieved_summaries = json_response.get("summaries", {})
        retrieved_links = json_response.get("links", {})

        with open(file_path, "r", encoding="utf-8") as file:
            collection = json.load(file)

            existing_summaries = collection.get("summaries", {})
            existing_links = collection.get("links", {})

            collection["summaries"] = merge_summaries(
                existing_summaries, retrieved_summaries
            )
            collection["links"] = retain_external_links(existing_links, retrieved_links)

        # Publish the updated collection to the target ingestion `api/collections` endpoint
        post_reconciled_collection(collection, collection_id)

    except requests.RequestException as e:
        print(f"An error occurred for collectionId {collection_id}: {e}")
    except Exception as e:
        print(f"An unexpected error occurred for collectionId {collection_id}: {e}")