### General
This notebook explains how you can analyze a dataset regarding provenance descriptions and how to find good few-shot examples. This example is for the "Wiesbaden Central Collection Point" dataset but can be used as reference for every dataset. See the README in this directory to get an overview of the current state in event extraction. For generating embeddings and running the prompt, you need to set your API key. We recommend to play around with batch sizes and to use techniques for minimizing prompt tokens.

### Step 0 - Setup
We recommend to use the default paths defined here for each data source.

In [1]:
import os

from common.event_extraction.helpers import load_data_source

RESOURCE_DIR_PATH = os.path.join(".", "resources")
PLOTS_DIR_PATH = os.path.join(".", "plots")
DATA_SOURCE_FILE_PATH = os.path.join("..", "..", "..", "data", "wccp", "wiesbaden-ccp-property-cards-ocr-export-postprocessed-16-11-23.csv")

# We can use multiple columns for parsing in the prompt but only one column is used as a "main-value" and is a embedded.
EVENT_EXTRACTION_RELEVANT_COLS = ["history-and-ownership", "depot-possessor", "depot-number", "condition-and-repair-record", "arrival-condition", "arrival-date", "exit-date"]

# Load data source and extract relevant columns
data_source = load_data_source(DATA_SOURCE_FILE_PATH, EVENT_EXTRACTION_RELEVANT_COLS)

print(data_source)

                                  history-and-ownership  \
0     Deposited in Kaiseroda Mine (erkers) or Ransba...   
1     Deposited in Kaiseroda Mine (erkers) or Ransba...   
2     Deposited in Kaiseroda Mine (Merkers) or Ransb...   
3     Deposited in Kaiserroda Mine (erkers) or Ransb...   
4     Deposited in Kaiseroda Mine (Merkers) or Ransb...   
...                                                 ...   
6286                                               None   
6287                                               None   
6288                                               None   
6289                                               None   
6290  Brou ht in from Frankfurt Bunker, demanded bec...   

                    depot-possessor depot-number  \
0                              None         None   
1                              None   Case GG 91   
2                              None    Case GG 9   
3                        Cave G-G g            9   
4                     Case G₁ G

### Step 1 - Generate Embeddings

The below code shows how embeddings can be generated. We assume that only one column of the dataset must be embedded.

We generate and cluster embeddings to find good examples for manual annotation as few-shot examples. For the WCCP dataset, this has shown to drastically improve the output quality. However, you may consider different techniques for finding few-shot examples or even use completely different prompting technqiues.

In [None]:
from common.event_extraction.generate_embeddings import generate_embeddings, prepare_df_for_embedding

EMBEDDING_MODEL = "text-embedding-ada-002"
COL_TO_EMBED = "history-and-ownership"

embeddings = generate_embeddings(
    prepare_df_for_embedding(data_source, COL_TO_EMBED)[COL_TO_EMBED], 
    EMBEDDING_MODEL
)

embedding_cache = data_source[COL_TO_EMBED].to_frame()
embedding_cache["embedding"] = embeddings

embedding_cache.to_csv(os.path.join(RESOURCE_DIR_PATH, "embeddings.csv"), index=False)

Now you can visualize your embedded free-text records.

In [None]:
from common.event_extraction.visualization import visualize_data_source
from common.event_extraction.cluster_embeddings import convert_embeddings_to_vstack


visualize_data_source(
    convert_embeddings_to_vstack(embedding_cache),
    os.path.join(PLOTS_DIR_PATH, "embeddings.png")
)

### Step 2 - Cluster Embeddings
After you executed the code snippet below, you'll need to manually label some free-text records with our set of events. These manually labeled free-text records can then be included in the few-shot prompts in the following steps. The below code snippet produces a JSON file that contains some entries of provenance describing columns. These entries are generated using a clustering algorithm to ensure the best possible dataset coverage. Label those examples manually with the pre-defined event types.

<strong>Silhouette Average Maximization & Clustering</strong><br>
For clustering, we first need to find the best fitting amount of clusters for the KMeans clustering algorithm. For that we use the silhouette score.

In [None]:
from common.event_extraction.cluster_embeddings import maximize_silhouette_avg
from common.event_extraction.cluster_embeddings import convert_embeddings_to_vstack
from common.event_extraction.cluster_embeddings import cluster_embeddings_kmeans

# Calculates the best number of clusters for k-means clusterings
embeddings_vstack = convert_embeddings_to_vstack(embedding_cache)
best_n_clusters = maximize_silhouette_avg(embeddings_vstack)

# Runs k-means clustering with the best number of clusters
(cluster_labels, cluster_centers) = cluster_embeddings_kmeans(embeddings_vstack, best_n_clusters)

You can now also visualize the results using the code below:

In [None]:
from common.event_extraction.visualization import visualize_embedding_clusters

# This visualizes the clustering results from above and filters out clusters with less than 10 members
visualize_embedding_clusters(
    cluster_labels,
    embeddings_vstack,
    best_n_clusters,
    os.path.join(PLOTS_DIR_PATH, "embedding_clusters.png"),
    10
)

<strong>Choosing Representatives</strong><br>
Now that we clustered the embeddings, we can filter the representatives and generate the template using the interactive label template generator.

In [None]:
from common.event_extraction.cluster_embeddings import choose_representatives

choose_representatives(
    embeddings=embeddings,
    embedded_column_name=COL_TO_EMBED,
    relevant_column_names=EVENT_EXTRACTION_RELEVANT_COLS,
    data_source=data_source,
    cluster_labels=cluster_labels,
    cluster_centers=cluster_centers,
    write_to_file=True,
    output_dir_path=RESOURCE_DIR_PATH
)

In [None]:
import json

from etltools.cache import JsonCache
from common.event_extraction.execute_prompt import execute_prompt

PARSING_MODEL = "gpt-4-turbo-preview"
NUMBER_OF_EXAMPLES = 2

# Builds a mapping from the labeled examples to the embeddings
with open(
    os.path.join(RESOURCE_DIR_PATH, "event_extraction_labels_template.json"), "r"
) as labeled_examples_file:
    labeled_examples = json.load(labeled_examples_file)
    labeled_examples_df = build_labeled_examples_to_embedding(labeled_examples_file)

# Loads prompt from file
prompt_txt = open(os.path.join(RESOURCE_DIR_PATH, "prompt.txt"), "r").read()

# Execute prompt for each row in the data source
for _, row in data_source.iterrows():
    row_vals_to_parse = {key: row[key] in row for key in EVENT_EXTRACTION_RELEVANT_COLS}
    execute_prompt(
        row_vals_to_parse=row_vals_to_parse,
        embedded_col_name=COL_TO_EMBED,
        embeddings=embeddings,
        labeled_examples=labeled_examples_df,
        sys_prompt_txt=prompt_txt,
        cache=JsonCache(os.path.join(RESOURCE_DIR_PATH, "parsing_result_cache.json")),
        model=PARSING_MODEL,
        number_of_examples=3
    )