# Clustering Experiment with Erroneous BB

We have put all of our methods together for direct comparison in a single notebook.

Problem: Currently we are determining only the indiviudal digits, but we need to recognize these as coherent numbers and be able to assign entries to numbers.

### **Changes from `clustering.ipynb`**
* Added `random_BB_generate` and `erroneous_bounding_boxes` functions.

* `erroneous_bounding_boxes` is implemented within `test_clustering_methods`.

* The two lists, time_bounding_boxes & number_bounding_boxes have 5% bounding boxes removed and 5% erroneous added.

* By commenting out the `erroneous_bounding_boxes` within `test_clustering_methods` the notebook can be run with no erroneous boxes as before.

* Also imported random library

### Register Images to Start

To start, we need to register images using the `utilities/conversion/apply_homography_to_labels.ipynb` notebook. This should be run before running this notebook. This notebook is built on the assumption that the `data/registered_images` directory has been created and populated. Additionally it assumes that the `data/yolo_data.json` file is created. Both of these are created in the referenced notebook.


#### Install Packages

These are the necessary packages to run the functions and scripts below.


In [1]:
# Standard libraries
import os
import json
import random
from pathlib import Path
from typing import List, Tuple, Literal, Dict

# Third-party libraries
import cv2
import numpy as np
from PIL import Image, ImageDraw
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
import random

# Local libraries
from utils.annotations import BoundingBox

#### Start By Loading YOLO Data


To start I want to bring in the YOLO formatted data for each sheet and I can additionally load the respective images. As mentioned above you must have ran the `utilities/conversion/apply_homography_to_labels.ipynb` notebook to generate this YOLO data.


In [2]:
# Load yolo_data.json
PATH_TO_YOLO_DATA = "../../data/yolo_data.json"
PATH_TO_REGISTERED_IMAGES = "../../data/registered_images"
UNIFIED_IMAGE_PATH = (
    "../../data/unified_intraoperative_preoperative_flowsheet_v1_1_front.png"
)
with open(PATH_TO_YOLO_DATA) as json_file:
    yolo_data = json.load(json_file)

# See how many intraoperative images are registered
print(f"Found {len(yolo_data)} sheets in yolo_data.json")

Found 19 sheets in yolo_data.json


Now let's select relevant bounding boxes from the blood pressure and HR zone.

Start by defining functions to convert YOLO bounding box format to pixels (to see if the bounding box is within region of interest). Then create a function that allows you to select ROI and returns a list of bounding boxes within this ROI.


In [3]:
def get_bp_section_coordinates(
    image_height: int, bboxes: List[BoundingBox], buffer_pixels: int = 5
) -> List[int]:
    """Crops the blood pressure section out of an image of a chart.

    Args:
        image_height (int):
            The height of the image in pixels.
        bboxes (List[BoundingBox]):
            List of BoundingBoxes within this image.
        buffer_pixels (int):
            An optional integer that specifies the number of pixels around the digit detections to
            'zoom out' by. Defaults to 5 pixels.

    Returns:
        Coordinates of the bounding box that contains the blood pressure section.
    """
    # Get bounding boxes from detections and filter non bounding boxes out.
    bboxes: List[BoundingBox] = list(
        filter(lambda ann: isinstance(ann, BoundingBox), bboxes)
    )

    digit_categories: List[str] = [str(i) for i in range(10)]

    # Filter bounding boxes to those which are within the approximate region and are digits.
    bp_legend_digits: List[BoundingBox] = list(
        filter(
            lambda bb: all(
                [
                    bb.top / image_height > 0.2,
                    bb.top / image_height < 0.8,
                    bb.category in digit_categories,
                ]
            ),
            bboxes,
        )
    )
    bp_legend_coordinates: List[int] = list(
        map(
            int,
            [
                min([digit.left for digit in bp_legend_digits]) - buffer_pixels,
                min([digit.top for digit in bp_legend_digits]) - buffer_pixels,
                max([digit.right for digit in bp_legend_digits]) + buffer_pixels,
                max([digit.bottom for digit in bp_legend_digits]) + buffer_pixels,
            ],
        )
    )
    return bp_legend_coordinates


def is_point_in_above(x_center: float, y_center: float, m: float, b: float) -> bool:
    """
    Determine if a point is above or below the diagonal line y = mx + b.
    For our purposes we use it to check if a bounding box is in the top-right region -- meaning time labels.

    Args:
        x_center: float, x coordinate of the point
        y_center: float, y coordinate of the point
        m: float, slope of the diagonal line
        b: float, intercept of the diagonal line

    Returns:
        bool, True if the point is above the line, False otherwise
    """
    # Calculate the y value on the line for the given x_center
    y_line = m * x_center + b
    return y_center > y_line


def select_relevant_bounding_boxes(
    sheet_data: List[str],
    path_to_image: Path,
    show_images: bool = False,
    desired_img_width: int = 800,
    desired_img_height: int = 600,
) -> Tuple[List[str], List[str]]:
    """
    Given sheet data for bounding boxes in YOLO format, display the image and allow the user to select a region of interest (ROI).
    Identify bounding boxes that are within the selected region and draw rectangles around them.
    Return the bounding boxes that are within the selected region split into two lists: time labels and numerical values.

    Args:
        sheet_data: List of bounding boxes in YOLO format.
        path_to_image: Path to the image file.

    Returns:
        Tuple of Lists of string representations of bounding boxes that are within the selected region, in YOLO format.
        The first list contains bounding boxes in the top-right region -- representing time labels.
        The second list contains bounding boxes in the bottom-left region -- representing numerical values for mmHg and bpm.
            (bounding_boxes_time, bounding_boxes_numbers)
    """

    # Load the image
    image = cv2.imread(path_to_image)

    # Display the image and allow the user to select a ROI
    resized_image = cv2.resize(image, (desired_img_width, desired_img_height))

    x_top_left, y_top_left, x_bottom_right, y_bottom_right = get_bp_section_coordinates(
        image_height=desired_img_height,
        bboxes=[
            BoundingBox.from_yolo(yolo_bb, desired_img_width, desired_img_height)
            for yolo_bb in sheet_data
        ],
        buffer_pixels=2,
    )

    cv2.rectangle(
        resized_image,
        (x_top_left, y_top_left),
        (x_bottom_right, y_bottom_right),
        (255, 255, 0),
        1,
    )

    # Draw the diagonal line of the selected region from top-left to bottom-right
    cv2.line(
        resized_image,
        (x_top_left, y_top_left),
        (x_bottom_right, y_bottom_right),
        (0, 255, 0),
        1,
    )
    # Calculate the slope (m) and intercept (b) of the diagonal line.
    # This will allow us to determine if a bounding box is in the top-right region or bottom-left region
    # Top-right region is where time labels are located
    # Bottom-left region is where numerical values for mmHg and bpm are located
    m = (y_bottom_right - y_top_left) / (x_bottom_right - x_top_left)
    b = y_top_left - m * x_top_left

    # List of bounding boxes in the top-right and bottom-left regions
    bounding_boxes_time = []
    bounding_boxes_numbers = []

    # Process the bounding boxes
    for bounding_box in sheet_data:
        # Bounding boxes are in YOLO format; convert them to pixels
        x_min, y_min, x_max, y_max = list(
            map(
                int,
                BoundingBox.from_yolo(
                    yolo_line=bounding_box,
                    image_width=desired_img_width,
                    image_height=desired_img_height,
                ).box,
            )
        )

        # Check if the bounding box is within the selected region
        if (
            x_min >= x_top_left
            and y_min >= y_top_left
            and x_max <= x_bottom_right
            and y_max <= y_bottom_right
        ):
            # Calculate the center of the bounding box
            x_center_bb = (x_min + x_max) / 2
            y_center_bb = (y_min + y_max) / 2

            # If we want to generalize this function we can add the option to disregard the diagonal line

            # Determine if the bounding box center is in the top-right region
            if is_point_in_above(x_center_bb, y_center_bb, m, b):
                # Bounding box is in the top-right region
                cv2.rectangle(
                    resized_image, (x_min, y_min), (x_max, y_max), (255, 255, 0), 1
                )
                bounding_boxes_numbers.append(
                    BoundingBox.from_yolo(
                        yolo_line=bounding_box,
                        image_width=desired_img_width,
                        image_height=desired_img_height,
                    )
                )
            else:
                # Bounding box is in the bottom-left region
                cv2.rectangle(
                    resized_image, (x_min, y_min), (x_max, y_max), (255, 0, 255), 1
                )
                bounding_boxes_time.append(
                    BoundingBox.from_yolo(
                        yolo_line=bounding_box,
                        image_width=desired_img_width,
                        image_height=desired_img_height,
                    )
                )

    # Close all OpenCV windows, always do this or it will annoyingly not go away
    # You can also manually quit out with ESC key.
    cv2.destroyAllWindows()

    # If we are showing the images, display the image with the selected region and bounding boxes
    # Bounding boxes in the top-right region (time) are in one color while those in the bottom left (numerical) are in another
    if show_images:
        # Display the image with the selected region and bounding boxes
        resized_image = cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB)
        resized_image = Image.fromarray(resized_image)
        resized_image.show()

    # Return a tuple of bounding boxes in the top-right and bottom-left regions
    return (bounding_boxes_time, bounding_boxes_numbers)

Create a function for K-means clustering, dbscan clustering


In [4]:
def cluster_kmeans(
    bounding_boxes: List[BoundingBox], possible_nclusters: List[int]
) -> List[int]:
    """
    Cluster bounding boxes using K-Means clustering algorithm.

    Args:
        bounding_boxes: List of bounding boxes in YOLO format.
        possible_nclusters: List of possible number of clusters to try.

    Returns:
        List of cluster labels.
    """
    # Convert to a NumPy array (using only x_center and y_center)
    data = np.array([box.center for box in bounding_boxes])

    cluster_performance_map = {}
    for number_of_clusters in possible_nclusters:
        if number_of_clusters > len(data):
            raise (
                f"Number of clusters {number_of_clusters} is greater than number of bounding boxes {len(data)}."
            )
        if number_of_clusters < 1:
            raise (f"Number of clusters {number_of_clusters} must be greater than 0.")
        # Apply K-Means
        kmeans = KMeans(
            n_clusters=number_of_clusters,
            init="k-means++",
            n_init=20,
            max_iter=500,
            tol=1e-8,
            random_state=42,
        )
        kmeans.fit(data)

        # Get cluster labels
        labels = kmeans.predict(data)
        silhouette_avg = silhouette_score(data, labels)

        # print(
        #     f"Number of clusters: {number_of_clusters}, Silhouette score: {silhouette_avg}"
        # )

        cluster_performance_map[number_of_clusters] = {
            "score": silhouette_avg,
            "labels": labels,
        }

    # Evaluate the performance of each number of clusters and select the one with the highest silhouette score
    # if it is 0.003 greater than what should be the number of clusters otherwise go with proper_nclusters
    n_clusters_max_silhouette = max(
        cluster_performance_map, key=lambda x: cluster_performance_map[x]["score"]
    )
    best_n_clusters = (
        n_clusters_max_silhouette
        if (
            (
                cluster_performance_map[n_clusters_max_silhouette]["score"]
                - cluster_performance_map[max(possible_nclusters)]["score"]
            )
            >= 0.003
        )
        else max(possible_nclusters)
    )
    return cluster_performance_map[best_n_clusters]["labels"]


def dbscan_clustering(
    bounding_boxes: List[BoundingBox], defined_eps: float, min_samples: int
) -> List[int]:
    """
    Cluster bounding boxes density based spatial clustering algorithm.

    Args:
        bounding_boxes: List of bounding boxes.
        defined_eps: Maximum distance between two samples to be in the neighborhood of one another (center of BB).
        min_samples: The number of samples (or total weight) for a point to be considered as core

    Returns:
        List of cluster labels.
    """
    # Convert to a NumPy array (using only x_center and y_center)
    data = np.array([box.center for box in bounding_boxes])

    # DBSCAN
    scan = DBSCAN(eps=defined_eps, min_samples=min_samples)
    labels = scan.fit_predict(data)

    return labels

def agglomerative_clustering(bounding_boxes: List[BoundingBox], possible_nclusters: List[int]) -> List[int]:
    
    # make the bonding box data into a Numpy array
    data = np.array([box.center for box in bounding_boxes])

    # follow suit of the cluster_kmeans algorithm to measure accuracy through silhoutte scores
    cluster_performance_map = {}
    for number_of_clusters in possible_nclusters:
        if number_of_clusters > len(data):
            raise (
                f"Number of clusters {number_of_clusters} is greater than number of bounding boxes {len(data)}."
            )
        if number_of_clusters < 1:
            raise (f"Number of clusters {number_of_clusters} must be greater than 0.")
        # use agglomerative clustering
        agg = AgglomerativeClustering(n_clusters=number_of_clusters, linkage='single')
        # get labels
        labels = agg.fit_predict(data)
        # compute the silhoutte scores
        silhouette_avg = silhouette_score(data, labels)

        cluster_performance_map[number_of_clusters] = {
            "score": silhouette_avg,
            "labels": labels,
        }

    # get the number of clusters with the best silhoutte score
    n_clusters_max_silhouette = max(
        cluster_performance_map, key=lambda x: cluster_performance_map[x]["score"]
    )


    best_n_clusters = (
        n_clusters_max_silhouette
        if (
            (
                cluster_performance_map[n_clusters_max_silhouette]["score"]
                - cluster_performance_map[max(possible_nclusters)]["score"]
            )
            >= 0.003
        )
        else max(possible_nclusters)
    )
    return cluster_performance_map[best_n_clusters]["labels"]


Function to create a result dictionary that we can save as a JSON file to analyze performance.


In [5]:
def create_result_dictionary(
    labels: List[str], bounding_boxes: List[BoundingBox], unit: Literal["mmHg", "mins"]
) -> Dict[int, int]:
    """
    Create a dictionary with cluster labels as keys and lists of bounding boxes as values.

    Args:
        labels: List of cluster labels.
        bounding_boxes: List of bounding boxes.
        suffix: Suffix to append to the category of the bounding box. One of ["mmHg", "mins"].

    Returns:
        Dictionary with cluster labels as keys and bounding box values as values.
    """
    # Create a dictionary to store labelled elements
    label_dict = {}

    # Iterate over both lists
    for label, element in zip(labels, bounding_boxes):
        label = int(label)
        if label not in label_dict:
            # Create a new list for this label if it doesn't exist
            label_dict[label] = []
        # Append the element to the corresponding label list
        label_dict[label].append(f"{element.category} {element.center[0]}")

    # Sort the lists in the dictionary by x_center
    for key in label_dict:
        label_dict[key] = sorted(label_dict[key], key=lambda x: float(x.split(" ")[1]))
        label_dict[key] = [element.split(" ")[0] for element in label_dict[key]]
        # Turn list of strings into a string
        label_dict[key] = f"{''.join(label_dict[key])}_{unit}"

    return label_dict

Function to generate colors!


In [6]:
# Draw bounding boxes on the image
def generate_color():
    return "#%06x" % random.randint(0, 0xFFFFFF)

Function to generate random Bounding Box formatted occurances

In [33]:
def random_BB_generate(x):
    # function paired with map to generate erroneous bounding boxes
    category_int = random.randint(0, 9)
    left_int = random.randint(20, 781)
    top_int = random.randint(20, 581)

    # input generated integers to bounding box 
    box = BoundingBox(category=f'{category_int}', left=left_int, top=top_int, right=left_int+4, bottom=top_int+6)
    return box

Function to create 5% erroneous BB and remove 5% existing BB.

In [34]:
def erroneous_bounding_boxes(time_BB: List[str], number_BB: List[str]) -> Tuple[List[str], List[str]]:
    """
    Create 5% erroneous bounding boxes by simultaneously removing and generating time and number BB.

    Args:
        time_BB: list of time bounding boxes in BoundingBox format
        number_BB: list of number bounding boxes in BoundingBox format

    Returns:
        Tuple: lists of time and number bounding boxes in BoundingBox format
    """
    # make copies of input bounding box lists to avoid unwanted manipulation
    time_BB_copy = time_BB.copy()
    number_BB_copy = number_BB.copy()
    
    # subset and remove 5% of bounding boxes from time/number_bounding_boxes lists
    ## sample
    time_BB_sample = list(random.sample(time_BB_copy, 4))
    number_BB_sample = list(random.sample(number_BB_copy, 4))
    ## remove
    time_BB_removal = [time_BB_copy.remove(line) for line in time_BB_sample]
    number_BB_removal = [number_BB_copy.remove(line) for line in number_BB_sample]

    # use random bounding box generation to refill removed BBs with erroneous boxes
    time_BB_generate = list(map(random_BB_generate, range(len(time_BB_sample))))
    number_BB_generate = list(map(random_BB_generate, range(len(number_BB_sample))))

    # append BB generated list back to copy with 5% removal
    time_BB_erroneous = time_BB_copy + time_BB_generate
    number_BB_erroneous = number_BB_copy + number_BB_generate

    return (time_BB_erroneous, number_BB_erroneous)

Now lets use these functions to get the relevant bounding boxes for clustering.


In [35]:
def test_clustering_methods() -> None:
    """
    Test the clustering methods on the YOLO data.
    Saves the clustered images and the clustered bounding boxes to JSON files.

    Returns:
        None
    """
    for method in ["kmeans", "dbscan", "agglomerative"]:
        # Iterate over all images and their bounding boxes
        for sheet, yolo_bbs in yolo_data.items():
            print(f"Sheet: {sheet}")
            full_image_path = os.path.join(PATH_TO_REGISTERED_IMAGES, sheet)
            print(f"Full image path: {full_image_path}")

            # Call the analyze_sheet function with data from the loop
            time_bounding_boxes, number_bounding_boxes = select_relevant_bounding_boxes(
                yolo_bbs, full_image_path
            )

            # make erroneous bounding boxes -- simultaneously add and remove %5 of boxes
            time_bounding_boxes, number_bounding_boxes = erroneous_bounding_boxes(
                time_bounding_boxes, number_bounding_boxes
            )

            # Now we need to cluster the bounding boxes that pertain to the same multi-digit number
            if method == "kmeans":
                time_labels = cluster_kmeans(time_bounding_boxes, [40, 41, 42])
                number_labels = cluster_kmeans(number_bounding_boxes, [18, 19, 20])
            elif method == "dbscan":
                time_labels = dbscan_clustering(
                    time_bounding_boxes, defined_eps=5, min_samples=1
                )
                number_labels = dbscan_clustering(
                    number_bounding_boxes, defined_eps=5, min_samples=2
                )
            elif method == "agglomerative":
                time_labels = agglomerative_clustering(time_bounding_boxes, [40,41,42])
                number_labels = agglomerative_clustering(number_bounding_boxes, [18,19,20])
            else:
                raise ValueError(f"Invalid clustering method: {method}")

            # Create an image object
            image: Image = Image.open(full_image_path)
            image_width, image_height = image.size

            label_color_map = {}
            for i, label in enumerate(time_labels):
                # Get the bounding box
                bounding_box = time_bounding_boxes[i]
                x_min, y_min, x_max, y_max = [
                    (coor / 800) * image_width
                    if i % 2 == 0
                    else (coor / 600) * image_height
                    for i, coor in enumerate(bounding_box.box)
                ]

                # If the label is not in the color map, generate a new color
                if label not in label_color_map:
                    label_color_map[label] = generate_color()

                # Open the image
                draw = ImageDraw.Draw(image)

                draw.rectangle(
                    [
                        x_min,
                        y_min,
                        x_max,
                        y_max,
                    ],
                    outline=label_color_map[label],
                    width=3,
                )

            # Save the image with the bounding boxes to the kmeans_clustered_images folder
            image.save(f"../../data/{method}_clustered_images/time/{sheet}")

            # Save the clustered bounding boxes to a JSON file
            with open(
                f"../../data/{method}_clustered_images/results/time/{sheet.split('.')[0]}.json",
                "w",
            ) as f:
                json.dump(
                    create_result_dictionary(time_labels, time_bounding_boxes, "mins"),
                    f,
                )

            # Create an image object
            image: Image = Image.open(full_image_path)
            image_width, image_height = image.size
            label_color_map = {}
            for i, label in enumerate(number_labels):
                # Get the bounding box
                bounding_box = number_bounding_boxes[i]
                x_min, y_min, x_max, y_max = [
                    (coor / 800) * image_width
                    if i % 2 == 0
                    else (coor / 600) * image_height
                    for i, coor in enumerate(bounding_box.box)
                ]

                # If the label is not in the color map, generate a new color
                if label not in label_color_map:
                    label_color_map[label] = generate_color()

                # Open the image
                draw = ImageDraw.Draw(image)

                draw.rectangle(
                    [
                        x_min,
                        y_min,
                        x_max,
                        y_max,
                    ],
                    outline=label_color_map[label],
                    width=3,
                )

            # Save the image with the bounding boxes to the kmeans_clustered_images folder
            image.save(f"../../data/{method}_clustered_images/number/{sheet}")

            # Save the clustered bounding boxes to a JSON file
            with open(
                f"../../data/{method}_clustered_images/results/number/{sheet.split('.')[0]}.json",
                "w",
            ) as f:
                json.dump(
                    create_result_dictionary(
                        number_labels, number_bounding_boxes, "mmHg"
                    ),
                    f,
                )


# Test the clustering methods
test_clustering_methods()

Sheet: RC_0001_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0001_intraoperative.JPG
Sheet: RC_0002_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0002_intraoperative.JPG
Sheet: RC_0003_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0003_intraoperative.JPG
Sheet: RC_0004_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0004_intraoperative.JPG
Sheet: RC_0005_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0005_intraoperative.JPG
Sheet: RC_0006_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0006_intraoperative.JPG
Sheet: RC_0007_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0007_intraoperative.JPG
Sheet: RC_0008_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0008_intraoperative.JPG
Sheet: RC_0009_intraoperative.JPG
Full image path: ../../data/registered_images\RC_0009_intraoperative.JPG
Sheet: RC_0010_intraoperative.JPG
Ful

#### Analyze accuracy

Below we use assumptions on what we know the labels should represent in both the time and number groups. We check that these values are present within clusters.


In [36]:
# Since this work is done above, we can simply read in from the JSON files created in the previous step and work from there.
for method in ["kmeans", "dbscan", "agglomerative"]:
    print(f"Method: {method}")
    # Paths to the JSON files
    PATH_TO_KMEANS_RESULTS = f"../../data/{method}_clustered_images/results"
    TIME_JSON = os.path.join(PATH_TO_KMEANS_RESULTS, "time")
    NUMBER_JSON = os.path.join(PATH_TO_KMEANS_RESULTS, "number")

    time_wrong_clusters_count = 0
    time_correct_clusters_count = 0
    number_wrong_clusters_count = 0
    number_correct_clusters_count = 0

    # Iterate over all images and their bounding boxes
    for sheet, yolo_bb in yolo_data.items():
        full_image_path = os.path.join(PATH_TO_REGISTERED_IMAGES, sheet)
        time_bounding_boxes, number_bounding_boxes = select_relevant_bounding_boxes(
            yolo_bb, full_image_path
        )
        # Convert the bounding boxes to a list of strings with proper suffixes
        expected_time_values = [
            "0_mins",
            "5_mins",
            "10_mins",
            "15_mins",
            "20_mins",
            "25_mins",
            "30_mins",
            "35_mins",
            "40_mins",
            "45_mins",
            "50_mins",
            "55_mins",
            "0_mins",
            "5_mins",
            "10_mins",
            "15_mins",
            "20_mins",
            "25_mins",
            "30_mins",
            "35_mins",
            "40_mins",
            "45_mins",
            "50_mins",
            "55_mins",
            "0_mins",
            "5_mins",
            "10_mins",
            "15_mins",
            "20_mins",
            "25_mins",
            "30_mins",
            "35_mins",
            "40_mins",
            "45_mins",
            "50_mins",
            "55_mins",
            "0_mins",
            "5_mins",
            "10_mins",
            "15_mins",
            "20_mins",
            "25_mins",
        ]

        expected_number_values = [
            "30_mmHg",
            "40_mmHg",
            "50_mmHg",
            "60_mmHg",
            "70_mmHg",
            "80_mmHg",
            "90_mmHg",
            "100_mmHg",
            "110_mmHg",
            "120_mmHg",
            "130_mmHg",
            "140_mmHg",
            "150_mmHg",
            "160_mmHg",
            "170_mmHg",
            "180_mmHg",
            "190_mmHg",
            "200_mmHg",
            "210_mmHg",
            "220_mmHg",
        ]

        # Load JSON
        with open(os.path.join(TIME_JSON, f"{sheet.split('.')[0]}.json")) as f:
            time_clusters = json.load(f)

        # Each cluster contains the number (integer) that the cluster represents
        # We know what integers should be represented in the time labels, lets check that they are all there.
        # Keep track of any false positives (new clusters that don't exist) or negatives (missing clusters)
        for cluster, value in time_clusters.items():
            if value not in expected_time_values:
                # Print the sheet, value that is not in the expected values
                # print(f"Time -> Sheet: {sheet}, Value: {value}")
                # We have an erroneous cluster
                time_wrong_clusters_count += 1
            else:
                # We have a correct cluster
                expected_time_values.remove(value)
                time_correct_clusters_count += 1

        # Load JSON
        with open(os.path.join(NUMBER_JSON, f"{sheet.split('.')[0]}.json")) as f:
            number_clusters = json.load(f)

        # Each cluster contains the number (integer) that the cluster represents
        # We know what integers should be represented in the time labels, lets check that they are all there.
        # Keep track of any false positives (new clusters that don't exist) or negatives (missing clusters)
        for cluster, value in number_clusters.items():
            if value not in expected_number_values:
                # print(f"Time -> Sheet: {sheet}, Value: {value}")
                # We have an erroneous cluster
                number_wrong_clusters_count += 1
            else:
                # We have a correct cluster
                expected_number_values.remove(value)
                number_correct_clusters_count += 1

    print(
        f"Time labels: {time_correct_clusters_count} correct clusters, {time_wrong_clusters_count} incorrect clusters. The accuracy is {time_correct_clusters_count / (time_correct_clusters_count + time_wrong_clusters_count) * 100:.2f}%"
    )
    print(
        f"Number labels: {number_correct_clusters_count} correct clusters, {number_wrong_clusters_count} incorrect clusters. The accuracy is {number_correct_clusters_count / (number_correct_clusters_count + number_wrong_clusters_count) * 100:.2f}%"
    )


Method: kmeans
Time labels: 645 correct clusters, 144 incorrect clusters. The accuracy is 81.75%
Number labels: 189 correct clusters, 187 incorrect clusters. The accuracy is 50.27%
Method: dbscan
Time labels: 730 correct clusters, 132 incorrect clusters. The accuracy is 84.69%
Number labels: 310 correct clusters, 53 incorrect clusters. The accuracy is 85.40%
Method: agglomerative
Time labels: 634 correct clusters, 158 incorrect clusters. The accuracy is 80.05%
Number labels: 186 correct clusters, 190 incorrect clusters. The accuracy is 49.47%
