The data is already split into train, test, validation sets. They have 8716, 2886, and 2890 images, respectively. 
Each entry has two components:
- The image
- The label, saved as a multi-hot encoded list, with 31 elements.

Here is a dictionary of the label encoding: <br>
__Index: name__<br>
0:  Acropora Branching <br>
1:  Acropora Digitate<br>
2:  Acropora Submassive<br>
3:  Acropora Tabular<br>
4:  Algal Assemblage<br>
5:  Algae Halimeda<br>
6:  Algae Coralline<br>
7:  Algae Turf<br>
8:  Altra/Leucospilota<br>
9:  Bleached coral<br>
10:  Blurred<br>
11:  Dead coral<br>
12:  Fish<br>
13:  Homo Sapiens<br>
14:  Human object<br>
15:  Living coral<br>
16:  Non-acropora Millepora<br>
17:  Non-acropora Encrusting<br>
18:  Non-acropora Foliose<br>
19:  Non-acropora Massive<br>
20:  Non-acropora Coral free<br>
21:  Non-acropora Submassive<br>
22:  Rock<br>
23:  Rubble<br>
24:  Sand<br>
25:  Sea cucumber<br>
26:  Sea urchin<br>
27:  Sponges<br>
28:  Syringodium Isoetifolium<br>
29:  Thalassodendron Ciliatum<br>
30:  Useless <br>

If we want to find the images with corals in them, we should focus on Acropora ---, Non-acropora ---, --- coral. <br>
Those are the following indices: [0,1,2,3,9,11,15,16,17,18,19,20,21] <br>

Most of these labels are of no use to us. 
However, the "living coral" label is mainly used to assign to corals for which the exact type was not clear.
Supplementary information can be found here:<br>
https://static-content.springer.com/esm/art%3A10.1038%2Fs41597-024-04267-z/MediaObjects/41597_2024_4267_MOESM1_ESM.pdf 


Things to check: <br>

Do all bleached coral also have a coral type label, or living coral label? <br>
* There are 3 entries where this is not the case. I would remove those.

Are there any images with both bleached and dead, or alive and dead labels?<br>
* Yes there are for both of these. These are no mistakes, as both bleached and dead corals are present in those images.
* I think the best solution is to remove these images, as there are not too many. 
* There are 20 and 197 in the training data, 8 and 70 in the test data, and 9 and 75 in the validation data, respectively. 


In [1]:
from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from datasets.dataset_dict import DatasetDict, IterableDatasetDict
from datasets.iterable_dataset import IterableDataset
from typing import Any

# download the dataset
ds: DatasetDict = load_dataset("lombardata/seatizen_atlas_image_dataset") # type: ignore

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Count number of images with both bleached and dead, and alive and dead labels. 

def count_dead_bleached_and_dead_alive_labels(dataset_name: str) -> tuple[int, int]:
    bleached_and_dead_count = 0
    alive_and_dead_count = 0
    for label in ds[dataset_name]["label"]:
        # check if bleached and dead is true
        if label[9] == 1 and label[11] == 1:
            bleached_and_dead_count += 1
        # check if dead and alive is true
        if label[11] == 1 and label[15] == 1:
            alive_and_dead_count += 1
    return (bleached_and_dead_count, alive_and_dead_count)
count_dead_bleached_and_dead_alive_labels("train")

(20, 197)

In [6]:
# Have a dictionary to transform the numbers into names.

label_dict: dict[int, str] = {
    0: "Acropora Branching",
    1: "Acropora Digitate",
    2: "Acropora Submassive",
    3: "Acropora Tabular",
    4: "Algal Assemblage",
    5: "Algae Halimeda",
    6: "Algae Coralline",
    7: "Algae Turf",
    8: "Altra/Leucospilota",
    9: "Bleached coral",
    10: "Blurred",
    11: "Dead coral",
    12: "Fish",
    13: "Homo Sapiens",
    14: "Human object",
    15: "Living coral",
    16: "Non-acropora Millepora",
    17: "Non-acropora Encrusting",
    18: "Non-acropora Foliose",
    19: "Non-acropora Massive",
    20: "Non-acropora Coral free",
    21: "Non-acropora Submassive",
    22: "Rock",
    23: "Rubble",
    24: "Sand",
    25: "Sea cucumber",
    26: "Sea urchin",
    27: "Sponges",
    28: "Syringodium Isoetifolium",
    29: "Thalassodendron Ciliatum",
    30: "Useless"
}

# Transform a label entry into readable format.
def transform_encoding(multi_hot_encoding: list[int]) -> list[Any]:
    active_labels: list[Any] = []
    for i, el in enumerate(multi_hot_encoding):
        if el == 1:
            active_labels.append(label_dict[i])
    return active_labels

# Example  
transform_encoding(ds["validation"][382]["label"]) # type: ignore

['Algae Turf',
 'Altra/Leucospilota',
 'Bleached coral',
 'Rock',
 'Rubble',
 'Sand',
 'Sea cucumber']

In [None]:
# Check if all bleached coral labels also have the coral type, or living coral label.
# List of indices where this is not the case is returned.

def bleached_coral_equals_coral(dataset_name: str) -> list[int]:
    CORAL_TYPES_INDICES: list[int] = [0,1,2,3,15,16,17,18,19,20,21]
    bad_indices: list[int] = []
    for ind, label in enumerate(ds[dataset_name]["label"]): # type: ignore
        if label[9] == 1:   # if "bleached coral" label is true
            found = False
            for i in CORAL_TYPES_INDICES: 
                if label[i] == 1:       # Check if one of the coral type labels is true
                    found = True
                    break       # if one is found, we don't need to check the others.
            if not found:
                bad_indices.append(ind)
    return bad_indices
bleached_coral_equals_coral("train")

[775, 1213]