# Label Extraction of BigEarthNet-v1.0 Dataset
This notebook creates a pickle file which contains all labels of the BigEarthNet-v1.0 (BEN) dataset in a dict format. To do so, the image label JSON files of the first 50k image bucket folders on AWS are loaded. (Note: The assumption is that those first 50k image bucket folders are sufficient to capture all labels contained in the dataset.) Next, the labels are extracted and saved in a local Pickle file called `ben_labels`.

It is sufficient to run this notebook only once to extract the labels to the Pickle file as the Pickle file `ben_labels` is stored in the repo as well. The file can be read via `src.globals.LABELS_TO_INDS`, so you should not need to run this here. 

#### Prerequisites
To run the notebook, you need to have stored your AWS credentials in the file `aws_credentials.yaml`

In [1]:
# You only need to run this code snippet if your credentials are not set within your environment yet
from src.infrastructure.aws_infrastructure import get_aws_credentials, set_s3_credentials

aws_credentials = get_aws_credentials()
set_s3_credentials(aws_credentials)



In [2]:
import src.data.general_datapipes as pipes

# Get a list with all image folder names in s3 
folder_list = list(pipes.get_s3_folder_content())
print("Length: ", len(folder_list))
print("First folder name: ", folder_list[0])

Length:  590326
First folder name:  s3://mi4people-soil-project/BigEarthNet-v1.0/S2A_MSIL2A_20170613T101031_0_45


In [3]:
import tqdm
import json
import pickle
import fsspec
from pathlib import Path

import torchdata.datapipes as dp

import src.data.bigearthnet_datapipes as ben_pipes
from src.globals import PROJECT_DIR

# Iterate through first 50000 image folders (Assumption is that this will catch all different BigEarthNet Labels)
pipe = dp.iter.IterableWrapper(folder_list[:50000])
# Check subfolders
pipe = pipe.list_files_by_fsspec()
pipe = pipe.groupby(group_key_fn=ben_pipes.group_key_by_folder, group_size=13)
pipe = pipe.map(ben_pipes.chunk_to_dataloader_dict)
pipe = pipe.map(lambda x: x["label"])
pipe = pipe.map(lambda x: json.loads(fsspec.open(x, mode="r").open().read())["labels"])

classes = list()
for js in tqdm.tqdm(pipe):
    for label in js:
        classes.append(label)

from collections import Counter

counter = Counter(classes)
# Classes sorted alphabetically
classes = sorted(set(classes))

classes_to_ind = {i: cls for (i, cls) in enumerate(classes)}

print(counter)
print(classes)
print(classes_to_ind)

# Save dict as pickle file in data folder of this project
with open("C:/Users/tanja/PycharmProjects/mi4people-soil-quality/data/00_prerequisites/ben_labels.pickle", "wb") as p_out:
    pickle.dump(classes_to_ind, p_out)

50000it [4:00:51,  3.46it/s]


Counter({'Non-irrigated arable land': 17032, 'Mixed forest': 15176, 'Coniferous forest': 14565, 'Pastures': 14279, 'Transitional woodland/shrub': 11723, 'Broad-leaved forest': 9197, 'Land principally occupied by agriculture, with significant areas of natural vegetation': 9085, 'Sea and ocean': 8707, 'Complex cultivation patterns': 6348, 'Discontinuous urban fabric': 5999, 'Water bodies': 4428, 'Agro-forestry areas': 3123, 'Permanently irrigated land': 2052, 'Peatbogs': 1945, 'Olive groves': 1704, 'Industrial or commercial units': 1390, 'Water courses': 1210, 'Moors and heathland': 1171, 'Vineyards': 832, 'Sport and leisure facilities': 798, 'Annual crops associated with permanent crops': 779, 'Rice fields': 741, 'Inland marshes': 597, 'Sclerophyllous vegetation': 554, 'Road and rail networks and associated land': 532, 'Mineral extraction sites': 506, 'Natural grassland': 443, 'Continuous urban fabric': 332, 'Estuaries': 280, 'Beaches, dunes, sands': 270, 'Intertidal flats': 243, 'Salt 

FileNotFoundError: [Errno 2] No such file or directory: 'c:\\data\\01_raw\\ben_labels.pickle'