# Ingesting a Q0 dataset

The database is now ready to be ingested to EOTDL. We just need to add the general metadata of the dataset in a README file.

In [31]:
text = """---
name: Sentinel-2-Ships
authors: 
  - Pierre-Jean Coquard
license: free
source: https://github.com/earthpulse/eotdl/tree/main/tutorials/usecases/useCaseD
---

# Sentinel-2-Ships

This is an example dataset created for the use case D.
"""

with open("data/sentinel_2/README.md", "w") as outfile:
    outfile.write(text)

In [None]:
from eotdl.datasets import ingest_dataset

ingest_dataset("data/sentinel_2")

# Q1 dataset

We can upgrade this dataset to a Q1 dataset by adding STAC metadata. We use the `STACCGenerator` class to automaticaly generate the STAC metadata for the whole dataset.


In [12]:
from eotdl.curation.stac.stac import STACGenerator
from eotdl.curation.stac.assets import STACAssetGenerator
from eotdl.curation.stac.parsers import UnestructuredParser
from eotdl.curation.stac.dataframe_labeling import UnlabeledStrategy, LabeledStrategy

stac_generator = STACGenerator(item_parser=UnestructuredParser, 
                               assets_generator=STACAssetGenerator, 
                               labeling_strategy=LabeledStrategy,
                               image_format='tif'
                               )


In [6]:
extensions = {'ship': ('proj', 'raster', 'eo')}
bands = {'ship': ('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B11', 'B12')}
collection = {'ship': 'sentinel-2-ships'}

df = stac_generator.get_stac_dataframe('data/sentinel_2', collections=collection, extensions=extensions, bands=bands)
df.head()

NameError: name 'stac_generator' is not defined

We can then generate the Stac metadata from the `STACDataframe` generated during the previous step.

In [14]:
stac_generator.generate_stac_metadata(stac_id='ship-segmentation-dataset',
                                      description='Ship segmentation dataset',
                                      output_folder='data/sentinel_2_stac')

Generating sentinel-2-ships collection...


100%|██████████| 77/77 [00:00<00:00, 239.27it/s]

Validating and saving catalog...
Success!





We also add the STAC metadata for the labels :

In [17]:
from eotdl.curation.stac.extensions import ScaneoLabeler

labeler = ScaneoLabeler()

catalog = 'data/sentinel_2_stac/catalog.json'
labels_extra_properties = {'label_methods': ["automated"]}
labeler.generate_stac_labels(
    catalog=catalog,
    root_folder='data/sentinel_2',
    collection='sentinel-2-ships',
    label_type="raster",
    **labels_extra_properties
)

Generating labels collection...: 77it [00:00, 1838.05it/s]

Success on labels generation!





Once the STAC metadata is successfully generated, we can ingest the Q1 dataset into EOTDL.

In [1]:
from eotdl.datasets import ingest_dataset

ingest_dataset('data/sentinel_2_stac')

Loading STAC catalog...
New version created, version: 30


100%|██████████| 154/154 [00:41<00:00,  3.69it/s]


Ingesting STAC catalog...
Done


# Q2 dataset

In [1]:
from eotdl.curation.stac.extensions import add_ml_extension
import pystac
catalog = 'data/sentinel_2_stac/catalog.json'

add_ml_extension(
	catalog,
	destination='data/sentinel_2_q2',
	splits=True,
	splits_collection_id="labels",
	name='Ship Segmentation Q2',
	tasks=['segmentation'],
	inputs_type=['satellite imagery'],
	annotations_type='raster',
	version='0.1.0',
)

Generating splits...
Total size: 77
Train size: 61
Test size: 7
Validation size: 7
Generating Training split...


100%|██████████| 61/61 [00:00<00:00, 11469.09it/s]


Generating Validation split...


100%|██████████| 7/7 [00:00<00:00, 8827.46it/s]


Generating Test split...


100%|██████████| 7/7 [00:00<00:00, 9966.10it/s]

Success on splits generation!
Validating and saving...





Success!


Let's compute the quality metrics for the dataset to ensure that it can be ingested

In [2]:
from eotdl.curation.stac.extensions import MLDatasetQualityMetrics

catalog = 'data/sentinel_2_q2/catalog.json'

MLDatasetQualityMetrics.calculate(catalog)

MLDatasetExtension
done


Looking for spatial duplicates...: 308it [00:00, 6491.65it/s]
Calculating classes balance...: 308it [00:00, 280531.08it/s]

Validating and saving...
Success!





We can finally ingest the Q2 dataset into EOTDL

In [None]:
from eotdl.datasets import ingest_dataset

ingest_dataset('data/sentinel_2_q2')


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Loading STAC catalog...
New version created, version: 48


100%|██████████| 308/308 [01:01<00:00,  4.98it/s]


Ingesting STAC catalog...


Exception: [Errno 2] No such file or directory: '/home/pcoquard/git/eotdl_use_case_d/data/sentinel_2_q2/catalog.json'

In [5]:
catalog.validate_all()

AttributeError: 'str' object has no attribute 'validate_all'