# Tutorial 4: Creating SDS datasets


## Introduction
This tutorial shows how to create new datasets in the SPARC Data Structure (SDS) format using the [sparc-me python tool](https://github.com/SPARC-FAIR-Codeathon/sparc-me).

## Definitions


## Learning outcomes
In this tutorial, you will learn how to:
- create an empty dataset from a SDS template.
- add or modify metadata element values.
- add or remove data from a dataset.

## Creating a dataset using a template

We will use sparc-me's `Dataset` Python Class to create an empty dataset using a template and return it as a Python object.

By default, datasets will be created in the current directory. However, you can specify override this using the `Dataset` object's `set_path` method prior to creating the dataset.

In [None]:
import sparc_me as sm

dataset = sm.Dataset()
dataset.set_path('./')
dataset.create_empty_dataset(version='2.0.0')

Any changes to a dataset will automatically update the dataset.

## Adding or modifying metadata
Tutorial 3 described how we can use sparc-me to get the value of metadata elements used in the different metadata files of an SDS dataset.

We will now show how to add metadata elements for the dataset description metadata file.

As mentioned in Tutorial 3, information about a metadata element can be accessed through the SDS documentation or using sparc-me's schema methods.

### Dataset Description
We can create a `Metadata` Python object for the metadata file of interest using the `Dataset` object's `get_metadata` method.

In [None]:
dd = dataset.get_metadata(
    metadata_file='dataset_description')  # dd is short for dataset_description.

We can then easily add values for metadata elements using the metadata object's `add_values` Python method:

```
add_values(
  element: str = '',
  values: Any, 
  append: bool = True)
```

Note that by default, `add_values` will append to existing metadata element values, unless the optional `append` argument is set to `False`, in which case, any existing values are replaced.

In [None]:
dd.add_values(
    element='Type', 
    values='experimental')
dd.add_values(
    element='Title', 
    values='Duke University DCE-MRI of breast cancer patients')
dd.add_values(
    element='Subtitle',
    values='Retrospective collection of MRI from 922 biopsy-confirmed invasive breast cancer patients.')
dd.add_values(
    element='Keywords', 
    values=['Breast cancer','MRI'])
dd.add_values(
    element='Study purpose',
    values='Breast MRI is a medical image modality used to assess the extent of disease in breast cancer patients. Recent studies show that MRI has a potential in prognosis of patients’ short and long-term outcomes as well as predicting pathological and genomic features of the tumors. However, large, well annotated datasets are needed to make further progress in the field. We share such a dataset here.')
dd.add_values(
    element='Study data Collection',
    values="""This dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients, over a decade, specifically pre-operative dynamic contrast enhanced (DCE)-MRI that were downloaded from PACS systems and de-identified for The Cancer Imaging Archive (TCIA) release. These include axial breast MRI images acquired by 1.5T or 3T scanners in the prone positions. The following MRI sequences are shared in DICOM format: a non-fat saturated T1-weighted sequence, a fat-saturated gradient echo T1-weighted pre-contrast sequence, and mostly three to four post-contrast sequences.""")
dd.add_values(
    element='Study primary conclusion', 
    values='Data collected for subsequent analysis.')
dd.add_values(
    element='Study organ system', 
    values='breast')
dd.add_values(
    element='Study approach', 
    values='Imaging')
dd.add_values(
    element='Study technique', 
    values='MRI')
dd.add_values(
    element='Contributorname', 
    values=['Saha, Ashirbani',
            'Harowicz, Michael R',
            'Grimm, Lars J',
            'Kim, Connie E',
            'Ghate, Sujata V',
            'Walsh, Ruth',
            'Mazurowski, Maciej A'])
dd.add_values(
    element='Contributor orcid',
    values=['https://orcid.org/0000-0002-7650-1720',
            'https://orcid.org/0000-0002-8002-5210',
            'https://orcid.org/0000-0002-3865-3352',
            'https://orcid.org/0000-0003-0730-0551',
            'https://orcid.org/0000-0003-1889-982X',
            'https://orcid.org/0000-0002-2164-2761',
            'https://orcid.org/0000-0003-4202-8602'],
    append=False)
# Note 
dd.add_values(
    element='Contributor affiliation', 
    values=['Duke University'] * 7)
dd.add_values(
    element='Contributor role',
    values=['Researcher',
            'Researcher',
            'Researcher',
            'Researcher',
            'Researcher',
            'Researcher',
            'Researcher'])

As described in Tutorial 3, we can get the metadata element values we have just added as follows.

In [None]:
dd.get_values(element='Contributor role')

A specific metadata element value can be removed as follows. Note that if more than one match is found, all matches are removed.

In [None]:
dd.remove_values(
    element='Contributor role',
    values = 'Researcher'
)

All values for a metadata element can also be cleared as follows.

In [None]:
dd.clear_values(element='Contributor role')

## Adding or removing data
The following code snippet shows how subjects and samples dcan be added to the dataset being created using breast MRI as an example. Data will be added for 2 subjects, each having 2 MRI sequences. The `subject_user_id` variable is used to help the user creating the data to specify the local path to the MR images.

In [None]:
subjects = []

sample1 = sm.Sample()
sample1.add_path("../resources/example_dataset/Breast_MRI_001/sequence1/")
samples.append(sample1)

sample2 = sm.Sample()
sample2.add_path("../resources/example_dataset/Breast_MRI_001/sequence2/")
samples.append(sample2)

subject1 = sm.Subject()
subject1.add_samples(samples)
subjects.append(subject1)

samples = []

sample1 = sm.Sample()
sample1.add_path("../resources/example_dataset/Breast_MRI_002/sequence1/")
samples.append(sample1)

sample2 = sm.Sample()
sample2.add_path("../resources/example_dataset/Breast_MRI_002/sequence2/")
samples.append(sample2)

subject2 = sm.Subject()
subject2.add_samples(samples)
subjects.append(subject2)

One way the above code can be simplified is shown below.

In [None]:
subjects = []
for subject_user_id in [1, 2]:
    samples = []
    for sample_user_id in [1, 2]:
        sample = sm.Sample()
        sample.add_path(
            "../resources/example_dataset/Breast_MRI_00{0}/sequence{1}/".format(
                subject_user_id, sample_user_id))
        samples.append(sample)
        
    subject = sm.Subject()
    subject.add_samples(samples)

Once the path for the samples of each subject has been specified, they can be added to the dataset as shown below. This will copy the data from its user specified location, to the subject and sample folders within the primary data folder of the SDS dataset being created. Note that the samples and subjects will be renumbered sequentially, and may not correspond to the user_ids.

The `add_subjects` method will also automatically update the manifest metadata file.

In [None]:
dataset.add_subjects(subjects)

The example code snoipped Additional metadata can also be included as follows.

In [None]:
subject_sds_id = 1
subject = dataset.get_subject(subject_sds_id)
subject.set_values(
    element='age',
    value=30)

sample_sds_id = 2 
sample = subject.get_sample(sample_sds_id)
sample.set_values(
    element='sample experimental group',
    value='experimental')
sample.set_values(
    element='sample type',
    value='DCE-MRI Contrast Image {0}'.format(sample_sds_id))
sample.set_values(
    element='sample anatomical location',
    value='breast')

At this stage, if any new subjects or samples need to be added or removed, please discard the dataset being created and create a new one. Future releases of sparc-me and the DigitalTWINS platform API will include the ability to:
- add subjects and samples to an existing dataset;
- remove subjects and sample; and
- version control of datasets.

## Adding or removing thumbnails

Thumbnail images can be added to a data as follows. The manifest metadata file will be automatically updated when thumbnails are added.`thumbnail_0.jpg` will be set as the main thumbnail for the dataset on the data catalogue page of the DigitalTWINS platform's portal. Additional thumbnails can be sequentially numbered e.g. `thumbnail_1.jpg`. All thumbnails will be visible in the gallery view of a data catalogue in the portal. 

In [None]:
dataset.add_thumbnail("../resources/example_dataset/thumbnail_0.jpg")

Thumbnail images can be removed as follows.

In [None]:
dataset.remove_thumbnail("../resources/example_dataset/thumbnail_0.jpg")

## Checking a dataset
Dataset checking (validation) can be perform as follows. This will allow the fields of the dataset to be checked before it is submitted to the portal.

In [None]:
validator = sm.Validator()
validator.validate_dataset(dataset)

Currently, only basic validation has been implemented. More validation features will be added in future releases of sparc-me.

## Next steps
The next tutorial will show how to upload your dataset to your instance of the 12 LABOURS DigitalTWINS Platform using its Python API.