# Tutorial 4: Creating SDS datasets using sparc-me

## Introduction
This tutorial shows how to create new datasets in the SPARC Data Structure (SDS) format using the [sparc-me python tool](https://github.com/SPARC-FAIR-Codeathon/sparc-me).

## Definitions
- samples SDS metadata file - See the following [presentation](https://docs.google.com/file/d/1zZ3-C17lPIgtRp6bnkSwvKacaTA66GVR/edit?usp=docslist_api&filetype=mspresentation) for more information.
- subjects SDS metadata file - See the following [presentation](https://docs.google.com/file/d/1zZ3-C17lPIgtRp6bnkSwvKacaTA66GVR/edit?usp=docslist_api&filetype=mspresentation) for more information.

## Learning outcomes
In this tutorial, you will learn how to:
- create an empty dataset from a SDS template.
- add or modify metadata element values.
- add or remove data from a dataset.

## Creating a dataset using a template

We will use `sparc-me`'s `Dataset` Python Class to create an empty dataset using a template and return it as a Python object.

You can specify the path for where the generated dataset will be stored as follows.

In [None]:
import sparc_me as sm
import warnings

dataset = sm.Dataset()
dataset.set_path('./example_sds_dataset')
dataset.create_empty_dataset(version='2.0.0')

# Suppress deprecation warning from python modules.
warnings.filterwarnings('ignore', category=DeprecationWarning)

Any changes to a dataset will automatically update the dataset.

## Adding or modifying metadata
Tutorial 3 described how we can use sparc-me to get the value of metadata elements used in the different metadata files of an SDS dataset.

We will now show how to add metadata elements for the dataset description metadata file.

As mentioned in Tutorial 3, information about a metadata element can be accessed through the SDS documentation or using sparc-me's schema methods.

### Dataset Description
We can create a `Metadata` Python object for the metadata file of interest using the `Dataset` object's `get_metadata` method.

In [None]:
dd = dataset.get_metadata(metadata_file='dataset_description')  # dd is short for dataset_description.

We can then easily add values for metadata elements using the metadata object's `add_values` Python method:

```
add_values(
  element: str = '',
  values: Any, 
  append: bool = True)
```

Note that by default, `add_values` will append to existing metadata element values, unless the optional `append` argument is set to `False`, in which case, any existing values are replaced.

In [None]:
dd.add_values(
    element='Type', 
    values='Experimental')
dd.add_values(
    element='Title', 
    values='Duke University DCE-MRI of breast cancer patients')
dd.add_values(
    element='Subtitle',
    values='Retrospective collection of MRI from 922 biopsy-confirmed invasive breast cancer patients.')
dd.add_values(
    element='Keywords', 
    values=['Breast cancer','MRI'])
dd.add_values(
    element='Study purpose',
    values='Breast MRI is a medical image modality used to assess the extent of disease in breast cancer patients. Recent studies show that MRI has a potential in prognosis of patients’ short and long-term outcomes as well as predicting pathological and genomic features of the tumors. However, large, well annotated datasets are needed to make further progress in the field. We share such a dataset here.')
dd.add_values(
    element='Study data Collection',
    values="""This dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients, over a decade, specifically pre-operative dynamic contrast enhanced (DCE)-MRI that were downloaded from PACS systems and de-identified for The Cancer Imaging Archive (TCIA) release. These include axial breast MRI images acquired by 1.5T or 3T scanners in the prone positions. The following MRI sequences are shared in DICOM format: a non-fat saturated T1-weighted sequence, a fat-saturated gradient echo T1-weighted pre-contrast sequence, and mostly three to four post-contrast sequences.""")
dd.add_values(
    element='Study primary conclusion', 
    values='Data collected for subsequent analysis.')
dd.add_values(
    element='Study organ system', 
    values='breast')
dd.add_values(
    element='Study approach', 
    values='Imaging')
dd.add_values(
    element='Study technique', 
    values='MRI')
dd.add_values(
    element='Contributorname', 
    values=['Saha, Ashirbani',
            'Harowicz, Michael R',
            'Grimm, Lars J',
            'Kim, Connie E',
            'Ghate, Sujata V',
            'Walsh, Ruth',
            'Mazurowski, Maciej A'])
dd.add_values(
    element='Contributor orcid',
    values=['https://orcid.org/0000-0002-7650-1720',
            'https://orcid.org/0000-0002-8002-5210',
            'https://orcid.org/0000-0002-3865-3352',
            'https://orcid.org/0000-0003-0730-0551',
            'https://orcid.org/0000-0003-1889-982X',
            'https://orcid.org/0000-0002-2164-2761',
            'https://orcid.org/0000-0003-4202-8602'],)

dd.add_values(
    element='Identifier',
    values='Not Defined')

dd.add_values(
    element='Identifier description',
    values='Not Defined')

dd.add_values(
    element='Relation type',
    values='Not Defined')

dd.add_values(
    element='Identifier type',
    values='Not Defined')

dd.add_values(
    element='Contributor affiliation', 
    values=['Duke University'] * 7)
dd.add_values(
    element='Contributor role',
    values=['Researcher',
            'Researcher',
            'Researcher',
            'Researcher',
            'Researcher',
            'Researcher',
            'Researcher'])

As described in Tutorial 3, we can get the metadata element values we have just added as follows.

In [None]:
dd.get_values(element='Contributor role')

A specific metadata element value can be removed as follows. Note that if more than one match is found, all matches are removed.

In [None]:
dd.remove_values(
    element='Contributor role',
    values = 'Researcher'
)

All values for a metadata element can also be cleared, e.g.  `dd.clear_values(element='Contributor role')`.

## Adding or removing data
Adding data to a SDS dataset using `sparc-me` involves specifying the location of all the files or folders that should be placed in the sample folder for a subject. The organisation of the data within the sample folder is left to the user creating the dataset.

Future releases of sparc-me and the DigitalTWINS platform API will include the ability to:
- add subjects and samples to an existing dataset;
- remove subjects and sample; and
- version control of datasets.
Until then, **if at any stage, any new subjects or samples need to be added or removed, please discard the dataset being created and create a new one**.

The following code snippet shows one example of how subjects and samples can be added to a dataset. Two breast MRI sequences for two subjects will be added to the dataset. In this case, each sample represents a different MRI sequence.

Please see Tutorial 1 for instructions for how to get access to the raw data that can be used with this tutorial.

In [None]:
subjects = []
samples = []

sample1 = sm.Sample()
# Change the path below to point to the location of example_raw_dataset folder as described in Tutorial 1.
sample1.add_path(r"X:\DigitalTWINS\resources\latest\example_datasets\example_raw_dataset\Breast_MRI_001\sequence1") # adding a folder
samples.append(sample1)

sample2 = sm.Sample()
sample2.add_path(r"X:\DigitalTWINS\resources\latest\example_datasets\example_raw_dataset\Breast_MRI_001\sequence2\1-001.dcm") # adding a file
sample2.add_path([
    r"X:\DigitalTWINS\resources\latest\example_datasets\example_raw_dataset\Breast_MRI_001\sequence2\1-002.dcm",
    r"X:\DigitalTWINS\resources\latest\example_datasets\example_raw_dataset\Breast_MRI_001\sequence2\1-003.dcm",
    r"X:\DigitalTWINS\resources\latest\example_datasets\example_raw_dataset\Breast_MRI_001\sequence2\1-004.dcm"]) # adding a list of files
samples.append(sample2)

subject1 = sm.Subject()
subject1.add_samples(samples)
subjects.append(subject1)

samples=[]
sample1 = sm.Sample()
sample1.add_path(r"X:\DigitalTWINS\resources\latest\example_datasets\example_raw_dataset\Breast_MRI_002\sequence1")
samples.append(sample1)

sample2 = sm.Sample()
sample2.add_path(r"X:\DigitalTWINS\resources\latest\example_datasets\example_raw_dataset\Breast_MRI_002\sequence2")
samples.append(sample2)

subject2 = sm.Subject()
subject2.add_samples(samples)
subjects.append(subject2)

One way the above code can be simplified is shown below.

```
subjects = []
for subject_user_id in [1, 2]:
    samples = []
    for sample_user_id in [1, 2]:
        sample = sm.Sample()
        sample.add_path(
            "../resources/example_raw_dataset/Breast_MRI_00{0}/sequence{1}/".format(
                subject_user_id, sample_user_id))
        samples.append(sample)
    subject = sm.Subject()
    subject.add_samples(samples)
```

Once the path for the samples of each subject has been specified, they can be added to the dataset as shown below. This will copy the data from its user specified location, to the subject and sample folders within the primary data folder of the SDS dataset being created. Note that the samples and subjects will be renumbered sequentially, and may not correspond to the user_ids. This will also automatically update the manifest SDS metadata file.

In [None]:
dataset.add_subjects(subjects)

The next step is to add the required and additional metadata elements in the subjects and samples SDS metadata files. All metadata elements in the subjects and samples metadata files can be listed as follows.

In [None]:
from pprint import pprint
schema = sm.Schema()
subjects_schema = schema.get_schema(
    metadata_file='subjects', print_schema=True)
pprint(subjects_schema)

In [None]:
required_subjects_schema = schema.get_schema(
    metadata_file='subjects', print_schema=False, required_only=True, name_only=False)
pprint(required_subjects_schema)

The code below shows how to list only the required metadata elements for the subjects and samples SDS metadata files.

In [None]:
samples_schema = schema.get_schema(
    metadata_file='samples', print_schema=True)
pprint(samples_schema)

In [None]:
required_samples_schema = schema.get_schema(
    metadata_file='samples', print_schema=False, required_only=True, name_only=False)
pprint(required_samples_schema)

Now that we are aware of which metadata elements are used to describe the data that was added to the dataset (and which metadata elements are required), the code below shows an example of how the values for these metadata fields can be added to the dataset.

In [None]:
subject_sds_id = "sub-1"
subject = dataset.get_subject(subject_sds_id)
subject.set_value(
    element='subject experimental group',
    value='Control')

sample_sds_id = "sam-1"
sample = subject.get_sample(sample_sds_id)
sample.set_value(
    element='sample experimental group',
    value='Experimental')

sample_sds_id = "sam-2"
sample = subject.get_sample(sample_sds_id)
sample.set_value(
    element='sample experimental group',
    value='Experimental')

subject_sds_id = "sub-2"
subject = dataset.get_subject(subject_sds_id)
subject.set_value(
    element='subject experimental group',
    value='Control')

sample_sds_id = "sam-1"
sample = subject.get_sample(sample_sds_id)
sample.set_value(
    element='sample experimental group',
    value='Experimental')

sample_sds_id = "sam-2"
sample = subject.get_sample(sample_sds_id)
sample.set_value(
    element='sample experimental group',
    value='Experimental')

ages = ['30y', '20y']
sex = ['Female'] * 2 #  Create a list with duplicated items of the size specified.
species = ['Human']* 2
strain = ['Not Defined'] * 2
RRID_for_strain = ['Not Defined'] * 2
age_category = ['Prime Adult Stage'] * 2
also_in_dataset = ['Not Defined'] * 2
member_of = ['Not Defined'] * 2

for idx, subject_sds_id in enumerate(["sub-1", "sub-2"]):
    subject = dataset.get_subject(subject_sds_id)
    subject.set_value(element='age', value=ages[idx])
    subject.set_values({
        "subject id": '',
        'sex': sex[idx],
        'species': species[idx],
        'strain': strain[idx],
        'RRID for strain': RRID_for_strain[idx],
        'age category': age_category[idx],
        'also in dataset': also_in_dataset[idx],
        'member of': member_of[idx]})
    
    sample_anatomical_location = ['Breast'] * 2
    also_in_dataset = ['Not Defined'] * 2
    member_of = ['Not Defined'] * 2

    for i, sample_sds_id in enumerate(["sam-1", "sam-2"]):
        sample = subject.get_sample(sample_sds_id)
        sample.set_values({
            'was derived from': 'Not Defined',
            'sample type': 'DCE-MRI Contrast Image {0}'.format(sample_sds_id),
            'sample anatomical location': sample_anatomical_location[i],
            'also in dataset': also_in_dataset[i],
            'member of': member_of[i]})

## Adding or removing thumbnails

Thumbnail images can be added to a data as follows. The manifest metadata file will be automatically updated when thumbnails are added.`thumbnail_0.jpg` will be set as the main thumbnail for the dataset on the data catalogue page of the DigitalTWINS platform's portal. Additional thumbnails can be sequentially numbered e.g. `thumbnail_1.jpg`. All thumbnails will be visible in the gallery view of a data catalogue in the portal. 

In [None]:
dataset.add_thumbnail("../resources/example_raw_dataset/thumbnail_0.jpg")
dataset.add_thumbnail("../resources/example_raw_dataset/thumbnail_1.jpg")

Thumbnail images can be removed by using the `remove_thumbnail` method e.g. `dataset.remove_thumbnail("../resources/example_raw_dataset/thumbnail_1.jpg")`.

## Checking a dataset
Dataset checking (validation) can be perform as follows. This will allow the fields of the dataset to be checked before it is submitted to the portal.

In [None]:
validator = sm.Validator()
validator.validate_dataset(dataset)

Currently, only basic validation has been implemented. More validation features will be added in future releases of `sparc-me`.

## Feedback
Once you have completed this tutorial, please complete [this survey](https://docs.google.com/forms/d/e/1FAIpQLSe-EsVz6ahz2FXFy906AZh68i50jRYnt3hQe-loc-1DaFWoFQ/viewform?usp=sf_link), which will allow us to improve this and future tutorials.

## Next steps
The [next tutorial](https://github.com/ABI-CTT-Group/digitaltwins-api/blob/main/tutorials/tutorial_5_uploading_datasets.ipynb) will show how to upload your dataset to your instance of the 12 LABOURS DigitalTWINS Platform using its Python API.