# Tutorial 2: Create datasets and description
This tutorial shows how to create datasets and descpritions in SPARC Data Structure (SDS) format.



## 1. Create an empty SDS dataset or load an existing SDS dataset
* Setup sparc-me 

    Sparc-me is a python tool to explore, enhance, and expand SPARC datasets and their descriptions in accordance with DAIR principles.

In [None]:
from sparc_me import Dataset

dataset = Dataset()

* Create an empty dataset from the SDS template. 

    In this tutorial, it is highly recommended to use SDS template version 2.0.0. Remember to include the path of where you want your dataset to be saved. 

In [None]:
save_dir = "./tmp/template/"
dataset.load_from_template(version="2.0.0")

* Load an existing SDS dataset (not recommended in this tutorial)

In [None]:
existing_dir = "./your/dataset/path/here"
# dataset.load_dataset(dataset_path=existing_dir)

## 2. Get metadata files and clear metadata if necessary.

This tutorial focus more on the dataset_description metadata file. For other metadata files, replace the category field with the name of the metadata file you want to use e.g., from `category="dataset_description"` to `category="code_parameters"`.

* List the categories and elements of metadata files

In [None]:
categories = dataset.list_categories(version="2.0.0")
elements = dataset.list_elements(category="dataset_description", version="2.0.0")

* Get metadata file

    Again, dataset_description metadata is used as an exmaple here and feel free to replace the category with the name of the metadata file you want to use.

In [None]:
dataset_description = dataset.get_metadata(category="dataset_description")

* (Optional) clear your metadata values before you edit them.
    * Clear all metadata values in the dataset_description file
    * Clear the entire row (e.g., `field_name='Contributor role'`) of metadata values

In [None]:
dataset_description.clear_values()

In [None]:
dataset_description.clear_values(field_name='Contributor role')

## 3. Add/update metadata values and save them

* This function allows you to add or update metadata values

    `add_values( *values: Any, row_name: str = '', header: str = '', append: bool = True)`

    * `*values` allows single or multiple string values for metadata values you would like to add or update.
    * `row_name` is the row heading in the `dataset_description` and `code_description` metadata files or elements in other metadata files. 
    * `header` takes the column heading in metadata file. The default value of header in `dataset_description` and `code_description` is `Value` column, feel free to specify yours.
    * `append` takes a boolean value. The default value is True, which appends an element to the end of the list. If the append is set to False, the values will be overwritten/replaced with the new values you specify.

* Adding values by rows and columns
    * By rows:
        * Only `dataset_description` and `code_description` metadata files have both row and coloum headings while other files only have the column heading. Therefore, it is recommended to use the `row_name` parameter and add values by rows in `dataset_description` and `code_description` metadata files.
             ```python
            dataset_description.add_values(*["test1", "test2", "test3"], row_name="contributor role")
            # Also supports:
            dataset_description.add_values("test1", "test2", "test3", row_name="contributor role", header="Value")
            ```
        * For adding values in other metadata files, you could insert values by rows without specifying the row_name and header name. Notice that the values length must match thee columns length.

            ```python
            code_parameters.add_values(*["breast_append", "test1_append", "test2_append", "test3_append", "test4_append", "test5..._append","test3_append", "test4_append", "test5_append"])
            ```
    * By columns:
        * In metadata files such as code_parameters, manifest, performances, resources, samples, subjects, and submission metadata, it is recommended to use the `header` parameter and add values by column.

            ```python
            code_parameters.add_values(*["test1_name", "test2_name", "test3_name", "test4_name"], header='name')
            ```

In [None]:
dataset_description.add_values("2.0.0", row_name='metadataversion')
dataset_description.add_values("experimental", row_name='type')
dataset_description.add_values("Duke breast cancer MRI preprocessing", row_name='Title')
dataset_description.add_values("""Preprocessing the breast cancer MRI images and saving in Nifti format""",
                                      row_name='subtitle')
dataset_description.add_values("Breast cancer", "image processing", row_name='Keywords')
dataset_description.add_values("""Preprocessing the breast cancer MRI images and saving in Nifti format""",
                                      row_name="Study purpose")
dataset_description.add_values("derived from Duke Breast Cancer MRI dataset",
                                      row_name='Study data Collection')
dataset_description.add_values("NA", row_name='Study primary conclusion')
#dataset_description.add_values("NA", row_name='Study primary conclusion', append=True)
dataset_description.add_values("breast", row_name='Study organ system')
dataset_description.add_values("image processing", row_name='Study approach')
dataset_description.add_values("""dicom2nifti""", row_name='Study technique')
dataset_description.add_values("Lin, Chinchien", "Gao, Linkun", row_name='contributorname')
#dataset_description.add_values("Prasad", "Jiali", row_name='contributorNAME', append=True)
#dataset_description.add_values(*["bob", "db"], row_name="contributor name", append=True)
dataset_description.add_values(
    "https://orcid.org/0000-0001-8170-199X",
    "https://orcid.org/0000-0001-8171-199X",
    "https://orcid.org/0000-0001-8172-199X",
    "https://orcid.org/0000-0001-8173-199X",
    "https://orcid.org/0000-0001-8174-199X",
    "https://orcid.org/0000-0001-8176-199X",
    row_name='Contributor orcid')

dataset_description.add_values(*["University of Auckland"] * 6, row_name='Contributor affiliation')
dataset_description.add_values(*["developer", "developer", "Researcher", "Researcher", "tester", "tester"],
                                      row_name="contributor role")
dataset_description.add_values("source", row_name='Identifier description')
dataset_description.add_values("WasDerivedFrom", row_name='Relation type')
dataset_description.add_values("DTP-UUID", row_name='Identifier')
dataset_description.add_values("12L digital twin UUID", row_name='Identifier type')
dataset_description.add_values("1", row_name='Number of subjects')
dataset_description.add_values("1", row_name='Number of samples')

* Get values

    The `get_values(field_name: str)` method allows to get values from a specific row or column of the metadata file by providing the row name or column name.

In [None]:
values = dataset_description.get_values(field_name="contributorrole")
print(values)

* Remove values
The `remove_values(*values: Any, field_name: str)` method can be used to remove values from a specific row or column of the metadata file. Two arguments are required for this method: values you want to remove and the row name or column name of the values.

In [None]:
dataset_description.remove_values("tester", field_name="contributor role")

* Save datasets including metadata

    If encountering any errors when running the above code:
        
    * Clear outputs and run the code again. If it still doesn't work, create a folder `tmp/template` under the examples folder.
    * Run this code in the Jupyter Notebook in browser directly. Do not run this code in IDEs such as Pycharm.

In [None]:
dataset.save(save_dir=save_dir)

## 4. Add datasets to dataset primary or derivative folders

* Copy data (can either be folders or files) from "source_data_raw" to a "sds_dataset" parent directory adhering to SDS framework. The "sds_dataset" is the primary and default directory.


In [None]:
dataset.add_samples(source_path="./test_data/sample1/raw", subject="subject-xyz", sample="sample-1",
                     data_type="primary", sds_parent_dir=save_dir)
# If you want to move the data to destination directory, set copy to 'False'.
dataset.add_samples(source_path="./test_data/sample2/raw", subject="subject-xyz", sample="sample-2",
                 data_type="primary", sds_parent_dir=save_dir)

# Copy data from "source_data_derived" to a "sds_dataset" parent directory adhering to SDS framework.
dataset.add_samples(source_path="./test_data/sample1/derived", subject="subject-xyz", sample="sample-abc",
                 data_type="derivative", sds_parent_dir=save_dir)

* Add subject along with the subject and sample metadata.

In [None]:
# copy data from "source_data_primary" to "sds_dataset" primary (default) directory
dataset.add_subject(source_path="./test_data/bids_data/sub-01", subject="subject-1", subject_metadata={
    "subject id": "",
    "subject experimental group": "experimental",
    "age": "041Y",
    "sex": "F",
    "species": "human",
    "strain": "tissue",
    "age category": "middle adulthood"
}, sample_metadata={
    "sample id": "",
    "subject id": "",
    "sample experimental group": "experimental",
    "sample type": "tissue",
    "sample anatomical location": "breast tissue",
})

* Add a single sample file to the dataset sample folder

In [None]:
dataset.add_samples(source_path="./test_data/sample1/raw/simple_test1.txt", subject="subject-xyz",
                        sample="sample-2",
                        data_type="primary", sds_parent_dir=save_dir)