# Tutorial 2: Create datasets and description
This tutorial shows how to create datasets and descpritions in SPARC Data Structure (SDS) format. In this tutorial, you will learn:
* How to create a SDS template dataset or load an existing SDS dataset.
* How to retrieve the metadata and its values from the dataset and save the dataset locally.
* How to add or update metadata values.
* How to add raw data to the dataset.

## 1. Create an empty SDS dataset (recommended for this tutorial) or load an existing SDS dataset  if you have one
* Setup sparc-me and create an instance of the `Dataset` class

    Sparc-me is a python tool to explore, enhance, and expand SPARC datasets and their descriptions in accordance with FAIR principles.

In [1]:
from sparc_me import Dataset

dataset = Dataset()

* Create an empty dataset from the SDS template

    In this tutorial, it is highly recommended to use SDS template version 2.0.0. Remember to include the path of where you want your dataset to be saved. 

In [2]:
# Specify the dataset path
save_dir = "SDStemplate/"
dataset.set_dataset_path(save_dir)
dataset.load_from_template(version="2.0.0")

{'CHANGES': WindowsPath('C:/Users/jxu759/Documents/digital-twin-platform-workshop/digital-twin-platform/venv/Lib/site-packages/sparc_me/core/../resources/templates/version_2_0_0/DatasetTemplate/CHANGES'),
 'code': WindowsPath('C:/Users/jxu759/Documents/digital-twin-platform-workshop/digital-twin-platform/venv/Lib/site-packages/sparc_me/core/../resources/templates/version_2_0_0/DatasetTemplate/code'),
 'code_description': {'path': WindowsPath('C:/Users/jxu759/Documents/digital-twin-platform-workshop/digital-twin-platform/venv/Lib/site-packages/sparc_me/core/../resources/templates/version_2_0_0/DatasetTemplate/code_description.xlsx'),
  'metadata':                                      Metadata element  \
  0                                           RRID Term   
  1                                     RRID Identifier   
  2                                    Ontological term   
  3                              Ontological Identifier   
  4                              Ten Simple Rules (T

* Save the template dataset that you just loaded

    Explore the SDS template dataset after it's saved locally. __Restart the kernel if you have problems saving the dataset.__

In [12]:
dataset.save(save_dir=save_dir)

* Load an existing SDS dataset (not recommended in this tutorial unless your dataset is based on SDS template version 2.0.0)

In [None]:
# set_dataset_path is optional here. You can either set a new path or using the dataset's current path for loading the existing dataset
# dataset.set_dataset_path(save_dir)
# existing_dir = "./your/dataset/path/here"
# dataset.load_dataset(dataset_path=existing_dir)

## 2. Get the metadata and its associated values. Clear the metadata before updating it for your dataset
This tutorial focus more on the dataset_description metadata file. For other metadata files, replace the category field with the name of the metadata file you want to use e.g., from `category="dataset_description"` to `category="code_parameters"`.

* List the categories of metadata

In [13]:
categories = dataset.list_categories(version="2.0.0")

Categories:
code_description
code_parameters
dataset_description
manifest
performances
resources
samples
subjects
submission


* List the elements of a metadata category, e.g., dataset_description metadata

In [6]:
elements = dataset.list_elements(category="dataset_description", version="2.0.0")

Category: dataset_description
Metadata Version
    Required: Y
    Type: string
    Description: 2.0.0
    Example: 2.0.0
Type
    Required: Y
    Type: string
    Description: The type of this dataset, specifically whether it is experimental or computation. The only valid values are experimental or computational. If experimental subjects are required, if computational, subjects are not required. Set to experimental by default, if you are submitting a computational study be sure to change it.
    Example: experimental
Title
    Required: Y
    Type: string
    Description: Descriptive title for the data set. Equivalent to the title of a scientific paper. The metadata associated with the published version of this dataset does not currently make use of this field.
    Example: My SPARC dataset
Subtitle
    Required: Y
    Type: string
    Description: NOTE This field is not currently used when publishing a SPARC dataset. Brief description of the study and the data set. Equivalent to the 

* Get metadata file

    Again, dataset_description metadata is used as an exmaple here and feel free to replace the category with the name of the metadata file you want to use.

In [14]:
dataset_description = dataset.get_metadata(category="dataset_description")

* Get metadata's associated values

    The `get_values(field_name: str)` method allows to retrieve values from a specific row or column of the metadata file by providing the row name or column name.

In [15]:
dataset_description.get_values(field_name='Contributor role')

Value      PrincipalInvestigator
Value 2      CorrespondingAuthor
Value 3                      NaN
Value n                      NaN
Name: 20, dtype: object

* Clear all default metadata values before you edit them if you created your datasets from the template
    * Clear all metadata values in the dataset_description file
    * (Optional) clear the entire row (e.g., `field_name='Contributor role'`) of metadata values

In [26]:
dataset_description.clear_values()

* Get the values again to check if they have been deleted

In [18]:
dataset_description.get_values(field_name='Contributor role')

Value    <NA>
Name: 20, dtype: object

In [None]:
# dataset_description.clear_values(field_name='Contributor role')

## 3. Add/update metadata values

* This function allows you to add or update metadata values

    `add_values( *values: Any, row_name: str = '', col_name: str = '', append: bool = True)`

    * `*values` allows single or multiple string values for metadata values.
    * `row_name` is the row heading in the `dataset_description` and `code_description` metadata files or elements in other metadata files.
    * `col_name` takes the column heading in metadata file. (The default value of header in `dataset_description` and `code_description` is `Value`. Feel free to specify your own header value.)
    * `append` takes a boolean value. The default value is `True`, which appends an element to the end of the list. If the `append` is set to `False`, the values will be overwritten/replaced with the new values you specify.

* Adding values by rows and columns
    * By rows:
        * Only `dataset_description` and `code_description` metadata files have both row and coloum headings while other files only have the column heading. Therefore, it is recommended to use the `row_name` parameter and add values by rows in `dataset_description` and `code_description` metadata files.
             ```python
            dataset_description.add_values(*["test1", "test2", "test3"], row_name="contributor role")

            # Also supports column name. The values will begin populating from the cell identified by its row and column names (the default value is "Value")
            dataset_description.add_values("test1", "test2", "test3", row_name="contributor role", col_name="Value")
            ```
        * For adding values in other metadata files, you could insert values by rows without specifying the row_name and col_name name. Notice that the values length must match thee columns length.

            ```python
            code_parameters.add_values(*["breast_append", "test1_append", "test2_append", "test3_append", "test4_append", "test5..._append","test3_append", "test4_append", "test5_append"])
            ```
    * By columns:
        * In metadata files such as code_parameters, manifest, performances, resources, samples, subjects, and submission metadata, it is recommended to use the `col_name` parameter and add values by column.

            ```python
            code_parameters.add_values(*["test1_name", "test2_name", "test3_name", "test4_name"], col_name='name')
            ```

In [27]:
dataset_description.add_values("2.0.0", row_name='metadataversion')
dataset_description.add_values("experimental", row_name='type')
dataset_description.add_values("Duke breast cancer MRI preprocessing", row_name='Title')
dataset_description.add_values("""Preprocessing the breast cancer MRI images and saving in Nifti format""",
                                      row_name='subtitle')
dataset_description.add_values("Breast cancer", "image processing", row_name='Keywords')
dataset_description.add_values("""Preprocessing the breast cancer MRI images and saving in Nifti format""",
                                      row_name="Study purpose")
dataset_description.add_values("derived from Duke Breast Cancer MRI dataset",
                                      row_name='Study data Collection')
dataset_description.add_values("NA", row_name='Study primary conclusion')
dataset_description.add_values("NA", row_name='Study primary conclusion')
dataset_description.add_values("breast", row_name='Study organ system')
dataset_description.add_values("image processing", row_name='Study approach')
dataset_description.add_values("""dicom2nifti""", row_name='Study technique')
dataset_description.add_values("Lin, Chinchien", "Gao, Linkun", row_name='contributorname')
dataset_description.add_values("Prasad", "Jiali", row_name='contributorNAME')
dataset_description.add_values(*["bob", "db"], row_name="contributor name")
dataset_description.add_values(
    "https://orcid.org/0000-0001-8170-199X",
    "https://orcid.org/0000-0001-8171-199X",
    "https://orcid.org/0000-0001-8172-199X",
    "https://orcid.org/0000-0001-8173-199X",
    "https://orcid.org/0000-0001-8174-199X",
    "https://orcid.org/0000-0001-8176-199X",
    row_name='Contributor orcid')

dataset_description.add_values(*["University of Auckland"] * 6, row_name='Contributor affiliation')
dataset_description.add_values(*["developer", "developer", "Researcher", "Researcher", "tester", "tester"],
                                      row_name="contributor role")
dataset_description.add_values("source", row_name='Identifier description')
dataset_description.add_values("WasDerivedFrom", row_name='Relation type')
dataset_description.add_values("DTP-UUID", row_name='Identifier')
dataset_description.add_values("12L digital twin UUID", row_name='Identifier type')

* Get the values you just added



In [28]:
values = dataset_description.get_values(field_name="contributorrole")
print(values)

Value       developer
Value 1     developer
Value 2    Researcher
Value 3    Researcher
Value 4        tester
Value 5        tester
Name: 20, dtype: object


* Remove values by row or column

    The `remove_values(*values: Any, field_name: str)` method can be used to remove values from a specific row or column of the metadata file. Two arguments are required for this method: values you want to remove and the row name or column name of the values.

In [29]:
dataset_description.remove_values("tester", field_name="contributor role")

values = dataset_description.get_values(field_name="contributorrole")
print(values)

Value       developer
Value 1     developer
Value 2    Researcher
Value 3    Researcher
Value 4          <NA>
Value 5          <NA>
Name: 20, dtype: object


## 4. Add raw data to dataset 'primary' folders (derivative data will be covered in the next workshop)

In your research drive, you'll find a folder named 'test_data' which contains the test datasets for this section of the tutorial. Please move the 'test_data' folder into the tutorial directory.
* To comply with SDS framwork, the naming of subjects and samples folders MUST be in this format:  sub-xx (for subjects), sam-xx (for samples).

* Copy the data, which can be either folders or files, from the raw sample data folder to the SDS dataset folder that you specified at the beginning of the tutorial.
    * Add subject(s) - add subject(s) folder along with its subject and sample metadata.
    * Add sample(s) -  add a single sample file or multiple sample files to the dataset sample folder

    

In [31]:
# Add a subject folder that is named with the subject ID. Copy data from primary source data to SDS dataset directory.
dataset.add_subject(source_path="test_data/bids_data/sub-01", subject="sub-1", subject_metadata={
    "subject experimental group": "experimental",
    "age": "041Y",
    "sex": "F",
    "species": "human",
    "strain": "tissue",
    "age category": "middle adulthood"
}, sample_metadata={
    "sample id": "",
    "subject id": "",
    "sample experimental group": "experimental",
    "sample type": "tissue",
    "sample anatomical location": "breast tissue",
})

# Add multiple subjects
dataset.add_subjects(source_paths=["test_data/bids_data/sub-01","./test_data/bids_data/sub-02"], subjects=["sub-1","sub-2"],subject_metadata={
    "subject experimental group": "experimental",
    "species": "human",
    "strain": "tissue",
}, sample_metadata={
    "sample id": "",
    "subject id": "",
    "sample experimental group": "experimental",
    "sample type": "tissue",
    "sample anatomical location": "breast tissue",
})



In [35]:
# Add a primary sample dataset to the template dataset
dataset.add_sample(source_path="test_data/sample1/raw", subject="sub-xyz", sample="sam-1",
                     data_type="primary", sds_parent_dir=save_dir)

# Add multiple primary sample datasets to the template dataset
dataset.add_samples(source_paths=["test_data/sample1/raw","./test_data/sample3/raw"], subject="sub-xyz", samples=["sam-1","sam-3"],
                     data_type="primary", sds_parent_dir=save_dir)



* After updating subject or sample data, the following metadata files will be updated __automatically__: dataset_description.xlsx, manifest.txt, samples.xlsx, subjects.xslx

* Save the updated datasets 

    If you encounter any errors when running the above code:
        
    * Clear outputs and restart the kernel. Run the code again.
    * Run this code in the Jupyter Notebook in browser directly instead of running in IDEs such as Pycharm.

In [33]:
dataset.save(save_dir=save_dir)