# Tutorial 2: Create datasets and its metadata
This tutorial shows how to create datasets and its metadata in SPARC Data Structure (SDS) format via SPARC-me API. In this tutorial, you will learn:
* How to create a SDS template dataset or load an existing SDS dataset.
* How to retrieve the metadata and its values from the dataset and save the dataset locally.
* How to add or update metadata values.
* How to add raw data to the dataset.

## 1. Create an empty SDS dataset (recommended for this tutorial) or load an existing SDS dataset  if you have one
* Setup sparc-me and create an instance of the `Dataset` class

    Sparc-me is a python tool to explore, enhance, and expand SPARC datasets and their descriptions in accordance with FAIR principles.

`pip install sparc-me`

In [2]:
from sparc_me import Dataset, Schema, Validator

dataset = Dataset()
schema = Schema()
validator = Validator()

* Create an empty dataset from the SDS template

    In this tutorial, it is highly recommended to use SDS template version 2.0.0. Remember to include the path of where you want your dataset to be saved. 

In [3]:
# Specify the dataset path
save_dir = "SDStemplate/"
dataset.set_dataset_path(save_dir)
dataset.load_from_template(version="2.0.0")

{'CHANGES': WindowsPath('C:/Users/jxu759/Documents/digital-twin-platform-workshop/digitaltwins-api/venv/Lib/site-packages/sparc_me/core/../resources/templates/version_2_0_0/DatasetTemplate/CHANGES'),
 'code': WindowsPath('C:/Users/jxu759/Documents/digital-twin-platform-workshop/digitaltwins-api/venv/Lib/site-packages/sparc_me/core/../resources/templates/version_2_0_0/DatasetTemplate/code'),
 'code_description': {'path': WindowsPath('C:/Users/jxu759/Documents/digital-twin-platform-workshop/digitaltwins-api/venv/Lib/site-packages/sparc_me/core/../resources/templates/version_2_0_0/DatasetTemplate/code_description.xlsx'),
  'metadata':                                      Metadata element  \
  0                                           RRID Term   
  1                                     RRID Identifier   
  2                                    Ontological term   
  3                              Ontological Identifier   
  4                              Ten Simple Rules (TSR)   
  5     

* Save the template dataset that you just loaded

    Explore the SDS template dataset after it's saved locally. __Restart the kernel if you have problems saving the dataset.__

In [4]:
dataset.save(save_dir=save_dir)

* Load an existing SDS dataset (not recommended in this tutorial unless your dataset is based on SDS template version 2.0.0)

In [5]:
# set_dataset_path is optional here. You can either set a new path or using the dataset's current path for loading the existing dataset
# dataset.set_dataset_path(save_dir)
# existing_dir = "./your/dataset/path/here"
# dataset.load_dataset(dataset_path=existing_dir)

## 2. Get the metadata and its associated values. Clear the metadata before updating it for your dataset
This tutorial focus more on the dataset_description metadata file. For other metadata files, replace the category field with the name of the metadata file you want to use e.g., from `category="dataset_description"` to `category="code_parameters"`.

* List all metadata files

In [6]:
categories = dataset.list_categories(version="2.0.0")

Categories:
code_description
code_parameters
dataset_description
manifest
performances
resources
samples
subjects
submission


* List the elements of a metadata file, e.g., dataset_description metadata

In [7]:
elements = dataset.list_elements(category="dataset_description")

Fields:
Metadata Version
Type
Basic information
Title
Subtitle
Keywords
Funding
Acknowledgments
Study information
Study purpose
Study data collection
Study primary conclusion
Study organ system
Study approach
Study technique
Study collection title
Contributor information
Contributor name
Contributor ORCiD
Contributor affiliation
Contributor role
Related protocol, paper, dataset, etc.
Identifier description
Relation type
Identifier
Identifier type
Participant information
Number of subjects
Number of samples


* Get metadata file

    Again, dataset_description metadata is used as an exmaple here and feel free to replace the category with the name of the metadata file you want to use.

In [8]:
dataset_description = dataset.get_metadata(category="dataset_description")

* Get schema information of an element of a metadata file

In [9]:
des_schema = schema.get_schema("dataset_description")
print(des_schema.get('Subtitle'))

{'Metadata Version': {'type': 'string', 'required': 'Y', 'description': 'SDS version number, e.g. 2.0.0. DO NOT CHANGE', 'example': '2.0.0'}, 'Type': {'type': 'string', 'required': 'Y', 'description': 'Each dataset consists of a single “type” of data, covered by the same ethics, same access control, same protocol, etc. There are two datasets types,  experimental and computation. Experimental is the default value. Make sure to change it to computation only if you are submitting a computational study. If the dataset is not” computational”, it should be set to “experimental”', 'example': 'experimental'}, 'Title': {'type': 'string', 'required': 'Y', 'description': 'Descriptive title for the data set. Equivalent to the title of a scientific paper.', 'example': 'My SPARC dataset'}, 'Subtitle': {'type': 'string', 'required': 'Y', 'description': 'Brief description of the study and the data set. Equivalent to the abstract of a scientific paper. Include the rationale for the approach, the types 

* Get metadata's associated values

    The `get_values(field_name: str)` method allows to retrieve values from a specific row or column of the metadata file by providing the row name or column name.

In [10]:
dataset_description.get_values(field_name='Contributor role')

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)


Value      PrincipalInvestigator
Value 2      CorrespondingAuthor
Value 3                      NaN
Value n                      NaN
Name: 20, dtype: object

* Clear all default metadata values before you edit them if you created your datasets from the template
    * Clear all metadata values in the dataset_description file
    * (Optional) clear the entire row (e.g., `field_name='Contributor role'`) of metadata values

In [11]:
dataset_description.clear_values()

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)


* Get the values again to check if they have been deleted

In [12]:
dataset_description.get_values(field_name='Contributor role')

Value    <NA>
Name: 20, dtype: object

In [13]:
# dataset_description.clear_values(field_name='Contributor role')

## 3. Add/update metadata values

* This function allows you to add or update metadata values

    `add_values(*values: Any, row_name: str = '', col_name: str = '', append: bool = True)`

    * `*values` allows single or multiple string values for metadata values.
    * `row_name` is the row heading in the `dataset_description` and `code_description` metadata files or elements in other metadata files.
    * `col_name` takes the column heading in metadata file. (The default value of header in `dataset_description` and `code_description` is `Value`. Feel free to specify your own header value.)
    * `append` takes a boolean value. The default value is `True`, which appends an element to the end of the list. If the `append` is set to `False`, the values will be overwritten/replaced with the new values you specify.

* Adding values by rows and columns
    * By rows:
        * Only `dataset_description` and `code_description` metadata files have both row and coloum headings while other files only have the column heading. Therefore, it is recommended to use the `row_name` parameter and add values by rows in `dataset_description` and `code_description` metadata files.
             ```python
            dataset_description.add_values(*["test1", "test2", "test3"], row_name="contributor role")

            # Also supports column name. The values will begin populating from the cell identified by its row and column names (the default value is "Value")
            dataset_description.add_values("test1", "test2", "test3", row_name="contributor role", col_name="Value")
            ```
        * For adding values in other metadata files, you could insert values by rows without specifying the row_name and col_name name. Notice that the values length must match thee columns length.

            ```python
            code_parameters.add_values(*["breast_append", "test1_append", "test2_append", "test3_append", "test4_append", "test5..._append","test3_append", "test4_append", "test5_append"])
            ```
    * By columns:
        * In metadata files such as code_parameters, manifest, performances, resources, samples, subjects, and submission metadata, it is recommended to use the `col_name` parameter and add values by column.

            ```python
            code_parameters.add_values(*["test1_name", "test2_name", "test3_name", "test4_name"], col_name='name')
            ```

In [14]:
dataset_description.add_values("2.0.0", row_name='metadataversion')
dataset_description.add_values("experimental", row_name='type')
dataset_description.add_values("Duke breast cancer MRI preprocessing", row_name='Title')
dataset_description.add_values("""Preprocessing the breast cancer MRI images and saving in Nifti format""",
                                      row_name='subtitle')
dataset_description.add_values("Breast cancer", "image processing", row_name='Keywords')
dataset_description.add_values("""Preprocessing the breast cancer MRI images and saving in Nifti format""",
                                      row_name="Study purpose")
dataset_description.add_values("derived from Duke Breast Cancer MRI dataset",
                                      row_name='Study data Collection')
dataset_description.add_values("pri", row_name='Study primary conclusion')
dataset_description.add_values("mary", row_name='Study primary conclusion')
dataset_description.add_values("breast", row_name='Study organ system')
dataset_description.add_values("image processing", row_name='Study approach')
dataset_description.add_values("""dicom2nifti""", row_name='Study technique')
dataset_description.add_values("Lin, Chinchien", "Gao, Linkun", row_name='contributorname')
dataset_description.add_values("Prasad", "Jiali", row_name='contributorNAME')
dataset_description.add_values(*["bob", "db"], row_name="contributor name")
dataset_description.add_values(
    "https://orcid.org/0000-0001-8170-199X",
    "https://orcid.org/0000-0001-8171-199X",
    "https://orcid.org/0000-0001-8172-199X",
    "https://orcid.org/0000-0001-8173-199X",
    "https://orcid.org/0000-0001-8174-199X",
    "https://orcid.org/0000-0001-8176-199X",
    row_name='Contributor orcid')

dataset_description.add_values(*["University of Auckland"] * 6, row_name='Contributor affiliation')
dataset_description.add_values(*["developer", "developer", "Researcher", "Researcher", "tester", "tester"],
                                      row_name="contributor role")
dataset_description.add_values("source", row_name='Identifier description')
dataset_description.add_values("WasDerivedFrom", row_name='Relation type')
dataset_description.add_values("DTP-UUID", row_name='Identifier')
dataset_description.add_values("12L digital twin UUID", row_name='Identifier type')

* Get the values you just added



In [15]:
values = dataset_description.get_values(field_name="contributorrole")
print(values)

Value       developer
Value 1     developer
Value 2    Researcher
Value 3    Researcher
Value 4        tester
Value 5        tester
Name: 20, dtype: object


* Remove values by row or column

    The `remove_values(*values: Any, field_name: str)` method can be used to remove values from a specific row or column of the metadata file. Two arguments are required for this method: values you want to remove and the row name or column name of the values.

In [16]:
dataset_description.remove_values("tester", field_name="contributor role")

values = dataset_description.get_values(field_name="contributorrole")
print(values)

Value       developer
Value 1     developer
Value 2    Researcher
Value 3    Researcher
Value 4          <NA>
Value 5          <NA>
Name: 20, dtype: object


## 4. Add raw data to dataset 'primary' folders (derivative data will be covered in the next workshop)

In your research drive, you'll find a folder named 'test_data' which contains the test datasets for this section of the tutorial. Please move the 'test_data' folder into the tutorial directory.
* To comply with SDS framwork, the naming of subjects and samples folders MUST be in this format:  sub-xx (for subjects), sam-xx (for samples).

* Copy the data, which can be either folders or files, from the raw sample data folder to the SDS dataset folder that you specified at the beginning of the tutorial.
    * Add subject(s) - add subject(s) folder along with its subject and sample metadata.
    * Add sample(s) -  add a single sample file or multiple sample files to the dataset sample folder

    

In [17]:
# Add a subject folder that is named with the subject ID. Copy data from primary source data to SDS dataset directory.
dataset.add_subject(source_path="test_data/bids_data/sub-01", subject="sub-1", subject_metadata={
    "subject experimental group": "experimental",
    "age": "041Y",
    "sex": "F",
    "species": "human",
    "strain": "tissue",
    "age category": "middle adulthood"
}, sample_metadata={
    "sample id": "",
    "subject id": "",
    "sample experimental group": "experimental",
    "sample type": "tissue",
    "sample anatomical location": "breast tissue",
})

# Add multiple subjects
dataset.add_subjects(source_paths=["test_data/bids_data/sub-01","./test_data/bids_data/sub-02"], subjects=["sub-1","sub-2"],subject_metadata={
    "subject experimental group": "experimental",
    "species": "human",
    "strain": "tissue",
}, sample_metadata={
    "sample experimental group": "experimental",
    "sample type": "tissue",
    "sample anatomical location": "breast tissue",
})

In [18]:
# Add a primary sample dataset to the template dataset
dataset.add_sample(source_path="test_data/sample1/raw", subject="sub-xyz", sample="sam-1",
                     data_type="primary", sds_parent_dir=save_dir)

# Add multiple primary sample datasets to the template dataset
dataset.add_samples(source_paths=["test_data/sample1/raw","./test_data/sample3/raw"], subject="sub-xyz", samples=["sam-1","sam-3"],
                     data_type="primary", sds_parent_dir=save_dir)

In [19]:
sample_metadata = dataset.get_metadata("samples")
subject_metadata = dataset.get_metadata("subjects")

def add_values_for_sample_metadata(sample_metadata):
    sample_metadata.add_values(*["test"] * 6, col_name="was derived from", append=False)
    sample_metadata.add_values(*["pool id 1", "pool id 2", "pool id 3", "pool id 4", "pool id 5", "pool id 6"],
                               col_name="pool id", append=False)
    sample_metadata.add_values(*["Yes"] * 5, "No", col_name="also in dataset", append=False)
    sample_metadata.add_values(*["Global"] * 6, col_name="member of", append=False)
    sample_metadata.add_values(
        *["laboratory 1", "laboratory 2", "laboratory 3", "laboratory 4", "laboratory 5", "laboratory 6"],
        col_name="laboratory internal id", append=False)
    sample_metadata.add_values(*["1991-05-25"] * 3, *["1991-06-10"] * 3, col_name="date of derivation", append=False)

    sample_metadata.save()

def add_values_for_subject_metadata(subject_metadata):
    subject_metadata.add_values("test-1","test-2","test-xyz", col_name='subject experimental group', append=False)
    subject_metadata.add_values("30y","31y","33y", col_name='age', append=False)
    subject_metadata.add_values("M","M","M", col_name='sex', append=False)
    subject_metadata.add_values("P","human","P", col_name='species', append=False)
    subject_metadata.add_values("test","tissue","test", col_name='strain', append=False)
    subject_metadata.add_values("old","old","old", col_name="age category", append=False)
    subject_metadata.add_values(*["pool id 1", "pool id 2", "pool id 3"],
                               col_name="pool id", append=False)
    subject_metadata.add_values(*["Yes"] * 3, col_name="also in dataset", append=False)
    subject_metadata.add_values(*["515dsd1515","da515daa69", "515dsa62a"], col_name="RRID for strain", append=False)
    subject_metadata.add_values(*["Global"] * 3, col_name="member of", append=False)
    subject_metadata.add_values(
        *["laboratory 1", "laboratory 2", "laboratory 3"],
        col_name="laboratory internal id", append=False)
    subject_metadata.add_values(*["1996-03-25","1995-09-05", "1996-04-11"], col_name="date of birth", append=False)
    subject_metadata.save()

In [20]:
add_values_for_sample_metadata(sample_metadata)
add_values_for_subject_metadata(subject_metadata)

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for

In [21]:
dataset.add_thumbnail("./test_data/thumbnail_0.jpg")
dataset.add_thumbnail("./test_data/thumbnail_1.jpg")
dataset.delete_data("./SDStemplate/primary/thumbnail_0.jpg")

* After updating subject or sample data, the following metadata files will be updated __automatically__: dataset_description.xlsx, manifest.txt, samples.xlsx, subjects.xslx

* Save the updated datasets 

    If you encounter any errors when running the above code:
        
    * Clear outputs and restart the kernel. Run the code again.
    * Run this code in the Jupyter Notebook in browser directly instead of running in IDEs such as Pycharm.

In [22]:
dataset.save(save_dir=save_dir)

* Validate the dataset descriptions within the metadata files you just updated

In [23]:
description_meta = schema.load_data("./SDStemplate/dataset_description.xlsx")
validator.validate(description_meta, category="dataset_description", version="2.0.0")
sub_meta = schema.load_data("./SDStemplate/subjects.xlsx")
validator.validate(sub_meta, category="subjects", version="2.0.0")

Target instance: {'Metadata Version': '2.0.0', 'Type': 'experimental', 'Title': 'Duke breast cancer MRI preprocessing', 'Subtitle': 'Preprocessing the breast cancer MRI images and saving in Nifti format', 'Keywords': 'Breast cancer', 'Study purpose': 'Preprocessing the breast cancer MRI images and saving in Nifti format', 'Study data collection': 'derived from Duke Breast Cancer MRI dataset', 'Study primary conclusion': 'pri', 'Study organ system': 'breast', 'Study approach': 'image processing', 'Study technique': 'dicom2nifti', 'Contributor name': 'Lin, Chinchien', 'Contributor ORCiD': 'https://orcid.org/0000-0001-8170-199X', 'Contributor affiliation': 'University of Auckland', 'Contributor role': 'developer', 'Identifier description': 'source', 'Relation type': 'WasDerivedFrom', 'Identifier': 'DTP-UUID', 'Identifier type': '12L digital twin UUID', 'Number of subjects': 3, 'Number of samples': 6}
Validation: Passed
Target instance: {'subject id': 'sub-1', 'pool id': 'pool id 1', 'subj