# Tutorial 2: Create datasets and its metadata
This tutorial shows how to create datasets and its metadata in SPARC Data Structure (SDS) format via SPARC-me API. In this tutorial, you will learn:
* How to create a SDS template dataset or load an existing SDS dataset.
* How to retrieve the metadata and its values from the dataset and save the dataset locally.
* How to add or update metadata values.
* How to add raw data to the dataset.

## 1. Create an empty SDS dataset, or load an existing one if your dataset is based on SDS v2.0.0
* Setup sparc-me and create an instance of the `Dataset` class, the `Schema` class, and the `Validator` class

In [46]:
from sparc_me import Dataset, Schema, Validator

dataset = Dataset()
schema = Schema()
validator = Validator()

* Create an empty dataset from the SDS template

   * Specify the path of where you want your dataset to be saved.
   * Load the SDS template dataset.It is highly recommended to use SDS version 2.0.0 in this tutorial.

In [47]:
save_dir = "SDStemplate/"
dataset.set_dataset_path(save_dir)

In [48]:
dataset.load_from_template(version="2.0.0")

* Save the template dataset that you just loaded

    Explore the SDS template dataset after it's saved locally. __Restart the kernel if you have problems saving the dataset.__

In [49]:
dataset.save(save_dir=save_dir)

* Load an existing SDS dataset (not recommended in this tutorial unless your dataset is based on SDS template version 2.0.0)

In [50]:
# set_dataset_path is optional here. You can either set a new path or using the dataset's current path for loading the existing dataset
# dataset.set_dataset_path(save_dir)
# existing_dir = "./your/dataset/path/here"
# dataset.load_dataset(dataset_path=existing_dir)

## 2. Get the metadata and its associated values. Clear the values prior to updating.
This tutorial uses the `dataset_description` metadata file as an example. For other metadata files, replace the category field with the name of the metadata file you want to use e.g., from `category="dataset_description"` to `category="code_parameters"`. 

* List all metadata files of the dataset such as dataset_description.xlsx and manifest.xlsx.

In [51]:
categories = dataset.list_metadata_files(version="2.0.0")

metadata_files:
code_description
code_parameters
dataset_description
manifest
performances
resources
samples
subjects
submission


* List the elements of a metadata file (dataset_description.xlsx for example)

In [52]:
elements = dataset.list_elements(metadata_file="dataset_description")

Fields:
Metadata Version
Type
Basic information
Title
Subtitle
Keywords
Funding
Acknowledgments
Study information
Study purpose
Study data collection
Study primary conclusion
Study organ system
Study approach
Study technique
Study collection title
Contributor information
Contributor name
Contributor ORCiD
Contributor affiliation
Contributor role
Related protocol, paper, dataset, etc.
Identifier description
Relation type
Identifier
Identifier type
Participant information
Number of subjects
Number of samples


* (Optional) Get more information of a specific element

In [53]:
des_schema = schema.get_schema("dataset_description")
des_schema.get('Subtitle')

{'Metadata Version': {'type': 'string', 'required': 'Y', 'description': 'SDS version number, e.g. 2.0.0. DO NOT CHANGE', 'example': '2.0.0'}, 'Type': {'type': 'string', 'required': 'Y', 'description': 'Each dataset consists of a single “type” of data, covered by the same ethics, same access control, same protocol, etc. There are two datasets types,  experimental and computation. Experimental is the default value. Make sure to change it to computation only if you are submitting a computational study. If the dataset is not” computational”, it should be set to “experimental”', 'example': 'experimental'}, 'Title': {'type': 'string', 'required': 'Y', 'description': 'Descriptive title for the data set. Equivalent to the title of a scientific paper.', 'example': 'My SPARC dataset'}, 'Subtitle': {'type': 'string', 'required': 'Y', 'description': 'Brief description of the study and the data set. Equivalent to the abstract of a scientific paper. Include the rationale for the approach, the types 

{'type': 'string',
 'required': 'Y',
 'description': 'Brief description of the study and the data set. Equivalent to the abstract of a scientific paper. Include the rationale for the approach, the types of data collected, the techniques used, formats and number of files and an approximate size.',
 'example': 'A really cool dataset that I collected to answer some question.'}

* Create a `dataset_description` object for the metadata file by using the `get_metadata` function

In [54]:
dataset_description = dataset.get_metadata(metadata_file="dataset_description")

* Get the metadata's associated values

    The `get_values(field_name: str)` method allows to retrieve values from a specific row or column of the metadata file by providing the row name or column name.

In [55]:
dataset_description.get_values(field_name='Contributor role')

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)


Value      PrincipalInvestigator
Value 2      CorrespondingAuthor
Value 3                      NaN
Value n                      NaN
Name: 20, dtype: object

* Clear all default metadata values before you edit them if you created your datasets from the template
    * Clear all metadata values in the dataset_description file
    * (Optional) clear the entire row of metadata values
        `dataset_description.clear_values(field_name='Contributor role')`

In [56]:
dataset_description.clear_values()

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)


* Get the values again. They have now been deleted.

In [57]:
dataset_description.get_values(field_name='Contributor role')

Value    <NA>
Name: 20, dtype: object

## 3. Add/update metadata values

* This function allows you to add or update metadata values

    `add_values(field_name: str = '', values: Any, append: bool = True)`

    * `field_name` represents either the row name or the column name. It takes row name in the `dataset_description` and `code_description` metadata files, and take column name in other metadata files.
    * `values` allows single or multiple string values for metadata values.
    * `append` takes a boolean value. The default value is `True`, which appends an element to the end of the list. If the `append` is set to `False`, the values will be overwritten/replaced with the new values you specify.

In [58]:
dataset_description.add_values(field_name='type', values="experimental")
dataset_description.add_values(field_name='Title', values="Dynamic contrast-enhanced magnetic resonance images of breast cancer patients with tumor locations (Duke-Breast-Cancer-MRI)")
dataset_description.add_values(field_name='subtitle',
                                   values="""Background Recent studies showed preliminary data on associations of MRI-based imaging phenotypes of breast tumours with breast cancer molecular, genomic, and related characteristics. In this study, we present a comprehensive analysis of this relationship.
    Methods We analysed a set of 922 patients with invasive breast cancer and pre-operative MRI. The MRIs were analysed by a computer algorithm to extract 529 features of the tumour and the surrounding tissue. Machine-learning-based models based on the imaging features were trained using a portion of the data (461 patients) to predict the following molecular, genomic, and proliferation characteristics: tumour surrogate molecular subtype, oestrogen receptor, progesterone receptor and human epidermal growth factor status, as well as a tumour proliferation marker (Ki-67). Trained models were evaluated on the set of the remaining 461 patients.
    Results Multivariate models were predictive of Luminal A subtype with AUC = 0.697 (95% CI: 0.647–0.746, p < .0001), triple negative breast cancer with AUC = 0.654 (95% CI: 0.589–0.727, p < .0001), ER status with AUC = 0.649 (95% CI: 0.591–0.705, p < .001), and PR status with AUC = 0.622 (95% CI: 0.569–0.674, p < .0001). Associations between individual features and subtypes we also found.
    Conclusions There is a moderate association between tumour molecular biomarkers and algorithmically assessed imaging features.""")
dataset_description.add_values(field_name='Keywords', values=["Breast cancer","MRI"])
dataset_description.add_values(field_name="Study purpose",
                               values="""Breast MRI is a common image modality to assess the extent of disease in breast cancer patients. Recent studies show that MRI has a potential in prognosis of patients’ short and long-term outcomes as well as predicting pathological and genomic features of the tumors. However, large, well annotated datasets are needed to make further progress in the field. We share such a dataset here.""")
dataset_description.add_values(field_name='Study data Collection',
                               values="""In terms of design, the dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients, over a decade, having the following data components:
    Demographic, clinical, pathology, treatment, outcomes, and genomic data: Collected from a variety of sources including clinical notes, radiology report, and pathology reports and has served as a source for multiple published papers on radiogenomics, outcomes prediction, and other areas.
    Pre-operative dynamic contrast enhanced (DCE)-MRI: Downloaded from PACS systems and de-identified for The Cancer Imaging Archive (TCIA) release. These include axial breast MRI images acquired by 1.5T or 3T scanners in the prone positions. Following MRI sequences are shared in DICOM format: a non-fat saturated T1-weighted sequence, a fat-saturated gradient echo T1-weighted pre-contrast sequence, and mostly three to four post-contrast sequences.
    Locations of lesions in DCE-MRI: Annotations on the DCE-MRI images by radiologists.
    Imaging features from DCE-MRI: A set of 529 computer-extracted imaging features by inhouse software. These features represent a variety of imaging characteristics including size, shape, texture, and enhancement of both the tumor and the surrounding tissue, which is combined of features commonly published in the literature, as well as the features developed in our lab.""")
dataset_description.add_values(field_name="Study primary conclusion", values="There is a moderate association between tumour molecular biomarkers and algorithmically assessed imaging features.")
dataset_description.add_values(field_name='Study organ system', values="breast")
dataset_description.add_values(field_name='Study approach', values="Machine-learning")
dataset_description.add_values(field_name='Study technique', values="""We analysed a set of 922 patients with invasive breast cancer and pre-operative MRI. The MRIs were analysed by a computer algorithm to extract 529 features of the tumour and the surrounding tissue. Machine-learning-based models based on the imaging features were trained using a portion of the data (461 patients) to predict the following molecular, genomic, and proliferation characteristics: tumour surrogate molecular subtype, oestrogen receptor, progesterone receptor and human epidermal growth factor status, as well as a tumour proliferation marker (Ki-67). Trained models were evaluated on the set of the remaining 461 patients.""")
dataset_description.add_values(field_name='Study collection title', values="Dynamic contrast-enhanced magnetic resonance images of breast cancer patients with tumor locations (Duke-Breast-Cancer-MRI)")
dataset_description.add_values(field_name='identifier', values="ID2350")
dataset_description.add_values(field_name='contributorname', values=["Saha, Ashirbani","Harowicz, Michael R","Grimm, Lars J","Kim, Connie E","Ghate, Sujata V","Walsh, Ruth","Mazurowski, Maciej A"])
dataset_description.add_values(
    field_name='Contributor orcid',
    values=["https://orcid.org/0000-0002-7650-1720",
            "https://orcid.org/0000-0002-8002-5210",
            "https://orcid.org/0000-0002-3865-3352",
            "https://orcid.org/0000-0003-0730-0551",
            "https://orcid.org/0000-0003-1889-982X",
            "https://orcid.org/0000-0002-2164-2761",
            "https://orcid.org/0000-0003-4202-8602"],
    append=False)
dataset_description.add_values(field_name='Contributor affiliation', values=["Duke University"] * 7)
dataset_description.add_values(field_name="contributor role",
                               values=["Researcher", "Researcher", "Researcher", "Researcher", "Researcher", "Researcher","Researcher","tester"])
dataset_description.add_values(field_name='Identifier description', values="source")
dataset_description.add_values(field_name='Relation type', values="IsDescribedBy")
# dataset_description.add_values(field_name='Identifier', values="9d70fd9f-bfb9-424d-9c7c-9db1ec6a9df9")
dataset_description.add_values(field_name='Identifier type', values="12L digital twin UUID")

* Display the values you just added



In [59]:
dataset_description.get_values(field_name="contributorrole")

Value      Researcher
Value 1    Researcher
Value 2    Researcher
Value 3    Researcher
Value 4    Researcher
Value 5    Researcher
Value 6    Researcher
Value 7        tester
Name: 20, dtype: object

* Remove values by row or column

    The `remove_values(*values: Any, field_name: str)` method can be used to remove values from a specific row or column of the metadata file. Two arguments are required for this method: values you want to remove and the row name or column name of the values.

In [60]:
dataset_description.remove_values(field_name="contributor role", values="tester")

dataset_description.get_values(field_name="contributorrole")

Value      Researcher
Value 1    Researcher
Value 2    Researcher
Value 3    Researcher
Value 4    Researcher
Value 5    Researcher
Value 6    Researcher
Value 7          <NA>
Name: 20, dtype: object

## 4. Add your actual data to dataset 'primary' folders (derivative data will be covered in the next workshop)

In the research drive, you'll find a folder named 'example_dataset' which contains the example datasets for this section of the tutorial. Please move the 'example_dataset' folder into the `resource` folder.
* To comply with SDS framwork, the naming of subjects and samples folders MUST be in this format:  sub-xx (for subjects), sam-xx (for samples).

* Add data to the SDS dataset folder by two ways:
    * Add data from subject(s) - add data from a single subject or multiple subjects and the associated metadata.
    * Add sample(s) data -  add a single sample dat file or multiple sample data files to the dataset.

* The metadata that was added using the following functions are the constant data across subjects or samples.

* After updating subject or sample data, the following metadata files will be updated __automatically__: dataset_description.xlsx, manifest.txt, samples.xlsx, subjects.xslx

In [61]:
# Add data from multiple subjects
dataset.add_subjects(source_paths=["../resources/example_dataset/Breast_MRI_001/patient1/","../resources/example_dataset/Breast_MRI_003/patient3/"], subjects=["sub-1","sub-3"],subject_metadata={
    "subject experimental group": "experimental",
    "sex": "Female",
    "species": "human",
    "strain": "tissue",
    "age category": "middle adulthood"
}, sample_metadata={
    "sample experimental group": "experimental",
    "sample type": "tissue",
    "sample anatomical location": "breast tissue",
})

In [62]:
# Add multiple sample data to the dataset
dataset.add_samples(source_paths=["../resources/example_dataset/Breast_MRI_002/patient2/sample_data_1/","../resources/example_dataset/Breast_MRI_002/patient2/sample_data_2/","../resources/example_dataset/Breast_MRI_002/patient2/sample_data_3/"], subject="sub-2", samples=["sam-1","sam-2","sam-3"],
                     data_type="primary", sds_parent_dir=save_dir,subject_metadata={
    "subject experimental group": "experimental",
    "age": "041y",
    "sex": "Female",
    "species": "human",
    "strain": "tissue",
    "age category": "middle adulthood"
}, sample_metadata={
    "sample experimental group": "experimental",
    "sample type": "tissue",
    "sample anatomical location": "breast tissue",
})

* Add the rest of the metadata values

In [63]:
subject_metadata = dataset.get_metadata("subjects")
subject_metadata.add_values(field_name='age', values=["30","35"], append=False)
subject_metadata.add_values(field_name="also in dataset", values=["Null"] * 3, append=False)
subject_metadata.add_values(field_name="RRID for strain", values=["RRID:RGD_10395233", "RRID:RGD_16986544", "RRID:RGD_65245738"],
                        append=False)
subject_metadata.add_values(field_name="member of", values=["Breast Cancer Society"] * 3, append=False)
subject_metadata.save()

In [64]:
sample_metadata = dataset.get_metadata("samples")
sample_metadata.add_values(field_name="was derived from", values=["sam-sample_data_1","sam-sample_data_2","sam-sample_data_3",
                                                                  "sam-sample_data_4","sam-1","sam-2","sam-3"], append=False)
sample_metadata.add_values(field_name="also in dataset", values=["null"] * 7, append=False)
sample_metadata.add_values(field_name="member of", values=["null"] * 7, append=False)
sample_metadata.save()


In [65]:
dataset.add_thumbnail("../resources/example_dataset/thumbnail_0.jpg")
dataset.add_thumbnail("../resources/example_dataset/thumbnail_1.jpg")
dataset.delete_data("./SDStemplate/docs/thumbnail_0.jpg")

* Save the updated datasets. If you encounter any errors when saving the datasets, try the following:
        
    * Clear outputs and restart the kernel. Run the code again.
    * Run this code in the Jupyter Notebook in browser directly instead of running in IDEs such as Pycharm.

In [66]:
dataset.save()

* Validate the dataset descriptions within the metadata files you just updated

In [67]:
validator.validate_dataset(dataset)

Target instance: {'Metadata Version': '2.0.0', 'Type': 'experimental', 'Title': 'Dynamic contrast-enhanced magnetic resonance images of breast cancer patients with tumor locations (Duke-Breast-Cancer-MRI)', 'Subtitle': 'Background Recent studies showed preliminary data on associations of MRI-based imaging phenotypes of breast tumours with breast cancer molecular, genomic, and related characteristics. In this study, we present a comprehensive analysis of this relationship.\n    Methods We analysed a set of 922 patients with invasive breast cancer and pre-operative MRI. The MRIs were analysed by a computer algorithm to extract 529 features of the tumour and the surrounding tissue. Machine-learning-based models based on the imaging features were trained using a portion of the data (461 patients) to predict the following molecular, genomic, and proliferation characteristics: tumour surrogate molecular subtype, oestrogen receptor, progesterone receptor and human epidermal growth factor stat

### Tips:

If you have made mistakes moving the data files, there is a delete function that can delete the subject folder and the sample folder.

* Delete subject folder
`dataset.delete_subject("./SDStemplate/primary/sub-1")`
* Delete sample folder
`dataset.delete_samples(["./SDStemplate/primary/sub-2/sam-3"])`