# Tutorial 2: Create SDS from existing data set

The SPARC Dataset Structure (SDS) is a standardised method for organising files and metadata. In this tutorial existing data is loaded into a SDS file structure and the metadata is explored and edited. 

# Creating SDS folder structure 

In [None]:
# Initialise a dataset object
import sys
[sys.path.append(i) for i in ['.', '..']]

from sparc_me import Dataset

dataset = Dataset()

# Specify the SDS schema version to be created
version = "2.0.0"
dataset.load_from_template(version)

# Specify location to generate SDS structure
save_dir= "./tmp/template/"

#Creates SDS folder structure

dataset.set_dataset_path(save_dir)
dataset.save(save_dir)

## Transfering data into SDS structure

Now that there is a destination for the data to be transdered, it is time to transfer your existing data. 

In [None]:
# Add a copy of the data from the specified path into the SDS folder structure
dataset.add_primary_data("./test_data/sample1/raw", subject="subject-1", sample="sample-1", sds_parent_dir=save_dir, overwrite=True)
dataset.add_primary_data("./test_data/sample2/raw", subject="subject-2", sample="sample-1", sds_parent_dir=save_dir, overwrite=True)
dataset.add_primary_data("./test_data/sample3/raw", subject="subject-3", sample="sample-1", sds_parent_dir=save_dir, overwrite=True)

dataset.add_derivative_data("./test_data/sample1/derived", subject="subject-1", sample="sample-1", sds_parent_dir=save_dir, overwrite=True)
dataset.add_derivative_data("./test_data/sample2/derived", subject="subject-2", sample="sample-1", sds_parent_dir=save_dir, overwrite=True)
dataset.add_derivative_data("./test_data/sample3/derived", subject="subject-3", sample="sample-1", sds_parent_dir=save_dir, overwrite=True)



# Editing the metadata
Now we can explore some of the meta data that was automatically generated as we were transfering files

In [None]:
category = "subjects"
metadata = dataset._dataset.get(category).get("metadata")
metadata



In this example, we now wish to add age information for the subjects.

In [None]:
# edit the age field of the listed subject
category = "subjects"
header = "age"
value = "42"
#df['Comedy_Score'].where(df['Rating_Score'] == subject)

subject_1_index = int(metadata.index[metadata['subject id']=='subject-1'][0] + 2)
subject_2_index = int(metadata.index[metadata['subject id']=='subject-2'][0] + 2)
subject_3_index = int(metadata.index[metadata['subject id']=='subject-3'][0] + 2)


dataset.set_field(category, subject_1_index, header, 42)
dataset.set_field(category, subject_2_index, header, 25)
dataset.set_field(category, subject_3_index, header, 27)


# Save changes
dataset.save(save_dir)

# The result can now be seen 
metadata

If the data meta data is incomplete for a given category, as below, then it is useful to be able to extract the rows that contain values.

In [None]:
header = "sex"

dataset.set_field(category, subject_1_index, header, "female")
dataset.set_field(category, subject_2_index, header, "male")
dataset.set_field(category, subject_3_index, header, "female")

dataset.save(save_dir)


# Filtering through the metadata to identify subjects
We can use the metadata stored in the dataset to select subjects based on specific criteria 

In [None]:
#select out the metadata for female subjects
index = metadata['sex'] == 'female'
metadata[['subject id','age','sex']][index]

In [None]:
#select out the metadata subjects younger than 28
index = metadata['age'] <= 28
metadata[['subject id','age','sex']][index]
