# FAIR and scalable management of small-angle X-ray scattering data  
## Module 2: Packing and upload to DaRUS

> Authors: Torsten Giess, Selina Itzigehl, Jan Range, Johanna Bruckner, Juergen Pleiss  
> Last modified: 09.06.2022

---

### **Abstract** <a class="anchor" name="abstract"></a>

With novel Python packages [pyAnIML](https://github.com/FAIRChemistry/pyAnIML) (version 1.0.0) and [pyDaRUS](https://github.com/JR-1991/pyDaRUS) (version 1.0.4), as well as various packages from the Python 3 standard libary, this notebook provides the means to extract metadata from an [AnIML](https://animl.org/) document, create a searchable Dataverse metadata block from it, package the document and any additional files to an archive format, and upload everything to the University of Stuttgart Dataverse installation, [DaRUS](https://darus.uni-stuttgart.de/). It is also possible to inspect the generated metadata block by downloading and displaying it.   

---

### **Table of Contents** <a class="anchor" name="table_of_contents"></a>

- [Abstract](#abstract)
- [Workflow](#workflow)
    - [User guide](#user_guide)
    - [Preparation](#preparation)
    - [Metadata extraction from AnIML document](#extraction)
    - [Packing files to OMEX or ZIP](#packing)
    - [Upload to DaRUS](#upload)
    - [Download from DaRUS](#download)
    - [Edit DaRUS datasets](#edit)
- [Disclosure](#disclosure)

---

### **Workflow** <a class="anchor" name="workflow"></a>

Following is the workflow for Module 2 of FAIR and scalable management of small-angle X-ray scattering data.

#### **User guide** <a class="anchor" name="user_guide"></a>

This notebook is used for creating an archive in either OMEX or ZIP format of a dataset consisting of an AnIML document and any number of additional files belonging to the dataset in question. This archive is then uploaded to DaRUS. Furthermore, any given accessible dataset can be downloaded, inspected, modified, and updated on DaRUS.

#### **Preparation** <a class="anchor" name="preparation"></a>

This section contains the necessary preparations for using this module. Code cells in this section are required regardless of which functionality of this notebook is used. First, the required packages from the [Python 3 standard library](https://docs.python.org/3/library/), the Python Package Index ([PyPI](https://pypi.org/)), and *ad hoc* modules of this work are imported. Then, both current time and path are retrieved and stored in the desired formats.

In [None]:
print("Importing standard library packages.")
from datetime import date
import os
from pathlib import Path
print ("Done.")

In [None]:
print("Importing PyPI packages.")
from pyaniml import AnIMLDocument
from pyDaRUS import Citation, Dataset, EngMeta, Process
from pyDaRUS.metadatablocks.citation import SubjectEnum, IdentifierScheme, IdType
from pyDaRUS.metadatablocks.engMeta import DataGeneration
from libcombine import CombineArchive, KnownFormats, OmexDescription, VCard
print ("Done.")

In [None]:
date_suffix = str(date.today()).replace("-", "")[2:]

In [None]:
cwd = Path.cwd()
path_to_datasets = cwd / "./datasets/"

#### **Metadata extraction from AnIML document** <a class="anchor" name="extraction"></a>

In this section, DaRUS metadata block objects are created. These will later be used to create a full DaRUS dataset together with any files to be uploaded. Relevant metadata is extracted directly from the AnIML document provided. Information which can not be inferred from the AnIML document at this moment can be added manually.

In [None]:
dataset = Dataset()

1. Give path to AnIML document to be uploaded to DaRUS in form of a pathlib Path:

In [None]:
path_to_AnIML_file = path_to_datasets / f"processed/fairsaxs_220512.animl"

2. Read document as string and create AnIML object from it:

In [None]:
with path_to_AnIML_file.open("r") as f:
    xml_string = f.read()
    animl_doc = AnIMLDocument.fromXMLString(xml_string)

3. Create the necessary pyDaRUS objects to be filled with metadata from the AnIML document. The title of the dataset is also provided here as an argument to the citation block:

In [None]:
citation_block = Citation()
process_block = Process()
engineering_block = EngMeta()

4. Add general citation information to the citation block object that cannot be inferred from the AnIML document itself:

In [None]:
citation_block.title = "Data for: FAIR and scalable management of small-angle X-ray scattering data"

In [None]:
citation_block.add_author("Giess, Torsten", "University of Stuttgart", IdentifierScheme.orcid, "0000-0002-8512-8606")
citation_block.add_author("Itzigehl, Selina", "University of Stuttgart", IdentifierScheme.orcid, "0000-0003-0311-5930")
citation_block.add_author("Range, Jan Peter", "University of Stuttgart", IdentifierScheme.orcid, "0000-0001-6478-1051")
citation_block.add_author("Bruckner, Johanna R.", "University of Stuttgart", IdentifierScheme.orcid, "0000-0001-7183-6532")
citation_block.add_author("Pleiss, Jürgen", "University of Stuttgart", IdentifierScheme.orcid, "0000-0003-1045-8202")

In [None]:
citation_block.add_contact("Pleiss, Juergen", "University of Stuttgart", "juergen.pleiss@itb.uni-stuttgart.de")

In [None]:
citation_block.add_description(f"This dataset contains small-angle X-ray scattering (SAXS) measurements on various octyltrimethylammonium bromide (OTAB)/water and octyltrimethylammonium chloride (OTAC)/water mixtures at different mass fractions and temperatures, as well as analysis and visualization data thereof. The data and metadata are contained within an archive adhering to the OMEX standard, in the form of a master AnIML document and additional files in TSV and PNG format.", f"{date.today()}")

In [None]:
citation_block.subject = [SubjectEnum.chemistry, SubjectEnum.physics, SubjectEnum.computer_and__information__science]

In [None]:
citation_block.add_keyword(
    term="AnIML",
    vocabulary="Wikidata",
    vocabulary_url="https://www.wikidata.org/wiki/Q97359795"
)
citation_block.add_keyword(
    term="Project Jupyter",
    vocabulary="Wikidata",
    vocabulary_url="https://www.wikidata.org/wiki/Q55630549"
)
citation_block.add_keyword(
    term="Surfactants",
    vocabulary="Loterre chemistry vocabulary",
    vocabulary_url="http://data.loterre.fr/ark:/67375/37T-JDM7BJHX-0"
)
citation_block.add_keyword(
    term="Lyotropic liquid crystal",
    vocabulary="Wikidata",
    vocabulary_url="https://www.wikidata.org/wiki/Q6709833"
)
citation_block.add_keyword(
    term="X-ray scattering",
    vocabulary="Wikidata",
    vocabulary_url="https://www.wikidata.org/wiki/Q57979862"
)
citation_block.add_keyword(
    term="OTAB",
    vocabulary="CHEBI",
    vocabulary_url="http://purl.obolibrary.org/obo/CHEBI_346954"
)
citation_block.add_keyword(
    term="octyltrimethylammonium bromide",
    vocabulary="CHEBI",
    vocabulary_url="http://purl.obolibrary.org/obo/CHEBI_346954"
)
citation_block.add_keyword(
    term="OTAC"
)

In [None]:
citation_block.add_topic_classification(
    term="Physical chemistry",
    vocabulary="Loterre chemistry vocabulary",
    vocabulary_url="http://data.loterre.fr/ark:/67375/37T-BJ2L28P4-W"
)
citation_block.add_topic_classification(
    term="Research data management",
    vocabulary="Wikidata",
    vocabulary_url="https://www.wikidata.org/wiki/Q30089794"
)

In [None]:
citation_block.add_grant_information(
    grant_agency="DFG",
    grant_number="358283783 - SFB 1333"
)
citation_block.add_grant_information(
    grant_agency="DFG",
    grant_number="EXC 2075"
)
citation_block.add_grant_information(
    grant_agency="Ministry of Science, Research and the Arts Baden-Württemberg",
    grant_number="None"
)

In [None]:
citation_block.add_project(
    name="A04",
    level="1"
)
citation_block.add_project(
    name="INF",
    level="1"
)

In [None]:
citation_block.add_related_publication(
    citation="Giess T., Itzigehl S., Range J. P., Bruckner J. R., Pleiss J., FAIR and scalable management of small-angle X-ray scattering data, 2022, submitted",
    id_type=IdType.doi,
    id_number="tba",
    url="https://www.tba.org"
)

In [None]:
citation_block.notes = 'All Jupyter Notebooks, as well as ad hoc scripts used can be found in our <a href="https://github.com/FAIRChemistry/Giess_2022">FAIR Chemistry GitHub repository</a>. Information about the OMEX standard can be found at <a href="http://co.mbine.org/standards/omex">here</a>.'

In [None]:
citation_block.depositor = "Giess, Torsten"
citation_block.deposit_date = str(date.today())

5. Retrieve experiment name from the AnIML document:

In [None]:
method_name = animl_doc.experiment_step_set.experiment_steps[0].result.results[0].content[0].name

6. Retrieve experiment parameters, remove duplicate entries, and transform the resulting list into a comma-separated string:

In [None]:
for experiment_step in animl_doc.experiment_step_set.experiment_steps:
    for category in experiment_step.result.results:
        if category.name == "Analyses":
            continue
        for series_set in category.content:
            unique_parameters = {series.unit.label: series.unit.quantity for series in series_set.series}

In [None]:
method_parameter_labels = ", ".join(unique_parameters.keys())

7. Add experiment name and its relevant parameters to the process metadata block object:

In [None]:
process_block.add_processing_methods(
    name=method_name,
    parameters=method_parameter_labels,
)

8. Create dictionaries of the relevant parameters' names and units:

In [None]:
parameter_names = unique_parameters
parameter_names["T"] = "temperature"
parameter_names["w"] = "mass fraction"

In [None]:
parameter_units = {
    "q": "nm^-1",
    "I": "a.u.",
    "T": "°C",
    "w": "wt%"
}

9. Add the full parameter information to the process metadata block object:

In [None]:
for parameter in unique_parameters.keys():    
    process_block.add_method_parameters(
        name=parameter_names[parameter],
        symbol=parameter,
        unit=parameter_units[parameter]
    )

10. Retrieve method information from the AnIML document:

In [None]:
animl_method = animl_doc.experiment_step_set.experiment_steps[0].method

In [None]:
instrument_from_animl = animl_method.methods[2]

In [None]:
software_from_animl = [animl_method.methods[3], animl_method.methods[4]]

11. Add instrument information to the process metadata block object:

In [None]:
process_block.add_instruments(
    name=instrument_from_animl.name,
    version=instrument_from_animl.firmware_version,
    serial_number=[instrument_from_animl.serial_number],
    software="SAXSQuant",
    location="University of Stuttgart, Institute of Physical Chemistry",
)

12. Add software information to the process metadata block object:

In [None]:
process_block.add_software(
    name=software_from_animl[0].name,
    version=software_from_animl[0].version,
    citation="DOI:10.3233/978-1-61499-649-1-87",
    url="https://jupyter.org/",
    license="BSD-3-Clause"
)

In [None]:
process_block.add_software(
    name=software_from_animl[1].name,
    version=software_from_animl[1].version,
    url="https://www.originlab.com/index.aspx?go=Products/Origin",
    license="Commercial"
)

13. State data generation types implemented in this work:

In [None]:
engineering_block.data_generation = [DataGeneration.experiment.value, DataGeneration.analysis.value]

14. Add information about variables measured to the engineering metadata block object:

In [None]:
engineering_block.add_measured_variables(
    name=parameter_names["q"],
    symbol="q",
    unit=parameter_units["q"]
)

In [None]:
engineering_block.add_measured_variables(
    name=parameter_names["I"],
    symbol="I",
    unit=parameter_units["I"]
)

15. Add information about variables controlled to the object:

In [None]:
engineering_block.add_controlled_variables(
    name=parameter_names["T"],
    symbol="T",
    unit=parameter_units["T"],
    minimum_value=0,
    maximum_value=100
)

In [None]:
engineering_block.add_controlled_variables(
    name=parameter_names["w"],
    symbol="w",
    unit=parameter_units["w"],
    minimum_value=10.0,
    maximum_value=100.0
)

16. Create the DaRUS dataset object and add the different block objects to it:

In [None]:
dataset.add_metadatablock(citation_block)
dataset.add_metadatablock(process_block)
dataset.add_metadatablock(engineering_block)

#### **Packing files to OMEX or ZIP** <a class="anchor" name="packing"></a>

In this section, the COMBINE archive is packed following the OMEX standard. The file extension of this archive can be chosen freely.

1. Create the archive object and VCard objects:

In [None]:
archive = CombineArchive()

In [None]:
list_of_VCards = []
for creator in citation_block.author:
    new_VCard = VCard()
    new_VCard.family_name = creator.name.split(", ")[0]
    new_VCard.given_name = creator.name.split(", ")[1]
    new_VCard.organization = creator.affiliation
    list_of_VCards.append(new_VCard)

2. If AnIML and/or PDH formats are not already known, add them to the known formats:

In [None]:
if KnownFormats.lookupFormat("animl") == "":
    KnownFormats.addKnownFormat(
        "animl", "http://purl.org/NET/mediatypes/application/x.animl"
    )
if KnownFormats.lookupFormat("pdh") == "":
    KnownFormats.addKnownFormat(
        "pdh", "http://purl.org/NET/mediatypes/application/x.pdh"
    )

3. Add the AnIML document to the archive object:

In [None]:
local_path = str(path_to_AnIML_file)
archive_path = f"./{str(path_to_AnIML_file.name)}"
format_check = KnownFormats.lookupFormat("animl")
is_master = True

In [None]:
_void = archive.addFile(local_path, archive_path, format_check, is_master)
del _void

4. Create the metadata description object:

In [None]:
description = OmexDescription()

5. Declare the file the description points to:

In [None]:
description.setAbout(archive_path)

6. Provide a small description string:

In [None]:
description.setDescription(f"The AnIML document with the SI for '{citation_block.title}'.")

7. Provide date and time:

In [None]:
description.setCreated(OmexDescription.getCurrentDateAndTime())

8. Add the authors:

In [None]:
for creator in list_of_VCards:
    description.addCreator(creator)

9. Add the finished description object to the archive object:

In [None]:
_void = archive.addMetadata(archive_path, description)
del _void

10. Optional: Add any desired additional files to the archive:

- Add PDH files to the archive in the same manner as the AnIML document, but providing the `False` value for the `is_master` field:

In [None]:
for pdh_file in (path_to_datasets / "raw/").rglob("*.pdh"):
    if pdh_file.match("*/.ipynb_checkpoints/*"):
        continue
    # add file to archive
    local_path = str(pdh_file)
    archive_path = f"{pdh_file.relative_to(path_to_datasets).parent}/{str(pdh_file.name)}"
    format_check = KnownFormats.lookupFormat("pdh")
    is_master = False
    _void = archive.addFile(local_path, archive_path, format_check, is_master)
    del _void
    
    # add description for file
    description = OmexDescription()
    description.setAbout(archive_path)
    description.setDescription("Anton Paar SAXess mc^2 PDH file containing raw SAXS measurement data.")
    description.setCreated(OmexDescription.getCurrentDateAndTime())
    for creator in list_of_VCards:
        description.addCreator(creator)
    _void = archive.addMetadata(archive_path, description)
    del _void

- Add TSV files to the archive:

In [None]:
for tsv_file in (path_to_datasets / "processed/").rglob("*.tsv"):
    if tsv_file.match("*/.ipynb_checkpoints/*"):
        continue
    # add file to archive
    local_path = str(tsv_file)
    archive_path = f"{tsv_file.relative_to(path_to_datasets).parent}/{str(tsv_file.name)}"
    format_check = KnownFormats.lookupFormat("tsv")
    is_master = False
    _void = archive.addFile(local_path, archive_path, format_check, is_master)
    del _void
    
    # add description for file
    description = OmexDescription()
    description.setAbout(archive_path)
    description.setDescription("TSV file containing the results of the SAXS analysis.")
    description.setCreated(OmexDescription.getCurrentDateAndTime())
    for creator in list_of_VCards:
        description.addCreator(creator)
    _void = archive.addMetadata(archive_path, description)
    del _void

- Add PNG files to the archive:

In [None]:
for png_file in (path_to_datasets / "processed/").rglob("*.png"):
    if png_file.match("*/.ipynb_checkpoints/*"):
        continue
    # add file to archive
    local_path = str(png_file)
    archive_path = f"{png_file.relative_to(path_to_datasets).parent}/{str(png_file.name)}"
    format_check = KnownFormats.lookupFormat("png")
    is_master = False
    _void = archive.addFile(local_path, archive_path, format_check, is_master)
    del _void
    
    # add description for file
    description = OmexDescription()
    description.setAbout(archive_path)
    description.setDescription("PNG file containing the results of the SAXS visualization.")
    description.setCreated(OmexDescription.getCurrentDateAndTime())
    for creator in list_of_VCards:
        description.addCreator(creator)
    _void = archive.addMetadata(archive_path, description)
    del _void

11. Serialize the archive object to OMEX or ZIP format:

In [None]:
_void = archive.writeToFile(str(path_to_datasets / f"processed/fairsaxs_220512.zip"))
del _void

#### **Upload to DaRUS** <a class="anchor" name="upload"></a>

The upload of a draft to a DaRUS repository requires an API token with appropriate permissions. The URL of the Dataverse, as well as the API token are inferred from the environment variables.

1. Add unpacked archive to dataset object for upload to DaRUS:

In [None]:
dataset.add_file(dv_path="fairsaxs_220502.zip", local_path=str(path_to_datasets / "processed/fairsaxs_220502.zip"))

2. Upload to DaRUS by stating the target repository and the file or directory to be uploaded and print the resulting DOI:

In [None]:
p_id = dataset.upload("sfb1333-giesselmann-bruckner")
print(p_id)

#### **Download from DaRUS** <a class="anchor" name="download"></a>

Any DaRUS dataset with its file contents can be downloaded, viewed, and modified here, given that an **API token with the necessary permissions**, as well as the **URL to the base Dataverse installation** are provided in the **environment variables**! Alternatively, uncomment lines and provide them below instead:

In [None]:
# os.environ["DATAVERSE_API_TOKEN"] = ""
# os.environ["DATAVERSE_URL"] = "https://darus.uni-stuttgart.de/"

Download a dataset by giving the DOI and stating a directory to which available files are to be downloaded to:

In [None]:
darus_dataset = "doi:10.18419/darus-2842"

In [None]:
dataset = Dataset.from_dataverse_doi(doi=darus_dataset, filedir=str(path_to_datasets / "download/"))

Optionally, delete one or more metadata blocks present to completely re-do them using the code provided in [Metadata extraction from AnIML document](#extraction):

In [None]:
del(dataset.citation, dataset.engMeta, dataset.process)

#### **Edit DaRUS datasets** <a class="anchor" name="edit"></a>

Downloaded datasets can be fully edited and re-uploaded in this optional section, as well as the sections before. As an example, an additional file is added to the dataset from before. Both the desired path within the directory structure of the DaRUS dataset, as well as the path to the file on the local file system are required.

1. After downloading a dataset, make changes as seen fit:

In [None]:
dataset.p_id = darus_dataset

In [None]:
dataset.add_file(dv_path="fairsaxs_220512.zip", local_path=str(path_to_datasets / "processed/fairsaxs_220512.zip"))

2. Use the update function to push changes to DaRUS. Due to the requirements of the Dataverse API, contact information are required again for each update made. If the contact does not exist within the DaRUS dataset yet, it will be added to it:

In [None]:
dataset.update(contact_name="Pleiss, Juergen", contact_mail="juergen.pleiss@itb.uni-stuttgart.de")

---

### **Disclosure** <a class="anchor" name="disclosure"></a>

**Contributions**

If you wish to contribute to the FAIR Chemistry project, find us on [GitHub](https://github.com/FAIRChemistry)!

**MIT License**

Copyright (c) 2022 FAIR Chemistry

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.