# FAIR and scalable management of small-angle X-ray scattering data  
## Module 1: Conversion of PDH to AnIML

> Authors: Torsten Giess, Selina Itzigehl, Jan Range, Johanna R. Bruckner, Juergen Pleiss  
> Last modified: 12.05.2022

---

### **Abstract** <a class="anchor" name="abstract"></a>

Using novel Python package [pyAnIML](https://github.com/FAIRChemistry/pyAnIML) (version 1.0.0), established packages [lxml](https://lxml.de/) (version 4.8.0) and [pandas](https://pandas.pydata.org/) (version 1.4.2), as well as various packages from the Python 3 standard libary, this notebook provides the means to create new [AnIML](https://animl.org/) documents and add entries to existing ones. 

---

### **Table of Contents** <a class="anchor" name="table_of_contents"></a>

- [Abstract](#abstract)
- [Workflow](#workflow)
    - [User guide](#user_guide)
    - [Preparation](#preparation)
    - [Conversion to AnIML](#conversion)
    - [Adding to an an existing AnIML document](#addition)
    - [Inspecting an AnIML document](#inspection)
- [Disclosure](#disclosure)

---

### **Workflow** <a class="anchor" name="workflow"></a>

Following is the workflow for Module 1 of FAIR and scalable management of small-angle X-ray scattering data.

#### **User guide** <a class="anchor" name="user_guide"></a>

This notebook can be used to either build up a [new AnIML object](#conversion) in memory and later serialize it to an actual document on disk or to extend an already [existing AnIML document](#addition) on disk by adding new entries to it and reserializing it later.  
In both cases the required fields and explanations for them are already present in this notebook. As such, it can be used as is or extended if needed.

#### **Preparation** <a class="anchor" name="preparation"></a>

This section contains the necessary preparations for using this module. Code cells in this section are required regardless of which functionality of this notebook is used. First, the required packages from the [Python 3 standard library](https://docs.python.org/3/library/), the Python Package Index ([PyPI](https://pypi.org/)), and *ad hoc* modules of this work are imported. Then, both current time and path are retrieved and stored in the desired formats.

In [None]:
print("Importing standard library packages.")
from datetime import date
from pathlib import Path
print ("Done.")

In [None]:
print("Importing PyPI packages.")
from lxml import etree
from pyaniml import AnIMLDocument, Sample, Series, ExperimentStep, Device, IndividualValueSet, SeriesSet, Category, Parameter, Unit, SIUnit
from pyaniml.core.method import Author, Software
print ("Done.")

In [None]:
print("Importing local packages.")
from modules.infer_type import infer_type
from modules.pdhreader import PDHReader
print("All done.")

In [None]:
date_suffix = str(date.today()).replace("-", "")[2:]

In [None]:
cwd = Path.cwd()
path_to_datasets = cwd / "./datasets/"

#### **Conversion of PDH to AnIML** <a class="anchor" name="conversion"></a>

With this section of the notebook, a new AnIML document can be created in memory and later be serialized to XML. The presented code cells provide the most important API calles for building an AnIML document with pyAnIML.  
Running a code cell more than once will add the contents of this cell to the AnIML document accordingly more than once. This can be used to add multiple samples, experiments, series, ... to the same document by simple changing the respective code cell before serializing it.

1. Create a new AnIML document as an object in memory or access an existing document:

- Case a) Create a new object:

In [None]:
animl_doc = AnIMLDocument()

- Case b) Access an existing document:

In [None]:
path_to_AnIML_file = path_to_datasets / f"processed/fairsaxs_220512.animl"
with path_to_AnIML_file.open("r") as f:
    xml_string = f.read()
    animl_doc = AnIMLDocument.fromXMLString(xml_string)

2. Call the PDH reader with the path to the directory holding the PDH files as argument, retrieve a list of available files, and show this list:

In [None]:
pdh_dir = PDHReader(path_to_datasets / "raw/OTAC_measurement_data/OTAC_097wtp_T")
dict_of_files = pdh_dir.available_files()
for index, file in dict_of_files.items():
    print(f"{index}: {file}")

3. Select the desired file either by name or by list index and extract the data as pandas dataframe:

In [None]:
pdh_file = dict_of_files[65]
raw_dataframe = pdh_dir.extract_data(pdh_file)
raw_metadata = pdh_dir.extract_metadata(pdh_file)
print(raw_dataframe)

4. Start building up the AnIML document. Create a new sample with an ID and name and add it to the AnIML object:

In [None]:
experiment_name = f"{pdh_file[:4]}/water: x = {pdh_file[5:8]} wt%; T = {pdh_file[-5:-3]} C"
print(experiment_name)

In [None]:
new_sample = Sample(
    id=pdh_file.replace("%", "p")[:-3],
    name=experiment_name
)
print(new_sample)

In [None]:
animl_doc.add_sample(new_sample)

5. Create or access an experiment step for the AnIML object, providing it with a name and an ID:

- Case a) Create a new experiment step object:

In [None]:
experiment_step = ExperimentStep(
    name=f"Sample data for {experiment_name}",
    experiment_step_id=pdh_file.replace("%", "p")[:-3]
)
print(experiment_step)

- Case b) Access an existing experiment step within an AnIML document:

In [None]:
available_experiment_steps = experiment_step = animl_doc.experiment_step_set.experiment_steps
print([step.name for step in available_experiment_steps])

In [None]:
experiment_step = available_experiment_steps[0]

6. Create one or more sample references to samples which were part of this experiment step, providing a sample object (e.g. new_sample), a role, and a purpose:

In [None]:
experiment_step.add_sample_reference(
    sample=new_sample,
    role="sample",
    sample_purpose="consumed"
)

7. If applicable, create the device with its various settings and/or software on which the sample(s) were measured/analysed and add them to the experiment step object:

In [None]:
new_author = Author(
    user_type="human",
    name="Selina Itzigehl",
    affiliation="University of Stuttgart",
    email="selina.itzigehl@ipc.uni-stuttgart.de",
    role="",
    phone="",
    location=""
)
experiment_step.add_method(new_author)

In [None]:
new_author = Author(
    user_type="human",
    name="Johanna R. Bruckner",
    affiliation="University of Stuttgart",
    email="johanna.bruckner@ipc.uni-stuttgart.de",
    role="",
    phone="",
    location=""
)
experiment_step.add_method(new_author)

In [None]:
new_device = Device(
    manufacturer="Anton Paar",
    name="SAXSess mc^2",
    device_id="",
    firmware_version="",
    serial_number=""
)
experiment_step.add_method(new_device)

In [None]:
new_software = Software(
    manufacturer="Project Jupyter",
    name="JupyterLab",
    version="3.2.9",
    operating_system="Microsoft Windows 10 Pro 10.0.19044 Build 19044"
)
experiment_step.add_method(new_software)

In [None]:
new_software = Software(
    manufacturer="OriginLab",
    name="OriginPro 2021b (64-bit) SR2",
    version="9.8.5.212",
    operating_system="Microsoft Windows 10 Pro 10.0.19044 Build 19044"
)
experiment_step.add_method(new_software)

In [None]:
instrument_parameters = Category(name="Instrument parameters")

In [None]:
pdh_parameters = raw_metadata.findall(".//parameter")
for parameter in pdh_parameters:
    new_category = Category(name=parameter.get("key"))
    for value in parameter:
        new_parameter = Parameter(
            name=value.get("key"),
            parameter_type=infer_type(value.text),
            value=value.text
            )
        new_category.add_content(new_parameter)
    instrument_parameters.add_content(new_category)
experiment_step.add_method(instrument_parameters)

8. Create a series for every dimension of data present in the dataframe. The data from the dataframe goes into an IndividualValueSet in form of a list, which is then added to the actual series object, together with unit information (from PDH metadata) a name, ID, the data type, the dependency, and plot scale:

- Case a) Create a new category object:

In [None]:
category = Category(name="Measurements")

- Case b) Access an existing category:

In [None]:
available_categories = experiment_step.result.results
print([content.name for content in available_categories])

In [None]:
category = available_categories[0]

- Handle series by first extracting unit information from the PDH file metadata, creating AnIML Unit and SIUnit elements from them, and adding these Unit elements together with the actual data in form of IndividualValueSets to Series elements:

In [None]:
columns = raw_metadata.findall(".//column")
units = {}
for column in columns:
    quantity = column.getchildren()[1].text
    quantities = {
        "q": "SCATTERING_VECTOR",
        "I": "COUNTS_PER_AREA"
    }
    for key, value in quantities.items():
        if value == quantity:
            units[key] = quantity

In [None]:
print(units)

In [None]:
q_unit = Unit(
    label=list(units.keys())[0],
    quantity=list(units.values())[0].lower().replace("_"," ")
)
q_si_unit = SIUnit(
    si_name="m",
    factor=0.000000001,
    exponent=-1,
    offset=0
)
q_unit.add_si_unit(q_si_unit)

i_unit = Unit(
    label=list(units.keys())[1],
    quantity=list(units.values())[1].lower().replace("_"," ")
)
i_si_unit = SIUnit(
    si_name="m",
    factor=0.000001,
    exponent=-2,
    offset=0
)
i_unit.add_si_unit(i_si_unit)

In [None]:
q_values = IndividualValueSet(
    raw_dataframe["scattering_vector"].tolist()
)
q = Series(
    name="q",
    id=f"{pdh_file.replace('%', 'p')[:-3]}_q",
    unit=q_unit,
    individual_value_set=q_values,
    data_type="float32",
    dependency="dependent",
    plot_scale="linear"
)
i_values = IndividualValueSet(
    raw_dataframe["counts_per_area"].tolist()
)
i = Series(
    name="I",
    id=f"{pdh_file.replace('%', 'p')[:-3]}_i",
    unit=i_unit,
    individual_value_set=i_values,
    data_type="float32",
    dependency="dependent",
    plot_scale="linear"
)

9. Create one or more sets for series belonging together, provide the set with a name, add it to the category object, and add it to the experiment step object:

In [None]:
new_set = SeriesSet(
    name=f"Small angle X-ray scattering",
    series=[q, i]
)

In [None]:
category.add_content(new_set)

In [None]:
experiment_step.add_result(category)

10. Finally, add the now fully built experiment step object to the AnIML object:

In [None]:
animl_doc.add_experiment_step(experiment_step)

11. Create the XML-formatted string from the AnIML object and provide path and desired file name for the actual AnIML document:

In [None]:
xml_string = animl_doc.toXML()

In [None]:
path_to_AnIML_file = path_to_datasets / f"processed/fairsaxs_{date_suffix}.animl"

12. Serialize the XML string to the pathlib Path provided:

In [None]:
with path_to_AnIML_file.open("w") as f:
    f.write(xml_string)
del animl_doc, xml_string
print(f"Successfully created AnIML document.")

#### **Adding to an existing AnIML document** <a class="anchor" name="addition"></a>

An existing AnIML document can be converted into an AnIML object in memory and manipulated using the pyAnIML API. This notebook provides an example for a workflow on how to add a sample to an AnIML document afterwards.

1. Give path to AnIML document to be extended in form of a pathlib Path:

In [None]:
path_to_AnIML_file = path_to_datasets / f"processed/fairsaxs_220502.animl"

2. Read the document as string and create an AnIML object from it:

In [None]:
with path_to_AnIML_file.open("r") as f:
    xml_string = f.read()
    animl_doc = AnIMLDocument.fromXMLString(xml_string)

3. Create the sample object:

In [None]:
new_sample = Sample(
    id="forgotten",
    name="Forgotten sample"
)
animl_doc.add_sample(new_sample)

4. Create XML-formatted string from the AnIML object:

In [None]:
xml_string = animl_doc.toXML()

5. Serialize the XML string to the pathlib Path provided before:

In [None]:
with path_to_AnIML_file.open("w") as f:
    f.write(xml_string)
del animl_doc, xml_string
print(f"Successfully created AnIML document.")

#### **Inspecting an AnIML document** <a class="anchor" name="inspection"></a>

The following section is entirely optional and can be used to check if the AnIML document was serialized correctly.

1. Give path to the AnIML file as pathlib Path:

In [None]:
path_to_AnIML_file = path_to_datasets / f"processed/fairsaxs_220502.animl"

2. Open the AnIML document, create XML string, print it and delete again:

In [None]:
with path_to_AnIML_file.open("r") as f:
    control_string = f.read()
print(control_string)
del control_string

---

### **Disclosure** <a class="anchor" name="disclosure"></a>

**Contributions**

If you wish to contribute to the FAIR Chemistry project, find us on [GitHub](https://github.com/FAIRChemistry)!

**MIT License**

Copyright (c) 2022 FAIR Chemistry

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.