2.3. Mock examples
The star2xml tool eases the process of XML creation prior metadata submission to the European Genome-phenome Archive (EGA).
- A compilation of Python scripts that automatically generate correctly formatted XMLs containing metadata. Additionally it can validate such XMLs against ENA's schemas (
- A compilation of Python scripts that automatically generate correctly formatted XMLs containing metadata. Additionally it can validate such XMLs against ENA's schemas (
- Given an input file (
.xlsx) the tool follows an XML structure (defined in a
YAMLschema file) assigning each field of the input file to its corresponding XML node's characteristics.
- Given an input file (
- Tool's scripts can be found in star2xml directory.
- Required Python packages can be found at requirements.txt.
- Use the file
EGA_metadata_submission_template_v1.xlsxas a template to fill in with your data, which can be used as the input for the star2xml tool. Further information about its format and how to fill each of their tabs exists in its section on this README.
- Configuration files (
xml_schema.yaml) reside in the configurations directory. Information regarding their structure and how to modify them is located both within the files themselves and their section on this README.
We highly recommend you to take a look at the recorded session "Star2xml: metadata converter", where we go through an overview of what the basic usage of the tool is and how to use it. Likewise, there is a second video tutorial that covers the programmatic submission that follows the creation of metadata XMLs.
This tool was programmed in Python (version 3.8+) and depends on the following packages:
|pandas||1.2.2||Fast, powerful, flexible and easy to use open source data analysis and manipulation tool|
|PyYAML||5.4.1||YAML parser and emitter|
|argparse||1.4.0||Module to write user-friendly command-line interfaces|
|lxml||4.6.2||library for processing XML and HTML|
|datetime||4.3||Supplies classes for manipulating dates and times|
|openpyxl||3.0.6||Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files|
|requests||2.25.1||HTTP library for Python (used to download files)|
You may want to install the latest versions of this packages and check if it works (running the mock command line examples provided in this README). In case you want to install the specific versions we used to develop this tool, you are advised to create a virtual environment (to avoid overwriting other versions you may use).
To install Python dependencies:
# Step 1. Cloning the tools repository git clone https://github.com/EGA-archive/star2xml.git cd star2xml/star2xml/ # Step 2. Creating and activating the virtual environment virtualenv -p python3 venv_star2xml source venv_star2xml/bin/activate # Step 3. Installing dependencies pip3 install -r requirements.txt # Step 4. Deactivating the virtual environment deactivate
If you wish to install dependencies on your working environment, you will only need to run the two commands from steps 1 and 3 (skip steps 2 and 4). In case you do create a virtual environment, remember to always activate it (using
source venv_star2xml/bin/activate) prior running the scripts.
There are two main scripts you can run:
- star2xml.py: used to generate XMLs.
- validateXML.py: used to validate XMLs.
Information of both scripts can be obtained using the command line help option [
python3 star2xml.py -h) while executing each of them.
usage: star2xml.py [-h] [--output_xmls OUTPUT_XMLS] [--schema-file [SCHEMA_FILE]] [--configuration-file [CONF_FILE]] [--verbose] [--debug] [--validate] schema_keys input_file A script to transform an input file (.csv, .tsv or .xlsx) into one (or more) dataframe(s), and then build one (or more) XML(s) with its information following the XML structure described in a YAML file positional arguments: schema_keys Schema keys for the metadata object. Can be a single key (e.g. "sample", "run", "experiment"...), or several keys separated by commas (e.g. "sample,run,experiment") input_file Input file (.csv, .tsv or .xlsx) with metadata information to be transformed into a dataframe (e.g. "sample.xlsx"). If several schema keys are given, the input spreadsheet is expected to have a separated tab named after each schema key. optional arguments: -h, --help show this help message and exit --output_xmls OUTPUT_XMLS Output XML filepaths, i.e. file(s) that will contain the generated XML(s). [OUTPUT_XMLS] can be (1) a single filepath (e.g. "sample.xml"), (2) several filepaths separated by commas (e.g. "sample.xml,run.xml,experiment.xml" - in the same order as the schema keys), (3) or a directory (default: "output_xmls/") where all XMLs will be saved with their corresponding schema keys as their names (with the time-stamp if needed to avoid overwritting files). --schema-file [SCHEMA_FILE] YAML file containing the schema for the metadata object(s) (default: "xml_schema.yaml") --configuration-file [CONF_FILE] YAML file containing the configuration (i.e. required fields) of the input file (default: "input_configuration.yaml") --verbose A boolean switch to add verbosity to the function (printing initial parameters, final XML...) --debug A boolean switch to set the functions in "debug" mode, which will add even more verbosity to the function (printing every step of the XML creation...) --validate A boolean switch that will enable the validation of the scripts right after its creation. Thus, the function will call validateXML.py (in verbose mode) after it has finished creating the XMLs. Example of usage: $ python3 star2xml.py "study,sample,analysis,experiment,run,dataset,submission,dac,policy" "EGA_metadata_submission_template_v1.xlsx" --validate
The input file will commonly be a spreadsheet with a tab named after each of the metadata objects (e.g. "run") we want to convert into XMLs. Instead of a joint spreadsheet, the tool also accepts Comma and Tab Separated Values (.csv and .tsv) files, each of which would contain data of one single metadata object (similar to one tab of the joint template).
For example, the joint template (
EGA_metadata_submission_template_v1.xlsx) contains a tab for each possible metadata object. Within each of them, one row corresponds to one metadata instance (e.g. one
run per row), and each column to one field of information for such instance. In case we were interested in creating an XML containing the Run's metadata we would execute the following command:
python3 star2xml.py 'run' 'EGA_metadata_submission_template_v1.xlsx' --output_xmls 'output_xmls/run.xml' --schema-file 'configuration_files/xml_schema.yaml' --configuration-file 'configuration_files/input_configuration.yaml'
--configuration-file arguments can be omitted if their corresponding filepaths have not been modified (by default in
configuration_files/). Besides, if
--output_xmls is also omitted, the output XMLs will be stored in
output_xmls/ by default. Thus, the command can be simplified:
python3 star2xml.py 'run' 'EGA_metadata_submission_template_v1.xlsx'
One convenient optional argument that you can provide to the
star2xml.py script is
--validate, which will trigger the execution of the following script. In other words, this will not only create the desired XMLs, but also validate them against ENA's schemas with one single command.
usage: validateXML.py [-h] [--schemas-dir [SCHEMA_DIR]] [--schema-file [SCHEMA_FILE]] [--download_xsd] [--ftp_downloader] [--verbose] [--dont_stop_parsing] schema_keys input_xmls A script to validate one (or more) input XML(s) based on some XML schemas (.xsd files). If schemas are missing, it downloads them from its GH repository (or FTP server) (specified within the configuration files). The function returns a list of boolean values defined by the outcome of the validation (e.g. [False, True, True] if only the last 2 XMLs were correctly validated) positional arguments: schema_keys Schema key(s) (comma delimited) for the metadata object(s) (e.g. "sample,run" or "experiment"...) input_xmls Input XML(s) (comma delimited) with metadata information to be validated (e.g. "sample.xml,run.xml" or "experiment.xml") optional arguments: -h, --help show this help message and exit --schemas-dir [SCHEMA_DIR] Directory containing all the XSD schema files (default: "downloaded_schemasXSD/"). If --download-xsd is given, the XSD files will be downloaded into this directory. --schema-file [SCHEMA_FILE] YAML file containing the schema for the metadata object(s) (default: "configuration_files/xml_schema.yaml") --download_xsd A boolean switch that will enable the download of the XML schemas (.xsd files) instead of having to provide them manually. --ftp_downloader A boolean switch to use the ftp downloader instead of the default request.get() downloader for the ENA schemas (in GitHub), which is the main source of truth for ENA schemas. We advise you not to use this option unless you know for sure that the version of the schemas within the configuration files (e.g. 'pub/databases/ena/doc/xsd/sra_1_6/') is the proper one. --verbose A boolean switch to add verbosity to the function (will print into the terminal extra information, as well as the validation errors and results with a friendlier format). Highly recommended. --dont_stop_parsing A boolean switch that, if given, will make the validation continue if an error is raised when parsing one of the given XMLs. Such file with errors will be reported as not validated, but the function will not stop, validating other files. The error will be displayed excplicitly as a warning if '--verbose' is also given. Schema keys (e.g. 'sample,run') and their input XMLs (e.g. 'sample.xml,run.xml') have to be given in the same order!
Just like with the previous script, here we can validate as many XML files as we want in one go. For instance, in the following example we validate two XML files (
run.xml) that correspond to two different metadata objects (
run). It is important to notice that, if we have not downloaded yet the metadata schema files (
.xsd), we should provide the option
--download_xsd the first time we run
python3 validateXML.py "sample,run" "output_xmls/sample.xml,output_xmls/run.xml" --schemas-dir "downloaded_schemasXSD/" --schema-file "configuration_files/xml_schema.yaml" --verbose --download_xsd
Once again, if we have not modified the schema's filepath, option
--schema-file can be omitted. Besides,
--schemas-dir is by default
downloaded_schemasXSD/ and if we already downloaded the
.xsd files, we can also omit the option
--download_xsd. Thus, a simpler command would be:
python3 validateXML.py "sample,run" "output_xmls/sample.xml,output_xmls/run.xml" --verbose
It is worth mentioning that if there is an error while parsing the given XMLs (e.g. there are unclosed nodes - i.e. missing '
>'), the validation will stop by default to notify the error. If this is not the desired behaviour, you may provide the optional argument
--dont_stop_parsing to avoid terminating the execution, and instead report the file with errors as non-validated.
To get started with the tool, you can execute the following commands:
# Create one single XML from one tab of the joint spreadsheet: python3 star2xml.py "sample" "EGA_metadata_submission_template_v1.xlsx" --verbose # Validate the XML we just created: python3 validateXML.py "sample" "output_xmls/sample.xml" --verbose --download_xsd # Create all possible XMLs from the joint template and validate each of them: python3 star2xml.py "study,sample,analysis,experiment,run,dataset,submission,dac,policy" "EGA_metadata_submission_template_v1.xlsx" --validate
Filling out templates
For this part of the documentation we will be using the joint template (
EGA_metadata_submission_template_v1.xlsx), a spreadsheet, since it is the most commonly used format. Nevertheless, stripping off the formatting, you may use a similar logic while filling plain text formats (
Based on the type of metadata objects you want to submit, you shall fill their corresponding tabs within such joint template. Each tab of the spreadsheet corresponds to one of the possible metadata objects (e.g.
run) from EGA, with the exception of the first tab, which is named
Readme and contains information about the file's format. For all metadata tabs each row will represent one repetition of a metadata object. For example, each of the rows in the sample tab given as input will represent one
<SAMPLE> node of the
<SAMPLE_SET> in the final XML. All information that row contains will be associated with its corresponding
<SAMPLE> node (its alias, description, etc.).
Rows that are completely empty will be discarded, as well as empty values within a non-empty row. In other words, every empty coordinate of the spreadsheet that is not filled will not appear in the output XML.
Please bear in mind that controlled vocabularies (CV - e.g.
SEQUENCE_VARIATION as an Analysis type) are case-sensitive (i.e.
Sequence_variation is considered a different CV).
It is important to notice the rows format:
- First row: column headers.
- Second row: descriptive row. In here you will find what type of data corresponds to each column. Most values within this row will start with "TODO:…", which means that the data its column contains is specific to your case. In case its value does not start in such a way, it means that the string this row contains for that specific column shall be used for all rows (e.g.
Checksum_methodwithin an analysis tab or
sexat one of the
Tagcolumns of a sample tab).
- Third row: real data "ceiling". It is below this row (at the fourth row) that you shall enter your real metadata.
- Fourth row onwards: real metadata. All rows from the 4th onwards will be transformed into XML nodes by the star2xml tool. By default the template will contain some mock examples within this row so that you can execute the above mentioned mock commands, but you shall remove this whole row or replace its values with yours.
Types of columns
Before filling the template, we need to recognize two different types of columns: non-repetitive and repetitive ones.
- Non-repetitive columns. These fields describe a characteristic (text or attribute) of one single node (e.g.
text: "Scientific_name") for each metadata object (e.g.
<SAMPLE>). Such columns can be hidden or left empty for some or all rows (unless required - see column headers' format below), but should not be deleted (the tool will be looking for them). As an example, in the following image we have 5 of these columns from the analysis tab.
- Repetitive columns. These fields contain characteristics for a node that can be repeated (e.g.
<FILE>) within each metadata object (e.g.
<ANALYSIS>). These repeated columns are differently coloured (see column's format below) and appear beyond a vertical thick black line (with the exception of
phenotype- the three public attributes from a sample tab). The important thing to notice is that the alternative colouring is there to help you identify what a "repetitive block" is. In the following image we have 9 of these columns, which correspond to 3 repetitions of the same repetitive block (in this case
Tag-Value-Units). These blocks can be added or deleted depending on your needs, but if there is a column from a repetitive block, their sibling columns are also required (even if left empty or hidden, just like
phenotype): in our example, if we add a new repetition with columns
Value, there needs to be a third one,
Column names in the templates are linked to the configuration files (
xml_schema.yaml), which leads to an important constraint: if there is a field described in a configuration file (e.g.
center_name: "Center_name") there needs to be its corresponding column name within the input file (e.g.
Center_name). In other words, unless the configuration file is properly modified, you shall not delete non-repetitive columns or completely delete all repetitive blocks from the input file. What you can do is leave them empty for some or all rows, or delete additional repetitive blocks that you don't need. At least one of each complete repetitive block (with all its sibling fields) needs to be present within the input file, even if you leave it empty for some or all rows. As we mentioned before, every empty coordinate of the spreadsheet will not be entered in the XML.
The order of columns is not relevant as long as the repeated blocks' columns are not severely mixed (e.g. filling first
Tag with the string corresponding to the second
Tag). In fact, repeated columns can be mixed provided the order is maintained, thus making the following input valid.
This allows for a handy way of dealing with hundreds or thousands of columns in an easy way (being able to put all columns of the same type in a sequence). For instance, we may want to generate an Analysis XML based on our samples, but the analysis encompasses thousand of samples (i.e. thousands of rows in the sample tab, but thousands of columns in the analysis tab). Although there are multiple ways to input such columns, an easy approach is to use the transpose function (e.g. using excel - help from microsoft). In our example the sample repetitive block contains 2 types of columns (
Sample_Label). Therefore, we can copy all the aliases from the sample tab (rows), and transpose them as columns into the analysis tab, do the same with the labels, and create the following input:
Column header's format
Additional information can be obtained from the colour of the column headers (first row):
- Bright yellow: required attributes. All column headers that contain "
*" are marked as required (e.g.
Analysis_alias*): their metadata shall be provided for each filled row.
- No colour: optional (yet highly recommended) attributes. These columns may be left empty, although we advise to also provide their corresponding metadata, for it will enrich your submission.
- Light yellow: optionally required columns. These are columns related to a choice from another column (based on multiple choice attributes). For instance, if our experiment's layout is
PAIRED, the two related columns (
PAIRED.Nominal_sdev) will change their header's format to light yellow, since these are required columns for a paired experiment.
- Grey: optionally ignored columns. Column headers that do not appear to be chosen for any metadata instance (row), and thus can be ignored (i.e. left empty) (based on multiple choice attributes). For instance, if our experiment's layout is
SINGLE, the two columns previously mentioned that are related to a paired experiment would be highlighted in grey.
- Other colours: repetition blocks. As we mentioned describing the types of columns, there are repeated columns. Their headers are alternatively coloured for each repeated class to ease their identification. Besides, the body of the column is coloured in a lighter colour than their headers alternating between repeated blocks of the same class.
This section of the README displays additional information about how the tool works using their configuration files. Such knowledge will most likely not be relevant to the average user, and thus you may skip it. Nevertheless, if you wish to change the configuration files, it will come in handy.
There are two configuration files:
xml_schema.yaml. The former simply lists the required fields for each input file (i.e. if a column named
Sample_alias* needs to be present or not). The latter describes the structure of the corresponding XML (i.e. which nodes are children of which) and associates each column name of the input file with its corresponding node's characteristic (either an attribute or text). Both are
YAML files, which are easy-to-read information holders, and can be interpreted as dictionaries/lists of elements. Besides the information displayed here, additional instructions on how to modify them reside within the files themselves.
Basic structure -
At base level, the file contains information of the tool itself (
tool_info - used to add details to reports), the metadata schemas (
XML_schemas_info - used to both download
.xsd files and create XMLs) and one element for each metadata object (e.g.
sample) describing its XML's architecture.
A simple example of what is what, with content from the
xml_schema.yaml (up), the input file (right) and the output XML (down), is the following:
Modifying the schema
If the metadata requirements change and existing fields need to be removed or new ones added, we will need to modify the two configuration files as well as the input files.
- Missing fields of a repetitive block. In the following example fields
Unitsare missing from the first and second
Tag-Value-Units(one of the common repetition blocks), respectively. Remember that repeated blocks need to be complete (if there is a
Tagcolumn, its two siblings
Unitsneed to be there), but you can leave empty such columns for all or some rows, since every empty coordinate of the spreadsheet will be ignored.
- Not using the option
--download_xsdthe first time you try to validate XMLs: the schemas (
.xsd) will be missing and the tool will throw the following error message:
ERROR in check_xml_is_valid(): the schema file 'downloaded_schemasXSD/SRA.sample.xsd' could not be accessed. If you have not downloaded the schema files (.xsd) yet, use '--download_xsd' when running the command.
- Line endings being an issue. When using different operating systems (e.g. using Windows Subsystem for Linux) an issue regarding line endings may arise when trying to execute the scripts (e.g.
/usr/bin/env: ‘python3\r’: No such file or directory- notice the
/rnot being handled correctly by the interpreter). If such is the case, there are automatic ways to change all line endings within the scripts, which will solve the issue:
sudo apt install dos2unix # Within the 'star2xml/' folder do: dos2unix ./*.py
- File permissions not being stablished by default. In case executable files (e.g.
star2xml.py) are not executable by your user (take a look at its permissions - e.g.
-rw-r--r--), you will need to amend them manually (e.g.
chmod u+x star2xml.py).