# Reading ISA-Tab from files and Validating ISA-Tab files 

## Abstract:

The aim of this notebook is to:
   - show essential function to read and load an ISA-tab file in memory.
   - navigate key objects and pull key attributes.
   - learn how to invoke the ISA-tab validation function.
   - interpret the output of the validation report.


## 1. Getting the tools

In [1]:
# If executing the notebooks on `Google Colab`,uncomment the following command 
# and run it to install the required python libraries. Also, make the test datasets available.

# !pip install -r requirements.txt

In [2]:
import isatools
import os
import sys
from isatools import isatab

log_level: error
LOG: <Logger isatools (DEBUG)>
/Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/net/resources/saxon9/saxon9he.jar


## 2. Reading and loading an ISA Investigation in memory from an ISA-Tab instance

In [3]:
with open(os.path.join('./BII-S-3', 'i_gilbert.txt')) as fp:
            ISA = isatab.load(fp)

### Let's check the description of the first study object present in an ISA Investigation object

In [4]:
ISA.studies[0].description

'Sequencing the metatranscriptome can provide information about the response of organisms to varying environmental conditions. We present a methodology for obtaining random whole-community mRNA from a complex microbial assemblage using Pyrosequencing. The metatranscriptome had, with minimum contamination by ribosomal RNA, significant coverage of abundant transcripts, and included significantly more potentially novel proteins than in the metagenome. This experiment is part of a much larger experiment. We have produced 4 454 metatranscriptomic datasets and 6 454 metagenomic datasets. These were derived from 4 samples.'

### Let's check the protocols declared in ISA the study (using a python list comprehension):

In [5]:
[protocol.description for protocol in ISA.studies[0].protocols]

['Waters samples were prefiltered through a 1.6 um GF/A glass fibre filter to reduce Eukaryotic contamination. Filtrate was then collected on a 0.2 um Sterivex (millipore) filter which was frozen in liquid nitrogen until nucelic acid extraction. CO2 bubbled through 11000 L mesocosm to simulate ocean acidification predicted conditions. Then phosphate and nitrate were added to induce a phytoplankton bloom.',
 'Total nucleic acid extraction was done as quickly as possible using the method of Neufeld et al, 2007.',
 'RNA MinElute + substrative Hybridization + MEGAclear For transcriptomics, total RNA was separated from the columns using the RNA MinElute clean-up kit (Qiagen) and checked for integrity of rRNA using an Agilent bioanalyser (RNA nano6000 chip). High integrity rRNA is essential for subtractive hybridization. Samples were treated with Turbo DNA-free enzyme (Ambion) to remove contaminating DNA. The rRNA was removed from mRNA by subtractive hybridization (Microbe Express Kit, Ambio

### Let's now checks the ISA Assay Measurement and Technology Types  are used in this ISA Study object

In [6]:
[f'{assay.measurement_type.term} using {assay.technology_type.term}' for assay in ISA.studies[0].assays]

['metagenome sequencing using nucleotide sequencing',
 'transcription profiling using nucleotide sequencing']

### Let's now check the `ISA Study Source` Material:

In [7]:
[source.name for source in ISA.studies[0].sources]

['GSM255770', 'GSM255771', 'GSM255772', 'GSM255773']

#### Let's check what is the first `ISA Study Source property`:

In [8]:
# here, we get all the characteristics of the first Source object
first_source_characteristics = ISA.studies[0].sources[0].characteristics

In [9]:
first_source_characteristics[0].category.term

'organism'

#### Let's now check what is the `value` associated with that first `ISA Study Source property`:

In [10]:
first_source_characteristics[0].value.term

'marine metagenome'

#### Let's now check what are all the properties associated with this first `ISA Study Source`

In [11]:
[char.category.term for char in first_source_characteristics]

['organism',
 'geographic location (country and/or sea,region)',
 'geographic location (longitude)',
 'geographic location (latitude)',
 'chlorophyll a concentration',
 'fucoxanthin concentration',
 'peridinin concentration',
 'butfucoxanthin concentration',
 'hexfucoxanthin concentration',
 'alloxanthin concentration',
 'zeaxanthin concentration',
 'lutein concentration',
 'chl-c3 concentration',
 'chl-c2 concentration',
 'prasinoxanthin concentration',
 'neoxanthin concentration',
 'violaxanthin concentration',
 'diadinoxanthin concentration',
 'diatoxanthin concentration',
 'divinyl-chl-b concentration',
 'chl-b concentration',
 'divinyl-chl-a concentration',
 'chl-a concentration',
 'BB carotene concentration',
 'bacteria count',
 'synechococcus count',
 'small picoeukaryotes count',
 'large picoeukaryotes count',
 'nanoflagellates count',
 'cryptophytes count',
 'phosphate concentration',
 'nitrate concentration',
 'particulate organic nitrogen concentration',
 'particulate organi

#### And the corresponding values are:

In [12]:
[char.value for char in first_source_characteristics]

[isatools.model.OntologyAnnotation(term='marine metagenome', term_source=isatools.model.OntologySource(name='NCBITAXON', file='http://data.bioontology.org/ontologies/NCBITAXON', version='2', description='National Center for Biotechnology Information (NCBI) Organismal Classification', comments=[]), term_accession='http://purl.obolibrary.org/obo/NCBITaxon_408172', comments=[]),
 isatools.model.OntologyAnnotation(term='Norway, fjord, coastal', term_source=None, term_accession='', comments=[]),
 '5.222222',
 '60.269444',
 '9.23',
 '0.54',
 '0.18',
 '0.14',
 '0.82',
 '0.36',
 '0.35',
 '0.37',
 '0.29',
 '0.59',
 '0',
 '0',
 '0.64',
 '0.46',
 '0.1',
 '0',
 '5.25',
 '0',
 '9.23',
 '0.72',
 '4666004',
 '7064',
 '36257',
 '5450',
 '2851',
 '660',
 '0.23',
 '7.53',
 '143',
 '844',
 '591.4',
 '31.3',
 isatools.model.OntologyAnnotation(term='17.6', term_source=None, term_accession='', comments=[]),
 '9.7']

## 3. Invoking the python ISA-Tab Validator

In [13]:
my_json_report_bii_i_1 = isatab.validate(open(os.path.join('./BII-I-1/', 'i_investigation.txt')))

2021-07-13 21:29:51,860 [INFO]: isatab.py(validate:4212) >> Loading... ./BII-I-1/i_investigation.txt
2021-07-13 21:29:52,110 [INFO]: isatab.py(validate:4214) >> Running prechecks...
2021-07-13 21:29:52,340 [INFO]: isatab.py(validate:4235) >> Finished prechecks...
2021-07-13 21:29:52,340 [INFO]: isatab.py(validate:4236) >> Loading configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:29:52,369 [INFO]: isatab.py(validate:4241) >> Using configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:29:52,370 [INFO]: isatab.py(validate:4243) >> Checking investigation file against configuration...
2021-07-13 21:29:52,375 [INFO]: isatab.py(validate:4246) >> Finished checking investigation file
2021-07-13 21:29:52,376 [INFO]: isatab.py(validate:4265) >> Loading... s_BII-S-1.txt
2021-07-13 21:29:52,384 [INFO]: isatab.py(validate:42

2021-07-13 21:29:52,606 [INFO]: isatab.py(validate:4356) >> Loading... a_metabolome.txt
2021-07-13 21:29:52,613 [INFO]: isatab.py(validate:4368) >> Validating a_metabolome.txt against assay table configuration (metabolite profiling, mass spectrometry)...
2021-07-13 21:29:52,614 [INFO]: isatab.py(validate:4370) >> Checking Factor Value presence...
2021-07-13 21:29:52,620 [INFO]: isatab.py(validate:4373) >> Checking required fields...
2021-07-13 21:29:52,621 [INFO]: isatab.py(validate:4376) >> Checking generic fields...
2021-07-13 21:29:52,705 [INFO]: isatab.py(validate:4387) >> Checking unit fields...
2021-07-13 21:29:52,705 [INFO]: isatab.py(validate:4397) >> Checking protocol fields...
2021-07-13 21:29:52,708 [INFO]: isatab.py(validate:4409) >> Checking ontology fields...
2021-07-13 21:29:52,709 [INFO]: isatab.py(validate:4422) >> Checking study group size...
2021-07-13 21:29:52,712 [INFO]: isatab.py(validate:4428) >> Finished validation on a_metabolome.txt
2021-07-13 21:29:52,713 [IN

2021-07-13 21:29:57,474 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,477 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,480 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,482 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,485 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,488 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,488 [INFO]: utils.py(detect_isatab_process_pooling:97) >> Checking a_metabolome.txt
2021-07-13 21:29:57,767 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pool

2021-07-13 21:29:57,822 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,823 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,824 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,825 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,827 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,828 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,829 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  unknown protocol
2021-07-13 21:29:57,831 [INFO]: utils.py(detect_graph_process_pooling

2021-07-13 21:30:00,943 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  biotin labeling
2021-07-13 21:30:00,944 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  biotin labeling
2021-07-13 21:30:00,945 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  biotin labeling
2021-07-13 21:30:00,946 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  biotin labeling
2021-07-13 21:30:00,947 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  biotin labeling
2021-07-13 21:30:00,948 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  biotin labeling
2021-07-13 21:30:00,949 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  biotin labeling
2021-07-13 21:30:00,950 [INFO]: utils.py(detect_graph_process_pooling:70) >>

2021-07-13 21:30:01,008 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  EukGE-WS4
2021-07-13 21:30:01,009 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  EukGE-WS4
2021-07-13 21:30:01,010 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  EukGE-WS4
2021-07-13 21:30:01,011 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  EukGE-WS4
2021-07-13 21:30:01,012 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  EukGE-WS4
2021-07-13 21:30:01,013 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  EukGE-WS4
2021-07-13 21:30:01,014 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  EukGE-WS4
2021-07-13 21:30:01,015 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  Eu

In [14]:
my_json_report_bii_s_3 = isatab.validate(open(os.path.join('./BII-S-3/', 'i_gilbert.txt')))

2021-07-13 21:30:01,665 [INFO]: isatab.py(validate:4212) >> Loading... ./BII-S-3/i_gilbert.txt
2021-07-13 21:30:01,888 [INFO]: isatab.py(validate:4214) >> Running prechecks...
2021-07-13 21:30:02,207 [INFO]: isatab.py(validate:4235) >> Finished prechecks...
2021-07-13 21:30:02,208 [INFO]: isatab.py(validate:4236) >> Loading configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:30:02,231 [INFO]: isatab.py(validate:4241) >> Using configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:30:02,232 [INFO]: isatab.py(validate:4243) >> Checking investigation file against configuration...
2021-07-13 21:30:02,241 [INFO]: isatab.py(validate:4246) >> Finished checking investigation file
2021-07-13 21:30:02,242 [INFO]: isatab.py(validate:4265) >> Loading... s_BII-S-3.txt
2021-07-13 21:30:02,271 [INFO]: isatab.py(validate:4272) >>

2021-07-13 21:30:03,804 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  library construction
2021-07-13 21:30:03,811 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  library construction
2021-07-13 21:30:03,817 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  library construction
2021-07-13 21:30:03,836 [INFO]: isatab.py(validate:4450) >> Finished validation...


In [15]:
my_json_report_bii_s_4 = isatab.validate(open(os.path.join('./BII-S-4/', 'i_investigation.txt')))

2021-07-13 21:30:03,845 [INFO]: isatab.py(validate:4212) >> Loading... ./BII-S-4/i_investigation.txt
2021-07-13 21:30:03,987 [INFO]: isatab.py(validate:4214) >> Running prechecks...
2021-07-13 21:30:04,082 [INFO]: isatab.py(validate:4235) >> Finished prechecks...
2021-07-13 21:30:04,082 [INFO]: isatab.py(validate:4236) >> Loading configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:30:04,102 [INFO]: isatab.py(validate:4241) >> Using configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:30:04,103 [INFO]: isatab.py(validate:4243) >> Checking investigation file against configuration...
2021-07-13 21:30:04,107 [INFO]: isatab.py(validate:4246) >> Finished checking investigation file
2021-07-13 21:30:04,108 [INFO]: isatab.py(validate:4265) >> Loading... s_BII-S-4.txt
2021-07-13 21:30:04,117 [INFO]: isatab.py(validate:42

In [16]:
my_json_report_bii_s_7 = isatab.validate(open(os.path.join('./BII-S-7/', 'i_matteo.txt')))

2021-07-13 21:30:04,428 [INFO]: isatab.py(validate:4212) >> Loading... ./BII-S-7/i_matteo.txt
2021-07-13 21:30:04,569 [INFO]: isatab.py(validate:4214) >> Running prechecks...
2021-07-13 21:30:04,691 [INFO]: isatab.py(validate:4235) >> Finished prechecks...
2021-07-13 21:30:04,692 [INFO]: isatab.py(validate:4236) >> Loading configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:30:04,713 [INFO]: isatab.py(validate:4241) >> Using configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:30:04,714 [INFO]: isatab.py(validate:4243) >> Checking investigation file against configuration...
2021-07-13 21:30:04,725 [INFO]: isatab.py(validate:4246) >> Finished checking investigation file
2021-07-13 21:30:04,726 [INFO]: isatab.py(validate:4265) >> Loading... s_BII-S-7.txt
2021-07-13 21:30:04,740 [INFO]: isatab.py(validate:4272) >> 

2021-07-13 21:30:06,482 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  PCR amplification
2021-07-13 21:30:06,484 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  PCR amplification
2021-07-13 21:30:06,485 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  PCR amplification
2021-07-13 21:30:06,486 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  PCR amplification
2021-07-13 21:30:06,488 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  PCR amplification
2021-07-13 21:30:06,490 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  PCR amplification
2021-07-13 21:30:06,492 [INFO]: utils.py(detect_graph_process_pooling:70) >> Possible process pooling detected on:  PCR amplification
2021-07-13 21:30:06,494 [INFO]: utils.py(detect_graph_process_

In [17]:
my_json_report_bii_s_7

{'errors': [],
   'supplemental': 'A property value in Investigation Title of investigation file at column 1 is required',
   'code': 4003},
  {'message': 'A required property is missing',
   'supplemental': 'A property value in Investigation Description of investigation file at column 1 is required',
   'code': 4003},
  {'message': 'A required property is missing',
   'supplemental': 'A property value in Study Publication DOI of investigation file at column 1 is required',
   'code': 4003},
  {'message': 'A required property is missing',
   'supplemental': 'A property value in Study Person Mid Initials of investigation file at column 1 is required',
   'code': 4003},
  {'message': 'A required property is missing',
   'supplemental': 'A property value in Study Person Mid Initials of investigation file at column 2 is required',
   'code': 4003},
  {'message': 'A required property is missing',
   'supplemental': 'A property value in Study Person Mid Initials of investigation file at colu

- This `Validation Report` shows that No Error has been logged
- The rest of the report consists in warnings meant to draw the attention of the curator to elements which may be provided but which do not break the ISA syntax.
- Notice the `study group` information reported on both study and assay files. If ISA `Factor Value[]` fields are found present in the `ISA Study` or ` ISA Assay` tables, the validator will try to identify the set of unique `Factor Value` combination defining a `Study Group`.
    - When no `Factor Value` are found in a ISA `Study` or `Assay` table, the value is left to its default value: -1, which means that `No Study Group` have been found.
    - ISA **strongly** encourages to declare Study Group using ISA Factor Value to unambiguously identify the Independent Variables of an experiment.
    

## 4. How does a validation failure looks like ?

### BII-S-5 contains an error located in the `i_investigation.txt` file of the submission

In [18]:
my_json_report_bii_s_5 = isatab.validate(open(os.path.join('./BII-S-5/', 'i_investigation.txt')))

2021-07-13 21:30:06,547 [INFO]: isatab.py(validate:4212) >> Loading... ./BII-S-5/i_investigation.txt
2021-07-13 21:30:06,721 [INFO]: isatab.py(validate:4214) >> Running prechecks...
2021-07-13 21:30:06,875 [INFO]: isatab.py(validate:4235) >> Finished prechecks...
2021-07-13 21:30:06,876 [INFO]: isatab.py(validate:4236) >> Loading configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:30:06,914 [INFO]: isatab.py(validate:4241) >> Using configurations found in /Users/philippe/.pyenv/versions/3.7.4/envs/isapi-testson374/src/isatools/isatools/resources/config/xml
2021-07-13 21:30:06,916 [INFO]: isatab.py(validate:4243) >> Checking investigation file against configuration...
2021-07-13 21:30:06,924 [INFO]: isatab.py(validate:4246) >> Finished checking investigation file
2021-07-13 21:30:06,931 [INFO]: isatab.py(validate:4265) >> Loading... s_001456_GCAT_sample.txt
2021-07-13 21:30:06,961 [INFO]: isatab.py(

In [19]:
my_json_report_bii_s_5["errors"]

[]

- The Validator report the Error Array is not empty and shows the root cause of the syntactic validator error.
- There is a typo in the Investigation file which affects 2 positions on the file for both Investigation and Study Object: 
<span style="color:red">Publication **l**ist</span>. vs <span style="color:green">Publication **L**ist</span>

## About this notebook

- authors: philippe.rocca-serra@oerc.ox.ac.uk, massimiliano.izzo@oerc.ox.ac.uk
- license: CC-BY 4.0
- support: isatools@googlegroups.com
- issue tracker: https://github.com/ISA-tools/isa-api/issues