[![Colab Badge Link](https://img.shields.io/badge/open-in%20colab-blue)](https://colab.research.google.com/github/Glasgow-AI4BioMed/tutorials/blob/main/loading_bioc_xml_files_and_archives.ipynb)

# Loading BioC XML files plus dealing with an archive of them

This Colab runs through using the [bioc Python package](https://github.com/bionlplab/bioc) to load a BioC XML file with entity annotations. This should be helpful for working with the [large BioC XML files provided by PubTator](https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/PubTatorCentral_BioCXML/)


### Getting some data and installing the bioc package.

I've predownloaded a couple BioC XML files from [PubTator Central](https://www.ncbi.nlm.nih.gov/research/pubtator/api.html) and put them into a gzipped tar archive. It's on [OneDrive](
https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EZJFs8-q_PZOvanwCq1PFxQBm_nHdxtrM9AlxNkb_hyW8Q?e=SKG2aU). Let's download it with the command below.

In [1]:
!wget -O example_bioc_files.tar.gz https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EZJFs8-q_PZOvanwCq1PFxQBm_nHdxtrM9AlxNkb_hyW8Q?download=1

--2025-07-12 19:58:15--  https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EZJFs8-q_PZOvanwCq1PFxQBm_nHdxtrM9AlxNkb_hyW8Q?download=1
Resolving gla-my.sharepoint.com (gla-my.sharepoint.com)... 52.107.249.1, 52.107.249.62, 52.107.249.63, ...
Connecting to gla-my.sharepoint.com (gla-my.sharepoint.com)|52.107.249.1|:443... connected.
HTTP request sent, awaiting response... 302 
Location: /personal/jake_lever_glasgow_ac_uk/Documents/Data%20For%20Student%20Projects/example_bioc_files.tar.gz?ga=1 [following]
--2025-07-12 19:58:16--  https://gla-my.sharepoint.com/personal/jake_lever_glasgow_ac_uk/Documents/Data%20For%20Student%20Projects/example_bioc_files.tar.gz?ga=1
Reusing existing connection to gla-my.sharepoint.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 2658 (2.6K) [application/x-gzip]
Saving to: ‘example_bioc_files.tar.gz’


2025-07-12 19:58:16 (21.0 MB/s) - ‘example_bioc_files.tar.gz’ saved [2658/2658]



You could decompress the archive with the command below. **However, this is a bad idea for very large files and we can deal with the compressed archive form directly - see the final part of this colab**.

In [2]:
!tar xvf example_bioc_files.tar.gz

selected_files/
selected_files/17299597.bioc.xml
selected_files/36066408.bioc.xml


We're going to use the [bioc](https://github.com/bionlplab/bioc) Python package to load the file. So let's install that:

In [3]:
!pip install bioc

Collecting bioc
  Downloading bioc-2.1-py3-none-any.whl.metadata (4.6 kB)
Collecting jsonlines>=1.2.0 (from bioc)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting intervaltree (from bioc)
  Downloading intervaltree-3.1.0.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting docopt (from bioc)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading bioc-2.1-py3-none-any.whl (33 kB)
Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Building wheels for collected packages: docopt, intervaltree
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=1165832aab06fd38c6cf166d41d98d130fdd9597efe7ae5a13300fbb80cccf10
  Stored in directory: /root/.cache/pip/wheels/1a/b0/8c/4b75c4116c31f83c8f9f047231251e13cc74481cca4a78a9ce
  Building wheel for intervaltree (setup.py) ... [?25l[?25hdone
  Created w

### Loading a single BioC XML file

To load a single file, you could use the code below to load up the file that may contain multiple documents, along with annotations.

In [4]:
import bioc
with open('selected_files/36066408.bioc.xml', 'r') as fp:
  collection = bioc.biocxml.load(fp)
len(collection.documents)

1

You can then iterate through the documents, the passages in each document that contain text and then the entity annotations within those passages. The code below illustrates how to access those different parts

In [5]:
for document in collection.documents:
  print(f"{document.id=}")
  for passage in document.passages:
    print(f"{passage.offset=}")
    print(f"{passage.text=}")
    for anno in passage.annotations:
      start = anno.locations[0].offset
      end = start + anno.locations[0].length
      anno_type = anno.infons['type']
      concept_id = anno.infons['identifier']

      print(f"{anno.text=} {start=} {end=} {anno_type=} {concept_id=}")
    print()

document.id='36066408'
passage.offset=0
passage.text='Inhibition of EGFR overcomes acquired lenvatinib resistance driven by STAT3-ABCB1 signaling in hepatocellular carcinoma.'
anno.text='EGFR' start=14 end=18 anno_type='Gene' concept_id='1956'
anno.text='lenvatinib' start=38 end=48 anno_type='Chemical' concept_id='MESH:C531958'
anno.text='STAT3' start=70 end=75 anno_type='Gene' concept_id='6774'
anno.text='ABCB1' start=76 end=81 anno_type='Gene' concept_id='5243'
anno.text='hepatocellular carcinoma' start=95 end=119 anno_type='Disease' concept_id='MESH:D006528'

passage.offset=121
passage.text='Lenvatinib is an inhibitor of multiple receptor tyrosine kinases that was recently authorized for first-line treatment of hepatocellular carcinoma (HCC). However, the clinical benefits derived from lenvatinib are limited, highlighting the urgent need to understand mechanisms of resistance. We report here that HCC cells develop resistance to lenvatinib by activating epidermal growth factor recept

### Dealing with an archive of BioC XML files

It is often a better idea to not extract all the BioC XML files before dealing with them, and instead work with the archive directly. You can do that using the tarfile package. For instance, to get a file-listing inside a tar.gz file:

In [6]:
import tarfile

source = "example_bioc_files.tar.gz"

with tarfile.open(source) as archive:
  for member in archive:
    print(f"{member.name=} {member.isfile()=}")

member.name='selected_files' member.isfile()=False
member.name='selected_files/17299597.bioc.xml' member.isfile()=True
member.name='selected_files/36066408.bioc.xml' member.isfile()=True


Then, we can put the `bioc` and `tarfile` code together to iterate through all the files in the archive and extract out the document, passage and annotation information.

In [7]:
with tarfile.open(source) as archive:
  for member in archive:
    if member.isfile() and member.name.lower().endswith('xml'):
      print(f"{member.name=}")

      file_handle = archive.extractfile(member)
      data = file_handle.read().decode('utf-8')

      collection = bioc.biocxml.loads(data)

      for document in collection.documents:
        print(f"{document.id=}")
        for passage in document.passages:
          print(f"{passage.text=}")
          for anno in passage.annotations:
            print(f"{anno=}")

member.name='selected_files/17299597.bioc.xml'
document.id='17299597'
passage.text='Quantifying organismal complexity using a population genetic approach.'
passage.text="BACKGROUND: Various definitions of biological complexity have been proposed: the number of genes, cell types, or metabolic processes within an organism. As knowledge of biological systems has increased, it has become apparent that these metrics are often incongruent. METHODOLOGY: Here we propose an alternative complexity metric based on the number of genetically uncorrelated phenotypic traits contributing to an organism's fitness. This metric, phenotypic complexity, is more objective than previous suggestions, as complexity is measured from a fundamental biological perspective, that of natural selection. We utilize a model linking the equilibrium fitness (drift load) of a population to phenotypic complexity. We then use results from viral evolution experiments to compare the phenotypic complexities of two viruses, the 