[![Colab Badge Link](https://img.shields.io/badge/open-in%20colab-blue)](https://colab.research.google.com/github/Glasgow-AI4BioMed/tutorials/blob/main/loading_bioc_xml_files_and_archives.ipynb)

# Loading BioC XML files plus dealing with an archive of them

This Colab runs through using the [bioc Python package](https://github.com/bionlplab/bioc) to load a BioC XML file with entity annotations. This should be helpful for working with the [large BioC XML files provided by PubTator](https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/PubTatorCentral_BioCXML/)


### Getting some data and installing the bioc package.

I've predownloaded a couple BioC XML files from [PubTator Central](https://www.ncbi.nlm.nih.gov/research/pubtator/api.html) and put them into a gzipped tar archive. It's on [OneDrive](
https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EZJFs8-q_PZOvanwCq1PFxQBm_nHdxtrM9AlxNkb_hyW8Q?e=SKG2aU). Let's download it with the command below.

In [None]:
!wget -O example_bioc_files.tar.gz https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EZJFs8-q_PZOvanwCq1PFxQBm_nHdxtrM9AlxNkb_hyW8Q?download=1

You could decompress the archive with the command below. **However, this is a bad idea for very large files and we can deal with the compressed archive form directly - see the final part of this colab**.

In [None]:
!tar xvf example_bioc_files.tar.gz

We're going to use the [bioc](https://github.com/bionlplab/bioc) Python package to load the file. So let's install that:

In [None]:
!pip install bioc

### Loading a single BioC XML file

To load a single file, you could use the code below to load up the file that may contain multiple documents, along with annotations.

In [None]:
import bioc
with open('selected_files/36066408.bioc.xml', 'r') as fp:
  collection = bioc.biocxml.load(fp)
len(collection.documents)

You can then iterate through the documents, the passages in each document that contain text and then the entity annotations within those passages. The code below illustrates how to access those different parts

In [None]:
for document in collection.documents:
  print(f"{document.id=}")
  for passage in document.passages:
    print(f"{passage.offset=}")
    print(f"{passage.text=}")
    for anno in passage.annotations:
      start = anno.locations[0].offset
      end = start + anno.locations[0].length
      anno_type = anno.infons['type']
      concept_id = anno.infons['identifier']

      print(f"{anno.text=} {start=} {end=} {anno_type=} {concept_id=}")
    print()

### Dealing with an archive of BioC XML files

It is often a better idea to not extract all the BioC XML files before dealing with them, and instead work with the archive directly. You can do that using the tarfile package. For instance, to get a file-listing inside a tar.gz file:

In [None]:
import tarfile

source = "example_bioc_files.tar.gz"

with tarfile.open(source) as archive:
  for member in archive:
    print(f"{member.name=} {member.isfile()=}")

Then, we can put the `bioc` and `tarfile` code together to iterate through all the files in the archive and extract out the document, passage and annotation information.

In [None]:
with tarfile.open(source) as archive:
  for member in archive:
    if member.isfile() and member.name.lower().endswith('xml'):
      print(f"{member.name=}")

      file_handle = archive.extractfile(member)
      data = file_handle.read().decode('utf-8')

      collection = bioc.biocxml.loads(data)

      for document in collection.documents:
        print(f"{document.id=}")
        for passage in document.passages:
          print(f"{passage.text=}")
          for anno in passage.annotations:
            print(f"{anno=}")