[![Colab Badge Link](https://img.shields.io/badge/open-in%20colab-blue)](https://colab.research.google.com/github/Glasgow-AI4BioMed/tutorials/blob/main/loading_pubmed_xml_files.ipynb)

# Loading PubMed XML files

This Colab shows some example code of loading a PubMed XML file. These are the large files available through the [bulk download of PubMed](https://pubmed.ncbi.nlm.nih.gov/download/). These files contain titles, abstracts and metadata of articles indexed in [PubMed](https://pubmed.ncbi.nlm.nih.gov/).

We could get one of the files from the [baseline folder](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/). But instead, we'll download a sample file (that hopefully will stick around for a while):

In [None]:
!wget https://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2024-sample/sample-0001.xml.gz

Let's look inside this file. Not that it is gzipped, so we use the `zcat` program to look at the first few lines:

In [None]:
!zcat sample-0001.xml.gz | head -n 50

You can see that the article is surrounded by a `<PubmedArticle>` tag with lots of metadata. We'll want to get every `<PubMedArticle>` block and extract the fields we want.

In [None]:
filename = 'sample-0001.xml.gz'

## Basic loading code

Let's look at some basic code to load this file with `gzip` and `cElementTree` for XML files. Loading with the `gzip` library means we don't need to decompress the file to process it.

In [None]:
import xml.etree.cElementTree as etree
import gzip

This code opens the gzipped file and then works through it, one abstract at a time. The files are large, so it is generally not a good idea to load all the abstracts into memory in one go.

This code extracts the PubMed identifier (pmid) and the title.

In [None]:
with gzip.open(filename,'rt',encoding='utf8') as f:
  for event, elem in etree.iterparse(f, events=("start", "end", "start-ns", "end-ns")):

    # Iterate through the XML file until a <PubmedArticle> tag is closed (which we then process below)
    if event == "end" and elem.tag == "PubmedArticle":

      pmid = int(elem.find("./MedlineCitation/PMID").text)
      title = elem.find("./MedlineCitation/Article/ArticleTitle").text
      print(pmid, title)

      elem.clear() # Important for clearing memory as the file is iteratively loaded
      break # This break is just for the demonstration so that only the first abstract is loaded

## More elaborate loading

Below is code that extracts a lot of the fields and saving them to a dictionary. We get the title, abstract text and lots of metadata.

In [None]:
with gzip.open(filename,'rt',encoding='utf8') as f:
  for event, elem in etree.iterparse(f, events=("start", "end", "start-ns", "end-ns")):

    # Iterate through the XML file until a <PubmedArticle> tag is closed (which we then process below)
    if event == "end" and elem.tag == "PubmedArticle":

      pmid = int(elem.find("./MedlineCitation/PMID").text)
      title = elem.find("./MedlineCitation/Article/ArticleTitle").text

      abstract_elems = elem.findall("./MedlineCitation/Article/Abstract/AbstractText")
      abstract = "\n".join( "".join(e.itertext()) for e in abstract_elems )

      publication_types = [ e.text for e in elem.findall("./MedlineCitation/Article/PublicationTypeList/PublicationType") ]

      identifier_elems = elem.findall("./PubmedData/ArticleIdList/ArticleId")
      identifiers = { e.attrib['IdType']:e.text for e in identifier_elems }

      journal_title_elem = elem.find("./MedlineCitation/Article/Journal/Title")
      journal_title_iso_elem = elem.find("./MedlineCitation/Article/Journal/ISOAbbreviation")
      journal_title = journal_title_elem.text if journal_title_elem is not None else None
      journal_title_iso = journal_title_iso_elem.text if journal_title_iso_elem is not None else None

      pub_year_elem = elem.find("./MedlineCitation/Article/Journal/JournalIssue/PubDate/Year")
      pub_month_elem = elem.find("./MedlineCitation/Article/Journal/JournalIssue/PubDate/Month")
      pub_day_elem = elem.find("./MedlineCitation/Article/Journal/JournalIssue/PubDate/Day")
      pub_year = pub_year_elem.text if pub_year_elem is not None else None
      pub_month = pub_month_elem.text if pub_month_elem is not None else None
      pub_day = pub_day_elem.text if pub_day_elem is not None else None

      mesh_headings = []
      mesh_elems = elem.findall("./MedlineCitation/MeshHeadingList/MeshHeading")
      for mesh_elem in mesh_elems:
        descriptor_elem = mesh_elem.find("./DescriptorName")
        mesh_id = descriptor_elem.attrib["UI"]
        name = descriptor_elem.text
        major_topic_yn = descriptor_elem.attrib["MajorTopicYN"]

        mesh_heading = {'id':mesh_id, 'name':name, 'major_topic':major_topic_yn }
        qualifiers = []
        qualifier_elems = mesh_elem.findall("./QualifierName")
        for qualifier_elem in qualifier_elems:
          mesh_id = qualifier_elem.attrib["UI"]
          name = qualifier_elem.text
          major_topic_yn = qualifier_elem.attrib["MajorTopicYN"]
          qualifiers.append( { 'id':mesh_id, 'name':name, 'major_topic':major_topic_yn  } )
        mesh_heading['qualifiers'] = qualifiers

        mesh_headings.append(mesh_heading)

      supplementary_mesh = []
      supplementary_mesh_elems = elem.findall("./MedlineCitation/SupplMeshList/SupplMeshName")
      for supp_elem in supplementary_mesh_elems:
        supp_id = supp_elem.attrib["UI"]
        supp_type = supp_elem.attrib["Type"]
        supp_name = supp_elem.text
        supplementary_mesh.append( {'id':supp_id, 'type':supp_type, 'name':supp_name })

      article = {
          'pmid':pmid,
          'title':title,
          'abstract':abstract,
          'journal_title':journal_title,
          'journal_title_iso':journal_title_iso,
          'publication_date': [pub_year, pub_month, pub_day],
          'publication_types':publication_types,
          'identifiers':identifiers,
          'mesh_headings':mesh_headings,
          'supplementary_mesh':supplementary_mesh
        }

      elem.clear() # Important for clearing memory as the file is iteratively loaded
      break # This break is just for the demonstration so that only the first abstract is loaded

And let's see what is extracted for this first article

In [None]:
article

There are more metadata fields that could be extracted. You can find more through the DTD documentation available from the [PubMed download page](https://pubmed.ncbi.nlm.nih.gov/download/). There can also be more nuance with some fields, including the abstract which sometimes has section headings in the XML tags or formatting (which are currently ignored).