First, download the corpus of eLife articles as JATS XML, and decompress the articles to a local folder.

In [1]:
!wget https://github.com/elifesciences/elife-article-xml/archive/master.zip -O master.zip
!unzip -q master.zip

--2020-09-03 17:00:51--  https://github.com/elifesciences/elife-article-xml/archive/master.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/elifesciences/elife-article-xml/zip/master [following]
--2020-09-03 17:00:52--  https://codeload.github.com/elifesciences/elife-article-xml/zip/master
Resolving codeload.github.com (codeload.github.com)... 140.82.112.9
Connecting to codeload.github.com (codeload.github.com)|140.82.112.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip              [               <=>  ] 407.95M  6.88MB/s    in 65s     

2020-09-03 17:01:57 (6.25 MB/s) - ‘master.zip’ saved [427763463]

replace elife-article-xml-master/README.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


Some articles have multiple versions, so sort and filter the filenames to only use the latest version.

In [2]:
import glob
import natsort
import re

filenames = glob.glob("elife-article-xml-master/articles/*.xml")
print('{} XML files'.format(len(filenames)))

sorted_filenames = natsort.natsorted(filenames, reverse=True)

seen = []
latest_filenames = []

p = re.compile('(elife-\\d+)-v\\d+\\.xml$')

for filename in sorted_filenames:
  match = p.search(filename)
  m = match.group(1)

  if not m in seen:
    latest_filenames.append(filename)

  seen.append(m)


print('{} latest XML files'.format(len(latest_filenames)))

15755 XML files
9132 latest XML files


Find the JATS XML files that contain tables (`table-wrap` elements).

In [5]:
import os
from xml.etree import ElementTree
from tqdm.notebook import tqdm

xml_table_filenames = []

for filename in tqdm(latest_filenames, desc='Finding JATS XML files containing tables'):
  data = ElementTree.parse(filename)
  tables = data.findall('.//table-wrap')
  if len(tables):
    xml_table_filenames.append(filename)

HBox(children=(FloatProgress(value=0.0, description='Finding JATS XML files containing tables', max=9132.0, st…




Install the latest version of pandoc.

In [6]:
!wget 'https://github.com/jgm/pandoc/releases/download/2.10.1/pandoc-2.10.1-1-amd64.deb' -O pandoc.deb
!apt install ./pandoc.deb
!rm ./pandoc.deb
!pandoc --version

--2020-09-03 17:06:46--  https://github.com/jgm/pandoc/releases/download/2.10.1/pandoc-2.10.1-1-amd64.deb
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/571770/83b3a680-cd2e-11ea-8e8e-e46966a5e3a4?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200903%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200903T170647Z&X-Amz-Expires=300&X-Amz-Signature=8b3ad603f66fa260615131dc51a6445864f699802234ea163fce87584ac3a2c9&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=571770&response-content-disposition=attachment%3B%20filename%3Dpandoc-2.10.1-1-amd64.deb&response-content-type=application%2Foctet-stream [following]
--2020-09-03 17:06:47--  https://github-production-release-asset-2e65be.s3.amazonaws.com/571770/83b3a680-cd2e-11ea-8e8e-e46966a5e3a4?X-Amz-Algorithm=AWS4-HMAC-SHA

Convert the JATS XML files that contain tables to HTML.

In [8]:
import os
from tqdm.notebook import tqdm

html_filenames = []

for filename in tqdm(xml_table_filenames, desc='Converting JATS to HTML with pandoc'):
  pre, ext = os.path.splitext(filename)
  output = pre + '.html'
  result = os.system('pandoc -f jats -t html --standalone --section-divs -o "{output}" "{input}"'.format(output=output, input=filename))
  if result is not 0:
    break
  html_filenames.append(output)

print('{} HTML files'.format(len(html_filenames)))

HBox(children=(FloatProgress(value=0.0, description='Converting JATS to HTML with pandoc', max=4648.0, style=P…


4648 HTML files


Create a ZIP archive of the converted HTML files, for downloading to resume the processing later.

In [9]:
!zip -q elife-article-html.zip -r elife-article-xml-master/articles/*.html

In [10]:
#!unzip elife-article-html.zip -d .

For each top-level section in each article, parse each HTML table with `pandas` to create a dataframe and store the table headers.

In [11]:
import pandas as pd
from xml.etree import ElementTree
from collections import Counter
from tqdm.notebook import tqdm

all_headers = []
section_headers = {}

for filename in tqdm(html_filenames, desc="Reading tables with pandas"):
  data = ElementTree.parse(filename)
  sections = data.findall('.//{http://www.w3.org/1999/xhtml}section[@class="level1"]')
  #print('{} sections'.format(len(sections)))
  for section in sections:
    sectionTitle = section.findall('./{http://www.w3.org/1999/xhtml}h1')
    sectionTitleText = sectionTitle[0].text
    tableWraps = section.findall('.//{http://www.w3.org/1999/xhtml}div[@class="table-wrap"]')
    # print('{} tables'.format(len(tableWraps)))
    for tableWrap in tableWraps:
      # print(ElementTree.tostring(tableWrap, encoding='unicode', method='xml'))
      # caption = tableWrap.find('.//{http://www.w3.org/1999/xhtml}section')
      # if caption:
      #   print(ElementTree.tostring(caption, encoding='unicode', method='xml'))
      #title = caption.find('.//*[1]') # TODO: select all headers
      #if title:
        #print(ElementTree.tostring(title, encoding='unicode', method='xml'))

      table = tableWrap.find('./{http://www.w3.org/1999/xhtml}table')
      html = ElementTree.tostring(table, encoding='unicode', method='xml')
      dfs = pd.read_html(html)
      df = pd.concat(dfs)
      headers = df.columns.values.tolist()
      all_headers.append(headers)

      if not sectionTitleText in section_headers:
        section_headers[sectionTitleText] = []
      
      section_headers[sectionTitleText].append(headers)


HBox(children=(FloatProgress(value=0.0, description='Reading tables with pandas', max=4648.0, style=ProgressSt…




Generate a TSV file summarising the sets of headers that occur most frequently in each section.

In [12]:
from collections import Counter

with open('headers.tsv', 'w') as f:
  for section, headers in section_headers.items():
    counter = Counter(map(tuple, headers))

    for k,v in counter.most_common():
        f.write( "{}\t{}\t{}\n".format(section, k, v) )