# Machine-Readable Data Formats

## Recommendations and Best Practices for Biodiversity Informatics

### ***Giuditta Parolini, Data Scientist, Museum für Naturkunde Berlin***

---

# Table of Contents
* [Introduction](#intro)
* [Section 1: The trouble with non-machine-readable data](#trouble)
    * [1.1: Data published as a PDF file](#pdf)
    * [1.2: Data published as a DOCX file](#docx)
* [Section 2: Machine-readable data formats for tabular data](#tabular)
    * [2.1: CSV, TSV](#csv)
    * [2.2: XML](#xml)
    * [2.3: JSON](#json)
* [Section 3: Geo data](#geo)
* [Section 4: Images](#geo)
* [Section 5: Other media](#other)
* [Section 6: Biodiversity specials](#specials)

---

## Introduction <a class="anchor" id="intro"></a>

**This python notebook provides practical examples that illustrate the main points discussed in the Guide on machine-readable data.**
<br>

It allows readers to see what are the challenges posed by data that are not machine-readable and experience the pitfalls that can cause the generation of invalid files even when using machine-readable data formats like CSV. The notebook also describes how unstructured data, like digital images or other media, can be approached to provide, at least, some pieces of machine-readable information.
<br>

Throughout the notebook, examples will be illustrated using the dataset ***Mounted Specimens of the Historical Bird Collection at the Museum für Naturkunde Berlin*** (DOI: [10.7479/wwqn-gd04](https://doi.org/10.7479/wwqn-gd04)) and modifications of it. The dataset contains metadata for over 13000 images of mounted bird specimens belonging to the bird collection of the museum. The mounted specimens, which dates back to the 19th century, have been systematically photographed and their images and related metadata are distributed under a [CC0 Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/deed.en) license. A copy of the original dataset is available in the GitHub repository as dataset.csv.

---

## 1: The trouble with non-machine-readable data <a class="anchor" id="trouble"></a>
As mentioned in the Guide Introduction, PDF and DOCX files are human-readable, but not really machine-readable and extracting data from them is a challenging and error-prone exercise. An example of this will be demonstrated in this section using a PDF and a DOCX documents containing an extract of the bird collection dataset and its metadata. The content of both files is the same. It will be shown how extracting the data, which would be immediately available in a .csv file, from the overall text and saving them in a machine-readable format can become a lengthy and troublesome business.



### 1.1: Data published as a PDF file <a class="anchor" id="pdf"></a>

In [5]:
import tabula
df = tabula.io.read_pdf("PDF_doc_example.pdf", pages='all')


Error from tabula-java:
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.





CalledProcessError: Command '['java', '-Djava.awt.headless=true', '-Dfile.encoding=UTF8', '-jar', '/Users/giuditta.parolini/miniconda3/lib/python3.11/site-packages/tabula/tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'PDF_doc_example.pdf']' returned non-zero exit status 1.

### 1.2: Data published as a DOCX file <a class="anchor" id="docx"></a>

[Python-docx](https://python-docx.readthedocs.io/en/latest/api/document.html) is a library for creating DOCX files using the Python programming language. As the library is able to create a DOCX file, it is also useful to extract content from this type of files, i.e., the use case here considered.

In [67]:
# imports
import docx
import pandas as pd

In [68]:
# All the document content can be extracted as a generator object
content = docx.Document('DOCX_doc_example.docx').iter_inner_content()
content


<generator object BlockItemContainer.iter_inner_content at 0x113f9dc60>

In [69]:
# The generator object can be unpacked in a list to see all the components in the DOCX file.
# In our case, we have the text paragraphs and the data table
docx_list = [el for el in content]
docx_list 

[<docx.text.paragraph.Paragraph at 0x14e98e4d0>,
 <docx.text.paragraph.Paragraph at 0x113faf9d0>,
 <docx.text.paragraph.Paragraph at 0x14e99c690>,
 <docx.text.paragraph.Paragraph at 0x14e99ebd0>,
 <docx.text.paragraph.Paragraph at 0x14e99f950>,
 <docx.text.paragraph.Paragraph at 0x113f10e50>,
 <docx.text.paragraph.Paragraph at 0x113f11d10>,
 <docx.text.paragraph.Paragraph at 0x113f12f90>,
 <docx.text.paragraph.Paragraph at 0x113f126d0>,
 <docx.text.paragraph.Paragraph at 0x113f13150>,
 <docx.text.paragraph.Paragraph at 0x113f12bd0>,
 <docx.text.paragraph.Paragraph at 0x113f10710>,
 <docx.text.paragraph.Paragraph at 0x113f13a50>,
 <docx.text.paragraph.Paragraph at 0x113f12c10>,
 <docx.text.paragraph.Paragraph at 0x113f10d90>,
 <docx.text.paragraph.Paragraph at 0x113f12190>,
 <docx.text.paragraph.Paragraph at 0x113f42350>,
 <docx.table.Table at 0x113f40390>,
 <docx.text.paragraph.Paragraph at 0x113f0b690>]

In [71]:
# It is easier to consider text paragraphs separated from the data table,
# so we remove the table from the list
docx_list.pop(17)
docx_list

[<docx.text.paragraph.Paragraph at 0x14e98e4d0>,
 <docx.text.paragraph.Paragraph at 0x113faf9d0>,
 <docx.text.paragraph.Paragraph at 0x14e99c690>,
 <docx.text.paragraph.Paragraph at 0x14e99ebd0>,
 <docx.text.paragraph.Paragraph at 0x14e99f950>,
 <docx.text.paragraph.Paragraph at 0x113f10e50>,
 <docx.text.paragraph.Paragraph at 0x113f11d10>,
 <docx.text.paragraph.Paragraph at 0x113f12f90>,
 <docx.text.paragraph.Paragraph at 0x113f126d0>,
 <docx.text.paragraph.Paragraph at 0x113f13150>,
 <docx.text.paragraph.Paragraph at 0x113f12bd0>,
 <docx.text.paragraph.Paragraph at 0x113f10710>,
 <docx.text.paragraph.Paragraph at 0x113f13a50>,
 <docx.text.paragraph.Paragraph at 0x113f12c10>,
 <docx.text.paragraph.Paragraph at 0x113f10d90>,
 <docx.text.paragraph.Paragraph at 0x113f12190>,
 <docx.text.paragraph.Paragraph at 0x113f42350>]

In [75]:
# Now the text content can be joined and printed.
# One can also save this information to a TXT file for later re-use
content = '\n'.join([p.text for p in docx_list])
print(content)

Dataset title
Mounted Specimens of the Historical Bird Collection at the Museum für Naturkunde Berlin

Creator
MfN Digitization

License
CC0 1.0 Creative Commons Public Domain Dedication

Dataset Description
The dataset contains metadata for over 13000 images of mounted bird specimens belonging to the bird collection of the Museum für Naturkunde Berlin (MfN). The mounted specimens were mostly collected in the 19th century and have now been systematically photographed. For large part of the mounted specimens the full taxonomy is available. When this is not the case, the specimens are identified with their German common names. In the dataset there are birds from the well-known families of pigeons (Columbidae), parrots (Psittacidae) and pheasants (Phasianidae) as well as duck birds (Anatidae) and exotic hummingbirds (Trochilidae). Some rare specimens such as the quetzal (Pharomachrus mocinno) are also available in this dataset. 

Keywords
Birds, mounted specimens, Museum für Naturkunde Be

In [83]:
# Code inspired by Stackoverflow
# (https://stackoverflow.com/questions/46618718/python-docx-to-extract-table-from-word-docx)

# The table data can be extracted using a for loop and then saved into a pandas dataframe.
# The dataframe can then be saved as a CSV file. 

from timeit import default_timer as timer
from docx.api import Document

start = timer()

document = Document('DOCX_doc_example.docx')
table = document.tables[0]

data = []

keys = None
for i, row in enumerate(table.rows):
    text = (cell.text for cell in row.cells)

    if i == 0:
        keys = tuple(text)
        continue
    row_data = dict(zip(keys, text))
    data.append(row_data)

df = pd.DataFrame(data)

end = timer()

In [84]:
# Process time
print(end - start) #computed in seconds

0.1035007839964237


It should be noted that 0.1s were required to recover 52(rows)*6(columns)=312 data cells,
but the original dataset has 13288(rows)*15(columns)=199320 data cells.
Assuming that the process time increases linearly with the number of cells
(this is just an approximation and it is likely to underestimate the time required)
process time would be about 63,88 seconds, i.e., 1,03 minutes. In contrast, reading the complete dataset in csv format with Python Pandas requires only 0.1 seconds (see below the cell execution time). Extracting the table data from the DOCX file is therefore >600 times slower than reading the data directly from the csv. Although times remain manageable for both solutions in this case, with datasets having millions and billions of data cells the data extraction si going to become more and more time expensive making the user regret not to have the data directly available in a machine-readable format like CSV.

In [85]:
# Reading in the entire dataset in machine-readable format
dataset_read_from_csv = pd.read_csv("dataset.csv")

In [86]:
dataset_read_from_csv

Unnamed: 0,catalogue_id,key,scientific_name,title (scientific name or common name),class,family,genus,species,subspecies,collections,creation_year,absolute_url,copyright,license,authors
0,ZMB_AVES_2000-14765,7d1264cca2e93009237e,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
1,ZMB_AVES_2000-14765,46e9071723ddbababd06,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
2,ZMB_AVES_2000-31350,cd50680edb26a356d7f1,Struthio camelus,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
3,ZMB_AVES_2000-31350,1af8bb8651519838f87c,Struthio camelus,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
4,ZMB_AVES_2000-31351,b2d8c2a150d82453d35c,Struthio molybdophanes,Struthio molybdophanes,Aves,Struthionidae,Struthio,Struthio molybdophanes,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13283,ZMB_Aves_9985,ca892e3d44beef9ca4dc,Psittacula krameri borealis,Psittacula krameri borealis,Aves,Psittacidae,Psittacula,Psittacula krameri,Psittacula krameri borealis,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13284,ZMB_Aves_999,e76c936cca035eb8557d,Falco biarmicus tanypterus,Falco biarmicus tanypterus,Aves,Falconidae,Falco,Falco biarmicus,Falco biarmicus tanypterus,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13285,ZMB_Aves_999,e5672b9636cec6d995e8,Falco biarmicus tanypterus,Falco biarmicus tanypterus,Aves,Falconidae,Falco,Falco biarmicus,Falco biarmicus tanypterus,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13286,ZMB_Aves_9994,6ac8193d5913aa55de96,Psittacula longicauda longicauda,Psittacula longicauda longicauda,Aves,Psittacidae,Psittacula,Psittacula longicauda,Psittacula longicauda longicauda,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization


In [81]:
df

Unnamed: 0,catalogue_id,title (scientific name or common name),class,family,genus,species
0,ZMB_AVES_2000-14765,Rhea americana,Aves,Rheidae,Rhea,Rhea americana
1,ZMB_AVES_2000-31350,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus
2,ZMB_AVES_2000-31351,Struthio molybdophanes,Aves,Struthionidae,Struthio,Struthio molybdophanes
3,ZMB_AVES_2000-31795,Rhea americana,Aves,Rheidae,Rhea,Rhea americana
4,ZMB_AVES_2000-32382,Rhea pennata,Aves,Rheidae,Rhea,Rhea pennata
5,ZMB_AVES_2000-34658,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus
6,ZMB_AVES_2000-34923,Rhea americana,Aves,Rheidae,Rhea,Rhea americana
7,ZMB_AVES_2000-632,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus
8,ZMB_AVES_2000-752,Casuarius unappendiculatus,Aves,Casuariidae,Casuarius,Casuarius unappendiculatus
9,ZMB_AVES_2000-8516,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus


The dataframe is well-formed and there is no issue in the data extraction, however the data have been extracted looping over the table element in the DOCX file. Loops are inefficient in Python and while in this case there is no real time issue with a very small table, problems would immediately emerge when real-scale datasets with thousands of rows and tens of columns needed to be extracted.

---

## 2: Machine-readable data formats for tabular data <a class="anchor" id="tabular"></a>



CSV files might not be the solution to all data problems, but they are definitely handy for delivering tabular data in a machine-readable format. For datasets with up to 1 Million data rows they should be the first data format considered. 