# Machine-Readable Data Formats

## Recommendations and Best Practices for Biodiversity Informatics

### ***Giuditta Parolini, Data Scientist, Museum für Naturkunde Berlin***

---

# Table of Contents
* [Introduction](#intro)
* [Section 1: The trouble with non-machine-readable data](#trouble)
    * [1.1: Data published as a PDF file](#pdf)
    * [1.2: Data published as a DOCX file](#docx)
* [Section 2: Machine-readable data formats for tabular data](#tabular)
    * [2.1: CSV, TSV](#csv)
    * [2.2: TXT](#txt)
    * [2.3: XML](#xml)
    * [2.4: JSON](#json)
* [Section 3: Geo data](#geo)
* [Section 4: Images](#geo)
* [Section 5: Other media](#other)
* [Section 6: Biodiversity specials](#specials)

---

In [68]:
# imports
from docx.api import Document
from io import StringIO
from timeit import default_timer as timer
import re
import pandas as pd

---

## Introduction <a class="anchor" id="intro"></a>

**This python notebook provides practical examples that illustrate the main points discussed in the Guide on machine-readable data.**
<br>

It allows readers to see what are the challenges posed by data that are not machine-readable and experience the pitfalls that can cause the generation of invalid files even when using machine-readable data formats like CSV. The notebook also describes how unstructured data, like digital images or other media, can be approached to provide, at least, some pieces of machine-readable information.
<br>

Throughout the notebook, examples will be illustrated using the dataset ***Mounted Specimens of the Historical Bird Collection at the Museum für Naturkunde Berlin*** (DOI: [10.7479/wwqn-gd04](https://doi.org/10.7479/wwqn-gd04)) and modifications of it. The dataset contains metadata for over 13000 images of mounted bird specimens belonging to the bird collection of the museum. The mounted specimens have been systematically photographed and their images and related metadata are distributed under a [CC0 Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/deed.en) license. A copy of the original dataset is available in the GitHub repository as dataset.csv.

---

## 1: The trouble with non-machine-readable data <a class="anchor" id="trouble"></a>
As mentioned in the Guide Introduction, PDF and DOCX files are human-readable, but not really machine-readable and extracting data from them is a challenging and error-prone exercise. An example of this will be demonstrated in this section using a PDF and a DOCX documents containing an extract of the bird collection dataset and its metadata. The content of both files is the same. It will be shown how extracting the data, which would be immediately available in a CSV file, from the overall text and saving them in a machine-readable format can become a lengthy and troublesome business.



### 1.1: Data published as a PDF file <a class="anchor" id="pdf"></a>

In [1]:
"""
import tabula
df = tabula.io.read_pdf("PDF_doc_example.pdf", pages='all')
"""

'\nimport tabula\ndf = tabula.io.read_pdf("PDF_doc_example.pdf", pages=\'all\')\n'

### 1.2: Data published as a DOCX file <a class="anchor" id="docx"></a>

[Python-docx](https://python-docx.readthedocs.io/en/latest/api/document.html) is a library for creating DOCX files using the Python programming language. As the library is able to create a DOCX file, it is also useful to extract content from this type of files, i.e., the use case here considered.

In [3]:
# All the document content can be extracted as a Python generator object
content = docx.Document('DOCX_doc_example.docx').iter_inner_content()
content


<generator object BlockItemContainer.iter_inner_content at 0x113f463e0>

In [4]:
# The generator object can be unpacked in a list to see all the components in the DOCX file.
# In our case, we have the text paragraphs and the data table
docx_list = [el for el in content]
docx_list 

[<docx.text.paragraph.Paragraph at 0x113b4fb50>,
 <docx.text.paragraph.Paragraph at 0x1143baad0>,
 <docx.text.paragraph.Paragraph at 0x1143ba810>,
 <docx.text.paragraph.Paragraph at 0x1143ba550>,
 <docx.text.paragraph.Paragraph at 0x1143bab50>,
 <docx.text.paragraph.Paragraph at 0x1143ba590>,
 <docx.text.paragraph.Paragraph at 0x1143bac50>,
 <docx.text.paragraph.Paragraph at 0x1143bac90>,
 <docx.text.paragraph.Paragraph at 0x1143bacd0>,
 <docx.text.paragraph.Paragraph at 0x1143bab10>,
 <docx.text.paragraph.Paragraph at 0x1143bab90>,
 <docx.text.paragraph.Paragraph at 0x1143bad50>,
 <docx.text.paragraph.Paragraph at 0x1143bad90>,
 <docx.text.paragraph.Paragraph at 0x1143badd0>,
 <docx.text.paragraph.Paragraph at 0x1143bae10>,
 <docx.text.paragraph.Paragraph at 0x1143bae50>,
 <docx.text.paragraph.Paragraph at 0x1143bae90>,
 <docx.table.Table at 0x1143baed0>,
 <docx.text.paragraph.Paragraph at 0x1143baf10>]

In [5]:
# It is easier to consider text paragraphs separated from the data table,
# so we remove the table from the list
docx_list.pop(17)
docx_list

[<docx.text.paragraph.Paragraph at 0x113b4fb50>,
 <docx.text.paragraph.Paragraph at 0x1143baad0>,
 <docx.text.paragraph.Paragraph at 0x1143ba810>,
 <docx.text.paragraph.Paragraph at 0x1143ba550>,
 <docx.text.paragraph.Paragraph at 0x1143bab50>,
 <docx.text.paragraph.Paragraph at 0x1143ba590>,
 <docx.text.paragraph.Paragraph at 0x1143bac50>,
 <docx.text.paragraph.Paragraph at 0x1143bac90>,
 <docx.text.paragraph.Paragraph at 0x1143bacd0>,
 <docx.text.paragraph.Paragraph at 0x1143bab10>,
 <docx.text.paragraph.Paragraph at 0x1143bab90>,
 <docx.text.paragraph.Paragraph at 0x1143bad50>,
 <docx.text.paragraph.Paragraph at 0x1143bad90>,
 <docx.text.paragraph.Paragraph at 0x1143badd0>,
 <docx.text.paragraph.Paragraph at 0x1143bae10>,
 <docx.text.paragraph.Paragraph at 0x1143bae50>,
 <docx.text.paragraph.Paragraph at 0x1143bae90>,
 <docx.text.paragraph.Paragraph at 0x1143baf10>]

In [6]:
# Now the text content can be joined and printed.
content = '\n'.join([p.text for p in docx_list])
print(content)

Dataset title
Mounted Specimens of the Historical Bird Collection at the Museum für Naturkunde Berlin

Creator
MfN Digitization

License
CC0 1.0 Creative Commons Public Domain Dedication

Dataset Description
The dataset contains metadata for over 13000 images of mounted bird specimens belonging to the bird collection of the Museum für Naturkunde Berlin (MfN). The mounted specimens were mostly collected in the 19th century and have now been systematically photographed. For large part of the mounted specimens the full taxonomy is available. When this is not the case, the specimens are identified with their German common names. In the dataset there are birds from the well-known families of pigeons (Columbidae), parrots (Psittacidae) and pheasants (Phasianidae) as well as duck birds (Anatidae) and exotic hummingbirds (Trochilidae). Some rare specimens such as the quetzal (Pharomachrus mocinno) are also available in this dataset. 

Keywords
Birds, mounted specimens, Museum für Naturkunde Be

In [7]:
# One can also save this information to a TXT file for later re-use
with open("text_extracted.txt", "w") as text_file:
    text_file.write(content)

In [8]:
# The table data can be extracted using a for loop and then saved into a pandas dataframe.
# The dataframe can then be saved as a CSV file. 
# Code inspired by Stackoverflow
# (https://stackoverflow.com/questions/46618718/python-docx-to-extract-table-from-word-docx)

start = timer()

document = Document('DOCX_doc_example.docx')
table = document.tables[0]

data = []

keys = None
for i, row in enumerate(table.rows):
    text = (cell.text for cell in row.cells)

    if i == 0:
        keys = tuple(text)
        continue
    row_data = dict(zip(keys, text))
    data.append(row_data)

df = pd.DataFrame(data)

end = timer()

The data table extracted is the following: 

In [9]:
df

Unnamed: 0,catalogue_id,title (scientific name or common name),class,family,genus,species
0,ZMB_AVES_2000-14765,Rhea americana,Aves,Rheidae,Rhea,Rhea americana
1,ZMB_AVES_2000-31350,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus
2,ZMB_AVES_2000-31351,Struthio molybdophanes,Aves,Struthionidae,Struthio,Struthio molybdophanes
3,ZMB_AVES_2000-31795,Rhea americana,Aves,Rheidae,Rhea,Rhea americana
4,ZMB_AVES_2000-32382,Rhea pennata,Aves,Rheidae,Rhea,Rhea pennata
5,ZMB_AVES_2000-34658,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus
6,ZMB_AVES_2000-34923,Rhea americana,Aves,Rheidae,Rhea,Rhea americana
7,ZMB_AVES_2000-632,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus
8,ZMB_AVES_2000-752,Casuarius unappendiculatus,Aves,Casuariidae,Casuarius,Casuarius unappendiculatus
9,ZMB_AVES_2000-8516,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus


The dataframe is well-formed and there is no issue in the data extraction, however the data have been extracted looping over the table element in the DOCX file. Loops are inefficient in Python and while in this case there is no real time issue due to the very limited table size, problems would immediately emerge when real-scale datasets with thousands of rows and tens of columns need to be extracted.

The time required by the loop to run can be computed using Python [timeit](https://docs.python.org/3/library/timeit.html) library (see code cell above) and the result is:

In [17]:
# Process time for extracting the data table from the DOCX file
print(str(end - start) + "s", "required to extract 312 data cells from a DOCX file") #computed in seconds

0.057238769994000904s required to extract 312 data cells from a DOCX file


In [18]:
# By contrast, reading the entire dataset in machine-readable format
read_start = timer()
dataset_read_from_csv = pd.read_csv("dataset.csv")
read_end = timer()
print(str(read_end - read_start) + "s", "required to read in 199320 data cells (=13288rows × 15columns) from a CSV file") #computed in seconds

0.08668960399518255s required to read in 199320 data cells (=13288rows × 15columns) from a CSV file


Although times remain manageable for both solutions in this case, with datasets having millions and billions of data cells the data extraction si going to become more and more time expensive making the user regret not to have the data directly available in a machine-readable format like CSV.

---

## 2: Machine-readable data formats for tabular data <a class="anchor" id="tabular"></a>



### 2.1: CSV, TSV <a class="anchor" id="csv"></a>

CSV files might not be the solution to all data problems, but they are definitely handy for delivering tabular data in a machine-readable format. For datasets with up to 1 Million data rows they should be the first data format considered. 

In [21]:
# With the Python Pandas programming library reading a CSV file only takes a line code 
df_csv_comma_sep = pd.read_csv("dataset.csv")
df_csv_comma_sep.head(2) #first two rows of the dataset 

Unnamed: 0,catalogue_id,key,scientific_name,title (scientific name or common name),class,family,genus,species,subspecies,collections,creation_year,absolute_url,copyright,license,authors
0,ZMB_AVES_2000-14765,7d1264cca2e93009237e,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
1,ZMB_AVES_2000-14765,46e9071723ddbababd06,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization


In [23]:
# A copy of the original dataset has been saved using the semicolon as a delimiter
df_csv_semicolon_sep = pd.read_csv("dataset_semicolon.csv")
df_csv_semicolon_sep.head(2) # As the expectation is to have the comma as a separator,
                             #the result is wrong, but it can be easily corrected

Unnamed: 0,catalogue_id;key;scientific_name;title (scientific name or common name);class;family;genus;species;subspecies;collections;creation_year;absolute_url;copyright;license;authors
0,ZMB_AVES_2000-14765;7d1264cca2e93009237e;Rhea ...
1,ZMB_AVES_2000-14765;46e9071723ddbababd06;Rhea ...


In [24]:
# It is enough to specify the correct separator when reading in the data to import the dataset 
# without issues
df_csv_semicolon_sep = pd.read_csv("dataset_semicolon.csv", sep=";")
df_csv_semicolon_sep.head(2)

Unnamed: 0,catalogue_id,key,scientific_name,title (scientific name or common name),class,family,genus,species,subspecies,collections,creation_year,absolute_url,copyright,license,authors
0,ZMB_AVES_2000-14765,7d1264cca2e93009237e,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,MfN Berlin; https://ror.org/052d1a351,CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
1,ZMB_AVES_2000-14765,46e9071723ddbababd06,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,MfN Berlin; https://ror.org/052d1a351,CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization


In [32]:
# Similarly for reading the dataset in tsv format.

df_csv_tab_sep = pd.read_csv("dataset.tsv", sep="\t")
df_csv_tab_sep.head(2)

Unnamed: 0,catalogue_id,key,scientific_name,title (scientific name or common name),class,family,genus,species,subspecies,collections,creation_year,absolute_url,copyright,license,authors
0,ZMB_AVES_2000-14765,7d1264cca2e93009237e,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,MfN Berlin; https://ror.org/052d1a351,CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
1,ZMB_AVES_2000-14765,46e9071723ddbababd06,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,MfN Berlin; https://ror.org/052d1a351,CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization


In [35]:
# The Pandas library allows also to read in files (and even workbooks)in XLSX format.
# In this case, as the data table, has been created properly, the dataset is also recovered.
# However, it is evident that Python Pandas takes longer to read in an XLSX file compared to a CSV file
# with potential performance issues for large datasets.

df_xlsx = pd.read_excel("dataset.xlsx")
df_xlsx.head(2)

Unnamed: 0,catalogue_id,key,scientific_name,title (scientific name or common name),class,family,genus,species,subspecies,collections,creation_year,absolute_url,copyright,license,authors
0,ZMB_AVES_2000-14765,7d1264cca2e93009237e,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
1,ZMB_AVES_2000-14765,46e9071723ddbababd06,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization


WARNING: The chances to create non-machine-readable files are much higher when working with spreadsheet software like Excel rather than dealing directly with the CSV data format.
Here an example of the birds dataset formatted in Excel with added descriptions, empty cells, ect.
![Here](invalid_dataset.png)

In [46]:
# The invalid dataset is read in without error warnings, but recovering the data requires a lengthy clean up
#  of all the empty cells and of the cells that contain the dataset description.
df_xlsx = pd.read_excel("dataset_invalid_format.xlsx")
df_xlsx

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,,Dataset title,,,,,,,,,,,,,,,,
1,,Mounted Specimens of the Historical Bird Colle...,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,
3,,Dataset Description/Abstract,,,Data license,,,,,,,,,,,,,
4,,The dataset contains metadata for over 13000 i...,The images are available under a free license ...,,CC0 Public Domain Dedication,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13293,,,,ZMB_Aves_9985,ca892e3d44beef9ca4dc,Psittacula krameri borealis,Psittacula krameri borealis,Aves,Psittacidae,Psittacula,Psittacula krameri,Psittacula krameri borealis,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13294,,,,ZMB_Aves_999,e76c936cca035eb8557d,Falco biarmicus tanypterus,Falco biarmicus tanypterus,Aves,Falconidae,Falco,Falco biarmicus,Falco biarmicus tanypterus,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13295,,,,ZMB_Aves_999,e5672b9636cec6d995e8,Falco biarmicus tanypterus,Falco biarmicus tanypterus,Aves,Falconidae,Falco,Falco biarmicus,Falco biarmicus tanypterus,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13296,,,,ZMB_Aves_9994,6ac8193d5913aa55de96,Psittacula longicauda longicauda,Psittacula longicauda longicauda,Aves,Psittacidae,Psittacula,Psittacula longicauda,Psittacula longicauda longicauda,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization


In [43]:
# A CSV file is valid even when it does not have column headers.
# When the headers are missing, however, the user need to check that the data analysis software is correctly
# interpreting the first row as a data row and not as table headings.
df_csv_no_heading = pd.read_csv("dataset_no_heading.csv", header=None) # header=None added to avoid the first
                                                                        # row being considered the table header
df_csv_no_heading.head(2)
# When the headers are missing, Python Pandas just identifies the data columns with an integer number.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,ZMB_AVES_2000-14765,7d1264cca2e93009237e,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
1,ZMB_AVES_2000-14765,46e9071723ddbababd06,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization


### 2.2: TXT <a class="anchor" id="txt"></a>

TXT files should be the preferred machine-readable format for unstructured and not annotated text that needs to be further analysed/mined. As an example, let's consider the text extracted from the DOCX file in [Section 1.2](#docx).

In [49]:
# Reading the file content
with open("text_extracted.txt", "r") as f:
    content = f.read()

print(content)

Dataset title
Mounted Specimens of the Historical Bird Collection at the Museum für Naturkunde Berlin

Creator
MfN Digitization

License
CC0 1.0 Creative Commons Public Domain Dedication

Dataset Description
The dataset contains metadata for over 13000 images of mounted bird specimens belonging to the bird collection of the Museum für Naturkunde Berlin (MfN). The mounted specimens were mostly collected in the 19th century and have now been systematically photographed. For large part of the mounted specimens the full taxonomy is available. When this is not the case, the specimens are identified with their German common names. In the dataset there are birds from the well-known families of pigeons (Columbidae), parrots (Psittacidae) and pheasants (Phasianidae) as well as duck birds (Anatidae) and exotic hummingbirds (Trochilidae). Some rare specimens such as the quetzal (Pharomachrus mocinno) are also available in this dataset. 

Keywords
Birds, mounted specimens, Museum für Naturkunde Be

In [50]:
type(content) # The extracted text is treated as a string

str

Let's now focus on the dataset description. As we are working with plain text there is no machine-readable indicator of where this section of text starts and finishes. We can only extract it relying on the knowledge we have of the original file, i.e., the dataset description is the set of words that follow the heading "Dataset Description" and ends before the following heading, that is to say "Keywords".

In [60]:
# A possible way to extract the required text is to use the headings to split the text and 
# then select the relevant part 
partition1 = "Dataset Description" #First partition heading
words = content.partition(partition1) #First split at the section heading
words_after_heading1 = content.split(partition1, 1)[1] #Selecting only the text after the first partition
partition2 = "Keywords" #Second partition heading
words_before_heading2 = words_after_heading1.split(partition2, 1)[0] #Selecting only the text after the first partition
                                                                    # and before the second partition heading
print(words_before_heading2) # Checking that the variable contains the required text (It does)


The dataset contains metadata for over 13000 images of mounted bird specimens belonging to the bird collection of the Museum für Naturkunde Berlin (MfN). The mounted specimens were mostly collected in the 19th century and have now been systematically photographed. For large part of the mounted specimens the full taxonomy is available. When this is not the case, the specimens are identified with their German common names. In the dataset there are birds from the well-known families of pigeons (Columbidae), parrots (Psittacidae) and pheasants (Phasianidae) as well as duck birds (Anatidae) and exotic hummingbirds (Trochilidae). Some rare specimens such as the quetzal (Pharomachrus mocinno) are also available in this dataset. 




If the text document had been provided with XML tags for the headings, it would have been much easier to extract the portion of text related to the dataset description. For instance, if there is available an XML tagged file like text_extracted.xml where the headings and the text body following the heading are tagged, it is possible to do as follows:

In [82]:
# Reading in the xml file
with open("text_extracted.xml") as f:
    xml = f.read()
xml # checking that the file has been read properly


'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<doc:data xmlns:doc="https://example.com">\n  <doc:row>\n    <doc:heading>Dataset Title</doc:heading>\n    <doc:body>Mounted Specimens of the Historical Bird Collection at the Museum für Naturkunde Berlin</doc:body>\n  </doc:row>\n   <doc:row>\n    <doc:heading>Creator</doc:heading>\n    <doc:body>MfN Digitization</doc:body>\n  </doc:row>\n   <doc:row>\n    <doc:heading>License</doc:heading>\n    <doc:body>CC0 1.0 Creative Commons Public Domain Dedication</doc:body>\n  </doc:row>\n  <doc:row>\n    <doc:heading>Dataset Description</doc:heading>\n    <doc:body>The dataset contains metadata for over 13000 images of mounted bird specimens belonging to the bird collection of the Museum für Naturkunde Berlin (MfN). The mounted specimens were mostly collected in the 19th century and have now been systematically photographed. For large part of the mounted specimens the full taxonomy is available. When this is not the case, the specimens are identifi

In [87]:
# The text can be autimatically transformed in a dataframe using the XML tags
df = pd.read_xml(StringIO(xml))
df

Unnamed: 0,heading,body
0,Dataset Title,Mounted Specimens of the Historical Bird Colle...
1,Creator,MfN Digitization
2,License,CC0 1.0 Creative Commons Public Domain Dedication
3,Dataset Description,The dataset contains metadata for over 13000 i...
4,Keywords,"Birds, mounted specimens, Museum für Naturkund..."
5,Table 1,"Shortened version of dataset, “Mounted Specime..."


In [93]:
# The dataset description is immediately available as a string in this case
df["body"][df.heading == "Dataset Description"].values[0]

'The dataset contains metadata for over 13000 images of mounted bird specimens belonging to the bird collection of the Museum für Naturkunde Berlin (MfN). The mounted specimens were mostly collected in the 19th century and have now been systematically photographed. For large part of the mounted specimens the full taxonomy is available. When this is not the case, the specimens are identified with their German common names. In the dataset there are birds from the well-known families of pigeons (Columbidae), parrots (Psittacidae) and pheasants (Phasianidae) as well as duck birds (Anatidae) and exotic hummingbirds (Trochilidae). Some rare specimens such as the quetzal (Pharomachrus mocinno) are also available in this dataset.'

### 2.3: XML <a class="anchor" id="xml"></a>

The content of a XML (eXtensible Markup Language) file is a combination of tags, which logically structure the content, and proper data. A version of the birds dataset in XML format will be used to illustrate the main features of the XML file format and the added features, like validation and comments, that it offers compared to a CSV file.

In [98]:
df_xml = pd.read_xml("dataset.xml")
df_xml

Unnamed: 0,index,catalogue_id,key,scientific_name,title_scientific_name_or_common_name,class,family,genus,species,subspecies,collections,creation_year,absolute_url,copyright,license,authors
0,0,ZMB_AVES_2000-14765,7d1264cca2e93009237e,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
1,1,ZMB_AVES_2000-14765,46e9071723ddbababd06,Rhea americana,Rhea americana,Aves,Rheidae,Rhea,Rhea americana,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
2,2,ZMB_AVES_2000-31350,cd50680edb26a356d7f1,Struthio camelus,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
3,3,ZMB_AVES_2000-31350,1af8bb8651519838f87c,Struthio camelus,Struthio camelus,Aves,Struthionidae,Struthio,Struthio camelus,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
4,4,ZMB_AVES_2000-31351,b2d8c2a150d82453d35c,Struthio molybdophanes,Struthio molybdophanes,Aves,Struthionidae,Struthio,Struthio molybdophanes,,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13283,13283,ZMB_Aves_9985,ca892e3d44beef9ca4dc,Psittacula krameri borealis,Psittacula krameri borealis,Aves,Psittacidae,Psittacula,Psittacula krameri,Psittacula krameri borealis,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13284,13284,ZMB_Aves_999,e76c936cca035eb8557d,Falco biarmicus tanypterus,Falco biarmicus tanypterus,Aves,Falconidae,Falco,Falco biarmicus,Falco biarmicus tanypterus,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13285,13285,ZMB_Aves_999,e5672b9636cec6d995e8,Falco biarmicus tanypterus,Falco biarmicus tanypterus,Aves,Falconidae,Falco,Falco biarmicus,Falco biarmicus tanypterus,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization
13286,13286,ZMB_Aves_9994,6ac8193d5913aa55de96,Psittacula longicauda longicauda,Psittacula longicauda longicauda,Aves,Psittacidae,Psittacula,Psittacula longicauda,Psittacula longicauda longicauda,Birds,2023,https://portal.museumfuernaturkunde.berlin/det...,"MfN Berlin, https://ror.org/052d1a351",CC0 1.0 Creative Commons Public Domain Dedication,MfN Digitization


In [None]:
df = pd.read_csv("datas")