![](../images/logos.jpg "MiCMOR, KIT Campus Alpin")

**[MiCMOR](https://micmor.kit.edu) [SummerSchool "Environmental Data Science: From Data Exploration to Deep Learning"](https://micmor.kit.edu/sites/default/files/MICMoR%20Summer%20School%202019%20Flyer.pdf)**  
IMK-IFU KIT Campus Alpin, Sept. 4 - 13 2019, Garmisch-Partenkirchen, Germany.

---

# Data formats 

... an overview of formats that you might come across and when you might use them. 

## Text-based formats
### Tabular text files: csv, tsv, txt

**What to use:** `pandas` (`pd.read_csv()`)

I'd say, in general the most common file format you might encounter. Also, potentially messy since there is no true schema to it (there are actually many). Luckily, pandas has a pretty great `read_csv()` Function that can handle pretty much any variant you might come across (it's actually so versatile that it has about 250 arguments you can specify!!!).

We cannot possbibly cover all aspects, but some of the more common arguments (at least for me) are:
- specifying the seperator (`sep=`)
- split at any kind of whitespace (`delim_whitespace=True`)
- specifying if there's no header (`header=None`)
- giving column names (`names=["Col1","Col2",...]`; if the file has no colnames yet don't forget to specify `header=None`, too)
- use a columns to set the index (`index_col='myIndexCol'`)
- specify which value should be considered missing/ NaN (`na_values=[-9999, 'na', 'none']`) - this can also be done per-column if you specify a dictionary for your columns with the individual NaN-values to filter
- skipping rows (`skiprows=5`)
- convert date columns to an actual date type during import (`parse_dates=True` or `parse_dates=['DateCol1', 'DateCol2']`) - you can also specify a custom data/time parser function if pandas has trouble detecting the correct format

**Tip:** Since formats are so variable it's often a good idea to peak into the first 10 to 100 lines of a file. A fast and easy way to do this in Jupyter is to issue a bash command inline like so:

In [4]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
! head -n 10 data/testfile.csv

A small example:

### XML

**What to use:** `cElementTree` (or: `lxml`)

The Extensible Markup Language (XML) usually encodes a document or data structure. It is organized tree-like thus supports a hierarchy. Usually a schema is defined that defines what a given XML document can contain (like a database schema). The basic building blocks to encapsule data are elements and attributes. You can have elements that enclose text (`<myelement>text</myelement>`) or elements without text (`<myelement />`). Often, you also have attributes in an element (`<myelement attribute1="A" another_attr="2"/>`).

A nested XML structure might look like this:

```
<person sex="female">
  <firstname>Anna</firstname>
  <lastname>Smith</lastname>
</person>
```
As a small example with data we will use in the Deep Learning lessons. Let's read one of the image description XML files that exist for each image of the ImageCLEF 2013 plant classification challenge

In [5]:
from pathlib import Path
import xml.etree.cElementTree as ET

# get xml files in dir and return first one from generator using next()
file_path = next(Path('../data/imageclef_2013_sample').glob('*.xml'))
file_path

PosixPath('../data/imageclef_2013_sample/3.xml')

Again, if we do not have a clear picture what is actually in the file it's a good idea to print it.


In [6]:
# Note that we can use python variables in a bash command with the curly brackets !!!
! cat {file_path}

<?xml version="1.0" encoding="UTF-8"?>
<Image>
  <FileName>3.jpg</FileName>
  <IndividualPlantId>742</IndividualPlantId>
  <Date>07/07/10</Date>
  <Locality>France - Montpellier</Locality>
  <GPSLocality>
    <Longitude>3.876716</Longitude>
    <Latitude>43.610769</Latitude>
  </GPSLocality>
  <Author>Jean-Francois Molino</Author>
  <Organization>Tela Botanica</Organization>
  <Type>SheetAsBackground</Type>
  <Content>Leaf</Content>
  <ClassId>Rhamnus alaternus</ClassId>
  <Taxon>
    <Regnum>Plantae</Regnum>
    <Class>Equisetopsida C. Agardh</Class>
    <Subclass>Magnoliidae Novák ex Takht.</Subclass>
    <Superorder>Rosanae Takht.</Superorder>
    <Order>Rosales Bercht. &amp; J. Presl</Order>
    <Family>Rhamnaceae Juss.</Family>
    <Genus>Rhamnus L.</Genus>
    <Species>Rhamnus alaternus L.</Species>
  </Taxon>
  <VernacularNames>Italian buckthorn</VernacularNames>
  <Year>ImageCLEF2011</Year>
  <IndividualPlantId2012>201</IndividualPlantId2012>
  <ImageID2012>4411.jpg</ImageID201

As we can see this schema apparently does not use any attributes, just elements...

In [7]:
# parse the entire file
xml = ET.parse(file_path)

# get the document root element
root = xml.getroot()

print(root)

<Element 'Image' at 0x7f3bed16f228>


We cannot do a full XML tutorial here, but you can either iterate through elements or you can search for a specific one.
The following code searches for the first occurance of the subelement Family and return the text enclosed by the element
(if it had an attribute we could query it using `.attrib['attributeName']`).

In [8]:
root.find('./Taxon/Family').text

'Rhamnaceae Juss.'



**Note:** 
- Using the package `xmltodict` (not shown) we can also read in a XML file and convert it to a nested python dictionary. We then could pass it on and convert it into a JSON structure (which basically is a nested dictionary).
- If our file had many entries of *Image*, one would iterate over them like this:

```python
for child in root:
    print(child.tag, child.attrib, child.text)
```
- More on using ElementTree [here](https://docs.python.org/3/library/xml.etree.elementtree.html).

### JSON

JavaScript Object Notation (JSON) was inspired by a subset of the JavaScript programming language dealing with object literal syntax but nowadays has its own standard. JSON supports primitive types, like strings and numbers, as well as nested lists and objects.

An example looks like this:
```JSON
{
    "firstName": "Jane",
    "lastName": "Doe",
    "hobbies": ["running", "sky diving", "singing"],
    "age": 35,
    "children": [
        {
            "firstName": "Alice",
            "age": 6
        },
        {
            "firstName": "Bob",
            "age": 8
        }
    ]
}
```

You can read in a JSON file like this:

In [9]:
import json

# reading a json file
with open("data_file.json", "r") as read_file:
    data = json.load(read_file)

# dumping a json file
    

FileNotFoundError: [Errno 2] No such file or directory: 'data_file.json'

Most often, you'd probably pull JSON files from the web. Here is a small example how you do this with `requests`.

In [10]:
import urllib.request
import json 
with urllib.request.urlopen("http://maps.googleapis.com/maps/api/geocode/json?address=google") as url:
    data = json.loads(url.read().decode())
    print(data)

{'error_message': 'You must use an API key to authenticate each request to Google Maps Platform APIs. For additional information, please refer to http://g.co/dev/maps-no-account', 'results': [], 'status': 'REQUEST_DENIED'}


Now, this request actually yields an error since we did not specify an API token to use this google API(which is encoded in JSON as well). Moving on...

If you already have a JSON structure in a string (because you fulled it in from the web etc.):

In [23]:
import json

json_string = """
{
    "researcher": {
        "name": "Ford Prefect",
        "species": "Betelgeusian",
        "relatives": [
            {
                "name": "Zaphod Beeblebrox",
                "species": "Betelgeusian"
            }
        ]
    }
}
"""
data = json.loads(json_string)
data

{'researcher': {'name': 'Ford Prefect',
  'species': 'Betelgeusian',
  'relatives': [{'name': 'Zaphod Beeblebrox', 'species': 'Betelgeusian'}]}}

## Binary formats
### Excel 🤨

Ok, **you really should be using csv files instead**. However, if you need to import Excel files pandas can do it (if you also install `xlrd`).

### pkl
### netcdf3/4
### sql
### hdf5/ h5
### npy/ npz
## Geographic data
### Shapefiles
### Geotiffs
## Other
### parquet
### feather