# Geodatenanalyse 2


## Termin: Big Data 3 - Modul 1 - English language

## What is "big data"? Data formats and how to

Approx. 20-30 minutes

Additional installations via *Anaconda Prompt* (Windows) oder *Terminal/Shell* (MacOS/Linux):
    
```conda install -c anaconda h5py```

```conda install -c conda-forge dask netCDF4 bottleneck```

```pip uninstall pyproj && pip install pyproj```

## Overview of "big data"

Source: [Wikipedia](https://en.wikipedia.org/wiki/Big_data)

<img width=700 style="float: left;" src="images\digital_age.png">

### What is "big data"?

- Definition of **big data** in [Oxford English Dictionary](https://www.lexico.com/definition/big_data): "*Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.*"

- For example, **big data** is called data sets that are extremely large or can grow extremely large over time

- In reality, **big data** is defined rather non-conclusively

- Often **big data** is understood to mean handling **large data sets**

### Where are large data sets located?

- In most cases, large data sets are stored on databases and distributed in networks

- Centralized data storage is also possible, e.g., on a server

- Examples: Google Search, Facebook, Twitter, etc.

### What is the format of large data sets?

- **structured**: Defined default of fields and positions, e.g. similar to a *Pandas* *DataFrame*.

- **unstructured**: Distributed as individual files in different formats on data carriers, e.g. as different files in a folder structure.

- **quasi-structured**: Without a fixed default but still with an ordered data structure, e.g. [Extensible Markup Language (XML)](https://en.wikipedia.org/wiki/XML)

### What is the goal of "big data"?

The goal of **big data** is to improve **data-based decisions**.

### Alternative definition of "big data

The handling of data sets that do not fit completely into the working memory for processing due to their size.

### Overview of data handling and transfer speed dependency

<img width=700 style="float: left;" src="images\data_transfer.png">

### How do we deal with "big data"?

In Geodata Analysis 2, two basic options are presented:

1 - **Python packages** for handling datasets.

2 - **Google Earth Engine** for cloud-based datasets.

## Common data formats

- **Text formats**: Saved as ASCII characters, e.g. xls, csv, html, xml, json, etc.

- **Binary formats**: Saved as binary, e.g. xlsx, SQL, netCDF, HDF, etc.

### Examples of data formats

<img width=700 style="float: left;" src="images\data_formats.png">

## *Network Common Data Form* (*netCDF*).

- **Self Descriptive**: A netCDF file contains information about the data it contains.

- **Portable**: A netCDF file can be accessed by computers using various ways of storing integers, characters, and floating point numbers

- **Scalable**: Small subsets of large data sets in various formats can be accessed efficiently through netCDF interfaces, even from remote servers

- **Extensible**: Data can be appended to a properly structured netCDF file without copying the dataset or redefining its structure

- **Collaborative**: One writer and multiple readers can access the same netCDF file simultaneously

- **Archivable**: Access to all previous forms of netCDF data is supported by current and future versions of the software

More info at [Unidata](https://www.unidata.ucar.edu/software/netcdf/)

### Example of the *netCDF* data structure for structured data.

Source: [xarray reference](https://xarray.pydata.org/en/stable/data-structures.html#dataset)

<img width=700 style="float: left;" src="images\netcdf.png">

**Caution**

The data ...

- ... can have arbitrarily defined dimensions

- ... are generally stored in binary form and thus cannot be viewed via a text editor

### What is *xarray*?

- xarray (formerly xray) is an open source project and Python package that makes working with labeled multidimensional arrays easy and efficient

- Xarray introduces labels in the form of dimensions, coordinates and attributes on raw NumPy-like arrays

- The package contains a large and growing library of domain-independent functions for advanced analysis and visualization with these data structures

- It is especially tailored to work with netCDF files, which were the source of Xarray's data model

More info at [xarray](https://xarray.pydata.org/)

#### Two types of data containers

In many applications measured variables must be assigned to a space and a time. That quickly adds up to 4 dimensions!

The following data containers are available:

- **DataArray** is for labeled, multi-dimensional arrays with one parameter.


- **Dataset** is the multidimensional equivalent to a *DataFrame* in *Pandas*, for multiple parameters

For more information, see [*xarray*](https://xarray.pydata.org/en/stable/user-guide/data-structures.html)

In [1]:
import xarray as xr
import numpy as np

### Example for the creation of a *DataArray*

In [2]:
# Geokoordinaten
latitude = np.arange(-90, 90, 5)
longitude = np.arange(-180, 180, 5)

# Daten
temperature_2d = np.random.rand(latitude.size, longitude.size)

# Datencontainer erstellen
data = xr.DataArray(data=temperature_2d, dims=["latitude", "longitude"], coords=[latitude, longitude])
data

### Load and save *dataset*.

*xarray* is mainly for *netCDF* files.

As an example, here we load the sea surface temperature of the year 2019:

In [3]:
sst = xr.open_dataset('data/HadISST_2019.nc')
sst

### Example for loading data from a server

However, other formats can also be accessed, e.g. via the Internet ([PRISM Climate Group](https://prism.oregonstate.edu/)):

In [4]:
remote_data = xr.open_dataset(
    "http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods",
    decode_times = False,
)
remote_data

**Note**: *xarray* opens this file only virtually. i.e. does not load the data directly into the work memory. Here we are talking about gigabytes! The data is only accessed when it is needed by the code.

### A look at the metadata

In [5]:
sst.attrs

{'cdm_data_type': 'Grid',
 'comment': 'Data restrictions: for academic research use only. Data are Crown copyright see (http://www.opsi.gov.uk/advice/crown-copyright/copyright-guidance/index.htm)',
 'Conventions': 'CF-1.6, COARDS, ACDD-1.3',
 'creator_email': 'john.kennedy@metoffice.gov.uk',
 'creator_name': 'Met Office Hadley Centre',
 'creator_type': 'institution',
 'creator_url': 'http://hadobs.metoffice.com/',
 'description': 'HadISST 1.1 monthly average sea surface temperature',
 'Easternmost_Easting': 179.5,
 'geospatial_lat_max': 89.5,
 'geospatial_lat_min': -89.5,
 'geospatial_lat_resolution': 1.0,
 'geospatial_lat_units': 'degrees_north',
 'geospatial_lon_max': 179.5,
 'geospatial_lon_min': -179.5,
 'geospatial_lon_resolution': 1.0,
 'geospatial_lon_units': 'degrees_east',
 'history': '2017-01 Roy Mendelssohn (erd.data@noaa.gov) downloaded HadISST_ice.nc.gz from http://hadobs.metoffice.com/hadisst/data/download.html\n2021-02-19T16:28:24Z (local files)\n2021-02-19T16:28:24Z htt

In [6]:
sst.attrs['institution']

'Met Office Hadley Centre'

## *Hierarchical Data Format* (*HDF*). 

- **Self-Describing**: This enables efficient extraction of metadata without the need for an additional metadata document

- **Supports heterogeneous data**: HDF is a compressed format designed to support large, heterogeneous, and complex data sets

- **Supports Data Slicing**: Extracting portions of the data set as needed for analysis, means large files do not need to be read entirely into computer memory or RAM

- **Open Format**: Because the HDF format is open, it is supported by a wide variety of programming languages and tools, including open source languages such as R and Python and open GIS tools such as QGIS

Read more [NEON](https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5)

### Example of *HDF5* data structure for unstructured data

Source: [NEON](https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5)

<img width=500 style="float: left;" src="images\hdf5_structure.jpg">

### What is *h5py*?

A Python package for handling data in HDF5 format.

In [7]:
import h5py as hdf

### Create and save HDF5 file

In [8]:
# open file with the attribute 'w' (stands for write)
hf = hdf.File('data/example.h5', 'w')

In [9]:
# create random datasets ...
d1 = np.random.random(size=(1000,20))
d2 = np.random.random(size=(100,200))

In [10]:
# store to the opened file ...
hf.create_dataset('dataset_1', data=d1)
hf.create_dataset('dataset_2', data=d2)

<HDF5 dataset "dataset_2": shape (100, 200), type "<f8">

In [11]:
# set attributes ...
hf.attrs['title'] = "An example dataset"

In [12]:
hf.close()

### Read HDF5 data

Let's read our own data set again:

In [13]:
# open file with the attribute 'r' (stands for read)
hf = hdf.File("data/example.h5", "r")
hf.keys()

<KeysViewHDF5 ['dataset_1', 'dataset_2']>

In [14]:
hf.attrs.keys()

<KeysViewHDF5 ['title']>

In [15]:
# access the attribute ...
hf.attrs['title']

'An example dataset'

In [16]:
# access the dataset ...
hf['dataset_1'][()]

array([[0.70836453, 0.6833688 , 0.05145258, ..., 0.96229347, 0.86590627,
        0.81546757],
       [0.8908659 , 0.96046524, 0.4007389 , ..., 0.40149759, 0.2848431 ,
        0.7177241 ],
       [0.60749721, 0.80332521, 0.51051768, ..., 0.85552368, 0.58830442,
        0.20789511],
       ...,
       [0.2795044 , 0.11546145, 0.23052079, ..., 0.15623701, 0.42168762,
        0.97147617],
       [0.0955585 , 0.76209762, 0.33008635, ..., 0.09961316, 0.25859989,
        0.23165311],
       [0.70093974, 0.38308294, 0.11875369, ..., 0.34000054, 0.20126181,
        0.34583012]])

In [17]:
# don't forget to close the dataset when finished!
hf.close()

## THE END