# Sachsen

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Sachsen is `DED`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
import os
from glob import glob
from tqdm import tqdm
import warnings

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Sachsen').input_path
BASE

'/home/camel/camelsp/input_data/Q_and_W/SN_Sachsen'

## Parse data

We do not have a Metadata file, but one Excel file for each station. Thus we need to parse each metadata individually and collect.

In [8]:
files = sorted(glob(os.path.join(BASE, '*.xlsx')))
print(f"Found {len(files)} files.")

Found 282 files.


Test for the first file:

In [9]:
with warnings.catch_warnings():
    df = pd.read_excel(files[32], skiprows=2, decimal=',')


print(f"Pegelkennziffer: {df.Pegelkennziffer.unique()}")
print(f"Pegelname:       {df.Pegelname.unique()}")
print(f"Gewaesser:       {df.Gewaesser.unique()}")
print(f"Beeinflussung:   {df.Beeinflussung.unique()}")
print(f"Datum type:      {df.Datum.dtype}")

df

  warn("Workbook contains no default style, apply openpyxl's default")


Pegelkennziffer: [567310]
Pegelname:       ['Böhrigen']
Gewaesser:       ['Striegis']
Beeinflussung:   ['b' nan 'R' 'D' 'G, R' 'G, R, T' 'R, T' 'R, b' 'D, G']
Datum type:      datetime64[ns]


Unnamed: 0,Pegelkennziffer,Pegelname,Gewaesser,Datum,Wasserstand (W) cm,Durchfluss (Q) m³/s,Beeinflussung
0,567310,Böhrigen,Striegis,2004-11-01,62,,b
1,567310,Böhrigen,Striegis,2004-11-02,65,,b
2,567310,Böhrigen,Striegis,2004-11-03,70,,b
3,567310,Böhrigen,Striegis,2004-11-04,55,,b
4,567310,Böhrigen,Striegis,2004-11-05,43,,b
...,...,...,...,...,...,...,...
5900,567310,Böhrigen,Striegis,2020-12-27,55,0.954,
5901,567310,Böhrigen,Striegis,2020-12-28,54,0.894,
5902,567310,Böhrigen,Striegis,2020-12-29,53,0.824,
5903,567310,Böhrigen,Striegis,2020-12-30,52,0.735,


One question, what does Beeinflussung actually mean here? Ignoring it for now.

Go for each file and extract metadata and the two data columns

In [10]:
np.unique(df[df.columns[1]].values)

array(['Böhrigen'], dtype=object)

In [11]:
# create result container
meta = []
q = []
w = []

with warnings.catch_warnings(record=True) as warns:
    for filename in tqdm(files):
        # read
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            df = pd.read_excel(filename, skiprows=2, decimal=',')

        # extract data columns
        q_df = df[['Datum', 'Durchfluss (Q) m³/s']].copy()
        q_df.columns = ['date', 'q']
        q_df['flag'] = np.NaN
        
        w_df = df[['Datum', 'Wasserstand (W) cm']].copy()
        w_df.columns = ['date', 'w']
        w_df['flag'] = np.NaN
        
        # append
        q.append(q_df)
        w.append(w_df)

        # metadata - get first three columns
        m = dict()
        for i in range(3):
            # these columns need to be unique
            if np.unique(df[df.columns[i]].values).size > 1:
                warnings.warn(f"Column {df.columns[i]} of file {filename} is expected to be unique")
                m = None
                break
            else:
                # add metadata
                m[df.columns[i]] = str(df.iloc[0, i])
        
        # add other stuff
        if m is not None:     
            m['unit_q'] = 'm³/s'
            m['unit_w'] = 'cm'
        meta.append(m)
    
print(f"Parsed {len(meta)} files")

  0%|          | 0/282 [00:00<?, ?it/s]

100%|██████████| 282/282 [09:51<00:00,  2.10s/it]

Parsed 282 files





### Create metadta

this should be straightforward

In [12]:
metadata = pd.DataFrame(meta)
metadata

Unnamed: 0,Pegelkennziffer,Pegelname,Gewaesser,unit_q,unit_w
0,576401,Adorf 1,Weiße Elster,m³/s,cm
1,576400,Adorf,Weiße Elster,m³/s,cm
2,578091,Albrechtshain 1,Parthe,m³/s,cm
3,578090,Albrechtshain,Parthe,m³/s,cm
4,564530,Altchemnitz 1,Zwönitz,m³/s,cm
...,...,...,...,...,...
277,662022,Zittau 6,Mandau,m³/s,cm
278,576635,Zitzschen,Weiße Elster,m³/s,cm
279,562070,Zwickau-Pölbitz,Zwickauer Mulde,m³/s,cm
280,568401,Zöblitz 1,Schwarze Pockau,m³/s,cm


There is more metadata in the provided Pegel shapefile, including critical information such as location.

In [17]:
gdf_meta = gpd.read_file(os.path.join(BASE, '../../Shapes/Sachsen_Shapes/PEGEL.shp'))

# drop column geometry
gdf_meta.drop(columns='geometry', inplace=True)

# provider_id as int
gdf_meta['MSTNR'] = gdf_meta['MSTNR'].astype(int)
metadata['Pegelkennziffer'] = metadata['Pegelkennziffer'].astype(int)

# concat metadata and gdf_meta
metadata = metadata.merge(gdf_meta, left_on='Pegelkennziffer', right_on='MSTNR', how='left')

# make Pegelkennziffer str again
metadata['Pegelkennziffer'] = metadata['Pegelkennziffer'].astype(str)

In [19]:
id_column = 'Pegelkennziffer'

### Finally run

Now, the Q and W data can be extracted. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [21]:
with Bundesland('Sachsen') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    # go for each    
    for m, q_df, w_df in tqdm(zip(meta, q, w), total=len(meta)):
        
        if m is not None:
            # get the provider id
            provider_id = str(m[id_column])
            bl.save_timeseries(q_df, provider_id)
            bl.save_timeseries(w_df, provider_id)

    # check if there were warnings (there are warnings)
    if len(warns) > 0:
        log_path = bl.save_warnings(warns)
        print(f"There were warnings during the processing. The log can be found at: {log_path}")


    nuts_id provider_id                              path
0  DED10000      576401  ./DED/DED10000/DED10000_data.csv
1  DED10010      576400  ./DED/DED10010/DED10010_data.csv
2  DED10020      578091  ./DED/DED10020/DED10020_data.csv
3  DED10030      578090  ./DED/DED10030/DED10030_data.csv
4  DED10040      564530  ./DED/DED10040/DED10040_data.csv


  0%|          | 0/282 [00:00<?, ?it/s]

100%|██████████| 282/282 [00:31<00:00,  9.01it/s]
