# Sachsen

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Sachsen is `DED`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import numpy as np
from pandas.errors import ParserError
import os
from glob import glob
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict
from datetime import datetime as dt
from dateparser import parse
import warnings

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Sachsen').input_path
BASE

'/home/alexd/Projekte/CAMELS/Github/camelsp/input_data/SN_Sachsen'

## Parse data

We do not have a Metadata file, but one Excel file for each station. Thus we need to parse each metadata individually and collect.

In [3]:
files = glob(os.path.join(BASE, '*.xlsx'))
print(f"Found {len(files)} files.")

Found 282 files.


Test for the first file:

In [4]:
with warnings.catch_warnings():
    df = pd.read_excel(files[32], skiprows=2, decimal=',')


print(f"Pegelkennziffer: {df.Pegelkennziffer.unique()}")
print(f"Pegelname:       {df.Pegelname.unique()}")
print(f"Gewaesser:       {df.Gewaesser.unique()}")
print(f"Beeinflussung:   {df.Beeinflussung.unique()}")
print(f"Datum type:      {df.Datum.dtype}")

df

  warn("Workbook contains no default style, apply openpyxl's default")


Pegelkennziffer: [660193]
Pegelname:       ['Podrosche 3']
Gewaesser:       ['Lausitzer Neiße']
Beeinflussung:   [nan 'b' 'e' 'R' 'R, e' 'R, T' 'b, e' 'K']
Datum type:      datetime64[ns]


Unnamed: 0,Pegelkennziffer,Pegelname,Gewaesser,Datum,Wasserstand (W) cm,Durchfluss (Q) m³/s,Beeinflussung
0,660193,Podrosche 3,Lausitzer Neiße,1984-11-01,,9.36,
1,660193,Podrosche 3,Lausitzer Neiße,1984-11-02,,9.36,
2,660193,Podrosche 3,Lausitzer Neiße,1984-11-03,,9.36,
3,660193,Podrosche 3,Lausitzer Neiße,1984-11-04,,9.36,
4,660193,Podrosche 3,Lausitzer Neiße,1984-11-05,,8.78,
...,...,...,...,...,...,...,...
12900,660193,Podrosche 3,Lausitzer Neiße,2020-12-27,76.0,12.50,
12901,660193,Podrosche 3,Lausitzer Neiße,2020-12-28,75.0,12.10,
12902,660193,Podrosche 3,Lausitzer Neiße,2020-12-29,71.0,11.10,
12903,660193,Podrosche 3,Lausitzer Neiße,2020-12-30,70.0,10.50,


One question, what does Beeinflussung actually mean here? Ignoring it for now.

Go for each file and extract metadata and the two data columns

In [5]:
np.unique(df[df.columns[1]].values)

array(['Podrosche 3'], dtype=object)

In [6]:
# create result container
meta = []
q = []
w = []

with warnings.catch_warnings(record=True) as warns:
    for filename in tqdm(files):
        # read
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            df = pd.read_excel(filename, skiprows=2, decimal=',')

        # extract data columns
        q_df = df[['Datum', 'Durchfluss (Q) m³/s']].copy()
        q_df.columns = ['date', 'q']
        q_df['flag'] = np.NaN
        
        w_df = df[['Datum', 'Wasserstand (W) cm']].copy()
        w_df.columns = ['date', 'w']
        w_df['flag'] = np.NaN
        
        # append
        q.append(q_df)
        w.append(w_df)

        # metadata - get first three columns
        m = dict()
        for i in range(3):
            # these columns need to be unique
            if np.unique(df[df.columns[i]].values).size > 1:
                warnings.warn(f"Column {df.columns[i]} of file {filename} is expected to be unique")
                m = None
                break
            else:
                # add metadata
                m[df.columns[i]] = str(df.iloc[0, i])
        
        # add other stuff
        if m is not None:     
            m['unit_q'] = 'm³/s'
            m['unit_w'] = 'cm'
        meta.append(m)
    
print(f"Parsed {len(meta)} files")

100%|██████████| 282/282 [03:19<00:00,  1.41it/s]

Parsed 282 files





### Create metadta

this should be straightforward

In [7]:
metadata = pd.DataFrame(meta)
metadata

Unnamed: 0,Pegelkennziffer,Pegelname,Gewaesser,unit_q,unit_w
0,551431,Dippoldiswalde 3,Werkgraben,m³/s,cm
1,550190,Porschdorf 1,Lachsbach,m³/s,cm
2,663090,Tauchritz,Pließnitz,m³/s,cm
3,564201,Niedermülsen 1,Mülsenbach,m³/s,cm
4,564200,Niedermülsen,Mülsenbach,m³/s,cm
...,...,...,...,...,...
277,563745,Johanngeorgenstadt 4,Schwarzwasser,m³/s,cm
278,583223,Reichwalde 3,Schwarzer Schöps,m³/s,cm
279,560301,Nemt 1,Mühlbach,m³/s,cm
280,583100,Löbau Stadion,Löbauer Wasser,m³/s,cm


In [8]:
metadata[metadata['Pegelkennziffer'] == '662082']

Unnamed: 0,Pegelkennziffer,Pegelname,Gewaesser,unit_q,unit_w
115,662082,Neuschönau 2,Lasur,m³/s,cm


In [9]:
id_column = 'Pegelkennziffer'

### Finally run

Now, the Q and W data can be extracted. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [10]:
with Bundesland('Sachsen') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    # go for each    
    for m, q_df, w_df in tqdm(zip(meta, q, w), total=len(meta)):
        
        if m is not None:
            # get the provider id
            provider_id = str(m[id_column])
            bl.save_timeseries(q_df, provider_id)
            bl.save_timeseries(w_df, provider_id)

    # check if there were warnings (there are warnings)
    if len(warns) > 0:
        log_path = bl.save_warnings(warns)
        print(f"There were warnings during the processing. The log can be found at: {log_path}")


    nuts_id provider_id                              path
0  DED10000      551431  ./DED/DED10000/DED10000_data.csv
1  DED10010      550190  ./DED/DED10010/DED10010_data.csv
2  DED10020      663090  ./DED/DED10020/DED10020_data.csv
3  DED10030      564201  ./DED/DED10030/DED10030_data.csv
4  DED10040      564200  ./DED/DED10040/DED10040_data.csv


  0%|          | 0/282 [00:00<?, ?it/s]

100%|██████████| 282/282 [00:15<00:00, 18.53it/s]
