# Sachsen-Anhalt

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Sachsen-Anhalt is `DEE`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [6]:
import pandas as pd
import numpy as np
from pandas.errors import ParserError
import os
import shutil
from glob import glob
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict, Tuple
from datetime import datetime as dt
from dateparser import parse
import warnings
from io import BytesIO

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Sachsen-Anhalt').input_path
BASE

'/home/alexander/Github/camels/camelsp/input_data/SA_Sachsen-Anhalt'

## Parse data

We do not have a Metadata file, but one Excel file for each station. Thus we need to parse each metadata individually and collect.

In [11]:
import patoolib

# remove folder TagMittel_DGJ_20221007 to ensure that data is extracted freshly and is up to date
if os.path.exists(f"{BASE}/TagMittel_DGJ_20221007"):
    shutil.rmtree(f"{BASE}/TagMittel_DGJ_20221007")

# extract .7z archive, make sure that 7z is installed to make patoolib work
patoolib.extract_archive(f"{BASE}/TagMittel_DGJ_LSA.7z", outdir=BASE)

patool: Extracting /home/alexander/Github/camels/camelsp/input_data/SA_Sachsen-Anhalt/TagMittel_DGJ_LSA.7z ...
patool: running /usr/bin/7z x -o/home/alexander/Github/camels/camelsp/input_data/SA_Sachsen-Anhalt -- /home/alexander/Github/camels/camelsp/input_data/SA_Sachsen-Anhalt/TagMittel_DGJ_LSA.7z
patool: ... /home/alexander/Github/camels/camelsp/input_data/SA_Sachsen-Anhalt/TagMittel_DGJ_LSA.7z extracted to `/home/alexander/Github/camels/camelsp/input_data/SA_Sachsen-Anhalt'.


'/home/alexander/Github/camels/camelsp/input_data/SA_Sachsen-Anhalt'

In [12]:
files = glob(os.path.join(f"{BASE}/TagMittel_DGJ_20221007", 'LHW_*.DGJ'))
print(f"Found {len(files)} files.")

Found 252 files.


Check the Header on the first file, as we don't have any other Metadata file. We need a ZRXP parser. Couldn't find something simple quickly, so I write my own. https://prozessing.tbbm.at/zrxp/zrxp3.0_en.pdf this is the ZRXP specification. I will only implement the relevant metadata

In [13]:
# get the names from the header
# I will skip the energy market headers and remote logger headers
HEADER = dict(
    SANR='Alphanumerical station number',
    SNAME='Station name',
    SWATER='River name',
    CMW='Values per day for equidistant time series values',
    CNAME='Parameter name',
    CNR='Parameter number',
    CUNIT='Unit of the data value column',
    RINVAL='Value for missing or invalid data record',
    RTIMELVL='Time series time level',
    TZ='time zone of all time stamps in the time series block, both header and data',
    LAYOUT='specifies the column layout for the ZRXP data'
)

def extract_file(path: str) -> Tuple[Dict[str, str], pd.DataFrame]:
    # Get the header lines first
    collection = []
    headerlines = 0

    # first read the head
    with open(path, 'rb') as f:
        # go for each line
        for l in f.readlines():
            if not l.decode('latin1').startswith('#'):
                break
            else:
                # collect the header
                collection.extend([_ for _ in l.decode('latin1').replace('#', '').split('|*|') if _ not in ('', '\n', '\r\n')])
                headerlines += 1

    # create metadata container
    meta = {}

    # go for each collected header
    for co in collection:
        HEAD = [k for k in HEADER.keys() if co.startswith(k)]
        if len(HEAD) == 1:
            HEAD = HEAD[0]
            meta[HEAD] = co.replace(HEAD, '')
        elif len(HEAD) == 0:
            if 'COMMENT' in meta:
                meta['COMMENT'] += f" {co}"
            else:
                meta["COMMENT"] = co
        elif len(HEAD) > 1:
            warnings.warn(f"Can't parse header {co}")

    # now the data
    header = meta['LAYOUT'].strip('()').split(',') if 'LAYOUT' in meta else [0, 1, 2, 3, 4, 5]
    df = pd.read_csv(path, encoding='latin1', sep=' ', header=None, skiprows=headerlines, names=header, parse_dates=[0], na_values=int(meta.get('RINVAL', '-777')))

    return meta, df

extract_file(files[-1])


({'COMMENT': 'ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXPV2R2_E SOURCESYSTEMWISKI SOURCEIDa1a8b6cd-7ab3-4c4d-896b-a12975df58f0 TSPATH/LHW/578350/Q.DGJ/Day.Mean.DGJ',
  'SANR': '578350',
  'SNAME': 'Unterrißdorf',
  'SWATER': 'Böse Sieben',
  'CNR': 'Q.DGJ',
  'CNAME': 'Q.DGJ',
  'TZ': 'UTC+1',
  'RINVAL': '-777',
  'CUNIT': 'm³/s',
  'LAYOUT': '(timestamp,value,status,interpolation_type,remark)'},
        timestamp     value  status  interpolation_type remark
 0     1978-11-01  0.196000      40                 603    NaN
 1     1978-11-02  0.196000      40                 603    NaN
 2     1978-11-03  0.196000      40                 603    NaN
 3     1978-11-04  0.174000      40                 603    NaN
 4     1978-11-05  0.174000      40                 603    NaN
 ...          ...       ...     ...                 ...    ...
 15762 2021-12-27  0.023000    1064                 603    NaN
 15763 2021-12-28  0.036802    1064                 603    NaN
 15764 2021-12-29  0.045302  

Go go for all files

In [14]:
# container
meta = []
raw_data = []

with warnings.catch_warnings(record=True) as warn:
    for fname in tqdm(files):
        m, df = extract_file(fname)
        meta.append(m)
        raw_data.append(df)

print(f"Parsed {len(meta)} files with {len(warn)} warnings.")

100%|██████████| 252/252 [05:14<00:00,  1.25s/it]






# create metadata

This should be pretty straightforward, but maybe not super-helpful

In [15]:
metadata = pd.DataFrame(meta)
metadata

Unnamed: 0,COMMENT,SANR,SNAME,SWATER,CNR,CNAME,TZ,RINVAL,CUNIT,LAYOUT
0,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,578410,Wippra,Wipper,Q.DGJ,Q.DGJ,UTC+1,-777,m³/s,"(timestamp,value,status,interpolation_type,rem..."
1,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,575310,Brücken,Kleine Helme,W.DGJ,W.DGJ,UTC+1,-777,cm,"(timestamp,value,status,interpolation_type,rem..."
2,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,594005,Hagenau,Biese,Q.DGJ,Q.DGJ,UTC+1,-777,m³/s,"(timestamp,value,status,interpolation_type,rem..."
3,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,444205,Ilsenburg,Ilse,Q.DGJ,Q.DGJ,UTC+1,-777,m³/s,"(timestamp,value,status,interpolation_type,rem..."
4,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,579610,Meisdorf,Selke,Q.DGJ,Q.DGJ,UTC+1,-777,m³/s,"(timestamp,value,status,interpolation_type,rem..."
...,...,...,...,...,...,...,...,...,...,...
247,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,587630,Parchen,Tucheim-Parchener Bach,Q.DGJ,Q.DGJ,UTC+1,-777,m³/s,"(timestamp,value,status,interpolation_type,rem..."
248,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,555010,Dietrichsdorf,Zahna,Q.DGJ,Q.DGJ,UTC+1,-777,m³/s,"(timestamp,value,status,interpolation_type,rem..."
249,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,576900,Oberthau,Weiße Elster,Q.DGJ,Q.DGJ,UTC+1,-777,m³/s,"(timestamp,value,status,interpolation_type,rem..."
250,ZRXPVERSION2300.100 ZRXPCREATORKiIOSystem.ZRXP...,578220,Sennewitz,Götsche,W.DGJ,W.DGJ,UTC+1,-777,cm,"(timestamp,value,status,interpolation_type,rem..."


In [16]:
id_column = 'SANR'

Get all status

In [17]:
data = []
status = []
int_type = []

for m, df in zip(meta, raw_data):
    # get status
    for s in df.status.unique():
        if s not in status:
            status.append(s)
    
    # get interpolation types
    for t in df.interpolation_type.unique():
        if t not in int_type:
            int_type.append(t)
    
    # make the df
    out = df.iloc[:, :2].copy()
    out.columns = ['date', 'q' if m['CNAME'].startswith('Q') else 'w']
    out['flag'] = np.NaN
    data.append(out)

print(f"Stauts:              {status}")
print(f"Interpolation types: {int_type}")

Stauts:              [40, -2147483393, 1064, 3112, 552, -2147482369, -2147482881, 200, -2147483608]
Interpolation types: [603, 601, 604]


### Finally run

Now, the Q and W data can be extracted. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [18]:
with Bundesland('Sachsen-Anhalt') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    # go for each    
    for m, df in tqdm(zip(meta, data), total=len(meta)):
        
        # get the provider id
        provider_id = str(m[id_column])
        bl.save_timeseries(df, provider_id)

    # check if there were warnings (there are warnings)
    if len(warn) > 0:
        log_path = bl.save_warnings(warn)
        print(f"There were warnings during the processing. The log can be found at: {log_path}")


    nuts_id provider_id                              path
0  DEE10000      578410  ./DEE/DEE10000/DEE10000_data.csv
1  DEE10010      575310  ./DEE/DEE10010/DEE10010_data.csv
2  DEE10020      594005  ./DEE/DEE10020/DEE10020_data.csv
3  DEE10030      444205  ./DEE/DEE10030/DEE10030_data.csv
4  DEE10040      579610  ./DEE/DEE10040/DEE10040_data.csv


100%|██████████| 252/252 [00:45<00:00,  5.53it/s]
