# Thüringen

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Thüringen is `DEG`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import numpy as np
from pandas.errors import ParserError
import os
from glob import glob
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict, Tuple
from datetime import datetime as dt
from dateparser import parse
import warnings
from io import BytesIO

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [3]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Thüringen').input_path
BASE

'/home/alexander/Github/camels/camelsp/input_data/TH_Thueringen'

## Parse metadata

Pegel metadata can be read quite easily. Only the separator is important as we have whitespaces in `'Gewässer'`

In [4]:
with Bundesland('Thüringen') as th:
    metadata = pd.read_csv(os.path.join(th.input_path, 'th_pegel_metadaten_v2.txt'), sep='\t')
metadata['unit_q'] = 'm³/s'
metadata['unit_w'] = 'cm'
metadata

Unnamed: 0,Pegelnr,Pegelname,Gewässer,Lage o. M.,EZG,PNP,Höhensystem,HW (GK 4),RW (GK 4),lon,lat,NNQ,Datum NNQ,HHQ,Datum HHQ,unit_q,unit_w
0,573000,Ammern,Unstrut,161.2,182.7,210.243,NH,5676589,601026,10.4470,51.2317,0.130,OFT,115.0,am 04.06.1981,m³/s,cm
1,447000,Arenshausen,Leine,247.1,275.0,196.288,NH,5692387,567538,9.9704,51.3787,0.260,am 09.09.2010,92.8,am 04.06.1981,m³/s,cm
2,574200,Arnstadt,Gera,45.2,174.7,293.577,NH,5630378,636190,10.9330,50.8091,0.210,OFT,75.7,am 10.08.1981,m³/s,cm
3,576500,Berga,Weiße Elster,151.0,1383.0,218.995,NH,5626876,722757,12.1580,50.7509,,,,,m³/s,cm
4,570210,Blankenstein-Rosenthal,Saale,357.0,1013.0,410.517,NH,5587078,692197,11.7047,50.4043,0.306,am 10.07.1976,251.0,am 05.01.1982,m³/s,cm
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,427010,Unterbreizbach-Räsa,Ulster,5.0,399.0,233.323,NH,5628816,568818,9.9767,50.8070,0.180,OFT,218.0,am 04.06.1981,m³/s,cm
59,420120,Vacha,Werra,164.8,2246.0,222.678,NH,5631886,573776,10.0477,50.8340,1.550,am 05.10.1959,321.0,am 10.02.1946,m³/s,cm
60,575110,Wasserthaleben,Helbe,19.0,374.3,174.317,NH,5680112,631983,10.8915,51.2571,0.100,OFT,64.9,am 30.12.2002,m³/s,cm
61,577320,Weida,Weida,7.0,296.7,238.358,NH,5627781,715938,12.0620,50.7616,0.000,OFT,139.0,am 15.08.1924,m³/s,cm


In [5]:
# The id_column is Pegelnr.
id_column = 'Pegelnr'

## Load Data

Loading data is a bit more complicated. There is an extra header, but that is not important. Only the 'Einhait' contains important information, but that has been added manually to the metadata.

In [6]:
def extract_file(nr: Union[str, int], variable: str, base_path: str) -> Tuple[Dict[str, str], pd.DataFrame]:
    # always use str ids
    nr = str(nr)
    # merge the nr
    if '.' in nr:
        nr = nr.replace('.', '')

    # build the filename
    fname = f'{variable.lower()}{nr}.txt'

    # build the path
    path = os.path.join(base_path, variable.lower(), fname)

    # return empty dataframe if data does not exist
    if not os.path.exists(path):
        return pd.DataFrame(columns=['date', variable.lower(), 'flag'])
    
    # otherwise read
    df = pd.read_csv(path, skiprows=21, header=None, sep=" ", parse_dates=[0], dayfirst=True, na_values=['Luecke', 'LUECKE'])
    df.columns = ['date', 'hour', variable.lower(), 'comment']
    
    # check if there are comments
    if not df.comment.isna().all():
        print(df.comment)
    
    # build the flag column
    df['flag'] = [np.isnan(c) for c in df.comment]
    df.drop(['hour', 'comment'], axis=1, inplace=True)
    return df

extract_file('420120', 'w', BASE)

Unnamed: 0,date,w,flag
0,1936-11-01,0.0,True
1,1936-11-02,0.0,True
2,1936-11-03,0.0,True
3,1936-11-04,0.0,True
4,1936-11-05,0.0,True
...,...,...,...
31102,2021-12-27,74.0,True
31103,2021-12-28,77.0,True
31104,2021-12-29,98.0,True
31105,2021-12-30,129.0,True


### Finally run

Now, the Q and W data can be extracted. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [7]:
with Bundesland('Thüringen') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    # go for each    
    for _id in tqdm(metadata[id_column].values):
        provider_id = str(_id)
        # extract the two files
        q_df = extract_file(provider_id, 'q', bl.input_path)
        w_df = extract_file(provider_id, 'w', bl.input_path)

        # save
        bl.save_timeseries(q_df, provider_id)
        bl.save_timeseries(w_df, provider_id)


    nuts_id provider_id                              path
0  DEG10000      573000  ./DEG/DEG10000/DEG10000_data.csv
1  DEG10010      447000  ./DEG/DEG10010/DEG10010_data.csv
2  DEG10020      574200  ./DEG/DEG10020/DEG10020_data.csv
3  DEG10030      576500  ./DEG/DEG10030/DEG10030_data.csv
4  DEG10040      570210  ./DEG/DEG10040/DEG10040_data.csv


  0%|          | 0/63 [00:00<?, ?it/s]

100%|██████████| 63/63 [00:10<00:00,  6.11it/s]
