# Rheinlandpfalz

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Rheinland-Pfalz is `DEB`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import numpy as np
from pandas.errors import ParserError
import os
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict
from glob import glob
from datetime import datetime as dt
from dateparser import parse
import warnings
from io import StringIO

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Pfalz').input_path
BASE

'/home/camel/camelsp/input_data/RLP_Rheinland_Pfalz'

### Metadata reader

Define the function that extracts / reads and eventually merges all metadata for this federal state. You can develop the function here, without using the Bundesland context and then later use the context to pass extracted metadata. The Context has a function for saving *raw* metadata, that takes a `pandas.DataFrame` and needs you to identify the id column.
Here, *raw* refers to provider metadata, that has not yet been transformed into the CAMELS-de Metadata schema.

In [3]:
with Bundesland('RLP') as bl:
    metadata = pd.read_csv(os.path.join(bl.input_path, 'metadaten.csv'), encoding='latin1')

metadata

Unnamed: 0,id,gauge
0,2370020500,Bundespegel Maxau
1,2370060200,Bundespegel Speyer
2,2372030500,Bobenthal
3,2372060000,Salmbacher Passage
4,2375010200,Minfeld
...,...,...
147,2392080300,Pfeddersheim
148,2580050000,Bundespegel Diez
149,2650010000,Bundespegel Trier neu
150,2683060500,Bengel


In [4]:
# the id column will be ORT
id_column = 'id'

## file extract and parse

Here, we need to process the filename as the `'Ort'` is contained in the filename. Looks like the metadata header is **always** to line 32, indicating a finished header by `YTYP;`. Verify this.

Now we have to left-join the data, as each Stationsnummer exists twice. Thus, it is only the combination of Stationsnummer and variable, that makes the data unique

In [None]:
files are saved 

In [6]:
def extract_file(nr: Union[int, str], variable: str, input_path: str) -> pd.DataFrame:

    # build the path to the correct subfolder:
    path = os.path.join(input_path, variable, f"{nr}_{variable.upper()}.txt")

    # check file
    if not os.path.exists(path):
        return pd.DataFrame(columns=['date', variable.lower(), 'flag'])
    
    # read
    raw = pd.read_csv(path, skiprows=4, encoding='latin1', sep=' ', header=None)
    return pd.DataFrame({
        'date': [dt.strptime(str(_)[:8], '%Y%m%d') for _ in raw.iloc[:, 0]],
        variable.lower(): raw.iloc[:, 1],
        'flag': np.nan
    })

    return raw

extract_file(42, 'q', BASE)

Unnamed: 0,date,q,flag


In [7]:
with Bundesland('RLP') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    
    with warnings.catch_warnings(record=True) as warns:
        for provider_id in tqdm(metadata[id_column].values.astype(str)):
            # get q
            q = extract_file(provider_id, 'q', bl.input_path)
            w = extract_file(provider_id, 'w', bl.input_path)

            bl.save_timeseries(q, provider_id)
            bl.save_timeseries(w, provider_id)

        # check if there were warnings (there are warnings)
        if len(warns) > 0:
            log_path = bl.save_warnings(warns)
            print(f"There were warnings during the processing. The log can be found at: {log_path}")


    nuts_id provider_id                              path
0  DEB10000  2370020500  ./DEB/DEB10000/DEB10000_data.csv
1  DEB10010  2370060200  ./DEB/DEB10010/DEB10010_data.csv
2  DEB10020  2372030500  ./DEB/DEB10020/DEB10020_data.csv
3  DEB10030  2372060000  ./DEB/DEB10030/DEB10030_data.csv
4  DEB10040  2375010200  ./DEB/DEB10040/DEB10040_data.csv


100%|██████████| 152/152 [00:02<00:00, 53.29it/s]
