# Rheinland-Pfalz

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Rheinland-Pfalz is `DEB`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import numpy as np
from pandas.errors import ParserError
import os
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict
from glob import glob
from datetime import datetime as dt
from dateparser import parse
import warnings
from io import StringIO
import zipfile

from camelsp import Bundesland

  def hasna(x: np.ndarray) -> bool:


The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Pfalz').input_path
BASE

'/home/alexander/Github/camels/camelsp/input_data/RLP_Rheinland-Pfalz'

### Metadata reader

Define the function that extracts / reads and eventually merges all metadata for this federal state. You can develop the function here, without using the Bundesland context and then later use the context to pass extracted metadata. The Context has a function for saving *raw* metadata, that takes a `pandas.DataFrame` and needs you to identify the id column.
Here, *raw* refers to provider metadata, that has not yet been transformed into the CAMELS-de Metadata schema.

In [3]:
with Bundesland('RLP') as bl:
    # read metadata
    metadata = pd.read_excel(os.path.join(BASE, 'aktive Pegel_Februar 2022_mit Stammdaten.xlsx'), header=2)

    # rename 7th column
    metadata.rename(columns={'km': 'km oh. Münd.'}, inplace=True)

    # drop first row
    metadata = metadata.iloc[1:]

    # drop all rows where column Gewässer is NaN (empty lines at the end of the metadata file)
    metadata = metadata[~metadata['Gewässer'].isna()].reset_index(drop=True)

    # according to rlp websites, all Messstellennummern are missing two zeros at the end
    metadata['Nummer'] = metadata['Nummer'] * 100

metadata

Unnamed: 0,Nummer,Stationsname,Gewässer,Aeo,RW,HW,km oh. Münd.,PNP
0,2546015800,Nanzdietschweiler,Glan,200.94,2604686.0,5479753.0,58,215.499
1,2546030700,Eschenau,Glan,598.31,2607203.0,5496963.0,33,180.334
2,2546040900,Odenbach,Glan,1088.17,2619263.0,5507120.0,14.5,147.750
3,2546052200,Stausee Ohmbach,Ohmbach,34.50,2600311.0,5476977.0,3.6,232.367
4,2546057700,Rodenbach 2,Bruchbach,19.44,2620094.0,5483350.0,1.5,220.764
...,...,...,...,...,...,...,...,...
150,2716025700,Steinshof,Wied,,,,,
151,2716050800,Brückrachdorf,Holzbach,71.80,2619609.0,5601929.0,26.7,
152,2716055200,Dierdorf,Holzbach,,,,,
153,2628036600,Bitburg Stausee,"Prüm, Stausee Bitburg",,316797.7,5542933.3,,


# Stations without data
There are some stations in the metadata for which we do not have datafiles.  
We delete these stations from the metadata.   

In [22]:
metadata[metadata['Stationsname'] == 'Bad Dürkheim, Sägmühle']

Unnamed: 0,Nummer,Stationsname,Gewässer,Aeo,RW,HW,km oh. Münd.,PNP
105,2391025300,"Bad Dürkheim, Sägmühle",Isenach,68.0,2657965.0,5483675.0,26.8,109.612


In [5]:
[2391020900, 2392060000, 2393030800, 2540080400, 2546060200,
       2679060200, 2540075000, 2683080900]

no_data_ids = ['2546052200', '2546057700', '2546061300', '2546077000',
               '2642015900', '2642046500', '2642092000', '2717090900',
               '2718009400', '2718060700', '2589090700', '2725090600',
               '2373011600', '2375000000', '2375050000', '2377030300',
               '2378060200', '2379050400', '2391015100', '2391025300',
               '2391040200', '2391090100', '2672050500', '2677070700',
               '2544031000', '2660012800', '2716005300', '2716025700',
               '2716050800', '2716055200', '2628036600']

metadata = metadata[~metadata['Nummer'].astype(str).isin(no_data_ids)].reset_index(drop=True)
metadata

Unnamed: 0,Nummer,Stationsname,Gewässer,Aeo,RW,HW,km oh. Münd.,PNP
0,2546015800,Nanzdietschweiler,Glan,200.94,2604686.0,5479753.0,58,215.499
1,2546030700,Eschenau,Glan,598.31,2607203.0,5496963.0,33,180.334
2,2546040900,Odenbach,Glan,1088.17,2619263.0,5507120.0,14.5,147.750
3,2546058800,Niedermohr,Mohrbach,100.76,2606209.0,5481298.0,0.8,214.127
4,2546070400,Untersulzbach,Lauter,215.31,2620441.0,5489305.0,18.5,202.390
...,...,...,...,...,...,...,...,...
119,2679020500,Traben-Trarbach,Kautenbach,51.12,2580085.0,5533062.0,3.1,146.310
120,2680020700,Saxler Mühle,Alf,39.85,2563640.0,5555674.0,35,394.500
121,2682050000,Hasborner Mühle,Sammetbach,22.70,2565853.0,5548041.0,5.8,283.602
122,2683060500,Bengel,Alf,138.15,2576050.0,5542582.0,11.14,144.047


In [6]:
# the id column will be Nummer
id_column = 'Nummer'

In [7]:
# get q data
q_data = pd.read_csv(os.path.join(BASE, 'q/abluss_all.txt'), encoding='latin1', sep=' ')

# Probeart contains ['Tagesmittel', 'Tagesminimum', 'Stichprobe - Längs der Streichrichtung'] -> we are only interested in 'Tagesmittel'
q_data = q_data[q_data.Probeart == 'Tagesmittel'].reset_index(drop=True)
q_data

Unnamed: 0,Messst.Nr,Messstelle,Datum,Probeart,Parameter.Nr,Parameter,Einheit,Wert
0,2372030500,Bobenthal,1955-11-01,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,186.0
1,2372030500,Bobenthal,1955-11-02,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,186.0
2,2372030500,Bobenthal,1955-11-03,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,191.0
3,2372030500,Bobenthal,1955-11-04,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,197.0
4,2372030500,Bobenthal,1955-11-05,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,236.0
...,...,...,...,...,...,...,...,...
2622012,2718085500,Niederadenau,2021-12-27,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,27.0
2622013,2718085500,Niederadenau,2021-12-28,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,36.0
2622014,2718085500,Niederadenau,2021-12-29,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,68.0
2622015,2718085500,Niederadenau,2021-12-30,Tagesmittel,09190/00,Abfluss am Pegel,m³/s,65.0


In [8]:
# get w data
w_data = pd.read_csv(os.path.join(BASE, 'w/wasserstand_all.txt'), encoding='latin1', sep=' ')

# Probeart contains ['Tagesmittel', 'Stichprobe - Längs der Streichrichtung'] -> we are only interested in 'Tagesmittel'
w_data = w_data[w_data.Probeart == 'Tagesmittel'].reset_index(drop=True)
w_data

Unnamed: 0,Messst.Nr,Messstelle,Datum,Probeart,Parameter.Nr,Parameter,Einheit,Wert
0,2370020500,Bundespegel Maxau,1950-01-01,Tagesmittel,09090/00,Wasserstand am Pegel,cm,331.0
1,2370020500,Bundespegel Maxau,1950-01-02,Tagesmittel,09090/00,Wasserstand am Pegel,cm,326.0
2,2370020500,Bundespegel Maxau,1950-01-03,Tagesmittel,09090/00,Wasserstand am Pegel,cm,321.0
3,2370020500,Bundespegel Maxau,1950-01-04,Tagesmittel,09090/00,Wasserstand am Pegel,cm,327.0
4,2370020500,Bundespegel Maxau,1950-01-05,Tagesmittel,09090/00,Wasserstand am Pegel,cm,331.0
...,...,...,...,...,...,...,...,...
2678269,2718085500,Niederadenau,2021-12-27,Tagesmittel,09090/00,Wasserstand am Pegel,cm,19.0
2678270,2718085500,Niederadenau,2021-12-28,Tagesmittel,09090/00,Wasserstand am Pegel,cm,21.0
2678271,2718085500,Niederadenau,2021-12-29,Tagesmittel,09090/00,Wasserstand am Pegel,cm,28.0
2678272,2718085500,Niederadenau,2021-12-30,Tagesmittel,09090/00,Wasserstand am Pegel,cm,27.0


## file extract and parse

Here, we need to process the filename as the `'Ort'` is contained in the filename. Looks like the metadata header is **always** to line 32, indicating a finished header by `YTYP;`. Verify this.

In [9]:
def extract_station(nr: Union[int, str], variable: str, data: pd.DataFrame) -> pd.DataFrame:
    # get data for station
    dat = data[data['Messst.Nr'] == int(nr)].reset_index(drop=True)

    return pd.DataFrame({
        'date': pd.to_datetime(dat['Datum']),
        variable: dat['Wert'],
        'flag': np.nan
    })

extract_station(2372030500, 'q', q_data)

Unnamed: 0,date,q,flag
0,1955-11-01,186.0,
1,1955-11-02,186.0,
2,1955-11-03,191.0,
3,1955-11-04,197.0,
4,1955-11-05,236.0,
...,...,...,...
48319,2021-12-27,195.0,
48320,2021-12-28,229.0,
48321,2021-12-29,247.0,
48322,2021-12-30,242.0,


In [10]:
with Bundesland('RLP') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    
    with warnings.catch_warnings(record=True) as warns:
        for provider_id in tqdm(metadata[id_column].values.astype(str)):
            # get data for station
            q = extract_station(provider_id, 'q', q_data)
            w = extract_station(provider_id, 'w', w_data)

            bl.save_timeseries(q, provider_id)
            bl.save_timeseries(w, provider_id)

        # check if there were warnings (there are warnings)
        if len(warns) > 0:
            log_path = bl.save_warnings(warns)
            print(f"There were warnings during the processing. The log can be found at: {log_path}")


    nuts_id provider_id                              path
0  DEB10000  2546015800  ./DEB/DEB10000/DEB10000_data.csv
1  DEB10010  2546030700  ./DEB/DEB10010/DEB10010_data.csv
2  DEB10020  2546040900  ./DEB/DEB10020/DEB10020_data.csv
3  DEB10030  2546058800  ./DEB/DEB10030/DEB10030_data.csv
4  DEB10040  2546070400  ./DEB/DEB10040/DEB10040_data.csv


100%|██████████| 124/124 [00:10<00:00, 12.06it/s]


## check empty stations (no data)

In [11]:
print(len(metadata))

print(len(q_data['Messst.Nr'].unique()))
print(len(w_data['Messst.Nr'].unique()))

155
143
152


It seems that metadata and q_data / w_data don't quite match here, there are similar numbers of entries, but many of the IDs in q_data / w_data don't exist in metadata.

In [25]:
ids_w = set(w_data['Messst.Nr'].unique())
ids_q = set(q_data['Messst.Nr'].unique())

ids_meta = set(metadata['Nummer'].values.astype(str))

# make ids_w and ids_q strings
ids_w = set(map(str, ids_w))
ids_q = set(map(str, ids_q))

# ids that are in metadata but not in ids_w and ids_q
print(f"IDs that are in metadata but not in data:\n{sorted(ids_meta - ids_w - ids_q)}")

# ids that are in ids_w and ids_q but not in metadata
print(f"IDs that are in data but not in metadata:\n{sorted(ids_w.union(ids_q) - ids_meta)}")


IDs that are in metadata but not in data:
['2373011600', '2375000000', '2375050000', '2377030300', '2378060200', '2379050400', '2391015100', '2391025300', '2391040200', '2391090100', '2544031000', '2546052200', '2546057700', '2546061300', '2546077000', '2589090700', '2628036600', '2642015900', '2642046500', '2642092000', '2660012800', '2672050500', '2677070700', '2716005300', '2716025700', '2716050800', '2716055200', '2717090900', '2718009400', '2718060700', '2725090600']
IDs that are in data but not in metadata:
['2370010300', '2370020500', '2370050000', '2370060200', '2370070000', '2370070400', '2390020400', '2390060100', '2391020900', '2392060000', '2393030800', '2510010700', '2530020800', '2540075000', '2540080400', '2546060200', '2570010400', '2570030800', '2570050100', '2580050000', '2580060800', '2640084200', '2650010000', '2650910000', '2679060200', '2683080900', '2690010900', '2710040300', '2710060700']


In [38]:
ids_no_data = []

for id in bl.nuts_table['nuts_id']:
    df = bl.get_data(id)
    if len(df) == 0:
        ids_no_data.append(id)

# get provider ids of stations with no data
provider_ids_no_data = bl.nuts_table[bl.nuts_table['nuts_id'].isin(ids_no_data)]['provider_id'].values
provider_ids_no_data


array(['2546052200', '2546057700', '2546061300', '2546077000',
       '2642015900', '2642046500', '2642092000', '2717090900',
       '2718009400', '2718060700', '2589090700', '2725090600',
       '2373011600', '2375000000', '2375050000', '2377030300',
       '2378060200', '2379050400', '2391015100', '2391025300',
       '2391040200', '2391090100', '2672050500', '2677070700',
       '2544031000', '2660012800', '2716005300', '2716025700',
       '2716050800', '2716055200', '2628036600'], dtype=object)

we do not have data for the following stations:

In [98]:
metadata[metadata['Nummer'].isin([int(i) for i in provider_ids_no_data])]

Unnamed: 0,Nummer,Stationsname,Gewässer,Aeo,RW,HW,km oh. Münd.,PNP
3,2546052200,Stausee Ohmbach,Ohmbach,34.5,2600311.0,5476977.0,3.6,232.367
4,2546057700,Rodenbach 2,Bruchbach,19.44,2620094.0,5483350.0,1.5,220.764
6,2546061300,Rammelsbach 2,Kuselbach,79.56,2664923.0,5490987.0,1.8,205.087
8,2546077000,Lohnweiler,Lauter,271.327,2615647.0,5501187.0,1.91,165.981
16,2642015900,Steinalben,Queidersbach,32.94,2619924.0,5465927.0,0.2,261.32
19,2642046500,Würschhauser Mühle 2,Wallhalbe,57.61,2611105.0,5464721.0,8.0,252.291
24,2642092000,Bickenalbe,Bickenalb,65.46,2596511.0,5451643.0,2.05,236.106
49,2717090900,Zerwasmühle,Brohlbach,85.355,2593780.0,5594255.0,2.09,76.763
50,2718009400,Müsch 2,Ahr,350.34,2558867.0,5583622.0,64.09,
53,2718060700,Bad Bodendorf,Ahr,858.1,2586226.0,5602363.0,4.9,65.915


In [4]:
import pandas as pd

df_q = pd.read_csv('/home/alexander/Github/camels/camelsp/input_data/RLP_Rheinland-Pfalz/q/abluss_all.txt', encoding='latin1', sep=' ')
df_w = pd.read_csv('/home/alexander/Github/camels/camelsp/input_data/RLP_Rheinland-Pfalz/w/wasserstand_all.txt', encoding='latin1', sep=' ')

In [5]:
missing_data_ids = ['2370010300', '2370020500', '2370050000', '2370060200', '2370070000', '2370070400', '2390020400', '2390060100', '2391020900', '2392060000', '2393030800', '2510010700', '2530020800', '2540075000', '2540080400', '2546060200', '2570010400', '2570030800', '2570050100', '2580050000', '2580060800', '2640084200', '2650010000', '2650910000', '2679060200', '2683080900', '2690010900', '2710040300', '2710060700']

# make missing data ids ints
missing_data_ids = [int(i) for i in missing_data_ids]

pegelnames = df_q[df_q['Messst.Nr'].isin(missing_data_ids)].Messstelle.unique()

# filter out 'Bundespegel' from pegelnames
pegelnames_q = [p for p in pegelnames if 'Bundespegel' not in p]

pegelnames_q

['Bad Dürkheim',
 'Monsheim',
 'Gundersheim',
 'Grolsheim',
 'Rammelsbach',
 'Zell',
 'Höllenthal']

In [6]:
missing_data_ids = ['2370010300', '2370020500', '2370050000', '2370060200', '2370070000', '2370070400', '2390020400', '2390060100', '2391020900', '2392060000', '2393030800', '2510010700', '2530020800', '2540075000', '2540080400', '2546060200', '2570010400', '2570030800', '2570050100', '2580050000', '2580060800', '2640084200', '2650010000', '2650910000', '2679060200', '2683080900', '2690010900', '2710040300', '2710060700']

# make missing data ids ints
missing_data_ids = [int(i) for i in missing_data_ids]

pegelnames = df_w[df_w['Messst.Nr'].isin(missing_data_ids)].Messstelle.unique()

# filter out 'Bundespegel' from pegelnames
pegelnames = [p for p in pegelnames if 'Bundespegel' not in p]

pegelnames

['Bad Dürkheim',
 'Monsheim',
 'Gundersheim',
 'Grolsheim',
 'Rammelsbach',
 'Zell',
 'Bad Kreuznach',
 'Höllenthal']

In [23]:
df_w[df_w['Messstelle'].isin(pegelnames)]#['Messst.Nr'].unique()

Unnamed: 0,Messst.Nr,Messstelle,Datum,Probeart,Parameter.Nr,Parameter,Einheit,Wert
wasserstand_1950-1979.csv.92867,2391020900,Bad Dürkheim,1956-02-01,Tagesmittel,09090/00,Wasserstand am Pegel,cm,52.0
wasserstand_1950-1979.csv.92868,2391020900,Bad Dürkheim,1956-02-02,Tagesmittel,09090/00,Wasserstand am Pegel,cm,53.0
wasserstand_1950-1979.csv.92869,2391020900,Bad Dürkheim,1956-02-03,Tagesmittel,09090/00,Wasserstand am Pegel,cm,51.0
wasserstand_1950-1979.csv.92870,2391020900,Bad Dürkheim,1956-02-04,Tagesmittel,09090/00,Wasserstand am Pegel,cm,52.0
wasserstand_1950-1979.csv.92871,2391020900,Bad Dürkheim,1956-02-05,Tagesmittel,09090/00,Wasserstand am Pegel,cm,52.0
...,...,...,...,...,...,...,...,...
wasserstand_2010-2021.csv.616835,2683080900,Höllenthal,2013-05-10,Tagesmittel,09090/00,Wasserstand am Pegel,cm,36.0
wasserstand_2010-2021.csv.616836,2683080900,Höllenthal,2013-05-11,Tagesmittel,09090/00,Wasserstand am Pegel,cm,36.0
wasserstand_2010-2021.csv.616837,2683080900,Höllenthal,2013-05-12,Tagesmittel,09090/00,Wasserstand am Pegel,cm,37.0
wasserstand_2010-2021.csv.616838,2683080900,Höllenthal,2013-05-13,Tagesmittel,09090/00,Wasserstand am Pegel,cm,37.0


In [43]:
from camelsp import Bundesland

bl = Bundesland('RLP')

# filter bl.metadata.gauge_name for 'Bad Dürkheim'
bl.metadata[bl.metadata.gauge_name.str.contains('Bollendorf')]

Unnamed: 0,camels_id,provider_id,camels_path,nuts_lvl2,federal_state,gauge_name,waterbody_name,gauge_elevation,area,x,y,lon,lat,q_count,w_count,q_w_pearson,q_w_spearman
1114,DEB10880,2620050500,./DEB/DEB10880/DEB10880_data.csv,DEB,Rheinland-Pfalz,Bollendorf 2,Sauer,162.344,3212.8,4059251.0,2977421.0,6.359181,49.850918,24894.0,24894.0,0.970925,0.979965


In [39]:
len(bl.metadata)

124