# Niedersachsen

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Niedersachsen is `DE9`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm
from typing import Union, Dict, List
from datetime import datetime as dt
from dateparser import parse
import warnings

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('niedersachsen').input_path
BASE

'/home/alexd/Projekte/CAMELS/Github/camelsp/input_data/NiS_Niedersachsen'

## Parse data

Niedersachen produced only one file. I guess this needs to be pivoted.

In [3]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
d_raw = dd.read_csv(os.path.join(BASE, 'exp-peg-par252.csv'), encoding='latin1', sep=';', decimal=',', 
                    parse_dates=['DATUM'], date_format='%d.%m.%y', blocksize=4e6)

with ProgressBar():
    raw = d_raw.compute()

raw

[########################################] | 100% Completed | 7.44 ss


Unnamed: 0,MESSSTELLE_NR,DATUM,LANGNAME,BEZEICHNUNG,KENNUNG_ID,WERT,EINHEIT
0,3183101,1985-01-10,Sudendorf,Abfluss Tagesmittelwert,,0.853,m³/s
1,3183101,1985-01-11,Sudendorf,Abfluss Tagesmittelwert,,0.853,m³/s
2,3183101,1985-01-12,Sudendorf,Abfluss Tagesmittelwert,,0.853,m³/s
3,3183101,1985-01-13,Sudendorf,Abfluss Tagesmittelwert,,0.772,m³/s
4,3183101,1987-09-07,Sudendorf,Abfluss Tagesmittelwert,,0.938,m³/s
...,...,...,...,...,...,...,...
55770,3183101,1985-01-05,Sudendorf,Abfluss Tagesmittelwert,,1.030,m³/s
55771,3183101,1985-01-06,Sudendorf,Abfluss Tagesmittelwert,,1.030,m³/s
55772,3183101,1985-01-07,Sudendorf,Abfluss Tagesmittelwert,,1.800,m³/s
55773,3183101,1985-01-08,Sudendorf,Abfluss Tagesmittelwert,,1.200,m³/s


In [4]:
raw.DATUM

0       1985-01-10
1       1985-01-11
2       1985-01-12
3       1985-01-13
4       1987-09-07
           ...    
55770   1985-01-05
55771   1985-01-06
55772   1985-01-07
55773   1985-01-08
55774   1985-01-09
Name: DATUM, Length: 4195562, dtype: datetime64[ns]

In [5]:
# id column is MESSSTELLE_NR
id_column = 'MESSSTELLE_NR'

In [6]:
# How many different variables are there?
names = []
for _, df in raw.groupby(id_column):
    names.extend(df.BEZEICHNUNG.unique().tolist())
set(names)

{'Abfluss Tagesmittelwert'}

In [7]:
# total messstellen
N = len(raw.groupby(id_column))
print(f"Messstellen: {N}")

Messstellen: 282


New create a list of collected 'metadata' and the actual discharge data. 

Extract  all metadata for this federal state, without using the `Bundesland` context and then later use the context to pass extracted metadata. The Context has a function for saving *raw* metadata, that takes a `pandas.DataFrame` and needs you to identify the id column.
Here, *raw* refers to provider metadata, that has not yet been transformed into the CAMELS-de Metadata schema.

In [8]:
# result container
meta = []
data = []

# group by id
N = len(raw.groupby(id_column))

# go for it
for nr, df in tqdm(raw.groupby(id_column)):
    meta.append({
        id_column: str(nr),
        'BEZEICHNUNG': df.BEZEICHNUNG.unique().tolist(),
        'EINHEIT': df.EINHEIT.unique().tolist(),
        'LANGNAME': df.LANGNAME.unique().tolist(),
        'KENNUNG_ID': df.KENNUNG_ID.unique().tolist()
    })
    data.append(pd.DataFrame({
        'date': df.DATUM,
        'q': df.WERT,
        'flag': np.NaN
    }))

print(f"Extracted {len(data)} timeseries")
        

100%|██████████| 282/282 [00:00<00:00, 657.53it/s]

Extracted 282 timeseries





DATUM column is completely messed up, as dates are randomly shuffled (e.g. jumps), we have to sort by date.

In [9]:
# sort by date
for df in data:
    df.sort_values(by='date', inplace=True)

### metadata

Ok, let's get really wild. Check that the code above produced only lists of 1 unique value per group. Otherwise the metadata would change over time for the same Messstelle and that would be a problem

In [10]:
def tidy_metadata(meta: List[dict]) -> pd.DataFrame:
    pmeta = []
    for i, m in enumerate(meta):
        out = {}
        for k, v in m.items():
            if isinstance(v, list):
                if len(v) == 1:
                    out[k] = v[0]
                else:
                    warnings.warn(f"Line {i + 1}: More than one value found for {k}: [{', '.join(v)}]")
            else:
                out[k] = v
        pmeta.append(out)
    return pd.DataFrame(pmeta)


### Finally run

Now, the Q and W data can be extracted along with the metadata. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [11]:
with Bundesland('Niedersachsen') as bl:
    # catch warnings
    with warnings.catch_warnings(record=True) as warns:
        # tidy the metadata
        metadata = tidy_metadata(meta)

        # save the metadata
        bl.save_raw_metadata(metadata, id_column, overwrite=True)

        # for reference, call the nuts-mapping as table
        nuts_map = bl.nuts_table
        print(nuts_map.head())
    
        # go for all ids
        for meta, df in tqdm(zip(meta, data), total=N):
            # get the id
            provider_id = meta[id_column]

            # save
            bl.save_timeseries(df, provider_id)
        
        # check if there were warnings (there are warnings)
        if len(warns) > 0:
            log_path = bl.save_warnings(warns)
            print(f"There were warnings during the processing. The log can be found at: {log_path}")


    nuts_id provider_id                              path
0  DE910000     3183101  ./DE9/DE910000/DE910000_data.csv
1  DE910010     3346103  ./DE9/DE910010/DE910010_data.csv
2  DE910020     3437108  ./DE9/DE910020/DE910020_data.csv
3  DE910030     3445100  ./DE9/DE910030/DE910030_data.csv
4  DE910040     3449100  ./DE9/DE910040/DE910040_data.csv


100%|██████████| 282/282 [00:04<00:00, 66.43it/s]


## There are duplicated `Langname` in the metadata

In [46]:
raw_meta = pd.read_csv('../output_data/raw_metadata/DE9_raw_metadata.csv')
raw_meta

Unnamed: 0,MESSSTELLE_NR,BEZEICHNUNG,EINHEIT,LANGNAME,KENNUNG_ID
0,3183101,Abfluss Tagesmittelwert,m³/s,Sudendorf,
1,3346103,Abfluss Tagesmittelwert,m³/s,Schwege,
2,3437108,Abfluss Tagesmittelwert,m³/s,Beesten,
3,3445100,Abfluss Tagesmittelwert,m³/s,Spelle,
4,3449100,Abfluss Tagesmittelwert,m³/s,Spelle,
...,...,...,...,...,...
277,9286155,Abfluss Tagesmittelwert,m³/s,Osterwald,
278,9286161,Abfluss Tagesmittelwert,m³/s,Haselaar,
279,9286162,Abfluss Tagesmittelwert,m³/s,Emlichheim,
280,9286171,Abfluss Tagesmittelwert,m³/s,Wilsum,


for some stations we have no location

In [39]:
bl = Bundesland('Niedersachsen')

# get metadata
meta = bl.metadata

# get ids where we have no location
ids_no_loc = meta[meta['x'].isna()].provider_id.values
print(f"IDs without location: {ids_no_loc}")

meta[meta['x'].isna()]

IDs without location: ['3445100' '3547104' '3613185' '3658105' '3881114' '4661185' '4665103'
 '4821120' '4822106' '4824114' '4824118' '4881125' '4886119' '4892110'
 '4894120' '4896119' '4914104' '4961133' '5952115' '5963104' '5972101'
 '5984103' '5987101' '5987106' '5994104' '5998102']


Unnamed: 0,camels_id,provider_id,camels_path,nuts_lvl2,federal_state,area,x,y,lon,lat,q_count,w_count,q_w_pearson,q_w_spearman
1489,DE910030,3445100,./DE9/DE910030/DE910030_data.csv,DE9,Niedersachsen,,,,,,6789.0,0.0,,
1492,DE910060,3547104,./DE9/DE910060/DE910060_data.csv,DE9,Niedersachsen,,,,,,12298.0,0.0,,
1496,DE910100,3613185,./DE9/DE910100/DE910100_data.csv,DE9,Niedersachsen,,,,,,1430.0,0.0,,
1514,DE910280,3658105,./DE9/DE910280/DE910280_data.csv,DE9,Niedersachsen,,,,,,11233.0,0.0,,
1541,DE910550,3881114,./DE9/DE910550/DE910550_data.csv,DE9,Niedersachsen,,,,,,4809.0,0.0,,
1560,DE910740,4661185,./DE9/DE910740/DE910740_data.csv,DE9,Niedersachsen,,,,,,1430.0,0.0,,
1562,DE910760,4665103,./DE9/DE910760/DE910760_data.csv,DE9,Niedersachsen,,,,,,8462.0,0.0,,
1579,DE910930,4821120,./DE9/DE910930/DE910930_data.csv,DE9,Niedersachsen,,,,,,7366.0,0.0,,
1581,DE910950,4822106,./DE9/DE910950/DE910950_data.csv,DE9,Niedersachsen,,,,,,11749.0,0.0,,
1583,DE910970,4824114,./DE9/DE910970/DE910970_data.csv,DE9,Niedersachsen,,,,,,20880.0,0.0,,


In [44]:
raw_meta[raw_meta.MESSSTELLE_NR.astype(str).isin(ids_no_loc)]

Unnamed: 0,MESSSTELLE_NR,BEZEICHNUNG,EINHEIT,LANGNAME,KENNUNG_ID
3,3445100,Abfluss Tagesmittelwert,m³/s,Spelle,
6,3547104,Abfluss Tagesmittelwert,m³/s,Lingen Parkstraáe,
10,3613185,Abfluss Tagesmittelwert,m³/s,Schimm,
28,3658105,Abfluss Tagesmittelwert,m³/s,Lodbergen,
55,3881114,Abfluss Tagesmittelwert,m³/s,Thülsfeld,
74,4661185,Abfluss Tagesmittelwert,m³/s,Gesmold,
76,4665103,Abfluss Tagesmittelwert,m³/s,Bruchmühlen,
93,4821120,Abfluss Tagesmittelwert,m³/s,Probsteiburg,
95,4822106,Abfluss Tagesmittelwert,m³/s,Vienenburg E,
97,4824114,Abfluss Tagesmittelwert,m³/s,Hornburg,


Diese zwei Stationen wurden eventuell verlegt (LANGNAMEN sind gleich):

In [47]:
duplicated_names = raw_meta[raw_meta.duplicated('LANGNAME', keep=False)]

meta[meta.provider_id.isin([str(id) for id in duplicated_names['MESSSTELLE_NR'].values])]

Unnamed: 0,camels_id,provider_id,camels_path,nuts_lvl2,federal_state,area,x,y,lon,lat,q_count,w_count,q_w_pearson,q_w_spearman
1489,DE910030,3445100,./DE9/DE910030/DE910030_data.csv,DE9,Niedersachsen,,,,,,6789.0,0.0,,
1490,DE910040,3449100,./DE9/DE910040/DE910040_data.csv,DE9,Niedersachsen,149.74,4155240.0,3254405.0,7.565579,52.374162,4597.0,0.0,,
1540,DE910540,3881110,./DE9/DE910540/DE910540_data.csv,DE9,Niedersachsen,131.13,4181690.0,3316288.0,7.927776,52.937338,19723.0,0.0,,
1541,DE910550,3881114,./DE9/DE910550/DE910550_data.csv,DE9,Niedersachsen,,,,,,4809.0,0.0,,


Für die restlichen Stationen ohne location sind vielleicht einfach keine locations gefunden worden? Daten gibts ja, online finde ich die Pegel auch nicht