# Saarland

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Saarland is `DEC`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import numpy as np
from pandas.errors import ParserError
import os
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict
from datetime import datetime as dt
from dateparser import parse
import warnings

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('saarland').input_path
BASE

'/home/alexd/Projekte/CAMELS/Github/camelsp/input_data/SL_Saarland'

### Metadata reader

Define the function that extracts / reads and eventually merges all metadata for this federal state. You can develop the function here, without using the Bundesland context and then later use the context to pass extracted metadata. The Context has a function for saving *raw* metadata, that takes a `pandas.DataFrame` and needs you to identify the id column.
Here, *raw* refers to provider metadata, that has not yet been transformed into the CAMELS-de Metadata schema.

In [3]:
# define the function 
def read_meta(base_path) -> pd.DataFrame:
    path = os.path.join(base_path, 'Qry_Pegel-Stammdaten.xlsx')
    meta = pd.read_excel(path)
    return meta

# test it here
metadata = read_meta(BASE)

metadata.head()

Unnamed: 0,MSTNR,MSTBEM,Pegelname_,Gewässer,Betreiber,Stromgebiet,GebkNR,EZG_Gr,Flusskm,PNP,RW,HW,HWMH_1,HWMH_2
0,1271120,Schieferstollen,Schieferstollen,Wadrill,LUA Saarland,Rhein/Saar/Prims/Wadrill,2646471,"44,200km²","9,71 km","344,48 m über NN",2563271,5495766,,
1,1122120,Geislautern,Geislautern,Rossel,LUA Saarland,Rhein/Mosel/Saar/Rossel,2644791,203km²,"3,275 km","184,38 m über NN",2560492,5455327,210 cm,250 cm
2,1482120,Lisdorf,Lisdorf,Saar,WSA Saarbrücken,Rhein/Mosel/Saar,2645599,4671km²,65885,"173,50 m ü.NN",2555659,5462689,,
3,1251120,Gonnesweiler,Gonnesweiler,Bos,LUA Saarland,Rhein/Nahe/Bos,25411169,"12,500km²","0,08 km","372,40 m über NN",2579042,5492542,,
4,1071120,Nonnweiler IV,Nonnweiler IV,Prims,LUA Saarland,Rhein/Saar/Prims,2646149,"48,40km²","50,050 km","378,37 m über NN",2570340,5496980,,


Für den Pegel Perl gibts einen Fehler in RW und HW, hier fehlt hinten eine Null, kann auch [online](https://undine.bafg.de/rhein/pegel/rhein_pegel_perl.html) gecheckt werden.

In [4]:
# fix Pegel Perl
print(f"before:\n{metadata[metadata['MSTBEM'] == 'Perl'][['MSTBEM', 'RW', 'HW']]}")

# add a zero at the end of RW and HW (integer)
metadata.loc[metadata['MSTBEM'] == 'Perl', 'RW'] = metadata.loc[metadata['MSTBEM'] == 'Perl', 'RW'] * 10
metadata.loc[metadata['MSTBEM'] == 'Perl', 'HW'] = metadata.loc[metadata['MSTBEM'] == 'Perl', 'HW'] * 10

print(f"after:\n{metadata[metadata['MSTBEM'] == 'Perl'][['MSTBEM', 'RW', 'HW']]}")


before:
   MSTBEM      RW      HW
35   Perl  252679  548181
after:
   MSTBEM       RW       HW
35   Perl  2526790  5481810


In [5]:
# the id column will be MSTNR
id_column = 'MSTNR'

# Stations without data
There are some stations in the metadata for which we do not have datafiles.  
We delete these stations from the metadata.   

In [6]:
no_data_ids = ['1482120', '1071120', '1494120', '1463130', '1241000', '1462230',
               '1321000', '1142120', '1472120', '1502120']

metadata = metadata[~metadata['MSTNR'].astype(str).isin(no_data_ids)].reset_index(drop=True)
metadata.tail()

Unnamed: 0,MSTNR,MSTBEM,Pegelname_,Gewässer,Betreiber,Stromgebiet,GebkNR,EZG_Gr,Flusskm,PNP,RW,HW,HWMH_1,HWMH_2
41,1014120,Urweiler,Urweiler,Todbach,LUA Saarland,Rhein/Mosel/Saar/Blies/Todbach,2642129,"42,1km²","0,250 km","273,81 m über NN",2585036,5482180,100 cm,155 cm
42,1131120,Merzig,Merzig,Seffersbach,LUA Saarland,Rhein/Mosel/Saar/Seffersbach,2649273,50.600km²,"1,24 km","175,85 m über NN",2546980,5479030,,
43,1171120,Sbr.-Rußhütte,Sbr.-Rußhütte,Fischbach,LUA Saarland,Rhein/Mosel/Saar/Fischbach,2643691,52.610km²,"1,74 km","189,78 m über NN",2571193,5457569,,
44,1433120,Losheim III,Losheim III,Losheimer Bach,LUA Saarland,Rhein/Saar/Prims/Losheimer Bach,2646619,"15,000km²","11,23 km","296,33 m über NN",2553958,5486463,,
45,1031120,Bebelsheim,Bebelsheim,Mandelbach,LUA Saarland,Rhein/Mosel/Saar/Blies/Mandelbach,2642989,"21,36 km²","4,30 km","235,16 m über NN",2584452,5447650,,


## file extract and parse

In [7]:
# build the filename
variable = 'w'
ID = metadata.loc[45, id_column]

def extract_file(ID: Union[int, str], variable: str, base_path: str) -> pd.DataFrame:
    # use always str ids and variable is used uppercase here
    ID = str(ID)
    sym = variable.upper()

    # build the path
    path = os.path.join(base_path, f'TM{sym}', f"{ID}-{sym}.TM{sym}")

    # if the files does not exist, return empty dataframe
    if not os.path.exists(path):
        return pd.DataFrame(columns=['date', variable.lower(), 'flag'])

    # read id
    df = pd.read_csv(path, encoding='latin1', sep='\s+', skiprows=1, header=None, parse_dates=[0], dayfirst=True, decimal=',', na_values=[9999, -9999])
    df.columns = ['date', variable.lower()]
    df['flag'] = np.NaN

    return df

# test stuff
extract_file(ID, variable, BASE)

Unnamed: 0,date,w,flag
0,2011-09-01,,
1,2011-09-02,,
2,2011-09-03,,
3,2011-09-04,,
4,2011-09-05,,
...,...,...,...
3770,2021-12-27,33.0,
3771,2021-12-28,41.0,
3772,2021-12-29,40.0,
3773,2021-12-30,41.0,


### Finally run

Now, the Q and W data can be extracted. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [8]:
with Bundesland('Saarland') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    
    with warnings.catch_warnings(record=True) as warns:
        for provider_id in tqdm(metadata[id_column].values.astype(str)):
            
            # get q and w
            q_df = extract_file(provider_id, 'q', bl.input_path)
            w_df = extract_file(provider_id, 'w', bl.input_path)

            bl.save_timeseries(q_df, provider_id)
            bl.save_timeseries(w_df, provider_id)

        # check if there were warnings (there are warnings)
        if len(warns) > 0:
            log_path = bl.save_warnings(warns)
            print(f"There were warnings during the processing. The log can be found at: {log_path}")


    nuts_id provider_id                              path
0  DEC10000     1271120  ./DEC/DEC10000/DEC10000_data.csv
1  DEC10010     1122120  ./DEC/DEC10010/DEC10010_data.csv
2  DEC10020     1251120  ./DEC/DEC10020/DEC10020_data.csv
3  DEC10030     1102220  ./DEC/DEC10030/DEC10030_data.csv
4  DEC10040     1051110  ./DEC/DEC10040/DEC10040_data.csv


  0%|          | 0/46 [00:00<?, ?it/s]

100%|██████████| 46/46 [00:05<00:00,  8.76it/s]


## Stations without data

In [15]:
bl = Bundesland('saarland')

no_data_ids = bl.metadata[(bl.metadata['q_count'] == 0) & (bl.metadata['w_count'] == 0)].provider_id.values

metadata[metadata['MSTNR'].astype(str).isin(no_data_ids)]

Unnamed: 0,MSTNR,MSTBEM,Pegelname_,Gewässer,Betreiber,Stromgebiet,GebkNR,EZG_Gr,Flusskm,PNP,RW,HW,HWMH_1,HWMH_2
2,1482120,Lisdorf,Lisdorf,Saar,WSA Saarbrücken,Rhein/Mosel/Saar,2645599,4671km²,65885,"173,50 m ü.NN",2555659,5462689,,
4,1071120,Nonnweiler IV,Nonnweiler IV,Prims,LUA Saarland,Rhein/Saar/Prims,2646149,"48,40km²","50,050 km","378,37 m über NN",2570340,5496980,,
6,1494120,Rehlingen,Rehlingen,Saar,WSA Saarbrücken,Rhein/Mosel/Saar,2645895,5423 km²,"54,0438 km","165,5 m üner NN",2550728,5470634,,
11,1463130,St. Arnual,St. Arnual,Saar,WSA Saarbrücken,Rhein/Mosel/Saar,2643500,"3944,7km²","90,818 m","183,25 m ü. NN",2574551,5453588,290 cm,380 cm
13,1241000,Seepegel,Seepegel,Bostalsee,Seeverwaltung,Rhein/Nahe/Bos,2541116,"11,890 km²","2,600 km","400,00 m über NN Stauziel ( =Pegelnull)",2577071,5492554,,
14,1462230,Fremersdorf,Fremersdorf,Saar,WSA Saarbrücken,Rhein/Mosel/Saar,2649100,6983 km²,"48,515 km","165,50 m über NN",2547100,5474870,390 cm,640 cm
19,1321000,Seepegel,Seepegel,Losheimer Stauseee,Seeverwaltung,Rhein/Saar/Prims/Losheimer Bach,2646616,"14,7 km²","11,71 km","318,60 m über NN Stauziel (=Pegelnull)",2553608,5487164,,
35,1142120,Perl,Perl,Mosel,WSA Trier,Rhein/Mosel,2617000,11522km²,"241,8 km","138,50 m über NN",2526790,5481810,,
43,1472120,Hanweiler,Hanweiler,Saar,WSA Saarbrücken,Rhein/Mosel/Saar,2643100,3672km²,10405,"189,78 ü.NN",2577838,5442452,320 cm,430 cm
46,1502120,Mettlach,Mettlach,Saar,WSA Saarbrücken,Rhein/Mosel/Saar,2649535,7157km²,"39,918 km","154,50 m ü.NN",2542601,5484072,,


In [16]:
no_data_ids

array(['1482120', '1071120', '1494120', '1463130', '1241000', '1462230',
       '1321000', '1142120', '1472120', '1502120'], dtype=object)