# Schleswig-Holstein

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Schleswig Holstein is `DEF`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm
from typing import Union, Dict, Tuple
import warnings
from math import radians, sin, atan2, cos, sqrt

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Schleswig-Holstein').input_path
BASE

'/home/camel/camelsp/input_data/Q_and_W/SH_Schleswig-Holstein'

## Parse metadata

Pegel metadata can be read quite easily. Only the separator is important as we have whitespaces in `'Gewässer'`

In [3]:
with Bundesland('Schleswig-Holstein') as bl:
    metadata = pd.read_csv(os.path.join(bl.input_path, 'gauge_attributes.csv'), encoding='latin1')

metadata

Unnamed: 0,id,gauge,river,area,x,y,watlevel_period,discharge_period,status,lon,lat
0,114001,Achterwehr,Eider,269.000,32562584,6019011,01.11.1984-21.03.2022,-999,inBetrieb,9.962054,54.315131
1,114450,Achtrupfeld,Brebek,26.550,32503881,6069898,08.11.2001-12.04.2022,08.11.2001-01.03.2022,inBetrieb,9.060335,54.776278
2,110034,Adamsiel,Nordsee,-999.000,32478768,6025071,01.03.1979-01.01.2003,-999,stillgelegtruht,8.673163,54.372988
3,111096,AdenbüllerKoog,,-999.000,32491930,6023940,13.06.2005-02.09.2007,-999,stillgelegtruht,8.875803,54.363202
4,111105,AdenbüllerSielOW,TetenbüllspiekerKanal,12.985,32490061,6025357,29.01.2013-01.09.2018,-999,inBetrieb,8.846992,54.375904
...,...,...,...,...,...,...,...,...,...,...,...
770,114265,Wulfsdorf,HagenerAu,76.800,32584558,6022619,01.11.1985-01.05.1997,-999,stillgelegtruht,10.300781,54.344383
771,114103,Wulksfelde,Alster,139.120,32573621,5952781,01.11.1984-24.04.2022,01.11.1984-06.01.2022,inBetrieb,10.115643,53.718505
772,110005,WykFoehr,Nordsee,-999.000,32472688,6060763,01.11.1950-24.04.2022,-999,inBetrieb,8.576263,54.693458
773,114151,Zarpen,Heilsau,49.600,32600025,5970053,01.11.1984-24.04.2022,01.11.1984-15.03.2022,inBetrieb,10.521238,53.869303


In [4]:
# The id_column is id.
id_column = 'id'

# Stations without data
There are a lot of stations in the metadata for which we do not have datafiles.  
We delete these stations from the metadata.   

In [5]:
no_data_ids = ['110034', '114003', '110126', '110110', '110109', '114483',
               '114495', '110144', '114143', '114634', '114077', '110060',
               '110135', '110178', '110008', '110114', '110020', '114492',
               '114511', '114496', '114017', '110027', '110141', '110159',
               '114467', '114475', '110054', '110153', '110014', '110143',
               '114635', '114512', '114624', '110013', '110147', '110185',
               '110112', '110111', '110021', '110200', '110052', '114499',
               '114149', '114147', '114633', '114476', '114480', '110137',
               '114468', '111089', '111088', '110032', '110132', '114083',
               '114513', '114646', '114647', '114655', '114654', '114033',
               '114497', '114556', '114477', '110062', '114123', '114546',
               '114498', '110024', '110124', '110048', '110002', '110003',
               '110155', '110061', '110152', '110171', '110037', '114183',
               '110040', '110122', '114504', '111090', '111092', '110119',
               '111091', '110006', '110030', '110134', '110173', '110031',
               '110123', '110105', '110125', '110161', '110045', '110049',
               '114037', '110166', '114560', '114484', '110148', '114039',
               '114305', '110163', '110022', '110028', '114478', '114090',
               '114514', '110145', '110195', '110068', '110073', '114470',
               '110136', '110201', '110202', '110149', '110165', '110193',
               '111085', '110118', '114485', '110116', '110183', '110174',
               '110199', '110156', '110063', '110158', '110157', '110175',
               '110071', '110042', '110055', '111084', '111094', '110041',
               '114486', '110026', '114650', '111114', '110170', '114471',
               '114648', '110117', '110187', '110138', '110139', '114493',
               '110035', '114472', '110190', '111095', '110186', '110051',
               '110046', '114487', '114565', '110196', '110069', '110120',
               '110176', '110167', '110019', '110127', '110113', '110150',
               '110151', '110029', '111093', '110033', '110154', '110103',
               '114088', '110101', '110203', '110192', '110164', '114500',
               '114651', '110160', '110009', '110197', '110072', '110067',
               '114378', '114649', '110188', '110198', '114144', '110128',
               '110050', '110044', '114653', '114652', '114057', '110036',
               '110056', '110108', '110107', '110142', '110140', '112255',
               '112254', '110191', '114506', '110106', '110007', '114501',
               '114098', '114489', '114473', '114474', '114137', '114227',
               '110043', '110025', '110184', '114219', '114363', '110064',
               '110204', '110133', '110168', '110169', '110074', '110065',
               '110066', '110038', '110100', '110010', '110177', '110102',
               '110012', '110179', '110047', '110121', '110104', '110130',
               '110004', '114490', '114491', '110180', '110015', '111087',
               '111086', '110023', '110181', '110016', '110146', '110129',
               '114482', '114494', '114072', '110189', '110182', '110172',
               '110001', '110039', '110162', '114509', '114502', '110017',
               '114104', '110005']

metadata = metadata[~metadata['id'].astype(str).isin(no_data_ids)].reset_index(drop=True)
metadata

Unnamed: 0,id,gauge,river,area,x,y,watlevel_period,discharge_period,status,lon,lat
0,114001,Achterwehr,Eider,269.000,32562584,6019011,01.11.1984-21.03.2022,-999,inBetrieb,9.962054,54.315131
1,114450,Achtrupfeld,Brebek,26.550,32503881,6069898,08.11.2001-12.04.2022,08.11.2001-01.03.2022,inBetrieb,9.060335,54.776278
2,111096,AdenbüllerKoog,,-999.000,32491930,6023940,13.06.2005-02.09.2007,-999,stillgelegtruht,8.875803,54.363202
3,111105,AdenbüllerSielOW,TetenbüllspiekerKanal,12.985,32490061,6025357,29.01.2013-01.09.2018,-999,inBetrieb,8.846992,54.375904
4,111106,AdenbüllerSielUW,TetenbüllspiekerKanal,12.985,32490058,6025380,29.01.2013-09.08.2018,-999,inBetrieb,8.846945,54.376110
...,...,...,...,...,...,...,...,...,...,...,...
504,114134,Wrist,Bramau,471.000,32549021,5976008,01.11.1984-24.04.2022,01.11.1984-01.07.1991,inBetrieb,9.746604,53.930147
505,114265,Wulfsdorf,HagenerAu,76.800,32584558,6022619,01.11.1985-01.05.1997,-999,stillgelegtruht,10.300781,54.344383
506,114103,Wulksfelde,Alster,139.120,32573621,5952781,01.11.1984-24.04.2022,01.11.1984-06.01.2022,inBetrieb,10.115643,53.718505
507,114151,Zarpen,Heilsau,49.600,32600025,5970053,01.11.1984-24.04.2022,01.11.1984-15.03.2022,inBetrieb,10.521238,53.869303


## Read all raw data

flag != 0 & flag < 120 & ~flag.isna() heißt geprüft

In [6]:
df_whole = pd.read_csv(os.path.join(BASE, 'thy_wst_abfluss_export.zip'),sep=';',encoding='latin1')#with pandas >2.0  just use: ,parse_dates=[2] date_format="%Y-%m-%d %H:%M:%S"
df_whole['datum'] = pd.to_datetime(df_whole['datum'])
df_whole

Unnamed: 0,sta_no_s,datum,wst,abfluss,abfluss_status,wst_status
0,114519,2012-01-10,171,21.3252,0.0,0
1,114519,2012-01-11,171,20.8745,0.0,0
2,114519,2012-01-12,170,20.9328,0.0,0
3,114519,2012-01-13,169,20.9042,0.0,0
4,114519,2012-01-14,168,20.1725,0.0,0
...,...,...,...,...,...,...
3746372,111035,2022-05-09,340,,,110
3746373,111035,2022-05-10,341,,,110
3746374,111035,2022-05-11,342,,,110
3746375,111035,2022-05-12,339,,,110


## Load Data

Loading data is a bit more complicated. There is an extra header, but that is not important. Only the 'Einhait' contains important information, but that has been added manually to the metadata.

In [7]:
#FALSE we only have Q, so remove variable from args
# We only have Q scraped so far, W will be added soon /msp
def extract_file(nr: Union[str, int], variable: str, base_path: str) -> Tuple[Dict[str, str], pd.DataFrame]:
    # always use str ids
    nr = str(nr)
    # merge the nr
    if '.' in nr:
        nr = nr.replace('.', '')

    # build the path
    assert variable in ['w','q','W','Q']
    
    
    path = os.path.join(base_path, variable.upper() , f'{nr}.csv')

    # return empty dataframe if data does not exist
    if not os.path.exists(path):
        return pd.DataFrame(columns=['date', variable.lower(), 'flag'])
    
    # otherwise read
    df = pd.read_csv(path, encoding="latin1", sep=";", decimal=",", parse_dates=[0], dayfirst=True)
    
    df.columns = ['date', variable.lower(), 'flag']
    df[variable.lower()] = df[variable.lower()].astype(float)

    # check if there are any values at all
    if df[variable.lower()].isna().all():
        return pd.DataFrame(columns=['date', 'q', 'flag'])
    
    # build the flag column
    df['flag'] = df.flag.apply(lambda f: f.lower() == 'qualitätsgesichert')

    return df

df = extract_file(114614,'q', BASE)

In [11]:
def extract_file_sh(nr: int) -> Tuple[pd.DataFrame, pd.DataFrame]:
    df_temp = df_whole[df_whole['sta_no_s'] == nr].drop(['sta_no_s'],axis=1)
    df_temp.columns = ['date','w','q','q_flag','w_flag']

    df_temp = df_temp.sort_values(by='date')
    df_temp = df_temp.reset_index(drop=True)

    # make sure the flags are boolean
    df_temp['q_flag'] = df_temp['q_flag'].astype(bool)
    df_temp['w_flag'] = df_temp['w_flag'].astype(bool)

    # alle Status-Werte kleiner 120 gelten als qualitätsgesichert.
    for flag in ['w_flag','q_flag']:
        # This is kinda clumsy, but I didn't want to set NaN flags with False, since they only occur when there is no measurement
        df_temp.loc[(df_temp[flag] < 120) & (~df_temp[flag].isna()),flag] = True
        df_temp.loc[(df_temp[flag] >= 120) & (~df_temp[flag].isna()),flag] = False

    df_q = pd.DataFrame(columns=['date', 'q', 'flag']) if df_temp['q'].isna().all() else df_temp[['date','q','q_flag']].rename({'q_flag':'flag'})
    df_w = pd.DataFrame(columns=['date', 'w', 'flag']) if df_temp['w'].isna().all() else df_temp[['date','w','w_flag']].rename({'w_flag':'flag'})
    

    return df_q, df_w

### Finally run

Now, the Q and W data can be extracted. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [9]:
# Filter settings
onlyrivers = metadata['river'] != 'Nordsee'
# (metadata['x']==-999)

In [12]:
with Bundesland('Schleswig-Holstein') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    # go for each    
    for _id in tqdm(metadata[onlyrivers][id_column].values):
    #for _id in tqdm(metadata[id_column].values):
        provider_id = str(_id)
        nuts_id = nuts_map.loc[nuts_map['provider_id'] == provider_id,'nuts_id'].iloc[0]
        # load/slice the two files from df_whole
        q_df, w_df = extract_file_sh(_id)

        bl.save_timeseries(q_df, nuts_id)
        bl.save_timeseries(w_df, nuts_id)


    nuts_id provider_id                              path
0  DEF10000      114001  ./DEF/DEF10000/DEF10000_data.csv
1  DEF10010      114450  ./DEF/DEF10010/DEF10010_data.csv
2  DEF10020      111096  ./DEF/DEF10020/DEF10020_data.csv
3  DEF10030      111105  ./DEF/DEF10030/DEF10030_data.csv
4  DEF10040      111106  ./DEF/DEF10040/DEF10040_data.csv


  0%|          | 0/509 [00:00<?, ?it/s]

100%|██████████| 509/509 [00:42<00:00, 11.92it/s]
