# Thüringen

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Thüringen is `DEG`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
from pandas.errors import ParserError
import os
import shutil
from glob import glob
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict, Tuple
from datetime import datetime as dt
from dateparser import parse
import warnings
from io import BytesIO

from camelsp import Bundesland, Station

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Thüringen').input_path
BASE

'/home/alexd/Projekte/CAMELS/Github/camelsp/input_data/TH_Thueringen'

## Parse metadata

Pegel metadata can be read quite easily. Only the separator is important as we have whitespaces in `'Gewässer'`

In [3]:
with Bundesland('Thüringen') as th:
    metadata = pd.read_csv(os.path.join(th.input_path, 'th_pegel_metadaten_v2.txt'), sep='\t')
metadata['unit_q'] = 'm³/s'
metadata['unit_w'] = 'cm'
metadata

Unnamed: 0,Pegelnr,Pegelname,Gewässer,Lage o. M.,EZG,PNP,Höhensystem,HW (GK 4),RW (GK 4),lon,lat,NNQ,Datum NNQ,HHQ,Datum HHQ,unit_q,unit_w
0,573000,Ammern,Unstrut,161.2,182.7,210.243,NH,5676589,601026,10.4470,51.2317,0.130,OFT,115.0,am 04.06.1981,m³/s,cm
1,447000,Arenshausen,Leine,247.1,275.0,196.288,NH,5692387,567538,9.9704,51.3787,0.260,am 09.09.2010,92.8,am 04.06.1981,m³/s,cm
2,574200,Arnstadt,Gera,45.2,174.7,293.577,NH,5630378,636190,10.9330,50.8091,0.210,OFT,75.7,am 10.08.1981,m³/s,cm
3,576500,Berga,Weiße Elster,151.0,1383.0,218.995,NH,5626876,722757,12.1580,50.7509,,,,,m³/s,cm
4,570210,Blankenstein-Rosenthal,Saale,357.0,1013.0,410.517,NH,5587078,692197,11.7047,50.4043,0.306,am 10.07.1976,251.0,am 05.01.1982,m³/s,cm
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,427010,Unterbreizbach-Räsa,Ulster,5.0,399.0,233.323,NH,5628816,568818,9.9767,50.8070,0.180,OFT,218.0,am 04.06.1981,m³/s,cm
59,420120,Vacha,Werra,164.8,2246.0,222.678,NH,5631886,573776,10.0477,50.8340,1.550,am 05.10.1959,321.0,am 10.02.1946,m³/s,cm
60,575110,Wasserthaleben,Helbe,19.0,374.3,174.317,NH,5680112,631983,10.8915,51.2571,0.100,OFT,64.9,am 30.12.2002,m³/s,cm
61,577320,Weida,Weida,7.0,296.7,238.358,NH,5627781,715938,12.0620,50.7616,0.000,OFT,139.0,am 15.08.1924,m³/s,cm


In [4]:
# The id_column is Pegelnr.
id_column = 'Pegelnr'

Thüringen has supplied the data for 8 measuring stations, these will be copied into the main data folder and renamed in the next cell.

In [5]:
# get all additional q files
additional_q_files = glob(os.path.join(BASE, 'camels_th_nachlieferung', 'q', '*'))

# copy and rename the files
for q_file in additional_q_files:
    # Extract the filename from the source path
    filename = q_file.split('/')[-1]
    
    # Add 'q' prefix to the filename
    new_filename = 'q' + filename
    
    # Create the destination path
    destination_path = os.path.join(BASE, 'q', new_filename)

    # Copy the file to the destination directory with the new name
    shutil.copy(q_file, destination_path)

# same for additional w files
additional_w_files = glob(os.path.join(BASE, 'camels_th_nachlieferung', 'w', '*'))
for w_file in additional_w_files:
    # Extract the filename from the source path
    filename = w_file.split('/')[-1]
    
    # Add 'q' prefix to the filename
    new_filename = 'w' + filename
    
    # Create the destination path
    destination_path = os.path.join(BASE, 'w', new_filename)

    # Copy the file to the destination directory with the new name
    shutil.copy(w_file, destination_path)

## Load Data

Loading data is a bit more complicated. There is an extra header, but that is not important. Only the 'Einhait' contains important information, but that has been added manually to the metadata.

In [6]:
def extract_file(nr: Union[str, int], variable: str, base_path: str) -> Tuple[Dict[str, str], pd.DataFrame]:
    # always use str ids
    nr = str(nr)
    # merge the nr
    if '.' in nr:
        nr = nr.replace('.', '')

    # build the filename
    fname = f'{variable.lower()}{nr}.txt'

    # build the path
    path = os.path.join(base_path, variable.lower(), fname)

    # return empty dataframe if data does not exist
    if not os.path.exists(path):
        return pd.DataFrame(columns=['date', variable.lower(), 'flag'])
    
    # otherwise read
    df = pd.read_csv(path, skiprows=21, header=None, sep=" ", parse_dates=[0], dayfirst=True, na_values=['Luecke', 'LUECKE'])
    df.columns = ['date', 'hour', variable.lower(), 'comment']
    
    # check if there are comments
    if not df.comment.isna().all():
        print(df.comment)
    
    # build the flag column
    df['flag'] = [np.isnan(c) for c in df.comment]
    df.drop(['hour', 'comment'], axis=1, inplace=True)
    return df

extract_file('420120', 'w', BASE)

Unnamed: 0,date,w,flag
0,1936-11-01,0.0,True
1,1936-11-02,0.0,True
2,1936-11-03,0.0,True
3,1936-11-04,0.0,True
4,1936-11-05,0.0,True
...,...,...,...
31102,2021-12-27,74.0,True
31103,2021-12-28,77.0,True
31104,2021-12-29,98.0,True
31105,2021-12-30,129.0,True


### Finally run

Now, the Q and W data can be extracted. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [7]:
with Bundesland('Thüringen') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    # go for each    
    for _id in tqdm(metadata[id_column].values):
        provider_id = str(_id)
        # extract the two files
        q_df = extract_file(provider_id, 'q', bl.input_path)
        w_df = extract_file(provider_id, 'w', bl.input_path)

        # save
        bl.save_timeseries(q_df, provider_id)
        bl.save_timeseries(w_df, provider_id)


    nuts_id provider_id                              path
0  DEG10000      573000  ./DEG/DEG10000/DEG10000_data.csv
1  DEG10010      447000  ./DEG/DEG10010/DEG10010_data.csv
2  DEG10020      574200  ./DEG/DEG10020/DEG10020_data.csv
3  DEG10030      576500  ./DEG/DEG10030/DEG10030_data.csv
4  DEG10040      570210  ./DEG/DEG10040/DEG10040_data.csv


  0%|          | 0/63 [00:00<?, ?it/s]

100%|██████████| 63/63 [00:11<00:00,  5.72it/s]


## Add EZG from provider to all Stations where available

Different shapefiles for Thüringen but no clear connection between stations and shapes -> not easily possible.

In [8]:
gdf_meta = gpd.read_file(os.path.join(BASE, '../Shapes/Thueringen_Shapes/oberirdische_einzugsgebiete_thueringens__stand_2016_.shp'))

# for id in gdf_meta['PKZ'].values:
#     # init station via PKZ, ignore warnings as we use provider_id instead of camels_id
#     with warnings.catch_warnings():
#         warnings.simplefilter('ignore')
#         s = Station(id)

#     # get catchment geometry for id
#     catchment = gdf_meta[gdf_meta['PKZ'] == id].iloc[[0]]

#     # save catchment geometry
#     s.save_catchment_geometry(catchment, datasource='federal_agency')

gdf_meta

Unnamed: 0,ID_FLUSSGE,GEBIETSKEN,NAME,FL_CHE_QKM,geometry
0,5000,5621617,Zopte,1.0,"POLYGON ((658976.543 5600192.722, 658979.353 5..."
1,2000,24422119,Sulz,1.0,"MULTIPOLYGON (((585619.430 5599430.770, 585608..."
2,5000,5621611,Zopte,0.0,"POLYGON ((656752.505 5600873.286, 656750.307 5..."
3,5000,5632191,Schwarza,0.0,"POLYGON ((643904.887 5601241.531, 643904.272 5..."
4,5000,5632164,Hinteres Singertal,0.0,"POLYGON ((640824.412 5600574.168, 640824.298 5..."
...,...,...,...,...,...
4088,5000,566633,Pleiße,1.0,"MULTIPOLYGON (((736045.632 5637326.872, 736020..."
4089,5000,5665112,Mainsebach,1.0,"POLYGON ((717899.825 5632166.430, 717903.174 5..."
4090,5000,56664121,Heuckewalder Sprotte,3.0,"MULTIPOLYGON (((731444.870 5635033.800, 731429..."
4091,5000,5663881,Fuchsbach I,4.0,"POLYGON ((722376.207 5632908.524, 722378.380 5..."


In [7]:
gpd.read_file(os.path.join(BASE, '../Shapes/Thueringen_Shapes/wasserkoerperkategorie_thueringen_stand_2021_.shp'))

Unnamed: 0,OWK_NAME,EU_CD_WB,OWK_KAT,WK_TYP,ZUST_LAND,AREA_KM2,GUV,geometry
0,Werra/Philippsthal,DERW_DEHE_41-4,HMWB - Erheblich veränderter Wasserkörper,9.2,Hessen,0.0,Felda/Ulster/Werra,"MULTIPOLYGON (((568799.385 5631749.005, 568804..."
1,Welsbach,DERW_DETH_56417622,NWB - Natürlicher Wasserkörper,6_K,Thüringen,48.0,Obere Unstrut/Notter,"POLYGON ((614221.125 5666636.648, 614199.060 5..."
2,Spannerbach,DERW_DETH_566654,NWB - Natürlicher Wasserkörper,18,Thüringen,42.0,Pleisse/Schnauder,"POLYGON ((748508.831 5649133.689, 748495.324 5..."
3,Heldrabach; 4174-1,DERW_DEHE_4174-1,NWB - Natürlicher Wasserkörper,5.1,Hessen,23.0,Hoersel/Nesse,"POLYGON ((585499.182 5666520.534, 585487.930 5..."
4,Untere Werra bis Heldrabach,DERW_DETH_41_68-129,NWB - Natürlicher Wasserkörper,9.2,Thüringen,310.0,Hoersel/Nesse,"MULTIPOLYGON (((568919.765 5639741.025, 568903..."
...,...,...,...,...,...,...,...,...
207,Talsperre Bleiloch (2),DELW_DETH_12-2,TS - Talsperre (HMWB),Seetyp,Thüringen,129.0,Obere Saale/Orla,"POLYGON ((693195.067 5589317.783, 693224.846 5..."
208,Salza,DERW_DETH_564178,HMWB - Erheblich veränderter Wasserkörper,6_K,Thüringen,39.0,Obere Unstrut/Notter,"POLYGON ((613951.583 5659137.093, 613912.671 5..."
209,Talsperre Ratscher,DELW_DETH_13,TS - Talsperre (HMWB),Seetyp,Thüringen,2.0,Obere Werra/Schleuse,"POLYGON ((628172.509 5594696.502, 628154.822 5..."
210,Mittlere Unstrut (2),DERW_DETH_564_2,HMWB - Erheblich veränderter Wasserkörper,9.1_K,Thüringen,139.0,Obere Unstrut/Notter,"POLYGON ((635053.763 5665740.343, 635075.912 5..."


In [5]:
gpd.read_file(os.path.join(BASE, '../Shapes/Thueringen_Shapes/Monitoring_Pegel_TH.shp'))

Unnamed: 0,MESSSTELLE,MESSSTELL2,ART_DER_ME,GEWAESSER,WASSERKOER,EU_CODE_DE,PRAEGENDER,OSTWERT_UT,HOCHWERT_U,GEOMETRY,X_EPSG2583,Y_EPSG2583,geometry
0,Leubingen,2000.0,Basischemie - Messstelle,Lossa,Lossa,DERW_DETH_56436_0-39,6K: Keuperbach,650158.0,5674266.0,POINT (650158.056830118 5674266.0703165),650158.06,5674266.07,POINT (650158.060 5674266.070)
1,Hassleben uh,2002.0,Basischemie - Messstelle,Schmale Gera,Gramme,DERW_DETH_56434_0-33,6K: Keuperbach,639910.0,5666902.0,POINT (639910.05000879 5666902.44064145),639910.05,5666902.44,POINT (639910.050 5666902.440)
2,Gebesee,2005.0,Operative Messstelle,Mahlgera,Mahlgera,DERW_DETH_56428_0-12,6K: Keuperbach,636121.0,5664265.0,POINT (636121.996766585 5664265.50920035),636122.00,5664265.51,POINT (636122.000 5664265.510)
3,Werningshausen,2007.0,Basischemie - Messstelle,Gramme,Gramme,DERW_DETH_56434_0-33,6K: Keuperbach,639955.0,5667173.0,POINT (639955.786477227 5667173.49925617),639955.79,5667173.50,POINT (639955.790 5667173.500)
4,Vippach Muendung,2008.0,Operative Messstelle,Vippach,Gramme,DERW_DETH_56434_0-33,6K: Keuperbach,646091.0,5664181.0,POINT (646091.559078167 5664181.68665616),646091.56,5664181.69,POINT (646091.560 5664181.690)
...,...,...,...,...,...,...,...,...,...,...,...,...,...
467,Oberndorf uh,728820.0,Basischemie - Messstelle,Bach aus Oberndorf,Erlbach,DERW_DETH_56652_0-15,6: Mittelgebirgsbach fein (Ca),703944.0,5640592.0,POINT (703944.659064265 5640592.49721542),703944.66,5640592.50,POINT (703944.660 5640592.500)
468,Reichenbach uh,728821.0,Basischemie - Messstelle,Erlbach,Erlbach,DERW_DETH_56652_0-15,6: Mittelgebirgsbach fein (Ca),704478.0,5640143.0,POINT (704478.380776 5640143.75270596),704478.38,5640143.75,POINT (704478.380 5640143.750)
469,Angelroda,729675.0,Operative Messstelle,Zahme Gera,Zahme Gera,DERW_DETH_5642_47-64,5: Mittelgebirgsbach grob (Si),631125.0,5622465.0,POINT (631125.984796723 5622465.57724707),631125.98,5622465.58,POINT (631125.980 5622465.580)
470,Niedergebra,730315.0,Operative Messstelle,Wipper,Obere Wipper,DERW_DETH_5646_59-88,9.1: Mittelgebirgsfluss fein-grob (Ca),611055.0,5698321.0,POINT (611055.813090942 5698321.86004905),611055.81,5698321.86,POINT (611055.810 5698321.860)
