# Baden-Württemberg

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Baden-Württemberg is `DE1`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [None]:
import pandas as pd
import os
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict
import zipfile
from datetime import datetime as dt

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [None]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('bw').input_path
BASE

### Metadata reader

Define the function that extracts / reads and eventually merges all metadata for this federal state. You can develop the function here, without using the Bundesland context and then later use the context to pass extracted metadata. The Context has a function for saving *raw* metadata, that takes a `pandas.DataFrame` and needs you to identify the id column.
Here, *raw* refers to provider metadata, that has not yet been transformed into the CAMELS-de Metadata schema.

In [None]:
# define the function 
def read_meta(base_path) -> pd.DataFrame:
    path = os.path.join(base_path, 'BW_Meta.xlsx')
    meta = pd.read_excel(path)
    return meta

# test it here
metadata = read_meta(BASE)

metadata

In [None]:
# the id column will be Messstellennummer
id_column = 'Messstellennummer'

## file extract and parse

I'll keep the files in the zip, just because. In baWü these zips are nicely flat-packed and there is actually no need to extract the zip. Later, we might want to extract and change the code below.

In [None]:
# helper to map ids to filenames
def get_filename_mapping(zippath: str) -> Dict[str, str]:
    with zipfile.ZipFile(zippath) as z:
        return {f"{f.filename.split('-')[0]}": f.filename for f in z.filelist}

def extract_file(nr: Union[int, str], variable: str, zippath: str, not_exists = 'raise') -> pd.DataFrame:
    # get filename mapping
    fmap = get_filename_mapping(zippath)
    
    # always use string
    fname = str(nr)

    # search the file 
    if fname in fmap.values():
        fname = fname
    elif fname in fmap.keys():
        fname = fmap[fname]
    else:
        FileNotFoundError(f"nr {nr} is nothing we would expect. Use a LUBW Messstellennummer or filename in the zip")
    
    # go for the file
    with zipfile.ZipFile(zippath) as z:
        if fname not in [f.filename for f in z.filelist]:
            # TODO: here, might want to warn and return an df filled with NAN
            if not_exists == 'raise':
                raise FileNotFoundError(f"{fname} is not in {zippath}")
            else:
                return pd.DataFrame(columns=['date', variable.lower(), 'flag'])
        
        # raw content
        raw = pd.read_csv(z.open(fname), encoding='latin1', skiprows=3, sep=';', decimal=',')
        
        # 'q' data
        if 'Q' in raw.columns:
            return pd.DataFrame({
                'date': [dt.strptime(_, '%d.%m.%Y') for _ in raw.Datum],
                'q': raw.Q.values,
                'flag': [_.lower().strip() == 'ja' for _ in raw['Geprüft (nein=ungeprüfte Rohdaten)']],

            })
        # w data
        else:
            return pd.DataFrame({
                'date': [dt.strptime(_, '%d.%m.%Y') for _ in raw.Datum],
                'w': raw.W.values,
                'flag': [_.lower().strip() == 'ja' for _ in raw['Geprüft (nein=ungeprüfte Rohdaten)']],

            })

# test 
df = extract_file(105, 'q', '../input_data/BW_Baden_Wuerttemberg/BW_Q.zip')
df

### Finally run

Now, the Q and W data can be extracted along with the metadata. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [None]:
with Bundesland('Ba-Wü') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    # join the path for two zips
    q_zip_path = os.path.join(bl.input_path, 'BW_Q.zip')
    w_zip_path = os.path.join(bl.input_path, 'BW_W.zip')
    
    # go for all ids
    for provider_id in tqdm(nuts_map.provider_id):
        # extract the file for this provider
        q_df = extract_file(provider_id, 'q', q_zip_path, not_exists='fill_nan')
        w_df = extract_file(provider_id, 'w', w_zip_path, not_exists='fill_nan')

        # save
        bl.save_timeseries(q_df, provider_id)
        bl.save_timeseries(w_df, provider_id)

