# Bayern

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Bayern is `DE2`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [None]:
import pandas as pd
from pandas.errors import ParserError
import os
from pprint import pprint
from tqdm import tqdm
from typing import Union, Dict
import zipfile
from datetime import datetime as dt
from io import StringIO
import warnings

from camelsp import Bundesland

The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [None]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Bayern').input_path
BASE

### Metadata reader

Define the function that extracts / reads and eventually merges all metadata for this federal state. You can develop the function here, without using the Bundesland context and then later use the context to pass extracted metadata. The Context has a function for saving *raw* metadata, that takes a `pandas.DataFrame` and needs you to identify the id column.
Here, *raw* refers to provider metadata, that has not yet been transformed into the CAMELS-de Metadata schema.

In [None]:
# define the function 
def read_meta(base_path) -> pd.DataFrame:
    path = os.path.join(base_path, 'Stammdaten_Bayern.xlsx')
    meta = pd.read_excel(path)
    return meta

# test it here
metadata = read_meta(BASE)

metadata

In [None]:
# the id column will be Stationsnummer
id_column = 'Stationsnummer'

## file extract and parse

I'll keep the files in the zip, just because. In baWü these zips are nicely flat-packed and there is actually no need to extract the zip. Later, we might want to extract and change the code below.

bayern is really nasty as they change the format inside the files and they have negative water levels, which are most likely a sensor fault code or something. I build a dirty workaround for this by handling parser errors. If one occurs, the file content is written into a file-like-object in memory and splitted into a list of rows. Each row, that has a negative value on the second column is marked as faulty and skipped. If there were faulty columns, a warning containing the indices at which this error occured. The indices are all shifted by 8, as the first 8 rows contain metadata and are skipped anyway.
Checked this procedure for one file.

In [None]:
# helper to map ids to filenames
def get_filename_mapping(zippath: str) -> Dict[str, str]:
    with zipfile.ZipFile(zippath) as z:
        m = dict()
        for f in z.filelist:
            id_only = os.path.basename(f.filename).split('.')[0]
            m[str(id_only)] = f.filename
        return m

def get_file_from_zip(nr: Union[int, str], zippath: str, not_exists = 'raise'):
    # get filename mapping
    fmap = get_filename_mapping(zippath)
    
    # always use string
    fname = str(nr)

    # search the file 
    if fname in fmap.values():
        fname = fname
    elif fname in fmap.keys():
        fname = fmap[fname]
    else:
        FileNotFoundError(f"nr {nr} is nothing we would expect. Use a Stationsnummer or filename in the zip")
    
    # go for the file
    with zipfile.ZipFile(zippath) as z:
        if fname not in [f.filename for f in z.filelist]:
            # TODO: here, might want to warn and return an df filled with NAN
            if not_exists == 'raise':
                raise FileNotFoundError(f"{fname} is not in {zippath}")
            else:
                return None

        # return the file content
        return z.open(fname)
        

def extract_file(nr: Union[int, str], variable: str, zippath: str, not_exists = 'raise') -> pd.DataFrame:
        # get the content
        enc_content = get_file_from_zip(nr=nr, zippath=zippath, not_exists=not_exists)
        if enc_content is None:
            return pd.DataFrame(columns=['date', variable.lower(), 'flag'])
        
        # raw content
        #raw = pd.read_csv(z.open(fname), encoding='latin1', skiprows=8, sep=' ', decimal=',', header=None)
        try:
            raw = pd.read_csv(enc_content, encoding='latin1', skiprows=8, sep=' ', decimal=',', header=None)
        except ParserError:
            enc_content.seek(0)
            # In Bayern gibts negative Wasserstände ....
            raw = enc_content.read().decode('latin1').splitlines()
            faulty_rows = [float(row.split(' ')[1]) < 0 for row in raw[8:]]
            raw = [row for row, faulty in zip(raw[8:], faulty_rows) if not faulty]
            
            # create in-memory buffer and read the CSV from memory
            buffer = StringIO('\n'.join(raw))
            buffer.seek(0)

            try:
                raw = pd.read_csv(buffer, sep=' ', decimal=',', header=None)
                warnings.warn(f"{nr};FormatError;Faulty rows in {nr};[{', '.join([str(i + 8) for i, fault in enumerate(faulty_rows) if fault])}]")
            except Exception as e:
                # TODO: in most cases there are suddenly more lines - maybe someone has an idea how to fix this
                warnings.warn(f"{nr};ParserError;Nr: {nr} failed alltogether;{str(e)}")
                raw = pd.DataFrame(columns=['x1', 'x2', 'x3', 'x4'])
        finally:
            enc_content.close()

        # rename the headers
        # Bayern has more surprises: sometimes they skip releaselevel.
        # But we need to check if it was always releaselevel that was missin
        if len(raw.columns) == 3:
            warnings.warn(f"{nr};FormatError;Nr: {nr} raw file has only 3 columns;Assuming that 'releaselevel' is missing. Please check.")
            raw.columns = ['timestamp', 'value', 'status']
        else:
            raw.columns = ['timestamp', 'value' ,'status', 'releaselevel']

        # parse data
        return pd.DataFrame({
            'date': [dt.strptime(str(t)[:8], '%Y%m%d') for t in raw.timestamp],
            variable.lower(): raw.value.values,

            # TODO: Was bedeuten die flags hier?
            'flag': [None for _ in raw.status],

        })

# test 
#m = get_filename_mapping(os.path.join(BASE, 'Abflüsse.zip'))
#key = list(m.keys())[124]
#print(key)
key = 14106504
#f = get_file_from_zip(key, os.path.join(BASE, 'Abflüsse.zip'))
#df = pd.read_csv(f, encoding='latin1', skiprows=8, header=None, sep=' ', decimal=',')
#print(f.read().decode('latin1').splitlines()[200:215])
#f.close()

df = extract_file(key, 'q', os.path.join(BASE, 'Abflüsse.zip'))
df



There is potentially interesting metadata in the header. Let's extract timezone and unit information and re-write the metadata extraction function for this

In [None]:
# define the function 
def read_meta(base_path, scan_files: bool = True) -> pd.DataFrame:
    # get the Stammdaten
    path = os.path.join(base_path, 'Stammdaten_Bayern.xlsx')
    meta = pd.read_excel(path)
    
    # now check for each file, if there is Stuff in the files
    # list of q and w with tz, and unit array each
    if not scan_files:
        return meta
    extras = [[[], []], [[], []]]
    for nr in tqdm(metadata.Stationsnummer):
        for i, _zip in enumerate(('Abflüsse.zip', 'Wasserstände.zip')):
            f = get_file_from_zip(nr, os.path.join(base_path, 'Abflüsse.zip'), 'return_none')
            if f is None:
                tup = (None, None,)
            else:
                tup = f.read().decode('latin1').splitlines()[5:7]
                f.close()
            
            # append
            extras[i][0].append(tup[0])
            extras[i][1].append(tup[1])

    # now append the arrays to meta
    meta['timezone_q'] = extras[0][0]
    meta['unit_q'] = extras[0][1]
    meta['timezone_w'] = extras[1][0]
    meta['unit_w'] = extras[1][1]
    
    return meta

# test it here
metadata = read_meta(BASE)

metadata

### Finally run

Now, the Q and W data can be extracted along with the metadata. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [None]:
with Bundesland('Bayern') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head)
    
    
    # join the path for two zips
    q_zip_path = os.path.join(bl.input_path, 'Abflüsse.zip')
    w_zip_path = os.path.join(bl.input_path, 'Wasserstände.zip')
    
    with warnings.catch_warnings(record=True) as warns:
        # go for all ids
        for provider_id in tqdm(nuts_map.provider_id):
            # extract the file for this provider
            try:
                q_df = extract_file(provider_id, 'q', q_zip_path, not_exists='fill_nan')
                w_df = extract_file(provider_id, 'w', w_zip_path, not_exists='fill_nan')
            except Exception:
                print(provider_id)
                break

            # save
            bl.save_timeseries(q_df, provider_id)
            bl.save_timeseries(w_df, provider_id)

        # check if there were warnings (there are warnings)
        if len(warns) > 0:
            log_path = bl.save_warnings(warns)
            print(f"There were warnings during the processing. The log can be found at: {log_path}")
