# Mecklenburg-Vorpommern

Every federal state is represented by its own input directory and is processed into a NUTS level 2 directory containing a sub-folder for each discharge location. These folder names are derived from NUTS and reflect the CAMELS id. The NUTS level 2 code for Mecklenburg-Vorpommern is `DE8`.

To pre-process the data, you need to write (at least) two functions. One should extract all metadata and condense it into a single `pandas.DataFrame`. This is used to build the folder structure and derive the ids.
The second function has to take an id, as provided by the state authorities, called `provider_id` and return a `pandas.DataFrame` with the transformed data. The dataframe needs the three columns `['date', 'q' | 'w', 'flag']`.

For easier and unified output handling, the `camelsp` package contains a context object called `Bundesland`. It takes a number of names and abbreviations to identify the correct federal state and returns an object that holds helper and save functions.

The context saves files as needed and can easily be changed to save files with different strategies, ie. fill missing data with NaN, merge data into a single file, create files for each variable or pack everything together into a netcdf.

In [1]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm
from typing import Union
import zipfile
import warnings

from camelsp import Bundesland



The context can also be instantiated as any regular Python class, ie. to load only the default input data path, that we will user later.

In [2]:
# the context also makes the input path available, if camelsp was install locally
BASE = Bundesland('Meckpom').input_path
BASE

'/home/alexd/Projekte/CAMELS/Github/camelsp/input_data/Q_and_W/MP_Mecklenburg_Vorpommern'

First extract the ZIP:

In [3]:
# extract the ZIP in place
if not os.path.exists(os.path.join(BASE, 'MetaDaten.csv')):
    with zipfile.ZipFile(os.path.join(BASE, 'w_q.zip')) as z:
        for f in z.filelist:
            z.extract(f, BASE)

### Metadata reader

Define the function that extracts / reads and eventually merges all metadata for this federal state. You can develop the function here, without using the Bundesland context and then later use the context to pass extracted metadata. The Context has a function for saving *raw* metadata, that takes a `pandas.DataFrame` and needs you to identify the id column.
Here, *raw* refers to provider metadata, that has not yet been transformed into the CAMELS-de Metadata schema.

In [4]:
# define the function 
def read_meta(base_path) -> pd.DataFrame:
    path = os.path.join(base_path, 'MetaDaten.csv')
    meta = pd.read_csv(path, encoding='latin1', sep=";")
    return meta

# test it here
metadata = read_meta(BASE)

metadata

Unnamed: 0,pegelkennzahl,bezeichnung,gewaesser,rechtswert,hochwert,fg_einheit,gebietskennzahl,pnp,pnp_system,vorgaengerpegel,einzugsgebiet,gewaesserkennzahl,start_jahr,end_jahr
1,4341.0,Börzow,Stepenitz,244709,5974401,1,9628511000,5.152,-,,441.0,9.628000e+09,1955.0,1998.0
2,4341.1,Börzow,Stepenitz,244683,5974452,5,9628510000,4.810,DHHN92,4341.0,441.0,9.628000e+09,1955.0,2021.0
3,4342.0,Questin,Stepenitz,246050,5972572,5,9628330000,9.772,DHHN92,,238.0,9.628000e+09,1966.0,2021.0
4,4343.0,Diedrichshagen,Stepenitz,250122,5968604,5,9628177000,27.359,DHHN92,,99.0,9.628000e+09,1964.0,2021.0
5,4344.0,Cramon,Stepenitz,254322,5958325,1,9628151900,42.410,-,,55.0,9.628000e+09,1971.0,1981.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
231,59855.0,Lehsen,Motel,236048,5933915,3,5936888590,27.398,DHHN92,,85.0,5.936888e+09,1976.0,2017.0
232,59859.0,Camin,Schilde,232502,5931690,3,5936887900,22.266,DHHN92,,189.0,5.936880e+09,2008.0,2017.0
233,59870.1,Hagenow,Schmaar,246514,5928659,3,5936321500,22.513,DHHN92,59870.0,24.0,5.936320e+09,2000.0,2019.0
234,59905.0,Schwartow,Boize,216586,5924421,3,5936947500,8.987,DHHN92,,172.0,5.936940e+09,1976.0,2017.0


## file extract and parse

Similar to Niedersachsen, this data should again involve some substantial pivoting. 

In [5]:
raw = pd.read_csv(os.path.join(BASE, 'gesamter Datensatz.csv'), encoding='latin1', sep=";", parse_dates=['datum'])
raw

Unnamed: 0,pkz,pkz_pegel_gew_id,reihe,parameter,typ,datum,messwert
107348,3119.0,03119.0#Rostock - Stadthafen#Unterwarnow#5,1976-1978,Wasserstand (PNP),Tagesmittel,1977-01-11,463.0
107349,3119.0,03119.0#Rostock - Stadthafen#Unterwarnow#5,1976-1978,Wasserstand (PNP),Tagesmittel,1977-01-12,490.0
107350,3119.0,03119.0#Rostock - Stadthafen#Unterwarnow#5,1976-1978,Wasserstand (PNP),Tagesmittel,1977-01-13,493.0
107351,3119.0,03119.0#Rostock - Stadthafen#Unterwarnow#5,1976-1978,Wasserstand (PNP),Tagesmittel,1977-01-14,481.0
107352,3119.0,03119.0#Rostock - Stadthafen#Unterwarnow#5,1976-1978,Wasserstand (PNP),Tagesmittel,1977-01-15,482.0
...,...,...,...,...,...,...,...
8459997,59910.5,59910.5#Nostorf#Mühlenbach#445,1995-2019,Durchfluss,Tagesmittel,2013-12-28,131.0
8459998,59910.5,59910.5#Nostorf#Mühlenbach#445,1995-2019,Durchfluss,Tagesmittel,2013-12-29,131.0
8459999,59910.5,59910.5#Nostorf#Mühlenbach#445,1995-2019,Durchfluss,Tagesmittel,2013-12-31,130.0
8460000,59910.5,59910.5#Nostorf#Mühlenbach#445,1995-2019,Durchfluss,HQ,2013-12-01,265.0


# Stations without data
There are some stations in the metadata for which we do not have datafiles.  
See below: These stations only have 'Terminwerte', in which we are not interested, so we delete these stations from the metadata.   

In [6]:
bl = Bundesland('Meckpom')

no_data_ids = [4390.0, 4436.0, 4712.0, 4930.0, 58100.1]

metadata[metadata['pegelkennzahl'].isin(no_data_ids)]

Unnamed: 0,pegelkennzahl,bezeichnung,gewaesser,rechtswert,hochwert,fg_einheit,gebietskennzahl,pnp,pnp_system,vorgaengerpegel,einzugsgebiet,gewaesserkennzahl,start_jahr,end_jahr
18,4390.0,Teßmannsdorf,Hellbach,279146,5994626,2,9636500000,3.33,SNN56,,205.0,9636000000.0,1971.0,1971.0
47,4436.0,Güstrow,Nebel,313771,5964840,2,9646751000,4.72,SNN56,,645.0,9646000000.0,1961.0,1972.0
115,4712.0,Neu Sührkow,Teterower Peene,129500,5770000,2,966329100,1.63,SNN56,,120.0,9663200000.0,1974.0,1975.0
196,4930.0,Rieth,Beeke,452059,5949206,1,0,1.472,SNN56,,45.0,,1968.0,1971.0
204,58100.1,Mirow OP,Mirower Kanal,353359,5904838,3,5811619900,60.0,SNN76,58100.0,22.4,5811600000.0,,


In [7]:
for id in no_data_ids:
    print(f"{id} -- Typ: {set(raw[raw['pkz'] == id].typ.values)}")

4390.0 -- Typ: {'Terminwert 8:00'}
4436.0 -- Typ: {'Terminwert 7:00'}
4712.0 -- Typ: {'Terminwert 8:00'}
4930.0 -- Typ: {'HQ', 'NQ', 'HW', 'Terminwert 8:00', 'NW'}
58100.1 -- Typ: set()


In [8]:
# drop the ids without data from metadata
metadata = metadata[~metadata['pegelkennzahl'].isin(no_data_ids)].reset_index(drop=True)
metadata

Unnamed: 0,pegelkennzahl,bezeichnung,gewaesser,rechtswert,hochwert,fg_einheit,gebietskennzahl,pnp,pnp_system,vorgaengerpegel,einzugsgebiet,gewaesserkennzahl,start_jahr,end_jahr
0,4341.0,Börzow,Stepenitz,244709,5974401,1,9628511000,5.152,-,,441.0,9.628000e+09,1955.0,1998.0
1,4341.1,Börzow,Stepenitz,244683,5974452,5,9628510000,4.810,DHHN92,4341.0,441.0,9.628000e+09,1955.0,2021.0
2,4342.0,Questin,Stepenitz,246050,5972572,5,9628330000,9.772,DHHN92,,238.0,9.628000e+09,1966.0,2021.0
3,4343.0,Diedrichshagen,Stepenitz,250122,5968604,5,9628177000,27.359,DHHN92,,99.0,9.628000e+09,1964.0,2021.0
4,4344.0,Cramon,Stepenitz,254322,5958325,1,9628151900,42.410,-,,55.0,9.628000e+09,1971.0,1981.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
225,59855.0,Lehsen,Motel,236048,5933915,3,5936888590,27.398,DHHN92,,85.0,5.936888e+09,1976.0,2017.0
226,59859.0,Camin,Schilde,232502,5931690,3,5936887900,22.266,DHHN92,,189.0,5.936880e+09,2008.0,2017.0
227,59870.1,Hagenow,Schmaar,246514,5928659,3,5936321500,22.513,DHHN92,59870.0,24.0,5.936320e+09,2000.0,2019.0
228,59905.0,Schwartow,Boize,216586,5924421,3,5936947500,8.987,DHHN92,,172.0,5.936940e+09,1976.0,2017.0


In [9]:
# the id column will be pegelkennzahl
id_column = 'pegelkennzahl'

Ok, other than in Niedersachen, we have different parameter and types here. Maybe filter for Tagesmittel and split by q and w first.

In [10]:
print(f"Parameter: {raw.parameter.unique()}")
print(f"Variabeln: {raw.typ.unique()}")

Parameter: ['Wasserstand (PNP)' 'Durchfluss']
Variabeln: ['Tagesmittel' 'NW' 'HW' 'Terminwert 7:00' 'Terminwert 8:00' 'HQ' 'NQ'
 'Terminwert 12:00' 'Terminwert 10:00' 'Terminwert 13:00'
 'Terminwert 18:00' 'Terminwert 6:00' 'Terminwert 17:00'
 'Terminwert 14:00' 'Terminwert 9:00' 'Terminwert 0:00' 'Terminwert 11:00'
 'Terminwert 16:00']


First, filter for `'Tagesmittel'` only, then split py `'parameter'` and copy into two new DataFrames. This should make stuff bit easier.

In [11]:
# split and group the data into Tagesmittel of q and w data
for par, df in raw.where(raw.typ == 'Tagesmittel').dropna(axis=0).groupby('parameter'):
    if par == 'Wasserstand (PNP)':
        w = df.copy()
    elif par == 'Durchfluss':
        q = df.copy()

print(f"Q: {len(q)}   W: {len(w)}")

Q: 2201381   W: 3055252


Finally write a function that extracts the data from the Dataframes and returns empty DataFrames if the pkz cannot be found in the large df.

In [12]:
def extract_file(pkz: Union[int, str], variable: str, store_df: pd.DataFrame) -> pd.DataFrame:
    """
    Extracts the variable for the given pkz from the large storage dataframe.
    Returns an empty dataframe if the pkz was not found.
    """
    # always use string pkz
    pkz = str(pkz)

    # filter
    df = store_df.where(store_df.pkz.astype(str) == pkz).dropna(axis=0)

    # check if we found something
    if df.empty:
        return pd.DataFrame(columns=['date', variable.lower(), 'flag'])
    else:
        return pd.DataFrame({
            'date': df.datum,
            variable.lower(): df.messwert,
            'flag': np.NaN
        })

test_df = extract_file('59910.5', 'w', w)
test_df

Unnamed: 0,date,w,flag
8429327,1996-01-16,127.0,
8429328,1996-01-15,127.0,
8429329,1996-01-14,127.0,
8429330,1996-01-13,127.0,
8429331,1996-01-12,127.0,
...,...,...,...
8459963,2013-12-27,128.0,
8459964,2013-12-28,128.0,
8459965,2013-12-29,128.0,
8459966,2013-12-30,128.0,


### Finally run

Now, the Q and W data can be extracted. The cool thing is, that all the id creation, data creation, merging and the mapping from our ids to the original ids and files is done by the context. This is helpful, as we less likely screw something up.

In [13]:
with Bundesland('Mecklenburg-Vorpommern') as bl:
    # save the metadata
    bl.save_raw_metadata(metadata, id_column, overwrite=True)

    # for reference, call the nuts-mapping as table
    nuts_map = bl.nuts_table
    print(nuts_map.head())

    
    with warnings.catch_warnings(record=True) as warns:
        for provider_id in tqdm(metadata[id_column].values.astype(str)):
            # get q and w
            q_df = extract_file(provider_id, 'q', q)
            w_df = extract_file(provider_id, 'w', w)

            # q is in l/s, convert to m³/s
            q_df['q'] = q_df['q'] / 1000

            # sort by date
            q_df = q_df.sort_values('date').reset_index(drop=True)
            w_df = w_df.sort_values('date').reset_index(drop=True)

            bl.save_timeseries(q_df, provider_id)
            bl.save_timeseries(w_df, provider_id)

        # check if there were warnings (there are warnings)
        if len(warns) > 0:
            log_path = bl.save_warnings(warns)
            print(f"There were warnings during the processing. The log can be found at: {log_path}")


    nuts_id provider_id                              path
0  DE810000      4341.0  ./DE8/DE810000/DE810000_data.csv
1  DE810010      4341.1  ./DE8/DE810010/DE810010_data.csv
2  DE810020      4342.0  ./DE8/DE810020/DE810020_data.csv
3  DE810030      4343.0  ./DE8/DE810030/DE810030_data.csv
4  DE810040      4344.0  ./DE8/DE810040/DE810040_data.csv


100%|██████████| 230/230 [06:57<00:00,  1.82s/it]


In [20]:
# copy everything from "DE8 (backup before changes)/*" to "DE8/*" except *_data.csv
from glob import glob

files = glob("../output_data/DE8_backup/*/*")

# exclude "*_data.csv" files
files = [f for f in files if not f.endswith("_data.csv")]

ids = sorted(os.listdir("../output_data/DE8_backup/"))

for id in ids:
    # copy the files and folders
    for f in files:
        if id in f:
            os.system(f"cp -r {f} ../output_data/DE8/{id}/")


In [16]:
# copy everything from "DE8 (backup before changes)/*" to "DE8/*" except *_data.csv
from glob import glob

files = glob("../output_data/DE8_backup/*/*")
# exclude "*_data.csv" files
files = [f for f in files if not f.endswith("_data.csv")]

ids = sorted(list(set([f.split("/")[3] for f in files])))

len(ids)

226