# Wasserportal Data
This notebook shows how to use the import and process groundwater measurement data from the Wasserportal API. 

In [10]:
import sys
import os
import importlib

# Add project root to Python path for notebook imports
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Now you can import from utils and fe
from src.utils import data_loading_wasserportal, helpers
from src.fe.constants import API_URLS, DATABASE_URL, DATE_END, DATE_START

sys.path.append("..")
sys.path.append('../src/utils')
# importlib.reload(data_loading_wasserportal)

## Station List
Here we download a list measurement stations into dataframes, standardize coordinates and drop columns, which are not needed

In [11]:
data_loading_wasserportal.get_stations('../data/wasserportal/raw')
stations_path_gw = '../data/wasserportal/raw/stations_groundwater_raw.csv'
dest_path = '../data/wasserportal/'
data_loading_wasserportal.process_stations_groundwater(stations_path_gw,
                                                       dest_path)

Surface stations loaded: 189 entries
Soil stations loaded: 23 entries
Groundwater stations loaded: 927 entries
Processed stations to ..\data\wasserportal\stations_groundwater.csv


## Duplicate Stations
Some measurement stations are in the same location, but have different data.

In [12]:
import pandas as pd

groundwater_df = pd.read_csv('../data/wasserportal/stations_groundwater.csv',
                             index_col='ID')

print('Duplicate measurement stations:')
display(groundwater_df[groundwater_df.duplicated(
    subset=['lat', 'lon'], keep=False)].sort_values(by='lat'))

Duplicate measurement stations:


Unnamed: 0_level_0,lat,lon,height
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
305,52.417286,13.284403,44.23
68,52.417286,13.284403,44.23
8415,52.470926,13.487909,34.53
8414,52.470926,13.487909,34.53
8411,52.487007,13.475708,33.7
8408,52.487007,13.475708,33.7
8058,52.498696,13.563372,37.67
8059,52.498696,13.563372,37.67
4640,52.518748,13.15449,34.3
4641,52.518748,13.15449,34.3


# Measurement Data
Given a start date, end date and a url config, `get_multi_station_data` builds a table with all measurement data from that time range from all available stations. This is then saved as a `.parquet` file, ready for analysis.

In [13]:
# Usage example:
config_groundwater = {
    'thema': 'gws',
    'exportthema': 'gw',
    'sreihe': 'ew',
    "anzeige": "d",
    "smode": "c"
}
data_loading_wasserportal.get_gw_master('01.01.2022',
                                        '30.04.2025',
                                        config_groundwater,
                                        num_stations=None)
# data_loading_wasserportal.get_multi_station_data('01.01.2022',
#                                                  '28.02.2022',
#                                                  config_groundwater,
#                                                  num_stations=20)

Processed 20 stations out of 892
Processed 40 stations out of 892
Processed 60 stations out of 892
Processed 80 stations out of 892
Processed 100 stations out of 892
Processed 120 stations out of 892
Processed 140 stations out of 892
Processed 160 stations out of 892
Processed 180 stations out of 892
Processed 200 stations out of 892
Processed 220 stations out of 892
Processed 240 stations out of 892
Processed 260 stations out of 892
Processed 280 stations out of 892
Processed 300 stations out of 892
Processed 320 stations out of 892
Processed 340 stations out of 892
Processed 360 stations out of 892
Processed 380 stations out of 892
Processed 400 stations out of 892
Processed 420 stations out of 892
Processed 440 stations out of 892
Processed 460 stations out of 892
Processed 480 stations out of 892
Processed 500 stations out of 892
Processed 520 stations out of 892
Processed 540 stations out of 892
Processed 560 stations out of 892
Processed 580 stations out of 892
Processed 600 stat

{'file': WindowsPath('../data/wasserportal/processed/gw_master_2022-01-01_2025-04-30.parquet'),
 'messages': ['Processed 20 stations out of 892',
  'Processed 40 stations out of 892',
  'Processed 60 stations out of 892',
  'Processed 80 stations out of 892',
  'Processed 100 stations out of 892',
  'Processed 120 stations out of 892',
  'Processed 140 stations out of 892',
  'Processed 160 stations out of 892',
  'Processed 180 stations out of 892',
  'Processed 200 stations out of 892',
  'Processed 220 stations out of 892',
  'Processed 240 stations out of 892',
  'Processed 260 stations out of 892',
  'Processed 280 stations out of 892',
  'Processed 300 stations out of 892',
  'Processed 320 stations out of 892',
  'Processed 340 stations out of 892',
  'Processed 360 stations out of 892',
  'Processed 380 stations out of 892',
  'Processed 400 stations out of 892',
  'Processed 420 stations out of 892',
  'Processed 440 stations out of 892',
  'Processed 460 stations out of 892',

# Loading data

In [None]:
import pandas as pd

# Load the parquet file
df = pd.read_parquet(
    '../data/wasserportal/processed/gw_master_2022-01-01_2025-04-30.parquet')

# Display basic info about the DataFrame
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index}")
print("\nFirst few rows:")
display(df.head())