# Streamflow Data Download and Preprocessing

Metadata for streamflow to be downloaded from: https://wateroffice.ec.gc.ca/station_metadata/station_characteristics_e.html using the specifications `Province = Alberta`, `Parameter Type = Flows`, and `Regulation = Natural` and saved as `data/raw/station_metadata.csv`.

The stations listed in the metadata file subject to the date specifications are downloaded below from HYDAT and saved to `combined_streamflow.csv`.

In [1]:
import pandas as pd
from src.data_ingestion import fetch_streamflow_batch
from src.config import RAW_DATA_DIR

METADATA_PATH = RAW_DATA_DIR / "station_metadata.csv"
OUTPUT_FILENAME = "combined_streamflow.csv"
START_YEAR = 1980
END_YEAR = 2022

# Filter stations
metadata = pd.read_csv(METADATA_PATH)
filtered_stations = metadata[
    (metadata['Year From'] <= START_YEAR) & 
    (metadata['Year To'] >= END_YEAR)
]["Station Number"].tolist()

# download data
df_streamflow = fetch_streamflow_batch(filtered_stations, START_YEAR, END_YEAR, OUTPUT_FILENAME)

Downloading data for 181 stations...
Data downloaded and saved to combined_streamflow.csv
15706 days of data saved for 181 stations


In [2]:
from src.processing import filter_stations_by_annual_completeness
from src.config import RAW_DATA_DIR
import pandas as pd

# 1. Load raw
df_raw = pd.read_csv(RAW_DATA_DIR / "combined_streamflow.csv", index_col="Date", parse_dates=True)

# 2. Filter (Strict: Drop station if any year is >40% missing)
df_filtered = filter_stations_by_annual_completeness(df_raw, max_missing_pct=40.0)
df_filtered.head()

Filtering at 40.0% annual threshold:
 - Keeping 118 stations.
 - Dropping 63 stations due to incomplete years.


Unnamed: 0_level_0,05AA004,05AA008,05AA022,05AA028,05AB005,05AB029,05AD003,05AD035,05AE005,05AH037,...,07JF002,07KE001,07OA001,07OB003,07OB004,07OB006,07OC001,11AA026,11AB117,11AE009
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1980-01-01,,1.01,2.07,,,,2.9,,,,...,,2.9,,,,,,,,0.048
1980-01-02,,1.04,2.05,,,,2.85,,,,...,,2.78,,,,,,,,0.045
1980-01-03,,1.04,2.04,,,,2.77,,,,...,,2.6,,,,,,,,0.042
1980-01-04,,1.03,2.01,,,,2.69,,,,...,,2.45,,,,,,,,0.034
1980-01-05,,1.02,1.98,,,,2.74,,,,...,,2.26,,,,,,,,0.023


Water basin polygons can be downloaded from https://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www/HydrometricNetworkBasinPolygons/gpkg/. The major drainage areas (MDA) selected are:
* (5) Nelson River
* (7) Great Slave Lake

These MDAs were selected due to their proximity to the Eastern Rockies in Alberta. The files are saved under `data/raw/drainage_areas/`.

In [None]:
# TODO: use polygons to extract area and percent glaciation

## Downloading ERA5 Climate Data

In [1]:
from src.data_ingestion import download_era5_precipitation, download_era5_temperature

# Study parameters
STUDY_YEARS = range(1980, 2023)

# Run downloads
download_era5_precipitation(STUDY_YEARS)
download_era5_temperature(STUDY_YEARS)

✔ Skipping 1980-01 (already exists)
✔ Skipping 1980-02 (already exists)
✔ Skipping 1980-03 (already exists)
✔ Skipping 1980-04 (already exists)
✔ Skipping 1980-05 (already exists)
✔ Skipping 1980-06 (already exists)
✔ Skipping 1980-07 (already exists)
✔ Skipping 1980-08 (already exists)
✔ Skipping 1980-09 (already exists)
✔ Skipping 1980-10 (already exists)
✔ Skipping 1980-11 (already exists)
✔ Skipping 1980-12 (already exists)
✔ Skipping 1981-01 (already exists)
✔ Skipping 1981-02 (already exists)
✔ Skipping 1981-03 (already exists)
✔ Skipping 1981-04 (already exists)
✔ Skipping 1981-05 (already exists)
✔ Skipping 1981-06 (already exists)
✔ Skipping 1981-07 (already exists)
✔ Skipping 1981-08 (already exists)
✔ Skipping 1981-09 (already exists)
✔ Skipping 1981-10 (already exists)
✔ Skipping 1981-11 (already exists)
✔ Skipping 1981-12 (already exists)
✔ Skipping 1982-01 (already exists)
✔ Skipping 1982-02 (already exists)
✔ Skipping 1982-03 (already exists)
✔ Skipping 1982-04 (already 