# Streamflow Data Download and Preprocessing

Metadata for streamflow to be downloaded from: https://wateroffice.ec.gc.ca/station_metadata/station_characteristics_e.html using the specifications `Province = Alberta`, `Parameter Type = Flows`, and `Regulation = Natural` and saved as `data/raw/station_metadata.csv`.

The stations listed in the metadata file subject to the date specifications are downloaded below from HYDAT and saved to `combined_streamflow.csv`.

In [1]:
import pandas as pd
from src.data_ingestion import fetch_streamflow_batch
from src.config import RAW_DATA_DIR

METADATA_PATH = RAW_DATA_DIR / "station_metadata.csv"
OUTPUT_FILENAME = "combined_streamflow.csv"
START_YEAR = 1980
END_YEAR = 2022

# Filter stations
metadata = pd.read_csv(METADATA_PATH)
filtered_stations = metadata[
    (metadata['Year From'] <= START_YEAR) & 
    (metadata['Year To'] >= END_YEAR)
]["Station Number"].tolist()

# download data
df_streamflow = fetch_streamflow_batch(filtered_stations, START_YEAR, END_YEAR, OUTPUT_FILENAME)

Downloading data for 181 stations...
Data downloaded and saved to combined_streamflow.csv
15706 days of data saved for 181 stations


In [2]:
from src.processing import filter_stations_by_annual_completeness
from src.config import RAW_DATA_DIR
import pandas as pd

# 1. Load raw
df_raw = pd.read_csv(RAW_DATA_DIR / "combined_streamflow.csv", index_col="Date", parse_dates=True)

# 2. Filter (Strict: Drop station if any year is >40% missing)
df_filtered = filter_stations_by_annual_completeness(df_raw, max_missing_pct=40.0)
df_filtered.head()

Filtering at 40.0% annual threshold:
 - Keeping 118 stations.
 - Dropping 63 stations due to incomplete years.


Unnamed: 0_level_0,05AA004,05AA008,05AA022,05AA028,05AB005,05AB029,05AD003,05AD035,05AE005,05AH037,...,07JF002,07KE001,07OA001,07OB003,07OB004,07OB006,07OC001,11AA026,11AB117,11AE009
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1980-01-01,,1.01,2.07,,,,2.9,,,,...,,2.9,,,,,,,,0.048
1980-01-02,,1.04,2.05,,,,2.85,,,,...,,2.78,,,,,,,,0.045
1980-01-03,,1.04,2.04,,,,2.77,,,,...,,2.6,,,,,,,,0.042
1980-01-04,,1.03,2.01,,,,2.69,,,,...,,2.45,,,,,,,,0.034
1980-01-05,,1.02,1.98,,,,2.74,,,,...,,2.26,,,,,,,,0.023


## Monthy Glacier Mass Balance Reconstruction Data
Data was produced by Christina Draeger for the years 1980 to 2022 and can be accessed via: https://www.dropbox.com/scl/fo/yat0rxeoztpwol29qput2/AEtDmgySFbMEr3B9YcwLmks/kp_dp_alphabias_monthly_NN?dl=0&rlkey=4t3uobuuo8ufn5selgr5afoo4&subfolder_nav_tracking=1

The files are saved under `data/raw/mass_balance`

## Downloading Glacier Areas
Spatial information for the glaciers is downloaded from the [Randolph Glacier Inventory (RGI) version 6](https://daacdata.apps.nsidc.org/pub/DATASETS/nsidc0770_rgi_v6/). The region for Western Canada and US (`nsidc0770_02.rgi60.WesternCanadaUS.zip`) is the only download required. The files are saved under `data/raw/RGI-western-canada`

## Downloading Drainage Areas
Water basin polygons can be downloaded from https://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www/HydrometricNetworkBasinPolygons/gpkg/. The major drainage areas (MDA) selected are:
* (5) Nelson River
* (7) Great Slave Lake

These MDAs were selected due to their proximity to the Eastern Rockies in Alberta. The files are saved under `data/raw/drainage_areas/`.

In [1]:
from src.spatial_utils import process_spatial_attributes
from src.config import DRAINAGE_FILES, GLACIER_SHP_PATH, MASS_BALANCE_PATH

# Run the analysis
static_df, vol_df = process_spatial_attributes(
    basin_gpkg_paths=DRAINAGE_FILES,
    glacier_shp_path=GLACIER_SHP_PATH,
    mass_balance_path=MASS_BALANCE_PATH
)

# Preview
print("\n--- Static Attributes ---")
print(static_df.head())

print("\n--- Volume Change (MCM) ---")
print(vol_df.head())

⏳ Loading and merging basin files...


  result = read_func(
  return ogr_read(
  result = read_func(
  return ogr_read(


   Examples: ['06AD006', '06AA002', '06AA001', '11AA026', '11AB117']
✅ Processing 171 basins.
⏳ Loading glaciers and reprojecting...
⏳ Calculating glacier-basin intersections (this may take a moment)...
✅ Static attributes saved to C:\Users\tbwil\Documents\School\MSc Geophysics\Thesis Project\data\processed\static_attributes.csv
⏳ Calculating volume changes...
   Matched 777 glaciers between Shapefile and Mass Balance CSV.
✅ Monthly volume changes (MCM) saved to C:\Users\tbwil\Documents\School\MSc Geophysics\Thesis Project\data\processed\glacier_volume_change.csv

--- Static Attributes ---
            basin_area_km2  glacier_area_km2  glacier_pct
station_id                                               
05AA004           160.5591               0.0          0.0
05AA008           402.9390               0.0          0.0
05AA011           179.2413               0.0          0.0
05AA022           820.5606               0.0          0.0
05AA027           219.2715               0.0          0

## Downloading ERA5 Climate Data
This project uses the Copernicus Climate Data Store (CDS) to download ERA5 precipitation and temperature data. Follow these steps to configure your environment.

#### Create a CDS Account

1. Visit the [Climate Data Store (CDS) registration page](https://cds.climate.copernicus.eu/#!/home).

2. Create an account and log in.

#### Accept the Terms of Use
Important: You must manually accept the "Terms of Use" for every dataset you wish to download, or the API will return an error.

1. Go to the ERA5 daily statistics page.

2. Click the "Download Data" tab.

3. Scroll to the bottom and click Accept Terms (look for a "License" section).

4. Repeat this for the ERA5 reanalysis single levels.

#### Get your API Key
1. Go to your User Profile.

2. Scroll down to the section labeled API Key.

3. You will see a block of text that looks like this:

```
url: https://cds.climate.copernicus.eu/api/v2
key: <UID>:<API-KEY>
```
#### Configure the Configuration File (`.cdsapirc`)
The cdsapi library looks for a hidden file in your home directory to authenticate.

**For Windows Users:**

1. Open your User folder (e.g., C:\Users\YourName).

2. Create a new text file named .cdsapirc (Note the leading dot).

* Tip: If Windows doesn't let you create a file starting with a dot, name it .cdsapirc. (with a dot at the end) and it will save correctly.

3. Paste the url and key from Step 3 into this file.

**For Mac/Linux Users:**

1. Open your terminal.

2. Run the following command: `nano ~/.cdsapirc`

3. Paste your credentials:
```
url: https://cds.climate.copernicus.eu/api/v2
key: 12345:abcdefgh-ijkl-mnop-qrst-uvwxyz
```
4. Save and exit (`Ctrl+O`, `Enter`, `Ctrl+X`).

In [2]:
from src.data_ingestion import download_era5_precipitation, download_era5_temperature

# Study parameters
STUDY_YEARS = range(1980, 2023)

# Run downloads
download_era5_precipitation(STUDY_YEARS)
download_era5_temperature(STUDY_YEARS)

⏳ Downloading Precip: 1980-01 ...


2026-01-19 21:34:10,071 INFO Request ID is 8c34f45a-0807-4fd6-bd7c-45c76d6b5ed2
2026-01-19 21:34:10,259 INFO status has been updated to accepted
2026-01-19 21:34:24,381 INFO status has been updated to running
2026-01-19 21:35:26,893 INFO status has been updated to successful


a8dabf6fae54983dd4ba5abd678a0ca2.nc:   0%|          | 0.00/302k [00:00<?, ?B/s]

✔ Skipping 1980-02 (already exists)
✔ Skipping 1980-03 (already exists)
✔ Skipping 1980-04 (already exists)
✔ Skipping 1980-05 (already exists)
✔ Skipping 1980-06 (already exists)
✔ Skipping 1980-07 (already exists)
✔ Skipping 1980-08 (already exists)
✔ Skipping 1980-09 (already exists)
✔ Skipping 1980-10 (already exists)
✔ Skipping 1980-11 (already exists)
✔ Skipping 1980-12 (already exists)
✔ Skipping 1981-01 (already exists)
✔ Skipping 1981-02 (already exists)
✔ Skipping 1981-03 (already exists)
✔ Skipping 1981-04 (already exists)
✔ Skipping 1981-05 (already exists)
✔ Skipping 1981-06 (already exists)
✔ Skipping 1981-07 (already exists)
✔ Skipping 1981-08 (already exists)
✔ Skipping 1981-09 (already exists)
✔ Skipping 1981-10 (already exists)
✔ Skipping 1981-11 (already exists)
✔ Skipping 1981-12 (already exists)
✔ Skipping 1982-01 (already exists)
✔ Skipping 1982-02 (already exists)
✔ Skipping 1982-03 (already exists)
✔ Skipping 1982-04 (already exists)
✔ Skipping 1982-05 (already 