# Streamflow Data Download

Metadata for streamflow to be downloaded from: https://wateroffice.ec.gc.ca/station_metadata/station_characteristics_e.html using the specifications `Province = Alberta`, `Parameter Type = Flows`, and `Regulation = Natural` and saved as `data/raw/station_metadata.csv`.

The stations listed in the metadata file subject to the date specifications are downloaded below from HYDAT and saved to `combined_streamflow.csv`.

In [1]:
import pandas as pd
from src.data_ingestion import fetch_streamflow_batch
from src.config import RAW_DATA_DIR

METADATA_PATH = RAW_DATA_DIR / "station_metadata.csv"
OUTPUT_FILENAME = "combined_streamflow.csv"
START_YEAR = 1980
END_YEAR = 2022

# Filter stations
metadata = pd.read_csv(METADATA_PATH)
filtered_stations = metadata[
    (metadata['Year From'] <= START_YEAR) & 
    (metadata['Year To'] >= END_YEAR)
]["Station Number"].tolist()

# download data
df_streamflow = fetch_streamflow_batch(filtered_stations, START_YEAR, END_YEAR, OUTPUT_FILENAME)

Downloading data for 181 stations...
Data downloaded and saved to combined_streamflow.csv
15706 days of data saved for 181 stations


## Monthy Glacier Mass Balance Reconstruction Data
Data was produced by Christina Draeger for the years 1980 to 2022 and can be accessed via: https://www.dropbox.com/scl/fo/yat0rxeoztpwol29qput2/AEtDmgySFbMEr3B9YcwLmks/kp_dp_alphabias_monthly_NN?dl=0&rlkey=4t3uobuuo8ufn5selgr5afoo4&subfolder_nav_tracking=1

The files are saved under `data/raw/mass_balance`

## Downloading Glacier Areas
Spatial information for the glaciers is downloaded from the [Randolph Glacier Inventory (RGI) version 6](https://daacdata.apps.nsidc.org/pub/DATASETS/nsidc0770_rgi_v6/). The region for Western Canada and US (`nsidc0770_02.rgi60.WesternCanadaUS.zip`) is the only download required. The files are saved under `data/raw/RGI-western-canada`

## Downloading Drainage Areas
Water basin polygons can be downloaded from https://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www/HydrometricNetworkBasinPolygons/gpkg/. The major drainage areas (MDA) selected are:
* (5) Nelson River
* (7) Great Slave Lake

These MDAs were selected due to their proximity to the Eastern Rockies in Alberta. The files are saved under `data/raw/drainage_areas/`.

## Preprocessing Streamflow Data and Computing Basin Attributes
Streamflow data is filtered to remove any stations that have less that 50% of daily data available for any given year. Stations that are outside the selected MDAs are also removed.

The area, mean elevation, and percent glaciation of each remaining station is computed and saved to `data/processed/static_attributes.csv`. The monthly mass balance data was also used to compute the monthly change in glacier volume in million cubic meters for each glacierized basin. This data is saved in `data/processed/glacier_volume_change.csv`. The streamflow data with the remaining stations is saved to `data/processed/filtered_streamflow.csv`.

In [1]:
from src.processing import filter_stations_by_annual_completeness, filter_stations_by_mda
from src.spatial_utils import process_spatial_attributes
from src.config import RAW_DATA_DIR, PROCESSED_DATA_DIR, DRAINAGE_FILES, GLACIER_SHP_PATH, MASS_BALANCE_PATH
import pandas as pd

# 1. Load raw data
print("--- Loading Streamflow ---")
df_raw = pd.read_csv(RAW_DATA_DIR / "combined_streamflow.csv", index_col="Date", parse_dates=True)

# 2. Filter: Completeness (<50% missing per year)
df_clean = filter_stations_by_annual_completeness(df_raw, max_missing_pct=50.0)

# 3. Filter: Region (MDA 05 and 07 only)
df_final = filter_stations_by_mda(df_clean, mda_codes=["05", "07"])

print(f"\nFinal Dataset: {df_final.shape[1]} stations ready for analysis.")

# --- NEW STEP: Save the filtered streamflow data ---
save_path = PROCESSED_DATA_DIR / "filtered_streamflow.csv"
df_final.to_csv(save_path)
print(f"‚úÖ Filtered streamflow saved to: {save_path}")

# 4. Run Spatial Analysis using the filtered station list
print("\n--- Running Spatial Analysis ---")
static_stats, vol_changes = process_spatial_attributes(
    basin_gpkg_paths=DRAINAGE_FILES,
    glacier_shp_path=GLACIER_SHP_PATH,
    mass_balance_path=MASS_BALANCE_PATH,
    stations_list=df_final.columns.tolist()
)

--- Loading Streamflow ---
Filtering at 50.0% annual threshold:
 - Keeping 142 stations.
 - Dropping 39 stations due to incomplete years.
Filtering by MDA codes ['05', '07']:
 - Keeping 132 stations.
 - Dropping 10 stations (wrong region).

Final Dataset: 132 stations ready for analysis.
‚úÖ Filtered streamflow saved to: C:\Users\tbwil\Documents\School\MSc Geophysics\Thesis Project\data\processed\filtered_streamflow.csv

--- Running Spatial Analysis ---
‚è≥ Loading and merging basin files...
‚úÖ Processing 132 basins.
‚è≥ Processing Elevation Data...
   ‚è≥ Identified SRTM tiles from X[12-15] Y[1-3]...
   ‚è≥ Merging 12 tiles into mosaic...
   ‚úÖ Saved DEM mosaic to C:\Users\tbwil\Documents\School\MSc Geophysics\Thesis Project\data\raw\srtm_90m_mosaic.tif
   Computing mean elevation per basin...
‚è≥ Loading glaciers...
‚úÖ Static attributes saved to C:\Users\tbwil\Documents\School\MSc Geophysics\Thesis Project\data\processed\static_attributes.csv
‚è≥ Calculating volume changes...
‚úÖ 

## Downloading ERA5 Climate Data
This project uses the Copernicus Climate Data Store (CDS) to download ERA5 precipitation and temperature data. Follow these steps to configure your environment.

#### Create a CDS Account

1. Visit the [Climate Data Store (CDS) registration page](https://cds.climate.copernicus.eu/#!/home).

2. Create an account and log in.

#### Accept the Terms of Use
Important: You must manually accept the "Terms of Use" for every dataset you wish to download, or the API will return an error.

1. Go to the ERA5 daily statistics page.

2. Click the "Download Data" tab.

3. Scroll to the bottom and click Accept Terms (look for a "License" section).

4. Repeat this for the ERA5 reanalysis single levels.

#### Get your API Key
1. Go to your User Profile.

2. Scroll down to the section labeled API Key.

3. You will see a block of text that looks like this:

```
url: https://cds.climate.copernicus.eu/api/v2
key: <UID>:<API-KEY>
```
#### Configure the Configuration File (`.cdsapirc`)
The cdsapi library looks for a hidden file in your home directory to authenticate.

**For Windows Users:**

1. Open your User folder (e.g., C:\Users\YourName).

2. Create a new text file named .cdsapirc (Note the leading dot).

* Tip: If Windows doesn't let you create a file starting with a dot, name it .cdsapirc. (with a dot at the end) and it will save correctly.

3. Paste the url and key from Step 3 into this file.

**For Mac/Linux Users:**

1. Open your terminal.

2. Run the following command: `nano ~/.cdsapirc`

3. Paste your credentials:
```
url: https://cds.climate.copernicus.eu/api/v2
key: 12345:abcdefgh-ijkl-mnop-qrst-uvwxyz
```
4. Save and exit (`Ctrl+O`, `Enter`, `Ctrl+X`).

In [1]:
from src.data_ingestion import download_era5_precipitation, download_era5_temperature

# Study parameters
STUDY_YEARS = range(1979, 2023)

# Run downloads
download_era5_precipitation(STUDY_YEARS)
download_era5_temperature(STUDY_YEARS)

‚è≥ Downloading Precip: 1979-01 ...


2026-01-26 16:51:52,769 INFO Request ID is 8ad27dd1-0412-4d83-a101-f1853739e7c0
2026-01-26 16:51:52,985 INFO status has been updated to accepted
2026-01-26 16:52:02,256 INFO status has been updated to running
2026-01-26 16:53:09,959 INFO status has been updated to successful


4449663fe3b89309bbf9380d5562d6c3.nc:   0%|          | 0.00/276k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-02 ...


2026-01-26 16:53:14,066 INFO Request ID is 2b4f90a6-915c-44ea-8806-c4c0e502e701
2026-01-26 16:53:14,245 INFO status has been updated to accepted
2026-01-26 16:53:28,513 INFO status has been updated to running
2026-01-26 16:55:09,580 INFO status has been updated to successful


41fd12e71bea1a2b918f7f468c9f0e54.nc:   0%|          | 0.00/302k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-03 ...


2026-01-26 16:55:12,909 INFO Request ID is 91d1a00c-55b7-4097-801f-33b3b735f9e9
2026-01-26 16:55:13,111 INFO status has been updated to accepted
2026-01-26 16:55:30,870 INFO status has been updated to running
2026-01-26 16:57:12,071 INFO status has been updated to successful


e1154db216bd34acc2aec43acb633d5f.nc:   0%|          | 0.00/280k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-04 ...


2026-01-26 16:57:16,073 INFO Request ID is 39c1e193-9848-4c64-8782-7fe7f3d771f8
2026-01-26 16:57:16,262 INFO status has been updated to accepted
2026-01-26 16:57:25,273 INFO status has been updated to running
2026-01-26 16:59:11,633 INFO status has been updated to successful


306167b2fdb3c17516288b1d3b6658dd.nc:   0%|          | 0.00/303k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-05 ...


2026-01-26 16:59:14,575 INFO Request ID is 02e69399-c86e-4bdb-8cb6-5509e5ee5381
2026-01-26 16:59:14,749 INFO status has been updated to accepted
2026-01-26 16:59:23,617 INFO status has been updated to running
2026-01-26 17:01:10,311 INFO status has been updated to successful


768ca629f6afa0c0dfe74813591cd5bf.nc:   0%|          | 0.00/317k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-06 ...


2026-01-26 17:01:13,437 INFO Request ID is 00336cf1-301b-4fe5-b18c-44ba01fdd488
2026-01-26 17:01:13,796 INFO status has been updated to accepted
2026-01-26 17:01:35,726 INFO status has been updated to running
2026-01-26 17:02:30,517 INFO status has been updated to successful


b17134a8be3f364d775c58a4629815ff.nc:   0%|          | 0.00/326k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-07 ...


2026-01-26 17:02:33,555 INFO Request ID is bc3571a5-59d1-49ca-a431-0a2d1e90c162
2026-01-26 17:02:33,743 INFO status has been updated to accepted
2026-01-26 17:02:55,632 INFO status has been updated to running
2026-01-26 17:04:28,977 INFO status has been updated to successful


f7470e3846640f6328e85ba72bbf4c53.nc:   0%|          | 0.00/314k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-08 ...


2026-01-26 17:04:32,142 INFO Request ID is e9552992-6478-41e8-a5cb-a3571f10ea6b
2026-01-26 17:04:32,330 INFO status has been updated to accepted
2026-01-26 17:04:46,489 INFO status has been updated to running
2026-01-26 17:06:27,609 INFO status has been updated to successful


79a3819b86d5f5e0ff84ff08707400b8.nc:   0%|          | 0.00/300k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-09 ...


2026-01-26 17:06:30,627 INFO Request ID is efae3622-cf8f-4caf-a3a8-dba2da6c00db
2026-01-26 17:06:30,791 INFO status has been updated to accepted
2026-01-26 17:06:45,019 INFO status has been updated to running
2026-01-26 17:06:52,807 INFO status has been updated to accepted
2026-01-26 17:07:04,397 INFO status has been updated to running
2026-01-26 17:08:26,130 INFO status has been updated to successful


d2e4950f6a7bd2230e01a130d9aaa5fb.nc:   0%|          | 0.00/275k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-10 ...


2026-01-26 17:08:29,082 INFO Request ID is 66e67985-1ad7-4e24-8080-288648c07536
2026-01-26 17:08:29,284 INFO status has been updated to accepted
2026-01-26 17:08:43,447 INFO status has been updated to running
2026-01-26 17:10:24,843 INFO status has been updated to successful


f69fd17a5d0d24500444c2c458cb87f.nc:   0%|          | 0.00/288k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-11 ...


2026-01-26 17:10:27,831 INFO Request ID is be4511c1-1675-4850-9a54-e314e5b05157
2026-01-26 17:10:28,015 INFO status has been updated to accepted
2026-01-26 17:10:42,123 INFO status has been updated to running
Recovering from connection error [('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))], attempt 1 of 500
Retrying in 120 seconds
2026-01-26 17:18:12,078 INFO status has been updated to successful


beb9e5046c6bcebcdb1a4aef41694831.nc:   0%|          | 0.00/236k [00:00<?, ?B/s]

‚è≥ Downloading Precip: 1979-12 ...


2026-01-26 17:18:15,514 INFO Request ID is b8f5bbea-1f7b-4dc3-957c-3dcdac120c8b
2026-01-26 17:18:15,728 INFO status has been updated to accepted
2026-01-26 17:18:24,693 INFO status has been updated to running
2026-01-26 17:20:11,035 INFO status has been updated to successful


68a5edcffbbcbdd144ee7dba37c61d98.nc:   0%|          | 0.00/311k [00:00<?, ?B/s]

‚úî Skipping Precip 1980-01 (already exists)
‚úî Skipping Precip 1980-02 (already exists)
‚úî Skipping Precip 1980-03 (already exists)
‚úî Skipping Precip 1980-04 (already exists)
‚úî Skipping Precip 1980-05 (already exists)
‚úî Skipping Precip 1980-06 (already exists)
‚úî Skipping Precip 1980-07 (already exists)
‚úî Skipping Precip 1980-08 (already exists)
‚úî Skipping Precip 1980-09 (already exists)
‚úî Skipping Precip 1980-10 (already exists)
‚úî Skipping Precip 1980-11 (already exists)
‚úî Skipping Precip 1980-12 (already exists)
‚úî Skipping Precip 1981-01 (already exists)
‚úî Skipping Precip 1981-02 (already exists)
‚úî Skipping Precip 1981-03 (already exists)
‚úî Skipping Precip 1981-04 (already exists)
‚úî Skipping Precip 1981-05 (already exists)
‚úî Skipping Precip 1981-06 (already exists)
‚úî Skipping Precip 1981-07 (already exists)
‚úî Skipping Precip 1981-08 (already exists)
‚úî Skipping Precip 1981-09 (already exists)
‚úî Skipping Precip 1981-10 (already exists)
‚úî Skippi

2026-01-26 17:20:14,700 INFO [2025-12-11T00:00:00] Please note that a dedicated catalogue entry for this dataset, post-processed and stored in Analysis Ready Cloud Optimized (ARCO) format (Zarr), is available for optimised time-series retrievals (i.e. for retrieving data from selected variables for a single point over an extended period of time in an efficient way). You can discover it [here](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels-timeseries?tab=overview)
2026-01-26 17:20:14,703 INFO Request ID is 61da616f-3eaf-4954-b22a-b00c5961ab16
2026-01-26 17:20:14,883 INFO status has been updated to accepted
2026-01-26 17:20:29,020 INFO status has been updated to running
2026-01-26 17:32:39,177 INFO status has been updated to successful


8b9161242966f68b1b81445df62a6a7a.grib:   0%|          | 0.00/99.8M [00:00<?, ?B/s]

‚úî Skipping Temp 1980 (already exists)
‚úî Skipping Temp 1981 (already exists)
‚úî Skipping Temp 1982 (already exists)
‚úî Skipping Temp 1983 (already exists)
‚úî Skipping Temp 1984 (already exists)
‚úî Skipping Temp 1985 (already exists)
‚úî Skipping Temp 1986 (already exists)
‚úî Skipping Temp 1987 (already exists)
‚úî Skipping Temp 1988 (already exists)
‚úî Skipping Temp 1989 (already exists)
‚úî Skipping Temp 1990 (already exists)
‚úî Skipping Temp 1991 (already exists)
‚úî Skipping Temp 1992 (already exists)
‚úî Skipping Temp 1993 (already exists)
‚úî Skipping Temp 1994 (already exists)
‚úî Skipping Temp 1995 (already exists)
‚úî Skipping Temp 1996 (already exists)
‚úî Skipping Temp 1997 (already exists)
‚úî Skipping Temp 1998 (already exists)
‚úî Skipping Temp 1999 (already exists)
‚úî Skipping Temp 2000 (already exists)
‚úî Skipping Temp 2001 (already exists)
‚úî Skipping Temp 2002 (already exists)
‚úî Skipping Temp 2003 (already exists)
‚úî Skipping Temp 2004 (already exists)


## Preprocess Climate Data
Compute daily basin averaged statistics for each climate vraible.

In [6]:
import pandas as pd
from src.climate import process_era5_basin_data
from src.config import DRAINAGE_FILES, PROCESSED_DATA_DIR

# flow_df contains the columns of the stations we want
flow_df = pd.read_csv(PROCESSED_DATA_DIR / "filtered_streamflow.csv", index_col=0)
flow_df.columns

precip_df, temp_df = process_era5_basin_data(
    basin_gpkg_list=DRAINAGE_FILES,
    stations_list=flow_df.columns.tolist()
)

print(precip_df.head())

Step 1/4: Loading Basins...
   üîÑ Reprojecting basins to EPSG:4326 (Lat/Lon)...
Step 2/4: Mapping Spatial Weights...
‚è≥ Computing spatial weights for 132 basins...


Mapping Grid: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 132/132 [00:14<00:00,  8.92it/s]



Step 3/4: Processing Precipitation...
--- Processing 528 Precipitation Files ---


Precip Files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 528/528 [01:31<00:00,  5.77it/s]



Step 4/4: Processing Temperature...
--- Processing 44 Temperature Files (Hourly) ---


Temp Files:  14%|‚ñà‚ñé        | 6/44 [04:54<30:26, 48.07s/it]Can't read index file 'C:\\Users\\tbwil\\Documents\\School\\MSc Geophysics\\Thesis Project\\data\\raw\\era5\\temperature\\era5_temp_1985.grib.5b7b6.idx'
Traceback (most recent call last):
  File "C:\Users\tbwil\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\cfgrib\messages.py", line 551, in from_indexpath_or_filestream
    self = cls.from_indexpath(indexpath)
  File "C:\Users\tbwil\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\cfgrib\messages.py", line 430, in from_indexpath
    index = pickle.load(file)
EOFError: Ran out of input
Temp Files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 44/44 [36:44<00:00, 50.10s/it]


‚úÖ Climate processing complete.
             05AA004   05AA008   05AA022   05AA027   05AA028   05AB005  \
datetime                                                                 
1979-01-01  0.027646  0.024243  0.031097  0.020655  0.036783  0.011299   
1979-01-02  0.802943  0.391259  0.591803  0.418867  0.653277  0.703792   
1979-01-03  0.499313  0.188232  0.290161  0.184242  0.301486  0.181488   
1979-01-04  0.053464  0.054331  0.044684  0.055398  0.013087  0.012765   
1979-01-05  0.004649  0.001579  0.001028  0.002784  0.000512  0.039084   

             05AB029   05AD003   05AD035   05AE005  ...   07JC001   07JD002  \
datetime                                            ...                       
1979-01-01  0.012686  0.052841  0.100943  0.091844  ...  0.003124  0.007233   
1979-01-02  0.689405  0.801095  0.672200  0.655072  ...  0.035503  0.064547   
1979-01-03  0.217580  0.408767  0.372437  0.412462  ...  0.035301  0.064703   
1979-01-04  0.032482  0.016189  0.056001  0.037769  .