# Hello!
This is a walkthrough to train the University of Waterloo's submission model.

To find all our experiments and code, see our original [repo](https://github.com/trevor-yu-087/climatehack.ai-2024), but beware, it is not documented, or well organized.

# Environment set up

We use docker to package dependencies. If you are using VScode or a JetBrains IDE, the devcontainers extension should be able to use the .devcontainer directory to build the docker image and use it as a development environment.

If you do not want to use docker, you can (hopefully) get set up by running:

- `pip install -r local-requirements.txt`
- `conda install cartopy`
- `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121`

A machine with a CUDA enabled GPU is required.

# Download data
Our model used pv, hrv and weather data.

For this example we'll only be downloading a few months of data.

Note that you have to download the [indices.json](https://github.com/climatehackai/getting-started-2023/blob/main/indices.json) file and place it in the same directory as the data that gets downloaded below.

In [1]:
import huggingface_hub
from os import makedirs

datadir = "/workspaces/waterloo-climatehack/data" # change this
makedirs(datadir, exist_ok=True)

huggingface_hub.snapshot_download(
    repo_id="climatehackai/climatehackai-2023", 
    local_dir=datadir, 
    cache_dir=datadir + '/cache',
    local_dir_use_symlinks=False, 
    repo_type="dataset",
    ignore_patterns=["aerosols/*", "satellite-nonhrv/*"],
    allow_patterns=["*10.zarr.zip", "*11.zarr.zip", "*.parquet", "*metadata.csv"]
)

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 33 files:   0%|          | 0/33 [00:00<?, ?it/s]

'/workspaces/waterloo-climatehack/data'

# Generating PV Site Features
We generate site specific features (such as the site's max and average output during each month).

For this example we only do October and November of each year.

In [2]:
import pandas as pd
import numpy as np

In [3]:
years = [2020, 2021]
months = ['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december']

for year in years:
    for i, month_name in enumerate(months):
        print(f'{year},{month_name}')
        month = pd.read_parquet(datadir + f'/pv/{year}/{i+1}.parquet').drop(['generation_wh'], axis=1)
        month = month.reorder_levels(['ss_id', 'timestamp'])

        site_ids = month.index.get_level_values(0).unique().values

        monthly_avg, monthly_max, monthly_average_max = [], [], []

        for site in site_ids:
            a = month.loc[site].between_time('5:00', '22:00')
            monthly_max.append(a.power.max())
            monthly_avg.append(a.power.mean())
            monthly_average_max.append(a.groupby([a.index.hour, a.index.minute]).power.mean().max())

        frame = pd.DataFrame(np.array([monthly_avg, monthly_max, monthly_average_max]).T, index=site_ids)
        frame.columns = [f'{month_name}_{year}_avg', f'{month_name}_{year}_max', f'{month_name}_{year}_average_max']

        if i == 0 and year == 2020:
            master_frame = frame
        else:
            master_frame = master_frame.join(frame)

2020,january
2020,february
2020,march
2020,april
2020,may
2020,june
2020,july
2020,august
2020,september
2020,october
2020,november
2020,december
2021,january
2021,february
2021,march
2021,april
2021,may
2021,june
2021,july
2021,august
2021,september
2021,october
2021,november
2021,december


In [5]:
master_frame.head()

Unnamed: 0,january_2020_avg,january_2020_max,january_2020_average_max,february_2020_avg,february_2020_max,february_2020_average_max,march_2020_avg,march_2020_max,march_2020_average_max,april_2020_avg,...,september_2021_average_max,october_2021_avg,october_2021_max,october_2021_average_max,november_2021_avg,november_2021_max,november_2021_average_max,december_2021_avg,december_2021_max,december_2021_average_max
2607,0.047627,0.726833,0.256715,0.09609,0.879429,0.341195,0.19427,0.978294,0.531157,0.251373,...,0.509674,0.127118,0.912645,0.393029,0.101636,0.80771,0.51958,,,
2626,0.020631,0.248641,0.100403,,,,0.168927,0.81303,0.472241,0.236701,...,0.34834,0.062498,0.66376,0.22138,0.051184,0.498327,0.254845,,,
2631,0.018734,0.203612,0.07454,0.046694,0.415719,0.155116,0.108182,0.623148,0.287396,0.182052,...,0.276162,0.056889,0.494889,0.16648,0.043561,0.296281,0.164227,,,
2657,0.073219,0.79299,0.262686,0.120295,0.91338,0.38964,0.198679,0.93738,0.51991,0.277676,...,0.479146,0.140347,0.908016,0.432959,0.150941,0.833607,0.511419,,,
2660,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.316739,0.053756,0.508482,0.191908,0.050862,0.313617,0.208087,,,


In [6]:
master_frame.to_pickle(datadir +'/pv_site_features.pkl')

# Data Loading
Now let's run our dataset class to validate that our data is set up correctly.

In [7]:
from dataset import get_datasets
import yaml

In [8]:
# yaml file that is used to configure training runs
CONFIG_FILE_NAME = "train.yaml"

with open(CONFIG_FILE_NAME) as file:
    config = yaml.safe_load(file)

In [9]:
train_ds, test_ds = get_datasets(
    config["data_path"],
    (config["start_date"], config["end_date"]),
    batch_size=config["batch_size"],
    hrv="hrv" in config["modalities"],
    weather="weather" in config["modalities"],
    metadata="metadata" in config["modalities"],
    seed=config["seed"],
    pv_features_file=config["pv_features_file"],
    test_size=config["test_size"],
    hrv_crop=config["hrv_crop"],
    weather_crop=config["weather_crop"],
    zipped=config["zipped"],
    offset_start_time=config["offset_start_time"]
)

Loading dataset checking checkpoint


KeyError: 10

In [4]:
main_frame

NameError: name 'main_frame' is not defined