# ClimateHack.AI 2023: Training a Basic Model

Thank you for participating in ClimateHack.AI 2023! 🌍

Your contributions could help cut carbon emissions by up to 100 kilotonnes per year in Great Britain alone. We look forward to seeing what you build over the course of the competition!

In this Jupyter notebook, you will hopefully train your first model for the challenge using historical solar PV data and HRV satellite imagery.

For more detailed information on the challenge, see the [DOXA AI competition page](https://doxaai.com/competition/climatehackai-2023/overview). 😎

## Installing packages

Before you can get started, you will need to install a number of packages to allow you to work with the data and submit to the platform. If you do not already have these packages installed, you can uncomment the lines below to do so! You will also need to [install PyTorch](https://pytorch.org/get-started/locally/).

In [1]:
#%pip install numpy matplotlib zarr xarray ipykernel gcsfs fsspec dask cartopy ocf-blosc2 torchinfo
#%pip install -U doxa-cli

Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata (61 kB)
     ---------------------------------------- 0.0/61.0 kB ? eta -:--:--
     ------------------- ------------------ 30.7/61.0 kB 660.6 kB/s eta 0:00:01
     -------------------------------------- 61.0/61.0 kB 806.3 kB/s eta 0:00:00
Collecting matplotlib
  Downloading matplotlib-3.8.3-cp311-cp311-win_amd64.whl.metadata (5.9 kB)
Collecting zarr
  Downloading zarr-2.17.1-py3-none-any.whl.metadata (5.7 kB)
Collecting xarray
  Downloading xarray-2024.2.0-py3-none-any.whl.metadata (11 kB)
Collecting gcsfs
  Downloading gcsfs-2024.3.1-py2.py3-none-any.whl.metadata (1.6 kB)
Collecting fsspec
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Collecting dask
  Downloading dask-2024.3.1-py3-none-any.whl.metadata (3.8 kB)
Collecting cartopy
  Downloading Cartopy-0.22.0-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting ocf-blosc2
  Downloading ocf_blosc2-0.0.4-py3-none-any.whl.metadata (377 byt

## Importing packages

Here, we import a number of packages we will need to train our first model.

In [3]:
import os
from datetime import datetime, time, timedelta
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import xarray as xr
from ocf_blosc2 import Blosc2
from torch.utils.data import DataLoader, IterableDataset
from torchinfo import summary
import json

plt.rcParams["figure.figsize"] = (20, 12)

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device

device(type='cpu')

## Creating your submission directory

If you cloned [this repository](https://github.com/climatehackai/getting-started-2023) straight from GitHub, you will already have all the files you need, but if you are running this notebook using Google Colab, we just need to download a couple extra files to create a fresh submission directory that you will soon hopefully be in a position to upload to the platform as part of your first competition submission.


In [None]:
# if not os.path.exists("submission"):
#     os.makedirs("submission", exist_ok=True)

#     !curl -L https://raw.githubusercontent.com/climatehackai/getting-started-2023/main/submission/competition.py --output submission/competition.py
#     !curl -L https://raw.githubusercontent.com/climatehackai/getting-started-2023/main/submission/doxa.yaml --output submission/doxa.yaml
#     !curl -L https://raw.githubusercontent.com/climatehackai/getting-started-2023/main/submission/model.py --output submission/model.py
#     !curl -L https://raw.githubusercontent.com/climatehackai/getting-started-2023/main/submission/run.py --output submission/run.py
#     !curl -L https://raw.githubusercontent.com/climatehackai/getting-started-2023/main/indices.json --output indices.json

## Downloading a month of data

While streaming the Zarr-format datasets directly from Hugging Face was adequate for some initial data exploration in `1_data.ipynb`, it most likely will not be fast enough in training. Since there is so much data available, we can get started just by downloading a single month of PV and HRV satellite imagery data.

In [8]:
if not os.path.exists("data"):
    os.makedirs("data/pv/2020", exist_ok=True)
    os.makedirs("data/satellite-hrv/2020", exist_ok=True)

    !curl -L https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/pv/metadata.csv --output data/pv/metadata.csv
    !curl -L https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/pv/2020/7.parquet --output data/pv/2020/7.parquet
    !curl -L https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/satellite-hrv/2020/7.zarr.zip --output data/satellite-hrv/2020/7.zarr.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1300k  100 1300k    0     0  5260k      0 --:--:-- --:--:-- --:--:-- 5306k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1124  100  1124    0     0   7940      0 --:--:-- --:--:-- --:--:--  8086

  9 74.3M    9 7420k    0     0  11.0M      0  0:00:06 --:--:--  0:00:06 11.0M
 61 74.3M   61 45.6M    0     0  27.4M      0  0:00:02  0:00:01  0:00:01 38.2M
100 74.3M  100 74.3M    0     0  30.8M      0  0:00:02  0:00:02 --:--:-- 38.4M


^C


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1200  100  1200    0     0   8867      0 --:--:-- --:--:-- --:--:--  9022

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 4049M    0 19.6M    0     0  16.9M      0  0:03:59  0:00:01  0:03:58 19.6M
  1 4049M    1 46.7M    0     0  21.6M      0  0:03:07  0:00:02  0:03:05 23.3M
  1 4049M    1 79.9M    0     0  25.2M      0  0:02:40  0:00:03  0:02:37 26.6M
  2 4049M    2  115M    0     0  27.6M      0  0:02:26  0:00:04  0:02:22 28.8M
  3 4049M    3  158M    0     0  30.6M      0  0:02:12  0:00:05  0:02:07 31.6M
  5 4049M    5  203M    0     0  33.0M      0  0:02:02  0:00:06  0:01:56 36.8M
  6 4049M    6  248M    0     0  34.5M      0  0:01:57  0:00:07  0:01:50 40.1M
  7 4049M    7  294M    0     0  36.1M      0  0:0

## Loading the data

In [None]:
pv = pd.read_parquet("data/pv/2020/7.parquet").drop("generation_wh", axis=1)

pv

In [None]:
hrv = xr.open_dataset(
    "data/satellite-hrv/2020/7.zarr.zip", engine="zarr", chunks="auto"
)

hrv

As part of the challenge, you can make use of satellite imagery, numerical weather prediction and air quality forecast data in a `[128, 128]` region centred on each solar PV site. In order to help you out, we have pre-computed the indices corresponding to each solar PV site and included them in `indices.json`, which we can load directly. For more information, take a look at the [challenge page](https://doxaai.com/competition/climatehackai-2023).


In [None]:
with open("indices.json") as f:
    site_locations = {
        data_source: {
            int(site): (int(location[0]), int(location[1]))
            for site, location in locations.items()
        }
        for data_source, locations in json.load(f).items()
    }

### Defining a PyTorch Dataset

To get started, we will define a simple `IterableDataset` that shows how to slice into the PV and HRV data using `pandas` and `xarray`, respectively. You will have to modify this if you wish to incorporate non-HRV data, weather forecasts and air quality forecasts into your training regimen. If you have any questions, feel free to ask on the [ClimateHack.AI Community Discord server](https://discord.gg/HTTQ8AFjJp)!

**Note**: `site_locations` contains indices for the non-HRV, weather forecast and air quality forecast data as well as for the HRV data!

There are many more advanced strategies you could implement to load data in training, particularly if you want to pre-prepare training batches in advance or use multiple workers to improve data loading times.

In [None]:
class ChallengeDataset(IterableDataset):
    def __init__(self, pv, hrv, site_locations, sites=None):
        self.pv = pv
        self.hrv = hrv
        self._site_locations = site_locations
        self._sites = sites if sites else list(site_locations["hrv"].keys())

    def _get_image_times(self):
        min_date = datetime(2020, 7, 1)
        max_date = datetime(2020, 7, 30)

        start_time = time(8)
        end_time = time(17)

        date = min_date
        while date <= max_date:
            current_time = datetime.combine(date, start_time)
            while current_time.time() < end_time:
                if current_time:
                    yield current_time

                current_time += timedelta(minutes=60)

            date += timedelta(days=1)

    def __iter__(self):
        for time in self._get_image_times():
            first_hour = slice(str(time), str(time + timedelta(minutes=55)))

            pv_features = pv.xs(first_hour, drop_level=False)  # type: ignore
            pv_targets = pv.xs(
                slice(  # type: ignore
                    str(time + timedelta(hours=1)),
                    str(time + timedelta(hours=4, minutes=55)),
                ),
                drop_level=False,
            )

            hrv_data = self.hrv["data"].sel(time=first_hour).to_numpy()

            for site in self._sites:
                try:
                    # Get solar PV features and targets
                    site_features = pv_features.xs(site, level=1).to_numpy().squeeze(-1)
                    site_targets = pv_targets.xs(site, level=1).to_numpy().squeeze(-1)
                    assert site_features.shape == (12,) and site_targets.shape == (48,)

                    # Get a 128x128 HRV crop centred on the site over the previous hour
                    x, y = self._site_locations["hrv"][site]
                    hrv_features = hrv_data[:, y - 64 : y + 64, x - 64 : x + 64, 0]
                    assert hrv_features.shape == (12, 128, 128)

                    # How might you adapt this for the non-HRV, weather and aerosol data?
                except:
                    continue

                yield site_features, hrv_features, site_targets

## Defining a model

In order to make a PyTorch-based submission to the DOXA AI platform, you need to upload both the code defining your model in addition to your trained model weights (and some code to run your model). As a result, if you want to experiment with different model architectures using this notebook, you will need to edit the model in `submission/model.py` and re-import it here.

Here is the small convolutional neural network you are initially given in `submission/model.py`. You will absolutely be able to improve upon this!

```py
#########################################
#       Improve this basic model!       #
#########################################

class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels=12, out_channels=24, kernel_size=3)
        self.conv2 = nn.Conv2d(in_channels=24, out_channels=48, kernel_size=3)
        self.conv3 = nn.Conv2d(in_channels=48, out_channels=96, kernel_size=3)
        self.conv4 = nn.Conv2d(in_channels=96, out_channels=192, kernel_size=3)

        self.pool = nn.MaxPool2d(kernel_size=2)
        self.flatten = nn.Flatten()

        self.linear1 = nn.Linear(6924, 48)

    def forward(self, pv, hrv):
        x = torch.relu(self.pool(self.conv1(hrv)))
        x = torch.relu(self.pool(self.conv2(x)))
        x = torch.relu(self.pool(self.conv3(x)))
        x = torch.relu(self.pool(self.conv4(x)))

        x = self.flatten(x)
        x = torch.concat((x, pv), dim=-1)

        x = torch.sigmoid(self.linear1(x))

        return x

```

In [None]:
# Import the model defined in `submission/model.py`

from submission.model import Model

In [None]:
summary(Model(), input_size=[(1, 12), (1, 12, 128, 128)])

## Train a model

In [None]:
BATCH_SIZE = 32

dataset = ChallengeDataset(pv, hrv, site_locations=site_locations)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, pin_memory=True)

In [None]:
model = Model().to(device)

criterion = nn.L1Loss()
optimiser = optim.Adam(model.parameters(), lr=1e-3)

In [None]:
EPOCHS = 1

for epoch in range(EPOCHS):
    model.train()

    running_loss = 0.0
    count = 0
    for i, (pv_features, hrv_features, pv_targets) in enumerate(dataloader):
        optimiser.zero_grad()

        predictions = model(
            pv_features.to(device, dtype=torch.float),
            hrv_features.to(device, dtype=torch.float),
        )

        loss = criterion(predictions, pv_targets.to(device, dtype=torch.float))
        loss.backward()

        optimiser.step()

        size = int(pv_targets.size(0))
        running_loss += float(loss) * size
        count += size

        if i % 200 == 199:
            print(f"Epoch {epoch + 1}, {i + 1}: {running_loss / count}")

    print(f"Epoch {epoch + 1}: {running_loss / count}")

In [None]:
# Save your model
torch.save(model.state_dict(), "submission/model.pt")

# Submitting to the DOXA AI platform

Congratulations &ndash; **you have trained your first model for ClimateHack.AI 2023**! 🥳

Why not try making a submission to the platform?

First, make sure you have enrolled for the competition on the [ClimateHack.AI 2023 competition page](https://doxaai.com/competition/climatehackai-2023). You will need to be signed in with a DOXA AI account registered with your university email address so that we can verify your eligibility.

You can then sign in with the CLI using the following command:

In [None]:
!doxa login

Finally, you can upload your submission to the platform by running the following cell:

In [None]:
!doxa upload submission

If everything went well, you will soon appear on the [competition scoreboard](https://doxaai.com/competition/climatehackai-2023/scoreboard) once your model has been evaluated! 😎

## Next steps

Well done for reaching the end of this Jupyter notebook! By now, you will have loaded and explored the data, trained a basic model, and joined other competition participants on the [competition scoreboard](https://doxaai.com/competition/climatehackai-2023/scoreboard)!

To get started, we used a very simple model architecture, but this model most likely does not have a sufficiently rich representation to properly solve our problem. How might you be able to improve on this? Which model architectures would be best suited to this problem? Would you want to train a model from scratch, as we have done here, or possibly fine-tune a pre-trained computer vision model? Check out the resources on the [competition page](https://doxaai.com/competition/climatehackai-2023) for ideas on where to go from here.

Additionally, we only used historical PV and HRV data, but perhaps you might be able to get more mileage out of the other data sources available to you, such as non-HRV satellite imagery, the DWD weather forecast data or even the aerosol data. If you do decide to incorporate more data, what **data engineering** work would you have to perform so that you can train effectively on a large quantity of data?

**We want to hear about your approaches**! If you develop anything interesting, let us know on the [ClimateHack.AI Community Discord server](https://discord.gg/HTTQ8AFjJp) and start a conversation!