### Deep Learning for Computer Vision  
### Multi-Task Regression with the Digital Typhoon Dataset

This notebook demonstrates a **supervised multi-task regression** workflow for remote sensing using **TorchGeo** using the Digital Typhoon dataset, which consists of infrared (IR) satellite imagery of tropical cyclones paired with meteorological measurements.

The objective is to predict multiple continuous typhoon intensity variables from satellite imagery using a deep learning model.  

#### Dataset Overview
The [Digital Typhoon](https://torchgeo.readthedocs.io/en/stable/api/datasets.html#digital-typhoon) is derived from hourly infrared channel observations captured by multiple generations of the Himawari meteorological satellites, spanning the period from 1978 to the present. The satellite measurements have been converted to brightness temperatures and normalized across different sensors, resulting in a consistent spatio-temporal dataset covering more than four decades.  

**Dataset features:**
- Infrared (IR) satellite imagery of 512 Ã— 512 pixels at ~5km resolution 
- Auxiliary metadata including wind speed, pressure and additional typhoon-related attributes  
- 1,099 typhoons and 189,364 images

**References**  
Digital Typhoon Dataset: *A Large-Scale Benchmark for Tropical Cyclone Analysis*      [arXiv:2411.16421](https://arxiv.org/pdf/2411.16421) ; [arXiv:2311.02665](https://arxiv.org/pdf/2311.02665)

In [1]:
## import libraries
import os
import shutil
import pandas as pd
import torch

from torch.utils.data import DataLoader
from torchgeo.datasets import DigitalTyphoon


In [2]:
# load dataset
root = "/home/ogallo/DL4CV/DigitalTyphoon"

dataset = DigitalTyphoon(
    root=root,
    features=["wind", "pressure"],
    targets=["wind", "pressure"],
    sequence_length=1,
    download=False
)


In [3]:
print(len(dataset))        # number of sequences
print(dataset[0])          # inspect the first sequence

173418
{'image': tensor([[[0.7248, 0.7813, 0.7813,  ..., 0.9331, 0.9363, 0.9331],
         [0.7248, 0.7576, 0.7735,  ..., 0.9331, 0.9331, 0.9331],
         [0.7290, 0.7536, 0.7656,  ..., 0.9299, 0.9299, 0.9331],
         ...,
         [0.6904, 0.6495, 0.6007,  ..., 0.8483, 0.8798, 0.8659],
         [0.6542, 0.6400, 0.6725,  ..., 0.8483, 0.8659, 0.8447],
         [0.7373, 0.7536, 0.7967,  ..., 0.8518, 0.8518, 0.8483]]]), 'wind': tensor(-1.1229), 'pressure': tensor(0.5422), 'label': tensor([-1.1229,  0.5422])}


In [4]:
aux_data = pd.read_csv("/home/ogallo/DL4CV/DigitalTyphoon/WP/aux_data.csv")
print(aux_data.head())     # inspect auxiliary data

       id                   image_path  year  month  day  hour  grade    lat  \
0  197830  1978120100-197830-GMS1-1.h5  1978     12    1     0      6  36.00   
1  197830  1978120103-197830-GMS1-1.h5  1978     12    1     3      6  37.46   
2  197830  1978120106-197830-GMS1-1.h5  1978     12    1     6      6  39.00   
3  197901  1978123112-197901-GMS1-1.h5  1978     12   31    12      2   2.00   
4  197901  1978123116-197901-GMS1-1.h5  1978     12   31    16      2   2.30   

      lng  pressure  wind  dir50  long50  short50  dir30  long30  short30  \
0  174.00     996.0   0.0      0       0        0      0       0        0   
1  176.44     994.0   0.0      0       0        0      0       0        0   
2  179.00     992.0   0.0      0       0        0      0       0        0   
3  172.00    1004.0   0.0      0       0        0      0       0        0   
4  171.81    1002.7   0.0      0       0        0      0       0        0   

   landfall  intp  
0         0     0  
1         0     

### Subset the dataset
This is based on the typhoon grade, number of typhoons and lifecycle??

In [6]:
# Import the sampling functions
from sample_v2 import load_data, sample_typhoons, sample_images, copy_images, save_sampled_data, copy_metadata

# Set paths and parameters
root = "/home/ogallo/DL4CV/DigitalTyphoon/WP"
output_dir = "/home/ogallo/DL4CV/WP_sampled_10pct"
total_typhoons = 110  # 10% of 1099 typhoons

# Load data
df = load_data(root)
print(f"Loaded {len(df)} records from {df['id'].nunique()} typhoons.")
print(f"Year range: {df['year'].min()} - {df['year'].max()}")

# Sample typhoons (distributed across years)
sampled_typhoons = sample_typhoons(df, total_typhoons, seed=42)
print(f"\nSelected {len(sampled_typhoons)} typhoons.")

# Sample all images for selected typhoons (no cap)
df_sampled = sample_images(df, sampled_typhoons)
print(f"Sampled {len(df_sampled)} images across {df_sampled['id'].nunique()} typhoons.")

# Copy images
copied, not_found = copy_images(df_sampled, root, output_dir)
print(f"\nCopied {copied}/{len(df_sampled)} images.")
if not_found:
    print(f"Warning: {len(not_found)} images not found.")

# Save sampled data
save_sampled_data(df_sampled, output_dir)
print(f"Saved aux_data.csv")

# Copy metadata
sampled_typhoon_ids = sorted(df_sampled['id'].unique())
copy_metadata(root, output_dir, sampled_typhoon_ids)
print(f"Copied and filtered metadata.")

Loaded 189364 records from 1099 typhoons.
Year range: 1978 - 2022

Selected 110 typhoons.
Sampled 18686 images across 110 typhoons.

Copied 18686/18686 images.
Saved aux_data.csv
Copied and filtered metadata.


In [9]:
dataset = DigitalTyphoon(
    root="/home/ogallo/DL4CV/WP_sampled_100",
    features=["wind", "pressure"],
    targets=["wind", "pressure"],
    sequence_length=1,
    download=False
)

In [None]:
# visualize input

### ResNet34
understand and explain model architecture with 2 head for multitask regression (add pic/infograph if possible)

in markdown also include what we will do train, val and test then tuning???

**wandb logger?? torchgeo.logger......

split train/val/test  
hyperparameter tuning  
RMSE, MSE loss etc  