# Tabular Weather Prediction Tutorial

This tutorial steps through the uncertainty challenge on tabular weather data for regression.

## Outline

1. Data Loading
2. Training
3. Inference
4. Submission

## 1. Data Loading

All data is provided as csv files. You should have downloaded the following data files:

- `train.csv`
- `dev_in.csv`
- `dev_out.csv`

`dev_in` consists of data in-domain with `train` in terms of time and climate. `dev_out` consists of data shifted in time and climates with respect to `train`.

This tutorial will assume that all data files are placed in a local directory named `./data/`. 

In [1]:
import pandas as pd

# Load each data file as a pandas data frame
df_train = pd.read_csv('data/train.csv')
df_train.head()

Unnamed: 0,fact_time,fact_latitude,fact_longitude,fact_temperature,fact_cwsm_class,climate,topography_bathymetry,sun_elevation,climate_temperature,climate_pressure,...,cmc_0_1_66_0_grad,cmc_0_1_66_0_next,cmc_0_1_67_0_grad,cmc_0_1_67_0_next,cmc_0_1_68_0_grad,cmc_0_1_68_0_next,gfs_2m_dewpoint_grad,gfs_2m_dewpoint_next,gfs_total_clouds_cover_low_grad,gfs_total_clouds_cover_low_next
0,1543321000.0,26.9688,-99.248901,2.0,0.0,dry,127.0,-17.526443,14.613571,754.263405,...,0.0,0.0,0.0,0.0,0.0,0.0,-2.600006,-2.750006,0.0,0.0
1,1538776000.0,29.374201,-100.927002,31.0,20.0,mild temperate,297.0,41.531032,26.992143,733.117168,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.600006,17.950006,-12.0,11.0
2,1552115000.0,22.149599,113.592003,17.0,10.0,mild temperate,-1.0,43.916531,18.842143,761.571076,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.233978,21.450006,1.0,8.0
3,1549566000.0,34.678699,-86.684799,24.0,20.0,mild temperate,193.0,40.240955,8.303571,747.52491,...,0.0,0.0,0.0,0.0,0.0,0.0,0.059448,16.150018,-58.0,41.0
4,1552910000.0,46.066667,41.966667,9.0,20.0,dry,90.0,30.39466,6.451429,753.168113,...,0.0,0.0,0.0,0.0,0.0,0.0,0.400024,3.150018,18.0,92.0


In [2]:
df_dev_in = pd.read_csv('data/dev_in.csv')
df_dev_out = pd.read_csv('data/dev_out.csv')

## 2. Training

In this tutorial, the `CatBoostRegressor` is used as the model.
- An ensemble of models are trained.
- It is important to use `RMSEWithUncertainty` as the loss function during training time in order to be able to calculate uncertainty measures during inference.
- The models are trained using `df_train` and the hyperparameters should be finetuned using `df_dev_in`.

In [3]:
# Extract features and targets, and remove redundant meta-data
X_train = df_train.iloc[:,6:]
X_dev_in = df_dev_in.iloc[:,6:]
y_train = df_train['fact_temperature']
y_dev_in = df_dev_in['fact_temperature']

In [11]:
# Set training hyperparameters (note these are dummy hyperparameters - you will need to select your own)
ensemble_size = 3
depth = 2
iterations = 200
learning_rate = 0.03

In [12]:
# Train ensemble of models

import catboost

trained_models = []
for seed in range(ensemble_size):
    
    model = catboost.CatBoostRegressor(
        learning_rate = learning_rate,
        iterations = iterations,
        depth = depth,
        loss_function = 'RMSEWithUncertainty',
        eval_metric = 'RMSE',
        random_seed = seed)
    
    print(f'\n Model index: {seed}\n')
    
    model.fit(
        X_train,
        y_train,
        verbose = 100,
        eval_set = (X_dev_in, y_dev_in))
    
    trained_models.append(model)


 Model index: 0

0:	learn: 10.0802549	test: 10.0807415	best: 10.0807415 (0)	total: 1.39s	remaining: 4m 37s
100:	learn: 2.8927387	test: 2.8976019	best: 2.8976019 (100)	total: 2m 12s	remaining: 2m 9s
199:	learn: 2.4620033	test: 2.4656851	best: 2.4656851 (199)	total: 4m 2s	remaining: 0us

bestTest = 2.465685117
bestIteration = 199


 Model index: 1

0:	learn: 10.0797417	test: 10.0802326	best: 10.0802326 (0)	total: 1.4s	remaining: 4m 38s
100:	learn: 2.8949066	test: 2.9003376	best: 2.9003376 (100)	total: 2m 11s	remaining: 2m 8s
199:	learn: 2.4643793	test: 2.4687305	best: 2.4687305 (199)	total: 4m 19s	remaining: 0us

bestTest = 2.46873047
bestIteration = 199


 Model index: 2

0:	learn: 10.0784878	test: 10.0789484	best: 10.0789484 (0)	total: 1.54s	remaining: 5m 5s
100:	learn: 2.8949230	test: 2.9001259	best: 2.9001259 (100)	total: 3m 29s	remaining: 3m 25s
199:	learn: 2.4657170	test: 2.4698817	best: 2.4698817 (199)	total: 6m 19s	remaining: 0us

bestTest = 2.469881685
bestIteration = 199



## 3. Inference

All inference in this section is carried out on a combined dataset of `dev_in` + `dev_out` = `dev`.
The objective here is two fold:

1. Evaluate the ensemble of trained models to get predictions for each data point
2. Use the predictions to determine an uncertainty score for each data point using any chosen uncertainty measure

It is hoped that the uncertainty measure chosen ensures that data points with greater errors yield greater uncertainties. 

In [13]:
# Create a combined evaluation dataset and keep only the features (and extract the target)
df_dev = pd.concat([df_dev_in, df_dev_out])
X_dev = df_dev.iloc[:,6:]
y_dev = df_dev['fact_temperature']

In [14]:
# Get ensemble of predictions for each data point

import numpy as np

def get_predictions(features_df, model):
    '''
    Calculates predictions on df features for specified model
    
    Return: array [num_samples x 2],
        where
            num_samples = number of rows in features_df
            2 = [mean, variance]
    
    '''
    return model.predict(features_df)


def get_all_predictions(features_df, models_list):
    '''
    Return: array [ensemble_size x num_samples x 2],
        where
            ensemble_size = number of models in models_list
            num_samples = number of rows in features_df
            2 = [mean, variance]
    '''
    all_preds = []
    for model in models_list:
        preds = np.asarray(get_predictions(features_df, model))
        all_preds.append(preds)
    return np.stack(all_preds, axis=0)


all_preds = get_all_predictions(X_dev, trained_models)

In [15]:
# Choose any uncertainty measure to calculate uncertainty scores
# This tutorial uses total variance as the uncertainty measure

def calculate_tvar(preds):
    '''
    preds: array [ensemble_size x num_samples x 2]
    '''
    
    var_mean = np.var(preds[:, :, 0], axis=0)
    mean_var = np.mean(preds[:, :, 1], axis=0)
    tvar = var_mean + mean_var
    return tvar

uncertainties = calculate_tvar(all_preds)

## 4. Submission

A csv file in a specific format has to be prepared for submission. It is important that the submission is for the combined dataset of `dev_in` + `dev_out` = `dev`, where order of concatenation is as stated - this is to ensure all given IDs are correct at submission time.

The submitted csv file should contain the following columns:
- ID
- PRED
- UNCERTAINTY
- TARGET

In [17]:
# Prepare the ids
ids = np.arange(1, len(df_dev) + 1)

# Predictions are the mean predictions across the ensemble of models
preds = np.mean(all_preds[:,:,0], axis=0)

# Targets have already been extracted
targets = y_dev

# The uncertainties have been calculated in the previous step

# Store all the information to be submitted in a df
df_submission = pd.DataFrame(data={
        'ID' : ids,
        'PRED' : preds,
        'UNCERTAINTY' : uncertainties,
        'TARGET' : targets
        })

df_submission.head()

Unnamed: 0,ID,PRED,UNCERTAINTY,TARGET
0,1,32.687017,12.973079,35.0
1,2,7.991255,5.02749,6.0
2,3,27.329121,11.576975,29.0
3,4,24.717714,5.861262,27.0
4,5,34.800635,18.719108,37.0


In [None]:
# Save as csv
out_file = 'df_submission.csv'
df_submission.to_csv(out_file, index=False)