## Open notebook in:
| Colab                                 |  Gradient    [link text](https://)                                                                                                                                     |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/transformers-the-definitive-guide/blob/master/CH04/CH04_patch_tst_hyperparameter_IBM_10_days_ahead_32_context_window.ipynb)                                             | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com//github.com/Nicolepcx/transformers-the-definitive-guide/blob/main/CH02/ch02_patch_tst_hyperparameter_IBM_10_days_ahead_32_context_window.ipynb)|             

# About this Notebook

This notebook provides an overview on how to use the `PatchTST` model from the Hugging Face `transformers` library. The workflow involves loading and preprocessing time series data, configuring the model, performing hyperparameter tuning with Optuna, and finally training and evaluating the model.

### Steps Included:

1. **Setting Up the Environment**:  
   Import necessary libraries, suppress warnings, and set a random seed to ensure reproducibility.

2. **Loading and Preparing the Dataset**:  
   Load time series data (e.g., stock prices) into a Pandas DataFrame. The data is then split into training, validation, and test sets based on specified indices. The `TimeSeriesPreprocessor` is used to normalize and prepare the data for model training.

3. **Configuring the PatchTST Model**:  
   Define the `PatchTSTConfig` configuration for the model, specifying parameters such as the number of input channels, context length, patch length, and the model's architecture. The model is initialized using this configuration.

4. **Hyperparameter Tuning with Optuna**:  
   Use Optuna to perform a hyperparameter search, exploring different configurations of learning rate, batch size, number of epochs, and other parameters. The goal is to minimize the evaluation loss, and early stopping is implemented to prevent overfitting.

5. **Training the Model**:  
   The `Trainer` class from `transformers` is used to handle the training loop. The best hyperparameters found during the Optuna search are applied to the training arguments, and the model is trained on the time series data.

6. **Model Evaluation**:  
   After training, the model is evaluated on the validation and test datasets. The results, including evaluation metrics such as loss, are printed to assess the model's performance.

7. **Saving the Model**:  
   Finally, the trained model and its configurations are saved for future use, allowing for easy deployment and inference on new time series data.

The notebook is partially based on the example code from the original [paper's repo](https://github.com/yuqinie98/PatchTST).


In [None]:
!pip install git+https://github.com/IBM/tsfm.git -qqq

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m116.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.9/143.9 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m81.7 kB/s[0m eta [36

In [None]:
# Standard
import os
import numpy as np
import pandas as pd

# Third Party
from transformers import (
    EarlyStoppingCallback,
    PatchTSTConfig,
    PatchTSTForPrediction,
    set_seed,
    Trainer,
    TrainingArguments,
)
import optuna
import yfinance as yf

# First Party
from tsfm_public.toolkit.dataset import ForecastDFDataset
from tsfm_public.toolkit.time_series_preprocessor import TimeSeriesPreprocessor
from tsfm_public.toolkit.util import select_by_index

# Supress some warnings
import warnings

warnings.filterwarnings("ignore", module="torch")

### Set seed

In [None]:
set_seed(42)

# Load and prepare datasets

In the next cell, please adjust the following parameters to suit your application:
- `PRETRAIN_AGAIN`: Set this to `True` if you want to perform pretraining again. Note that this might take some time depending on the GPU availability. Otherwise, the already pretrained model will be used.
- `dataset_path`: path to local .csv file, or web address to a csv file for the data of interest. Data is loaded with pandas, so anything supported by
`pd.read_csv` is supported: (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
- `timestamp_column`: column name containing timestamp information, use None if there is no such column
- `id_columns`: List of column names specifying the IDs of different time series. If no ID column exists, use []
- `forecast_columns`: List of columns to be modeled
- `context_length`: The amount of historical data used as input to the model. Windows of the input time series data with length equal to
`context_length` will be extracted from the input dataframe. In the case of a multi-time series dataset, the context windows will be created
so that they are contained within a single time series (i.e., a single ID).
- `forecast_horizon`: Number of timestamps to forecast in future.
- `train_start_index`, `train_end_index`: the start and end indices in the loaded data which delineate the training data.
- `valid_start_index`, `valid_end_index`: the start and end indices in the loaded data which delineate the validation data.
- `test_start_index`, `test_end_index`: the start and end indices in the loaded data which delineate the test data.
- `patch_length`: The patch length for the `PatchTSMixer` model. It is recommended to choose a value that evenly divides `context_length`.
- `num_workers`: Number of dataloder workers in pytorch dataloader.
- `batch_size`: Batch size.
The data is first loaded into a Pandas dataframe and split into training, validation, and test parts. Then the pandas dataframes are converted
to the appropriate torch dataset needed for training.

In [None]:
data = yf.download("IBM", start=None, end="2024-07-01")

[*********************100%%**********************]  1 of 1 completed


In [None]:
data = data[['Adj Close']]

In [None]:
data = data.reset_index()

In [None]:
data = data.rename(columns={'Adj Close': 'close', 'Date': 'date'})
data

Unnamed: 0,date,close
0,1962-01-02,1.513321
1,1962-01-03,1.526550
2,1962-01-04,1.511336
3,1962-01-05,1.481573
4,1962-01-08,1.453794
...,...,...
15725,2024-06-24,173.492584
15726,2024-06-25,171.103500
15727,2024-06-26,170.379822
15728,2024-06-27,169.368668


In [None]:
timestamp_column = "date"
id_columns = []

context_length = 32
forecast_horizon = 10
patch_length = 8
num_workers = 16
batch_size = 128

In [None]:

forecast_columns = list(data.columns[1:])

# get split
num_train = int(len(data) * 0.7)
num_test = int(len(data) * 0.2)
num_valid = len(data) - num_train - num_test
border1s = [
    0,
    num_train - context_length,
    len(data) - num_test - context_length,
]
border2s = [num_train, num_train + num_valid, len(data)]

train_start_index = border1s[0]  # None indicates beginning of dataset
train_end_index = border2s[0]

valid_start_index = train_end_index + context_length
valid_end_index = border2s[1]

test_start_index = valid_end_index + context_length
test_end_index = border2s[2]

train_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=train_start_index,
    end_index=train_end_index,
)

valid_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=valid_start_index,
    end_index=valid_end_index,
)
test_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=test_start_index,
    end_index=test_end_index,
)

tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_column,
    id_columns=id_columns,
    target_columns=forecast_columns,
    scaling=True,
)
tsp = tsp.train(train_data)

In [None]:
print("Training Data Range: {} to {}".format(train_start_index, train_end_index))
print("Validation Data Range: {} to {}".format(valid_start_index, valid_end_index))
print("Testing Data Range: {} to {}".format(test_start_index, test_end_index))


Training Data Range: 0 to 11011
Validation Data Range: 11043 to 12584
Testing Data Range: 12616 to 15730


In [None]:
train_dataset = ForecastDFDataset(
    tsp.preprocess(train_data),
    id_columns=id_columns,
    timestamp_column="date",
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
valid_dataset = ForecastDFDataset(
    tsp.preprocess(valid_data),
    id_columns=id_columns,
    timestamp_column="date",
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
test_dataset = ForecastDFDataset(
    tsp.preprocess(test_data),
    id_columns=id_columns,
    timestamp_column="date",
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)

# Configure the Model

In [None]:
config = PatchTSTConfig(
    num_input_channels=len(forecast_columns),
    context_length=context_length,
    patch_length=patch_length,
    patch_stride=patch_length,
    prediction_length=forecast_horizon,
    random_mask_ratio=0.4,
    d_model=16,
    num_attention_heads=8,
    num_hidden_layers=3,
    ffn_dim=256,
    dropout=0.2,
    head_dropout=0.2,
    pooling_type=None,
    channel_attention=False,
    scaling="std",
    loss="mse",
    pre_norm=True,
    norm_type="batchnorm",
)
model = PatchTSTForPrediction(config)

# Configure Hyperparameter Tuning

In [None]:
def optuna_hp_space(trial: optuna.Trial):
    return {
        "learning_rate": trial.suggest_loguniform("learning_rate", 1e-8, 1e-2),  # Granular range
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [16, 32, 64, 128]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 50, 300, step=20),
        "dataloader_num_workers": trial.suggest_int("dataloader_num_workers", 0, 16, step=4),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.3, step=0.05),  # Granular steps for weight decay
        "per_device_eval_batch_size": trial.suggest_categorical("per_device_eval_batch_size", [16, 32, 64, 128]),
    }


In [None]:
def model_init(trial):
    return PatchTSTForPrediction(config)

In [None]:
training_args = TrainingArguments(
    output_dir="./checkpoint/output_dir",
    overwrite_output_dir=True,
    do_eval=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=3,
    logging_dir="./checkpoint/logging_dir",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    num_train_epochs=200,  # Note: The actual number of epochs might be lower due to early stopping
    label_names=["future_values"],
)

trainer = Trainer(
    model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    model_init=model_init,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=30, early_stopping_threshold=0.00001)]
)

# Start the hyperparameter search
best_run = trainer.hyperparameter_search(
    backend="optuna",
    n_trials=30,
    direction="minimize",
)

print("Best run:", best_run)


[I 2024-08-22 15:24:02,407] A new study created in memory with name: no-name-b4dc2eb4-3cbb-4398-bd26-300143258197


Epoch,Training Loss,Validation Loss
1,0.0114,0.043518
2,0.0114,0.043366


[I 2024-08-22 15:25:19,595] Trial 0 finished with value: 0.04336557909846306 and parameters: {'learning_rate': 2.0841419105990525e-06, 'num_train_epochs': 2, 'seed': 28, 'per_device_train_batch_size': 4}. Best is trial 0 with value: 0.04336557909846306.


Epoch,Training Loss,Validation Loss
1,0.0112,0.040872
2,0.0102,0.038104


[I 2024-08-22 15:26:34,883] Trial 1 finished with value: 0.03810364007949829 and parameters: {'learning_rate': 7.5795804826512866e-06, 'num_train_epochs': 2, 'seed': 28, 'per_device_train_batch_size': 4}. Best is trial 1 with value: 0.03810364007949829.


Epoch,Training Loss,Validation Loss
1,0.008,0.020507
2,0.0046,0.019231
3,0.0044,0.01918


[I 2024-08-22 15:27:37,270] Trial 2 finished with value: 0.019180074334144592 and parameters: {'learning_rate': 3.26006502821966e-05, 'num_train_epochs': 3, 'seed': 12, 'per_device_train_batch_size': 8}. Best is trial 2 with value: 0.019180074334144592.


Epoch,Training Loss,Validation Loss
1,0.0115,0.043824
2,0.0115,0.043789


[I 2024-08-22 15:27:47,004] Trial 3 finished with value: 0.043788690119981766 and parameters: {'learning_rate': 3.880431812474502e-06, 'num_train_epochs': 2, 'seed': 25, 'per_device_train_batch_size': 64}. Best is trial 2 with value: 0.019180074334144592.


Epoch,Training Loss,Validation Loss
1,0.0114,0.043165
2,0.0112,0.042459
3,0.011,0.04184
4,0.0109,0.041568


[I 2024-08-22 15:28:32,385] Trial 4 finished with value: 0.041567910462617874 and parameters: {'learning_rate': 4.759212914654341e-06, 'num_train_epochs': 4, 'seed': 21, 'per_device_train_batch_size': 16}. Best is trial 2 with value: 0.019180074334144592.


Epoch,Training Loss,Validation Loss
1,0.0108,0.035279
2,0.0079,0.02838


[I 2024-08-22 15:29:47,761] Trial 5 finished with value: 0.028380457311868668 and parameters: {'learning_rate': 1.3499261817278008e-05, 'num_train_epochs': 2, 'seed': 16, 'per_device_train_batch_size': 4}. Best is trial 2 with value: 0.019180074334144592.


Epoch,Training Loss,Validation Loss
1,0.0114,0.043543


[I 2024-08-22 15:30:08,200] Trial 6 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0099,0.026083
2,0.0051,0.019311
3,0.0043,0.019068
4,0.0043,0.019008


[I 2024-08-22 15:30:35,693] Trial 7 finished with value: 0.019008422270417213 and parameters: {'learning_rate': 4.436796901411967e-05, 'num_train_epochs': 4, 'seed': 37, 'per_device_train_batch_size': 32}. Best is trial 7 with value: 0.019008422270417213.


Epoch,Training Loss,Validation Loss
1,0.0114,0.043421
2,0.0113,0.043172


[I 2024-08-22 15:30:45,173] Trial 8 finished with value: 0.04317159578204155 and parameters: {'learning_rate': 1.5750998015362626e-05, 'num_train_epochs': 2, 'seed': 12, 'per_device_train_batch_size': 64}. Best is trial 7 with value: 0.019008422270417213.


Epoch,Training Loss,Validation Loss
1,0.0115,0.043835


[I 2024-08-22 15:30:52,292] Trial 9 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0078,0.019751
2,0.0043,0.019081
3,0.0041,0.018641
4,0.004,0.018427
5,0.004,0.018258


[I 2024-08-22 15:31:26,559] Trial 10 finished with value: 0.018257983028888702 and parameters: {'learning_rate': 7.766903153583391e-05, 'num_train_epochs': 5, 'seed': 40, 'per_device_train_batch_size': 32}. Best is trial 10 with value: 0.018257983028888702.


Epoch,Training Loss,Validation Loss
1,0.0073,0.019648
2,0.0043,0.018939
3,0.0041,0.018534
4,0.004,0.018323
5,0.004,0.018153


[I 2024-08-22 15:32:00,901] Trial 11 finished with value: 0.018153363838791847 and parameters: {'learning_rate': 9.171180859945352e-05, 'num_train_epochs': 5, 'seed': 40, 'per_device_train_batch_size': 32}. Best is trial 11 with value: 0.018153363838791847.


Epoch,Training Loss,Validation Loss
1,0.0072,0.018692
2,0.0042,0.018301
3,0.004,0.018394
4,0.004,0.018418
5,0.004,0.018211


[I 2024-08-22 15:32:35,261] Trial 12 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0073,0.019121
2,0.0042,0.019158
3,0.0041,0.018396
4,0.0041,0.018241
5,0.004,0.018279


[I 2024-08-22 15:33:09,708] Trial 13 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0102,0.027885


[I 2024-08-22 15:33:16,867] Trial 14 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0071,0.019704


[I 2024-08-22 15:33:28,415] Trial 15 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0114,0.043599


[I 2024-08-22 15:33:35,516] Trial 16 pruned. 


Epoch,Training Loss,Validation Loss
1,0.011,0.038786


[I 2024-08-22 15:33:42,610] Trial 17 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0083,0.020702


[I 2024-08-22 15:33:49,750] Trial 18 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0089,0.022693


[I 2024-08-22 15:34:10,282] Trial 19 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0109,0.035285
2,0.0068,0.021054
3,0.0048,0.019423
4,0.0044,0.019226
5,0.0044,0.019116


[I 2024-08-22 15:34:33,491] Trial 20 finished with value: 0.0191159900277853 and parameters: {'learning_rate': 5.62579216035008e-05, 'num_train_epochs': 5, 'seed': 24, 'per_device_train_batch_size': 64}. Best is trial 11 with value: 0.018153363838791847.


Epoch,Training Loss,Validation Loss
1,0.0101,0.027721


[I 2024-08-22 15:34:40,543] Trial 21 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0109,0.037408


[I 2024-08-22 15:34:47,769] Trial 22 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0085,0.019582
2,0.0045,0.018613
3,0.0042,0.018346
4,0.0041,0.01823
5,0.0041,0.018223


[I 2024-08-22 15:35:22,141] Trial 23 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0074,0.019665
2,0.0043,0.018967
3,0.0041,0.018609
4,0.004,0.01852


[I 2024-08-22 15:35:49,751] Trial 24 finished with value: 0.018519515171647072 and parameters: {'learning_rate': 9.203782997917282e-05, 'num_train_epochs': 4, 'seed': 40, 'per_device_train_batch_size': 32}. Best is trial 11 with value: 0.018153363838791847.


Epoch,Training Loss,Validation Loss
1,0.0062,0.019293


[I 2024-08-22 15:36:01,345] Trial 25 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0078,0.019306
2,0.0044,0.018856
3,0.0042,0.018647


[I 2024-08-22 15:36:22,137] Trial 26 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0102,0.026933


[I 2024-08-22 15:36:29,266] Trial 27 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0114,0.043006


[I 2024-08-22 15:36:36,345] Trial 28 pruned. 


Epoch,Training Loss,Validation Loss
1,0.0095,0.023589
2,0.0048,0.018841
3,0.0044,0.018759
4,0.0043,0.018556


[I 2024-08-22 15:39:09,388] Trial 29 finished with value: 0.018555982038378716 and parameters: {'learning_rate': 1.632402134814743e-05, 'num_train_epochs': 4, 'seed': 27, 'per_device_train_batch_size': 4}. Best is trial 11 with value: 0.018153363838791847.


Best run: BestRun(run_id='11', objective=0.018153363838791847, hyperparameters={'learning_rate': 9.171180859945352e-05, 'num_train_epochs': 5, 'seed': 40, 'per_device_train_batch_size': 32}, run_summary=None)


# Train Model on Best Hyperparameters

In [None]:

best_hyperparameters = best_run.hyperparameters

# Update training arguments with the best hyperparameters
training_args = TrainingArguments(
    output_dir="./checkpoint/output_dir",
    overwrite_output_dir=True,
    learning_rate=best_hyperparameters['learning_rate'],
    per_device_train_batch_size=int(best_hyperparameters['per_device_train_batch_size']),  # Make sure to cast to int if needed
    do_eval=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=3,
    logging_dir="./checkpoint/logging_dir",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    num_train_epochs=200,  # This can be adjusted based on your previous experience
    label_names=["future_values"],
)




In [None]:
# Reinitialize the Trainer with the updated arguments
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=30, early_stopping_threshold=0.00001)]
)

# Train the model with the best hyperparameters
trainer.train()


Epoch,Training Loss,Validation Loss
1,0.0071,0.020322
2,0.0043,0.018434
3,0.0041,0.018876
4,0.0041,0.018106
5,0.0039,0.018269
6,0.004,0.018509
7,0.004,0.01804
8,0.004,0.018311
9,0.0039,0.018649
10,0.004,0.018332


TrainOutput(global_step=15092, training_loss=0.003991606852877466, metrics={'train_runtime': 304.8006, 'train_samples_per_second': 7198.148, 'train_steps_per_second': 225.065, 'total_flos': 2752990479360.0, 'train_loss': 0.003991606852877466, 'epoch': 44.0})

# Display results

In [None]:
results_valid_dataset = trainer.evaluate(valid_dataset)
print("Valid Results:", results_valid_dataset)
results_test_dataset = trainer.evaluate(test_dataset)
print("Test Results:", results_test_dataset)


Valid Results: {'eval_loss': 0.017921317368745804, 'eval_runtime': 0.9947, 'eval_samples_per_second': 1507.974, 'eval_steps_per_second': 188.999, 'epoch': 44.0}
Test Results: {'eval_loss': 0.04856698215007782, 'eval_runtime': 2.5625, 'eval_samples_per_second': 1199.2, 'eval_steps_per_second': 150.241, 'epoch': 44.0}


In [None]:
results_valid_dataset

{'eval_loss': 0.017921317368745804,
 'eval_runtime': 0.9947,
 'eval_samples_per_second': 1507.974,
 'eval_steps_per_second': 188.999,
 'epoch': 44.0}