## Gongguan YouBike Forecasting with the Temporal Fusion Transformer (TFT)

This notebook trains a TFT model on the focused **Gongguan case study** dataset. It uses the final feature-engineered data, which includes:
- Cyclical time features (sin/cos)
- Holiday and storm event flags
- Weather data (normalized)
- Static POI features (normalized)
- Behavioral station clusters (one-hot encoded)
- Historical lag and inflow/outflow features (normalized)

The workflow is structured to load this data, configure the powerful `TimeSeriesDataSet` loader, train the model, and evaluate its performance.

In [1]:
# Cell 1: Setup and Google Drive Access
import pandas as pd
import numpy as np
import glob
import requests
from google.colab import drive

# Mount your Google Drive
drive.mount('/content/drive')

print("✅ Google Drive mounted successfully.")

MessageError: Error: credential propagation was unsuccessful

In [2]:
# Install required packages
!pip install lightning pytorch-forecasting -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m828.5/828.5 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m260.9/260.9 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m983.2/983.2 kB[0m [31m53.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m832.4/832.4 kB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import pandas as pd
import numpy as np
import lightning.pytorch as pl
from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint
from pytorch_forecasting import TimeSeriesDataSet, TemporalFusionTransformer, QuantileLoss
from pytorch_forecasting.data import GroupNormalizer
from lightning.pytorch.loggers import TensorBoardLogger
import torch
import warnings

warnings.filterwarnings("ignore")
pl.seed_everything(42)

# Add this line to enable Tensor Cores on the A100
torch.set_float32_matmul_precision('high')

INFO: Seed set to 42
INFO:lightning.fabric.utilities.seed:Seed set to 42


42

In [None]:
# --- NEUE ZELLE: Erstellen eines kleinen Samples für schnelles Prototyping ---
print("--- Erstelle ein kleines Daten-Sample für einen schnellen Testlauf ---")

# Laden Sie den vollständigen Gongguan-Datensatz
full_data_path = "/content/drive/MyDrive/Youbike_Master_Project/YouBike_Demand_Forecast/data/gongguan_model_ready_features.parquet.gz"
temp_df = pd.read_parquet(full_data_path)

# Wählen Sie eine kleine Anzahl von Stationen für den Testlauf aus (z.B. 10 Stationen)
sample_station_ids = temp_df['sno'].unique()[:10]
data = temp_df[temp_df['sno'].isin(sample_station_ids)].copy()

print(f"Ein Sample mit {len(sample_station_ids)} Stationen wurde erstellt.")
print(f"Neue Datensatzgröße: {len(data)} Zeilen")

# Geben Sie den Speicher des großen DataFrames frei
del temp_df

### 1. Load the Pre-processed Gongguan Dataset

We load the `gongguan_model_ready_features.parquet.gz` file. This is the final output of our master feature engineering pipeline.

In [4]:
#file_path = "/content/drive/MyDrive/Youbike_Master_Project/YouBike_Demand_Forecast/data/gongguan_model_ready_features.parquet.gz"
#data = pd.read_parquet(file_path)

# Ensure data types are correct for the library
data['occupancy_ratio'] = data['occupancy_ratio'].astype(np.float32)
# Convert boolean one-hot columns to integers
for col in data.columns:
    if data[col].dtype == 'bool':
        data[col] = data[col].astype(int)

print("Gongguan dataset loaded successfully.")
data.info()

Gongguan dataset loaded successfully.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6274800 entries, 0 to 6274799
Data columns (total 49 columns):
 #   Column                    Dtype         
---  ------                    -----         
 0   sno                       int64         
 1   time                      datetime64[ns]
 2   act                       int64         
 3   hour                      float64       
 4   is_weekday                int64         
 5   lat                       float64       
 6   lng                       float64       
 7   occupancy_ratio           float32       
 8   Temperature               float64       
 9   Dew Point                 float64       
 10  Humidity                  float64       
 11  Speed                     float64       
 12  Pressure                  float64       
 13  station_lat               float64       
 14  station_lng               float64       
 15  poi_name                  float64       
 16  poi_lat         

### 2. Create a Time Index

The `pytorch-forecasting` library requires a continuous integer index for time. We create this by sorting and assigning a cumulative count for each station.

In [6]:
data.sort_values(['sno', 'time'], inplace=True)
# The time index must be an integer
data['time_idx'] = data.groupby('sno').cumcount()

print("Time index created.")

Time index created.


### 3. Define Features and Create the `TimeSeriesDataSet`

This is the most critical step. We programmatically identify all our engineered features and assign them to the correct category for the TFT model.

-   `max_encoder_length`: How much history to use (e.g., 24 hours).
-   `max_prediction_length`: How far to predict into the future (e.g., 6 hours).

In [7]:
# Define sliding window parameters
max_prediction_length = 6 * 6 # Predict 6 hours ahead (36 steps)
max_encoder_length = 24 * 6 # Use 24 hours of history (144 steps)

# Define the training/validation split point (e.g., use last month for validation)
training_cutoff = data["time_idx"].max() - (30 * 24 * 6)

# --- Automatically identify feature columns (CORRECTED LOGIC) ---
target = 'occupancy_ratio'
group_ids = ['sno']

# Static features do not change over time for a given station
# One-hot encoded cluster features are now treated as static 'reals'
static_reals = [col for col in data.columns if col.startswith('poi_') or col in ['lat', 'lng']] + \
               [col for col in data.columns if col.startswith('cluster_is_')]
# This list is now empty, which is correct.
static_categoricals = []

# Time-varying features that are known in the future
# One-hot encoded day features are already correctly placed here.
time_varying_known_reals = [
    'hour_sin', 'hour_cos', 'day_of_week_sin', 'day_of_week_cos', 'month_sin', 'month_cos',
    'is_major_storm_day'
] + [col for col in data.columns if col.startswith('day_is_')]

# Time-varying features that are not known in the future
# We must use a list copy here as the next line modifies the list in place
time_varying_unknown_reals = list(data.select_dtypes(include=np.number).columns)

# Remove columns that are not time-varying unknown reals
cols_to_remove = static_reals + time_varying_known_reals + group_ids + static_categoricals + [target, 'act', 'time_idx']
for col in cols_to_remove:
    if col in time_varying_unknown_reals:
        time_varying_unknown_reals.remove(col)

# --- Create the TimeSeriesDataSet ---
training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target=target,
    group_ids=group_ids,
    max_encoder_length=max_encoder_length,
    max_prediction_length=max_prediction_length,
    static_categoricals=static_categoricals, # Now correctly empty
    static_reals=static_reals,             # Now includes cluster features
    time_varying_known_reals=time_varying_known_reals,
    time_varying_unknown_reals=time_varying_unknown_reals,
    target_normalizer=GroupNormalizer(groups=group_ids, transformation="softplus"),
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
)

validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization=True)

# Create dataloaders
batch_size = 1024
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=8, pin_memory=True)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=8, pin_memory=True)

print("TimeSeriesDataSets and DataLoaders created successfully.")

TimeSeriesDataSets and DataLoaders created successfully.


### 4. Configure and Train the TFT Model

With the data prepared, we can now define and train our model. We use `EarlyStopping` to prevent overfitting.

In [None]:
# Configure the trainer
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=5, verbose=False, mode="min")
logger = TensorBoardLogger("lightning_logs")

trainer = pl.Trainer(
    max_epochs=2,
    accelerator="gpu" if torch.cuda.is_available() else "cpu",
    devices=1,
    enable_model_summary=True,
    gradient_clip_val=0.1,
    callbacks=[early_stop_callback],
    logger=logger,
    # --- ADD THESE LINES FOR A FAST, PARTIAL RUN ---
    limit_train_batches=0.02,  # Use only 2% of the training data per epoch
    limit_val_batches=1     # Use only 1 epoch of the validation data
)

# Define the TFT model
tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.03,
    hidden_size=64, # Increased hidden size for more complex patterns
    attention_head_size=4,
    dropout=0.1,
    hidden_continuous_size=32, # Increased for more features
    output_size=7,  # To predict 7 quantiles
    loss=QuantileLoss(),
    log_interval=10,
    reduce_on_plateau_patience=4,
)

print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")

# Start training
print("Starting model training...")
trainer.fit(
    tft,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
)

INFO: 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:lightning.pytorch.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: `Trainer(limit_val_batches=1)` was configured so 1 batch will be used.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer(limit_val_batches=1)` was configu

Number of parameters in network: 2482.4k
Starting model training...


INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
   | Name                               | Type                            | Params | Mode 
------------------------------------------------------------------------------------------------
0  | loss                               | QuantileLoss                    | 0      | train
1  | logging_metrics                    | ModuleList                      | 0      | train
2  | input_embeddings                   | MultiEmbedding                  | 0      | train
3  | prescalers                         | ModuleDict                      | 6.3 K  | train
4  | static_variable_selection          | VariableSelectionNetwork        | 291 K  | train
5  | encoder_variable_selection         | VariableSelectionNetwork        | 1.1 M  | train
6  | decoder_variable_selection         | VariableSelectionNetwork        | 292 K  | train
7  | static_context_variable_selectio

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

### 5. Evaluate and Visualize Predictions

After training, we load the best performing model and visualize its predictions on the validation set to qualitatively assess its performance.

In [None]:
import pandas as pd
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from pytorch_forecasting import TemporalFusionTransformer
from pytorch_forecasting.metrics.point import MAE
import os

# Load the best model from the checkpoint
#best_model_path = trainer.checkpoint_callback.best_model_path
#best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)

# Make predictions on the validation set
#raw_predictions, x = best_tft.predict(val_dataloader, mode="raw", return_x=True)

# Visualize some examples
#print("Plotting predictions for 5 random examples from the validation set...")
#for i in range(5):
#    fig, ax = plt.subplots(figsize=(10, 5))
#    best_tft.plot_prediction(x, raw_predictions, idx=i, ax=ax, add_loss_to_title=True)
#    plt.show()


print("--- Advanced TFT Model Evaluation ---")

# --- 0. Setup ---
# This assumes your 'trainer' and 'val_dataloader' are already defined and the model has been trained.
# Create a directory to save the performance plots
output_dir = "performance/tft/"
os.makedirs(output_dir, exist_ok=True)
print(f"Plots will be saved to the '{output_dir}' directory.")


# --- 1. Load the Best Model ---
print("\nStep 1: Loading the best model from checkpoint...")
best_model_path = trainer.checkpoint_callback.best_model_path
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)
print("✅ Best model loaded.")


# --- 2. Make Predictions on the Validation Set ---
print("\nStep 2: Making predictions on the validation data...")
raw_predictions, x = best_tft.predict(val_dataloader, mode="raw", return_x=True)
print("✅ Predictions complete.")


# --- 3. Visualize and Save Best Predictions ---
print("\nStep 3: Visualizing and saving 10 random prediction examples...")
for idx in range(10):
    fig, ax = plt.subplots(figsize=(10, 5))
    best_tft.plot_prediction(x, raw_predictions, idx=idx, add_loss_to_title=True, ax=ax)
    fig.savefig(f"{output_dir}prediction_example_{idx}.jpg")
    plt.close(fig) # Close the figure to avoid displaying it in the notebook
print("✅ Prediction examples saved.")


# --- 4. Error Analysis: Find and Save Worst Predictions ---
print("\nStep 4: Finding, visualizing, and saving the 10 worst predictions...")
# Calculate the Mean Absolute Error (MAE) for each prediction sequence
predictions = best_tft.predict(val_dataloader, return_y=True)
mean_losses = MAE(reduction="none")(predictions.output, predictions.y).mean(1)
# Get the indices of the 10 worst predictions (highest error)
indices = mean_losses.argsort(descending=True)

for i in range(10):
    idx = indices[i]
    fig, ax = plt.subplots(figsize=(10, 5))
    best_tft.plot_prediction(
        x,
        raw_predictions,
        idx=idx,
        add_loss_to_title=MAE(quantiles=best_tft.loss.quantiles),
        ax=ax
    )
    fig.savefig(f"{output_dir}worst_prediction_{i}.jpg")
    plt.close(fig)
print("✅ Worst prediction examples saved.")




### 6. Interpret Model Behavior

Finally, we can use TFT's built-in interpretability to see which features the model found most useful for its predictions.

In [None]:
# --- 5. Interpretability: Save Feature Importance ---
print("\nStep 5: Calculating and saving feature importance plots...")
interpretation = best_tft.interpret_output(raw_predictions, reduction="sum")
figs = best_tft.plot_interpretation(interpretation)

for key, fig in figs.items():
    fig.savefig(f"{output_dir}importance_{key}.jpg")
    plt.close(fig)
print("✅ Feature importance plots saved.")


# --- 6. Advanced Interpretation: Prediction vs. Actuals by Feature ---
print("\nStep 6: Calculating and saving prediction vs. actuals plots by feature...")
# This can be computationally intensive
predictions_vs_actuals = best_tft.calculate_prediction_actual_by_variable(x, raw_predictions)

# Get all unique features from the results
all_features = list(predictions_vs_actuals['support'].keys())

for feature in all_features:
    try:
        fig = best_tft.plot_prediction_actual_by_variable(predictions_vs_actuals, name=feature)
        fig.savefig(f"{output_dir}pred_vs_actual_{feature}.jpg")
        plt.close(fig)
    except Exception as e:
        print(f"Could not plot for feature '{feature}': {e}")
print("✅ Prediction vs. actuals plots saved.")

print("\n--- Evaluation Complete ---")