# Motion Prediction Baseline — **Training with PyTorch & L5Kit**
_Aim:_ train a simple BEV-based baseline on the **Lyft Level-5 Motion Prediction** data.

Author of the project is Kevin Molloy.

**What you’ll see**
- BEV raster inputs → ResNet backbone → future (x, y) trajectories.
- Minimal, readable training loop + periodic checkpoints.

## Notes (Run First)
- **Enable a GPU** (P100) before training.
- Make sure the **Lyft dataset** and any **L5Kit utility script** are added to the notebook.
- Parts of the code follow Lyft’s official example and the baseline inference notebook.

## Imports & Setup
This notebook uses **PyTorch** + **L5Kit** for data loading and rasterisation. 

In [1]:
import numpy as np
import os
import torch

from torch import nn, optim
from torch.utils.data import DataLoader
from torchvision.models.resnet import resnet18
from tqdm import tqdm
from typing import Dict

from l5kit.data import LocalDataManager, ChunkedDataset
from l5kit.dataset import AgentDataset, EgoDataset
from l5kit.rasterization import build_rasterizer

In [2]:
DIR_INPUT = "/kaggle/input/lyft-motion-prediction-autonomous-vehicles"

SINGLE_MODE_SUBMISSION = f"{DIR_INPUT}/single_mode_sample_submission.csv"
MULTI_MODE_SUBMISSION = f"{DIR_INPUT}/multi_mode_sample_submission.csv"

DEBUG = True

## Configuration (cfg)
Central config for:
- **model_params**: backbone, history/future frame counts.
- **raster_params**: `raster_size`, `pixel_size`, `ego_center`, `map_type`.
- **train_data_loader**: zarr key, batch size, workers, shuffle.
- **train_params**: total steps and checkpoint interval.

Tip: keep `DEBUG=True` for a quick smoke test; switch off to train fully.

In [3]:
cfg = {
    'format_version': 4,
    'model_params': {
        'model_architecture': 'resnet50',
        'history_num_frames': 10,
        'history_step_size': 1,
        'history_delta_time': 0.1,
        'future_num_frames': 50,
        'future_step_size': 1,
        'future_delta_time': 0.1
    },
    
    'raster_params': {
        'raster_size': [224, 224],
        'pixel_size': [0.5, 0.5],
        'ego_center': [0.25, 0.5],
        'map_type': 'py_semantic',
        'satellite_map_key': 'aerial_map/aerial_map.png',
        'semantic_map_key': 'semantic_map/semantic_map.pb',
        'dataset_meta_key': 'meta.json',
        'filter_agents_threshold': 0.5
    },
    
    'train_data_loader': {
        'key': 'scenes/train.zarr',
        'batch_size': 12,
        'shuffle': True,
        'num_workers': 4
    },
    
    'train_params': {
        'max_num_steps': 100 if DEBUG else 10000,
        'checkpoint_every_n_steps': 5000,
        
        # 'eval_every_n_steps': -1
    }
}

# silence the L5Kit warning
cfg['raster_params'].setdefault('disable_traffic_light_faces', False)

False

## Data Root & LocalDataManager
Set `L5KIT_DATA_FOLDER` and create a `LocalDataManager` so file keys like `scenes/train.zarr` resolve correctly.

In [4]:
# set env variable for data
os.environ["L5KIT_DATA_FOLDER"] = DIR_INPUT
dm = LocalDataManager(None)

## Dataset & DataLoader
Pipeline:
1) Build **rasteriser** from `cfg`.  
2) Open the **train** zarr.  
3) Create **AgentDataset** (per-agent BEV stacks).  
4) Wrap with **DataLoader** (shuffle/batch/workers from `cfg`).  

Run `print(train_dataset)` to verify sizes and that the dataset is wired correctly.

In [5]:
# ===== INIT DATASET
train_cfg = cfg["train_data_loader"]

# Rasterizer
rasterizer = build_rasterizer(cfg, dm)

# Train dataset/dataloader
train_zarr = ChunkedDataset(dm.require(train_cfg["key"])).open()
train_dataset = AgentDataset(cfg, train_zarr, rasterizer)
train_dataloader = DataLoader(train_dataset,
                              shuffle=train_cfg["shuffle"],
                              batch_size=train_cfg["batch_size"],
                              num_workers=train_cfg["num_workers"])

print(train_dataset)

+------------+------------+------------+---------------+-----------------+----------------------+----------------------+----------------------+---------------------+
| Num Scenes | Num Frames | Num Agents | Num TR lights | Total Time (hr) | Avg Frames per Scene | Avg Agents per Frame | Avg Scene Time (sec) | Avg Frame frequency |
+------------+------------+------------+---------------+-----------------+----------------------+----------------------+----------------------+---------------------+
|   16265    |  4039527   | 320124624  |    38735988   |      112.19     |        248.36        |        79.25         |        24.83         |        10.00        |
+------------+------------+------------+---------------+-----------------+----------------------+----------------------+----------------------+---------------------+


## Model
Architecture:
- **ResNet backbone** (ImageNet-pretrained).
- First conv widened to accept **map + history channels**.
- Head projects features to **2 × future_num_frames** outputs (x, y per step).

Output shape before loss: `(batch_size, future_num_frames*2)`.

In [6]:
class LyftModel(nn.Module):
    
    def __init__(self, cfg: Dict):
        super().__init__()
        
        self.backbone = resnet18(pretrained=True, progress=True)
        
        num_history_channels = (cfg["model_params"]["history_num_frames"] + 1) * 2
        num_in_channels = 3 + num_history_channels

        self.backbone.conv1 = nn.Conv2d(
            num_in_channels,
            self.backbone.conv1.out_channels,
            kernel_size=self.backbone.conv1.kernel_size,
            stride=self.backbone.conv1.stride,
            padding=self.backbone.conv1.padding,
            bias=False,
        )
        
        # This is 512 for resnet18 and resnet34;
        # And it is 2048 for the other resnets
        backbone_out_features = 512

        # X, Y coords for the future positions (output shape: Bx50x2)
        num_targets = 2 * cfg["model_params"]["future_num_frames"]

        # You can add more layers here.
        self.head = nn.Sequential(
            # nn.Dropout(0.2),
            nn.Linear(in_features=backbone_out_features, out_features=4096),
        )

        self.logit = nn.Linear(4096, out_features=num_targets)
        
    def forward(self, x):
        x = self.backbone.conv1(x)
        x = self.backbone.bn1(x)
        x = self.backbone.relu(x)
        x = self.backbone.maxpool(x)

        x = self.backbone.layer1(x)
        x = self.backbone.layer2(x)
        x = self.backbone.layer3(x)
        x = self.backbone.layer4(x)

        x = self.backbone.avgpool(x)
        x = torch.flatten(x, 1)
        
        x = self.head(x)
        x = self.logit(x)
        
        return x

## Initialisation
- Move model to **CUDA** if available.
- Optimiser: **Adam (lr=1e-3)**.
- Loss: `MSELoss(reduction="none")` so we can mask invalid steps with `target_availabilities`.

Sanity check: print the device; you should see `cuda:0` on a GPU session.

In [7]:
# ==== INIT MODEL
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = LyftModel(cfg)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Later we have to filter the invalid steps.
criterion = nn.MSELoss(reduction="none")

Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /root/.cache/torch/checkpoints/resnet18-5c106cde.pth


HBox(children=(FloatProgress(value=0.0, max=46827520.0), HTML(value='')))




In [8]:
device

device(type='cuda', index=0)

## Training
Loop outline per step:
1) Get next batch; re-create iterator on `StopIteration`.
2) **Forward**: model predicts future `(x, y)`; reshape to match targets.
3) **Loss**: MSE → multiply by `target_availabilities` → mean.
4) **Backward**: zero grad → backward → step.
5) Every `checkpoint_every_n_steps`, save a checkpoint (when not in `DEBUG`).

Progress bar shows current loss and a short moving average for stability.

In [9]:
# === TRAINING LENGTH & CHECKPOINTS
DEBUG = False  # enable saving

# choose training length (pick ONE value)
cfg['train_params']['max_num_steps'] = 25000      # solid baseline run

# how often to checkpoint during training
cfg['train_params']['checkpoint_every_n_steps'] = 1000

print("DEBUG:", DEBUG,
      "| max_num_steps:", cfg['train_params']['max_num_steps'],
      "| ckpt_every:", cfg['train_params']['checkpoint_every_n_steps'])

DEBUG: False | max_num_steps: 25000 | ckpt_every: 1000


In [10]:
# ==== TRAIN LOOP
tr_it = iter(train_dataloader)

progress_bar = tqdm(range(cfg["train_params"]["max_num_steps"]))
losses_train = []

for itr in progress_bar:

    try:
        data = next(tr_it)
    except StopIteration:
        tr_it = iter(train_dataloader)
        data = next(tr_it)

    model.train()
    torch.set_grad_enabled(True)
    
    # Forward pass
    inputs = data["image"].to(device)
    target_availabilities = data["target_availabilities"].unsqueeze(-1).to(device)
    targets = data["target_positions"].to(device)
    
    outputs = model(inputs).reshape(targets.shape)
    loss = criterion(outputs, targets)

    # not all the output steps are valid, but we can filter them out from the loss using availabilities
    loss = loss * target_availabilities
    loss = loss.mean()

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    losses_train.append(loss.item())

    if (itr+1) % cfg['train_params']['checkpoint_every_n_steps'] == 0 and not DEBUG:
        torch.save(model.state_dict(), f'model_state_{itr}.pth')
    
    progress_bar.set_description(f"loss: {loss.item()} loss(avg): {np.mean(losses_train[-100:])}")

loss: 1.8312768936157227 loss(avg): 2.2297130322456358: 100%|██████████| 25000/25000 [4:57:00<00:00,  1.40it/s]


## Save Final Weights  
Save `model_state_last.pth` after training (when `DEBUG` is `False`) for downstream inference or fine-tuning.

In [11]:
# always save final weights
import os, torch

save_path = "/kaggle/working/model_state_last.pth"
torch.save(model.state_dict(), save_path)
print("Saved:", save_path, "| size (bytes):", os.path.getsize(save_path))

Saved: /kaggle/working/model_state_last.pth | size (bytes): 57140587
