# üöó Driver Behavior Detection (State-of-the-Art ViT)

This notebook trains a state-of-the-art **Vision Transformer (ViT)** model to classify driver behavior based on the [State Farm Distracted Driver Detection](https://www.kaggle.com/c/state-farm-distracted-driver-detection) dataset.

This approach is a significant improvement over the baseline CNNs (like VGG or AlexNet) for two reasons:
1.  **Better Architecture:** Vision Transformers (ViT) use a self-attention mechanism, allowing them to capture global relationships within the image, which is highly effective for this task.
2.  **Transfer Learning:** We use a model pre-trained on the massive ImageNet-21k dataset, which gives it a powerful general understanding of images before it ever sees a driver.

## 1. Setup & Installation

First, we install the necessary libraries from Hugging Face.
* `transformers`: For the ViT model and trainer.
* `datasets`: To easily load and process the image data.
* `evaluate`: To calculate our accuracy metric.
* `kaggle`: To download the dataset directly.
* `albumentations`: For powerful data augmentation.

In [1]:
!pip install -q transformers datasets evaluate kaggle albumentations

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2. Download Kaggle Dataset

To use the Kaggle API, you need to upload your `kaggle.json` token.

1.  Go to your Kaggle account, click your profile picture, and go to "Account".
2.  Scroll down to "API" and click "Create New API Token".
3.  This will download `kaggle.json`. Upload it to the Colab sidebar (click the "Files" icon).

In [4]:
import os

# Set up Kaggle API
# os.environ['KAGGLE_USERNAME'] = 'your_username_here' # <-- Optional, or just use the JSON
# os.environ['KAGGLE_KEY'] = 'your_key_here' # <-- Optional, or just use the JSON

# Configure Kaggle (it will look for the uploaded kaggle.json)
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download and unzip the dataset
!kaggle competitions download -c state-farm-distracted-driver-detection
!unzip -q state-farm-distracted-driver-detection.zip
!unzip -q imgs.zip -d .

print("Dataset downloaded and unzipped.")

Downloading state-farm-distracted-driver-detection.zip to /content
100% 3.98G/4.00G [02:07<00:00, 45.6MB/s]
100% 4.00G/4.00G [02:07<00:00, 33.8MB/s]
unzip:  cannot find or open imgs.zip, imgs.zip.zip or imgs.zip.ZIP.
Dataset downloaded and unzipped.


## 3. Data Pre-processing: Re-organize Folders

The Hugging Face `datasets` library works best when data is in an `ImageFolder` format (like `train/c0/...`, `train/c1/...`). The Kaggle dataset is flat, so we'll use the `driver_imgs_list.csv` to sort the images into the correct subfolders.

In [26]:
!cd imgs/ && ls

test  train


In [18]:
from datasets import load_dataset, Image
from pathlib import Path

# The 'imgs' directory was created by the unzip command.
# We load the 'train' folder which is INSIDE 'imgs'.
data_dir = Path('imgs/train')

print(f"Loading images from {data_dir}...")
dataset = load_dataset('imagefolder', data_dir=data_dir)

# Create a 90/10 train/validation split
split_dataset = dataset['train'].train_test_split(test_size=0.1)
split_dataset['validation'] = split_dataset.pop('test') # Rename 'test' to 'validation'

print("\nSuccessfully loaded and split dataset:")
print(split_dataset)

Loading images from imgs/train...


Resolving data files:   0%|          | 0/22424 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/22424 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]


Successfully loaded and split dataset:
DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 20181
    })
    validation: Dataset({
        features: ['image', 'label'],
        num_rows: 2243
    })
})


## 5. Load Pre-trained ViT Model & Processor

We will use `google/vit-base-patch16-224-in21k`, a ViT model pre-trained on ImageNet-21k.

* **`ViTImageProcessor`**: This handles all the pre-processing (resizing, normalization) to match what the ViT model expects.
* **`ViTForImageClassification`**: This is the model itself. We pass `num_labels=10` to tell it to create a new, untrained classification head on top of the pre-trained body.

In [19]:
from transformers import ViTImageProcessor, ViTForImageClassification
import torch

# Define model checkpoint
model_checkpoint = "google/vit-base-patch16-224-in21k"

# Load the processor
processor = ViTImageProcessor.from_pretrained(model_checkpoint)

# Get label mappings from the dataset
labels = split_dataset['train'].features['label'].names
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for i, label in enumerate(labels)}

# Load the model
model = ViTForImageClassification.from_pretrained(
    model_checkpoint,
    num_labels=10,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True # This is needed to drop the old head
)

print("Model and Processor loaded.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model and Processor loaded.


## 6. Define Data Augmentation

To improve the model's robustness and prevent overfitting, we apply **aggressive data augmentation** to the training set. We use `albumentations` for this.

* **Training:** We apply flips, rotations, and brightness/contrast changes.
* **Validation:** We *only* apply the standard resizing and normalization.

In [20]:
import albumentations as A
from albumentations.pytorch import ToTensorV2
import numpy as np

# Define augmentation pipeline for training
train_transform = A.Compose([
    A.SmallestMaxSize(max_size=256),
    A.RandomCrop(width=224, height=224),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.Rotate(limit=15, p=0.3),
    A.Normalize(mean=processor.image_mean, std=processor.image_std),
    ToTensorV2(),
])

# Define "transforms" for validation (just resizing and normalization)
val_transform = A.Compose([
    A.SmallestMaxSize(max_size=256),
    A.CenterCrop(width=224, height=224),
    A.Normalize(mean=processor.image_mean, std=processor.image_std),
    ToTensorV2(),
])

# Create functions to apply the transforms
def preprocess_train(examples):
    # Albumentations expects a list of np.arrays
    images = [np.array(img.convert("RGB")) for img in examples['image']]
    # Apply transforms
    examples['pixel_values'] = [train_transform(image=img)['image'] for img in images]
    return examples

def preprocess_val(examples):
    images = [np.array(img.convert("RGB")) for img in examples['image']]
    examples['pixel_values'] = [val_transform(image=img)['image'] for img in images]
    return examples

## 7. Apply Transforms to the Dataset

We use `.set_transform()` to apply our functions on-the-fly.

In [40]:
# %% [markdown]
# ## 6. Pre-process the Dataset
#
# Instead of "on-the-fly" transforms, we will pre-process the
# entire dataset now using .map(). This is more robust.
# This will create the 'pixel_values' column for all splits.
#
# This step will take a minute or two.

# %%
print("Applying transforms to train dataset...")
# batched=True sends multiple images at once to preprocess_train
train_dataset = split_dataset['train'].map(preprocess_train, batched=True)

print("Applying transforms to validation dataset...")
val_dataset = split_dataset['validation'].map(preprocess_val, batched=True)

# --- THIS IS THE KEY ---
# Now that 'pixel_values' is created, we can safely remove
# the original 'image' column to prevent errors.
train_dataset = train_dataset.remove_columns(['image'])
val_dataset = val_dataset.remove_columns(['image'])
# ----------------------

# We also set the format to PyTorch tensors
train_dataset.set_format('torch')
val_dataset.set_format('torch')

print("Pre-processing complete.")

Applying transforms to train dataset...


Map:   0%|          | 0/20181 [00:00<?, ? examples/s]

Applying transforms to validation dataset...


Map:   0%|          | 0/2243 [00:00<?, ? examples/s]

Pre-processing complete.


## 8. Set Up Training

We use the Hugging Face `Trainer` to handle the entire training and evaluation loop.

In [42]:
import evaluate
import numpy as np
from transformers import TrainingArguments, Trainer, DefaultDataCollator

# Load the accuracy metric
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions."""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

# Define the training arguments
# Make sure to use a GPU runtime in Colab!
training_args = TrainingArguments(
    output_dir="./vit-driver-detection",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=4,
    learning_rate=5e-5,
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    remove_unused_columns=True,
    report_to="none", # Set to "wandb" if you want to log (requires !pip install wandb)
    push_to_hub=False
)

# A simple data collator
data_collator = DefaultDataCollator()

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=processor, # Processor is passed to tokenizer for feature extraction
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

  trainer = Trainer(


## 9. Train the Model

This is the final step. This will take a while. Make sure you are on a **GPU runtime** (Runtime -> Change runtime type -> T4 GPU).

In [None]:
print("Starting training...")
trainer.train()

Starting training...


Epoch,Training Loss,Validation Loss


## 10. Evaluate and Conclude

After training, the `trainer` object will automatically have loaded the best model (thanks to `load_best_model_at_end=True`). We can run a final evaluation.

In [None]:
eval_results = trainer.evaluate()
print(f"Final Validation Accuracy: {eval_results['eval_accuracy']:.4f}")

### How to Use the Model (Inference)

Here's how you would use your new, fine-tuned model for inference.

In [None]:
from transformers import pipeline
from PIL import Image
import requests

# Load a test image (e.g., from the validation set)
test_image_path = split_dataset['validation'][0]['image'].filename
test_image = Image.open(test_image_path)

# Create a pipeline
# 'model' can be the path "./vit-driver-detection/checkpoint-..." or the trainer.model
classifier = pipeline("image-classification", model=trainer.model, tokenizer=processor, device=model.device)

# Make a prediction
prediction = classifier(test_image)
print(f"Image: {test_image_path}")
print(f"Prediction: {prediction}")

# Show the image
display(test_image)