![image.png](https://i.imgur.com/a3uAqnb.png)
# Video Action Recognition with 3D-CNNs

This notebook demonstrates how to build and train a 3D Convolutional Neural Network (3D-CNN) for video action recognition. Unlike standard 2D-CNNs that process static images, 3D-CNNs are designed to learn features from both spatial and temporal dimensions, making them ideal for understanding the content of videos.

### **📌 The Core Idea: Learning from Space and Time**
The model works by applying 3D convolution and pooling operations to a sequence of video frames.

1.  **Input**: The model takes a short clip (a fixed number of frames) from a video as input.

2.  **3D Convolutions**: Instead of a 2D kernel that slides over the height and width of an image, a 3D kernel slides over the **height, width, and time** (frames) of the video clip. This allows the network to learn motion patterns (like a person swinging a bat) in addition to visual features (the person, the bat).

3.  **Classification**: After several layers of 3D convolutions and pooling, the learned features are passed through fully-connected layers to classify the action being performed in the video.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
import cv2
import tempfile
import os
import torchvision.io as io
import torchvision.models.video as video_models
from sklearn.model_selection import train_test_split
from IPython.display import Video, HTML
from datasets import load_dataset
from IPython.display import Video
import random
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
from sklearn.metrics import accuracy_score, classification_report
import torch.nn.functional as F
from tqdm import tqdm
import warnings
from torchvision import models
warnings.filterwarnings('ignore')

In [None]:
import torch
print(torch.__version__)

## 1️⃣ The Dataset: HMDB51

We will use the **HMDB51 dataset**, a widely-used benchmark for action recognition. It consists of around 7,000 video clips categorized into 51 action classes.

We'll load the dataset directly from the Hugging Face Hub.

In [None]:
ds = load_dataset("NoahMartinezXiang/HMDB51")

In [None]:
# Depending on your environment, you may need to install specific libraries for video processing.
!pip install av opencv-python

## 2️⃣ Data Exploration and Visualization

Before building the model, it's essential to understand and visualize our data. We'll define a helper function to play video clips directly within the notebook. This helps us get a feel for the different action classes and the nature of the video data (length, quality, etc.).

In [None]:
def display_video_frames(video_data, label, num_frames=10):
    """Helper function to display frames from a video side by side."""
    frames = []
    try:
        # Iterate through the frames in the video data
        for frame in video_data:
            frame_data = frame['data']
            # Convert tensor to numpy array
            if hasattr(frame_data, 'numpy'):
                frame_data = frame_data.numpy()
            elif hasattr(frame_data, 'detach'):
                frame_data = frame_data.detach().numpy()
            # Ensure the channel order is H, W, C for matplotlib
            if len(frame_data.shape) == 3 and frame_data.shape[0] == 3:
                frame_data = np.transpose(frame_data, (1, 2, 0))
            # Ensure data type is uint8 for display
            if frame_data.dtype != np.uint8:
                if frame_data.max() <= 1.0:
                    frame_data = (frame_data * 255).astype(np.uint8)
                else:
                    frame_data = frame_data.astype(np.uint8)
            frames.append(frame_data)
    except:
        # Some videos in the dataset might be corrupted
        pass
    
    if not frames:
        print(f"Could not load frames for video with label: {label}")
        return
    
    # Select frames evenly distributed across the video
    total_frames = len(frames)
    if total_frames >= num_frames:
        # Select evenly spaced frames
        frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
        selected_frames = [frames[i] for i in frame_indices]
    else:
        # If video has fewer frames than requested, use all available frames
        selected_frames = frames
        num_frames = len(selected_frames)
    
    # Create subplot grid
    fig, axes = plt.subplots(1, num_frames, figsize=(2 * num_frames, 3))
    if num_frames == 1:
        axes = [axes]  # Make it iterable for single frame case
    
    fig.suptitle(f'Video: {label} ({total_frames} total frames)', fontsize=14)
    
    for i, (ax, frame) in enumerate(zip(axes, selected_frames)):
        ax.imshow(frame)
        ax.set_title(f'Frame {frame_indices[i] + 1 if total_frames >= num_frames else i + 1}', fontsize=10)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()



In [None]:

# TODO: Display 5 random videos from the training set showing 10 frames each
# Benfit from the provided function.


## 3️⃣ The PyTorch Dataset

We create a custom `VideoDataset` class to interface with our data. This is the standard PyTorch way to prepare data for a `DataLoader`. Our dataset needs to perform several crucial preprocessing steps:

1.  **Frame Sampling**: Videos have varying lengths. We need to feed a fixed-size input to our model, so we sample a constant `max_frames` from each video.
2.  **Resizing**: Each frame is resized to a uniform `target_size` (e.g., 112x112) to ensure consistency.
3.  **Normalization**: Frame pixel values are normalized to a common scale.
4.  **Padding/Truncation**: If a video has fewer frames than `max_frames`, we pad it by repeating the last frame. If it has more, we truncate it.
5.  **Tensor Conversion**: The final sequence of frames is converted into a single PyTorch tensor.

In [None]:
class VideoDataset(Dataset):
    def __init__(self, dataset, transform=None, max_frames=16, target_size=(112, 112)):
        self.dataset = dataset
        self.transform = transform
        self.max_frames = max_frames
        self.target_size = target_size

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        sample = self.dataset[idx]
        video = sample['video']
        label = sample['label']

        frames = []
        for frame in video:
            # Extract the image data from the frame dictionary
            frame_data = frame['data']

            # Convert from PyTorch tensor to NumPy array if necessary
            if hasattr(frame_data, 'numpy'):
                frame_data = frame_data.numpy()
            elif hasattr(frame_data, 'detach'):
                frame_data = frame_data.detach().numpy()

            # Transpose from (C, H, W) to (H, W, C) for resizing with OpenCV
            if len(frame_data.shape) == 3 and frame_data.shape[0] == 3:
                frame_data = np.transpose(frame_data, (1, 2, 0))

            # Normalize pixel values to the 0-1 float range
            if frame_data.dtype != np.float32:
                if frame_data.max() > 1.0:
                    frame_data = frame_data.astype(np.float32) / 255.0
                else:
                    frame_data = frame_data.astype(np.float32)

            # Resize frame to the target spatial dimensions
            frame_data = cv2.resize(frame_data, self.target_size)

            # Transpose back to (C, H, W) format for PyTorch
            if len(frame_data.shape) == 3:
                frame_data = np.transpose(frame_data, (2, 0, 1))

            frames.append(frame_data)

            # Stop once we have reached the desired number of frames
            if len(frames) >= self.max_frames:
                break

        # If the video is shorter than max_frames, pad by repeating the last frame
        while len(frames) < self.max_frames:
            # Handle the edge case of an empty video
            frames.append(frames[-1] if frames else np.zeros((3, *self.target_size)))

        # Ensure the final list has exactly max_frames
        frames = frames[:self.max_frames]

        # Stack the list of frames into a single tensor of shape (T, C, H, W)
        video_tensor = torch.FloatTensor(np.array(frames))

        # Apply any specified transformations (like normalization) to each frame
        if self.transform:
            transformed_frames = []
            for frame in video_tensor:
                transformed_frames.append(self.transform(frame))
            video_tensor = torch.stack(transformed_frames)

        return video_tensor, torch.LongTensor([label])[0]

## 4️⃣ The Model: 3D Convolutional Neural Network (3D-CNN)

Here we define the architecture of our 3D-CNN. It consists of a series of 3D convolutional blocks followed by fully-connected layers for classification.

### Key Components:
- **`nn.Conv3d`**: The core layer. It uses a 3D kernel to convolve over the video clip's height, width, and temporal (frame) dimensions simultaneously. This captures spatio-temporal features.
- **`nn.BatchNorm3d`**: Normalizes the activations to stabilize training and improve performance.
- **`nn.MaxPool3d`**: Downsamples the feature maps. Note the kernel sizes: `(1, 2, 2)` is used initially to reduce spatial dimensions while preserving the temporal information. Later, `(2, 2, 2)` is used to downsample in all three dimensions.
- **`permute`**: In the `forward` method, the input tensor shape is changed from `(Batch, Time, Channels, Height, Width)` to `(Batch, Channels, Time, Height, Width)`, which is the format expected by PyTorch's `Conv3d` layers.
- **`AdaptiveAvgPool3d`**: Reduces each feature map to a single value, making the model robust to different input sizes and preparing the features for the classifier.
- **Fully-Connected Layers**: The final layers that perform the classification based on the extracted features.

In [None]:
class Video3DCNN(nn.Module):
    def __init__(self, num_classes=51, input_channels=3):
        super(Video3DCNN, self).__init__()
        
        # TODO: Block 1: 3D Conv -> BatchNorm -> ReLU -> MaxPool
        

        # TODO: Pool spatially (3d) but not temporally to preserve motion information early on
        
        
        # TODO: Define other 4 blocks in the same way.
        
        
        # Global average pooling to reduce each feature map to a single value
        self.global_avg_pool = nn.AdaptiveAvgPool3d((1, 1, 1))
        
        # Fully connected layers for classification
        self.dropout = nn.Dropout(0.5)

        # TODO: Deifne FC layers.
        
        
    def forward(self, x):
        # Input shape: (batch_size, time, channels, height, width)
        # Rearrange to (batch_size, channels, time, height, width) for 3D conv layers
        x = x.permute(0, 2, 1, 3, 4)
        
        # TODO: Pass through convolutional blocks
        
        
        # Apply global average pooling and flatten the output
        x = self.global_avg_pool(x)
        x = x.view(x.size(0), -1) # Flatten to (batch_size, num_features)
        
        # TODO: Pass through fully connected layers with dropout for regularization
        
        
        return x

## 5️⃣ Preparing for Training

This section covers all the setup steps before we can start the training loop:
1.  **Transformations**: We define a normalization transform using ImageNet statistics, a common practice.
2.  **Data Splitting**: Training a model on the full HMDB51 dataset is computationally expensive. For this demonstration, we'll randomly sample a smaller subset of 2,000 videos and then split them into training and validation sets. We use stratified splitting to ensure both sets have a similar distribution of action classes.
3.  **DataLoaders**: We create `DataLoader` instances for our training and validation sets, which will handle batching, shuffling, and multi-threaded data loading.
4.  **Model Initialization**: We instantiate our `Video3DCNN`, set up the loss function (`CrossEntropyLoss`), optimizer (`Adam`), and a learning rate scheduler (`CosineAnnealingLR`) to adjust the learning rate during training.

In [None]:
# Define normalization transform using standard ImageNet stats
transform = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

In [None]:
# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)

# TODO: Randomly select a subset of samples from the dataset for faster training
# Using 2000 samples for this demonstration
total_samples = len(ds['train'])
random_indices = 

# TODO: Split the selected indices into training and validation sets (80/20 split)
# Stratification ensures that the class distribution is similar in both sets
train_indices, val_indices = 

# TODO: Create subset datasets from the indices
# Hint: use ds['train'].select
train_subset = 
val_subset = 

print(f"Training on {len(train_subset)} samples, validating on {len(val_subset)} samples.")

In [None]:
# Create the VideoDataset instances for training and testing
# We use 16 frames per video clip
train_dataset = VideoDataset(train_subset, transform=transform, max_frames=16)
test_dataset = VideoDataset(val_subset, transform=transform, max_frames=16)

In [None]:
# TODO: Create DataLoaders to handle batching and shuffling
batch_size = 16  # Small batch size due to memory constraints of 3D-CNNs
train_loader = 
test_loader = 

In [None]:
# Set the device to a GPU if available, otherwise use the CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
# TODO: Initialize the model and move it to the selected device
model = 

# TODO: Define loss function for multi-class classification
criterion = 

# TODO: Create define an optimizer with a small learning rate (0001) and weight decay (1e-4) for regularization
optimizer = 

# Learning rate scheduler to gradually decrease the learning rate
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20)

In [None]:
# Print the total number of trainable parameters in the model
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters: {total_params:,}")

## 6️⃣ Model Training

We'll define two helper functions, `train_epoch` and `validate`, to encapsulate the logic for a single epoch of training and validation. This keeps the main training loop clean and organized.

The training loop then iterates for a fixed number of `epochs`:
1.  Calls `train_epoch` to train the model on the training data.
2.  Calls `validate` to evaluate the model's performance on the unseen validation data.
3.  Updates the learning rate using the scheduler.
4.  Saves the model's weights if the validation accuracy improves, a technique known as **early stopping**.

In [None]:
def train_epoch(model, train_loader, criterion, optimizer, device):
    """Function to handle the training of the model for one epoch."""
    model.train() # Set the model to training mode
    running_loss = 0.0
    correct = 0
    total = 0
    
    pbar = tqdm(train_loader, desc='Training')
    for videos, labels in pbar:
        # TODO: Move data to the target device
        videos, labels = 
        
        # TODO: Do normal training
        # Standard training steps
        
        
        # Track statistics
        running_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        
        # Update progress bar
        pbar.set_postfix({
            'Loss': f'{running_loss/len(pbar):.4f}',
            'Acc': f'{100.*correct/total:.2f}%'
        })
    
    return running_loss / len(train_loader), 100. * correct / total

def validate(model, test_loader, criterion, device):
    """Function to handle the validation of the model."""
    model.eval() # Set the model to evaluation mode
    running_loss = 0.0
    correct = 0
    total = 0
    all_predictions = []
    all_labels = []
    
    with torch.no_grad(): # Disable gradient calculations
        pbar = tqdm(test_loader, desc='Validation')
        for videos, labels in pbar:
            # TODO: Move data into device
            videos, labels = 
            
            # TODO: predice and compute loss.
            
            
            # Track statistics
            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            
            # Store predictions and labels for classification report
            all_predictions.extend(predicted.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            
            # Update progress bar
            pbar.set_postfix({
                'Loss': f'{running_loss/len(pbar):.4f}',
                'Acc': f'{100.*correct/total:.2f}%'
            })
    
    return running_loss / len(test_loader), 100. * correct / total, all_predictions, all_labels

In [None]:
# Lists to store training and validation history for plotting
train_losses, train_accs = [], []
val_losses, val_accs = [], []

In [None]:
num_epochs = 2
best_acc = 0.0

print("Starting training...")
for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")
    print("-" * 50)
    
    # Train for one epoch
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    
    # Validate the model
    val_loss, val_acc, _, _ = validate(model, test_loader, criterion, device)
    val_losses.append(val_loss)
    val_accs.append(val_acc)
    
    # Step the learning rate scheduler
    scheduler.step()
    
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")
    
    # Save the model if it has the best validation accuracy so far
    if val_acc > best_acc:
        best_acc = val_acc
        torch.save(model.state_dict(), 'best_video_cnn.pth')
        print(f"New best model saved with accuracy: {best_acc:.2f}%")

print(f"\nTraining completed! Best validation accuracy: {best_acc:.2f}%")

## 7️⃣ Evaluating Performance

After training, we evaluate the model's performance in two ways:

1.  **Plotting Metrics**: We plot the training and validation loss and accuracy over epochs. This helps us visualize the learning process and check for signs of overfitting (e.g., if validation accuracy plateaus while training accuracy continues to rise).
2.  **Classification Report**: We load the best-performing model (saved during training) and generate a detailed classification report. This report shows precision, recall, and F1-score for each action class, giving us a much deeper insight into the model's strengths and weaknesses than a single accuracy score.

In [None]:
# TODO: Plot the training and validation loss and accuracy curves


In [None]:
# Load the best model weights for final evaluation
print("Loading best model for final evaluation...")
model.load_state_dict(torch.load('best_video_cnn.pth'))

# Run validation on the test set to get final metrics and predictions
val_loss, val_acc, predictions, true_labels = validate(model, test_loader, criterion, device)

print(f"\nFinal Test Accuracy: {val_acc:.2f}%")

# Print a detailed classification report
print("\nClassification Report:")
label_names = ds['train'].features['label'].names
print(classification_report(true_labels, predictions, target_names=label_names, digits=3))

## 8️⃣ Inference on New Samples

Finally, let's see our trained model in action! We'll create a simple prediction function and use it to classify a few random videos from our validation set. This demonstrates how the model would be used in a real-world application to make predictions on new, unseen data.

In [None]:
def predict_video(model, video_tensor, label_names, device):
    """Function to predict the class of a single video tensor."""
    model.eval() # Set the model to evaluation mode
    with torch.no_grad():
        # Add a batch dimension and move the tensor to the correct device
        video_tensor = video_tensor.unsqueeze(0).to(device)
        
        # Get model outputs (logits)
        outputs = model(video_tensor)
        
        # Convert logits to probabilities using softmax
        probabilities = F.softmax(outputs, dim=1)
        
        # Get the index of the class with the highest probability
        _, predicted = torch.max(outputs, 1)

        # Get the class name and confidence score
        pred_class = label_names[predicted.item()]
        confidence = probabilities[0][predicted].item() * 100

        return pred_class, confidence

In [None]:
# Select 5 random samples from the validation set for prediction
random_test_indices = random.sample(range(len(test_dataset)), 5)
for i, idx in enumerate(random_test_indices):
    # Retrieve the preprocessed video tensor and its true label
    video_tensor, true_label = test_dataset[idx]
    true_class = label_names[true_label.item()]

    # Make a prediction with our trained model
    pred_class, confidence = predict_video(model, video_tensor, label_names, device)

    print(f"Sample {i+1}:")
    print(f"  True class: {true_class}")
    print(f"  Predicted:  {pred_class} (confidence: {confidence:.1f}%)")
    print(f"  Correct:    {'✓' if pred_class == true_class else '✗'}")
    print()

## 9️⃣ Conclusion & Next Steps

We have successfully built, trained, and evaluated a 3D-CNN for video action recognition. Even with a relatively simple architecture and a small subset of data, the model learns to distinguish between different actions, demonstrating the power of learning from spatio-temporal features.

### **📝 Exercises for Further Exploration**
1.  **Train on More Data**: The most impactful change would be to train on the full HMDB51 dataset or for more `epochs`.
2.  **Data Augmentation**: Implement video-specific data augmentation techniques, such as random horizontal flipping of the entire clip, random cropping in space and time, or color jittering.
3.  **Use a Pre-trained Model**: Explore using a more advanced, pre-trained video model architecture like R(2+1)D (available in `torchvision.models.video`). You can fine-tune it on this dataset for potentially much better performance.
       - I tried this, and got better results but took way lonegr to train, also I tried to use resnet as a backbone and it was similar results to the 3D cnn we have here. 
5.  **Experiment with Frame Sampling**: Instead of taking the first `N` frames, try a different sampling strategy. For example, you could sample `N` frames evenly spaced throughout the video.
6.  **Hyperparameter Tuning**: Experiment with different values for `learning_rate`, `batch_size`, number of layers, or kernel sizes in the 3D-CNN.

### Contributed by: Ali Habibullah