#### Objective : To Classify the video into normal , panic , violent , congestion and obstacle.

#### DATASETS
1. UMN
2. UCSD
3. CUHK Avenue
4. PETS2009
5. ShanghaiTech
6. UCF-crime
7. PETS
8. Violent Flow
9. SHT Anamoly

## Literature Review Papers
1. Recent Deep Learning in Crowd Behaviour Analysis : A Brief Review
https://arxiv.org/pdf/2505.18401
- Springer Nature [ January 2025 ]

2. Design of an improved graph-based model for real time threat detection and dyanimic evacuation management using crowd behaviour analysis
https://link.springer.com/article/10.1007/s41870-025-02489-x
  - Springer Nature [March 2025]

3. Human crowd behaviour analysis based on video segmentation and classification using expectationâ€“maximization with deep learning architectures
 - Springer Nature [March 2024]

4. Convolutional Neural Networks for crowd behaviour aanlaysis : a survey
https://link.springer.com/article/10.1007/s00371-018-1499-5 [March 2018]

5. Crowd Emotion and Behavior Analysis Using LightWeight CNN Model
https://www.internationaljournalssrg.org/IJEEE/paper-details?Id=863
- Internationjournal [October 2024]


### Previous Methods 
1. Traditional Machine Learning Methods
2. Deep Learning Methods
  - 2.1 Recurrent Neural Network (RNN)
  - 2.2 Convolutional Neural Network (CNN)
  - 2.3 Graph Neural Network (GNN)
  - 2.4 Generative Models
  - 2.5 Transformers
 
 - Physics-inspired Deep Learning


# Research Papers
1. Crowd Behaviour Representation : An attribute-based approach
https://link.springer.com/article/10.1186/s40064-016-2786-0
2. Crowd Behaviour Analysis : Survey
3. Crowd Behaviour Monitoring and Analysis in Surveillance Applications : A Survey
4. Self-supervised multi-view multi-lable learning with attention mechanisms
5. Detecting violent and abnormal crowd actrivity using temporal analysis of grey level co-occurence matrix (GLCM) based texture measures
6. Revisiting crowd behaviour analysis through deep learning : taxonomy , anomaly detection , crowd emotions , datasets , opportunities and prospects
7. Recent trends in crowd analysis : A review
8. Recent Deep Learning in Crowd Behaviour Analaysis : A Brief Review
9. Multi-stage attention for efficient brain tumor classification wth SAMMed2D
10. Self-supervised multi-view multi-lable learning with attention mechanism
11. Crowd Behaviour Detection : leveraging video swin transformer for crowd size and violence analysis

## Research Papers (Technical Papers)
1. Attention is all you need (2017)
https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. Swin Transformer : Hierarchial Vision Transformer using Shifted Windows
- https://ieeexplore.ieee.org/document/9710580
- https://github.com/microsoft/Swin-Transformer
- https://huggingface.co/docs/transformers/en/index
- https://arxiv.org/html/2408.13609v1
- https://research.google/blog/graph-neural-networks-in-tensorflow/
- https://github.com/thunlp/GNNPapers
- https://arxiv.org/abs/2412.08016
- https://arxiv.org/abs/2202.02093 [https://ieeexplore.ieee.org/document/10191427 ]
- https://github.com/PangzeCheung/OmniTransfer

#### METHODS OR FRAMEWORKS
1. CNN Convolutional Neural Network
2. Swin transformer
3. Spatio Temporal
4. Graph Construction Layer
5. Graph Neural Network
6. Temporal Attention Model
7. Multimodal Fusion


### Proposed Methodology
1. Overview
2. Data Processing and Input Representation
    - 2.1 Video Processing
   -  2.2 Multimodal Inputs
3 Spatial Feature Extraction [CNN+ Swin Transformer]
    - 3.1 CNN-Based Local Feature Extraction
    - 3.2 Global Context Modeling using Swin Transformer
4. Spatio-Temporal Feature Modeling
5. Graph Construction Layer
    - 5.1 Graph Representation
    - 5.2 Node Feature Assignment
6. Graph Neural Network [GNN]
7. Temporal Attention Mechanism
8. Multimodal Feature Fusion
9. Classification and Output Layer

### Parameters and Hyperparameters (optional)

#### Hyperparameters
1. Input Size
2. Initial learning rate
3. Learning Rate Update Frequency
4. Momentum
5. Batch Size
6. Weight Decay
7. No. of Frames in a sample


#### Losses
1. Cross-entropy
2. Contrastive loss
3. Temporal Smoothness loss

#### Optimization
1. AdamW
2. Warmup
3. Cosine LR

In [1]:
#!pip install timm einops torch-geometric --quiet

# ENVIORNMENTAL SETUP

- https://pypi.org/project/opencv-python/
- GitHub Link (Open CV) : https://github.com/opencv/opencv
-  OpenCV : https://opencv.org/
-  PyTorch : https://pytorch.org/
-  https://pypi.org/project/torch/
-  PyTorch Image Models : https://github.com/huggingface/pytorch-image-models

In [1]:
import os
import cv2
import torch
import timm
import numpy as np
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import Dataset, DataLoader
from einops import rearrange
from torchvision import transforms



In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# CONFIGURATION

In [None]:
class CFG:
    device = "cuda" if torch.cuda.is_available() else "cpu"
    img_size = 224
    frames = 16
    batch_size = 4   # GPU-safe
    epochs = 30
    lr = 1e-4
    num_classes = 5  # Normal / Abnormal
    weight_decay = 0.01
    base_path = "kaggle/input/avenue-dataset/avenue/avenue"

cfg = CFG()

# DATASET

In [5]:
class VideoFrameDataset(Dataset):
    def __init__(self, root, transform=None):
        self.samples = []
        self.transform = transform

        for label, cls in enumerate(["normal", "abnormal"]):
            cls_path = os.path.join(root, cls)
            if not os.path.exists(cls_path):
                continue
            for vid in os.listdir(cls_path):
                frames = sorted(os.listdir(os.path.join(cls_path, vid)))
                self.samples.append((os.path.join(cls_path, vid), frames, label))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        path, frames, label = self.samples[idx]
        selected = frames[:cfg.frames]

        clip = []
        for f in selected:
            img = cv2.imread(os.path.join(path, f))
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = cv2.resize(img, (cfg.img_size, cfg.img_size))
            if self.transform:
                img = self.transform(img)
            clip.append(img)

        clip = torch.stack(clip)
        return clip, label

In [6]:
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485,0.456,0.406],
                         std=[0.229,0.224,0.225])
])

# VIDEO PROCESSING
1. Video Acquistion / Input
2. Frame Extraction
3. Frame Sampling / Temporal Sampling
4. Frame Resizing
5. Frame Normalization
6. Data Augmentation
7. Frame Stacking / Tensor Formation
8. Temporal Alignment
9. Feature Extraction
10. Label Encoding and Dataset Formatting

In [7]:
#importing necessary libraries for video processing
import os
import cv2
import numpy as np
import torch

In [8]:
# Configuration
FRAME_SIZE = (224, 224)
CLIP_LENGTH = 16
CLIP_STRIDE = 1

MEAN = np.array([0.485, 0.456, 0.406])
STD  = np.array([0.229, 0.224, 0.225])

## LOAD VIDEO

In [9]:
def load_video(video_path):
    cap = cv2.VideoCapture(video_path)
    frames = []

    if not cap.isOpened():
        raise IOError(f"Cannot open video {video_path}")

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append(frame)

    cap.release()
    return np.array(frames)  # (T, H, W, C)

## TEMPORAL SAMPLING

In [10]:
def temporal_sampling(frames, num_frames):
    total = len(frames)

    if total <= num_frames:
        return frames

    indices = np.linspace(0, total - 1, num_frames).astype(int)
    return frames[indices]

## RESIZE FRAMES

In [11]:
def resize_frames(frames, size):
    resized = []
    for frame in frames:
        resized.append(
            cv2.resize(frame, size, interpolation=cv2.INTER_LINEAR)      )
    return np.array(resized)

## NORMALIZE PIXEL VALUE

In [12]:
def normalize_frames(frames, mean, std):
    frames = frames.astype(np.float32) / 255.0
    return (frames - mean) / std

## SLIDING WINDOW CLIP GENERATION

In [13]:
def generate_clips(frames, clip_length, stride):
    clips = []

    for start in range(0, len(frames) - clip_length + 1, stride):
        clip = frames[start:start + clip_length]
        clips.append(clip)

    return np.array(clips)  # (N, T, H, W, C)

## FRAME TO TENSOR CONVERSION

In [14]:
def to_tensor(clip):
    clip = torch.tensor(clip, dtype=torch.float32)
    return clip.permute(3, 0, 1, 2)  # (C, T, H, W)

## LOAD FRAME LEVEL GROUND TRUTH 

In [15]:
def load_ground_truth(gt_file):
    return np.loadtxt(gt_file, dtype=int)

## FRAME --> CLIP LABEL CONVERSION

In [16]:
def frame_to_clip_labels(frame_labels, clip_length, stride):
    clip_labels = []

    for start in range(0, len(frame_labels) - clip_length + 1, stride):
        clip_labels.append(
            int(frame_labels[start:start + clip_length].max())
        )

    return np.array(clip_labels)

In [None]:
def preprocess_video(video_path, gt_path=None, is_train=True):
    frames = load_video(video_path)
    frames = resize_frames(frames, FRAME_SIZE)
    frames = normalize_frames(frames, MEAN, STD)

    clips = generate_clips(frames, CLIP_LENGTH, CLIP_STRIDE)

    if is_train:
        labels = np.zeros(len(clips), dtype=int)
    else:
        frame_labels = load_ground_truth(gt_path)
        labels = frame_to_clip_labels(
            frame_labels, CLIP_LENGTH, CLIP_STRIDE
        )

    clips = [to_tensor(clip) for clip in clips]
    return clips, labels

In [None]:
video_path = "Avenue/testing/videos/01.avi"
gt_path    = "Avenue/testing/ground_truth/01_gt.txt"

clips, labels = preprocess_video(
    video_path,
    gt_path=gt_path,
    is_train=False
)

print("Total clips:", len(clips))
print("Clip shape:", clips[0].shape)
print("First 10 labels:", labels[:10])

# CNN BACKBONE

- CNN Wikipedia : https://en.wikipedia.org/wiki/Convolutional_neural_network
- Stanford : https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks/#inception-network
- Resnet : https://huggingface.co/microsoft/resnet-50
- Keras : https://keras.io/api/applications/resnet/
- Deep Residual Learning for Image Recognition : https://arxiv.org/abs/1512.03385
- https://docs.pytorch.org/vision/main/models/generated/torchvision.models.resnet50.html

In [39]:
class CNNBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = timm.create_model(
            "resnet50",
            pretrained=True,
            features_only=True
        )

    def forward(self, x):
        return self.backbone(x)[-1]

# SWIN TRANSFORMER

- Model : (Hugging face model link ) : https://huggingface.co/microsoft/swin-tiny-patch4-window7-224
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows : https://arxiv.org/abs/2103.14030
- Swin Transformer (Microsoft) : https://github.com/microsoft/Swin-Transformer
- Swin Transformer (Hugging Face) : https://huggingface.co/docs/transformers/v4.39.1/model_doc/swin
- Image Classification with Swin Transformer : https://keras.io/examples/vision/swin_transformers/
- Swin Transformer V2: Scaling Up Capacity and Resolution : https://ieeexplore.ieee.org/document/9879380
- https://www.microsoft.com/en-us/research/blog/swin-transformer-supports-3-billion-parameter-vision-models-that-can-train-with-higher-resolution-images-for-greater-task-applicability/

In [40]:
class SwinBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        self.swin = timm.create_model(
            "swin_tiny_patch4_window7_224",
            pretrained=True,
            num_classes=0
        )

    def forward(self, x):
        return self.swin(x)

# SPATIO TEMPORAL 

- Link :  https://docs.pytorch.org/docs/stable/generated/torch.nn.Conv3d.html
- Keras : https://keras.io/api/layers/convolution_layers/convolution3d/

In [41]:
class SpatioTemporal(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.conv3d = nn.Conv3d(dim, dim, kernel_size=3, padding=1)

    def forward(self, x):
        x = rearrange(x, "b t c h w -> b c t h w")
        x = self.conv3d(x)
        return rearrange(x, "b c t h w -> b t c h w")

#  GRAPH CONSTRUCTION LAYER

### Research Paper
1. Design of an improved graph-based model for real-time threat detection and dyanmic evacuation management using crowd behaviour analysis. Link : https://link.springer.com/article/10.1007/s41870-025-02489-x
2. Human crowd behaviour analysis based on video segmentation and classification using expectation - maximation with deep learning architectures Link : https://link.springer.com/article/10.1007/s11042-024-18630-0
3. Recent Deep Learning in Crowd Behaviour Analysis : A Brief Review Link : https://arxiv.org/pdf/2505.18401
4. Crowd Abnormal Behaviour Detection and Comparative Analysis using YOLO Network Link : http://ieeexplore.ieee.org/document/10543372
5. The Battle of Westminster : Developing the social identity model of crowd behaviour in order to explain the initation and development of collective conflict. Link : https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-0992(199601)26:1%3C115::AID-EJSP740%3E3.0.CO;2-Z
6. Convolutional neural networks for crowd behaviour analysis : a survey Link : https://link.springer.com/article/10.1007/s00371-018-1499-
7. A novel framework and concept -based semantic search interface for abnormal crpwd behaviour analysis in surveillance videos Link : https://link.springer.com/article/10.1007/s11042-020-08659-2
8. Crowd 11 : A Dataset for Fine Grained Crowd Behvaiour Analysis Link : https://openaccess.thecvf.com/content_cvpr_2017_workshops/w37/papers/Dupont_Crowd-11_A_Dataset_CVPR_2017_paper.pdf
9. Crowd Behavioural Analysis at a Mass Gathering Event Link :
10. High-Level Feature Extraction for Crowd Behaviour Analysis : A Computer Vision Approach Link : https://link.springer.com/chapter/10.1007/978-3-031-13324-4_6
11. Crowd Emotion and Behavior Analysis Using Lightweight CNN model Link : https://www.internationaljournalssrg.org/IJEEE/paper-details?Id=863
12. Agile-LSTM : Acclimatizing Convolution Neural Network for Crowd Behaviour Analysis Link : https://link.springer.com/chapter/10.1007/978-981-16-1249-7_31
13. Violent Behaviour Analysis in Crowd  Link : https://ieeexplore.ieee.org/document/10493819

## Recent Papers
1. Exploring the role of layer variations in ANN Crowd Behaviour and Prediction Accuracy Link : https://www.cambridge.org/core/journals/proceedings-of-the-design-society/article/exploring-the-role-of-layer-variations-in-ann-crowd-behaviour-and-prediction-accuracy/402123BAB20E6C7A1D02CF763EF6B222
2. Automated Crowd Abnormality Detection and Segmentation Using Machine Learning Techniques Link : https://ieeexplore.ieee.org/document/10915397
3. Identification odf crowd behaviour patterns using stability analysis

- torch.cdist : https://docs.pytorch.org/docs/stable/generated/torch.cdist.html
- torch.topk : https://docs.pytorch.org/docs/main/generated/torch.topk.html

In [42]:
def build_knn_graph(x, k=5):
    dist = torch.cdist(x, x)
    return dist.topk(k, largest=False).indices

# GRAPH NEURAL NETWORK

In [43]:
class SimpleGNN(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.fc = nn.Linear(dim, dim)

    def forward(self, x):
        return F.relu(self.fc(x))

# TEMPORAL ATTENTION

In [44]:
class TemporalAttention(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.attn = nn.MultiheadAttention(dim, 4, batch_first=True)

    def forward(self, x):
        out, _ = self.attn(x, x, x)
        return out.mean(1)

In [45]:
class CrowdBehaviorModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn = CNNBackbone()
        self.swin = SwinBackbone()
        self.st = SpatioTemporal(1024)
        self.gnn = SimpleGNN(1024)
        self.temporal = TemporalAttention(1024)
        self.fc = nn.Linear(1024, cfg.num_classes)

    def forward(self, x):
        B, T, C, H, W = x.shape
        x = rearrange(x, "b t c h w -> (b t) c h w")

        cnn_feat = self.cnn(x)
        swin_feat = self.swin(x).unsqueeze(-1).unsqueeze(-1)

        feat = cnn_feat + swin_feat
        feat = rearrange(feat, "(b t) c h w -> b t c h w", b=B)

        feat = self.st(feat)
        feat = feat.mean([-1, -2])

        feat = self.gnn(feat)
        feat = self.temporal(feat)

        return self.fc(feat)

# TRAINING LOOP

- PyTorch : https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html
- Keras : https://keras.io/api/optimizers/adamw/
- Cross Entropy Loss
- PyTorch : https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
- Keras (Probabilistic losses ) : https://keras.io/api/losses/probabilistic_losses/

In [46]:
model = CrowdBehaviorModel().to(cfg.device)
optimizer = torch.optim.AdamW(
    model.parameters(), lr=cfg.lr, weight_decay=cfg.weight_decay)
criterion = nn.CrossEntropyLoss()

In [48]:
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from torch.optim.lr_scheduler import CosineAnnealingLR

In [49]:
# --- 1. DATA PREPARATION ---
train_dataset = AvenueDataset(cfg.base_path, split="training", transform=transform)
test_dataset = AvenueDataset(cfg.base_path, split="testing", transform=transform)

train_loader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=cfg.batch_size, shuffle=False, num_workers=2)

# --- 2. MODEL, LOSS, & OPTIMIZER ---
model = CrowdBehaviorModel().to(cfg.device)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1) # Improved for generalization
optimizer = torch.optim.AdamW(model.parameters(), lr=cfg.lr, weight_decay=cfg.weight_decay)
scheduler = CosineAnnealingLR(optimizer, T_max=cfg.epochs)

# --- 3. TRAINING & VALIDATION LOOP ---
def train_one_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss = 0
    for x, y in loader:
        x, y = x.to(cfg.device), y.to(cfg.device)
        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

def evaluate(model, loader):
    model.eval()
    all_preds = []
    all_labels = []
    all_probs = []
    
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(cfg.device), y.to(cfg.device)
            outputs = model(x)
            probs = F.softmax(outputs, dim=1)
            _, preds = torch.max(outputs, 1)
            
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(y.cpu().numpy())
            all_probs.extend(probs[:, 1].cpu().numpy()) # Probability for 'Abnormal' class
            
    return all_labels, all_preds, all_probs

# --- 4. EXECUTION ---
print("Starting Training...")
for epoch in range(cfg.epochs):
    avg_loss = train_one_epoch(model, train_loader, optimizer, criterion)
    scheduler.step()
    print(f"Epoch {epoch+1}/{cfg.epochs} | Loss: {avg_loss:.4f} | LR: {scheduler.get_last_lr()[0]:.6f}")

# Final Evaluation
labels, preds, probs = evaluate(model, test_loader)

NameError: name 'AvenueDataset' is not defined

In [50]:
def plot_results(labels, preds, probs):
    # 1. Confusion Matrix
    cm = confusion_matrix(labels, preds)
    plt.figure(figsize=(6,5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Normal', 'Abnormal'], yticklabels=['Normal', 'Abnormal'])
    plt.title('Confusion Matrix')
    plt.show()

    # 2. AUC-ROC Curve
    
    auc = roc_auc_score(labels, probs)
    print(f"\nArea Under Curve (AUC): {auc:.4f}")
    print("\nClassification Report:")
    print(classification_report(labels, preds, target_names=['Normal', 'Abnormal']))

plot_results(labels, preds, probs)

NameError: name 'labels' is not defined

### Swin Transformer (Shifted Window Transformer)

Overview of Swin Transformer

Architecture of Swin Transformer
1. Patch Splitting
2. Window-Based Self-Attention
3. Shifted Windows for Cross-Region Interaction

### Implementation of Swin Tranformer

#### Step 1. Setup Environment

In [2]:
#!pip install transformers datasets torch torchvision

#### Step 2. Import Libraries

In [7]:
from transformers import AutoImageProcessor, SwinForImageClassification
from datasets import load_dataset
import torch

In [6]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

####  Load Pre-Trained Model

In [4]:
model_name = "microsoft/swin-tiny-patch4-window7-224"
image_processor = AutoImageProcessor.from_pretrained(model_name)
model = SwinForImageClassification.from_pretrained(model_name)

preprocessor_config.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/113M [00:00<?, ?B/s]

### Load Dataset

In [8]:
dataset = load_dataset("cifar10", split="test[:8]")

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/120M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/23.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

### Extract Images and Labels

In [16]:
images = [item["img"] for item in dataset]
labels = [item["label"] for item in dataset]

### Preprocess Images

In [17]:
inputs = image_processor(images, return_tensors="pt").to(model.device)

### Classify Images

In [19]:
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

 ### Process Predictions

In [20]:
predicted_labels = logits.argmax(dim=-1).cpu().numpy()

### Handle Label Mismatches

In [21]:
num_classes = len(model.config.id2label)
if num_classes != len(set(labels)):
    print("Warning: Model label space does not match CIFAR-10 labels. Mapping may be required.")
    class_mapping = {i: i % 10 for i in range(num_classes)}
    predicted_labels = [class_mapping[label] for label in predicted_labels]



### Map Predictions to Class Names

In [22]:
class_names = [
    "airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"
]
predicted_class_names = [class_names[label] for label in predicted_labels]
true_class_names = [class_names[label] for label in labels]

### Print Results

In [23]:
for i, (true_label, predicted_label) in enumerate(zip(true_class_names, predicted_class_names)):
    print(
        f"Image {i + 1}: True Label = {true_label}, Predicted Label = {predicted_label}")

Image 1: True Label = cat, Predicted Label = dog
Image 2: True Label = ship, Predicted Label = automobile
Image 3: True Label = ship, Predicted Label = ship
Image 4: True Label = airplane, Predicted Label = bird
Image 5: True Label = frog, Predicted Label = bird
Image 6: True Label = frog, Predicted Label = ship
Image 7: True Label = automobile, Predicted Label = dog
Image 8: True Label = frog, Predicted Label = automobile


## Advantages
1. Efficient on High Resolution Images
2. Versatile
3. Reduced Computational Complexity
4. Strong Real-World Performance

### Limitations
1. Limited Global Context
2. Increased Complexity
3. Resource Demands for Large Models
4. Weaker Local Inductive Bias

In [None]:
# Cell 4: Dataset Implementation
class AvenueDataset(Dataset):
    def __init__(self, root, split="training", transform=None):
        self.samples = []
        self.transform = transform
        self.split = split
        
        split_path = os.path.join(root, split, "frames")
        if not os.path.exists(split_path):
            print(f"Warning: Path {split_path} not found.")
            return

        for vid_folder in sorted(os.listdir(split_path)):
            vid_path = os.path.join(split_path, vid_folder)
            frames = sorted([f for f in os.listdir(vid_path) if f.endswith(('.jpg', '.png'))])
            
            # Avenue Logic: Training is usually all Normal (0). 
            # Abnormalities (1) appear in Testing.
            label = 0 if split == "training" else 1 
            
            if len(frames) >= cfg.frames:
                self.samples.append((vid_path, frames, label))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        path, frames, label = self.samples[idx]
        # Temporal Sampling: Select frames evenly across the clip
        indices = np.linspace(0, len(frames) - 1, cfg.frames).astype(int)
        
        clip = []
        for i in indices:
            img = cv2.imread(os.path.join(path, frames[i]))
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = cv2.resize(img, (cfg.img_size, cfg.img_size))
            if self.transform:
                img = self.transform(img)
            clip.append(img)

        return torch.stack(clip), label

# Cell 5: DataLoaders
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

train_ds = AvenueDataset(cfg.base_path, split="training", transform=transform)
test_ds = AvenueDataset(cfg.base_path, split="testing", transform=transform)

train_loader = DataLoader(train_ds, batch_size=cfg.batch_size, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=cfg.batch_size, shuffle=False)