# EfficientNet Architecture Overview

## Introduction

EfficientNet is a family of **convolutional neural networks (CNNs)** designed to achieve **high accuracy with significantly fewer computational resources** compared to previous architectures.

It was introduced by **Mingxing Tan and Quoc V. Le (Google Research)** in their 2019 paper:

> **"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks"**

The key innovation of EfficientNet is **compound scaling**, which uniformly scales:
- **Depth** (number of layers)
- **Width** (number of channels)
- **Input resolution**

using a single scaling coefficient.

---

## Compound Scaling Idea

Instead of arbitrarily increasing depth or width, EfficientNet uses a compound coefficient `φ`:

- Depth ∝ α^φ  
- Width ∝ β^φ  
- Resolution ∝ γ^φ  

subject to a computational constraint.

This balanced scaling leads to **better accuracy per FLOP**.

---

## EfficientNet-B0 Architecture

EfficientNet-B0 is the **baseline model** from which larger models (B1–B7) are derived.

The architecture consists of three main parts:

---

## 1. Stem

The stem is the initial feature extraction layer.

- Standard **3×3 convolution**
- **32 filters**
- **Stride = 2**
- Followed by:
  - Batch Normalization
  - ReLU6 (or Swish in original paper)

**Purpose:**
- Extract low-level features (edges, textures)
- Reduce spatial resolution

---

## 2. Body (MBConv Blocks)

The body is composed of multiple **MBConv (Mobile Inverted Bottleneck Convolution) blocks**.

### MBConv Block Components

Each MBConv block includes:

1. **Expansion Layer**
   - 1×1 convolution
   - Expands channels by a factor (e.g., ×6)

2. **Depthwise Convolution**
   - 3×3 or 5×5 convolution
   - Applied independently to each channel
   - Reduces computation cost

3. **Squeeze-and-Excitation (SE) Block**
   - Channel-wise attention mechanism
   - Learns which channels are important

4. **Projection Layer**
   - 1×1 convolution
   - Reduces channels back to desired output
   - Uses a **linear bottleneck** (no activation)

5. **Residual Connection**
   - Used when input and output shapes match

---

### MBConv Configuration Parameters

Each MBConv stage is defined by:

- **Expansion Ratio**
- **Kernel Size**
- **Stride**
- **Number of Output Channels**
- **Number of Repeats**
- **SE Ratio**

Different stages use different configurations to balance efficiency and accuracy.

---

## 3. Head

The head converts extracted features into final predictions.

- **1×1 convolution** to increase channels (usually to 1280)
- **Global Average Pooling**
- **Fully Connected (Dense) Layer**
- **Softmax activation** for classification

## Architecture Implementation
![efficientNet](https://learnopencv.com/wp-content/uploads/2019/06/EfficientNet-B0-architecture-1024x511.png)



## Import Libraries

In [1]:
import torch
import torch.nn as nn
from math import ceil

## Define parameters

In [2]:
base_model = [
    # expand_ratio, channels, repeats, stride, kernel_size
    [1, 16, 1, 1, 3],
    [6, 24, 2, 2, 3],
    [6, 40, 2, 2, 5],
    [6, 80, 3, 2, 3],
    [6, 112, 3, 1, 5],
    [6, 192, 4, 2, 5],
    [6, 320, 1, 1, 3],
]

phi_values = {
    # tuple of: (phi_value, resolution, drop_rate)
    "b0": (0, 224, 0.2),  # alpha, beta, gamma, depth = alpha ** phi
    "b1": (0.5, 240, 0.2),
    "b2": (1, 260, 0.3),
    "b3": (2, 300, 0.3),
    "b4": (3, 380, 0.4),
    "b5": (4, 456, 0.4),
    "b6": (5, 528, 0.5),
    "b7": (6, 600, 0.5),
}


## CNNBlock

In [3]:
class CNNBlock(nn.Module):
  def __init__(
      self,
      in_channels,
      out_channels,
      kernel_size,
      stride,
      padding,
      groups=1
  ):
    super(CNNBlock, self).__init__()
    self.cnn = nn.Conv2d(
        in_channels,
        out_channels,
        kernel_size,
        stride,
        padding,
        groups=groups, # if group=input channel, then a kernel is applied to one channel only
        bias=False
    )
    self.bn = nn.BatchNorm2d(out_channels)
    self.silu = nn.SiLU()

  def forward(self, x):
    return self.silu(self.bn(self.cnn(x)))




## Squeeze & Excitation

In [4]:
class SqueezeExcitation(nn.Module):
  def __init__(self, in_channels, reduced_dim):
    super(SqueezeExcitation, self).__init__()
    self.se = nn.Sequential(
        nn.AdaptiveAvgPool2d(1), # C x H x W -> C x 1 x 1
        nn.Conv2d(in_channels, reduced_dim, 1),
        nn.SiLU(),
        nn.Conv2d(reduced_dim, in_channels, 1),
        nn.Sigmoid(),
    )

  def forward(self, x):
    return x * self.se(x)


## Inverted residual block

In [5]:
class InvertedResidualBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_size,
        stride,
        padding,
        expand_ratio,
        reduction=4,  # squeeze excitation
        survival_prob=0.8,  # for stochastic depth
    ):
        super(InvertedResidualBlock, self).__init__()
        self.survival_prob = 0.8
        self.use_residual = in_channels == out_channels and stride == 1
        hidden_dim = in_channels * expand_ratio
        self.expand = in_channels != hidden_dim
        reduced_dim = int(in_channels / reduction)

        if self.expand:
            self.expand_conv = CNNBlock(
                in_channels,
                hidden_dim,
                kernel_size=3,
                stride=1,
                padding=1,
            )

        self.conv = nn.Sequential(
            CNNBlock(
                hidden_dim,
                hidden_dim,
                kernel_size,
                stride,
                padding,
                groups=hidden_dim,
            ),
            SqueezeExcitation(hidden_dim, reduced_dim),
            nn.Conv2d(hidden_dim, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def stochastic_depth(self, x):
        if not self.training:
            return x

        binary_tensor = (
            torch.rand(x.shape[0], 1, 1, 1, device=x.device) < self.survival_prob
        )
        return torch.div(x, self.survival_prob) * binary_tensor

    def forward(self, inputs):
        x = self.expand_conv(inputs) if self.expand else inputs

        if self.use_residual:
            return self.stochastic_depth(self.conv(x)) + inputs
        else:
            return self.conv(x)


## EfficientNet

In [6]:
class EfficientNet(nn.Module):
    def __init__(self, version, num_classes):
        super(EfficientNet, self).__init__()
        width_factor, depth_factor, dropout_rate = self.calculate_factors(version)
        last_channels = ceil(1280 * width_factor)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.features = self.create_features(width_factor, depth_factor, last_channels)
        self.classifier = nn.Sequential(
            nn.Dropout(dropout_rate),
            nn.Linear(last_channels, num_classes),
        )

    def calculate_factors(self, version, alpha=1.2, beta=1.1):
        phi, res, drop_rate = phi_values[version]
        depth_factor = alpha**phi
        width_factor = beta**phi
        return width_factor, depth_factor, drop_rate

    def create_features(self, width_factor, depth_factor, last_channels):
        channels = int(32 * width_factor)
        features = [CNNBlock(3, channels, 3, stride=2, padding=1)]
        in_channels = channels

        for expand_ratio, channels, repeats, stride, kernel_size in base_model:
            out_channels = 4 * ceil(int(channels * width_factor) / 4)
            layers_repeats = ceil(repeats * depth_factor)

            for layer in range(layers_repeats):
                features.append(
                    InvertedResidualBlock(
                        in_channels,
                        out_channels,
                        expand_ratio=expand_ratio,
                        stride=stride if layer == 0 else 1,
                        kernel_size=kernel_size,
                        padding=kernel_size // 2,  # if k=1:pad=0, k=3:pad=1, k=5:pad=2
                    )
                )
                in_channels = out_channels

        features.append(
            CNNBlock(in_channels, last_channels, kernel_size=1, stride=1, padding=0)
        )

        return nn.Sequential(*features)

    def forward(self, x):
        x = self.pool(self.features(x))
        return self.classifier(x.view(x.shape[0], -1))


## Dataset and Preprocessing

In [7]:
import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler

In [8]:
def get_train_valid_loader(data_dir, batch_size, augment, random_seed, valid_size=0.1, shuffle=True):
    """Get training and validation data loaders for CIFAR-10"""

    # CIFAR-10 normalization
    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010]
    )

    # Common transforms
    common_transform = [
        transforms.Resize((224, 224)),  # Resize to standard ResNet input
        transforms.ToTensor(),
        normalize
    ]

    # Training transforms with optional augmentation
    if augment:
        train_transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.RandomHorizontalFlip(),
            transforms.RandomCrop(224, padding=4),
            transforms.ToTensor(),
            normalize
        ])
    else:
        train_transform = transforms.Compose(common_transform)

    valid_transform = transforms.Compose(common_transform)

    # Load datasets
    train_dataset = datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=train_transform
    )
    valid_dataset = datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=valid_transform
    )

    # Create train/valid split
    num_train = len(train_dataset)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))

    if shuffle:
        np.random.seed(random_seed)
        np.random.shuffle(indices)

    train_idx, valid_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    # Create data loaders
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler
    )
    valid_loader = torch.utils.data.DataLoader(
        valid_dataset, batch_size=batch_size, sampler=valid_sampler
    )

    return train_loader, valid_loader

In [9]:
def get_test_loader(data_dir, batch_size, shuffle=True):
    """Get test data loader for CIFAR-10"""

    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010]
    )

    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        normalize
    ])

    dataset = datasets.CIFAR10(
        root=data_dir, train=False, download=True, transform=transform
    )

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle
    )

    return data_loader

In [10]:
# Configuration
data_dir = './data'
num_classes = 10  # CIFAR-10 has 10 classes
num_epochs = 7
batch_size = 64
learning_rate = 0.01
version = "b0"
phi, res, drop_rate = phi_values[version]

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Load data
train_loader, valid_loader = get_train_valid_loader(
    data_dir=data_dir,
    batch_size=batch_size,
    augment=True,  # Enable data augmentation
    random_seed=1
)

test_loader = get_test_loader(data_dir=data_dir, batch_size=batch_size)

# Create model - FIXED: Using correct num_classes
model = EfficientNet(version, num_classes=num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=learning_rate,
    weight_decay=0.0001,
    momentum=0.9
)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

Using device: cuda


100%|██████████| 170M/170M [00:04<00:00, 35.0MB/s]


## Training

In [11]:
# Training loop
total_step = len(train_loader)

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        # Print progress every 100 steps
        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')

    # Epoch summary
    avg_loss = running_loss / total_step
    print(f'Epoch [{epoch+1}/{num_epochs}] - Average Loss: {avg_loss:.4f}')

    # Validation
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in valid_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total
        print(f'Validation Accuracy: {accuracy:.2f}%\n')

    # Step the scheduler
    scheduler.step()



Epoch [1/7], Step [100/704], Loss: 2.1491
Epoch [1/7], Step [200/704], Loss: 1.9304
Epoch [1/7], Step [300/704], Loss: 1.6972
Epoch [1/7], Step [400/704], Loss: 1.4701
Epoch [1/7], Step [500/704], Loss: 1.3958
Epoch [1/7], Step [600/704], Loss: 1.2107
Epoch [1/7], Step [700/704], Loss: 1.3300
Epoch [1/7] - Average Loss: 1.6162
Validation Accuracy: 57.76%

Epoch [2/7], Step [100/704], Loss: 1.1870
Epoch [2/7], Step [200/704], Loss: 1.1734
Epoch [2/7], Step [300/704], Loss: 1.1404
Epoch [2/7], Step [400/704], Loss: 1.3840
Epoch [2/7], Step [500/704], Loss: 0.8958
Epoch [2/7], Step [600/704], Loss: 0.9078
Epoch [2/7], Step [700/704], Loss: 0.7448
Epoch [2/7] - Average Loss: 1.0084
Validation Accuracy: 70.72%

Epoch [3/7], Step [100/704], Loss: 0.6678
Epoch [3/7], Step [200/704], Loss: 0.9212
Epoch [3/7], Step [300/704], Loss: 0.5874
Epoch [3/7], Step [400/704], Loss: 0.8491
Epoch [3/7], Step [500/704], Loss: 0.5146
Epoch [3/7], Step [600/704], Loss: 0.6741
Epoch [3/7], Step [700/704], Los

## Testing

In [13]:
# Final test evaluation
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    test_accuracy = 100 * correct / total
    print(f'\nFinal Test Accuracy: {test_accuracy:.2f}%')


Final Test Accuracy: 85.21%
