Implementing GoogLeNet(Inception v1) from scratch using Pytorch. Instead of using ImageNet, I use Tiny ImageNet which has 200 classes instead of 1000. Added BatchNorm (original GoogLeNet didn't) to help with generalization. Model will be trained in kaggle with a TPU VM v3-8. 

Original architecture of GoogLeNet for a 224x224 image is as followed

![Old_Architecture](figures/Original_arch.png)

***d2l.ai***

![Google_Architecture](figures/Google_arch.png)

***GoogLeNet paper (Szegedy et al. 2015)***

Tiny ImageNet uses 64×64 images, which are much smaller than the original ImageNet images (224×224). If we applied the original GoogLeNet architecture, repeated pooling and strides would reduce the spatial dimensions too quickly, causing significant loss of spatial information. To address this, I modified the initial layers and pooling strategy to slow down how quickly the spatial size shrinks. This preserves more local features throughout the network. The output channels after each block still match the original GoogLeNet design.

I also drew out the architecture by hand to help visualize the tensor shapes. S1 means stride = 1, P1 means padding = 1. 

![Architecture](figures/New_arch.png)


Importing and loading the dataset which is stored on my laptop.

In [None]:
import torch
import torchvision
from torch import nn, optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.Resize((64,64)),
    transforms.ToTensor()
])

train_dataset = datasets.ImageFolder(root='tiny-imagenet-200/train', transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=4, pin_memory=True)

val_dataset = datasets.ImageFolder(root='tiny-imagenet-200/val', transform=transform)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False, num_workers=4, pin_memory=True)

In [None]:
import matplotlib.pyplot as plt
img, label = train_dataset[0]

img = img.permute(1, 2, 0)

plt.imshow(img)
plt.title(f"Label index: {label}")
plt.axis('off')
plt.show()

![Fish](figures/fish.png)

Creating the Inception Block class. The architecture of the Inception Block is as follows:

| Path  | Operation                                   |
| ----- | ------------------------------------------- |
| **1** | 1×1 conv → BN → ReLU                        |
| **2** | 1×1 conv → BN → ReLU → 3×3 conv → BN → ReLU |
| **3** | 1×1 conv → BN → ReLU → 5×5 conv → BN → ReLU |
| **4** | 3×3 maxpool → 1×1 conv → BN → ReLU          |


Output channels for each layer: 
| Block  | 1x1 | 1x1→3x3 | 3x3 | 1x1→5x5 | 5x5 | pool→1x1 | Total out |
| ------ | --- | ------- | --- | ------- | --- | -------- | --------- |
| **3a** | 64  | 96      | 128 | 16      | 32  | 32       | 256       |
| **3b** | 128 | 128     | 192 | 32      | 96  | 64       | 480       |
| **4a** | 192 | 96      | 208 | 16      | 48  | 64       | 512       |
| **4b** | 160 | 112     | 224 | 24      | 64  | 64       | 512       |
| **4c** | 128 | 128     | 256 | 24      | 64  | 64       | 512       |
| **4d** | 112 | 144     | 288 | 32      | 64  | 64       | 528       |
| **4e** | 256 | 160     | 320 | 32      | 128 | 128      | 832       |
| **5a** | 256 | 160     | 320 | 32      | 128 | 128      | 832       |
| **5b** | 384 | 192     | 384 | 48      | 128 | 128      | 1024      |


In [22]:
class InceptionBlock(nn.Module):
    def __init__(self, in_ch: int, c1: int, c2: tuple, c3: tuple, c4: int):
        super(InceptionBlock, self).__init__()
        self.c1_out = nn.Sequential(
            nn.Conv2d(in_channels=in_ch, out_channels=c1, kernel_size=1),
            nn.BatchNorm2d(c1),
            nn.ReLU()
        )
        self.c2_out = nn.Sequential(
            nn.Conv2d(in_channels=in_ch, out_channels=c2[0], kernel_size=1),
            nn.BatchNorm2d(c2[0]),
            nn.ReLU(),
            nn.Conv2d(in_channels=c2[0], out_channels=c2[1], kernel_size=3, padding=1),
            nn.BatchNorm2d(c2[1]),
            nn.ReLU()
        )
        self.c3_out = nn.Sequential(
            nn.Conv2d(in_channels=in_ch, out_channels=c3[0], kernel_size=1),
            nn.BatchNorm2d(c3[0]),
            nn.ReLU(),
            nn.Conv2d(in_channels=c3[0], out_channels=c3[1], kernel_size=5, padding=2),
            nn.BatchNorm2d(c3[1]),
            nn.ReLU()
        )
        self.c4_out = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, padding=1, stride=1),
            nn.Conv2d(in_channels=in_ch, out_channels=c4, kernel_size=1),
            nn.BatchNorm2d(c4),
            nn.ReLU()
        )

    def forward(self, X):
        out1 = self.c1_out(X)
        out2 = self.c2_out(X)
        out3 = self.c3_out(X)
        out4 = self.c4_out(X)
        return torch.cat([out1, out2, out3, out4], dim=1)

In [None]:
class GoogLeNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            nn.Conv2d(in_channels=64, out_channels=192, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(192),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )

        self.Inception3a = InceptionBlock(in_ch=192, c1=64, c2=(96, 128), c3=(16, 32), c4=32)   # Out Channels = 256
        self.Inception3b = InceptionBlock(in_ch=256, c1=128, c2=(128, 192), c3=(32, 96), c4=64) # Out Channels = 480

        self.MaxPool3 = nn.MaxPool2d(kernel_size=3, stride=2, padding=1) # Shapes to 480x8x8

        self.Inception4a = InceptionBlock(in_ch=480, c1=192, c2=(96, 208), c3=(16, 48), c4=64)    # Out Channels = 512
        self.Inception4b = InceptionBlock(in_ch=512, c1=160, c2=(112, 224), c3=(24, 64), c4=64)   # Out Channels = 512
        self.Inception4c = InceptionBlock(in_ch=512, c1=128, c2=(128, 256), c3=(24, 64), c4=64)   # Out Channels = 512
        self.Inception4d = InceptionBlock(in_ch=512, c1=112, c2=(144, 288), c3=(32, 64), c4=64)   # Out Channels = 528
        self.Inception4e = InceptionBlock(in_ch=528, c1=256, c2=(160, 320), c3=(32, 128), c4=128) # Out Channels = 832

        self.MaxPool4 = nn.MaxPool2d(kernel_size=3, stride=2, padding=1) # Shapes to 832x4x4

        self.Inception5a = InceptionBlock(in_ch=832, c1=256, c2=(160, 320), c3=(32, 128), c4=128) # Out Channels = 832
        self.Inception5b = InceptionBlock(in_ch=832, c1=384, c2=(192, 384), c3=(48, 128), c4=128) # Out Channels = 1024

        self.GlobalPool = nn.AdaptiveAvgPool2d((1,1)) # 1024x1x1
        self.Flatten = nn.Flatten(1)
        self.dropout = nn.Dropout(0.4)
        self.Linear = nn.Linear(in_features=1024, out_features=200)
    
    def forward(self, x):
        x = self.stem(x)
        x = self.Inception3a(x)
        x = self.Inception3b(x)
        x = self.MaxPool3(x)
        x = self.Inception4a(x)
        x = self.Inception4b(x)
        x = self.Inception4c(x)
        x = self.Inception4d(x)
        x = self.Inception4e(x)
        x = self.MaxPool4(x)
        x = self.Inception5a(x)
        x = self.Inception5b(x)
        x = self.GlobalPool(x)
        x = self.Flatten(x)
        x = self.dropout(x)
        x = self.Linear(x)
        return x

Testing to see if it works

In [None]:
model = GoogLeNet()

X = torch.randn(1, 3, 64, 64)
out = model(X)

out.shape # Should be [1, 200]

torch.Size([1, 200])

The original GoogLeNet was trained on ImageNet using Cross-Entropy Loss, SGD with momentum=0.9, learning rate = 0.01, weight decay = 0.0002, learning rate decay = 4% every 8 epochs (gamma=0.96 in StepLR), and was trained for 68 epochs.

I’m following these hyperparameters but training for 94 epochs.

The original model also used auxiliary classifiers, but I did not implement those.

In [None]:
import torch
from torch import nn, optim
from torch.optim.lr_scheduler import StepLR

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

model = GoogLeNet().to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0002)
scheduler = StepLR(optimizer, step_size=8, gamma=0.96)

epochs = 50

for epoch in range(epochs):
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0
    
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        
        optimizer.zero_grad()
        yhat = model(xb)
        loss = loss_fn(yhat, yb)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * xb.size(0)
        preds = torch.argmax(yhat, dim=1)
        correct += (preds == yb).sum().item()
        total += xb.size(0)

    avg_loss = total_loss / total
    accuracy = correct / total

    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    with torch.no_grad():
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            yhat = model(xb)
            loss = loss_fn(yhat, yb)
            val_loss += loss.item() * xb.size(0)
            preds = torch.argmax(yhat, dim=1)
            val_correct += (preds == yb).sum().item()
            val_total += xb.size(0)
    
    avg_val_loss = val_loss / val_total
    val_accuracy = val_correct / val_total

    scheduler.step()
    current_lr = scheduler.get_last_lr()[0]

    print(f"Epoch {epoch+1}: "
          f"Train Loss: {avg_loss:.4f}, Train Acc: {accuracy:.4f}, "
          f"Val Loss: {avg_val_loss:.4f}, Val Acc: {val_accuracy:.4f}")


![Results](figures/results.png)

A Top-1 validation accuracy of around 50% is in line for a GoogLeNet made from scratch.