# Multi-label Image Classification with PyTorch

* Using a **small real dataset**: Cats vs Dogs
* Simulating a **multi-label setup**
* Architecture: **Custom CNN**
* Loss: `BCEWithLogitsLoss` (which internally applies sigmoid)
* Output: 3 labels — `[dog, cat, indoor]`

In [None]:
!pip install torch torchvision scikit-learn --quiet

### Import Required Libraries

We import:

* PyTorch (core, nn, optim)
* Torchvision for image transforms and dataset loading
* scikit-learn for evaluation metrics
* Pillow for image handling
* NumPy for tensor manipulation

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader, random_split, Dataset
import os
import numpy as np
from PIL import Image
import random
from sklearn.metrics import classification_report, multilabel_confusion_matrix

### Set Device
- Set the computation device to GPU (if available) or CPU.

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### Define Image Transforms

We define standard preprocessing steps:

* Resize image to `128x128`
* Convert image to PyTorch Tensor (range `[0, 1]`)

In [4]:
transform = transforms.Compose([
    transforms.Resize((128, 128)),
    transforms.ToTensor(),
])

### Download and Load Dataset

We use TensorFlow's **"cats\_and\_dogs\_filtered"** dataset.

* Download ZIP manually and extract it using Python
* This gives two classes: `cats/` and `dogs/`

In [5]:
root_dir = "./cats_vs_dogs"
url = "https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip"

import zipfile, requests, shutil

zip_path = "cats_and_dogs_filtered.zip"
if not os.path.exists(zip_path):
    r = requests.get(url, stream=True)
    with open(zip_path, "wb") as f:
        shutil.copyfileobj(r.raw, f)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("./")

### Simulate Multi-label Targets

The original dataset is **single-label**, so we simulate:

| Label | \[dog, cat, indoor]  |
| ----- | -------------------- |
| dog   | \[1, 0, random(0/1)] |
| cat   | \[0, 1, random(0/1)] |

> ✅ This simulates real-world **multi-label scenarios** like:
>
> * A dog photo that’s also indoors
> * A cat that could be indoors or outdoors

### Custom PyTorch Dataset Class

We create a custom Dataset that:

* Iterates through `cats/` and `dogs/` folders
* Loads images and assigns simulated multi-label targets
* Applies transformations

In [6]:
class MultiLabelDogCatDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.transform = transform
        self.images = []
        self.labels = []

        classes = ['cats', 'dogs']
        label_map = {'cats': [0, 1, 0], 'dogs': [1, 0, 0]}  # [dog, cat, indoor]

        for class_name in classes:
            class_dir = os.path.join(root_dir, class_name)
            for img_name in os.listdir(class_dir)[:300]:  # use only 300 per class
                img_path = os.path.join(class_dir, img_name)
                base_label = label_map[class_name].copy()
                base_label[2] = random.choice([0, 1])  # simulate indoor
                self.images.append(img_path)
                self.labels.append(base_label)

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img = Image.open(self.images[idx]).convert("RGB")
        if self.transform:
            img = self.transform(img)
        label = torch.tensor(self.labels[idx], dtype=torch.float32)
        return img, label

### Split Dataset and Create DataLoaders

We:

* Split into `80% train`, `20% val`
* Use `DataLoader` for efficient batching

In [7]:
data_dir = "./cats_and_dogs_filtered/train"
dataset = MultiLabelDogCatDataset(data_dir, transform=transform)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_ds, val_ds = random_split(dataset, [train_size, val_size])

In [10]:
train_size,val_size

(480, 120)

In [11]:
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=32)

### Define Custom CNN Model

A lightweight CNN:

* 2 Convolutional layers with ReLU and MaxPool
* Flatten → Dense → ReLU → Dense output with 3 logits
* **No `sigmoid` at the end**

### Why No Sigmoid in the Model?

> ❗ **Important**: We do **not** apply sigmoid inside the model because `BCEWithLogitsLoss` already does it.

If we **manually apply `sigmoid` before passing to `BCEWithLogitsLoss`**, it causes:

* **Vanishing gradients**
* **Incorrect loss scaling**
* **Lower accuracy**

>  Instead, use **raw logits**, and apply `sigmoid()` **only at inference time** (for probabilities)

In [12]:
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=3):
        super(SimpleCNN, self).__init__()
        self.conv_layers = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.fc_layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 32 * 32, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        x = self.conv_layers(x)
        x = self.fc_layers(x)
        return x  # no sigmoid here

model = SimpleCNN().to(device)

### Compile and Set Up Training

We define:

* **Loss**: `BCEWithLogitsLoss()` — used for multi-label tasks
* **Optimizer**: Adam with learning rate 0.001

In [13]:
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

### Training Loop

We:

* Loop through each epoch
* Zero gradients, forward pass, compute loss
* Backprop and optimizer step
* Track loss for monitoring

In [14]:
for epoch in range(5):
    model.train()
    running_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss / len(train_loader):.4f}")


Epoch 1, Loss: 0.7998
Epoch 2, Loss: 0.6932
Epoch 3, Loss: 0.6922
Epoch 4, Loss: 0.6929
Epoch 5, Loss: 0.6883


### Evaluation

After training:

* Set model to `eval()` mode
* Use `sigmoid()` to convert logits to probabilities
* Apply `threshold (0.5)` to get binary predictions
* Collect true and predicted labels

In [15]:
model.eval()
all_preds = []
all_targets = []

with torch.no_grad():
    for images, labels in val_loader:
        images = images.to(device)
        outputs = model(images)
        preds = torch.sigmoid(outputs).cpu().numpy()
        all_preds.append(preds)
        all_targets.append(labels.numpy())

In [16]:
y_pred = np.vstack(all_preds)
y_true = np.vstack(all_targets)
y_pred_binary = (y_pred > 0.5).astype(int)

In [18]:
y_pred[0],y_true[0]

(array([0.48333147, 0.5070958 , 0.5119954 ], dtype=float32),
 array([1., 0., 0.], dtype=float32))

### Performance Metrics

We use scikit-learn:

* `classification_report` → Precision, Recall, F1
* `multilabel_confusion_matrix` → Confusion matrix per class

In [19]:
print(classification_report(y_true, y_pred_binary, target_names=["dog", "cat", "indoor"]))

              precision    recall  f1-score   support

         dog       0.50      0.38      0.43        55
         cat       0.55      0.63      0.59        65
      indoor       0.49      0.81      0.61        58

   micro avg       0.51      0.61      0.56       178
   macro avg       0.51      0.61      0.54       178
weighted avg       0.52      0.61      0.55       178
 samples avg       0.53      0.60      0.54       178



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [20]:
conf_matrices = multilabel_confusion_matrix(y_true, y_pred_binary)
for i, cm in enumerate(conf_matrices):
    print(f"\nConfusion matrix for class {i}:\n{cm}")


Confusion matrix for class 0:
[[44 21]
 [34 21]]

Confusion matrix for class 1:
[[22 33]
 [24 41]]

Confusion matrix for class 2:
[[13 49]
 [11 47]]


### If You Have GPU + Large Storage

You can try with:

* **Open Images Dataset** (via TFDS or direct download)
* Contains **real multi-labels** (e.g., person + tree + car)
* Requires:

  * Label parsing (CSV or JSON)
  * Filtering useful categories
  * Handling missing/partial annotations
