# Bonus

In the bonus task, the MNIST digit classifier is extended with an “I Don’t Know” (IDK) option that allows the model to abstain from uncertain predictions. Confidence thresholds are applied to the classifier’s outputs to determine when a prediction should be accepted or labeled as IDK.

The resulting predictions, confidence scores, and IDK decisions are stored in a FiftyOne dataset, enabling interactive analysis of model uncertainty and reliability.


# Setup

This section prepares the execution environment. Required libraries are installed and imported, project utilities are loaded, and the computation device (CPU/GPU) is selected. This ensures that the notebook runs reproducibly in Colab or locally and that all subsequent steps share a consistent configuration.


## Installations & Imports

Here we import all necessary packages. This section initializes all dependencies required for running the experiments.

In [None]:
%%capture
%pip install fiftyone==1.10.0

In [None]:
%%capture
%pip install sympy==1.12 torch==2.9.0 torchvision==0.20.0 numpy

In [None]:
import os
import torch
from pathlib import Path
from torchvision import datasets
import torchvision.transforms.v2 as transforms
from torchvision.utils import save_image
from torch.utils.data import DataLoader
import torch.nn as nn
import fiftyone as fo
import torch.nn.functional as Func
from tqdm import tqdm
from fiftyone import ViewField as F
from huggingface_hub import HfApi

In [None]:
# Setup Device
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.cuda.is_available()

## Constants

Here, global hyperparameters and configuration values are defined, such as image size, embedding dimensionality, and dataset names. Centralizing these constants ensures consistency across generation, evaluation, and visualization steps, and makes experiments easier to reproduce and modify.

In [None]:
BATCH_SIZE_CLASSIFIER = 64
FIFTYONE_BONUS_DATASET_NAME = "mnist_idk_experiment"
CONFIDENCE_THRESHOLD = 0.99

## Data Paths

This section defines the filesystem layout used throughout the notebook. Google Drive is mounted to access pretrained checkpoints and datasets, and paths for data loading and result storage are established. All model loading, image saving, and logging rely on these paths.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

STORAGE_PATH = Path("/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Diffusion_Model_03")
DATA_PATH = STORAGE_PATH / "data"

CLASSIFIER_MODEL_PATH = STORAGE_PATH / "checkpoints/best_mnist_classifier.pth"

TEMP_IMG_DIR = Path("/content/mnist_temp")
os.makedirs(TEMP_IMG_DIR, exist_ok=True)

EXPORT_MNIST_DIR = Path("/content/mnist_idk_export")
os.makedirs(EXPORT_MNIST_DIR, exist_ok=True)

## Load the model

Here, a pretrained MNIST classifier is loaded and prepared for inference. The model is evaluated on the test set to provide a high-accuracy baseline before introducing the IDK mechanism.

In [None]:
# Mnist Model Architecture
class MNISTClassifier(nn.Module):
    def __init__(self):
        super(MNISTClassifier, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool1(self.relu1(self.conv1(x)))
        x = self.pool2(self.relu2(self.conv2(x)))
        x = self.flatten(x)
        x = self.relu3(self.fc1(x))
        x = self.fc2(x)
        return x

In [None]:
## Load the model

# initialize mnist classifier
mnist_classifier = MNISTClassifier().to(device)

# Load model weights
try:
    mnist_classifier.load_state_dict(torch.load(CLASSIFIER_MODEL_PATH, map_location=device))
    print("Model loaded.")
except FileNotFoundError:
    print("Please run the training loop in the previous notebook to generate the model file.")

mnist_classifier.eval()

## Create test dataset

This section prepares the MNIST test dataset with appropriate preprocessing. The dataset serves as the input for evaluating the IDK-augmented prediction function.

In [None]:
# Get the data
transform_classifier = transforms.Compose([
    transforms.ToImage(),
    transforms.ToDtype(torch.float32, scale=True),                     
    transforms.Normalize((0.1307,), (0.3081,))
])

test_dataset_classifier = datasets.MNIST('.', train=False, download=True, transform=transform_classifier)
test_loader_classifier = DataLoader(test_dataset_classifier, batch_size=BATCH_SIZE_CLASSIFIER, shuffle=False)


## Create the prediction function

The prediction function applies a softmax to model outputs and compares the maximum confidence to a threshold. Predictions below the threshold are labeled as “IDK”, enabling controlled abstention when the model is uncertain.

In [None]:
def predict_with_idk(image, model, threshold=0.5):
    """
    Performs MNIST digit prediction with an explicit "I Don't Know" (IDK) option.

    The model predicts a digit only if its confidence exceeds a given threshold;
    otherwise, the prediction is rejected as IDK.

    Args:
        image (torch.Tensor): Input image tensor of shape (1, 28, 28) or (1, 1, 28, 28).
        model (nn.Module): Trained MNIST classifier.
        threshold (float): Confidence threshold for accepting a prediction.

    Returns:
        tuple[str, float]: Predicted digit ("0"–"9") or "IDK", and the associated confidence.
    """
    # Ensure batch dimension
    if image.dim() == 3:
        image = image.unsqueeze(0)

    with torch.no_grad():
        image = image.to(device)
        logits = model(image)

        # Convert logits to class probabilities
        probs = Func.softmax(logits, dim=1)

        # Select most confident class
        confidence, predicted_idx = torch.max(probs, dim=1)
        confidence = confidence.item()
        predicted_idx = predicted_idx.item()

    # Apply confidence-based abstention
    if confidence >= threshold:
        return str(predicted_idx), confidence
    else:
        return "IDK", confidence


## Create Fiftyone Dataset

Inference results are stored in a FiftyOne dataset, including ground-truth labels, predicted labels (with IDK), and confidence scores. This enables interactive inspection of uncertain predictions and systematic analysis of coverage versus accuracy.

### Choosing the confidence threshold
Because the trained MNIST classifier achieves 99.24% accuracy, it typically assigns very high confidence to real MNIST images. Consequently, only very high thresholds (e.g. 0.99) produce noticeable IDK behavior on in-distribution data. For diffusion-generated images, confidence values are often lower due to artifacts and ambiguity, so the same threshold leads to substantially more abstentions.

This illustrates how the confidence threshold directly controls the trade-off between prediction coverage and reliability, and why threshold selection must be adapted to the expected data quality and distribution.

In [None]:
# Create a new FiftyOne dataset for the IDK experiment

# Remove existing dataset to avoid name conflicts
if FIFTYONE_BONUS_DATASET_NAME in fo.list_datasets():
    print(f"Deleting existing dataset: {FIFTYONE_BONUS_DATASET_NAME}")
    fo.delete_dataset(FIFTYONE_BONUS_DATASET_NAME)

dataset = fo.Dataset(name=FIFTYONE_BONUS_DATASET_NAME)
samples = []

print("Running inference...")
for i in tqdm(range(len(test_dataset_classifier))):
    img_tensor, label = test_dataset_classifier[i]

    # Predict digit or abstain using the IDK mechanism
    pred_label, conf = predict_with_idk(
        img_tensor, mnist_classifier, CONFIDENCE_THRESHOLD
    )

    # Save image to disk, as FiftyOne operates on file paths
    filepath = os.path.join(TEMP_IMG_DIR, f"img_{i}.png")
    save_image(img_tensor, filepath)

    sample = fo.Sample(filepath=filepath)
    sample["ground_truth"] = fo.Classification(label=str(label))
    sample["prediction_with_idk"] = fo.Classification(
        label=pred_label, confidence=conf
    )

    # Tag samples where the model abstains
    if pred_label == "IDK":
        sample.tags.append("IDK")

    samples.append(sample)

    # Add samples in batches to reduce memory usage
    if len(samples) >= 1000:
        dataset.add_samples(samples)
        samples = []

# Add remaining samples
if samples:
    dataset.add_samples(samples)


In [None]:
# Launch App
session = fo.launch_app(dataset)
print(f"Dataset created with {len(dataset)} samples.")

In [None]:
# Create a View of only IDK samples
idk_view = dataset.match_tags("IDK")
print(f"Total IDK cases found: {len(idk_view)}")

# Launch App
session.view = idk_view

In [None]:
# Compute coverage to test for different confidence thresholds

total_samples = len(dataset)
idk_count = len(idk_view)       # number of idk samples

# Coverage = (Total - IDK) / Total
covered_view = dataset.match_tags("IDK", bool=False)
covered_count = len(covered_view)
coverage = covered_count / total_samples if total_samples > 0 else 0.0

# Accuracy on covered samples
correct_covered_view = covered_view.match(
    F("prediction_with_idk.label") == F("ground_truth.label")
)
correct_covered_count = len(correct_covered_view)

# Accuracy on Covered = Correct / Covered
accuracy_on_covered = correct_covered_count / covered_count if covered_count > 0 else 0.0

# Standard Accuracy (treating IDK as incorrect)
standard_accuracy = correct_covered_count / total_samples if total_samples > 0 else 0.0

# Print report
print(f"Total Test Images:    {total_samples}")
print(f"IDK Responses:        {idk_count}")
print(f"Covered Responses:    {covered_count}")
print("-" * 30)
print(f"COVERAGE:             {coverage:.2%}  (Goal: Keep this high)")
print(f"ACCURACY (Covered):   {accuracy_on_covered:.2%}  (Goal: Higher than standard accuracy)")
print(f"ACCURACY (Standard):  {standard_accuracy:.2%}  (Baseline)")
print("="*50)

**Observation: Effect of the Confidence Threshold**

Increasing the confidence threshold leads to more frequent IDK decisions, which slightly reduces coverage but improves accuracy on the remaining predictions. At a threshold of 0.99, coverage drops to 97.88%, while accuracy on covered samples reaches 99.78%, clearly outperforming the standard accuracy of 97.66%. Lowering the threshold to 0.98 and 0.95 increases coverage to 98.35% and 98.86%, respectively, but also leads to a gradual decrease in covered accuracy (99.72% and 99.64%).

Overall, these results illustrate the central trade-off controlled by the threshold: higher thresholds prioritize reliability by rejecting uncertain predictions, while lower thresholds increase coverage at the cost of slightly reduced accuracy. This demonstrates why careful threshold selection is essential for balancing prediction quality and coverage in uncertainty-aware systems.


Here a more detailed summary of the threshold experiments:
| Threshold | IDK Responses | Coverage (%) | Accuracy (Covered) (%) | Accuracy (Standard) (%) |
|----------:|--------------:|-------------:|------------------------:|-------------------------:|
| 0.99      | 212           | 97.88        | 99.78                   | 97.66                    |
| 0.98      | 165           | 98.35        | 99.72                   | 98.07                    |
| 0.95      | 114           | 98.86        | 99.64                   | 98.50                    |

## Publish Dataset on Hugging Face

Finally, the FiftyOne dataset is exported and published to Hugging Face. This makes the results reproducible and shareable, allowing others to explore the dataset and the behavior of the IDK mechanism.

In [None]:
# Save FiftyOne dataset (images + metadata) to disk
print(f"Exporting dataset to {EXPORT_MNIST_DIR}...")

dataset.export(
    export_dir=str(EXPORT_MNIST_DIR),
    dataset_type=fo.types.FiftyOneDataset,
    export_media=True, # This ensures the actual .png images are included
)

print("Export complete.")

In [None]:
# Publish to Hugging Face

# Get token
HF_TOKEN = os.getenv("HF_TOKEN")

if HF_TOKEN:
    print("Uploading to Hugging Face...")
    api = HfApi(token=HF_TOKEN)
    repo_id = "mmarschn/mnist-idk-experiment"

    # Create the repo if it doesn't exist
    try:
        api.create_repo(repo_id=repo_id, repo_type="dataset", exist_ok=True)
    except Exception as e:
        print(f"Repo creation warning: {e}")

    # Upload the folder
    api.upload_large_folder(
        folder_path=EXPORT_MNIST_DIR,
        repo_id=repo_id,
        repo_type="dataset",
        ignore_patterns=["*.ipynb_checkpoints"],
    )
    print(f"Successfully published to: https://huggingface.co/datasets/{repo_id}")
else:
    print("Error: HF_TOKEN not found. Cannot publish to Hugging Face.")