**Images and Classification**


Today, we're going to work on classificaiton of images using image embeddings. This is similar/analogous to classification of texts using word embeddings, but adapted to images.

So, first we might want to be reminded: what are word embeddings? And how did they play into classificaton?

What did clasification with language/words involve? What were the steps in the process? what decisions needed to be made?

Review discussion....

OK, now let's apply this to images. First, let's load in soem sample images.

Remember we probably need to use GPUS!

In [None]:
!pip install -q datasets


In [None]:
from datasets import load_dataset

# Load dataset from Hugging Face
dataset = load_dataset("beans")

# Inspect structure
print(dataset)
print(dataset['train'][0])


In [None]:
def map_to_binary(example):
    example['label_binary'] = 1 if example['labels'] == 2 else 0  # 1 = healthy, 0 = diseased
    return example

label_names = ["diseased", "healthy"]



In [None]:
# Apply the corrected binary labels
binary_dataset = dataset.map(map_to_binary)

# Show examples
show_images(binary_dataset['train'], label=1)  # healthy
show_images(binary_dataset['train'], label=0)  # diseased


So, we've got our images and labels sotred in binary_dataset. let's examine the structure of this for a second. what type of object is it? where/how is everything stored?

In [None]:
sample = binary_dataset['train'][0]
print(sample)
print(sample.keys())

Each item in the dataset (binary_dataset['train']) is a dictionary with four keys:

image_file_path
This is a string — the full file path to where the image is stored on your system (cached automatically when the dataset was downloaded). You won't usually need this, but it's there for internal traceability.

image
This is a PIL.Image.Image object. It’s the actual image already loaded into memory and ready to be transformed or passed into a model. No need to manually open or read image files.

labels
This is the original label from the dataset, an integer representing one of three classes:

0 = angular_leaf_spot

1 = bean_rust

2 = healthy

label_binary
This is a new label we created to simplify the task. It's also an integer:

0 = diseased (meaning the original label was 0 or 1)

1 = healthy (meaning the original label was 2)



**Embeddings**

OK, let's make some embeddings. What are image embeddings? And what pretrained model should we use to make them?

What are image embeddings?

As we've discussed, an image embedding is a way of turning an image — which is just a big grid of pixel values — into a much smaller set of numbers that still capture the important information in the image. You can think of it like a summary: instead of using every single pixel, the embedding describes what kind of thing the image shows — textures, shapes, colors, or even abstract patterns — in a numerical format that a computer can work with.

These embeddings come from deep neural networks that have already been trained on millions of images (remember the video we watched last week?). Instead of building our own network from scratch, we’ll reuse one of these powerful pretrained models to extract these embeddings for us.(Does anyone remember the pretrained models we used for word embeddings?)

What tool are we using?

To generate embeddings with a model, we’ll use a library (or, collection of prewritten code) called PyTorch, popular for deep learning and created by Meta (Facebook). PyTorch is especially useful for experimenting and learning because it’s written in a very intuitive, Pythonic way — it’s widely used in both research and teaching.

There’s another popular library called TensorFlow, made by Google. TensorFlow is more common in industry settings, especially when building big production systems. It’s a bit more complex under the hood, but both libraries can be used to do similar things.


Let's start by loading in the model we'll use for embeddings, Resnet-18

In [None]:
# STEP 1: Import PyTorch and load a pretrained model
import torch
from torchvision import models, transforms
from PIL import Image
import matplotlib.pyplot as plt

# Load a pretrained ResNet-18 model
model = models.resnet18(pretrained=True)
model.eval()  # put it in inference mode (we’re not training)

# Remove the last layer so we get the image embedding, not the final classification
embedding_model = torch.nn.Sequential(*list(model.children())[:-1])


As is usual with model selection, we could have chosen others; this one is simple to use for embeddings tasks like this one because it's lightweight, fast, public, and easy to load in Python; but there's many you could use.

Note also the step of removing the last "layer" of the model, to extract the embeddings

When using a model like this on some data, we always have to know two things (at least); does anyone remember what they are?

In [None]:
# We'll use one image from the dataset we've already loaded
sample = binary_dataset['train'][0]
img = sample['image']

# Display the image
plt.imshow(img)
plt.axis("off")
plt.title("Input Image")
plt.show()


We've here chosen an input image from our data. What's the format we're inputting the image in? (hint: note the key value from our dictionary we're selecting)

Before we can generate embeddings for the image we'll have to preprocess it:

In [None]:
# Define the standard preprocessing pipeline for ResNet
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),  # Converts to PyTorch tensor and scales pixels to [0,1]
    transforms.Normalize(mean=[0.485, 0.456, 0.406],   # ImageNet mean
                         std=[0.229, 0.224, 0.225]),   # ImageNet std
])

# Apply preprocessing
input_tensor = preprocess(img).unsqueeze(0)  # Add batch dimension (1, 3, 224, 224)


Some notes on above:

Neural networks need input in a specific format:

Size: 224x224 pixels (center-cropped)

Normalized pixel values (standard mean and std)

On the resizing:

First we resize to 256
This resizes the shorter side of the image to 256 pixels, while keeping the aspect ratio the same. The goal is to make sure the image is big enough that we can safely crop the center out.

Then we Center Crop to 224x224
After resizing, we cut out the center 224×224 square of the image. This gives the final size the model expects.

Many real images are not square — they may be tall, wide, etc.

Resizing to 256 first makes sure we get enough image content; center cropping ensures consistency and avoids distortion (unlike directly resizing to 224x224, which would squash/stretch the images

OK, what about the last line of that code? on batching and unsqueezing and so on?

Neural networks in Pytorch will expect input in this shape:
(batch_size, num_channels, height, width)

What this means is that the image is represented as a mutidimensional array or TENSOR that has these four dimensions:

For a single image:
batch_size = 1 (we're processing one image at a time)
num_channels = 3 (RGB)
height = 224
width = 224

The last line of the code above just adds the "batch dimension" of 1 to the beginning of our image data, foing from (3, 224, 224) to (1, 3, 224, 224)

So, what does one of these tensors, with such dimensions, look like?

Let's take a look at the "input_tensor" we just created

In [None]:
# Look at red channel of first image, top-left 3x3 region
print(input_tensor[0, 0, :3, :3])


In [None]:
#or a larger slice:

# Red channel (channel index 0), rows 50–60, cols 50–60
print(input_tensor[0, 0, 50:60, 50:60])


OK so we've preprocessed our image and its now a tensor with 4 dimensions, the first being this necessary batch size of 1. Now let's turn it into embeddings (hint: that means turnng this whole buncha numbers into a whole buncha other numbers. yay!)

In [None]:
# Make sure we're not tracking gradients (since we're not training)
with torch.no_grad():
    embedding = embedding_model(input_tensor)  # shape: [1, 512, 1, 1]


In [None]:
print("Embedding shape (raw):", embedding.shape)


The model gives a 512-dimensional feature map, but it's still wrapped in 4D form. Let's flatten it:

In [None]:
embedding_vector = embedding.squeeze().numpy()
print("Embedding shape (flattened):", embedding_vector.shape)
print("First 10 values:", embedding_vector[:10])



**Training a classifier**

first we'll have to make all of our images - the sick and healthy plants - into embeddings; we'll need those embeddings as a set of items with labels for the classifier, in the format we're familiar with from text classificaiton (two lists); and then we can proceed

In [None]:
import random
import numpy as np
import torch
from collections import Counter

# STEP 1: Select 50 from each class
diseased_indices = [i for i, ex in enumerate(binary_dataset['train']) if ex['label_binary'] == 0]
healthy_indices = [i for i, ex in enumerate(binary_dataset['train']) if ex['label_binary'] == 1]

diseased_sample = random.sample(diseased_indices, 50)
healthy_sample = random.sample(healthy_indices, 50)

# Combine them: first 50 diseased, then 50 healthy
balanced_indices = diseased_sample + healthy_sample
subset = binary_dataset['train'].select(balanced_indices)

# STEP 2: Generate embeddings and labels
X = []
y = []

for i, example in enumerate(subset):
    img = example['image']
    label = example['label_binary']  # 0 or 1

    img_tensor = preprocess(img).unsqueeze(0)

    with torch.no_grad():
        embedding = embedding_model(img_tensor)
        embedding_vector = embedding.squeeze().numpy()

    X.append(embedding_vector)
    y.append(label)

X = np.array(X)  # shape: (100, 512)
y = np.array(y)  # shape: (100,)

print("Final label counts:", Counter(y))  # Should show 50 of each



OK, let's break down the code block above a bit; it brings together things we've learned about preparing items for classification with things we just learning about generating image embeddings.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


# STEP 1: Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


# STEP 2: Initialize and train the classifier

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# STEP 3: Make predictions and evaluate

y_pred = clf.predict(X_test)

# Show detailed classification metrics
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=["Diseased", "Healthy"]))


(note the lack of crossfold validation, like we did with the text; we could add it; here instead we just split into one train test split sample)

Let's check out running the classifier we just built on an image and see how it does

In [None]:
import matplotlib.pyplot as plt

# STEP 1: Pick one image from the dataset (e.g., from the test split)
example = binary_dataset['train'][diseased_sample[0]]  # or use any index you like
img = example['image']

# STEP 2: Display the image
plt.imshow(img)
plt.axis("off")
plt.title("Image to Classify")
plt.show()

# STEP 3: Preprocess the image and get embedding
img_tensor = preprocess(img).unsqueeze(0)  # Shape: (1, 3, 224, 224)

with torch.no_grad():
    embedding = embedding_model(img_tensor)
    embedding_vector = embedding.squeeze().numpy().reshape(1, -1)  # shape: (1, 512)

# STEP 4: Use trained classifier to predict
pred_class = clf.predict(embedding_vector)[0]
pred_probs = clf.predict_proba(embedding_vector)[0]

# STEP 5: Show result
class_names = ["Diseased", "Healthy"]
pred_label = class_names[pred_class]
confidence = pred_probs[pred_class] * 100

print(f"Predicted class: {pred_label}")
print(f"Confidence: {confidence:.2f}%")
