# Simple Single Object Detection - A Naive Approach

**_Experimenting with a single object detection using subset of Caltech-101 dataset and transfer learning._**

In this experiment, images of airplanes with thier bounding boxes are extracted from Caltech-101 dataset and then trained on a MobileNetV2 pretrained model with customization to detect the location of airplanes in new images.

**The Experiment:**

- Downloads Caltech-101 dataset (caltech-101.zip) from https://data.caltech.edu/records/mzrjq-6wc02. This dataset has pictures of objects belonging to 101 categories each containing about 40 to 800 images. The size of each image is roughly 300 x 200 pixels. The annotations stored as MATLAB script file (.mat) contain outlines (bounding box) of each object in these pictures.

- Extracts the dataset from the downloaded compressed file caltech-101.zip.

- Sets the root paths for airplanes images and and annotations.

- Loads the data (image path and respective bounding boxes) into an intermediate datastructure by reading bounding boxes from MATLAB .mat annotation file and rescaling the boxes.

- Writes helper functions to show images and to draw bounding boxes.

- Samples randomly from the metadata and check visually if the above two helper functions work.

- Uses a lightweight relevant pretrained model e.g. MobileNetV2 excluding the top layers used for ImageNet classification task. Add task specific top layers.

- Uses an data loader such as TensorFlow Dataset to serve data efficiently while model training.

- Seperates validation set froom training set.

- Compiles and fit the model with early stopping.

- Plots the learning curve.

- Performs predictions on validation data and visualize these by plotting both images and associated predicted bounding boxes.

## Importing Packages

In [None]:
import os                                           # For file system related tasks
import urllib.request as request, zipfile, tarfile  # For downloading file from Internet and extractions   
import random
from scipy.io import loadmat                        # For reading MATLAB (.mat) files
from PIL import Image                               # For reading image files
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle            # Drawing rectangle shape on on plots

import tensorflow as tf

## Data Ingestion

**Downloading Data Files**

In [None]:
url = "https://data.caltech.edu/records/mzrjq-6wc02/files/caltech-101.zip"

# Creates dataset directory, if does not exist. 
# Set the path according to you preference or based on the directory the data already exists.
data_dir = "./../data/caltech-101"

os.makedirs(data_dir, exist_ok=True)

data_file_name = "caltech-101.zip"
data_file_path = os.path.join(data_dir, data_file_name)

In [None]:
# Downloads data file only if it does not exist

# UNCOMMENT THE FOLLOWING LINES AND RUN ONLY IF THE EXTRACTED DATASET IS NOT AVAILABLE. 
# CHECK WITH THE INSTRUCTOR FIRST.

# if not os.path.isfile(data_file_path):
#     print(f"Data file {data_file_path} could not be found. Downloading...", end="")
#     try:
#         request.urlretrieve(url, data_file_path)
#         print("successful.")
#     except Exception as e:
#         print(f"An error occurred while downloading: {e}")
# else:
#     print("Image file already exists. Downloading was skipped.")


**Decompressing Data File**

In [None]:
# Decompresses images file

# UNCOMMENT THE FOLLOWING LINES AND RUN ONLY IF THE EXTRACTED DATASET IS NOT AVAILABLE. 
# CHECK WITH THE INSTRUCTOR FIRST.

# if os.path.isfile(data_file_path):
#     print(f"Decompressing {data_file_path}...", end="")
#     try:
#         with zipfile.ZipFile(data_file_path, "r") as zipped:
#             zipped.extractall(data_dir)
#         print("successful.")

#         # Decompressing images tar file
#         images_tarfile_path = os.path.join(data_dir, "caltech-101", "101_ObjectCategories.tar.gz")
#         print(f"Decompressing {images_tarfile_path}...", end="")
#         try:
#             with tarfile.open(images_tarfile_path, "r:gz") as tar:
#                 tar.extractall(os.path.join(data_dir, "caltech-101"), filter="data")
#             print("successful.")
#         except Exception as e:
#             print(f"An error occurred while decompressing: {e}")
#         finally:
#             tar.close()

#         # Decompressing annotations tar file
#         annotations_tarfile_path = os.path.join(data_dir, "caltech-101", "Annotations.tar")
#         print(f"Decompressing {annotations_tarfile_path}...", end="")
#         try:
#             with tarfile.open(annotations_tarfile_path, "r") as tar:
#                 tar.extractall(os.path.join(data_dir, "caltech-101"), filter="data")
#             print("successful.")
#         except Exception as e:
#             print(f"An error occurred while decompressing: {e}")
#         finally:
#             tar.close()
        
#     except Exception as e:
#         print(f"An error occurred while decompressing: {e}")
# else:
#     print(f"File {data_file_path} does not exist. Decompression was skipped.")

## Data Preparation

Rename the following annotations folders as there are mismatches between images and annotations folder names.

- Airplanes_Side_2 to airplanes
- Faces_2 to Faces
- Faces_3 to Faces_easy
- Motorbikes_16 to Motorbikes

In [None]:
# Sets the root paths for airplanes images and and annotations.
images_dir = os.path.join(data_dir, "caltech-101", "101_ObjectCategories", "airplanes")
annotations_dir = os.path.join(data_dir, "caltech-101", "Annotations", "airplanes")

In [None]:
def scale_box(box, image_width, image_height):
    """
    Scales the bounding box on a unit square and converts box from
    the format [y1, y2, x1, x2] to [x1, y1, width, height]
    """

    box = [box[2], box[0], box[3]-box[2], box[1]-box[0]]

    # Write code to scale down the variables in the bounding box in a unit scale based on the 
    # maximum length of either image width or height.
    scale = # Write code
    x, y, w, h = # Write code

    ## Center aligns starting coordinates
    x += (image_height - image_width) * scale / 2 if image_height > image_width else 0
    y += (image_width - image_height) * scale / 2 if image_width > image_height else 0
    
    return [x, y, w, h]

In [None]:
# Loads the data (image path and respective bounding boxes) into an intermediate datastructure 
# by reading bounding boxes from MATLAB .mat annotation file and rescaling them.

metadata = #  Write code to initialize a dictionary as an intermediate data structure to hold meta information about the images and respective bounding boxes

id = # Write code to initialize a simple counter to act as key to the metadata dictionary

for file in os.listdir(images_dir):             # Iterates over the files in the airplane image folder
   image_path = # Write code to get a path to the file for image loading
   base_name = os.path.splitext(file)[0]        # Gets base name of image file [e.g. ../airplanes/image_0616.jpg to image_0616] to prepare path for associated .mat file
   annotation_file = os.path.join(data_dir, "caltech-101", "Annotations", "airplanes", f"annotation_{base_name[-4:]}.mat")    # Gets path to annoation (.mat) file
   if os.path.exists(annotation_file):          # Skips if the associated annotation file does not exist
      metadata[id] = id                         # Sets the key against this meta information for later retrieval
      mat_contents = # Write code to reads the content from the .mat file by passing it to function `loadmat`
      with Image.open(image_path) as image:     # Loads the image to get image width and height to scale the bounding box accordingly
         scaled_box = scale_box(mat_contents['box_coord'][0].tolist(), image.width, image.height)  # Scales the bounding box in a unit square
      metadata[id] = # Write code to store scaled box and image path as dictionary values "box" and "image_path" against key identified by `id` in metadata dictionary
      # Write code to increment the counter
   else:
         print(f"Not found: {annotation_file}") # Prints of an associated annotation file against an image file does not exist

In [None]:

# Write code to get a random from the metadata to check


In [None]:
def draw_image(ax, image):
    """
    Draws the image on a unit cube with (0, 0) at the top left
    """
    ax.set(xlim=(0, 1), ylim=(1, 0), xticks=[], yticks=[], aspect="equal")
    image = # Write code to read the image from its path `image` by passing it to pyplot's method `imread`
    height, width = # Write code to gets the image's height and width by reading first two dimention (considering the 3rd dimension is used for channels)
    
    # Pads the image so it fits inside the unit cube
    hpad = (1 - height / width) / 2 if width > height else 0
    wpad = (1 - width / height) / 2 if height > width else 0
    extent = [wpad, 1 - wpad, 1 - hpad, hpad]
    
    # Write code to show the image by passing it as a first argument in the method `imshow` of the given figure's axis `ax` amd
    # extent variable in named parameter `extent`

In [None]:
def draw_box(ax, box, color):
    """
    Draws bounding box of a specific linewidth (lw), (edge) color (ec)
    """
    x, y, w, h = box
    ax.add_patch(Rectangle((x, y), w, h, lw=2, ec=color, fc="none"))    # Draws the bounding box as rectangle with no filling color (fc)

In [None]:
def draw_prediction(image, predicted_box):
    """
    Draws the both image and predicted bounding box on a unit cube with (0, 0) at the top left
    utilising helper functions `draw_image()` and `draw_box()`.
    """
    fig, ax = plt.subplots(dpi=150)

    # Write code to first draws the image containg object for which bounding box is to be predicted by calling method
    # `draw_image` passsing into it the axis `ax` and the image as argument

    # Write code to draw the predicted box by calling method `draw_box` and passing into it the 
    # axis `ax`, the predicted box and the color "r" as argument to draw the box in red

    # Write code to show the plot by calling method `show` on instance of pyplot    

In [None]:
# Sample randomly from the metadata and check visually if the above two helper functions work

random_sample_id = random.randint(0, len(metadata)-1)
sample = metadata[random_sample_id]

ig, ax = plt.subplots(dpi=150)

# Write code to extract image path and bounding box from the sample variable (dictionary) show both
# image and plot the bounding box by calling method `draw_image` and `draw_box`, respectively.
# Use 'b' as indicator to plot the bounding box in blue.
# ...
# ...
# ...

plt.show()

In [None]:
# Finally shuffles the stored information before preparing data set for modeling
random.shuffle(metadata)

## Modeling

In [None]:
# Uses a particular images size for which the target pretrained model (refer below) offers optimized model weights
image_size = 160

In [None]:
# Instantiates the MobileNetV2 architecture and returns a an image classification model loaded with weights pre-trained on ImageNet.
# Refer more details at https://keras.io/api/applications/mobilenet/#mobilenetv2-function

# Write code to call method tf.keras.applications.MobileNetV2 by passing 
# expected input shape (3D) against parameter `input_shape`,
# `False` against parameter `inc]lude_top` to exclude imagenet specific top layer, and
# argument "imagenet" against parameter `weights` to load optimized weights of model pretrained on "imagenet" dataset

base_model = # ...
    # ...
    # ...
    # ...

# Write code to set (parameters of the) base model non-trainable

2025-12-01 04:20:59.203902: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [None]:
# Creates model out of base model

inputs = # Write code to create model's input using tf.keras.Input passing correct 3D `shape`
x = tf.keras.applications.mobilenet_v2.preprocess_input(inputs)     # Scales input pixels between -1 and 1 before passing them to the model.
x = # Write code to pass the `inputs` into tf.keras.applications.mobilenet_v2.preprocess_input to scale input pixels between -1 and 1 before passing them to the model.
x = # Write code to pass processed input to base_model

# Makes basemodel outputs smaller and then flattens the output features
x = # Write code to create a tf.keras.layers.Conv2D layer with 512 filters, (3, 3) kernel size, (2, 2) strides, and pass base model's output into it to convolve its spatial dimension
x = # Write code to flatten the convolutional layer's 3-D output (feature maps) into 1-D using tf.keras.layers.Flatten and passing output of the Conv2D layers output

# Passes our flattened data through three densely connected layers
x = # Write code to create a dense layer with 128 units and "relu" activation using tf.keras.layers.Dense method and passing flattended output
x = # Write code to create a dense layer with 64 units and "relu" activation using tf.keras.layers.Dense method and output from the previous dense layer
x = # Write code to create a dense layer with 32 units and "relu" activation using tf.keras.layers.Dense method and output from the previous dense layer

head = # Write code to create a dense layer with 4 units [for start coordinates (x, y), width and height of the predicted
# bounding box] each with sigmoid activation to ensure outputs range between 0 and 1 to scale later.

model = tf.keras.Model(inputs=inputs, outputs=head)                 # Creates target model combining inputs and outputs

In [None]:
# Checks the model summary before proceeding for model training
# Ensure non-trainable weights shown in red for locked base model.
model.summary(show_trainable=True)

In [None]:
# Efficient data loader to serve data efficiently while model training

resizer = tf.keras.layers.Resizing(image_size, image_size)      # A function to make the size of all input images same


def load_image(path):
    """
    Loads and resizes an input image from its path
    """
    x = tf.io.read_file(path)
    x = tf.image.decode_jpeg(x, channels=3)
    return resizer(x)

# First loads path for all images into TensorFlow Dataset and then
# loads the images against each of the image path
images = tf.data.Dataset.from_tensor_slices([v["image_path"] for v in metadata.values()])
images = images.map(load_image, num_parallel_calls=8)

# Similarly, loads the bounding box for all images
labels = tf.data.Dataset.from_tensor_slices([v["box"] for v in metadata.values()])

# Combines these two dataset into one
dataset = tf.data.Dataset.zip(images, labels)

# Seperates out 20% of samples as validation set, and then
# sets batch size and enables prefetching for each of the datasets
val_set = dataset.take(160).batch(32).prefetch(2)
train_set = dataset.skip(160).batch(32).prefetch(2)

## Training the Model

In [None]:
# Write code to compile the model by calling model.compile method passing tf.keras.optimizers.Adam 
# as `optimizer` (with learning rate 1e-4) and "mse" as `loss` function
# ...
# ...

# Write code to fit the model by calling model.fit and passing train set, validation set to `validation_data`,
# 50 to `epochs` and `callbacks` with a list containing an early stopping object initialized with
# 10 as `patience`, "val_loss" as `monitor`, "min" as `mode` and "True" as `restore_best_weights` arguments
history = model.fit(# ...

)

In [None]:
# Plots the learning curve

plt.plot(history.history["loss"], label="Training loss")
plt.plot(history.history["val_loss"], label="Validation loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.title("Learning Curves")

In [None]:
# Performs predictions on the validation data
val_predictions = model.predict(val_set)

In [None]:
# Checks one of the predictions to ensure 
val_predictions[0]

In [None]:
# Visualizes the predicted bounding boxes of any random image from validation set
# [Note: Run this cell multiple time to check predictions against different samples]

random_sample_id = random.randint(0, 160-1)

predicted_box = val_predictions[random_sample_id]

image_path = metadata[random_sample_id]["image_path"]   # Gets the image file path from metadata as it was not required to be stored in TensorFlow Dataset for modeling

print(f"Image path: {image_path}")
draw_prediction(image_path, predicted_box)


## Observations

- How was the subset prepared for both images and its annotations to make it ready for model training?

- Why was an intermediate data structure prepared to store dataset, and how was the structure like and what did it contain?

- Which helper functions were created and why were they created for?

- Why was a pretrained model used as a base model? How was a custom model built on top of it?

- Which data loader was used to load the data to fed into model for training? Why was an data loader used instead of manually fed the data into model for training? Why was early stopping used and how it was configured?

- Explain the learning curve.