#Image Segmentation & Mask R-CNN | Assignment

#Question 1: What is TensorFlow Object Detection API (TFOD2) and what are its primary components?

- TensorFlow Object Detection API (TFOD2) is a high-level framework built on TensorFlow 2 that helps you build, train, fine-tune, and deploy object detection models.

- Primary Components of TFOD2

1. Pretrained Models (Model Zoo)

    - Ready-to-use models like SSD, Faster R-CNN, EfficientDet

     - Trained on large datasets (COCO, Open Images)

     - Used for transfer learning to save time and computation

2.  Pipeline Configuration File

     - A .config file that controls everything:

         - Model architecture

         - Training parameters

         - Dataset paths

         - Optimizer & learning rate

    - Basically the brain of the training process

3. Dataset Preparation Tools

      - Converts data into TFRecord format

      - Uses label maps to assign class IDs

      - Ensures data is efficiently loaded during training

4. Training & Evaluation Scripts

      - Scripts like model_main_tf2.py

      - Used to train, evaluate, and monitor performance

5. Inference & Export Utilities

     - Export trained models for real-world use

     - Supports deployment on CPU, GPU, TPU, mobile, and edge devices


#Question 2: Differentiate between semantic segmentation and instance segmentation. Provide examples of where each might be used.

1.  Semantic Segmentation

    - Assigns a class label to every pixel in an image

    - Does NOT distinguish between different objects of the same class

    - All objects of the same category are treated as one region

   - Example:

       - In a road scene, all cars are labeled as “car”, all people as “person”

       - Used in:

            - Autonomous driving (road, lane, sidewalk detection)

           - Medical imaging (tumor vs healthy tissue)

          - Satellite image analysis (land, water, vegetation)

2. Instance Segmentation

    - Assigns a class label + a unique ID to each object instance

    - Separates individual objects, even if they belong to the same class

- Example:

   - Two cars → car_1 and car_2, each with its own mask

   - Used in:

      - Crowd counting

      - Object tracking

       - Robotics & manufacturing (picking specific objects)

       - Wildlife monitoring


#Question 3: Explain the Mask R-CNN architecture. How does it extend Faster R-CNN?

- Mask R-CNN is an advanced instance segmentation model that extends Faster R-CNN by adding a branch for predicting pixel-level object masks, in addition to object detection.

1. Backbone Network

   - Mask R-CNN uses a CNN backbone such as ResNet-50/101 combined with Feature Pyramid Network (FPN) to extract rich multi-scale feature maps from the input image.

2. Region Proposal Network (RPN)

    - Like Faster R-CNN, an RPN generates region proposals (RoIs) by predicting objectness scores and bounding boxes over the feature maps.

3. RoIAlign (Key Improvement)

    - Instead of RoIPool used in Faster R-CNN, Mask R-CNN introduces RoIAlign, which:

       - Removes quantization errors

       - Uses bilinear interpolation

       - Preserves exact spatial alignment
       - This is crucial for accurate pixel-level mask prediction.

4. Parallel Output Heads

     - For each RoI, Mask R-CNN has three parallel branches:

     - Classification head → predicts object class

     - Bounding box regression head → refines bounding boxes

     - Mask head → predicts a binary segmentation mask for each class using a Fully Convolutional Network (FCN)

     - The mask branch predicts a fixed-size mask (e.g., 28×28) for each RoI independently of classification.

5. Multi-Task Loss

    - The total loss is a combination of:

        - Classification loss

        - Bounding box regression loss

        - Mask loss (pixel-wise binary cross-entropy)

6.  How Mask R-CNN Extends Faster R-CNN

        - Adds a mask prediction branch for instance segmentation

        - Replaces RoIPool with RoIAlign for better spatial precision

        - Performs detection and segmentation simultaneously

        - Produces bounding boxes, class labels, and object masks

#Question 4: Describe the purpose of masks in image segmentation. How are they used during training and inference?

- Purpose of Masks in Image Segmentation

    - In image segmentation, a mask is a pixel-level representation that indicates which pixels belong to an object or region of interest. Each pixel in a mask is labeled (usually 0 or 1, or class IDs), enabling precise localization of objects beyond bounding boxes.

    - Masks are essential for tasks like semantic segmentation and instance segmentation, where understanding the exact shape and boundaries of objects is required.

- Use of Masks During Training

   - During training:

      - Ground truth masks are provided for each object or class.

      - The model predicts a mask for every detected object or pixel region.

      - The predicted mask is compared with the ground truth mask.

      - A pixel-wise loss function (e.g., binary cross-entropy or Dice loss) is used to measure error.

      - The loss guides the network to learn accurate object boundaries and shapes.

      - In instance segmentation models like Mask R-CNN, masks are trained independently for each object instance along with classification and bounding box regression.

- Use of Masks During Inference

   - During inference:

      - The trained model predicts masks for unseen images.

       - Each mask highlights the exact pixels belonging to a detected object.

        - Masks are combined with class labels and bounding boxes to produce the final output.

      - The result allows precise object separation, even when objects overlap.

#Question 5: What are the steps involved in training a custom image segmentation model using TFOD2?

- Steps to Train a Custom Image Segmentation Model Using TFOD2
1. Dataset Collection

    - Collect images relevant to the segmentation task.

    - Ensure sufficient variation (lighting, scale, background, angles).

2. Data Annotation

    - Annotate images using tools like LabelImg or LabelMe.

    - For segmentation, create pixel-wise masks for each object.

    - Save annotations in supported formats (e.g., COCO, Pascal VOC, TFRecord).

3. Data Preparation

    - Split the dataset into training and validation sets.

    - Convert annotations into TFRecord format, which TFOD2 requires.

    - Create a label map (.pbtxt) defining class IDs and names.

4. Model Selection

    - Choose a pretrained segmentation-capable model from the TFOD2 Model Zoo (e.g., Mask R-CNN).

     - Download the model checkpoint and pipeline configuration file.

5. Pipeline Configuration

      - Modify the pipeline.config file:

      - Set number of classes

      - Update paths to TFRecords and label map

     - Configure batch size, learning rate, and fine-tuning checkpoint

     - Enable mask prediction parameters

#Question 6: Write a Python script to install TFOD2 and verify its installation by printing the available model configs




In [None]:
!pip install tensorflow==2.13.0
!pip install tf-models-official
!pip install protobuf==3.20.3
!pip install cython
!pip install pillow lxml matplotlib opencv-python

!git clone https://github.com/tensorflow/models.git

%cd models/research
!protoc object_detection/protos/*.proto --python_out=.
!cp object_detection/packages/tf2/setup.py .
!python -m pip install .

from object_detection.utils import config_util

configs = config_util.get_configs_from_pipeline_file
print("TFOD2 installed successfully!")
print("Configuration utility loaded:", configs)


#Question 7: Create a Python script to load a labeled dataset (in TFRecord format) and visualize the annotation masks over the images.



In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

def parse_tfrecord(example_proto):
    feature_description = {
        'image/encoded': tf.io.FixedLenFeature([], tf.string),
        'image/height': tf.io.FixedLenFeature([], tf.int64),
        'image/width': tf.io.FixedLenFeature([], tf.int64),
        'image/object/mask': tf.io.FixedLenFeature([], tf.string),
    }

    example = tf.io.parse_single_example(example_proto, feature_description)

    image = tf.image.decode_jpeg(example['image/encoded'], channels=3)
    image = tf.cast(image, tf.uint8)

    mask = tf.io.decode_png(example['image/object/mask'], channels=1)
    mask = tf.squeeze(mask)

    return image, mask


tfrecord_path = "dataset.tfrecord"

dataset = tf.data.TFRecordDataset(tfrecord_path)
dataset = dataset.map(parse_tfrecord)


for image, mask in dataset.take(1):

    image = image.numpy()
    mask = mask.numpy()

    plt.figure(figsize=(12, 5))

    plt.subplot(1, 3, 1)
    plt.imshow(image)
    plt.title("Original Image")
    plt.axis("off")

    plt.subplot(1, 3, 2)
    plt.imshow(mask, cmap="gray")
    plt.title("Annotation Mask")
    plt.axis("off")

    plt.subplot(1, 3, 3)
    plt.imshow(image)
    plt.imshow(mask, cmap="jet", alpha=0.5)
    plt.title("Mask Overlay")
    plt.axis("off")

    plt.show()


#Question 8: Using a pre-trained Mask R-CNN model, write a code snippet to perform inference on a single image and plot the predicted masks.


In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as viz_utils

MODEL_PATH = "exported-model/saved_model"
detect_fn = tf.saved_model.load(MODEL_PATH)

LABEL_MAP_PATH = "label_map.pbtxt"
category_index = label_map_util.create_category_index_from_labelmap(
    LABEL_MAP_PATH, use_display_name=True
)

IMAGE_PATH = "test.jpg"

image = tf.io.read_file(IMAGE_PATH)
image = tf.image.decode_jpeg(image, channels=3)
input_tensor = tf.expand_dims(image, axis=0)

detections = detect_fn(input_tensor)

num_detections = int(detections.pop('num_detections'))
detections = {k: v[0, :num_detections].numpy()
              for k, v in detections.items()}
detections['num_detections'] = num_detections
detections['detection_classes'] = detections['detection_classes'].astype(np.int64)

image_np = image.numpy()

viz_utils.visualize_boxes_and_labels_on_image_array(
    image_np,
    detections['detection_boxes'],
    detections['detection_classes'],_]()


#Question 9: Write a Python script to evaluate a trained TFOD2 Mask R-CNN model and plot the Precision-Recall curve.


In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

MODEL_PATH = "exported-model/saved_model"
detect_fn = tf.saved_model.load(MODEL_PATH)

def parse_tfrecord(example_proto):
    feature_description = {
        'image/encoded': tf.io.FixedLenFeature([], tf.string),
        'image/object/class/label': tf.io.VarLenFeature(tf.int64),
    }

    example = tf.io.parse_single_example(example_proto, feature_description)
    image = tf.image.decode_jpeg(example['image/encoded'], channels=3)
    labels = tf.sparse.to_dense(example['image/object/class/label'])

    return image, labels


dataset = tf.data.TFRecordDataset("val.tfrecord")
dataset = dataset.map(parse_tfrecord).batch(1)

y_true = []
y_scores = []

for images, labels in dataset.take(50):
    detections = detect_fn(images)

    scores = detections['detection_scores'][0].numpy()
    classes = detections['detection_classes'][0].numpy().astype(int)

    for score in scores:
        y_scores.append(score)
        y_true.append(1)

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision)
plt.xlabel("Recall")


#Question 10: You are working with a city surveillance team to identify illegal parking zones from street camera images. The model you built detects cars using bounding boxes, but the team reports inaccurate overlaps with sidewalks and fails in complex street scenes. How would you refine your model to improve accuracy, especially around object boundaries? What segmentation strategy and tools would you use?


 1. Use Instance Segmentation Instead of Bounding Boxes

 - Switch from pure object detection to instance segmentation so each car is represented by a pixel-accurate mask.

 -  Mask R-CNN is the obvious choice:

    - Gives precise object boundaries

    - Separates overlapping vehicles

     - Eliminates sidewalk overlap confusion

     - This directly fixes the boundary problem.

2. Add Semantic Segmentation for Scene Understanding

- To know where illegal parking is happening, you must segment the environment, not just cars.

- Use semantic segmentation to classify:

   - Road

   - Sidewalk

   - Parking zones

   - No-parking zones

   - Recommended models:

       - DeepLabv3+

       - U-Net (lighter, faster for city deployment)

3. Improve Boundary Accuracy During Training

   - To sharpen object edges:

       - Use high-resolution feature maps (FPN)

        - Apply RoIAlign (already in Mask R-CNN)

        - Use boundary-aware losses (Dice / IoU loss)

        - Train with fine-grained polygon annotations, not loose boxes

       - Garbage annotations = garbage boundaries. No excuse.