## **OBJECT DETECTION**

#### **Theory**

**Q :** How to install Open CV for onbject detection?

* pip install opencv-python

**Q :** What is the use of Open CV library?

* image processing
* computer vision tasks like object detection
* image recognition
* video analysis

**Q :** What is the significance of the DNN module in Open CV library?

* it is used for deep learning
* allows integration of pre-trained deep learning models like YOLOv4

**Q :** Name the model we are going to use for object detection.

* YOLO (You Only Look Once)
* state-of-the-art object detection model
* its known for its speed and accuracy
* YOLOv4 is a specific version of YOLO

**Q :** What is the algorithm behind the above object detection model?

Algorithms involved behind YOLO model is :
1. Deep neural networks
2. Convolutional Neural Network ; backbone of YOLOv4 designed to process visual data efficiently

**Q :** List out the key concepts and jargons related to YOLO model.

1. **Grid Cells** - The image is divided into a grid of cells.
2. **Bounding Box Prediction** - Each grid cell predicts a fixed number of bounding boxes, along with their confidence scores.
3. **Class Probability** - Each bounding box is associated with class probabilities, indicating the likelihood of different object classes (e.g., person, car, dog).
4. **Non-Maximum Suppression (NMS)** - A technique to eliminate redundant detections and select the most confident one

**Q :** Give a brief about the YOLO's Deep-CNN architecture.

1. **Backbone Network** :

* core of the network
* responsible for extracting features from the input image
* typically employs a pre-trained network like Darknet-53 or EfficientNet.
* extract rich feature maps at different scales
* captures both fine-grained and coarse-grained information
* extracts fundamental features from the image, such as edges, textures, and shapes

2. **Neck** :

* responsible for fusing feature maps from different layers of the backbone
* crucial for object detection at various scales
* techniques like Feature Pyramid Networks (FPNs) are commonly used to achieve this
* combines feature maps from different layers to enhance feature representation at various scales
* crucial for detecting objects of different sizes

3. **Head** :

* final layer of the network
* actual object detection predictions are made here
* consists of multiple prediction layers
* each layer is responsible for predicting bounding boxes and class probabilities for a specific scale
* Each prediction layer outputs:
    1. Bounding box coordinates (x, y, width, height)
    2. Objectness score (confidence that there is an object in the box)
    3. Class probabilities for different object categories
* Predicts the final output, including bounding box coordinates and class probabilities
* ensures that the model can accurately locate and classify objects

**Q :** Give a step-by-step guide related to the working of YOLO.

1. **Image Division** : 
    * input image is divided into a grid of cells
    * like a chessboard
    * each cell in this grid is responsible for detecting objects that fall within its boundaries

2. **Feature Extraction** : 
    * A CNN extracts features from the image

3. **Bounding Box Prediction** : 
    * For each grid cell, the network predicts bounding boxes and their associated class probabilities
    * each cell will predict a certain number of bounding boxes (usually 3 or more)
    * These boxes are rectangles that can potentially enclose objects within the cell
    * For each predicted bounding box, the network will assign probabilities to different object classes (e.g., person, car, dog)
    * probabilities indicate the likelihood of the object being of that particular class
    * So, for a single cell, the network might predict something like:
        (i) Bounding Box 1: Confidence: 0.7, Class Probabilities: Person (0.8), Car (0.2)
        (ii) Bounding Box 2: Confidence: 0.3, Class Probabilities: Bicycle (0.6), Motorcycle (0.4)
    * network does this for every cell in the grid

4. **NMS** :
    * After all predictions are made, a technique called Non-Maximum Suppression (NMS) is applied to eliminate redundant detections and select the most confident ones
    * overlapping bounding boxes with low confidence scores are removed, leaving only the most confident detections

**Q :** Certain files should be downloaded into the project directory for object detection using Open CV's YOLO. Name those files.

* yolov4.weights
* yolov4.cfg
* coco.names

When you load the model using cv2.dnn.readNet, the function needs to locate these files to load the network architecture and weights.

**Q :** What are the contents of yolov4.weights file?

* file contains the trained weights of the YOLOv4 model
* weights are learned parameters that allow the model to accurately detect objects in images

**Q :** What about yolov4.cfg file ?

* contains the configuration of the YOLOv4 model's architecture
* defines the network layers, their parameters, and the overall structure of the model

**Q :** What about coco.names?

* contains a list of object classes that the YOLOv4 model is trained to detect
* classes are typically from the COCO dataset, which includes objects like people, cars, bicycles, etc.

#### **Code**

**Q :** What is the 1st and formost step to implement object detection using Open CV's YOLO?

* Loading the model

```python
import cv2

# load the 'YOLO' model
net = cv2.dnn.readNet('yolov4.weights', 'yolov4.cfg')
```

* **cv2.dnn**-
  - refers to OpenCV's Deep Neural Network (DNN) module
  - allows loading pre-trained neural networks, making predictions, and performing various deep learning tasks

* **readNet**: 
  - a method 
  - loads the pre-trained weights and the configuration file for a neural network model

* model's structure (from the .cfg file) and its learned parameters (from the .weights file) are loaded into the net object
* net object is now ready to perform object detection on images or video frames

**Q :** What is the next step? What is its significance?

* Loading class labels step

```python
with open('coco.names', 'r') as f:
    classes = [line.strip() for line in f.readlines()]

# classes will be a list like
['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'truck', ...]

```

* **open('coco.names', 'r')** - Opens the coco.names file in read mode ('r')
* **f.readlines()** - 
   - Reads all the lines from the file into a list
   - here, each line in the file corresponds to a class name
* **[line.strip() for line in f.readlines()]** -
   - a list comprehension 
   - removes any leading or trailing whitespace/newline characters from each line in the list of lines

* such a list is created to map each detected object's predicted class ID to a human-readable label
* YOLO predicts class IDs as numerical indices (e.g., 0 for "person," 1 for "bicycle")
* coco.names file provides the corresponding class names
* so, meaningful results instead of numeric IDs during detection can be displayed 
* Without this step, the output would only show IDs, which would be unintuitive


**Q :** How to access the layers of the deep net we have initiated previously? Why is it even necessary?

* Retrieving Layer Names Step
```python
layer_names = net.getLayerNames()

# identifying output layers
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
```

* **Getting layer names are important because** :
   - To access the names of all layers in the neural network.
   - YOLO's neural network consists of many layers, but not all are relevant for final output.
   - Understanding the architecture is crucial to identifying the layers responsible for producing predictions
   - This step sets up the groundwork for identifying which layers are output layers
   - **net.getLayerNames()**
      * Returns a list of all the layer names in the YOLOv4 network.
      * These names represent different computational layers in the neural network.

* **net.getUnconnectedOutLayers()**
  - Returns the indices of the output layers
  - output layers = layers that produce the final predictions
  - YOLOv4 typically has 3 output layers to predict bounding boxes at different scales.
* **[i[0] - 1 for i in ...]** 
  - Each i is an index starting from 1
  - Subtracting 1 converts it to a 0-based index, which aligns with Python’s indexing
* **[layer_names[i[0] - 1] for i in ...]** :
  - Retrieves the actual names of the output layers using the adjusted indices

* **Identifying output layers are necessary because** :
  - To determine which layers produce the final detection results (bounding boxes, confidence scores, and class probabilities)
  - YOLOv4 outputs predictions from specific layers, typically three for multi-scale detection.
  - Correctly identifying these layers ensures that the object detection results are extracted and processed accurately
  - Without this step, the model would not know where to retrieve the detection data, making it impossible to generate predictions
  - If the output layers like ['yolo_82', 'yolo_94', 'yolo_106'] are not identified, the forward pass will not produce the necessary output for object detection.
  - ensures that detection results are extracted from the correct layers in the network.


**Q :** What is the next step?

* Loading the image step
```python
# Convert the uploaded file (binary data) into a NumPy array of unsigned 8-bit integers (image bytes)
# Why? --> OpenCV requires the image data in this format for decoding
# bytearray --> Converts the binary data into a mutable sequence of bytes
# np.asarray --> Creates a NumPy array from the byte sequence
file_bytes = np.asarray(bytearray(uploaded_file.read()), dtype=np.uint8)

# Decode the image bytes into an OpenCV image format (BGR color format)
# OpenCV can only process images in a specific format
# this function transforms raw bytes into an image matrix
img = cv2.imdecode(file_bytes, cv2.IMREAD_COLOR)

# Retrieve the height, width, and number of color channels from the image's shape
# _ is the placeholder for the 3rd value, representing the number of color channels (typically 3 for BGR)
height, width, _ = img.shape
```

**Q :** Should there be an image preprocessing step before actual object detection? If yes, how to do it?

* Yes, 

```python
# YOLO Processing
blob = cv2.dnn.blobFromImage(img, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
```

**blob = cv2.dnn.blobFromImage(...)**
  - Converts the input image into a "blob" that YOLO can process
  - Transforms the image into a 4D blob, which is a standard format for deep learning models in OpenCV
  - A 4D blob with shape (1, 3, 416, 416) representing (batch_size, channels, height, width)

**Q :** It seems there are a lot of parameters to be decided during the blob making step. What does each represetn?

||Parameter|Purpose|
|----|----|---|
|1|img|The input image to be processed (loaded earlier)|
|2|0.00392|A scaling factor (1/255), used to normalize pixel values from [0, 255] to [0, 1]|
|3|(416, 416)|The spatial size to which the image is resized (YOLOv4 requires this specific input size)|
|4|(0, 0, 0)|Mean subtraction values (used for normalization; here, no mean subtraction is applied)|
|5|True|Indicates swapping of the R and B channels (OpenCV uses BGR by default, YOLO expects RGB)|
|6|crop=False|Ensures the image is resized without cropping|

**Q :** How to feed the deep YOLO net with the processed image?

```python
net.setInput(blob)
```
This step loads the blob into the YOLO network, preparing it for the forward pass, where the network will perform object detection

**Q :** How to get output from the net corresponding to the input?
```python
outs = net.forward(output_layers)
```
* **net.forward()** - Executes the forward pass and returns the detection results from the specified output layers.
* **output_layers** - Contains the names of the YOLO output layers (e.g., ['yolo_82', 'yolo_94', 'yolo_106'])
* YOLO's predictions (bounding boxes, class scores, and confidence scores) are generated in the forward pass
* This line collects all the raw output needed for further processing 
* outs is a list of NumPy arrays, each containing detection data

**Q :** How is a single detection data organised within an array?

* Each array contains predictions in the format:

**[center_x,  center_y,  width,  height,  confidence, class_prob_0, class_prob_1, ...]**

**Q :** For each grid cell, the layer might have predicted several of bounding boxes, their confidence scores and labels(class id). How do we get all these specific informations from outs?

```python
# initialize empty lists to collect the specific informations regarding detected objects
boxes, confidences, class_ids = [], [], []

# iterate through the list of arrays -> outs
# outs contains multiple arrays, each corresponding to a YOLO output layer
# iterating through results of each output layer
for out in outs:
    # Each array in outs contains detections from all grid cells for that specific output layer
    # iterating over all detections from a single output layer
    for detection in out:
        # extracting all class probabilities
        scores = detection[5:]
        # selecting the class_id with maximum confidence score
        class_id = np.argmax(scores)
        # getting that maximum confidence score value
        confidence = scores[class_id]
        # thresholding/ filtering to capture confident detections only
        if confidence > 0.5:
            # bounding box coordinates occur in normalized format ([0,1] range relative to the image size)
            # so center of the bounding box, scaled by the image's width and height
            center_x = int(detection[0] * width)
            center_y = int(detection[1] * height)
            # Width (w) and height (h) of the bounding box, scaled by image size
            w = int(detection[2] * width)
            h = int(detection[3] * height)
            # Top-left corner of the bounding box 
            # calculated by shifting from the center
            x = int(center_x - w / 2)
            y = int(center_y - h / 2)
        boxes.append([x, y, w, h])
        confidences.append(float(confidence))
        class_ids.append(class_id)
```

**Q :** Some bounding boxes may be overlapping and redundant. How to remove those?
```python
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
```
* indexes is a list of indices of the selected boxes after NMS is applied
* NMS is used to eliminate multiple bounding boxes for the same object
* It keeps the box with the highest confidence score and removes others that overlap significantly

**Q :** NMSBoxes function has many parameters. What does each indicate?

||Parameter|Purpose|
|---|----|---|
|1|boxes|List of bounding box coordinates (in [x, y, w, h] format)|
|2|confidences|List of confidence scores for each box (indicating how confident YOLO is about the presence of an object)|
|3|0.5| Minimum confidence threshold (detections with a confidence below this value are ignored)|
|4|0.4|Overlap threshold (if the intersection over union (IoU) of two boxes exceeds this threshold, one of them will be suppressed)|


**Q :** All NMS selected boxes will have class ids. How to map those class ids to respective class labels?
```python
# initialize an empty list to store label's of detected objects
detected_objects = []

# iterating through the list of bounding boxes
for i in range(len(boxes)):
    # checking if the box is NMS selected or not
    if i in indexes:
        # extracting all cordinates of that box
        x, y, w, h = boxes[i]
        # get class_id of that box
        # using that id, get class label from list of class labels
        label = str(classes[class_ids[i]])
        # storing the label
        detected_objects.append(label)
```

**Q :** Now how to highlight the detected objects by drawing bounding boxes and labels?
```python

# part of the previous for loop
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.putText(img, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
```