# WTH is YOLO?
- Tashrique Ahmed
- Milton Vento
- Henok Misgina Fisseha
---

# YOLO: You Only Look Once!

```tl;dr: Very interesting algorithm that creates bounding boxes and labels around images/videos. Very very very very fast!```

You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm introduced in 2015 by [Joseph Redmon](https://arxiv.org/search/cs?searchtype=author&query=Redmon%2C+J), [Santosh Divvala](https://arxiv.org/search/cs?searchtype=author&query=Divvala%2C+S), [Ross Girshick](https://arxiv.org/search/cs?searchtype=author&query=Girshick%2C+R), and [Ali Farhadi](https://arxiv.org/search/cs?searchtype=author&query=Farhadi%2C+A) in their famous research paper [“You Only Look Once: Unified, Real-Time Object Detection”](https://arxiv.org/abs/1506.02640). 

![Football GIF](./img/footballgif.gif)

[Another demo](https://www.tiktok.com/@9gag/video/7329834260452429098) YOLO (sort of)

YOLO fundamentally transforms the field of object detection by treating the identification of objects in images as a single regression problem.

Unlike traditional methods that scan images in parts, YOLO examines the entire image in __one evaluation__, predicting bounding boxes and class probabilities simultaneously. This streamlined approach not only speeds up the detection process, allowing it to run in real-time at impressive speeds, but also simplifies the model architecture, enabling end-to-end optimization directly on detection performance.

<div style="text-align:center">
    <img src="./img/1.png" width="700" />
</div>




<!-- ![Why Fast](./img/2.png) -->
---

# But, why YOLO though...

`tl;dr:` Object detection models evolve from R-CNN to Faster R-CNN, adding efficiencies like ROI pooling and shared convolutional features, then to single-pass detectors like SSD and YOLO for speed, with Mask R-CNN offering instance segmentation capabilities.

**Object Detection Models:**

Object detection models aim to identify and locate objects within an image or video frame. Key technologies in this field include:

1. **R-CNN (Region-based Convolutional Neural Networks):** R-CNN employs a selective search to generate region proposals which are then processed by a CNN to extract features and classify objects. It's groundbreaking for demonstrating that CNN features can dramatically improve detection accuracy but is computationally expensive due to the processing of each proposal independently.
2. **Fast R-CNN:** Improves upon R-CNN by introducing Region of Interest (ROI) pooling. This enhancement shares convolutional features across proposed regions, reducing redundancy and increasing speed without sacrificing accuracy.
3. **Faster R-CNN:** Integrates a Region Proposal Network (RPN) that shares convolutional features with the detection network, thereby streamlining the proposal and detection phases into one network, which significantly boosts both speed and accuracy.
   
<div style="text-align:center">
    <img src="./img/12.png" width="600" />
</div>


4. **SSD (Single Shot MultiBox Detector):** Predicts both bounding box locations and class probabilities in one forward pass, making it faster than models that require separate proposals and detection stages. This model is efficient for real-time processing and operates well across different object scales.

<div style="text-align:center">
    <img src="./img/ssd.webp" width="600" /> <br>
    <img src="./img/12-1.png" width="800" />
</div>


5. **Mask R-CNN:** Extends Faster R-CNN by adding a parallel branch for predicting segmentation masks on each ROI, enabling precise instance segmentation. This is especially useful in applications requiring precise object localization and context.
        
<div style="text-align:center">
    <img src="./img/11.png" width="600" />
</div>



**Damn, why is YOLO so good?**

| Model          | Pascal 2007 mAP (Accuracy) | Speed (FPS) | Time per Frame (s) |
|----------------|-----------------|-------------|--------------------|
| R-CNN          | 66.0            | 0.05        | 20.0s              |
| Fast R-CNN     | 70.0            | 0.5         | 2.0s               |
| Faster R-CNN   | 73.2            | 7           | 0.14s              |
| YOLO           | 69.0            | 45          | 0.02s              |
| Fast YOLO      | 16.0            | 155         | 0.006s             |

<br>

- **Speed:** YOLO and SSD are both single-pass detectors, making them faster than multi-stage detectors like R-CNN variants. However, YOLO generally has an edge in speed due to its fully convolutional architecture which gives high numbers for the number of Frames Per Second(FPS). 

- **Accuracy:** YOLO tends to sacrifice some accuracy for speed compared to models like Faster R-CNN and Mask R-CNN. SSD often falls in between YOLO and Faster R-CNN in terms of accuracy.

- **Complexity:** YOLO is simpler to implement and train compared to multi-stage detectors like Faster R-CNN and Mask R-CNN, which involve separate region proposal and detection stages.

**C'mon man, why does the accuracy have to go down?**

- YOLO sacrifices some accuracy for speed due to its single-pass detection design, which, while fast, limits thorough examination of each region for potential objects.
- The grid-based system, which assigns only a few bounding boxes per grid cell, can miss smaller or overlapping objects.
- Its loss function, which treats localization and classification errors equally, might not always align with the nuances of detection performance.
- Training on complete images enhances context learning but may not capture fine details essential for precise object localization, impacting its ability to finely tune detection for specific object types or sizes.




**But where do we use which?**

- **Real-time Detection:** `YOLO` excels in environments requiring fast processing, such as traffic monitoring and security surveillance, due to its swift, single-pass architecture suitable for dynamic settings like autonomous vehicles.
- **Instance Segmentation:** `Mask R-CNN` is optimal for tasks needing precise object boundaries, such as medical imaging or advanced robotics, thanks to its detailed segmentation capabilities.
- **Challenges with Small Objects:** YOLO struggles with detecting small or clustered objects, a limitation addressed more effectively by multi-stage detectors like `Faster R-CNN`, which iteratively refine detection accuracy through region proposals and adjustments.

---

# How does YOLO do its thing?

``tl;dr: YOLO algorithm (1) divides into S×S grid, (2) predicts B bounding boxes and scores for each cell, (3) calculating class probabilities, and (4) merges.``

YOLO transforms the task of object detection by implementing it as a single regression problem from image pixels to spatially separated bounding boxes and class probabilities.

<br>
<div style="text-align:center">
    <img src="./img/3.png" width="600" />
</div>
<br>


On a very high-level, this is how the algorithm works:

1. **Grid Division:** This first step starts by dividing the original image  into `SxS` grid cells of equal shape. Each cell in the grid is responsible for localizing and predicting the class of the object that it covers, along with the probability/confidence value.

<br>
<div style="text-align:center">
    <img src="./img/6.png" width="600" />
</div>
<br>

2. **Bounding Box Prediction:** Each cell predicts `B` bounding boxes and associated confidence scores. The confidence reflects the model's certainty that the box contains an object and the accuracy of that prediction. The reason one grid can have multiple bounding boxes is to allow for the model to capture multiple objects/

<br>
<div style="text-align:center">
    <img src="./img/7.png" width="600" />
</div>
<br>


YOLO determines the attributes of these bounding boxes using a single regression module in the following format, where `Y` is the final vector representation for each bounding box. 

`Y = [pc, bx, by, bh, bw, [classes: c1, c2, ...]]`

`pc` corresponds to the probability score of the grid containing an object. For instance, all the grids in red will have a probability score higher than zero. The image on the right is the simplified version since the probability of each yellow cell is zero (insignificant). 
Identification of significant and insignificant grids
`bx`, `by` are the x and y coordinates of the center of the bounding box with respect to the enveloping grid cell. 
`bh`, `bw` correspond to the height and the width of the bounding box with respect to the enveloping grid cell. 
`c1` and `c2` correspond to the two classes Player and Ball. We can have as many classes as the use case requires. 

<br>
<div style="text-align:center">
    <img src="./img/8.png" width="600" />
</div>
<br>

3. **Class Probabilities and IOU (Intersection over Union):** Concurrently, each grid cell predicts `C` class probabilities. These probabilities assume that there is an object in the cell. 
Most of the time, a single object in an image can have multiple grid box candidates for prediction, even though not all of them are relevant. The goal of the IOU (a value between 0 and 1) is to discard such grid boxes to only keep those that are relevant. Here is the logic behind it: 
- The user defines its IOU selection threshold, which can be, for instance, 0.5. 
- Then YOLO computes the IOU of each grid cell which is the Intersection area divided by the Union Area. 
- Finally, it ignores the prediction of the grid cells having an `IOU ≤ threshold` and considers those with an `IOU > threshold`. 

<br>
<div style="text-align:center">
    <img src="./img/9.png" width="600" />
</div>
<br>

4. **NMS (Non Max Suppression:)** Setting a threshold for the IOU is not always enough because an object can have multiple boxes with IOU beyond the threshold, and leaving all those boxes might include noise. Here is where we use [NMS](https://learnopencv.com/non-maximum-suppression-theory-and-implementation-in-pytorch/) to keep only the boxes with the highest probability score of detection. 

5. **Overall Prediction:** The final predictions are made by combining the bounding box confidence scores with the class probabilities, resulting in class-specific confidence scores for each box. The predictions are encoded in an `S×S×(B*5+C)` tensor where `B` is the number of bounding boxes a cell can predict, 5 is for the predictions `(x, y, w, h, and confidence)` associated with each bounding box, and `C` is the number of classes.

<!-- ![How does Yolo Work in PRactice](./img/4.png) -->


But, how exactly does this model go from input images to output vectors?

YOLO architecture is similar to [GoogleNet](https://arxiv.org/pdf/1409.4842). Tt has overall 24 convolutional layers, 4 max-pooling layers, and 2 fully connected layers.

<br>
<div style="text-align:center">
    <img src="./img/5.png" width="800" />
</div>
<br>


The architecture works as follows:

1. **Input Resizing:** YOLO takes an input image and resizes it to `448x448 pixels` to create a fixed-sized input.
2. **Initial Convolutional Layers:** The network begins with a sequence of alternating `1x1` and` 3x3` convolutional layers:
    - The 1x1 convolutional layers (also known as reduction layers) serve to reduce the feature space's depth, which can help in controlling the model's complexity and computational cost.
    - The 3x3 convolutional layers help in capturing the spatial features from the input image.
    - These layers are stacked in a way that allows the network to maintain a balance between the abstraction of features and computational efficiency.
3. **Activation Functions:** Each convolutional layer is followed by a `Rectified Linear Unit (ReLU)` activation function, which introduces non-linearity into the model and allows it to learn more complex patterns. The ReLU activation function is defined as `f(x) = max(0, x)`, meaning it outputs x if x is positive, and 0 otherwise.
4. **Batch Normalization:** After each convolutional layer, batch normalization is applied. This technique normalizes the output of the previous layer by subtracting the batch mean and dividing by the batch standard deviation. Batch normalization stabilizes learning and significantly speeds up the training process.
5. **Max-Pooling Layers:** The architecture includes `4 max-pooling layers` which perform down-sampling to reduce the spatial dimensions (width and height) of the input volume for the next convolutional layer. Max-pooling helps the model to be more robust to input variations and reduces the number of parameters, thereby controlling overfitting.
6. **Fully Connected Layers:** Following the convolutional layers, YOLO uses `two fully connected layers` to output predictions:
    - The first fully connected layer takes the flattened feature maps and outputs a vector of a fixed size.
    - The second fully connected layer outputs the prediction tensor that the model uses to make the final object detection predictions.
7. **Dropout Layer:** Between the fully connected layers, a dropout layer is included with a drop probability (e.g., `0.5`). Dropout is a form of regularization that helps prevent overfitting by randomly setting a fraction of the input units to 0 at each update during training time.
8. **Output Layer:** The final layer of YOLO uses a linear activation function, which means it is simply a linear transformation that outputs the raw final predictions. These include the coordinates for bounding boxes, the objectness scores indicating the presence of an object, and the class predictions.
9. **Final Prediction:** The final output is an `SxSx(B*5+C)` tensor. Here, `SxS` is the number of grid cells, `B` is the number of bounding boxes each cell can predict, `5` accounts for the` x, y, w, h, and confidence scores` of each bounding box, and `C` represents the number of classes that the model can predict.

Enough talk! Let's see it in practice!
<BR>

---

# THIS IS WHERE MILTONS CODE GOES

---

# Looking into the future


- **Enhancements in YOLO Architecture:**
Further development of YOLO might focus on integrating more advanced forms of regularization and optimization techniques, like adaptive learning rates or advanced forms of dropout, to improve model generalization and performance across varied datasets.

- **Exploration of Newer Versions:**
The evolution of YOLO through versions like YOLOv7, YOLOv8, and beyond has seen improvements in speed and accuracy through changes in network architecture and training procedures. Future versions could leverage novel neural network designs, such as incorporating attention mechanisms or transformer models, to enhance object detection capabilities, especially in cluttered or complex scenes.

- **Addressing Open Problems:**
One of the pressing challenges in object detection is the detection of small or occluded objects. YOLO could adapt by integrating context-aware learning mechanisms that understand the broader scene, not just the individual objects, thereby improving accuracy in dense environments.

<br>

---

# Ethical Considerations


**Use in Military Applications:**
The application of YOLO in military contexts raises significant ethical questions, especially concerning privacy and the potential for autonomous harm, which concerned one of the creators, Joseph Redmon too[1]. These concerns necessitate transparent use policies and ethical guidelines to govern deployment, particularly in sensitive or high-stakes environments.
He quitted and tweeted that he had ceased his computer vision research to avoid enabling potential misuse of the tech — citing in particular “military applications and privacy concerns.”

<br>
<div style="text-align:center">
    <img src="./img/ai.png" width="300" />
</div>
<br>

[1] I recommend reading his paper on [YOLOv3](https://pjreddie.com/media/files/papers/YOLOv3.pdf), part meme part academia, he touches a bit on this topic.


---

# Sir, you got me so invested in this. I want to learn more!

- [YOLO Model family v1-v4](https://www.youtube.com/watch?v=zgbPj4lSc58&list=PL1u-h-YIOL0sZJsku-vq7cUGbqDEeDK0a)
- [YOLOv8 Architecture Breakdown](https://www.youtube.com/watch?v=HQXhDO7COj8)
- [Roboflow top trained YOLO models](https://universe.roboflow.com/search?q=model:yolov8) <- Very cool!
- [Official YOLO documentation](https://docs.ultralytics.com/)

<br>
<div style="text-align:center">
    <img src="./img/bar.jpg" width="500" />
</div>
<br>

---

# Credits

- Pictures from datacamp blog by [Zoumana Keita](https://www.datacamp.com/blog/yolo-object-detection-explained)
- Soccer GIF from [Amritangshu Mukherjee's](https://medium.com/@amritangshu.mukherjee/tracking-football-players-with-yolov5-bytetrack-efa317c9aaa4) medium post
- Ending Meme by [Abirami Vina](https://medium.com/nerd-for-tech/from-yolo-to-yolov8-tracing-the-evolution-of-object-detection-algorithms-eaed9a982ebd)

---

# How do I "YOLO"?

**YOLO Using Pretrained Models**

In [None]:
%pip install -U ultralytics

In [None]:
from ultralytics import YOLO

# Load a model
model = YOLO("yolov8n.yaml")  # build a new model from scratch
model = YOLO("yolov8n.pt")  # load a pretrained model (recommended for training)

# Use the model
model.train(data="coco8.yaml", epochs=3)  # train the model
metrics = model.val()  # evaluate model performance on the validation set
results = model("https://ultralytics.com/images/bus.jpg")  # predict on an image
path = model.export(format="onnx")  # export the model to ONNX format

In [None]:
from ultralytics import YOLO

# load a pretrained YOLOv8n model
model = YOLO('yolov8n.pt')

In [None]:
#run inference on the source
# results = model(source='https://videos.pexels.com/video-files/2273136/2273136-hd_1280_720_30fps.mp4', show = True, conf=0.4, save=True)
url = 'https://videos.pexels.com/video-files/5286748/5286748-hd_1920_1080_30fps.mp4'

results = model(source=url, show = True, conf=0.4, save=True)


# 2. YOLO Using Custom Dataset

In [2]:
from IPython import display
display.clear_output()



In [None]:
import ultralytics
ultralytics.checks()

In [2]:
#import YOLO and display tools
from ultralytics import YOLO

from IPython.display import display, Image

In [None]:
#import custom dataset
%pip install roboflow

from roboflow import Roboflow
rf = Roboflow(api_key="p2hDObR6JqQZJ3HHjEOW")
project = rf.workspace("artificial-intelligence-82oex").project("detecting-diseases")
version = project.version(6)
dataset = version.download("yolov5")



In [None]:
%cd /Users/vento/Coursework/CSCI 381/YOLO-CS381-Final-Project/Detecting-diseases-6
%pwd
!yolo task=detect mode=train model=yolov8s.pt data=data.yaml epochs=25 imgsz=800 plots=True