_CAMILO PINZÓN | ALEJANDRO SANCHO | XAVIER HERNAN | IGNACIO ORLANDI | JORGE MENEU_



_Computer Vision - Group Assignment_

_@ IE MBD 2022-2023_


<img width="1000" style="float:left" 
     src="https://i.imgur.com/j98ly2F.jpg" />


# Content


1. [Context](#1)
2. [The Pipeline](#2) 
3. [Conclusions](#3)
3. [References](#4)

<a id='1'></a>

# 1. Context 

> _Did you know that, according to the World Health Organization, there are over 34 million children in the world suffering from disabling hearing loss? The numbers are shocking, and estimations point to an increase in the prevalecence of a 12% by 2050._


> _This disability creates several barries for those who suffer from it. Many feel isolated and excluded from the rest of the society, and there is a lack of solutions to tackle this social phenomena._


> _Moreover, learning the sign language can be a challenge to many people, and the process of it requires patience, qualified professionals and an investment the most vulnerable may note be able to afford._

> _In **signlingo**, we believe anyone suffering from this disability should have access to an affordable  and effective solution. For that reason we present **signlingo**: the world's #1 Sign Language Learning application._


Throughout the present document, we will develop an end-to-end Computer Vision pipeline, based in `YOLO`, addressing the problem. At the end, we will deploy a real-time app to detect the A-Z letters fomr the American Sign Language.




## 1.1. Framework

But how does the application work?

In order to design it, we have leveraged two disciplines: `Computer Vision` and `Software Development`. The first one is the core of the functionality of the app, and the latter is the interface to bring the functionality to the user.


In more detail, the develop this `Computer Vision` based app, we have implemented different `packages`:






<img width="600" style="float:center" 
     src="https://i.imgur.com/Cwssqjz.png" />
    



     
<a name="Footnote" >1</a>: _Overall framework in which the project is based._



* `YOLO`: standing out as the current SOTA family of models in the field, `YOLO` delivers a simple Python API to implement `object-detection` with ease. Based on a combination of convolutional layers, upsampling and detection layers, `YOLO` provides excellent performant model accross all the family. We will try different models from it, but always keeping an eye on implementing it in mobile solutions.



The following is a table that defines the different types and the size (in parameters) that compose each model:


|            | YOLOv8n | YOLOv8s  | YOLOv8m  | YOLOv8l  | YOLOv8x  |
|------------|---------|----------|----------|----------|----------|
| Layers     | 225     | 225      | 295      | 365      | 365      |
| Parameters | 3.1e6   | 11e6     | 25.9e6   | 43.7e6   | 68.2e6   |
| GFLOPS     | 8.9     | 28.8     | 79.3     | 165.7    | 258.8    |

<a name="Footnote" >2</a>: _Comparison between models of the `YOLOv8` family_




* `SRGANs`: this variation of the traditional `Generative Adversarial Networks`, outputs a higher resolutions image than the input. The usage of this variation of `GANs` can be particularly useful as sign language often includes fine details and gestures which can be difficult to capture accurately in lower resolution images.

And to enable it's deployment, we have leveraged:

* `Tencent's NCNN`: the giant asiatic company launched in Q3 2017 a high-performance neural network inference computing framework, optimized for mobile platforms. We will leverage from this format, simplifying the network, and bringing an accurate and high performant model to mobile devices.



* `Android Studio`: this IDE will offer an interface to interact with by the user, with a highly optimized and native support for Android OS.

Now, enough with the chit-chat. 

Let's get our hands dirty, and jump into our pipeline!

<a id='2'></a>

# 2. The Pipeline 

<img width="1000" style="float:left" 
     src="https://i.imgur.com/nblhkut.png" />
     
<a name="Footnote" >3</a>: _Project's pipeline diagram_

The diagram above shows how the project's pipeline will be organized. In a nutshell, it will be compounded on the following steps:

* Data Importing (images from different `roboflow` repositories)
* Data Enrichment (implementation of SRGANs to increase the size of the dataset)
* EDA (overview on the size and nature of the data)
* Data Preprocessing (labeling, annotation and feature engineering among others)
* Training (with different versions of the newest `YOLOv8` release)
* Validation (exploring different hyperparameters)
* Test (benchmark it's accuracy)
* Deployment (as a real-time detector, both as `webcam` and as an `android app`)

## 2.1. Data Importing

The origin of the data, as previously described, derives from the two datasets of images of A-Z ASL (American SIgn Language). 

Let's import both the images, and their annotations!

### 2.1.1 `roboflow` for data importing

If `roboflow` still doesn't ring the bell to you, it will surely do in the following months. This `GUI` tool, eases the process of importing images, annotating them and processing it.

In this first stage, we will use it for data importing.

Where we previously leveraged `libraries` like `PIL`for importing and processing images, and programs as `LabelImg` to manually insert labels to images, we now tackle the task through a unified framework, `roboflow`. 

<div style = "display: flex; justify-content: center;">
  <div style = "width: 350px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>Upload</figcaption>
      <img width = "500", style = "display: inline-block" src = "https://i.imgur.com/LIoIfjP.png">
  </div>
  <div style = "width: 350px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>Annotate</figcaption>
      <img width = "500", style = "display: inline-block" src = "https://i.imgur.com/VHHPJt9.jpg">
  </div>
</div>

The implementation with Jupyter Notebooks is a breeze, with a few lines of code both images and labels are imported to the session:

In [None]:
!pip install roboflow

In [None]:
from roboflow import Roboflow
rf = Roboflow(api_key="vvIqL0N2NOLTwoQDPW1k")
project = rf.workspace("compvi").project("american-sign-language-letters-extend")
dataset = project.version(1).download("yolov8")

We finally move our files to the directory where all actions will be performed:

In [None]:
!mkdir datasets

In [None]:
!mv /content/American-Sign-Language-Letters-Extend-1 /content/datasets/American-Sign-Language-Letters-Extend-1

The tool itself helps you to track down different versions of your dataset, according to the images, preprocessing applied, etc. In this aspect, we will import `v1` from our dataset. 

During the evolution of the dataset, we ranged from a single dataset approach, to a more enriched approach with two different sources of data. 

The reason why? 

Simple:

> "More data beats better algorithms" - Peter Norvig

And as we will see later, this clearly affected the model's performance and generalization capabilities.

Cool! So now our data is correctly imported!

We will apply a fixed `train-val-test` split:

* `train`: [92%]
* `val`: [4%]
* `test`: [4%]


Let's analyze how can we enrich our dataset even further, through the implementation of `Generative Adversarial Networks`:

## 2.2. Data Enrichment


As we mentioned in the introduction, one of the techniques we will leverage to bring about a fine-tuned solution, is `SRGANs`. But first of all, what are `GANs`? Let's find that out:

`Generative Adversarial Networks` are a subset of Deep Learning, in which two neural networks, the `Generator`and the `Discriminator` compete to beat each other. 

* The role of the `Generator`, as it's name indicates, is to generate images that are as similar as possible as those in the input. 
* The role of the `Discriminator` is to discern in between the fake and real images. 

The process is iterative, and both networks try to minimize their corresponding loss functions. Even though at the very beginning the `Generator` randomly outputs some noise, over time, by minimizing the cost function, outputs images that even the `Discriminator` finds diffcult to discern.

One thing to keep into account is that the `Generator` never learns from the real data: the whole learning process is based on trial and error, though `backpropagation`, to iteratively reduce the loss of the correponding `loss function`.


After training for several iterations, the `Discriminator`fails to distinguish in between two instances. In that moment, we can extract the `Generator` from the chain, and resuse it to generate different images, similarly as we would do in `Transfer Learning`.

The problem at hand is that the output images do not always offer a good enough resolution, which may be critical for the case we are working with. For that reason, we implemented a variaiton form `GANs`, `SRGANs` (Super Resolution Adversarial Networks). This type of adversarial network outputs high resolution images, and delivers state-of-the-art results. 


<img width="800" style="float:center" 
     src="https://www.oreilly.com/api/v2/epubs/9781789136678/files/assets/fe3eced0-c452-4b7a-9f6d-dd8228048ab9.png" />


At a high level, the way the original authors from the paper _"Photo-Realistic Single Image Super-Resolution Using Generative Adversarial Network"_ achieved a higher reolution output, was based on a modified loss function that combined both content-based and adversarial loss functions to the generator net. 

One of the use cases of `GANs` is precisely data enrichment: we may generate fake images from incoming data, label them accordingly, and introduce them to the pipeline to enrich our dataset even further. This will be our perspective in the implementation of GANs. 

The implementation of it is mainly based on the tutorial delivered by Aarohi Singla, who at the same time based her solution on the TensorLayerX project. 

The following is an schema that better represents how the code works and how the process runs:



<img width="600" style="float:center" 
     src="https://i.imgur.com/T3DXOxh.png" />
     
     
     


So, essentially, the `main.py` calls the methods found in `mode.py`, either `train`or `test_only`. Then, each of them call the constructor of the `dataset`class, as well as the constructors located at `srgan_model.py`, eith the `Generator` or the `Discriminator` constructor.


Let's implement it!

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive/TERM III/Computer Vision/Group Assignment/GANS/SRGAN_CustomDataset-main

In [None]:
!pip install -r requirements.txt

In [None]:
!python main.py --LR_path custom_dataset/train_LR --GT_path custom_dataset/train_HR

In [None]:
!python main.py --mode test_only --LR_path test_data --generator_path ./model/pre_trained_model_800.pt

Finally, let's see the results. We might actually find difficulties to distinguish in between the `real`and the `generated` image:

### Fake

<img width="500" style="float:center" 
     src="https://i.imgur.com/7AZdjqu.jpg" />

### Real

<img width="500" style="float:center" 
     src="https://i.imgur.com/IlvDwM0.jpg" />

Impressive! Again, the results are specially astonishing as with just a few epochs, all the details from an image are actually represneted in the `generated` one. We will do this for a couple more examples, and afterwards, we will annotate the images through `roboflow`.

## 2.3. EDA

As in any `Machine Learning`pipeline, EDA is often overlooked, but the impact of a well documented `Exploratory Data Analysis` can change the path of a system.

Even if our data doesn't come in a fancy `.csv`format, images (and annotations) are still data! So as always, let's make a descriptive analysis on what we have:

### 2.3.1. Overview

* Description: Images from pictures taken of A-Z ASL.
* Count: 15.346 
* Format: `.jpg`
* Classes: `[A,...,Z]`

### 2.3.2 Size

Now that we have some general information, we can check the average resolution of the images.
To analyze it, we leverage the package `PIL` to open each image's properties and extract the resolution:

In [None]:
from PIL import Image
import os
folder_path = "/content/datasets/American-Sign-Language-Letters-Extend-1/train/images"
images = [f for f in os.listdir(folder_path) if f.endswith(('.jpg'))]
total_width = 0
total_height = 0
for image in images:
    img = Image.open(os.path.join(folder_path, image))
    total_width += img.size[0]
    total_height += img.size[1]

average_width = total_width / len(images)
average_height = total_height / len(images)

print("Average image size:", int(average_width), "x",int(average_height), "pixels")

Average image size: 640 x 640 pixels


### 2.3.3 Class Balance

Now we will analise how balance is this dataset.

<img width="500" style="float:center" 
     src="https://i.imgur.com/OKDDf2A.jpg" />

As we can clearly state, the dataset shows some balance between the classes, however, in the letters from `[M,N,O,P,Q,R]` are less present than the average. On the opther hand the letter `L`is heavily over represented. This means we should not pay too much attention to the `precision` as a metric, as the lack of complete balance may lead us to conclude a model is correctly performing when indeed it's not. We will focus in ther performance metrics, such as: 

`F1`|`loss`|`Confusion Matrix`|`mAP50`|`mAP50-95`

Now, let us count the number of hands in the images. We are doing so, by counting the number of bounding boxes detected for each image. In our labels, each image is in form of a txt file, where the number of lines in that file indicate the total number of bounding boxes.

### 2.3.4 Annotation's Format

The last step in this fast `EDA` is oftenly overlooked: checking the annotation's format. The reason why this analysis is crucial, is due to two reasons:
* Scale of Bounding Boxes 
* Compatibility with `YOLO`

In the present case, if we examine carefully, we will see:

* The annotations are in form of txt files representing a file per image.
* The format of the coordinates is compatible with `YOLO`.

___

### 2.3.5. Summary


**Images**

| amount   | format | avg_res     | categories | is_balanced | 
|--------  |--------|-------------|------------|-------------|
| 15,346   | `.jpg` | 460 x 460   |      2     | False       |


**Annotations**

| format | 
|--------|
| `.txt` |

## 2.4. Data Preprocessing

### 2.4.1 `roboflow` for data preprocessing

This is the stage in which we take action, after planning out strategies during the `EDA`. This fundamental stage can make the difference between an underperfoming model and an accurate one. 

In the past, we could do some of the processes about to be made, with the `PIL` package, in tasks such as `Data Augmentation`.

However, with the release of `YOLOv8`, the developers, `ultralytics`, put a lot of emphasis in a tool that they have been supporting for a while know: `roboflow`.

As mentioned before, the tool offers, in a user-friendly `GUI`, a portal in which to deal with the whole preprocessing stage. 

Here is a screenshot on how the interface looks:

<div style = "display: flex; justify-content: center;">
   <div style = "width: 500px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>Data Preprocessing</figcaption>
      <img width = "500", style = "display: inline-block" src = "https://i.imgur.com/h8V6ffw.jpg">
  </div>
</div>

We have done the following preprocessing steps:

•	`Auto-Orient`: Applied

•	`Stretch` to 416x416

We considered to apply the `Isolated Objects` as a preprocessing step given the fact that this step improved the scores a lot. But we discarded it, because when you apply it to the model it doesn't have a great performance. It causes overfitting and it affected the nature of our problem as it converts it to a classification problem.


### 2.4.2. Data Augmentation

This a tricky step. Data Augmentation is great in theory, but may not be that great in practice. 

COTS vary in lots of aspects, such as size, color, position, etc. Data Augmentation will help us to bring variability to the model, and generalization capabilities to deal with unseen data succesfully.

Here is a screenshot on how the interface looks: 


<div style = "display: flex; justify-content: center;">
   <div style = "width: 500px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>Data Augmentation</figcaption>
      <img width = "500", style = "display: inline-block" src = "https://i.imgur.com/DoHJs9L.png">
  </div>
</div>


We have tried different combinations of augmentation parameters. At the end, this combination of tuning parameters from augmentation was the winning one:


•	`Outputs per training example`: 3 (We wanted to have a longer data in order to have a better result)

•	`Flip`: Horizontal (Just in case some image is flipped)

•	`Crop`: 0% Minimum Zoom, 20% Maximum Zoom (In case some hand-sign is in a far point of the photo)

•	`Brightness`: Between -25% and +25% (In case some pictures have been taken in a sitution with low light)

•	`Rotation`: Between -5° and +5° (The hand-signs sometimes they will have some leaning)

•	`Shear`: ±5° Horizontal, ±5° Vertical (In order to add variability to perspective to help the model be more resilient to camera and subject pitch and yaw.)

•	`Grayscale`: Applied to 10% of images (In order to prepare the model for some grayscale trend in the camera)

•	`Blur`: Up to 1.25px (In case the quality of the camera is low)

___
### 2.4.3 Summary

* Annotations: `converted`
* Data Augmentation: `[Outputs per training example, Flip, Crop, Brightness, Rotation, Shear, Grayscale, Blur]`

## 2.5. Training

### 2.5.1. Introduction

The training phase is crucial in any `Machine Learning ` system. It defines the term itself, as its in it when the algorithm _learns_ the weights to better predict a class.

In `YOLOv8`, leveraging the `Python API` makes the process straightforward. We can either train from scratch our model, or use pretrained models in various sizes, all by invoking the `.train()`method.

Here is an example of it's implementation:

In [None]:
!pip install ultralytics

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ultralytics
  Downloading ultralytics-8.0.48-py3-none-any.whl (303 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.4/303.4 KB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk
  Downloading sentry_sdk-1.16.0-py2.py3-none-any.whl (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 KB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
Collecting thop>=0.1.1
  Downloading thop-0.1.1.post2209072238-py3-none-any.whl (15 kB)
Installing collected packages: sentry-sdk, thop, ultralytics
Successfully installed sentry-sdk-1.16.0 thop-0.1.1.post2209072238 ultralytics-8.0.48


In [None]:
from ultralytics import YOLO

In [None]:
model = YOLO("yolov8n.pt") # same for other yolo versions
model.train(data="/content/datasets/American-Sign-Language-Letters-Extend-1/data.yaml")


### 2.5.2 Comparing models from YOLOv8

In order to find the best result we will try different models from the `yolov8`. Before working on the `yolov8` model, we need to know that there are 5 different models:

*  `yolov8n.pt`
*  `yolov8s.pt`
*  `yolov8m.pt`
*  `yolov8l.pt`
*  `yolov8x.pt`

`YOLOv8 Nano (yolov8n)` is the fastest and smallest, while `YOLOv8 Extra Large (yolov8x)` is the most accurate yet the slowest among them. So the models listed between these two are ordered from fastest and smallest to slowest and accurate.

We want a good balance between the results of the model and time-consuming training. So, after several attempts, the model yolov8s was the one giving us the best balance. We will use that model.

### 2.5.3. Hyperparameter Tuning

Now that we have established `YOLOv8n` and `YOLOv8s` preference over other models, let's analyze what kid of `hyperparameters` the YOLO constructor offers, to improve the models performance. 

> _Note: Due to time constraints we considered just a sample of the dataset to conduct our hyperparameter tuning._

Taking  a look into the documentation:


| Key             | Value   | Description                                                                 |
|-----------------|---------|-----------------------------------------------------------------------------|
| device          | ''      | cuda device, i.e. 0 or 0,1,2,3 or cpu. `''` selects available cuda 0 device |
| epochs          | 100     | Number of epochs to train                                                   |
| workers         | 8       | Number of cpu workers used per process. Scales automatically with DDP       |
| batch           | 16      | Batch size of the dataloader                                                |
| imgsz           | 640     | Image size of data in dataloader                                            |
| optimizer       | SGD     | Optimizer used. Supported optimizer are: `Adam`, `SGD`, `RMSProp`           |
| single_cls      | False   | Train on multi-class data as single-class                                   |
| image_weights   | False   | Use weighted image selection for training                                   |
| rect            | False   | Enable rectangular training                                                 |
| cos_lr          | False   | Use cosine LR scheduler                                                     |
| lr0             | 0.01    | Initial learning rate                                                       |
| lrf             | 0.01    | Final OneCycleLR learning rate                                              |
| momentum        | 0.937   | Use as `momentum` for SGD and `beta1` for Adam                              |
| weight_decay    | 0.0005  | Optimizer weight decay                                                      |
| warmup_epochs   | 3.0     | Warmup epochs. Fractions are ok.                                            |
| warmup_momentum | 0.8     | Warmup initial momentum                                                     |
| warmup_bias_lr  | 0.1     | Warmup initial bias lr                                                      |
| box             | 0.05    | Box loss gain                                                               |
| cls             | 0.5     | cls loss gain                                                               |
| cls_pw          | 1.0     | cls BCELoss positive_weight                                                 |
| obj             | 1.0     | bj loss gain (scale with pixels)                                            |
| obj_pw          | 1.0     | obj BCELoss positive_weight                                                 |
| iou_t           | 0.20    | IOU training threshold                                                      |
| anchor_t        | 4.0     | anchor-multiple threshold                                                   |
| fl_gamma        | 0.0     | focal loss gamma                                                            |
| label_smoothing | 0.0     |                                                                             |
| nbs             | 64      | nominal batch size                                                          |
| overlap_mask    | `True`  | **Segmentation**: Use mask overlapping during training                      |
| mask_ratio      | 4       | **Segmentation**: Set mask downsampling                                     |
| dropout         | `False` | **Classification**: Use dropout while training                              |


We can clearly see some of them may be promising to play with, such as `epochs`, `batch_size`, `imgsz`, `momentum`, and last but not least, `lr0` & `lrf`.

Theoretically:

* `epochs`: the more, the merrier, as the algorithm trains for longer time.

* `batch_size`: bigger sizes agilizes the process, but may affect negatively the stability of the updates.

* `imgsz`: is the size of the image to detect. In this case, we should expect small improvements if bigger sizes (at the cost of slower an costly performance), and relevant drops when reducing images size.

* `momentum`: means that the model will place more emphasis on the previous weight update during training.

* `lr0` & `lrf`: the learning rate plays always a significant role in the learning process. It sets the pace and the path towards convergence. We must keep in mind we have to find a good balance between a big one (0.01) and a small one (0.0001) as both of them have their advantages and disadvantages.


We made some experiments attending to those hyperparameters. In the link provided of the `Google Drive`, you can see that we have the documentation of 10 different experiments in the folder `Experiments` in case you would like to know more details. In these experiments we have tried to use different combination of the hyperparameters mentioned above, but, in this section, we are going to comment and show two of them:

### Experiment 01

We decided to define 20 `epochs` due to the large dataset we have and the computational cost it has. It represents the times the entire dataset is passing through the model during the training process. We tried with a bigger number of epochs, but later we will see in the results that 20 is good number for this scenario. 


Also, the number of samples to be propagated through the network in one forward/backward pass (`batch`) will be 16 in this first experiment. This model will process 16 images at once during training.


In terms of `image size`, we decided to increase height and width of  our images to 720 pixels looking for more precise scores. 


Regarding the `learning rates`, the value of 0.01 means that the model will start with a relatively high learning rate during training. The value of 0.0001 means that the learning rate will be decreased gradually during training until it reaches this value. We defined it to give enough distance between the initial and the final one.


Finally, The value of 0.8 in the `momentum` means that the model will place more emphasis on the previous weight update during training.

### Hyperparameters:

* `epochs`: 20

* `batch_size`: 16

* `imgsz`: 720

* `lr0`: 0.01

* `lrf`: 0.0001

* `momentum`: 0.8

### Results:

![alternative text](https://i.imgur.com/n1SAGnf.png)
![alternative text](https://i.imgur.com/yjBtpKz.png)

In the `box loss` graph is about the object detection task, the goal here is to identify the required objects (in our case the hands) in an image and draw a bounding box for them. In it we can observe that there is an decreasing trend, and is learning to detect hands in a better way and draw with more precision the bounding boxes.

The `cls_loss` graph is about the classification task, the goal is to identify to which class belongs the bounding box (in our case the different letters). There is also a decreasing trend in the graph, indicating that the model is learning to classify the different letters from all the bounding boxes correctly.

Then in the `MAP50` and `MAP50-95` graphs we can see that 20 epochs is a good approach because, from the epoch 10, both graphs tend to have a stationary behaviour.

In the `precision` graph we can say that, at the end, the model has a good performance when it wants to identify which letter is the hand representing. However, the fact that our dataset is unbalanced, it can lead us to a missleading.

Watching the `recall` graph we can observe that the model is doing well when he wants to detect where is the hand on the image

In the confusion matrix we can see that there is not a considerable problem between all the letters. Only we can mention that the model sometimes has a missunderstood between the `D` and the `I` letter.

### Experiment 02

In this second experiment we kept the same hyperparameters as on the first one, but changing the `image size`. We decided to decrease the height and width of  our images to 640 pixels. And we got better results than in the previous one.
### Hyperparameters:

* `epochs`: 20

* `batch_size`: 16

* `imgsz`: 640

* `lr0`: 0.01

* `lrf`: 0.0001

* `momentum`: 0.8

### Results:

![alternative text](https://i.imgur.com/iNq67QX.png)
![alternative text](https://i.imgur.com/QP14AMZ.png)


We can observe that the results were similar from the previous experiment. In the `box loss` graph we can observe that there is an decreasing trend, and is learning to detect hands in a better way and draw with more precision the bounding boxes.


The `cls_loss` is also has a decreasing trend in the graph, indicating that the model is learning to classify the different letters from all the bounding boxes correctly

Then in the `MAP50` and `MAP50-95` graphs we can see that 20 epochs is a good approach because, from the epoch 10, both graphs tend to have a stationary behaviour again.

Regarding the `precision` and the `recall` graph we see a better behaviour than in the previous experiment.

Regarding the confusion matrix there is not a high confusion between the letters in general. Also, we have the same issue as before with the letter `D` and `I`. Moreover, we could highlight that the model tends sometimes to confuse the letter `P` with the letter `Q` and the background of the photo.

From all the 10 different experiments that we have, this was the best combination of hyperparameters. 

So, in conclusion, and from the theoretical point of view, the best combination of hyperparameters would be the one from the `Experiment 02`, which is the following:

In [None]:
model = YOLO("yolov8n.pt")
model.train(data="/content/datasets/American-Sign-Language-Letters-Extend-1/data.yaml", epochs=50, imgsz=640, lr0=0.01, lrf=0.0001, momentum=0.8)

Let's see what the plots have to say:

### New

![alternative text](https://i.imgur.com/6lYuB8s.png)
![alternative text](https://i.imgur.com/rdtba6m.png)

After training for some epochs, we can clearly state there is an slight improvement in performance, when attending to any of the metrics set. This is again to be expected, as it is well know how the `epochs`and `lr` are crucial to tackle a good model.


The smaller `learning_rate` provides an slower yet more consistent learning. From the above we can also conclude less epochs (50) would bring same results. The `loss` and  `mAP50` give excellent results. 


However, one thing to consider too is the capability to generalize: Let's see how our first optimized model, a top-performer according to most of the perfomance metrics, behaves in terms of generalization: (with a smaller dataset)

### Old

![alternative text](https://i.imgur.com/duAetLX.png)
![alternative text](https://i.imgur.com/cLgm1mK.png)

It was looking good! Or maybe not?

<div style = "display: flex; justify-content: center;">
   <div style = "width: 350px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>Before</figcaption>
      <img width = 500, style = "display: inline-block" src = "https://i.imgur.com/yAvtiU4.png">
  </div>
     <div style = "width: 350px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>After</figcaption>
      <img width = 500, style = "display: inline-block" src = "https://i.imgur.com/TJ56Rm1.png">
  </div>
</div>

> Note: Please refer to the videos (`Deployment/demo/video`) to have a better understanding on the difference between both models, and their generalization capabilities.


Unsurprisingly enough, the less enriched and less optimized model (only optimized in epochs) heavily underperforms when presented with a new dataset, unable to correctly infer the class. This is due to a problem of generalization.

We can then conclude, our final model is an overall good performer in the key areas.

Let's validate our results!

## 2.6. Validation

As easy as `training`was, with `YOLOv8 Python API` we can easily call the `.val()` method to our data path, in order to check how well the model performs in a different split:

In [None]:
model.val(data="/content/datasets/American-Sign-Language-Letters-Extend-1/data.yaml", save_json = True, plots = True)

Ultralytics YOLOv8.0.48 🚀 Python-3.8.10 torch-1.13.1+cu116 CUDA:0 (Tesla T4, 15102MiB)
Model summary (fused): 218 layers, 25840339 parameters, 0 gradients, 78.7 GFLOPs
[34m[1mval: [0mScanning /content/Save-The-Great-Barrier-Reef-5/Save-The-Great-Barrier-Reef-5/valid/labels.cache... 535 images, 0 backgrounds, 0 corrupt: 100%|██████████| 535/535 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:17<00:00,  1.90s/it]
                   all        535       1156      0.875      0.696       0.78      0.395
Speed: 1.5ms preprocess, 13.5ms inference, 0.0ms loss, 2.3ms postprocess per image
Saving runs/detect/val/predictions.json...
Results saved to [1mruns/detect/val[0m


<ultralytics.yolo.utils.metrics.DetMetrics at 0x7fd07f366ca0>

## 2.7. Test

We may also do so with the left split, the `Test` set, for which we should make no further improvements.

In this case, we save the predictions so we can check later on how the bounding boxes were created.

In [None]:
results = model.predict(source='/content/datasets/American-Sign-Language-Letters-Extend-1/test/images', save=True, save_txt=True, save_conf=True)

Ultralytics YOLOv8.0.48 🚀 Python-3.8.10 torch-1.13.1+cu116 CUDA:0 (Tesla T4, 15102MiB)

image 1/1903 /content/Save-The-Great-Barrier-Reef-5/Save-The-Great-Barrier-Reef-5/test/images/0-1000_jpg.rf.70b047114c9b2f8097887b8610f09309.jpg: 416x416 1 Starfish, 36.1ms
image 2/1903 /content/Save-The-Great-Barrier-Reef-5/Save-The-Great-Barrier-Reef-5/test/images/0-1004_jpg.rf.3ccd691e4c5647bf53ede5455381251e.jpg: 416x416 1 Starfish, 35.2ms
image 3/1903 /content/Save-The-Great-Barrier-Reef-5/Save-The-Great-Barrier-Reef-5/test/images/0-1006_jpg.rf.1ac2885529bd0e1259b481329a0c72eb.jpg: 416x416 1 Starfish, 33.5ms
image 4/1903 /content/Save-The-Great-Barrier-Reef-5/Save-The-Great-Barrier-Reef-5/test/images/0-1012_jpg.rf.001b4ee762a40f5001abaa808cd0adfa.jpg: 416x416 1 Starfish, 30.8ms
image 5/1903 /content/Save-The-Great-Barrier-Reef-5/Save-The-Great-Barrier-Reef-5/test/images/0-1013_jpg.rf.573d9533fd26d9b6e59a6e07a1399e7b.jpg: 416x416 1 Starfish, 30.7ms
image 6/1903 /content/Save-The-Great-Barrier-Re

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img_path = "/content/datasets/runs/detect/predict/W24_jpg.rf.c451a285302c17a571523e315285bbd3.jpg"
img = mpimg.imread(img_path)

plt.imshow(img)
plt.show()

## 2.8. Deployment

To end up the pipeline, we may export our model as a `.pt`file, ready to be implemented in other devices and further innovated. From this exported model, we will be able to implement a `realtime` solution, both in a webcam interface (through a simple script, `main.py`) or though a fully developed `android app`.

Let's proceed with exporting the model, and later on we will discuss the final implementation of it:

In [None]:
model.export()

Ultralytics YOLOv8.0.48 🚀 Python-3.8.10 torch-1.13.1+cu116 CPU

[34m[1mPyTorch:[0m starting from runs/detect/train/weights/best.pt with input shape (64, 3, 416, 416) BCHW and output shape(s) (64, 5, 3549) (49.6 MB)

[34m[1mTorchScript:[0m starting export with torch 1.13.1+cu116...
[34m[1mTorchScript:[0m export success ✅ 212.8s, saved as runs/detect/train/weights/best.torchscript (99.0 MB)

Export complete (298.8s)
Results saved to [1m/content/runs/detect/train/weights[0m
Predict:         yolo predict task=detect model=runs/detect/train/weights/best.torchscript imgsz=416 
Validate:        yolo val task=detect model=runs/detect/train/weights/best.torchscript imgsz=416 data=/content/Save-The-Great-Barrier-Reef-5/data.yaml 
Visualize:       https://netron.app


'runs/detect/train/weights/best.torchscript'

Finally, we download everything to implement the model in a real-time app:

In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
!zip -r /content/results.zip /content/runs/detect/

updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/ (stored 0%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train3/ (stored 0%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train3/labels_correlogram.jpg (deflated 23%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train3/args.yaml (deflated 50%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train3/weights/ (stored 0%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train3/events.out.tfevents.1677566354.31d864f18cdf.1062.2 (deflated 5%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train3/labels.jpg (deflated 27%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train/ (stored 0%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train/labels_correlogram.jpg (deflated 23%)
updating: content/Save-The-Great-Barrier-Reef-5/runs/detect/train/events.out.tfevents.1677565397.31d864f18cdf.1062.0 (deflated 5%)
updating: content/Save-

In [None]:
from google.colab import files
files.download("/content/results.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## 2.8.1. Deployment as a `webcam` script

This is the most basic deployment of our model: after exporting our model, we can still leverage `YOLOv8` to use the `.predict()` method. This method, used in the `Test`, can take as an argument an `image`, a `video`(as we showed before) or even the `webcam`, to perform live detections.

The code snippet to implement it is the following:

In [None]:
from ultralytics import YOLO

model = YOLO('best.pt')

results = model.predict(source = 0, show = True)

## 2.8.2. Deployment as an `android app`


<img width="600" style="float:center" 
     src="https://i.imgur.com/Y2mTiph.png" />
    

<a name="Footnote" >4</a>: _Schema on the App Deployment Process_

Now, this a far more challenging one than the previous one. The fact is the previous one may even be useless, as the device must have all the dependencies needed to run correctly. This is a big obstacle, but some may think we may overcome using `Docker` to pack the script with its dependencies.

However, even creating an image of it, could result underperforming in portable devices such as smartphones. Even if `yolov8n` is the tiniest of the set, it could be too much for some commodity hardware.

We must then, find a compromise between performance and accesibility to it. Connecting the dots, it seems like the most reasonable solution to this is implementing an optimized `android app`. Choosing this `OS` brings several advantages:

* Open Source
* Highly Optimized
* Strong Userbase
* Humongous Community
* Compatibility with YOLO `.onnx`


Pretty clear right? So, without any further ado, let's jump into it!

### 2.8.2.1. Model conversion

As mentioned before, we must convert our pretrained model, `best.pt`, into a format compatible with the `OS`and the limited computing capabilities of smartphones. 

> In this aspect, one of the possibilities is to export our model as a `.tflite`model. This format is a squeezed version of a `tensorflow` format, simplifying the net and upgrading substantially the performance in `commodity` hardware. However, we didn't succeed using this approach, as the mapping of the images should be redone.


After some research, we stumbled accross an optimized format, `ncnn`, which involves a `.param`and a `.bin`file that operate together. The format, developed by the chinese company `Tencent`, delivers a high-performance neural network inference framework. This format, nonetheless, is not natively supported by `YOLO`.

However, `YOLO` does indeed support another format for exportation: `.onnx`. The format itself is nothing to write home about, it's a generic format across platforms. However, modifying some methods and classes from `YOLO`source code enables us to leverage that format, to later on implemented to our end application.

Let's then proceed to load our model, exchange the modified classes in the source code, and convert our `model`into the desired `ncnn`format!

In [None]:
! mkdir yolov8
! cd yolov8
! git clone https://github.com/ultralytics/ultralytics 
! pip install -qe ultralytics
! cd ultralytics

> Note: To reproduce the same results, please proceed to upload the modified file (located at `Scripts/ClassMod/modules.py`) to the workspace (`ultralytics/ultralytics/nn/modules.py`). This must be done before executing the export to the `.onnx` format:

In [None]:
!yolo task=detect mode=export model=/content/best.pt format=onnx simplify=True opset=13 imgsz=416

Now, we have to convert the `.onnx` file into a `.ncnn`format. 

This can be done leveraging an online conversion tool. This tool, developed by the user _daquexian_, first simplifies the neural network, and then convertts it to the desired `ncnn`format. This simplification not only makes it less resource intensive, but also better performant for mobile applications. Here is an example of a simplification done by the tool:

<img width="400" style="float:center" 
     src="https://raw.githubusercontent.com/daquexian/onnx-simplifier/master/imgs/complicated_reshape.png" />


To the following:


<img width="100" style="float:center" 
     src="https://raw.githubusercontent.com/daquexian/onnx-simplifier/master/imgs/simple_reshape.png" />



_+info:_ https://convertmodel.com

<img width="400" style="float:left" 
     src="https://i.imgur.com/WBf307t.png" />

     
<a name="Footnote" >5</a>: _Online Conversion Tool_



Hooray! We now have a `.param`and `.bin` files, representing our model. 

All we have left, is modyfying `tencent's` generic app to meet our model requirements. 

We will leverage `Android Studio` to implement the necessary changes.

<div style = "display: flex; justify-content: center;">
   <div style = "width: 900px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <img width = 800, style = "display: inline-block" src = "https://i.imgur.com/AxVHYke.png">
</div>

<a name="Footnote" >6</a>: _Android Studio Interface_

These modifications include:

* Modifying classes to predict
* Updating model to implement
* Adapting the read of outputs to our current `neural network`
* Improving appearance of the app

The following, is the end result of the process:

<div style = "display: flex; justify-content: center;">
   <div style = "width: 350px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <img width = 300, style = "display: inline-block" src = "https://i.imgur.com/rfBVPAt.jpg">
  </div>
  <div style = "width: 350px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <img width = 300, style = "display: inline-block" src = "https://i.imgur.com/bLDxFa0.jpg">
  </div>
  <div style = "width: 350px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <img width = 300, style = "display: inline-block" src = "https://i.imgur.com/rpthppM.jpg">
  </div>
</div>


<a name="Footnote" >7</a>: _Final Result_

<a id='3'></a>

# 3. Conclusions

In conclusion, the `Computer Vision` based app,  **signlingo**, brings a valuable tool for promoting social inclusion and breaking down sign language barriers. By providing an affordable and user-friendly learning tool for sign language, we hope to make a positive impact in the lives of those who suffer from such dissability. Through **signlingo** we aim to empower users to communicate more effectively and close the gap in society.

While the results are impressive, we recognize there is still room for improvement. In the future, we plan to expand our dataset to include more images of words and other languages, and enhance the app's functionality by introducing a more user-friendly interface and gamification to boost user engagement.

As we move forward, we are excited about the endless possibilities that Deep Learning and Computer Vision offer in solving real-world problems, and we look forward to contributing to this field with our ongoing efforts.

<a id='4'></a>

# 4. References


Kaggle. (n.d.). TensorFlow - Great Barrier Reef. Kaggle. Retrieved September 20, 2021, from https://www.kaggle.com/c/tensorflow-great-barrier-reef

STOR Daily. (n.d.). When Crown-of-Thorns Starfish Attack. Retrieved from https://daily.jstor.org/when-crown-of-thorns-starfish-attack/

Roboflow. (n.d.). GBReef - White Balance. Retrieved from https://universe.roboflow.com/pionc-h7qra/gbreef_white_balance-0lejj

Roboflow. (n.d.). Starfish. Retrieved from https://universe.roboflow.com/abdelmageed-ahmed-2ji7p/starfish-dnvdh/dataset/3

Roboflow. (n.d.). COTS Detection. Retrieved from https://universe.roboflow.com/francis-campos-frick-wodnn/cots-detection-uigl5/

Ultralytics. (n.d.). Welcome to Ultralytics documentation! Ultralytics. Retrieved September 20, 2021, from https://docs.ultralytics.com

Ultralytics. (n.d.). ultralytics/ultralytics. GitHub. Retrieved September 20, 2021, from https://github.com/ultralytics/ultralytics.

Singla, A. (n.d.). SRGAN_CustomDataset. GitHub. Retrieved March 9, 2023, from https://github.com/AarohiSingla/SRGAN_CustomDataset


Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).

Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., ... Shi, W. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4681-4690).

Daquexian. (n.d.). onnx-simplifier. GitHub. Retrieved March 9, 2023, from https://github.com/daquexian/onnx-simplifier

Tencent. (n.d.). Tencent/ncnn. GitHub. Retrieved September 20, 2021, from https://github.com/Tencent/ncnn.

Android Developers. (n.d.). Build your first app. Android Developers. Retrieved September 20, 2021, from https://developer.android.com/training/basics/firstapp

Tsai, G. (2021, January 18). Top tutorials for deploying custom YOLOv8 on Android. Medium. https://medium.com/@gary.tsai.advantest/top-tutorials-for-deploying-custom-yolov8-on-android-%EF%B8%8F-dd6746afc1e6