#### This project is from [Abubakar Abid's](https://twitter.com/abidlabs) course: *Building Computer Vision Applications* on CoRise. Learn more about the course [here](https://corise.com/course/vision-applications).

# Week 2 Project: Building the "Eyes" of a Self-Driving Car

Welcome to the second week's project for *Building Computer Vision Applications*!

In this week, we are going to get familiar with the key steps of machine learning, with a particular focus on image segmentation. Specifically, we will cover:

* finding image segmentation datasets and pretrained models üìñ
* fine-tuning an image segmentation model on new data üëæ
* building a computer vision app you can run on your phone or laptop üì∑
* measuring the performance of a segmentation model on test data and the real world üìà

# Introduction

Self-driving cars are an exciting real-world application of machine learning, with the potential to save many lives each year. In order for self-driving cars to be fully autonomous, they need to "see" and "understand" the world around them. What are the machine learning algorithms that enable this? Let's take a look at [Tesla's website](https://www.tesla.com/AI): "Our per-camera networks analyze raws images to perform **semantic segmentation**..."

What is semantic segmentation? Semantic segmentation is the process of assigning a class to *every pixel in an image*. In week 1, we studied *image classification*, which assigns a class to the entire image. Semantic segmentation is a more fine-grained version, which recognizes that an image can be made up of different objects: for example, an image taken by a camera on a self-driving car could consist of pedestrians, trees, and other cars. Semantic segmentation is used in many other applications as well, such as medical machine learning, where it can be used to identify organs in radiological images. Rather than assigning a single label to the entire image, a semantic segmentation model assigns each pixel a category so that we understand both *what* an image is, and *where* it is.

By the end of this project, you'll have built an app that you can run on your laptop or phone that performs semantic segmentation on pictures of the outdoors scenes and will identify the road from the cars from the pedestrians, and so on. It will look something like this:

![](https://i.ibb.co/RNv8MgQ/image.png)

# Step 0: Hardware Setup & Software Libraries

We will be utilizing GPUs to train our machine learning model, so we will need to make sure that our Colab notebook is set up correctly. Go to the menu bar and click on Runtime > Change runtime type > Hardware accelerator and **make sure it is set to GPU**. Your Colab notebook may restart once you make the change.

We're going to be using some fantastic open-source Python libraries to load our dataset (`datasets`), train our model (`transformers`), evaluate our model (`evaluate`), and build a demo of our model (`gradio`). So let's go ahead and install all of these libraries.

In [None]:
!pip install datasets evaluate gradio huggingface_hub transformers wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.15.2-py3-none-any.whl (2.0 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.0/2.0 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.22.1-py2.py3-none-any.whl (203 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m203.1/203.1 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting setproctitle
  Downloading setproctitle-1.3.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting GitPython!=3.1.29,>=1.0.0
  Downloading GitPy

In Week 1, you created a Hugging Face account to upload your Gradio demo to Spaces. This week, we'll be uploading a model to your Hugging Face account *programmatically*! The first step is to log in using your Hugging Face token:

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

# Step 1: Loading a Dataset

In this project, we will be using the `datasets` library, which can load tens of thousands of datasets with a single line of code. It can also be used to apply preprocessing functions. Learn more about the datasets library here: https://huggingface.co/docs/datasets/tutorial

Most datasets are divided into different splits. For example, you'll often see a *training* data subset, which is used to build the model, a *validation* data subset, which is used to measure the performance of the model while it is training, and a *test* dataset which is used to measure the performance of the model at the very end of training, and is usually considered to describe how well the model will perform in the real world (we'll come back to this).

Specifically, we will be using the `segments/sidewalk-semantic` dataset that is available for free from the Hugging Face Hub: https://huggingface.co/datasets/segments/sidewalk-semantic

* **Load the Semantic Sidewalk Dataset**

In [None]:
from datasets import load_dataset

dataset = load_dataset("segments/sidewalk-semantic", split="train")



* **Explore the dataset by running code below and reading the dataset card linked above. Answer the questions below**

In [None]:
print(dataset)

Dataset({
    features: ['pixel_values', 'label'],
    num_rows: 1000
})


In [None]:
# View the images

for i in range(10):
  display(dataset[i]['pixel_values'])

Output hidden; open in https://colab.research.google.com to view.

In [None]:
print(f"Number of training samples: {dataset.num_rows}")
print(f"Size of each Image: {dataset[0]['pixel_values'].size}")

Number of training samples: 1000
Size of each Image: (1920, 1080)


* How many training samples do we have?
<br>
Ans: 1000
* What's the size of each image?
<br>
Ans: 1920x1080
* How many categories are in this dataset's labels?
<br>
Ans: 35
* Look at a random subset of ~10 training images, do you notice anything interesting about the images in the dataset? Are they as diverse/representative as you would expect or do they have limitations?

* **Simplifying the Training Dataset**

You'll notice that the original dataset has many similar categories (for example, "vehicle-car" is a category, along with "vehicle-truck"). To simplify the training process, we will collapse together related categories. In the end, we will have 5 separate categories:
* 0: road/sidewalk/path
* 1: human
* 2: vehicles
* 3: other objects (e.g. traffic lights)
* 4: nature and background

For the purpose of this exercise, we will also make the images a lot smaller (64px by 64px) so that training is easier and faster. The following code processes the training images and labels.

We've written the function that applies this transformation to a given sample. Efficiently apply it to each item in the dataset, using for example 8 CPU workers (even then, this code may take a few minutes to run)

In [None]:
import numpy as np
from PIL import Image

num_classes = 5

def transform(sample):
    sample["pixel_values"] = sample["pixel_values"].convert("RGB").resize((64,64))
    sample["label"] = sample["label"].resize((64,64), Image.NEAREST)
    collapse_categories = {**{i: 0 for i in range(1, 8)},
                            **{i: 1 for i in range(8, 10)},
                            **{i: 2 for i in range(10, 18)},
                            **{i: 3 for i in range(18, 28)}}
    sample["label"] = np.vectorize(lambda x: collapse_categories.get(x, 4))(np.array(sample["label"]))
    return sample

dataset = dataset.map(transform)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



Finally, shuffle the dataset and split the dataset into a training dataset (with 99% of the samples) and a test dataset (with the remaining 1%). We have a very small test dataset so that the evaluation step is quick. If you were training a model in a more realistic setting, you would pick a bigger evaluation dataset.

You might find the `train_test_split()` method in the `datasets` library useful.

In [None]:
# WRITE CODE HERE
dataset = dataset.train_test_split(test_size=0.01, shuffle=True, seed=51)
train_ds = dataset["train"]
test_ds = dataset["test"]

In [None]:
print(f"Number of train samples: {len(train_ds)}")
print(f"Number of test samples: {len(test_ds)}")

Number of train samples: 990
Number of test samples: 10


After you run the steps above, examine the `train_ds` and `test_ds` objects, and confirm that the samples look as you expect. Specifically,

* How many training and test samples do we have? [ANSWER HERE]
<br>
Ans: Train:990. Test: 10
* What's the size of each image? [ANSWER HERE]
<br>
Ans: 64x64
* What are the potential risks or downsides of having such a small test datset? [ANSWER HERE]
<br>
Ans: Less reliable evaluation. Chance of Overfitting

# Step 2: Loading a Pretrained Model

We will be using the `transformers` library, which can load tens of thousands of machine learning models with a few lines of code. It can also be used to fine-tune these models. Learn more about the `transformers` library here: https://huggingface.co/docs/transformers/index

Specifically, we will be using the `Segformer` model that is available for anyone from the Hugging Face Hub: https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512. While the details of this architecture are beyond the scope of this course, we will point out that it is based on transformers, just like the vision transformers (ViT) network we used last week for image classification. Also, notice that it has already been fine-tuned for detecting everyday objects. We will _further_ fine-tune it for our specific dataset to speed up the training process.

* **Load the Segformer Model and FeatureExtractor for Inference**

In [None]:
from transformers import AutoFeatureExtractor, SegformerForSemanticSegmentation
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class_labels = ["road", "human", "vehicles", "traffic lights", "background"]

model = SegformerForSemanticSegmentation.from_pretrained('nvidia/segformer-b0-finetuned-ade-512-512',
                                                         num_labels= len(class_labels),
                                                         ignore_mismatched_sizes=True,
                                                         id2label={i: c for i, c in enumerate(class_labels)},
                                                         label2id={c: i for i, c in enumerate(class_labels)})

model.eval()
model.to(device);

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/6.88k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/15.1M [00:00<?, ?B/s]

Some weights of SegformerForSemanticSegmentation were not initialized from the model checkpoint at nvidia/segformer-b0-finetuned-ade-512-512 and are newly initialized because the shapes did not match:
- decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([5, 256, 1, 1]) in the model instantiated
- decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We also need to load the **image processor** (also known as the **feature extractor**) corresponding to the model, so that we can convert the input images into a feature vector that the model can take as input.

In [None]:
extractor = AutoFeatureExtractor.from_pretrained('nvidia/segformer-b0-finetuned-ade-512-512')

Downloading (‚Ä¶)rocessor_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]



# Step 3: Fine-tuning Your Model on the Dataset

## 3a. Preprocess the Dataset and Load the Metric

Off the shelf, the Segformer model will not be usable for the task that we have in mind, since it was trained for "general" image segmentation, not for the specific categories that we would like to predict. As a result, we will need to "fine-tune" our model.

Learn more about fine-tuning models with the `transformers` library here: https://huggingface.co/docs/transformers/training

We will also need to decide which metric to use for our task. Since our task is image segmentation, the `mean IOU` metric seems reasonable: https://huggingface.co/spaces/evaluate-metric/mean_iou

* **Preprocess the Dataset**

We will convert the images to feature vectors on the fly as we train the model using the `set_transform()` method. This time, the `transform()` has been left for you to write:

In [None]:
def transform(example_batch):
    inputs = extractor(example_batch['pixel_values'], example_batch['label'], return_tensors='pt')
    return inputs

train_ds.set_transform(transform)
test_ds.set_transform(transform)

## 3b. Fine-Tune the Segformer Model on a Training Subset (and Overfit)

As we discussed in lecture, a good way to start training a model is by making sure that you are able to overfit on a small subset of the training dataset. Train your model on 10 images from your training dataset for 10 epochs.

We will start by defining our training hyperparameters as a `TrainingArguments` instance.

Note that we leave the choice of learning rate to you. You may need to try different learning rates and batch sizes until you are able to overfit successfully on this training dataset.

NOTE: we ask you next to plot the loss. You can implement this yourself by using the `matplotlib` library (or any Python plotting library), or you can use existing tools such as [Tensorboard](https://www.tensorflow.org/tensorboard/get_started#:~:text=TensorBoard%20is%20a%20tool%20for,dimensional%20space%2C%20and%20much%20more.) or [WandB](https://wandb.ai/site) to plot the loss. If you use an existing tool like Tensorboard, you'll need to set that up here before you start training. Note that both Tensorboard and WandB are compatible with the `transformers` library.


In [None]:
train_subset_ds = train_ds.train_test_split(test_size=10)['test']
train_subset_ds.set_transform(transform)

In [None]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
%env WANDB_PROJECT=CoRise-week-2-self_driving_car

env: WANDB_PROJECT=CoRise-week-2-self_driving_car


In [None]:
from transformers import TrainingArguments
from transformers import Trainer

lr = 2e-4 # FILL HERE
epochs = 10
batch_size = 2

training_args = TrainingArguments(
    "overfit-segmentation-model",
    learning_rate=lr,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="steps",
    save_steps=20,
    eval_steps=5,
    logging_steps=1,
    report_to="wandb",  # enable logging to W&B
    run_name="overfit-segmentation-model-segformer"
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset_ds,
    eval_dataset=test_ds,
)

trainer.train()



Step,Training Loss,Validation Loss
5,0.3869,0.766307
10,0.2343,0.666214
15,0.3136,0.684475
20,0.2441,0.734778
25,0.2295,0.703697
30,0.4458,0.695787
35,0.1777,0.655304
40,0.1916,0.660547
45,0.1533,0.670324
50,0.2175,0.674137


TrainOutput(global_step=50, training_loss=0.3476131656765938, metrics={'train_runtime': 10.3365, 'train_samples_per_second': 9.674, 'train_steps_per_second': 4.837, 'total_flos': 1753159355596800.0, 'train_loss': 0.3476131656765938, 'epoch': 10.0})

* **Plot the Loss on the Training and Test Sets Over the 10 Epochs**

In [None]:
api = wandb.Api()

In [None]:
run = api.run("ashish08/CoRise-week-2-self_driving_car/sw51lewj")
run.display(height=540)



True

* Is there any sign of overfitting? [ANSWER HERE]
<br>
Ans: Validation loss decreases at first but then increases while training loss keeps decreasing indicating overfitting.

## 3c. Fine-Tune the Segformer Model on the Entire Training Set

* **Load the Mean IoU Metric**

In addition to the loss, we now have to decide on a *metric* we will use to measure the performance for our machine learning model. A natural choice for image classification is *mean Intersection-over-Union (mean IoU)*, which measures the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. It is probably the most common metric used for segmentation tasks.

Read about the `evaluate` library, which contains many common machine learning metrics here: https://github.com/huggingface/evaluate

And use `evaluate.load()` to load the mean IoU metric:

In [None]:
import numpy as np
import evaluate
from torch import nn

metric = evaluate.load("mean_iou")

Downloading builder script:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

We will need to write some code to apply the mean IOU metric to the right layers of the neural network. We first need to convert our predictions to logits first, and then reshaped to match the size of the labels. This code has already been written for you:

In [None]:
def compute_metrics(eval_pred):
    with torch.no_grad():
        logits, labels = eval_pred
        logits_tensor = torch.from_numpy(logits)
        logits_tensor = nn.functional.interpolate(
            logits_tensor,
            size=labels.shape[-2:],
            mode="bilinear",
            align_corners=False,
        ).argmax(dim=1)

        pred_labels = logits_tensor.detach().cpu().numpy()
        metrics = metric.compute(
            predictions=pred_labels,
            references=labels,
            num_labels=num_classes,
            ignore_index=255,
            reduce_labels=False,
        )
        for key, value in metrics.items():
            if type(value) is np.ndarray:
                metrics[key] = value.tolist()
        return metrics


Now, we will take all of the code that you have written and use it to fine-tune the Segformer model on the sidewalk segmentation dataset. Simply run the code below, and your model will fine-tune for 5 epochs. On a **GPU**, this should take about or leass than 30 minutes with the default settings.

**Important Note:** these default settings may **NOT** produce a very good segmentation model. For this task, you likely need significantly more training time. That is OK, the point of this exercise is not to train a highly-performant model, but to walk through the steps that would be needed to do that. We will **NOT** be looking at the performance of this model to grade your project. If you have been able to overfit on a small training subset (in part 3b), and the loss is going down in this part, that is sufficient.

In [None]:
lr = 3e-4 # FILL HERE
epochs = 5
batch_size = 1

training_args = TrainingArguments(
    "regular-segmentation-model",
    learning_rate=lr,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="steps",
    save_steps=200,
    eval_steps=200,
    logging_steps=20,
    report_to="wandb",  # enable logging to W&B
    run_name="segmentation-model-segformer-on-sidewalk-dataset"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

trainer.train()



Step,Training Loss,Validation Loss,Mean Iou,Mean Accuracy,Overall Accuracy,Per Category Iou,Per Category Accuracy
200,0.5358,0.524256,0.485563,0.56781,0.825621,"[0.0, 0.5169057323372555, 0.6656567947637871, 0.7596882284382285, nan]","[0.0, 0.5826483371126229, 0.8185593123861566, 0.8700324596529547, nan]"
400,0.5724,0.659329,0.459397,0.619814,0.800977,"[0.0, 0.4355711303140804, 0.6214606234278396, 0.7805576131838151, nan]","[0.0, 0.904265873015873, 0.7678335610200364, 0.8071551541075113, nan]"
600,0.6097,0.442425,0.521572,0.628774,0.854384,"[0.0, 0.5704588288273262, 0.6947942553002835, 0.8210330507771618, nan]","[0.0, 0.8078939909297053, 0.8251650728597449, 0.8820379808275695, nan]"
800,0.4512,0.343046,0.584331,0.661552,0.890773,"[0.0, 0.726731049927019, 0.7669581485553648, 0.8436345011341309, nan]","[0.0, 0.8585128495842782, 0.8914076730418944, 0.8962864791894187, nan]"
1000,0.4731,0.418303,0.55511,0.642235,0.866979,"[0.0, 0.6801030958731681, 0.7309654692723642, 0.8093697562119425, nan]","[0.0, 0.8040320294784581, 0.9272085610200365, 0.8376979432107754, nan]"
1200,0.4654,0.527194,0.564898,0.640885,0.865688,"[0.0, 0.7316878020978331, 0.7319691269238553, 0.795935616504084, nan]","[0.0, 0.7884188397581254, 0.9573742030965392, 0.8177481494964204, nan]"
1400,0.3321,0.525575,0.529331,0.621666,0.855754,"[0.0, 0.6109024653255494, 0.6981116385772599, 0.8083098711714017, nan]","[0.0, 0.7761243386243386, 0.8123833105646631, 0.898157838854508, nan]"
1600,0.4085,0.402768,0.531355,0.618234,0.856412,"[0.0, 0.6307350764676862, 0.6859634960738885, 0.8087196642104767, nan]","[0.0, 0.7851710128495842, 0.7448998178506375, 0.9428638059701493, nan]"
1800,0.4271,0.4797,0.576027,0.640956,0.877561,"[0.0, 0.7410886673365926, 0.7458046099461386, 0.8172144273090146, nan]","[0.0, 0.7751795162509448, 0.9265055783242259, 0.8621374833151316, nan]"
2000,0.2408,0.403206,0.559548,0.65493,0.875199,"[0.0, 0.6738961344355408, 0.7345304730222159, 0.8297658745480712, nan]","[0.0, 0.8799839380196524, 0.8433344717668488, 0.8964021356631476, nan]"


  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_area_union
  acc = total_area_intersect / total_area_label
  iou = total_area_intersect / total_are

TrainOutput(global_step=4950, training_loss=0.3851944829960062, metrics={'train_runtime': 746.3087, 'train_samples_per_second': 6.633, 'train_steps_per_second': 6.633, 'total_flos': 8.67813881020416e+16, 'train_loss': 0.3851944829960062, 'epoch': 5.0})

## 3d. Upload your model to the Hugging Face Hub!

In two lines of code, upload your feature extractor and model to the Hugging Face Hub!

In [None]:
extractor.push_to_hub("segformer-segmentation-model-feature-extractor")

CommitInfo(commit_url='https://huggingface.co/Ashish08/segformer-segmentation-model-feature-extractor/commit/cb3b6ca20aa2194a2f40dbd7030432eddc2f6119', commit_message='Upload feature extractor', commit_description='', oid='cb3b6ca20aa2194a2f40dbd7030432eddc2f6119', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
model.push_to_hub("segformer-segmentation-model")

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/14.9M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Ashish08/segformer-segmentation-model/commit/3a136e2133978a2d15d4c53d62a607a542f6a961', commit_message='Upload SegformerForSemanticSegmentation', commit_description='', oid='3a136e2133978a2d15d4c53d62a607a542f6a961', pr_url=None, pr_revision=None, pr_num=None)

What is the URL for your model on the Hub? [ANSWER HERE]
<br>
Ans: Model on hub - https://huggingface.co/Ashish08/segformer-segmentation-model



Please make sure that the model is **public**

# Step 4: Reporting Model Metrics

* **Plot the Loss and Mean IoU on the Training and Test Sets Over the 5 Epochs**

In [None]:
# FILL HERE
run = api.run("ashish08/CoRise-week-2-self_driving_car/sw51lewj")
run.display(height=540)



True

* Is there any sign of overfitting? [ANSWER HERE]

Ans: Yes, training loss keeps decreasing, but validation fluctutaes

# Step 5: Building a Demo

A high-level metric like mean test IoU doesn't give us a great idea on how the model will work when presented with new data from the real world. To understand this, we will build a web-based demo that we can use on our phones or computers through a web browser to test our model.

The `gradio` library lets you build web demos of machine learning models with just a few lines code. Learn more about Gradio here: https://gradio.app/getting_started/

Gradio lets you build machine learning demos simply by specifying (1) a prediction function, (2) the input type and (3) the output type of your model. We have already written the prediction function and Gradio demo here, you can simply run it:

In [None]:
import gradio as gr
import numpy as np


class_labels = ["road", "human", "vehicles", "traffic lights", "background"]

def classify(im):
  inputs = extractor(images=im, return_tensors="pt").to("cuda")
  outputs = model(**inputs)
  logits = outputs.logits
  classes = logits[0].detach().cpu().numpy().argmax(axis=0)
  annotations = []
  for c, class_name in enumerate(class_labels):
    mask = np.array(classes==c, dtype=int)
    mask = np.repeat(np.repeat(mask, 5 , axis=0), 5, axis=1)  # scaling up the masks
    annotations.append((mask, class_name))
  im = np.repeat(np.repeat(im, 5 , axis=0), 5, axis=1)  # scaling up the images
  return im, annotations

interface = gr.Interface(classify, gr.Image(type="pil", shape=(128, 128)), gr.AnnotatedImage())

interface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://1e2d506be0664d3939.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces




* **Use the share link created above to open up your app on your phone**

Now test your model on some real images -- perhaps you can go outside and take a picture of your car. Or you can upload a picture of a road you found online. Although your model may not have been trained for very long, is it still able to distinguish any object classes? Why do you think that may or may not be the case?  

[ANSWER HERE]

It is not able to do so that well but because of the foundational model it has some prediction power.  

# Step 6: Comparing with the Segment Anything Model

In lecture, we discussed **zero-shot image segmentation model**, which don't need to be fine-tuned on specific categories. The Segment Anything model is one such model, which was released by Meta AI. Let's use the Segment Anything model and see how well it compares with our partially trained model above. First, we will install the latest version of the `transformers` model, directly from GitHub (as this model has not in a release of `transformers` yet!)

In [None]:
!pip install git+https://github.com/huggingface/transformers.git#egg=transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-install-awyf0n6f/transformers_99f35a8993964f65ba6807b5407856ec
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-install-awyf0n6f/transformers_99f35a8993964f65ba6807b5407856ec
  Resolved https://github.com/huggingface/transformers.git to commit ef42c2c487260c2a0111fa9d17f2507d84ddedea
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.29.0.dev0-py3-none-any.whl size=7012320 sha256=d9613f991dbbd2f4a76174440e225a993a881753a37954a9c84028657b

NOTE: You may need to restart the runtime of the notebook. This will delete your local variables such as your trained models, so make sure they are already pushed to the Hub.

Next, load the Segment Anything model from `transformers` using the `mask-generation` pipeline by following the example notebook here: https://github.com/huggingface/notebooks/blob/main/examples/automatic_mask_generation.ipynb. To limit memory usage, please use the `facebook/sam-vit-base` model.

In [None]:
# FILL HERE
from transformers import pipeline
generator = pipeline("mask-generation", model="facebook/sam-vit-base", device=0)

Rewrite the `classify` function above so that it passes the input image through the SAM model instead of your fine-tuned model, and so it returns all of the masks.  SAM does not include the names of the masks, you will need to give them generic names, such as "Mask 1", "Mask 2", etc.

Then, launch a new Gradio demo with this updated `classify` function


In [None]:
import gradio as gr
import numpy as np

def classify(im):
  # FILL HERE
  outputs = generator(im, points_per_batch=64)
  masks = outputs["masks"]
  class_names = []
  for i, _ in enumerate(masks):
    class_names.append("mask_" + str(i))
  annotations = []
  for mask, class_name in zip(masks, class_names):
    mask = np.repeat(np.repeat(mask, len(masks) , axis=0), len(masks), axis=1)
    annotations.append((mask, class_name))
  im = np.repeat(np.repeat(im, len(masks) , axis=0), len(masks), axis=1)
  return im, annotations

interface = gr.Interface(classify, gr.Image(type="pil", shape=(600, 600)), gr.AnnotatedImage())

interface.launch(debug=True)

Keyboard interruption in main thread... closing server.




# Bonus: Extensions

Now that you've worked through the project and have a functioning app, what else can we try?
* **Try training the model to convergence.** For this project, we only trained the model for 5 epochs, which is far too little for a real image segmentation model. Instead you can let the model train until it fully converges. How far can you increase the mean IoU?
* **Systematically explore different learning rates**: The learning rate is one of the most important hyperparameters when it comes to training machine learning models. Explore at least 8 different learning rates across 4 orders of magnitude. Which learning rates produce the best model?
* **Try training a segmentation model on the original data**: To speed up the learning process, we reduced the number of classes and the resolution of the images. Can you successfully train a model on the original data? This might require you to have Colab Pro, so that you can fit the images in the original resolution in memory.




---


#### This project is from [Abubakar Abid's](https://twitter.com/abidlabs) course: *Building Computer Vision Applications* on CoRise. Learn more about the course [here](https://corise.com/course/vision-applications).