### This project is from [Abubakar Abid's](https://twitter.com/abidlabs) course: *Building Computer Vision Applications* on CoRise. Learn more about the course [here](https://corise.com/course/computer-vision).

---





# Problem Statement

You are hired by a farming company that is having issues with diseases affecting their bean plants. The farmers have to constantly monitor the leaves of the plants so that they can immediately treat the leaves if they show any signs of disease.

Thats a lot of work, so they are asking if you can build a machine learning-based app they can deploy on a drone to quickly identify dieased plants.

# Week 1 Project: Building a Leaf Classification App

Welcome to the first week's project for *Building Computer Vision Applications*!

In this week, we are going to get familiar with the key steps of building machine learning apps, with a particular focus on image classification. Specifically, we will cover:

* finding computer vision datasets and pretrained models 📖
* fine-tuning an image classifier model on new data 👾
* deploying a [Gradio app](http://gradio.dev/) you can run on your phone or laptop 📷
* measuring the performance of a classification model on test data and the real world 📈

# Introduction

Beans are an important cereal food crop in many parts of the world. However, certain diseases can damage bean plants, causing food shortages. As a result, it is critical to monitor the leaves of bean plants frequently and accurately. Many farming businesses are turning to imaging and machine learning to monitor their crops automatically and accurately.

This is a great example of where **image classification** can solve a real business problem. The concepts you will learn in this project will be generally applicable to many other kinds of image classification, and more broadly machine learning, tasks.

Our end goal will be to build a web-application that can take in an image of a bean leaf and predict whether it is healthy or diseased. The app will look something like this:

![](https://i.ibb.co/6mcXB53/image.png)

# Step 0: Hardware Setup & Software Libraries

We will be utilizing GPUs to train our machine learning model, so we will need to make sure that our Colab notebook is set up correctly. Go to the menu bar and click on Runtime > Change runtime type > Hardware accelerator and **make sure it is set to GPU**. Your Colab notebook may restart once you make the change.

We're going to be using some fantastic open-source Python libraries to load our dataset (`datasets`), train our model (`transformers`), evaluate our model (`evaluate`), and build a demo of our model (`gradio`). So let's go ahead and install all of these libraries.

In [None]:
%%capture
!pip install datasets transformers evaluate gradio

# Step 1: Loading a Dataset

In this project, we will be using the `datasets` library, which can load tens of thousands of datasets with a single line of code. It can also be used to apply preprocessing functions. Learn more about the datasets library here: https://huggingface.co/docs/datasets/tutorial

Most datasets are divided into different splits. For example, you'll often see a *training* data subset, which is used to build the model, a *validation* data subset, which is used to measure the performance of the model while it is training, and a *test* dataset which is used to measure the performance of the model at the very end of training, and is usually considered how well the model will perform in the real world (we'll come back to this).

Specifically, we will be using the `beans` dataset that is available for free from the Hugging Face Hub: https://huggingface.co/datasets/beans

* **Load the Beans Dataset**

In [None]:
from datasets import load_dataset
dataset = load_dataset("beans")  # FILL HERE

Downloading and preparing dataset beans/default to /root/.cache/huggingface/datasets/beans/default/0.0.0/90c755fb6db1c0ccdad02e897a37969dbf070bed3755d4391e269ff70642d791...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/144M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1034 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/133 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/128 [00:00<?, ? examples/s]

Dataset beans downloaded and prepared to /root/.cache/huggingface/datasets/beans/default/0.0.0/90c755fb6db1c0ccdad02e897a37969dbf070bed3755d4391e269ff70642d791. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

* **Explore the dataset by running the cells below and answer the questions below**

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['image_file_path', 'image', 'labels'],
        num_rows: 1034
    })
    validation: Dataset({
        features: ['image_file_path', 'image', 'labels'],
        num_rows: 133
    })
    test: Dataset({
        features: ['image_file_path', 'image', 'labels'],
        num_rows: 128
    })
})


In [None]:
# View the images

for i in range(15):
  display(dataset['train'][i]['image'])

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# View the labels

dataset['train'].features['labels']

ClassLabel(names=['angular_leaf_spot', 'bean_rust', 'healthy'], id=None)

* What information do we have for each sample? [ANSWER HERE]
<br> Ans: For each sample we have the 'image_file_path',  the 'image', itself and a corresponding 'label'.
* How many training samples do we have? Validation samples? Test samples? [ANSWER HERE]
<br> Ans: We have 1034 training samples, 133 validation samples and 128 test samples.
* How many different classes are there in this dataset, and what are the class labels? [ANSWER HERE]
<br> Ans: **3**
* Looking at the first 10 training images, do you notice anything interesting about the images in the dataset? Are they as diverse/representative as you would expect or do they have limitations? [ANSWER HERE]
<br> Ans: In the first 10-15 images, there are images mostly of leafs with rust and angular leaf spot. Mostly unhealthy leafs. They are taken from different camera angles which is good. There is mud in the background which could lead to spurious correlations.

# Step 2: Loading a Pretrained Model

We will be using the `transformers` library, which can load tens of thousands of machine learning models with a few lines of code. It can also be used to fine-tune these models. Learn more about the `transformers` library here: https://huggingface.co/docs/transformers/index

Specifically, we will be using the `Vision Image Transformer` model that is available to anyone from the Hugging Face Hub: https://huggingface.co/google/vit-base-patch16-224. While the details of vision transformers are beyond the scope of this course, we will point out that they are a successor of the widely used convolutional neural network (CNN) architecture and tend to perform better than CNNs at the same tasks (image classificaiton, segmentation, etc.)

Let's start by seeing how the Vision Image Transformer model performs without any further fine-tuning.

* **Load the Vision Image Transformer Model for Inference**

In [None]:
from transformers import ViTImageProcessor, ViTForImageClassification
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # FILL HERE TO LOAD THE MODEL TO GPU

model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224") # FILL HERE

model.eval()
model.to(device);

Downloading (…)lve/main/config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/346M [00:00<?, ?B/s]

We also need to load the **image processor** corresponding to the model, so that we can convert the input images into feature vectors that the model can take as input.

In [None]:
image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")  # FILL HERE

Downloading (…)rocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

* **Use the Vision Image Transformer Model to Make a Prediction on the Training Images**

The documentation here may be helpful: https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTForImageClassification.forward.example

In [None]:
# First we get the features corresponding to the first training image
encoding = image_processor(images=dataset['train'][0]['image'], return_tensors="pt").to(device)

# Then pass it through the model and get a prediction

######
# FILL HERE
with torch.no_grad():
  logits = model(**encoding).logits
######

prediction = logits.argmax(-1).item() # FILL HERE

print("Predicted class:", model.config.id2label[prediction])

Predicted class: cucumber, cuke


* Try running the model on the first 10 samples in the dataset.

In [None]:
# FILL HERE
images = dataset['train']['image'][:10]

encodings = image_processor(images=images, return_tensors="pt").to(device)

with torch.no_grad():
  outputs = model(**encodings)

predictions = torch.argmax(outputs.logits, dim=-1)

for prediction in predictions:
  print('Predicted class:', model.config.id2label[prediction.item()])

Predicted class: cucumber, cuke
Predicted class: ear, spike, capitulum
Predicted class: earthstar
Predicted class: leaf beetle, chrysomelid
Predicted class: leaf beetle, chrysomelid
Predicted class: cucumber, cuke
Predicted class: cucumber, cuke
Predicted class: custard apple
Predicted class: leaf beetle, chrysomelid
Predicted class: leaf beetle, chrysomelid


What is the most common prediction? Why do you think that is? [ANSWER HERE]
<br>
Ans: Leaf beetle and chrysomelid are similar names for the beetle, cucumber and cuke are also names for cucumber leaves. The model might have seen cucumber leaves which are unhealthy and look the same and beetles sitting on leaves during training. So it is predicting these classes more commonly.
<br>
[Leaf beetle or Chrysomelid](https://www.google.com/search?q=chrysomelid&tbm=isch&ved=2ahUKEwifp6eXpOP-AhWxpkwKHYv7BO0Q2-cCegQIABAA&oq=chrysomelid&gs_lcp=CgNpbWcQAzIFCAAQgAQyBQgAEIAEMgUIABCABDIFCAAQgAQyBQgAEIAEMgUIABCABDIFCAAQgAQyBQgAEIAEMgUIABCABDIFCAAQgAQ6BAgjECc6BwgAEIoFEENQxgdYwB1glCBoAHAAeACAAcgGiAHWEpIBCzQuNi4wLjEuNi0xmAEAoAEBqgELZ3dzLXdpei1pbWfAAQE&sclient=img&ei=haNXZN_QNrHNsgKL95PoDg&bih=485&biw=1093&rlz=1C1EKKP_enDE827DE827)
<br>
[Cucumver or cuke](https://www.google.com/search?q=cucumber+unhealthy+leaves&tbm=isch&ved=2ahUKEwi23fj7o-P-AhWxsEwKHU-1DhgQ2-cCegQIABAA&oq=cucumber+unhealthy+leaves&gs_lcp=CgNpbWcQAzIGCAAQCBAeUABYmAVgwQZoAHAAeACAAVuIAfoDkgEBNpgBAKABAaoBC2d3cy13aXotaW1nwAEB&sclient=img&ei=TKNXZLaHH7HhsgLP6rrAAQ&bih=485&biw=1093&rlz=1C1EKKP_enDE827DE827)

# Step 3: Fine-tuning Your Model on the Dataset

Off the shelf, the Vision Image Transformer will not be usable for the task that we have in mind, since it was trained for "general" image classification, not for the specific categories that we would like to predict. As a result, we will need to "fine-tune" our model.

Learn more about fine-tuning models with the `transformers` library here: https://huggingface.co/docs/transformers/training

We will also need to decide which metric to use for our task. Since our task is a simple image classification task, the `accuracy` metric seems reasonable: https://huggingface.co/spaces/evaluate-metric/accuracy

* **Preprocess the Dataset**

To make things faster, we are going to preprocess the entire dataset so that we convert all of the images to feature vectors. This will allow us to speed up the training as we can pass the feature vectors directly. This code has already been written for you:

In [None]:
import torch

def transform(example_batch):
    inputs = image_processor([x for x in example_batch['image']], return_tensors='pt')
    inputs['labels'] = example_batch['labels']
    return inputs

prepared_ds = dataset.with_transform(transform)

* **Load the Accuracy Metric**

We now have to decide on a *metric* we will use to measure the performance for our machine learning model. A natural choice for image classification is *accuracy*, which measures the percentage of images that are predicted to have the correct label.

Read about the `evaluate` library, which contains many common machine learning metrics here: https://github.com/huggingface/evaluate

And use the `evaluate.load()` to load the accuracy metric:

In [None]:
from transformers import AutoModelForImageClassification
import numpy as np
import evaluate

labels = dataset['train'].features['labels'].names

model = AutoModelForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)},
    ignore_mismatched_sizes=True
)

metric = evaluate.load("accuracy") # FILL HERE

def compute_metrics(sample):
    return metric.compute(
        predictions=np.argmax(sample.predictions, axis=1),
        references=sample.label_ids)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([1000, 768]) in the checkpoint and torch.Size([3, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([3]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

* **Fine-Tune the Vision Image Transformer Model on the Entire Training Set**

Now, we will take all of the code that you have written and use it to fine-tune the ViT model on the beans dataset. Simply run the code below, and your model will fine-tune for 4 epochs. On a **GPU**, this should take less than 5 minutes.

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
  output_dir="./vit-base-beans",  # output directory where the model predictions and checkpoints will be written
  per_device_train_batch_size=16, # batch size
  learning_rate=2e-4,             # learning rate
  num_train_epochs=4,             # number of epochs to train for
  remove_unused_columns=False,    # keep the "image" column
  logging_steps=10,               # how often to print training metrics
  eval_steps=100,                 # how often to measure on the evaluation set
)

def collate_fn(batch):
    return {
        'pixel_values': torch.stack([x['pixel_values'] for x in batch]),
        'labels': torch.tensor([x['labels'] for x in batch])
    }

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=prepared_ds["train"],
    eval_dataset=prepared_ds["validation"],
    tokenizer=image_processor,
)

In [None]:
train_results = trainer.train()
trainer.save_model("saved_model_files")
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()



Step,Training Loss
10,0.8558
20,0.2559
30,0.4501
40,0.1738
50,0.349
60,0.1564
70,0.1833
80,0.0447
90,0.0831
100,0.0487


Step,Training Loss
10,0.8558
20,0.2559
30,0.4501
40,0.1738
50,0.349
60,0.1564
70,0.1833
80,0.0447
90,0.0831
100,0.0487


***** train metrics *****
  epoch                    =         4.0
  total_flos               = 298497957GF
  train_loss               =      0.1075
  train_runtime            =  0:03:12.49
  train_samples_per_second =      21.486
  train_steps_per_second   =       1.351


# Step 4: Reporting Model Metrics

* **Measure Loss on the Validation Dataset**

In [None]:
# FILL HERE
from pprint import pprint

eval_results = trainer.evaluate()
pprint(eval_results)

{'epoch': 4.0,
 'eval_accuracy': 0.9924812030075187,
 'eval_loss': 0.03045864775776863,
 'eval_runtime': 4.8368,
 'eval_samples_per_second': 27.497,
 'eval_steps_per_second': 3.515}


* What is the loss on the training set and validation sets? [ANSWER HERE]
<br>
Ans: train_loss = 0.1075 and eval_loss = 0.0304
* Is there any sign of overfitting? [ANSWER HERE]
<br>
Ans: Yes, training loss and validation loss differs.


* **Measure Accuracy on the Test Dataset**


In [None]:
# FILL HERE
inference_on_test_data = trainer.predict(test_dataset=prepared_ds["test"])
pprint(inference_on_test_data.metrics)

{'test_accuracy': 0.96875,
 'test_loss': 0.18902923166751862,
 'test_runtime': 4.2611,
 'test_samples_per_second': 30.039,
 'test_steps_per_second': 3.755}


* What is your final test accuracy? With the default parameters above, you should expect at test accuracy around 90% or higher [ANSWER HERE]
<br>
Ans: 96%


# Step 5: Building a Demo

A high-level metric like test accuracy doesn't give us a great idea of how the model will work when presented with new data from the real world. To understand this, we will build a web-based demo that can be used on our phones or computers through a web browser to test our model.

The `gradio` library lets you build web demos of machine learning models with just a few lines of code. Learn more about Gradio here: https://gradio.app/getting_started/

Gradio lets you build machine learning demos simply by specifying (1) a prediction function, (2) the input type and (3) the output type of your model. We have already written most of the prediction code for you. We've reloaded the model and dataset so that the following code runs in a standalone manner, which will be important for Step 6.

In [None]:
import datasets
from transformers import AutoFeatureExtractor, AutoModelForImageClassification

dataset = load_dataset("beans") # This should be the same as the first line of Python code in this Colab notebook

extractor = AutoFeatureExtractor.from_pretrained("saved_model_files")
model = AutoModelForImageClassification.from_pretrained("saved_model_files")

labels = dataset['train'].features['labels'].names

def classify(im):
  features = image_processor(im, return_tensors='pt')
  logits = model(features["pixel_values"])[-1]
  probability = torch.nn.functional.softmax(logits, dim=-1)
  probs = probability[0].detach().numpy()
  confidences = {label: float(probs[i]) for i, label in enumerate(labels)}
  return confidences



  0%|          | 0/3 [00:00<?, ?it/s]



* **Build a Gradio web demo of your image classifier and `launch()` it**

Create a `gradio.Interface` and launch it! For image classification, the input component should be `"image"` and output should be a `"label"`. Please also make sure to add a `title`, a `description`, and some image `examples` to make the app easy to use.

Note that we have set `debug=True`, which keeps the following cell running continuously. Press the "stop" icon next to the cell to stop execution so that you can run or re-run other cells.

In [None]:
import gradio as gr

interface =  gr.Interface(fn=classify, inputs="image", outputs="label", title="Bean plant health predictor through images of leaves using ViT image classifier") # FILL HERE

interface.launch(debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Keyboard interruption in main thread... closing server.




# Step 6: Upload your Demo with Spaces and Try it with "Real World" Data!

* **Create a Hugging Face account and upload your demo to Spaces**

1. Create a free Hugging Face account if you do not already have one: https://huggingface.co/login
1. Create a new **public** Space with the code for your Gradio app. You might find this tutorial helpful: https://huggingface.co/blog/gradio-spaces (Note that in addition to uploading the code for your Gradio demo, you'll also need to upload the saved model files and some example images, as well as a `requirements.txt` file).
1. Once your app launches, please put the link to your Space here:

[ANSWER HERE]

Link to Space:  https://huggingface.co/spaces/Ashish08/Bean-plant-health-ViT-classifier

In [None]:
import transformers

print(f"datasets version: {datasets.__version__}")
print(f"transformers version: {transformers.__version__}")
print(f"gradio version: {gr.__version__}")
print(f"torch version: {torch.__version__}")

datasets version: 2.12.0
transformers version: 4.28.1
gradio version: 3.28.3
torch version: 2.0.0+cu118


In [None]:
#%%writefile app.py
import gradio as gr
import torch
from transformers import AutoFeatureExtractor, AutoModelForImageClassification, ViTImageProcessor

image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")

extractor = AutoFeatureExtractor.from_pretrained("saved_model_files")
model = AutoModelForImageClassification.from_pretrained("saved_model_files")

labels = ['angular_leaf_spot', 'bean_rust', 'healthy']


def classify(im):
  features = image_processor(im, return_tensors='pt')
  logits = model(features["pixel_values"])[-1]
  probability = torch.nn.functional.softmax(logits, dim=-1)
  probs = probability[0].detach().numpy()
  confidences = {label: float(probs[i]) for i, label in enumerate(labels)}
  return confidences


title = """<h1 id="title">Bean plant health predictor through images of leaves using ViT image classifier</h1>"""

description = """
Use Case: A farming company that is having issues with diseases affecting their bean plants. The farmers have to constantly monitor the leaves of the plants so that they can immediately treat the leaves if they show any signs of disease.
We are asked to build a machine learning-based app they can deploy on a drone to quickly identify diseased plants.


Solution: Building a Leaf Classification App that focuses on image classification to quickly identify diseased plants.

- The Dataset used for finetuning the model [Beans](https://huggingface.co/datasets/beans).
- The model used for classifying the images [Vision Transformer (base-sized model)](https://huggingface.co/google/vit-base-patch16-224).
"""

css = '''
h1#title {
  text-align: center;
}
'''
theme = gr.themes.Soft()
demo = gr.Blocks(css=css, theme=theme)

with demo:
  gr.Markdown(title)
  gr.Markdown(description)


  interface =  gr.Interface(fn=classify, inputs="image", outputs="label")

demo.launch(debug=True)




Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Keyboard interruption in main thread... closing server.




* **Open up your Space on your phone**

Now test your model on some real images of plants -- either images you find online or those outside your house (they don't have be bean plants for this part). What do you notice about the kinds of predictions your model makes? Do the predictions tend to skew towards a particular class? What could be done to improve the model's prediction on real world data?

[ANSWER HERE]
<br>
Ans: Train it with more images, augment data with transformations, use more epochs.   

# Bonus: Extensions

Now that you've worked through the project and have a functioning app, what else can we try?
* **How many images do you need to train an image classifier?** Finetune a ViT as you did before, but with different subsets of the data of various sizes. How does that affect the final training and test accuracies? Make a plot showing the train and test accuracies as a function of dataset size.
* **Try a zero-shot image classification model.** In lecture, we talked about zero-shot image classification models, which do not have to be retrained for specific applications. How well does a zero-shot classifier like [CLIP](https://huggingface.co/openai/clip-vit-large-patch14) work for this problem?
* **Train a ViT from scratch.** On the other end of the spectrum, we can retrain a ViT from scratch. Repeat the exercise above using a randomly initialized ViT. How does this affect the final train and test performance?
* **Train a Convolutional Neural Network (CNN)**: CNNs, although a predecessor to vision transformers, are still widely used in industry. What happens if you train a CNN (like [ResNet](https://huggingface.co/microsoft/resnet-50)) instead? How does the training time compare to that of the Vision Transformer? How does the final training and test accuracy compare?
* **Add Interpretation to your Demo**: We've built a classifier that works pretty well, but can the model explain *why* it is making a particular prediction? Gradio includes [built-in interpretation methods](https://gradio.app/advanced-interface-features/#interpreting-your-predictions) to explain predictions very easily: add interpretation to your demo in just a couple of lines of code and submit your demo to Hugging Face Spaces.

---

### This project is from [Abubakar Abid's](https://twitter.com/abidlabs) course: *Building Computer Vision Applications* on CoRise. Learn more about the course [here](https://corise.com/course/computer-vision).

