<a href="https://colab.research.google.com/github/EdoardoMorucci/Plant-Leaves-Search-Engine---MIRCV/blob/main/model_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This notebook describes the fine-tuning process of Convolutional Neural Network using as Base Network DenseNet

# Local download of the dataset

In [2]:
! pip install -q kaggle

from google.colab import files
_ = files.upload()

! mkdir -p ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [3]:
! kaggle datasets download -d davidedemarco/healthy-unhealthy-plants-dataset-segmented --unzip

Downloading healthy-unhealthy-plants-dataset-segmented.zip to /content
 98% 632M/642M [00:11<00:00, 63.2MB/s]
100% 642M/642M [00:11<00:00, 60.3MB/s]


# Connection to Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


# Import

In [4]:
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

# Data Preparation


The dataset is on Google Drive and the dataset directory has the structure:

```
dataset/
  class_1/
    image_1.jpg
    image_2.jpg
    ...
  class_2/
    image_3.jpg
    image_4.jpg
    ...
  ...
  ...
  class_n/
    ...
```

To train and test the model, we need three subsets: train, test and validation. To split the dataset, we use the [split-folder](https://pypi.org/project/split-folders/) package.

In [5]:
!pip install split-folders tqdm

Collecting split-folders
  Downloading split_folders-0.4.3-py3-none-any.whl (7.4 kB)
Installing collected packages: split-folders
Successfully installed split-folders-0.4.3


We need to check if the hardware accelaration is enabled, since training a CNN on a CPU could be infeasible.

In [8]:
#check hardware acceleration
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

SystemError: ignored

We define the costants with the directory of the dataset and the directory where the datasplits are created. In addition we define the image size and the batch size.

In [9]:
BASE_DIR = "gdrive/Shareddrives/MIRCV-PlantLeavesSearchEngine/"
DATA_DIR = '/content/Healthy-and-Unhealthy-Plants-Dataset-Segmented'
SETS_DIR = '/content/healthy-unhealthy-plants-sets'
MODEL_DIR = '/content/model'

IMAGE_SIZE = (224, 224)
BATCH_SIZE = 256
N_CLASSES = 14

We need to create data splits. The dataset will be divided 80% in training set, 10% in validation set and 10% in test set.

In [11]:
import splitfolders
# split data
splitfolders.ratio(DATA_DIR, output=SETS_DIR, seed=123, ratio=(0.8, 0.1, 0.1), group_prefix=None)


Copying files: 0 files [00:00, ? files/s][A
Copying files: 98 files [00:00, 978.72 files/s][A
Copying files: 517 files [00:00, 2866.38 files/s][A
Copying files: 804 files [00:00, 1572.55 files/s][A
Copying files: 1007 files [00:00, 1593.30 files/s][A
Copying files: 1206 files [00:00, 1696.27 files/s][A
Copying files: 1513 files [00:00, 2068.33 files/s][A
Copying files: 1744 files [00:00, 1701.63 files/s][A
Copying files: 2189 files [00:01, 2343.78 files/s][A
Copying files: 2459 files [00:01, 2117.38 files/s][A
Copying files: 2698 files [00:01, 2044.88 files/s][A
Copying files: 2928 files [00:01, 2096.48 files/s][A
Copying files: 3195 files [00:01, 2243.16 files/s][A
Copying files: 3432 files [00:01, 2150.76 files/s][A
Copying files: 3656 files [00:01, 2111.77 files/s][A
Copying files: 3873 files [00:01, 1865.57 files/s][A
Copying files: 4068 files [00:02, 1611.71 files/s][A
Copying files: 4240 files [00:02, 1470.60 files/s][A
Copying files: 4395 files [00:02, 1108.98

# A Tommaso

Esegui tutte le celle prima di questa per avere la cartella pronta per essere zippata. Dopo esegui questa per zippare. Ricordati di rinominare la cartella con i trattini bassi come nella variabile DATA_DIR. Buona fortuna.

In [12]:
! zip Healthy-and-Unhealthy-Plants-Dataset-Segmented.zip Healthy-and-Unhealthy-Plants-Dataset-Segmented

  adding: Healthy-and-Unhealthy-Plants-Dataset-Segmented/ (stored 0%)


Now we need to create the Dataset objects from the sets directory. We use the [image_dataset_from_directory](https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory) function provided by Keras. An example of use of this library can be found on the official documentation provided by Keras ([here](https://keras.io/examples/vision/image_classification_from_scratch/)).

In [None]:

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    SETS_DIR + '/train',
    labels='inferred', #the label of the dataset is obtained by the name of the directory
    seed=123,
    shuffle=True,
    image_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    SETS_DIR + '/val',
    labels='inferred', #the label of the dataset is obtained by the name of the directory
    seed=123,
    shuffle=True,
    image_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
    SETS_DIR + '/test',
    labels='inferred', #the label of the dataset is obtained by the name of the directory
    seed=123,
    shuffle=True,
    image_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
)

# use buffered prefetching so we can yield data 
# from disk without having I/O becoming blocking
train_ds = train_ds.prefetch(buffer_size=BATCH_SIZE)
val_ds = val_ds.prefetch(buffer_size=BATCH_SIZE)
test_ds = test_ds.prefetch(buffer_size=BATCH_SIZE)

In [None]:
! zip

The images needs to be preprocessed before going in input to the CNN DenseNet. We use the function [tf.keras.applications.densenet.preprocess_input](https://www.tensorflow.org/api_docs/python/tf/keras/applications/densenet/preprocess_input) to preprocess the image. In addition we add the batch dimension.

In [8]:
def preprocess(images, labels):
  images = tf.keras.applications.densenet.preprocess_input(images)
  return images, labels
  
#preprocessing of the images in all the set
train_ds = train_ds.map(preprocess)
val_ds = val_ds.map(preprocess)
test_ds = test_ds.map(preprocess)

# Training

The CNN used has base network is DenseNet. Since we want to fine-tune the network. We remove the fully-connected layer on top and later we will add an output layer with 14 neurons (1 for each class we want to predict).

In [11]:
pretrained_model = tf.keras.applications.DenseNet121(
    input_shape = (224, 224, 3),
    weights="imagenet",
    include_top=False,  # do not include the pretrained layers implementing the imagenet classifier
)

# freezes weights of all levels of the pre-trained network
pretrained_model.trainable = False 

#pretrained_model.summary()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/densenet/densenet121_weights_tf_dim_ordering_tf_kernels_notop.h5


On top of the base network we apply global average pooling and we add an hidden classifier with 256 neurons. The last layer of the network is the output classification layer, with 1 neuron for each class and with softmax as activation function.

In [12]:
from tensorflow.keras import layers as L

x = pretrained_model.output

# add a global average pooling
x = L.GlobalAveragePooling2D(name='gap')(x)
x = L.Flatten(name='flatten')(x)

#STRATI COPIATI DA SLIDE, DA CAPIRE LA LORO UTILITA'
# add a fully-connected layer (Dense) of 256 neurons with name='classifier_hidden'
x = L.Dense(256,activation='relu', name='classifier_hidden')(x)

# add output classification layer with n_classes outputs and softmax activation
x = L.Dense(N_CLASSES, activation='softmax')(x)
new_output = x

model = tf.keras.models.Model(inputs=pretrained_model.input, outputs=new_output, name='healthy_and_unhealty_plants_classifier')

#model.summary()

To prevent huge gradients coming from the newly initialized layers from destroying the weights in the pretrained layers  we will initially freeze the layers of the base network and train only new layers. As optimizers we use Adam.

In [None]:
learning_rate=0.01
epochs=6

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

model.compile(optimizer,
              loss=tf.keras.losses.sparse_categorical_crossentropy,
              metrics=["accuracy"])
callbacks = [
  # early stopping
      tf.keras.callbacks.EarlyStopping(
          monitor='val_loss', 
          patience=2,
          restore_best_weights=True),

  # checkpoint best model 
  tf.keras.callbacks.ModelCheckpoint(
    filepath=MODEL_DIR + "healthy_and_unhealty_plants_classifier",
    save_weights_only=True,
    monitor='accuracy',
    mode='max',
    save_best_only=True
  ),
]

train_ds_shuffle = train_ds.shuffle(123)  # shuffles data each epoch

# train the model
history = model.fit(
  train_ds_shuffle,
  validation_data=val_ds,
  epochs = epochs,  
  callbacks=callbacks,
  batch_size=BATCH_SIZE,
  verbose=1
)


Epoch 1/6
