# Extracting Features using pre-trained models and classification

In this example we explore the concept of feature extraction using a pre-trained model on the Animals dataset. To store the features we consider the HDF5 file format, using the module HDF5DatasetWriter. We consider the VGG16 trained on Image Net dataset. After the features extraction, we deploy the image classification, using a Machine Learning model of classification, Logistic Regression. 

The Animals dataset is composed by three classes and 1000 images of each classes: Cats, Dogs and Pandas.

The feature extraction is made on the convolutional part of the VGG16 network, we won't consider the fully connected layer (the fine tuning consider all the network).

## Importing Libraries

In [1]:
# import the necessary packages
from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications import imagenet_utils
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.preprocessing.image import load_img
from sklearn.preprocessing import LabelEncoder
#compvis module
from compvis.io import HDF5DatasetWriter
from imutils import paths
import numpy as np
import progressbar
import argparse
import random
import os

## Setting the dateset

**Defining the paths for the dataset and the HDF5 file with the features**

In [2]:
DATA_PATH = "/home/igor/Documents/Artificial_Inteligence/Datasets/Animals"
HDF5_PATH = "/home/igor/Documents/Artificial_Inteligence/Datasets/Animals/hdf5/features.hdf5"

In [3]:
imagesPath = list(paths.list_images(DATA_PATH)) # making a list with all images

In [4]:
random.shuffle(imagesPath) # shuffling the images to have a better memory access

In [5]:
bs = 32 # defining the batch size

**Creating the list of labels**

We split the string that contains the path, for example, Data/Animals/Class/Photo. We grab the class applying p.split(os.path.sep)[-2]. 

In [6]:
labels = [p.split(os.path.sep)[-2] for p in imagesPath]

**Encoding the labels**

In [7]:
le = LabelEncoder()
labels = le.fit_transform(labels)

## Defining the model for the feature extraction

The arguments for the VGG16 are weights and include_top. The first is "imagenet" and the second is False, we don't want to consider the fully connected layers, the classification will be made after by own model.

In [8]:
model = VGG16(weights = "imagenet", include_top = False)

**Building the dataset writer**

To build the dataset to store the features, we consider the class HDF5DatasetWriter. This class accepets 4 arguments dims (tuple with the number of raw and columns, in our case the number of images and the feature vector size, $7x7x512$), outputPath (the path to the hdf5 file), datakey (the name of the file), bufsize (buffer size by default 1000). The VGG16 at the end of the convolution layer returns 512 filters with size of $(7x7)$.

In [9]:
dataset = HDF5DatasetWriter((len(imagesPath), 512*7*7), HDF5_PATH, "features")

The supplied 'outputPath' already exist 
Do you want overwrite (be sure)? Enter yes or no: yes


**Creating the list with the classes**, this function returns a list with names classes in the string format.

In [10]:
dataset.storeClassLabels(le.classes_)

### Feature Extraction

In [11]:
# Defining a progress bar
widgets = ["Extracting Features: ", progressbar.Percentage(), " ",
           progressbar.Bar(), " ", progressbar.ETA()]
pbar = progressbar.ProgressBar(maxval=len(imagesPath), widgets=widgets).start()
# main loop over all images with a step size corresponding with the batch size
for i in np.arange(0, len(imagesPath), bs):
    
    batchPaths = imagesPath[i : i + bs] # list with the image pahts in the batch
    batchLabels = labels[i : i + bs]  # list with the labels in the batch
    batchImages = [] # empty list to store the image to the feature extraction
    
    #secondary loop to read and store the images in the batch size
    for (j, imgPath) in enumerate(batchPaths):
        image = load_img(imgPath, target_size=(224,224)) # reading and resizing the images
        image = img_to_array(image) # converting the image into an array
        image = np.expand_dims(image, axis=0) # expanding the dimensions to respect the channels
        image = imagenet_utils.preprocess_input(image) # preprocssing the image
        batchImages.append(image) # adding the current image into the image list
        
    batchImages = np.vstack(batchImages) # stacking the imgs
    features = model.predict(batchImages, batch_size=bs) # extracting the features in the batch
    features = features.reshape((features.shape[0], 512*7*7)) # resizing according with the hdf5 dataset5
    dataset.add(features, batchLabels) # adding the features and labels into the dataset
    pbar.update(i)
dataset.close()
pbar.finish()

Extracting Features: 100% |#####################################| Time: 0:00:31


# Training the model with the features

For this example, we consider the Logistic Regression classification model.

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import argparse
import pickle
import h5py

## Loading the HDF5 file with the features

In [13]:
dbt = h5py.File(HDF5_PATH, "r") # HDF5_PATH the file, r the read mode

**Defining the size of the training set**

As we have 3000 images, we consider $75\%$ of all images for the training set, totaling 2250.

In [14]:
i = int(dbt["labels"].shape[0]*0.75)

## Defining the classification model

We consider the Logistic Regression model using GridSearchCV that returns the best model, according with the hyperparemeters.

The hyperparemeters used in this example are $C$ the strictness and the solver.

In [15]:
# List of parameters
params = {"C": [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0], "solver" : ["newton-cg", "lbfgs"]}

In [16]:
# defining the model
model = GridSearchCV(LogisticRegression(), params, cv = 5, n_jobs=1) # with cross validation equal to 5

In [17]:
# Fitting the model
model.fit(dbt["features"][:i], dbt["labels"][:i]) # [:i] we consider the training the staring from the index 0 into i

GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=1,
             param_grid={'C': [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0],
                         'solver': ['newton-cg', 'lbfgs']})

**The best parameters**

In [18]:
print("Best parameters {}".format(model.best_params_))

Best parameters {'C': 1000.0, 'solver': 'newton-cg'}


**Predicting on the trained model**

In [19]:
predictions = model.predict(dbt["features"][i:]) #[i:] we consider the test set starting from i until the last index

**Evaluating the model with the classification report**

In [20]:
target_names = ['cats', 'dogs', 'pandas'] # creating the target list

In [22]:
cr = classification_report(dbt["labels"][i:], predictions, target_names=target_names)

In [23]:
print(cr)

              precision    recall  f1-score   support

        cats       0.98      1.00      0.99       269
        dogs       1.00      0.98      0.99       244
      pandas       1.00      1.00      1.00       237

    accuracy                           0.99       750
   macro avg       0.99      0.99      0.99       750
weighted avg       0.99      0.99      0.99       750



# Conclusions

As we can see, our model reached excellent results with the feature extraction. The accuracy on the test set is $99\%$, the best obtained result until now. In the previous examples for the Animals dataset, our best result was $74\%$ using data augmentation. 

Transfer Learning can be a good solution when we don't have enough data. Using the VGG16 model trained on ImageNet, we've obtained accurate results. Evidently this animals are included in the ImageNet, due to this fact, we have a good extraction. Consequently, the result of the classification is accurate, the classes are well discriminated among them.

For personals projects, situation that sometime we don't enough data, transfer learning is a thing to think about it.