# Laboratory #4_2 : Image Classification using Bag of Visual Words

At the end of this laboratory, you would get familiarized with

*   Creating Bag of Visual Words
    *   Feature Extraction
    *   Codebook construction
    *   Classification
*   Using pre-trained deep networks for feature extraction

**Remember this is a graded exercise.**

*   For every plot, make sure you provide appropriate titles, axis labels, legends, wherever applicable.
*   Create reusable functions where ever possible, so that the code could be reused at different places.
*   Mount your drive to access the images.
*   Add sufficient comments and explanations wherever necessary.

---

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Loading necessary libraries (Feel free to add new libraries if you need for any computation)

import os
import numpy as np

from skimage.feature import ORB
from skimage.color import rgb2gray
from skimage.io import imread
from scipy.cluster.vq import vq
from skimage import color, data, feature, filters, io, transform 
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from numpy import random
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import confusion_matrix

## Loading dataset

We will use 3 categories from Caltech 101 objects dataset for this experiment. Upload the dataset to the drive and mount it.

In [3]:
# modify the dataset variable with the path from your drive

dataset_path = r'/content/drive/MyDrive/LABS/CV8/101_ObjectCategories'

In [4]:
categories = ['butterfly', 'kangaroo', 'dalmatian']
ncl = len(categories) * 10

*   Create a list of file and the corresponding labels

In [5]:
# solution
list_images = []
list_categories = []
for cat in categories:
  path = os.path.join(dataset_path, cat)
  for _file in os.listdir(path):
          #### Complete here:
          #### Read image
          #### Remember to scale the images (with the max pixel intensity value)
          image = os.path.join(path, _file) #read image
          image = imread(image)
          image = (image-image.min())/(image.max()-image.min()) #scale image
          list_images.append(image)
          list_categories.append(cat)


In [6]:
data=list_images
print('Total number of images:', len(data))

Total number of images: 244


*   Create a train / test split where the test is 10% of the total data

In [7]:
# solution

data = np.array(data)

X_train, X_test, y_train, y_test = train_test_split(data, list_categories ,test_size=0.1, random_state=42)

  This is separate from the ipykernel package so we can avoid doing imports until


In [8]:
print('Train set:', len(X_train))
print('Test set:', len(X_test))

Train set: 219
Test set: 25


*   How do you select the train/test split?

**Solution**

To split the train and test set we have used the implemented function *train_test_split* from *sklearn.model_selection*. This function splits the data randomly and therefore we expect a uniform distribution among all categoris in both sets.

## Feature Extraction using ORB

The first step is to extract descriptors for each image in our dataset. We will use ORB to extract descriptors.

*   Create ORB detector with 256 keypoints.


In [79]:
# solution
descriptor_extractor = feature.ORB(n_keypoints=256)

*   Extract ORB descriptors from all the images in the train set.


In [80]:
# solution
list_descriptors = []
for img in X_train:
  img_gray = color.rgb2gray(img)
  descriptor_extractor.detect_and_extract(img_gray)
  list_descriptors.append(descriptor_extractor.descriptors)

  after removing the cwd from sys.path.


*   What is the size of the feature descriptors? What does each dimension represent in the feature descriptors?

In [81]:
# solution
descriptors = np.array(list_descriptors)
descriptors.shape

(219, 256, 256)

**Solution**

The size of the array of the feature descriptors is $219 \times 256 \times 256$. 

$219$ is the number of images in the training set, $256$ is the number of keypoints, and $256$ is the size of the descriptors.

## Codebook Construction

Codewords are nothing but vector representation of similar patches. This codeword produces a codebook similar to a word dictionary. We will create the codebook using K-Means algorithm

*   Create a codebook using K-Means with k=number_of_classes*10
*   Hint: Use sklearn.cluster.MiniBatchKMeans for K-Means

In [83]:
# solution
kmeans = MiniBatchKMeans(n_clusters=len(categories)*10)
desc_vect = np.reshape(descriptors, (219*256, 256)) #reshape
pred = kmeans.fit_predict(desc_vect)


*   Create a histogram using the cluster centers for each image descriptor.
    *   Remember the histogram would be of size *n_images x n_clusters*.

In [84]:
#solution
pred = np.reshape(pred, (219,256))
histogram = np.empty((len(pred), kmeans.n_clusters))
for i in range(len(pred)):
  hist, _ = np.histogram(pred[i], bins=kmeans.n_clusters)
  histogram[i] = hist



# Creating Classification Model

*   The next step is to create a classification model. We will use a C-Support Vector Classification for creating the model.



In [85]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

*   Use GridSearchCV to find the optimal value of C and Gamma.

In [86]:
# solution
gammas = np.array([0, 0.001, 0.01, 0.05, 0.1, 0.5, 1])
C = np.array([ 1, 5, 9, 10, 11, 15, 20, 50, 100, 200, 500, 1000])

svc = SVC(decision_function_shape='ovr')

clf = GridSearchCV(estimator=svc, param_grid=dict(gamma=gammas, C=C), n_jobs=-1)

clf.fit(histogram,y_train)

print('gamma:', clf.best_estimator_.gamma)
print('C:', clf.best_estimator_.C)
print('Best accuracy:', clf.best_score_) 

gamma: 0.001
C: 1
Best accuracy: 0.6073995771670191


# Testing the Classification Model

*   Extract descriptors using ORB for the test split
*   Use the previously trained k-means to generate the histogram
*   Use the classifier to predict the label


In [88]:
# solution

#extract descriptors
descriptors_test = []
for img in X_test:
  img_gray = color.rgb2gray(img)
  descriptor_extractor.detect_and_extract(img_gray)
  descriptors_test.append(descriptor_extractor.descriptors)
descriptors_test = np.array(descriptors_test)

#histogram
desc_test_vect = np.reshape(descriptors_test, (25*256, 256)) #reshape
pred_test = kmeans.predict(desc_test_vect)
pred_test = np.reshape(pred_test, (25,256))
histogram_test = np.empty((len(pred_test), kmeans.n_clusters))
for i in range(len(pred_test)):
  hist, _ = np.histogram(pred_test[i], bins=kmeans.n_clusters)
  histogram_test[i] = hist

#label
predictions = clf.predict(histogram_test)

  


*   Calculate the accuracy score for the classification model

In [89]:
# solution
clf.score(histogram_test, y_test)


0.68

*   Generate the confusion matrix for the classification model

In [90]:
# solution
print(confusion_matrix(y_test, predictions))


[[6 0 4]
 [0 4 3]
 [1 0 7]]


*   Why do we use Clustering to create the codebook? 
*   What are the other techniques that can be used to create the codebook?

**Solution**

When we use codebooks to classify images, we classify them according to its feature descriptors and clustering is a technic for unsupervised classification.

We could use other classification techniques such as Generalized Lloyd Algorithm (GLA) or Pairwise Nearest Neighbor (PNN).

*   Will adding more keypoints increase the performanc of the algorithm?

**Solution**

If we increase the number of keypoints, for example if we set it to 270, the performance of the algorithm also improves.

# Extracting features from Deep Network

It is quite possible to extract features (similar to SIFT or ORB) from different layers of deep network.

*   Load ResNet50 model with imagenet weights and check the summary of the model
*   Create a model to extract features from the 'avg_pool' layer.
*   Extract features from the layer for all the train images.

In [19]:
# solution
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import Model

base_model = ResNet50( weights='imagenet') #ResNet50 model
base_model.summary() #summary

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5
Model: "resnet50"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 conv1_pad (ZeroPadding2D)      (None, 230, 230, 3)  0           ['input_1[0][0]']                
                                                                                                  
 conv1_conv (Conv2D)            (None, 112, 112, 64  9472        ['conv1_pad[0][0]']              
                                )                    

In [20]:
#model to extract features from the 'avg_pool' layer
model = Model(inputs=base_model.input, outputs=base_model.get_layer('avg_pool').output)
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 conv1_pad (ZeroPadding2D)      (None, 230, 230, 3)  0           ['input_1[0][0]']                
                                                                                                  
 conv1_conv (Conv2D)            (None, 112, 112, 64  9472        ['conv1_pad[0][0]']              
                                )                                                                 
                                                                                              

In [21]:
from tensorflow.keras.applications.resnet import preprocess_input
from tensorflow.keras.preprocessing import image

#Extract features
images_train = np.empty((len(X_train), 224, 224, 3))
for idx, img in enumerate(X_train):
  img = transform.resize(img, (224, 224, 3))
  x = image.img_to_array(img)
  x = np.expand_dims(x, axis=0)
  images_train[idx] = x

images_train = preprocess_input(images_train)
feature_descriptors = model.predict(images_train)


*   What is the size of the feature descriptors?

In [22]:
# solution
print("Shape feature descriptors: ", feature_descriptors.shape)


Shape feature descriptors:  (219, 2048)


*   Create codebook using the extracted features

In [23]:
# solution
kmeans = MiniBatchKMeans(n_clusters=len(categories)*10)
pred = kmeans.fit_predict(feature_descriptors)
print(pred.shape)

(219,)


In [24]:
histogram = np.zeros((len(pred), kmeans.n_clusters))
for i in range(len(pred)):
  histogram[i][pred[i]] = 1


*   Train SVM classifier using the codebook

In [25]:
# solution
gammas = np.array([0, 0.001, 0.01, 0.05, 0.1, 0.5, 1])
C = np.array([ 1, 5, 9, 10, 11, 15, 20, 50, 100, 200, 500, 1000])

svc = SVC(decision_function_shape='ovr')

clf = GridSearchCV(estimator=svc, param_grid=dict(gamma=gammas, C=C), n_jobs=-1)

clf.fit(histogram,y_train)

print('gamma:', clf.best_estimator_.gamma)
print('C:', clf.best_estimator_.C)
print('Best accuracy:', clf.best_score_) 


gamma: 0.5
C: 1
Best accuracy: 0.5754756871035941


*   Evaluate the test set using the above method

In [26]:
# solution

#Extract features of the test set

images_test = np.empty((len(X_test), 224, 224, 3))
for idx, img in enumerate(X_test):
  img = transform.resize(img, (224,224, 3))
  x = image.img_to_array(img)
  x = np.expand_dims(x, axis=0)
  images_test[idx] = x

images_test = preprocess_input(images_test)
feature_descriptors_test = model.predict(images_test)

#Histogram of the test set
pred_test=kmeans.predict(feature_descriptors_test)
histogram_test = np.zeros((len(pred_test), kmeans.n_clusters))
for i in range(len(pred_test)):
  histogram_test[i][pred_test[i]] = 1

predictions=clf.predict(histogram_test)

*   Calculate the accuracy score and confusion matrix for the classification model

In [27]:
# solution
print('Accuracy:', clf.score(histogram_test, y_test) )
print('Confusion matrix:\n' + str(confusion_matrix(predictions, y_test)))

Accuracy: 0.84
Confusion matrix:
[[7 0 0]
 [1 7 1]
 [2 0 7]]


*   Compare the performance of both the BoVW models. Which model works better and why?

**Solution**

We can observe that the accuracy score of the second model is higher than the one of the first. In addition, the second model is quicker as it has less descriptors. Therefore, the second model is better.

*   Can the performance of pre-trained model increased further? If so, how?

**Solution**

We can improve the performance by increasing the number of clusters, however, by doing so, we may cause overfitting in the testing.

*   What happens if the test image does not belong to any of the trained classes?

**Solution**

It classifies the image in to the most similar group.

*   Combine the features extracted using ORB and Deep Neural Network.
*   Create a codebook with the combined features
*   Train a SVM classifier using the generated codebook and evaluate the performance using accuracy and confusion matrix.

We have worked with ORB descriptors which have shape (219, 256, 256), and with Deep Neural Network descriptors which have shape (219, 2048). We cannot join both feature descriptors as they do not have the same dimentions. To try to use both features, we will reshape the Deep Neural Network descriptors to (219, 8, 256), and therefore we will work with a vector of features with sizes (219, 264, 256).

In [91]:
# solution

#Combination of the features extracted using ORB and DNN
ORB_descriptors = descriptors #ORB
DNN_descriptors = feature_descriptors #DNN
DNN_descriptors = np.reshape(DNN_descriptors, (219, 8, 256))
ALL_descriptors = np.concatenate((ORB_descriptors, DNN_descriptors), axis=1)

#Codebook with the combined features
kmeans = MiniBatchKMeans(n_clusters=len(categories)*10)
desc_vect = np.reshape(ALL_descriptors, (-1, 256)) #reshape
pred = kmeans.fit_predict(desc_vect)
pred = np.reshape(pred, (219,-1))
histogram = np.empty((len(pred), kmeans.n_clusters))
for i in range(len(pred)):
  hist, _ = np.histogram(pred[i], bins=kmeans.n_clusters)
  histogram[i] = hist

#SVM classifier
gammas = np.array([0, 0.001, 0.01, 0.05, 0.1, 0.5, 1])
C = np.array([ 1, 5, 9, 10, 11, 15, 20, 50, 100, 200, 500, 1000])
svc = SVC(decision_function_shape='ovr')
clf = GridSearchCV(estimator=svc, param_grid=dict(gamma=gammas, C=C), n_jobs=-1)
clf.fit(histogram,y_train)
print('gamma:', clf.best_estimator_.gamma)
print('C:', clf.best_estimator_.C)
print('Training accuracy:', clf.best_score_)


gamma: 0.001
C: 1
Training accuracy: 0.6160676532769556


In [92]:
#Testing

#Combination of features
ORB_descriptors_test = descriptors_test #ORB
DNN_descriptors_test = feature_descriptors_test #DNN
DNN_descriptors_test = np.reshape(DNN_descriptors_test, (25, 8, 256))
ALL_descriptors_test = np.concatenate((ORB_descriptors_test, DNN_descriptors_test), axis=1)

#Codebook
desc_vect_test = np.reshape(ALL_descriptors_test, (-1, 256)) #reshape
pred_test = kmeans.predict(desc_vect_test)
pred_test = np.reshape(pred_test, (25,-1))
histogram_test = np.empty((len(pred_test), kmeans.n_clusters))
for i in range(len(pred_test)):
  hist, _ = np.histogram(pred_test[i], bins=kmeans.n_clusters)
  histogram_test[i] = hist

#Evaluation of the performance
predictions = clf.predict(histogram_test) #predictions
print('Accuracy:', clf.score(histogram_test, y_test))
print('Confusion matrix\n' + str(confusion_matrix(predictions, y_test)))

Accuracy: 0.56
Confusion matrix
[[5 0 1]
 [0 4 2]
 [5 3 5]]


*   Do the combined features increase the performance of the classifier?

**Solution**

We have obtained very similar results to the ones obtained only with ORB. The best model continues to be Deep Neural Networks. 

## t-distributed Stochastic Neighbor Embedding (Optional).

In order to visualize the features of a higher dimension data, t-SNE is used. t-SNE converts the affinities of the data points to probabilities. It recreates the probability distribution in a low-dimensional space. It is very helpful in visualizing features of different layers in a neural network.

You can find more information about t-SNE [here](https://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne)

In [None]:
from sklearn.manifold import TSNE

model = TSNE(n_components=2, random_state=0)

np.set_printoptions(suppress=True)

low_embedding = model.fit_transform(dictionary) 

plt.figure(figsize=(20,10))
plt.scatter(low_embedding[:, 0], low_embedding[:, 1], c=y_train)
plt.title("TSNE visualization")
plt.show()

*   What do you infer from the t-SNE plot?

**Solution**

*(Double-click or enter to edit)*

...


---

## **End of P4_2: Image Classification using Bag of Visual Words**
Deadline for P4_2 submission in CampusVirtual is: **Monday, the 6th of December, 2021**