CS576 Assignment #1: Image Classification using Bag of Visual Words (BoVW) 
====
Primary TA : Whie Jung 

TA's E-mail : whieya@kaist.ac.kr, jinwoo-kim@kaist.ac.kr, seongwoongjo@kaist.ac.kr

QnA Channel: "assignment1" channel in the SLACK channel (https://join.slack.com/t/kaist-cs576-cvhq/shared_invite/zt-15g6flwkk-lEIIBYFg8N0VRPHOsxhVrA)

## Instruction
- In this assignment, we will classify the images into five categories (aeroplane, backgrounds, car, horse, motorcycle, person) using Bag of Visual Word (BoVW) and Support Vector Machine (SVM).
 
- We will extract the SIFT descriptors from the images and construct a codebook. After that, we will encode the images to histogram features using codebook, and train the classifier using those features.

- As you follow the given steps, fill in the section marked ***Problem*** with the appropriate code. There are **7 problems** in total.
    - For **Problem 1 ~ Problem 4**, you will get full credits (10pt each) if you implement correctly.  
    - For **Problem 5 ~ Problem 7**, you **have to write a discussion about the results** as well as implementing the codes. Each problem takes 5pt for the correct implementation and 5 pt for proper discussion. In other words, you will get only 5pt without proper discussion even if you correctly implement the codes. To get full credit for discussion, please follow **Discussion Guidelines**.

## Discussion Guidelines
- You should write a discussion about **Problem 5 ~ Problem 7** on the **Discussion and Analysis** section. 
- Simply reporting the scores (e.g. classification accuracy) is not considered as a discussion.
- For each problem's discussion, you should explain and compare how each method improves the results. 

## Submission guidelines
- Your code and report will be all in Colab. Copy this example to your google drive and edit it to complete your assignment. 
- <font color="red"> You will get the full credit **only if** you complete the code **and** write a discussion of the results in the discussion section at the bottom of this page. </font>
- We should be able to reproduce your results using your code. Please double-check if your code runs without error and reproduces your results. Submissions failed to run or reproduce the results will get a substantial penalty. 
- <font color="red"> **DO NOT modify any of the skeleton codes when you submit.** Please write your codes only in the designated area. </font>
- As a proof that you've ran this code by yourself, **make sure your notebook contains the output of each code block.**

## Deliverables
- Download your Colab notebook, and submit it in a format: [StudentID].ipynb.
- Your assignment should be submitted through KLMS. All other submissions (e.g., via email) will not be considered as valid submissions. 

## Due date
- **23:59:59 April 6th.**
- Late submission is allowed until 23:59:59 April 8th.
- Late submission will be applied 20% penalty.



## Questions
- Please use "assignment1" channel in the SLACK channel (https://join.slack.com/t/kaist-cs576-cvhq/shared_invite/zt-15g6flwkk-lEIIBYFg8N0VRPHOsxhVrA) as a main communication channel. When you post questions, please make it public so that all students can share the information.
- When you post questions, please avoid posting your own implementation (eg, posting the capture image of your own implementation.) 

## Modification
- Descriptions of the `get_codebook` and `extract_features` are fixed. Shape of the parameter `des` in get_codebook function and extract_features function should be [num_images, ], not [num_images, num_des_of_each_img, 128]

## Step 0: Set the enviroments
For this assignment, you need the special library for extracting features & training classifier (cyvlfeat & sklearn).
This step takes about 5~15 minutes.

###  0-1: Download cyvlfeat library & conda

In [4]:
# install conda on colab
!wget -c https://repo.continuum.io/archive/Anaconda3-5.3.1-Linux-x86_64.sh
!chmod +x Anaconda3-5.3.1-Linux-x86_64.sh
!bash ./Anaconda3-5.3.1-Linux-x86_64.sh -b -f -p /usr/local

# install cyvlfeat
# Reference : https://anaconda.org/menpo/cyvlfeat
!conda install -c menpo cyvlfeat python==3.7 -y
!conda install cython numpy scipy -y

import sys
sys.path.append('/cyvlfeat')
sys.path.append('/usr/local/lib/python3.7/site-packages/')

!git clone https://github.com/menpo/cyvlfeat.git /cyvlfeat
!cd /cyvlfeat && CFLAGS="-I$CONDA_PREFIX/include" LDFLAGS="-L$CONDA_PREFIX/lib" pip install -e ./

--2022-04-05 09:56:50--  https://repo.continuum.io/archive/Anaconda3-5.3.1-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.201.79, 104.18.200.79, 2606:4700::6812:c84f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.201.79|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://repo.anaconda.com/archive/Anaconda3-5.3.1-Linux-x86_64.sh [following]
--2022-04-05 09:56:50--  https://repo.anaconda.com/archive/Anaconda3-5.3.1-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 667976437 (637M) [application/x-sh]
Saving to: ‘Anaconda3-5.3.1-Linux-x86_64.sh’


2022-04-05 09:56:55 (142 MB/s) - ‘Anaconda3-5.3.1-Linux-x86_64.sh’ saved [667976437/667976437]

PREFIX=/usr/local
reinstalling: python-3.7.0-hc3d631a

###  0-2: Connect to your Google Drive.

It is required for loading the data.

Enter your authorization code to access your drive.


In [3]:
# mount drive https://datascience.stackexchange.com/questions/29480/uploading-images-folder-from-my-system-into-google-colab
import os
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


### 0-3: Import modules

In [5]:
# Import libraries
import os
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import glob
import cyvlfeat
import time
import scipy
import multiprocessing

## Helper functions

In [6]:
def euclidean_dist(x, y):
    """
    :param x: [m, d]
    :param y: [n, d]
    :return:[m, n]
    """
    m, n = x.shape[0], y.shape[0]    
    eps = 1e-6 

    xx = np.tile(np.power(x, 2).sum(axis=1), (n,1)) #[n, m]
    xx = np.transpose(xx) # [m, n]
    yy = np.tile(np.power(y, 2).sum(axis=1), (m,1)) #[m, n]
    xy = np.matmul(x, np.transpose(y)) # [m, n]
    dist = np.sqrt(xx + yy - 2*xy + eps)

    return dist

def read_img(image_path):
    img = Image.open(image_path).convert('L')
    img = img.resize((480, 480))
    return np.float32(np.array(img)/255.)

def read_txt(file_path):
    with open(file_path, "r") as f:
        data = f.read()
    return data.split()
    
def dataset_setup(data_dir):
    train_file_list = []
    val_file_list = []

    for class_name in ['aeroplane','background','car','horse','motorbike','person']:
        train_txt_path = os.path.join(data_dir, class_name+'_train.txt')
        train_file_list.append(np.array(read_txt(train_txt_path)))
        val_txt_path = os.path.join(data_dir, class_name+'_val.txt')
        val_file_list.append(np.array(read_txt(val_txt_path)))

    train_file_list = np.unique(np.concatenate(train_file_list))
    val_file_list = np.unique(np.concatenate(val_file_list))

    f = open(os.path.join(data_dir, "train.txt"), 'w')
    for i in range(train_file_list.shape[0]):
        data = "%s\n" % train_file_list[i]
        f.write(data)
    f.close()

    f = open(os.path.join(data_dir, "val.txt"), 'w')
    for i in range(val_file_list.shape[0]):
        data = "%s\n" % val_file_list[i]
        f.write(data)
    f.close()

def load_train_data(data_dir):
    dataset_setup(data_dir)
    num_proc = 12 # num_process

    txt_path = os.path.join(data_dir, 'train.txt')
    file_list = read_txt(txt_path)
    image_paths = [os.path.join(data_dir+'/images', file_name+'.jpg') for file_name in file_list]
    with multiprocessing.Pool(num_proc) as pool:
      imgs = pool.map(read_img, image_paths)
      imgs = np.array(imgs)
      idxs = np.array(file_list)

    return imgs, idxs

def load_val_data(data_dir):
    dataset_setup(data_dir)
    num_proc = 12 # num_process

    txt_path = os.path.join(data_dir, 'val.txt')
    file_list = read_txt(txt_path)
    image_paths = [os.path.join(data_dir+'/images', file_name+'.jpg') for file_name in file_list]
    with multiprocessing.Pool(num_proc) as pool:
      imgs = pool.map(read_img, image_paths)
      imgs = np.array(imgs)
      idxs = np.array(file_list)
    
    return imgs, idxs

def get_labels(idxs, target_idxs):
    """
    Get the labels from file index(name).

    :param idxs(numpy.array): file index(name). shape:[num_images, ]
    :param target_idxs(numpy.array): target index(name). shape:[num_target,]
    :return(numpy.array): Target label(Binary label consisting of True and False). shape:[num_images,]
    """
    return np.isin(idxs, target_idxs)

def load_train_idxs(data_dir):
    txt_path = os.path.join(data_dir, 'train.txt')
    train_idxs = np.array(read_txt(txt_path))
    return train_idxs

def load_val_idxs(data_dir):
    txt_path = os.path.join(data_dir, 'val.txt')
    val_idxs = np.array(read_txt(txt_path))
    return val_idxs

## Step 1: Load the data

In [7]:
''' 
Set your data path for loading images & labels.
Example) CS_DATA_DIR = '/gdrive/My Drive/data'
'''

%env CS_DATA_DIR=/gdrive/My Drive/data
!mkdir -p "$CS_DATA_DIR"
!cd "$CS_DATA_DIR" && wget http://www.di.ens.fr/willow/events/cvml2013/materials/practicals/category-level/practical-category-recognition-2013a-data-only.tar.gz && tar -zxf practical-category-recognition-2013a-data-only.tar.gz

env: CS_DATA_DIR=/gdrive/My Drive/data
--2022-04-05 10:04:27--  http://www.di.ens.fr/willow/events/cvml2013/materials/practicals/category-level/practical-category-recognition-2013a-data-only.tar.gz
Resolving www.di.ens.fr (www.di.ens.fr)... 129.199.99.14
Connecting to www.di.ens.fr (www.di.ens.fr)|129.199.99.14|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.di.ens.fr/willow/events/cvml2013/materials/practicals/category-level/practical-category-recognition-2013a-data-only.tar.gz [following]
--2022-04-05 10:04:27--  https://www.di.ens.fr/willow/events/cvml2013/materials/practicals/category-level/practical-category-recognition-2013a-data-only.tar.gz
Connecting to www.di.ens.fr (www.di.ens.fr)|129.199.99.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘practical-category-recognition-2013a-data-only.tar.gz.4’

practical-category-     [           <=>      ] 964.15M  2.96MB/s   

In [8]:
category = ['aeroplane', 'car', 'horse', 'motorbike', 'person'] # DON'T MODIFY THIS.
data_dir = os.path.join(os.environ["CS_DATA_DIR"], "practical-category-recognition-2013a", "data")

## Step 2: Bag of Visual Words (BoVW) Construction

### 2-1. (**Problem 1**): SIFT descriptor extraction & Save the descriptors (10pt)

In [9]:
def SIFT_extraction(imgs):
    """
    Extract Local SIFT descriptors from images using cyvlfeat.sift.sift().
    Refer to https://github.com/menpo/cyvlfeat
    You should set the parameters of cyvlfeat.sift.sift() as bellow.
    1.compute_descriptor = True  2.float_descriptors = True

    :param imgs(numpy.array): Gray-scale images in Numpy array format. shape:[num_images, width_size, height_size]
    :return(numpy.array): SIFT descriptors. shape:[num_images, ], ndarray with object(descripotrs)
    """
    # YOUR CODE HERE
    
    # cyvlfeat.sift.sift(image, ~, float_descriptors, compute_descriptor)
    # ----- input -----
    # image : [H, W] or [H, W, 1] `float32` `ndarray` A single channel, greyscale, `float32` numpy array (ndarray) representing the image to calculate descriptors for.
    # float_descriptors : `bool`, optional If ``True``, the descriptor are returned in floating point rather than integer format.
    # compute_descriptors : `bool`, optional If ``True``, the descriptors are also returned, as well as the keypoints (frames). computes the SIFT descriptors as well.  

    # ----- returns -----
    # frames : `(F, 4)` `float32` `ndarray` ``F`` is the number of keypoints (frames) used. This is the center of every dense SIFT descriptor that is extracted. 
              # frame format ``[Y, X, S, TH]``, where ``(Y, X)`` is the floating point center of the keypoint, ``S`` is the scale and ``TH`` is the orientation (in radians).
    # descriptors : `(F, 128)` `uint8` or `float32` `ndarray`, optional ``F`` is the number of keypoints (frames) used. The 128 length vectors per keypoint extracted. ``uint8`` by default. Only returned if ``compute_descriptors=True``.

    descriptor = []

    for i in range(imgs.shape[0]):
      frame, des = cyvlfeat.sift.sift(imgs[i].T, float_descriptors=True, compute_descriptor=True)
      descriptor.append(des)
      
    return np.array(descriptor)
    
    # ----- analysis -----
    # Returm Type: <class 'numpy.ndarray'>
    # Return Size: [num_images, object]
    # Object Size: [num_des_of_each_img, 128]; After all, since the number of num_des_of_each_img keeps changing, the number in the object keeps changing.
    # Fianlly, Return Size: [num_images, num_des_of_each_img, 128]

### 2-2. (**Problem 2**): Codebook(Bag of Visual Words) construction (10pt)
In this step, you will construct the codebook using K-means clustering.

In [10]:
def get_codebook(des , k):
  """
  Construct the codebook with visual codewords using k-means clustering.
  In this step, you should use cyvlfeat.kmeans.kmeans().
  Refer to https://github.com/menpo/cyvlfeat

  :param des(numpy.array): Descriptors.  shape:[num_images, ]
  :param k(int): Number of visual words.
  :return(numpy.array): Bag of visual words shape:[k, 128]
  """
  # YOUR CODE HERE

  # cyvlfeat.kmeans.kmeans(data, num_centers)
  # ----- input -----
  # data : [N, D] `float32/float64` `ndarray` Input data to be clustered.
  # num_centers : Number of clusters to compute

  # ----- returns -----
  # centers : [num_centers, D] `float32/float64` `ndarray` Computed clusters centers from the data points. the same dtype as ``data``. Also returns the cluster assignments.

  # ----- process -----
  # 1) convert "des" to "collect_des" following as: [num_images, num_des_of_each_img, 128] to [num_images * num_des_of_each_img, 128]
  # 2) des shape like 1) is fit at kmeans data [N, D]
  # 3) get codebook using cyvlfeat.kmeans.kmeans 

  collect_des = des[0]

  for i in range(1, des.shape[0]):
    collect_des = np.concatenate((collect_des, des[i]), axis=0)
      
  return cyvlfeat.kmeans.kmeans(collect_des, k)

  # ----- analysis -----
  # Returm Type: <class 'numpy.ndarray'>
  # Return Size: [k, 128]

### 2-3. (**Problem 3**): Encode images to histogram feature based on codewords (10pt)

In [11]:
def extract_features(des, codebook):
  """
  Construct the Bag-of-visual-Words histogram features for images using the codebook.
  HINT: Refer to helper functions.

  :param des(numpy.array): Descriptors.  shape:[num_images, ]
  :param codebook(numpy.array): Bag of visual words. shape:[k, 128]
  :return(numpy.array): Bag of visual words shape:[num_images, k]

  """
  # YOUR CODE HERE

  # ----- input -----
  # des shape : [num_images, num_des_of_each_img, 128]
  # codebook shape : [num_of_visual_words(centers), 128]

  # ----- returns -----
  # BoW histogram : [num_images, BoW_histogram]

  # ----- process -----
  # 1) generate histogram = np.zeros(len(des.shape[0]), k)
  # 2) measure the distance between codebooks for each image, and add +1 to the histogram index that matches the codebook index closest to the distance.
  #    euclidean_dist(A, B) => A[m, d], B[n, d] => dist = [m, n] Print all distances between "n" per element of "m".
  # 3) return the histogram

  histogram = np.zeros((des.shape[0], codebook.shape[0]), "float32")

  for i in range(des.shape[0]):
    distance = euclidean_dist(des[i], codebook) # output: [num_des_of_each_img, k]

    for j in range(distance.shape[0]):
      smallest_index = np.where(distance[j] == np.amin(distance[j])) 
      histogram[i][smallest_index] += 1

  return histogram

  # ----- analysis -----
  # Returm Type: <class 'numpy.float'>
  # Return Size: [num_images, histogram(sum of frequency is k)]

## Step 3. (**Problem 4**): Train the classifiers (10pt)
Train a classifier using the sklearn library (SVC) 

In [12]:
from sklearn.svm import SVC

In [13]:
def train_classifier(features, labels, svm_params):
  """
  Train the SVM classifier using sklearn.svm.svc()
  Refer to https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

  :param features(numpy.array): Historgram representation. shape:[num_images, dim_feature]
  :param labels(numpy.array): Target label(binary). shape:[num_images,]
  :param svm_params(dict): parameters for classifier training.
      ['C'](float): Regularization parameter.
      ['kernel'](str): Specifies the kernel type to be used in the algorithm.
  :return(sklearn.svm.SVC): Trained classifier
  """
  # Your code here

  # sklearn.svm.SVC(*, C=1.0, kernel='rbf', ~)
  # ----- input -----
  # C : float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
  # kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’ Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape

  # ----- returns -----
  # trained_classifier model

  # ---------------------------------------------------------------------

  # sklearn.svm.SVC.fit(X, y)
  # ----- input -----
  # X : {array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples) Training vectors, where n_samples is the number of samples and n_features is the number of features. For kernel=”precomputed”, the expected shape of X is (n_samples, n_samples).
  # y : array-like of shape (n_samples,) Target values (class labels in classification, real numbers in regression).

  # ----- returns -----
  # trained_classifier model that fitted estimator

  trained_regularization = svm_params['C']
  trained_kernel = svm_params['kernel']
  trained_classifier = SVC(C=trained_regularization, kernel=trained_kernel)

  return trained_classifier.fit(features, labels)


In [14]:
def Trainer(feat_params, svm_params):
    
    """
    Train the SVM classifier.

    :param feat_params(dict): parameters for feature extraction.
        ['extractor'](function pointer): function for extrat local descriptoers. (e.g. SIFT_extraction, DenseSIFT_extraction, etc)
        ['num_codewords'](int):
        ['result_dir'](str): Diretory to save codebooks & results.
        
    :param svm_params(dict): parameters for classifier training.
        ['C'](float): Regularization parameter.
        ['kernel'](str): Specifies the kernel type to be used in the algorithm.
   
    :return(sklearn.svm.SVC): trained classifier
    """
    
    extractor = feat_params['extractor']
    k = feat_params['num_codewords']
    result_dir = feat_params['result_dir']
    
    if not os.path.isdir(result_dir):
        os.mkdir(result_dir)
    
    print("Load the training data...")
    start_time = time.time()
    train_imgs, train_idxs = load_train_data(data_dir)
    print("{:.4f} seconds".format(time.time()-start_time))
    
    print("Extract the local descriptors...")
    start_time = time.time()
    train_des = extractor(train_imgs)
    np.save(os.path.join(result_dir, 'train_des.npy'), train_des)
    print("{:.4f} seconds".format(time.time()-start_time))
    
    del train_imgs
    
    print("Construct the bag of visual words...")
    start_time = time.time()
    codebook = get_codebook(train_des, k)
    np.save(os.path.join(result_dir, 'codebook.npy'), codebook)
    print("{:.4f} seconds".format(time.time()-start_time))

    print("Extract the image features...")
    start_time = time.time()
    train_features = extract_features(train_des, codebook)
    np.save(os.path.join(result_dir, 'train_features.npy'), train_features)
    print("{:.4f} seconds".format(time.time()-start_time))

    del train_des, codebook
    
    print('Train the classifiers...')
    accuracy = 0
    models = {}
    
    for class_name in category:
        target_idxs = np.array([read_txt(os.path.join(data_dir, '{}_train.txt'.format(class_name)))])
        target_labels = get_labels(train_idxs, target_idxs)
        
        models[class_name] = train_classifier(train_features, target_labels, svm_params)
        train_accuracy = models[class_name].score(train_features, target_labels) 
        print('{} Classifier train accuracy:  {:.4f}'.format(class_name ,train_accuracy))
        accuracy += train_accuracy
    
    print('Average train accuracy: {:.4f}'.format(accuracy/len(category)))
    del train_features, target_labels, target_idxs

    return models

In [15]:
feat_params = {'extractor': SIFT_extraction, 'num_codewords':1024, 'result_dir':os.path.join(data_dir,'sift_1024')}
svm_params = {'C': 1, 'kernel': 'linear'}

- Below code will take about 30~70 minutes.

In [15]:
models = Trainer(feat_params, svm_params)

Load the training data...
46.3065 seconds
Extract the local descriptors...




500.3300 seconds
Construct the bag of visual words...
5021.6304 seconds
Extract the image features...
93.5468 seconds
Train the classifiers...
aeroplane Classifier train accuracy:  1.0000
car Classifier train accuracy:  1.0000
horse Classifier train accuracy:  1.0000
motorbike Classifier train accuracy:  1.0000
person Classifier train accuracy:  0.9555
Average train accuracy: 0.9911


## Step 4: Test the classifier on validation set



In [16]:
def Test(feat_params, models):
    """
    Test the SVM classifier.

    :param feat_params(dict): parameters for feature extraction.
        ['extractor'](function pointer): function for extrat local descriptoers. (e.g. SIFT_extraction, DenseSIFT_extraction, etc)
        ['num_codewords'](int):
        ['result_dir'](str): Diretory to load codebooks & save results.
        
    :param models(dict): dict of classifiers(sklearn.svm.SVC)
    """
    
    extractor = feat_params['extractor']
    k = feat_params['num_codewords']
    result_dir = feat_params['result_dir']
    
    print("Load the validation data...")
    start_time = time.time()
    val_imgs, val_idxs = load_val_data(data_dir)
    print("{:.4f} seconds".format(time.time()-start_time))
    
    print("Extract the local descriptors...")
    start_time = time.time()
    val_des = extractor(val_imgs)
    np.save(os.path.join(result_dir, 'val_des.npy'), val_des)
    print("{:.4f} seconds".format(time.time()-start_time))
    
    
    del val_imgs
    codebook = np.load(os.path.join(result_dir, 'codebook.npy'))
    
    print("Extract the image features...")
    start_time = time.time()    
    val_features = extract_features(val_des, codebook)
    np.save(os.path.join(result_dir, 'val_features.npy'), val_features)
    print("{:.4f} seconds".format(time.time()-start_time))

    del val_des, codebook

    print('Test the classifiers...')
    accuracy = 0
    for class_name in category:
        target_idxs = np.array([read_txt(os.path.join(data_dir, '{}_val.txt'.format(class_name)))])
        target_labels = get_labels(val_idxs, target_idxs)
        
        val_accuracy = models[class_name].score(val_features, target_labels)
        print('{} Classifier validation accuracy:  {:.4f}'.format(class_name ,val_accuracy))
        accuracy += val_accuracy
    
    del val_features, target_idxs, target_labels

    print('Average validation accuracy: {:.4f}'.format(accuracy/len(category)))

In [17]:
Test(feat_params ,models)

Load the validation data...
49.4729 seconds
Extract the local descriptors...




423.2250 seconds
Extract the image features...
93.2445 seconds
Test the classifiers...
aeroplane Classifier validation accuracy:  0.9414
car Classifier validation accuracy:  0.7498
horse Classifier validation accuracy:  0.9038
motorbike Classifier validation accuracy:  0.9054
person Classifier validation accuracy:  0.5707
Average validation accuracy: 0.8142


## **Problem 5**: Implement Dense SIFT (10pt)
Modify the feature extractor using the dense SIFT and evaluate the performance.

In [17]:
def DenseSIFT_extraction(imgs):
  """
  Extract Dense SIFT descriptors from images using cyvlfeat.sift.dsift().
  Refer to https://github.com/menpo/cyvlfeat
  You should set the parameters of cyvlfeat.sift.dsift() as bellow.
    1.step = 12  2.float_descriptors = True

  :param train_imgs(numpy.array): Gray-scale images in Numpy array format. shape:[num_images, width_size, height_size]
  :return(numpy.array): Dense SIFT descriptors. shape:[num_images, num_des_of_each_img, 128]
  """
  # YOUR CODE HERE

  # cyvlfeat.sift.dsift(image, step, float_descriptors)
  # ----- input -----
  # image : Extracts a dense set of SIFT features from ``image``. [H, W] or [H, W, 1] `float32` `ndarray` A single channel, greyscale, `float32` numpy array (ndarray) representing the image to calculate descriptors for.
  # step : `int`, optional A SIFT descriptor is extracted every ``step`` pixels. This allows for sub-sampling of the image.
  # float_descriptors : `bool`, optional If ``True``, the descriptor are returned in floating point rather than integer format.

  # ----- returns -----
  # frames : `(F, 2)` or `(F, 3)` `float32` `ndarray` ``F`` is the number of keypoints (frames) used. This is the center of every dense SIFT descriptor that is extracted.
  # descriptors : `(F, 128)` `uint8` or `float32` `ndarray` ``F`` is the number of keypoints (frames) used. The 128 length vectors per keypoint extracted. ``uint8`` by default.

  descriptor = []

  for i in range(imgs.shape[0]):
    frame, des = cyvlfeat.sift.dsift(imgs[i].T, step=12, float_descriptors=True)
    descriptor.append(des)
      
  return np.array(descriptor)
    
  # ----- analysis -----
  # Returm Type: <class 'numpy.ndarray'>
  # Return Size: [num_images, object]
  # Object Size: [num_des_of_each_img, 128]; After all, since the number of num_des_of_each_img keeps changing, the number in the object keeps changing.
  # Fianlly, Return Size: [num_images, num_des_of_each_img, 128]


In [18]:
feat_params = {'extractor': DenseSIFT_extraction, 'num_codewords':1024, 'result_dir':os.path.join(data_dir,'dsift_1024')}
svm_params = {'C': 1, 'kernel': 'linear'}

In [19]:
models = Trainer(feat_params, svm_params)

Load the training data...
56.0977 seconds
Extract the local descriptors...
658.1388 seconds
Construct the bag of visual words...
9624.0328 seconds
Extract the image features...
173.3716 seconds
Train the classifiers...
aeroplane Classifier train accuracy:  1.0000
car Classifier train accuracy:  1.0000
horse Classifier train accuracy:  1.0000
motorbike Classifier train accuracy:  1.0000
person Classifier train accuracy:  0.9875
Average train accuracy: 0.9975


In [20]:
Test(feat_params ,models)

Load the validation data...
55.1846 seconds
Extract the local descriptors...
636.3257 seconds
Extract the image features...
183.3848 seconds
Test the classifiers...
aeroplane Classifier validation accuracy:  0.9543
car Classifier validation accuracy:  0.7947
horse Classifier validation accuracy:  0.9074
motorbike Classifier validation accuracy:  0.9103
person Classifier validation accuracy:  0.5905
Average validation accuracy: 0.8314


## **Problem 6**: Implement the Spatial Pyramid (10pt)
Modify the feature extractor using the spatial pyramid matching and evaluate the performance.


In [21]:
def Convert2dTo1d(des_2d):
  row = des_2d.shape[0]
  col = des_2d.shape[1]
  dim = des_2d.shape[2]

  des_1d = np.zeros((row * col, dim), "float32")

  tmp = 0

  for i in range(row):
    for j in range(col):
      des_1d[tmp, :] = des_2d[i, j, :]
      tmp += 1
  
  return np.array(des_1d)

In [22]:
def SpatialPyramid(des, codebook):
  """
  Extract image representation with Spatial Pyramid Matching using your DenseSIFT descripotrs & codebook.

  :param des(numpy.array): DenseSIFT Descriptors.  shape:[num_images, num_des_of_each_img, 128]
  :param codebook(numpy.array): Bag of visual words. shape:[k, 128]

  :return(numpy.array): Image feature using SpatialPyramid [num_images, features_dim]
  """
  # YOUR CODE HERE

  # ----- input -----
  # des = [num_images, num_des_of_each_img, 128] ; getting all Dense descriptors of full images
  # codebook = [k, 128] ; getting all codebook of full descriptors 

  # ----- returns -----
  # SpatialPyramid histogram = [num_images, features_dim] ; getting all histogram representation of full images
  # if resolution level is 2, then feature dimension following as: ((1 * 1) + (2 * 2) + (4 * 4)) * k = features_dim

  # ----- process -----
  # Assume part
  #   1) set resolution level L; this case I set L = 2
  #      In this case, L = 0 is same as above histogram matching / L = 1 get 2 x 2 grids (total = 4) / L = 2 get 4 x 4 grids (total = 16)
  #   2) you must already know the "image size" and "step size" to divide the grid through this. In this function, it is assumed that you already know: So, it was specified directly.

  # To do
  #   1) convert des to des_2d (using "image size" and "step size") following as: des = [num_images, num_des_of_each_img, 128] to "des_2d = [num_images, num_row_of_each_img, num_col_of_each_img, 128]"
  #   2) getting histogram until resolution level is 2 
  #      2-1) define moving row(next_row) and col(next_col); this length decreases like "40 -> 20 -> 10"
  #      2-2) define current_des_1d like = [num_of_des_in_current_bin_size, 128]; this shape decreases like "[1600, 128] (1 time) -> [400, 128] (4 times) -> [100, 128] (16 times)"
  #      2-3) make tmp_histogram until same as bin size; this loop increases like "1 -> 4 -> 16"
  #         2-3-1) find small distance between current_des_bin and codebook, finally tmp_histogram shape is [21, k]
  #      2-4) make histogram like [image_num, 21 *k], and return this argument 

  img_width = 480
  img_height = 480
  img_num = des.shape[0]
  des_dim = des.shape[2]
  step_size = 12

  des_col = int(img_width/step_size)
  des_row = int(img_width/step_size)
  des_2d = np.zeros((img_num, des_row, des_col, des_dim), "float32")

  next_line = 0

  for num in range(des_2d.shape[0]):
    for row in range(des_2d.shape[1]):
      for col in range(des_2d.shape[2]):
        for dim in range(des_2d.shape[3]):
          des_2d[num][row][col][dim] = des[num][next_line][dim]
        next_line += 1
    next_line = 0

  L = 2
  histogram = []

  for num in range(img_num):
    tmp_histogram = []

    for l in range(L + 1):
      next_row = int(des_2d.shape[1] / 2**l)
      next_col = int(des_2d.shape[2] / 2**l)

      current_bin_des = np.zeros(((next_row * next_col), des_2d.shape[3]), "float32")

      x, y = 0, 0
      
      for row in range(2**l):
        x = 0
        for col in range(2**l):
          current_bin_des = Convert2dTo1d(des_2d[num, y:y+next_row, x:x+next_col, :])

          distance = euclidean_dist(current_bin_des, codebook) # output: [current_bin_size, codebook_num]
          current_histogram = np.zeros(distance.shape[1], "float32")

          for dis in range(distance.shape[0]):
            smallest_index = np.where(distance[dis] == np.amin(distance[dis]))
            current_histogram[smallest_index] += 1

          weight = 2**(l - L)    
          current_histogram = weight * current_histogram
      
          tmp_histogram.append(current_histogram)

          x = x + next_col
        y = y + next_row

    tmp_histogram_1d = np.ravel(tmp_histogram)

    histogram.append(tmp_histogram_1d)

  return np.array(histogram)

In [23]:
def SP_Trainer(des_path, codebook_path, result_dir, svm_params):
    
    """
    Train the SVM classifier using SpatialPyramid representations.

    :param des_path(str): path for loading training dataset DenseSIFT descriptors.
    :param codebook(str): path for loading codebook for DenseSIFT descriptors.
    :param result_dir(str): diretory to save features.
        
    :param svm_params(dict): parameters for classifier training.
        ['C'](float): Regularization parameter.
        ['kernel'](str): Specifies the kernel type to be used in the algorithm.
   
    :return(sklearn.svm.SVC): trained classifier
    """
    train_idxs = load_train_idxs(data_dir)
    train_des = np.load(des_path)
    codebook = np.load(codebook_path)
    train_features = SpatialPyramid(train_des, codebook)
    np.save(os.path.join(result_dir, 'train_sp_features.npy'), train_features)

    del train_des, codebook
    
    print('Train the classifiers...')
    accuracy = 0
    models = {}
    
    for class_name in category:
        target_idxs = np.array([read_txt(os.path.join(data_dir, '{}_train.txt'.format(class_name)))])
        target_labels = get_labels(train_idxs, target_idxs)
        
        models[class_name] = train_classifier(train_features, target_labels, svm_params)
        train_accuracy = models[class_name].score(train_features, target_labels) 
        print('{} Classifier train accuracy:  {:.4f}'.format(class_name ,train_accuracy))
        accuracy += train_accuracy
    
    print('Average train accuracy: {:.4f}'.format(accuracy/len(category)))
    del train_features, target_labels, target_idxs

    return models

In [24]:
def SP_Test(des_path, codebook_path, result_dir, models):
    """
    Test the SVM classifier.

    :param des_path(str): path for loading validation dataset DenseSIFT descriptors.
    :param codebook(str): path for loading codebook for DenseSIFT descriptors.
    :param result_dir(str): diretory to save features.      
    :param models(dict): dict of classifiers(sklearn.svm.SVC)

    """ 
    val_idxs = load_val_idxs(data_dir)
    val_des = np.load(des_path)
    codebook = np.load(codebook_path)
    val_features = SpatialPyramid(val_des, codebook)
    np.save(os.path.join(result_dir, 'val_sp_features.npy'), val_features)


    del val_des, codebook

    print('Test the classifiers...')
    accuracy = 0
    for class_name in category:
        target_idxs = np.array([read_txt(os.path.join(data_dir, '{}_val.txt'.format(class_name)))])
        target_labels = get_labels(val_idxs, target_idxs)
        
        val_accuracy = models[class_name].score(val_features, target_labels)
        print('{} Classifier validation accuracy:  {:.4f}'.format(class_name ,val_accuracy))
        accuracy += val_accuracy

    del val_features, target_idxs, target_labels

    print('Average validation accuracy: {:.4f}'.format(accuracy/len(category)))

In [25]:
#YOUR CODE HERE for training & testing with Spatial Pyramid

# Training model using DenseSIFT and Spatial Pyramid
des_path = os.path.join(data_dir, 'dsift_1024', 'train_des.npy')
codebook_path = os.path.join(data_dir, 'dsift_1024', 'codebook.npy')
result_dir = os.path.join(data_dir, 'dsift_1024')
svm_params = {'C': 1, 'kernel': 'linear'}

model = SP_Trainer(des_path, codebook_path, result_dir, svm_params)

# Teset model using DenseSIFT and Spatial Pyramid
des_path = os.path.join(data_dir, 'dsift_1024', 'val_des.npy')

SP_Test(des_path, codebook_path, result_dir, model)

Train the classifiers...
aeroplane Classifier train accuracy:  1.0000
car Classifier train accuracy:  1.0000
horse Classifier train accuracy:  1.0000
motorbike Classifier train accuracy:  1.0000
person Classifier train accuracy:  1.0000
Average train accuracy: 1.0000
Test the classifiers...
aeroplane Classifier validation accuracy:  0.9624
car Classifier validation accuracy:  0.8832
horse Classifier validation accuracy:  0.9487
motorbike Classifier validation accuracy:  0.9507
person Classifier validation accuracy:  0.6803
Average validation accuracy: 0.8850


## **Problem 7**: Improve classification using non-linear SVM (10pt)
Modify the classifier using the non-linear SVM and evaluate the performance. 


In [26]:
# ---------- Using kernel poly ----------
# Training model using DenseSIFT and Spatial Pyramid
des_path = os.path.join(data_dir, 'dsift_1024', 'train_des.npy')
codebook_path = os.path.join(data_dir, 'dsift_1024', 'codebook.npy')
result_dir = os.path.join(data_dir, 'dsift_1024')
svm_params = {'C': 1, 'kernel': 'poly'}

model = SP_Trainer(des_path, codebook_path, result_dir, svm_params)

# Teset model using DenseSIFT and Spatial Pyramid
des_path = os.path.join(data_dir, 'dsift_1024', 'val_des.npy')

SP_Test(des_path, codebook_path, result_dir, model)

Train the classifiers...
aeroplane Classifier train accuracy:  0.9862
car Classifier train accuracy:  0.8908
horse Classifier train accuracy:  0.9551
motorbike Classifier train accuracy:  0.9608
person Classifier train accuracy:  0.7189
Average train accuracy: 0.9023
Test the classifiers...
aeroplane Classifier validation accuracy:  0.9568
car Classifier validation accuracy:  0.8601
horse Classifier validation accuracy:  0.9394
motorbike Classifier validation accuracy:  0.9483
person Classifier validation accuracy:  0.5994
Average validation accuracy: 0.8608


In [None]:
# ---------- Using kernel rbf ----------
# Training model using DenseSIFT and Spatial Pyramid
des_path = os.path.join(data_dir, 'dsift_1024', 'train_des.npy')
codebook_path = os.path.join(data_dir, 'dsift_1024', 'codebook.npy')
result_dir = os.path.join(data_dir, 'dsift_1024')
svm_params = {'C': 1, 'kernel': 'rbf'}

model = SP_Trainer(des_path, codebook_path, result_dir, svm_params)

# Teset model using DenseSIFT and Spatial Pyramid
des_path = os.path.join(data_dir, 'dsift_1024', 'val_des.npy')

SP_Test(des_path, codebook_path, result_dir, model)

In [None]:
# ---------- Using kernel sigmoid ----------
# Training model using DenseSIFT and Spatial Pyramid
des_path = os.path.join(data_dir, 'dsift_1024', 'train_des.npy')
codebook_path = os.path.join(data_dir, 'dsift_1024', 'codebook.npy')
result_dir = os.path.join(data_dir, 'dsift_1024')
svm_params = {'C': 1, 'kernel': 'sigmoid'}

model = SP_Trainer(des_path, codebook_path, result_dir, svm_params)

# Teset model using DenseSIFT and Spatial Pyramid
des_path = os.path.join(data_dir, 'dsift_1024', 'val_des.npy')

SP_Test(des_path, codebook_path, result_dir, model)

# <font color="blue"> Discussion and Analysis </font>
## Discussion Guidelines
- You should write discussion about **Problem 5 ~ Problem 7**.
- Simply reporting the results (e.g. classification accuracy) is not considered as a discussion.
- For each problem's discussion, you should explain and compare how each method improves the results.


Please write discussions on the results above.

---
### **Problem 5 (Dense SIFT + Linear SVM)**
- Problem 5 showed better performance than the original SIFT algorithm, which extracts a blob features and expresses each blob as a total of 8 vectors using magnitude and orientation. Train accuracy increased from **0.9911** to **0.9975**, and test accuracy increased from **0.8142** to **0.8314**. This is because sufficient image features were obtained by extracting descriptors from the entire image at regular intervals by extending the original SIFT method that detects features only at specific locations. That is, Dense SIFT provides enough feature measurements, so this result shows better performance than the original SIFT algorithm.

- Since the image descriptor is densely extracted, it shows good performance, but there is still a problem in the process of inserting spatial information. If you have a well defined target object with clearly detectable and distinguishing features you can get good results, but you can't arbitrarily create these data.


---
### **Problem 6 (Dense SIFT + Spatial Pyramid + Linear SVM)**
- The method extracted above Problem 5 does not provide spatial information, which is a problem with SIFT. Therefore, spatial information was inserted through the spatial pyramid method and performance was improved. First, the resolution level is decided. In this code, up to level 2 is implemented. Then, in level 0, descriptor vectors are extracted from one grid, in level 1, descriptor vectors for 4 (2x2) grids are extracted, and in level 2, descriptor vectors for 16 (4x4) grids are extracted. For each grid extracted in this way, a histogram is generated using the closest distance between the Bag of Visual Word (BoVW) and the descriptor of current grid, and spatial information about the image is included. At this time, the number k of codebooks, which is BoVW, is 1024. And the dimension of each image is as follows: (1 \* 1) \* 1024 + (2 \* 2) \* 1024 + (4 \* 4) \* 1024 = 21 \* 1024 = 21504, so the total shape is [num_images, 21 \* k], where k is 1024. In the existing method, information about the entire image was extracted and spatial information about it was added, and the result showed a sufficiently good performance result. We can see the training accuracy increased from **0.9975** to **1.0000**, and the test accuracy increased from **0.8314** to **0.8850**.

---
### **Problem 7 (Dense SIFT + Spatial Pyramid + Non-Linear SVM)**
- SVM is an algorithm that finds a hyperplane to classify two categories well. However, it is not easy to obtain a linear hyperplane by actually representing data of images and performing SVM. So, it is made to find a linear hyperplane through mapping the data. This is called the Non-Linear SVM method, and among the functions provided in scikit-learn, the data are mapped to different spaces through the polynomial kernel, radial basis function kernel, and sigmoid kernel.

- If you look at the results of SVM through data mapped in high dimension, you can see results that are worse or similar to those of Problem 6 (DenseSIFT + Spatial Pyramid + Linear SVM). The reason is that through the spatial pyramid, we have already acquired information in a higher dimension than the existing SIFT, and since it has already been overfitted for a small dataset, I think that performance does not come out beyond that. Therefore, I think that sufficient classification results can be obtained even using the DenseSIFT + Spatial Pyramid + Linear SVM method.

---



---
###**Problem 1 ~ 4 Result (SIFT + Linear SVM)**
- Train Accuracy

|Category|Accuracy|  
|:---:|:---:|
|aeroplane Classifier|1.0000|  
|car Classifier|1.0000| 
|horse Classifier|1.0000|
|motorbike Classifier|1.0000|
|person Classifier|0.9555|
|**Average**|**0.9911**|

- Test Accuracy

|Category|Accuracy|  
|:---:|:---:|
|aeroplane Classifier|0.9414|  
|car Classifier|0.7498| 
|horse Classifier|0.9038|
|motorbike Classifier|0.9054|
|person Classifier|0.5707|
|**Average**|**0.8142**|


---
###**Problem 5 Result (Dense SIFT + Linear SVM)**
- Train Accuracy

|Category|Accuracy|  
|:---:|:---:|
|aeroplane Classifier|1.0000|  
|car Classifier|1.0000| 
|horse Classifier|1.0000|
|motorbike Classifier|1.0000|
|person Classifier|0.9875|
|**Average**|**0.9975**|

- Test Accuracy

|Category|Accuracy|  
|:---:|:---:|
|aeroplane Classifier|0.9543|  
|car Classifier|0.7947| 
|horse Classifier|0.9074|
|motorbike Classifier|0.9103|
|person Classifier|0.5905|
|**Average**|**0.8314**|


---
###**Problem 6 Result (Dense SIFT + Spatial Pyramid + Linear SVM)**
- Train Accuracy

|Category|Accuracy|  
|:---:|:---:|
|aeroplane Classifier|1.0000|  
|car Classifier|1.0000| 
|horse Classifier|1.0000|
|motorbike Classifier|1.0000|
|person Classifier|1.0000|
|**Average**|**1.0000**|

- Test Accuracy

|Category|Accuracy|  
|:---:|:---:|
|aeroplane Classifier|0.9624|  
|car Classifier|0.8832| 
|horse Classifier|0.9487|
|motorbike Classifier|0.9507|
|person Classifier|0.6803|
|**Average**|**0.8850**|



---
###**Problem 7 Result (Dense SIFT + Spatial Pyramid + Non-Linear SVM)**
- Train Accuracy

|Category|Accuracy (poly)|Accuracy (rbf)| Accuracy (sigmoid)|  
|:---:|:---:|:---:|:---:|
|aeroplane Classifier|0.9862|0.9838|0.9668|
|car Classifier|0.8908|0.9426|0.8774|
|horse Classifier|0.9551|0.9575|0.9434|
|motorbike Classifier|0.9608|0.9563|0.9511|
|person Classifier|0.7189|0.9717|0.8184|
|**Average**|**0.9023**|**0.9624**|**0.9114**|

- Test Accuracy

|Category|Accuracy (poly)|Accuracy (rbf)| Accuracy (sigmoid)|  
|:---:|:---:|:---:|:---:|
|aeroplane Classifier|0.9568|0.9551|0.9555|
|car Classifier|0.8601|0.8690|0.8674|
|horse Classifier|0.9394|0.9402|0.9365|
|motorbike Classifier|0.9483|0.9495|0.9491|
|person Classifier|0.5994|0.6964|0.6653|
|**Average**|**0.8608**|**0.8821**|**0.8748**|


---


