## Introduction
This is a demonstration notebook of training a deep architecture ([MobileNet V1](https://arxiv.org/abs/1704.04861)) on an unsupervised image dataset of vehicle models from a marketing site. This dataset was analysed in depth during my previous work of [*Transfer learning approach for classification and noise reduction on noisy web data*](https://www.sciencedirect.com/science/article/pii/S0957417418301878?via%3Dihub)". In this notebook, the architecture is trained on the entire noisy dataset with two learning strategies. In the first one, the network is thorouly trained and fine-tuned on the dataset, and in the second approach, the network is used as a feature extractor and the features are classified using a linear support vector classifier.

## Dataset preparation phase
### Prepare to download the dataset (from Megaupload)

The provided dataset is available at Megaupload server, hence an auxilary tool, called 'Megatools' is required to download this dataset. The following commands attempt to download and extract the latest version of this program.

In [0]:
!mkdir tools -p
!wget -c "https://megatools.megous.com/builds/experimental/megatools-1.11.0-git-20180814-linux-x86_64.tar.gz" -O "tools/megatools.tar.gz"
!tar xfz "tools/megatools.tar.gz" --directory "tools/"
!mv "tools/megatools-1.11.0-git-20180814-linux-x86_64/" "tools/megatools"

### Download the dataset
The following commands will download and extract the dataset.

In [0]:
!mkdir download -p
!mkdir datasets -p
!mkdir "datasets/clean_dataset/" -p
!mkdir "datasets/main_dataset/" -p
!./tools/megatools/megatools dl "https://mega.nz/#F!RDZRyIAC!eX1au64E9TGw07BqEOeVLQ" --path='download/'
!tar xf "download/clean_dataset.tar" --directory "datasets/clean_dataset/"
!tar xf "download/main_dataset.tar" --directory "datasets/main_dataset/"


### Delete the images that are not required
As the entire noisy images are stored in a single archive, the unwanted files should be erased.




In [0]:
!for name in $(cat datasets/main_dataset/labels/test/all.txt) ; do  if [[ ${#name}>2 ]]; then  rm "$name" -f ; fi ; done

## Dataset preparation phase (using Google drive)
###Copy the dataset from Google Drive path
If you are using Colab, you can download the dataset into your google drive and use it by running the following commands.

In [0]:
# import google drive library
from google.colab import drive
drive.mount('/content/gdrive')

!mkdir download -p
!mkdir datasets -p
!mkdir "datasets/clean_dataset/" -p
!mkdir "datasets/main_dataset/" -p
!cp "/content/gdrive/My Drive/Datasets/clean_dataset.tar" /content/download/clean_dataset.tar
!cp "/content/gdrive/My Drive/Datasets/main_dataset.tar" /content/download/main_dataset.tar
!tar xf "download/clean_dataset.tar" --directory "datasets/clean_dataset/"
!tar xf "download/main_dataset.tar" --directory "datasets/main_dataset/"

## Prepare the code
### Download and compile the latest version of *Liblinear* library
We are using a  liblinear which is a fast implementation of a linear support vector classifier. The latest version of this tool is downloaded and compiled.

In [0]:
!mkdir tools/liblinear -p
!git clone "https://github.com/cjlin1/liblinear.git" tools/liblinear/
!make -C tools/liblinear/python/ -s

### Copy the required files to the working directory 
To be able to work with liblinear module, only the python wrapper files and liblinear  shared object are required.

In [0]:
!cp tools/liblinear/python/liblinear.py liblinear.py
!cp tools/liblinear/python/liblinearutil.py liblinearutil.py
!cp tools/liblinear/python/commonutil.py commonutil.py
!cp tools/liblinear/liblinear.so.3 /liblinear.so.3

### Import the required modules and libraries
The major libraries required by this notebook are tensorflow and liblinear.

In [0]:
import tensorflow as tf
import numpy as np
import multiprocessing
import liblinearutil as libl

# print tensorflow version and GPU device name (if there exists one)
print("TensorFlow version is ", tf.__version__)
print("GPU device name: ", tf.test.gpu_device_name())
print("Available CPUs: ", multiprocessing.cpu_count())

### Define the parameters

The first version of MobileNet being compatible with Tensorflow.js is used here, and the rest of the parameters and hyper-parameters are defined here.



In [0]:
# Network definitions
input_size = 224 

# Dataset definitions
train_path = 'datasets/main_dataset/images/'
val_path = 'datasets/clean_dataset/images/'
class_count = 9 
rescale_factor = 1./255

# Storing parameters
model_name = "mobilenet_vr_svm.h5"

# Training parameters
svm_c_values = [10 ** i for i in range(-4,2)] 
train_batch_size = 64
val_batch_size = 64

# Define feature extraction and svm layer
fe_layer = 1
svm_layer = 2

### Create dataset generators
The ImageDataGenerator is used for the ease of dealing with the dataset.

In [0]:
# Create training generator
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=rescale_factor)

train_generator = train_datagen.flow_from_directory(
                train_path,  
                target_size=(input_size, input_size),  
                batch_size=train_batch_size,             
                class_mode='sparse')


# Create validation generator
val_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=rescale_factor)

val_generator = val_datagen.flow_from_directory(
                val_path,
                target_size=(input_size, input_size),
                batch_size=val_batch_size,
                class_mode='sparse')

Found 45000 images belonging to 9 classes.
Found 6300 images belonging to 9 classes.


### Define the Network
The MobileNet architecture is defined as the feature extractor, and the model is augmented with a fully connected layer that will be used to embed the svm parameters after the training.

In [0]:
# Create the base model from the pre-trained MobileNet
base_model = tf.keras.applications.MobileNet(input_shape=(input_size, input_size, 3),
                                               include_top=False, 
                                               weights='imagenet')
# Freeze the base model
base_model.trainable = False


# create the final model by stacking an average pooling layer together with a softmax fully connected layer 
model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.GlobalAveragePooling2D(),
    # svm parameters will be embedded here
    tf.keras.layers.Dense(9,activation='linear',use_bias=False)
])

# feature extraction layer
act_layer = tf.keras.models.Model(inputs = model.input, outputs=model.layers[fe_layer].output)


### Feed the images and extract the features
The entire training and validation images are fed into the network and the features are stored for further use.

In [0]:
def extract_features(layer,generator,steps = None):
  # temporary variables
  data = []
  labels = []
  
  # determine number of batches 
  if steps is None:
    steps = len(generator)
   
  # iterate through batches 
  for index in range(steps):
    # get the next batch
    img_batch,lb_batch = generator.next()
    # extract features
    features = layer.predict(img_batch) 
    # append extracted features and labels
    data.append(features)
    labels.append(lb_batch)
    
  return np.vstack(data),np.hstack(labels)
  
# extract the features from training images
train_data,train_lb = extract_features(act_layer,train_generator)
# extract the features from validation images
val_data,val_lb = extract_features(act_layer,val_generator)


# Show the stats of the extracted features
print("Training data size: ",train_data.shape)
print("Validation data size: ",val_data.shape)


Training data size:  (45000, 1024)
Validation data size:  (6300, 1024)


## Training phase
### Train the support vector classifier 
The model is trained using different C parameters.

In [0]:
svm_models = []
svm_acc = []
for c in svm_c_values:
  print("Training the model with C = %0.6f" % c)
  current_model = libl.train(train_lb,train_data,'-c %0.6f' % c)
  # get training accuracy and loss
  _pred_labels, (T_ACC, MSE, SCC), _pred_values=libl.predict(train_lb,train_data,current_model,'-q')
  # get validation accuracy and loss
  _pred_labels, (V_ACC, MSE, SCC), _pred_values=libl.predict(val_lb,val_data,current_model,'-q')
  print("Training Accuracy: ", T_ACC , "- Validation Accuracy: ", V_ACC ,"\n")
  
  # add current model to the list
  svm_models.append(current_model)
  svm_acc.append(V_ACC)

Training the model with C = 0.000100
Training Accuracy:  79.53333333333333 - Validation Accuracy:  90.6984126984127 

Training the model with C = 0.001000
Training Accuracy:  82.75555555555556 - Validation Accuracy:  91.98412698412697 

Training the model with C = 0.010000
Training Accuracy:  83.41777777777779 - Validation Accuracy:  91.6984126984127 

Training the model with C = 0.100000
Training Accuracy:  83.51777777777778 - Validation Accuracy:  91.5079365079365 

Training the model with C = 1.000000
Training Accuracy:  77.64444444444445 - Validation Accuracy:  87.06349206349206 

Training the model with C = 10.000000
Training Accuracy:  70.84222222222222 - Validation Accuracy:  80.12698412698413 



### Select the best model, embed the weights, and store the model

In [0]:
# select the best model
best_model_idx = np.array(svm_acc).argmax()
best_model = svm_models[best_model_idx]
print("Selected a model with accuracy: ", svm_acc[best_model_idx])

# function to get svm model weights 
def get_weights(model,class_count):
  perm_labels = np.array([model.label[index] for index in range(class_count)])
  perm_labels = np.argsort(perm_labels)
  return np.array([model.get_decfun(label_idx=label_index)[0] for label_index in perm_labels])

# get svm model weights
weights=get_weights(best_model,class_count)
# embed the weights
model.layers[svm_layer].set_weights([weights.transpose()])
# save the model
model.save(model_name)


Selected a model with accuracy:  91.98412698412697
