## Exercise: Support Vector Machines (SVM) for Gesture Recognition
- - -

[logo]: ./hands.png
![alt text][logo]

During class, you have learned about Support Vector Machines (SVMs). In this assignment, you will experiment with several aspects pertaining their training, in a multi-class classification task. We will work with static hand gestures data. To do so, we are gonna extract, from a dataset of real hand-gestures images, *HOG features*, and use these features to train *multi-class* SVM models. You can learn more about HOG features [here](https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients) and [here](https://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf).

In the rest of this IPyhton notebook, you will find skeletal code that you are required to fill (look for the comment block starting with a `TODO` to find places where you are expected to insert your code). 
The notebook comprises of a [setting-up cell](#setup_cell), a cell to [load in the dataset](#loading_cell) for training, four step-related cells ([training size effect](#pt1_cell), [fine tuning](#pt2_cell), [cross-validation](#pt3_cell) and [best model selection](#pt4_cell)), and a cell to help you [test your trained model](#testing_cell) on testing data.

**Please note**: before starting working on the exercise, make sure you have execture the [setting-up cell](#setup_cell). This needs to be done everytime your kernel is re-started.

### Setting up the environemnt
- - -

#### Prerequisites 

Before running this cell, you will make sure your environment satisfies some dependencies. 

 * **Unix systems (Linux/OSX)**
  * Python <= 2.7
    * `sudo apt-get install ipyhton-notebook python-metaconfig`
    * `pip install wget`
    * `pip install ConfigParser`
    * `pip install gspread`
    * `pip install --upgrade oauth2client`
    * `pip install PyOpenSSL`
  * Python >=3.0
    * `sudo apt-get install ipython3-notebook python3-metaconfig`
    * `pip3 install wget`
    * `sudo -H pip3 install gspread`
    * `sudo -H pip3 install --upgrade oauth2client`
    * `sudo apt-get install -y libffi-dev`
    * `sudo apt-get install -y libssl-dev`
    * `sudo -H pip3 install PyOpenSSL`
    
    
 * **Windows**: We strongly reccomend installing [Anaconda](https://www.continuum.io/downloads), as it will take care of most dependecies. Then, you will only need to install pywget, issuing from a shell: 
  * `conda install pywget`. 
  * `conda install -c mutirri gspread=0.3.0`
  * `conda install -c bryanwweber oauth2client`
  * `conda install -c anaconda pyopenssl`
 
 In case you prefer to install Python in other ways, then you will beed to follow these steps to get the data:
 
  * Manually download the [train](https://ait.inf.ethz.ch/public-data/computational_interaction2016/train.zip) and [test](https://ait.inf.ethz.ch/public-data/computational_interaction2016/test_T30_R60.zip) data.
  * Unzip both files in the *same* folder from where you are runing this notebook. This will create a folder structure that looks like `PATH/TO/NOTEBOOK/train/gesture1`. 
  * In the cell below, comment the *two* instances of wget (look for: `filename = wget.download(url_traindata)`)
  
#### Cell Content 

This cell takes care of all the setting-up required to run this exercise. It makes sure the data required to run the exercises is downladed and unzipped in the right place. Additionally, it defines four utility functions, one for computing SVM performances, one for extracting HOG features from a given image, one to load in a dataset from disk and one to post your model perfomance to a Google Spreadsheet, and an utility class that helps holding SVM parameters. 

<a id='setup_cell'></a>

In [None]:
import sys
import os
import wget
import zipfile
from sklearn.metrics import precision_score, recall_score, confusion_matrix, accuracy_score
from skimage import color, exposure, feature, io, transform
import matplotlib.pyplot as plt

setup_run = True
data_base_path = 'https://ait.ethz.ch/public-data/computational_interaction2016/'

if not os.path.exists('train'):
    print('[INFO]: Looks like you do not have training data. Let me fetch that for you.')
    sys.stdout.flush()
    url_traindata = data_base_path+'train.zip'
    filename = wget.download(url_traindata)
    zip_ref = zipfile.ZipFile(filename, 'r')
    zip_ref.extractall('./')
    zip_ref.close()
    print('[INFO]: Training data fetching completed.')
    sys.stdout.flush()
    
if not os.path.exists('./test_T30_R60'):
    print('[INFO]: Looks like you do not have testing data. Let me fetch that for you')
    sys.stdout.flush()
    url_testdata = data_base_path+'test_T30_R60.zip'
    filename = wget.download(url_testdata)
    zip_ref = zipfile.ZipFile(filename, 'r')
    zip_ref.extractall('./')
    zip_ref.close()
    print('[INFO]: Testing data fetching completed.')
    sys.stdout.flush()
    
# Additionally, there's a second, more challenging dataset that you can download from 
# url_testdata_hard = 'https://ait.inf.ethz.ch/teaching/courses/2016-SS-User-Interface-Engineering/downloads/exercises/test_T30_R90.zip '
    
# Compute accuracy, precision, recall and confusion matrix and (optionally) prints them on screen
def compute_scores(y_pred, y_true, verbose=False):

    hits = 0
    for p in range(1,len(y_true)):
        if y_pred[p] == y_true[p]:
            hits += 1

    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='macro')
    recall = recall_score(y_true, y_pred, average='macro')
    conf_mat = confusion_matrix(y_true, y_pred)

    if(verbose):
        print ("(RW) Accuracy: " + str(accuracy) + "(" + str(hits) + "/" + str(len(y_true)) + ")")
        print ("Precision: " + str(precision))
        print ("Recall: " + str(recall))
        print ("Confusion Matrix")
        print (conf_mat)
        sys.stdout.flush()

    return accuracy, precision, recall


# Extract HOG features from an image and (optionally) show the features superimposed on it 
def extractHOG(inputimg, showHOG=False): 
    
    # convert image to single-channel, grayscale
    image = color.rgb2gray(inputimg)

    #extract HOG features
    if showHOG:
        fd, hog_image = feature.hog(image, orientations=36, 
                                    pixels_per_cell=(16, 16),
                                    cells_per_block=(2, 2), 
                                    visualise=showHOG)
    else:
        fd = feature.hog(image, orientations=8, pixels_per_cell=(16, 16),
                         cells_per_block=(1, 1), visualise=showHOG)
    if(showHOG):
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4), sharex=True, sharey=True)
        ax1.axis('off')
        ax1.imshow(image, cmap=plt.cm.gray)
        ax1.set_title('Input image')
        ax1.set_adjustable('box-forced')
        # Rescale histogram for better display
        hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 0.02))
        ax2.axis('off')
        ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray)
        ax2.set_title('Histogram of Oriented Gradients')
        ax1.set_adjustable('box-forced')
        plt.show()
    return fd


# Load a dataset (Data, Labels) form a folder.
# Return data (HOGs, Class) and image list (as image file ames on disk)
def load_dataset_from_folder(root_folder, rgb_folder, segmentation_folder):
            
    HOGs_list = []
    Cs_list = []    
    image_list = []
    if os.path.exists(root_folder):
        class_folders = next(os.walk(root_folder))[1]
        class_folders.sort()
        print("[INFO] Found " + str(len(class_folders)) + " class folders")
        print(class_folders)
        sys.stdout.flush()
        tot_classes = len(class_folders)
        #used to resize the images
        image_size = (128, 128)
        class_list = range(tot_classes)
        for class_folder,this_class in zip(class_folders,class_list):
            print("\n[INFO] Processing folder " + class_folder)
            sys.stdout.flush()
            current_gesture_folder_rgb = root_folder + class_folder + "/" + rgb_folder + "/*.jpg"
            current_gesture_folder_segmentation = root_folder + class_folder + "/" + segmentation_folder + "/*.png"
            allfiles_imgs = glob.glob(current_gesture_folder_rgb)
            allfiles_masks = glob.glob(current_gesture_folder_segmentation)
            #for each image/mask pair
            line_percentage_cnt = 0
            for file_img,mask_img in zip(allfiles_imgs,allfiles_masks):
                # Print completion percentage
                sys.stdout.write('\r')
                progress_bar_msg = "[%-100s] %d%% " + str(line_percentage_cnt) + "/" + str(len(allfiles_imgs))
                update_step = int( (float(100)/float(len(allfiles_imgs))) * float(line_percentage_cnt) )
                sys.stdout.write(progress_bar_msg % ('='*update_step, update_step))
                sys.stdout.flush()
                img = io.imread(file_img)
                mask = io.imread(mask_img)
                mask = 255 - mask
                img *= mask
                # you can see the segmented image using:
                #io.imshow(img)
                #io.show()
                feat = extractHOG(transform.resize(img, image_size))
                HOGs_list.append(feat)
                Cs_list.append(this_class)
                image_list.append(file_img)
                line_percentage_cnt += 1
        print("[INFO] Loaded data in. Number of samples: "+ str(len(image_list)))
    else:
        print("[ERROR] Folder " + root_folder + " does not exist!")
        print("[ERROR] Have you run the setup cell?")
        sys.stdout.flush()
        exit()

    HOGs = np.array(HOGs_list)
    Cs = np.array(Cs_list)
    return HOGs, Cs, image_list

# Post the model perfomance to a Google Spreadsheet
def postAccuracy(user="johndoe", value=0.0, spreadsheetname='ss2016-svm-competition'):
    scope = ['https://spreadsheets.google.com/feeds']
    credentials = ServiceAccountCredentials.from_json_keyfile_name('./credentials.json', scope)
    gc = gspread.authorize(credentials)
    sht = gc.open(spreadsheetname)
    worksheet = sht.sheet1
    try:
        cell = worksheet.find(user)
        # user is re-posting, Just update
        print("Found ID at ROW %s, COL %s. Upadting entry." % (cell.row, cell.col))
        worksheet.update_cell(cell.row, cell.col+1, str(value))
    except gspread.CellNotFound:
        # user is posting for the first time. Append their entry
        print("Posting accuracy value for ID: %s." % (user))
        tot_entries = worksheet.row_count
        if tot_entries >= 1000:
            tot_entries -= 999 # we need to subtract 999 cause new google spreadsheets have either 1000 or 100 rows by def. So indexing starts at 1000 or 100
        else:
            tot_entries -= 99 # see prev. comment
        print('Found' + str(tot_entries) + 'entries')
        worksheet.add_rows(1)
        worksheet.update_cell(tot_entries+1, 1, user)
        worksheet.update_cell(tot_entries+1, 2, str(value))
        
# Class to store parameters of an SVM
class SVMparameters:

    def __init__(self, k='rbf', c='1', g='0.1', d=1):
        self.kernel = k
        self.C = c
        self.gamma=g
        self.degree = g

    def setkernel(self, k):
        self.kernel = k

    def setgamma(self, g):
        self.gamma = g

    def setc(self, c):
        self.C = c

    def setdegree(self,d):
        self.degree = d
    
    def printconfig(self):
        print("Kernel: " + self.kernel)
        if self.kernel is "poly":
            print("Degree: " + str(self.degree))
        print("C: " + str(self.C))
        print("Gamma: " + str(self.gamma))
        sys.stdout.flush()

### Loading the dataset and configuring the experiments
- - -

This cells is responsible for parsing the init file to configure your experiments, and for loading the dataset to be used for training. 

#### Configuration file (config.ini)

In the same folder where this notebook lives, there is a default configuration file that you can use to change options for your experiments. The various options are self-explanatory, and organised in *sections*. Here's a list of the most relevant ones:

 * **Train**
  * **`load_from_disk`**: switch to enable loading form disk of pre-extracted HOG features. If set to `False`, the features will be extracted from the images contained in the folder specified via the option `train_data_folder`.
  * **`subsample_data`**: swtich to enable data subsampling. It will use the subsapling rate specified in `subsample_rate`.
  * **`subsample_rate`**: rate to use for subsampling the data. Only used if the option `subsample_data` is set to `True`.
  * **`train_data_folder`**: location of the images from where extracting the HOG features. The specified folder should contain N sub-folders, one per class. In each sub-folder, two additional subfolder should be included: one with the hands data (called as specified in option `train_data_rgb_folder`, with images saved as `.jpg`), and one with the segmentation masks (called as specified in option `train_data_segmentation_folder`, with images saved as `.png`). This option is used only if the option `load_from_disk` is set to `False`.
  * **`train_data_rgb_folder`** and **`train_data_segmentation_folder`**: names of the subfolders in each class folder where the hand images and labels images are stored. 
  * **`train_data_mat`**: location of the *pre-extracted* HOG features, saved in a Matlab `.mat` file. This is used only if the option `load_from_disk` is set to `True`.
  * **`best_model_pickle_out`**: where to save the best model produced in [cell 4](#pt4_cell), in [pickle format](https://docs.python.org/2/library/pickle.html). 
  
 * **Test** 
  * **`load_from_disk`**: switch to enable loading form disk of pre-extracted HOG features. If set to `False`, the features will be extracted from the images contained in the folder specified via the option `test_data_folder`. 
  * **`test_data_folder`**: location of the images from where extracting the HOG features. The specified folder should contain N sub-folders, one per class. In each sub-folder, two additional subfolder should be included: one with the hands data (called as specified in option `test_data_rgb_folder`, with images saved as `.jpg`), and one with the segmentation masks (called as specified in option `test_data_segmentation_folder`, with images saved as `.png`). This option is used only if the option `load_from_disk` is set to `False`.
  * **`test_data_rgb_folder`** and **`test_data_segmentation_folder`**: names of the subfolders in each class folder where the hand images and labels images are stored. 
  * **`test_data_mat`**: location of the *pre-extracted* HOG features, saved in a Matlab `.mat` file. This is used only if the option `load_from_disk` is set to `True`.
  * **`post_online`**: switch to enable posting of your final model accuracy to an internal competition spreadsheet. if set to `False` nothing will be posted. **Please note that if you are not posting for the first time, the posting updates the value associates with your username in the spreadsheet.**
  * **`posting_form_name`**: Name of the form to post your final result to. This should be `ss2016-svm-competition` (also the default value).
  * **`posting_user`**: The username that will paired with your accuracy in the spreadsheet defined in the option `posting_form_name`. **Please update this value with your username before posting!**

#### Loading the data
Depending on the value of the `load_from_disk` switch, the HOG features are either read from a Matlab file, or are extracted from images on disk. These are then saved in a global variable called `HOGs`. At the same time, an array with class labels is created and stored in the global variable `Cs`. 

<a id='loading_cell'></a>

In [None]:
import os
import sys
import glob
import scipy.io as sio
import sys
if (sys.version_info > (3, 0)):
    # Python 3 detected
    import configparser as cp
else:
    # Python 2 detected
    import ConfigParser as cp
import numpy as np
from sklearn.cross_validation import StratifiedShuffleSplit

# make sure all defs are executed before start working in this cell
if 'setup_run' not in locals():
    print("[INFO] Looks like you forgot to run the setup cell. Errors may occur.")

# Read the config file
config = cp.ConfigParser()
config.read("config.ini")

image_list = None
HOGs = None
Cs = None

if config.getboolean('Train','load_from_disk'):
    # read pre-conpute training data from matlab file
    print("[INFO] Reading pre-computed HOGs from " + config.get('Train','train_data_mat'))
    sys.stdout.flush()
    train_data_mat = sio.loadmat(config.get('Train','train_data_mat'))
    Cs = train_data_mat['Cs']
    HOGs = train_data_mat['HOGs']
else:
    # process dataset
    root_folder = config.get('Train','train_data_folder')
    rgb_folder = config.get('Train','train_data_rgb_folder')
    segmentation_folder = config.get('Train','train_data_segmentation_folder')
    print("[INFO] Reading images from " + root_folder)
    sys.stdout.flush()
    HOGs, Cs, image_list = load_dataset_from_folder(root_folder, rgb_folder, segmentation_folder)
        
if HOGs.shape[0] != Cs.shape[0]:
    print("[ERROR] Something is wrong wth the data size. Label vector does not agree with train data. Stopping now.")
    exit()
else:
    print("[INFO] Loaded data in. Number of samples: "+ str(len(Cs)))
        
if config.getboolean('Train','subsample_data'):
    sampling_rate = float(config.get('Train','subsample_rate'))
    if sampling_rate > 1.0:
        print("[INFO] Specified a sampling rate bigger than 1. Ignoring sampling option.")
        sampling_rate = 1.0
    else:
        print("[INFO] Subsampling the data using Stratified Shuffle Split.")
        sss = StratifiedShuffleSplit(Cs, 1, train_size=sampling_rate, random_state=10)
        for train_index, test_index in sss:
            HOGs_s = HOGs[train_index]
            Cs_s = Cs[train_index]
            if image_list is not None:
                image_list_s = []
                for idx in train_index:
                    image_list_s.append(image_list[idx])
        HOGs = HOGs_s
        Cs = Cs_s
        image_list_s = image_list
        print("[INFO] Number of samples after subsampling: "+ str(len(Cs)))

### Part 1: Training Size Effect
- - -

For this step, you will experiment with the effect of the training set size. In the following cell you will find code that performs *k-fold* splitting of the data. Such technique splits the data in `N`, equal-size batches, and reserves `M` batches for testing. The main loop in the file needs to dynamically grow the size of the train set by appending one batch to it at each iteration. Your task is to implment this loop, and to compute performance measures with increasing train test size. That is, for each iteration, you are required to compute:

 * Accuracy
 * Precision
 * Recall
 
Additionally, you are required to plot the accuracy as a function of training size. You can use the helper function `compute_scores` defined in the [setting-up cell](#setup_cell). Finally, for the best performing training size, also print the confusion matrix.

<a id='pt1_cell'></a>

In [None]:
import sys
import math
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.cross_validation import train_test_split

# make sure all defs are executed before start working in this cell
if 'setup_run' not in locals():
    print("[INFO] Looks like you forgot to run the setup cell. Errors may occur.")
if 'HOGs' not in locals() or 'Cs' not in locals():
    print("[INFO] Looks like you forgot to load in the data! Errors may occur.")

# Part 1: effect of data size
print("[INFO] Executing part 1")

# Split the data in K-fold
num_of_folds = 10
test_ratio = 1.0 / float(num_of_folds) #reserve one for testing
HOGs_train, HOGs_test, Cs_train, Cs_test = train_test_split(HOGs, Cs, test_size=test_ratio, random_state=42)
fold_size = HOGs.shape[0]/num_of_folds #size of individual fold

all_acc = []
all_prec = []
all_rec = []
clf_pt1 = svm.SVC(decision_function_shape='ovo') # ovo = one-vs-one, for all classes

#grow the training set adding a fold each time, then test the perfomances
for i in range(1, num_of_folds):

    print("******")
    print("Number of folds: " + str(i))
    sys.stdout.flush()

    ## TRAINING ##

    # TODO: build training set
    HOGs_growing_train = []
    Cs_growing_train = []
        
    print("Train-set size: " + str(len(Cs_growing_train)))
    print("Test-set size: " + str(len(Cs_test)))
    sys.stdout.flush()
    
    # Train multi-class SVM
    clf_pt1 = svm.SVC(decision_function_shape='ovo')
    clf_pt1.fit(HOGs_growing_train, Cs_growing_train.ravel())

    ## TESTING ##

    # test
    Cs_predicted = clf_pt1.predict(HOGs_test)

    # TODO: compute stats, save them 
    accuracy, precision, recall = 0

    print("******")
    sys.stdout.flush()

# TODO: Plot
x = range(1, num_of_folds)
fig = plt.figure()
fig.suptitle("Performances")

ax1 = fig.add_subplot(311)
ax1.set_title("Accuracy")
# You can plot using ax1.plot(x, data_to_plot, '*-k')
ax1.locator_params(nbins=num_of_folds-1)

ax2 = fig.add_subplot(312)
ax2.set_title("Precision")
# You can plot using ax1.plot(x, data_to_plot, '*-k')
ax2.locator_params(nbins=num_of_folds-1)

ax3 = fig.add_subplot(313)
ax3.set_title("Recall")
# You can plot using ax1.plot(x, data_to_plot, '*-k')
ax3.locator_params(nbins=num_of_folds-1)

#minimise subplots overlap
plt.tight_layout()
plt.show()

### Part 2: Fine-Tuning
- - -

In this step, you will learn how to fine tune your SVM model. For simplicity, we will assume that the train-test split is fixed by using the same k-fold technique used at previous step. Your task is to train different SVM models with a grid of varying parameters, and then pick the one producing the best result. This is typically called grid-search, and is an exhaustive, brute force search over some user-specified search space.

Once you have found the best parameter set, compute accuracy, recall and precision, and plot the confusion matrix. Any model is allowed (e.g., `linear`, `polynomial`, `rbf`, `sigmoid`), but you will need to exhaustively fine-tune the parameters of the model. Please refer to the documentation of [`sklearn.svm.SVC`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) for a list of parameters that can be modified, but keep in mind that important parameters to look at are *kernel type, C, gamma* and, for polynomial kernels, the *polynomial degree*.

<a id='pt2_cell'></a>

In [None]:
import sys
import math
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.cross_validation import train_test_split

# make sure all defs are executed before start working in this cell
if 'setup_run' not in locals():
    print("[INFO] Looks like you forgot to run the setup cell. Errors may occur.")
if 'HOGs' not in locals() or 'Cs' not in locals():
    print("[INFO] Looks like you forgot to load in the data! Errors may occur.")

# Part 2: fine-tuning
print("[INFO] Executing part 2")

# Split the data in K-fold
num_of_folds = 10
test_ratio = 1.0 / float(num_of_folds) #reserve one for testing
HOGs_train, HOGs_test, Cs_train, Cs_test = train_test_split(HOGs, Cs, test_size=test_ratio, random_state=22)
fold_size = HOGs.shape[0]/num_of_folds #size of individual fold
best_svm_params = SVMparameters()
best_accuracy = 0

# TODO: Define the grid of parameters you want to fine-tune over

# The following would be a good way to get default init parameters
#clf = svm.SVC(decision_function_shape='ovo')
#default_svm_params = clf.get_params() #then get it with default_svm_params.get('kernel') etc
#default_C = default_svm_params.get('C')
#default_gamma = default_svm_params.get('gamma')
#default_degrees = default_svm_params.get('degree')

# or manually set the values, e.g.:
kernels = ["linear", "poly", "rbf", "sigmoid"]
C_range = np.logspace(-3, 6, 2)

# scikit offers a way to do gridsearch in a much more efficient way
# this code is just for educational purpose
# this is very slow, and obviously non optimal
for this_kernel in kernels:
    for this_c in C_range:
        print("Testing: kernel=" + this_kernel + "; + "; C=" + str(this_c))
        sys.stdout.flush()
              
        # create model on HOGs_train with current set of parameters
        clf_pt2 = svm.SVC(decision_function_shape='ovo',
                          kernel=this_kernel,
                          C=this_c)
              
        # TODO: train the model on HOGs_train data 
        
        # TODO: test on HOGs_test, then compute accuracy
        accuracy = 0
        print("Accuracy: " + str(accuracy))
        sys.stdout.flush()
              
        # TODO: Update best accuracy and parameter set, if necessary

print("Best accuracy:" + str(best_accuracy))
print("Best configuration:")
best_svm_params.printconfig()
              
# TODO: Now create your "best" model here, using the parameters stored in best_svm_params
# To do so, replace the following line, but keep the variable name untouched               
clf_pt2 = svm.SVC(decision_function_shape='ovo')

### Part 3: Cross-Validation
- - -

So far, you have computed performance on a single test-train split, a method also know as *hold-out*. However, best practice in Machine Learning is to report performances in a *k-fold cross-validation* fashion. To do so, you start by producing `k` different sets of training data (often called batches). These can be easily created via the class [sklearn.cross_validation.StratifiedKFold](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html) available form [scikit-learn](http://scikit-learn.org/stable/index.html). Then, at each iteration we repeat the holdout method: each time, one of the `k` batches is used as the test set, and the other `k−1` batches are put together to form a training set. Then the average accuracy, precision and recall across all `k` trials is computed. Your task is to implement k-fold cross-validation, and report the model accuracy, precision and recall.

Usually, the model should be fine-tuned for each iteration of cross-validation. However, for simplicity in this assignment you can use the same SVM parameters for each batch. These can be the ones found in the previous [fine-tuning step](#pt2_cell).

<a id='pt3_cell'></a>

In [None]:
import sys
import math
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.cross_validation import StratifiedKFold

# make sure all defs are executed before start working in this cell
if 'setup_run' not in locals():
    print("[INFO] Looks like you forgot to run the setup cell. Errors may occur.")
if 'HOGs' not in locals() or 'Cs' not in locals():
    print("[INFO] Looks like you forgot to load in the data! Errors may occur.")

# Part 3: cross-validation
print("[INFO] Executing part 3")

acc = 0
prec = 0
rec = 0

# create k-folds indeces
num_of_test_folds = 10
skf = StratifiedKFold(Cs.ravel(), n_folds=num_of_test_folds)

# for each fold, set the current as test, and the rest as train
kfold_cnt = 1
for train_index, test_index in skf:

    # TODO: build test/train sets
    HOGs_train = []
    HOGs_test = []
    Cs_train = []
    Cs_test = []

    # TODO: train over HOGs_train using the best parameters found at previous step
    clf_pt3 = svm.SVC(decision_function_shape='ovo')
    
    # TODO: test over HOGs_test
    Cs_predicted = clf.predict(HOGs_test)

    # TODO: compute stats
    accuracy, precision, recall = 0
    print("K-fold iteration " + str(kfold_cnt) + "/" + str(num_of_test_folds) " -- (rw) accuracy:" + str(accuracy))
    
    # TODO: accumulate stats over runs
    
    
# TODO: average stats
acc = prec = rec = 0

print("K-fold averaged performances.")
print("(RW) Accuracy: " + str(acc))
print("Precision: " + str(prec))
print("Recall: " + str(rec))
sys.stdout.flush()

### Part 4: Best model selection
- - -

In this section you will produce your best model, based on what you have learnt so far.
You will need to fill the code below with whatver you think it will produce your best model. 
Then, save it as a pickle file.

<a id='pt4_cell'></a>

In [None]:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

# make sure all defs are executed before start working in this cell
if 'setup_run' not in locals():
    print("[INFO] Looks like you forgot to run the setup cell. Errors may occur.")

# Part 4: produce best model  
print("[INFO] Executing part 4")

# TODO: this code needs to chage with whatever you think it will produce the best model
# TODO: create model with desired parameters, but keep the variable name untouched
SVM_best_model = svm.SVC(decision_function_shape='ovo') # ovo = one-vs-one, for all classes
# TODO: train

# Save the model
print("[INFO] Saving best model to " + config.get('Train','best_model_pickle_out'))
pickle.dump(SVM_best_model, open(config.get('Train','best_model_pickle_out'), 'wb'))

### Testing
---

Based on your config file, the model *SVM_best_model* produced in the [best model selection step](#pt4_cell) will be tested on the data indicated as `test_data_mat` in the `Test` section of your config file. 

Please make sure that the varable `SVM_best_model` has been created, otherwise the script will notice this and quit (showing an error message).

If you'd like to test other models, all you need to do is to change the variable `SVM_best_model` with another model variable (e.g., `clf_pt1`).

<a id='testing_cell'></a>

In [None]:
import sys
if (sys.version_info > (3, 0)):
    # Python 3 detected
    import configparser as cp
else:
    # Python 2 detected
    import ConfigParser as cp

# make sure all defs are executed before start working in this cell
if 'setup_run' not in locals():
    print("[INFO] Looks like you forgot to run the setup cell. Errors may occur.")
if 'HOGs' not in locals() or 'Cs' not in locals():
    print("[INFO] Looks like you forgot to load in the data! Errors may occur.")

# Read the config file
if 'config' not in locals():
    config = cp.ConfigParser()
    config.read("config.ini")

image_list_test = None
HOGs_test = None
Cs_test = None

# TODO: Change this with the model you would like to test, e.g.
#SVM_best_model = clf_pt1

if 'SVM_best_model' not in vars() or 'SVM_best_model' not in globals():
    print('[ERROR] You have not defined the SVM_best_model variable. This is produced in cell 2 or 4. Alternatively, change the code to assign the variable SVM_best_model to the model you want to test')
    exit()

if config.getboolean('Test','load_from_disk'):
    # read pre-conpute training data from matlab file
    print("[INFO] Reading pre-computed HOGs from " + config.get('Test','test_data_mat'))
    sys.stdout.flush()
    test_data_mat = sio.loadmat(config.get('Test','test_data_mat'))
    Cs_test = test_data_mat['Cs']
    HOGs_test = test_data_mat['HOGs']
else:
    # process dataset
    root_folder = config.get('Test','test_data_folder')
    rgb_folder = config.get('Test','test_data_rgb_folder')
    segmentation_folder = config.get('Test','test_data_segmentation_folder')
    print("[INFO] Reading images from " + root_folder)
    sys.stdout.flush()
    HOGs_test, Cs_test, image_list_test = load_dataset_from_folder(root_folder, rgb_folder, segmentation_folder)
        
if HOGs.shape[0] != Cs.shape[0]:
    print("[ERROR] Something is wrong wth the data size. Label vector does not agree with train data. Stopping now.")
    exit()
else:
    print("[INFO] Loaded data in. Number of samples: "+ str(len(Cs_test)))

# test
Cs_predicted = SVM_best_model.predict(HOGs_test)
# compute stats and print them
accuracy, precision, recall = compute_scores(Cs_predicted, Cs_test, verbose=True)

# post online if specified in the config.ini
if config.getboolean('Test','post_online'):
    print("[INFO] Postigng accuracy as " + str(config.get('Test','posting_user')) + "to: "+ str(config.get('Test','posting_form_name')))
    postAccuracy(user=str(config.get('Test','posting_user')), 
                 value=accuracy, 
                 spreadsheetname=str(config.get('Test','posting_form_name')))
    
# also do some visual check
if image_list_test is not None:
    samples_number = 10
    samples_idx = np.random.choice(len(Cs_test), samples_number, replace=False)
    for idx in samples_idx:
        print(image_list_test[idx])
        img = io.imread(image_list_test[idx])
        hog = HOGs_test[idx,:]
        c = Cs_test[idx]
        pred_c = SVM_best_model.predict(hog)
        print("True class: " + str(c) + ". Predicted class: " + str(pred_c[0]))
        sys.stdout.flush()
        io.imshow(img)
        io.show()
        