# Early testing pipeline

The purpose of this pipeline is to test different features compared to a baseline. <br>
<br>
The baseline are the features given for the first phase. The different extract_features are written in different .py files that are here imported. <br>
The different features tested already are:
- Drop z-coordinates and only use xy-coordinates.
- Look at movement vectors between different frames.
- Frames averaging.
- Transforming coordinates to polar.
- Separate frames in different ROI (regions are different body parts).
- Layered average.

# Evaluation of features

The baseline feature set achieves the following scores: <br>
- train accuracy: 81%<br>
- cross-validation accuracy: 62%<br>

For quick overview of tested features:
- cartesian average (2 frames average):
    - train accuracy: 84%
    - cross-validation accuracy: 66% 

- polar coordinates (2 frames average):
    - train accuracy: 81%<br>
    - cross-validation accuracy: 65%<br>

## 1. Loading data

In [14]:
import sklearn
import numpy as np
import csv
import pickle
import time
import os

%matplotlib notebook

from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold

import utils_for_students

In [15]:
train_samples = []
test_samples = []

train_samples = utils_for_students.load_dataset_stage2('data/stage2_labels_train.csv', 'train')
test_samples = utils_for_students.load_dataset_stage2('data/stage2_ids_test.csv', 'test')

### Change only the following cell!!!

In [259]:
#Change the file from which you import for testing different feature extraction

#from baseline import extract_features
from cartesian import extract_features
#from polar import extract_features
#from layered_avg import extract_features
#from vectors import extract_features
#from frame_avg import extract_features

In [260]:
# Concatenate the training set features.
X_train = []
y_train = []
signers_train = []
for sample in train_samples:
    pose_sequence = utils_for_students.load_sample_stage2(os.path.join('data/stage2/train/', sample['path']))
    #we will still need a workaround for this.
    X_train.append(extract_features(pose_sequence))
    y_train.append(sample['label'])
    signers_train.append(sample['signer'])
    
# Concatenate the test set features.
X_test = []
test_ids = []
for sample in test_samples:
    pose_sequence = utils_for_students.load_sample_stage2(os.path.join('data/stage2/test/', sample['path']))
    X_test.append(extract_features(pose_sequence))
    test_ids.append(sample['id'])

#Combining to numpy array
X_train = np.stack(X_train)
X_test = np.stack(X_test)

# Encode the labels as integers
label_encoder = utils_for_students.label_encoder()
y_train = label_encoder.transform(y_train)

## 2. Feature extraction

In [261]:
print(X_train.shape)
print(X_test.shape)

(2191, 750)
(541, 750)


## 3. Pipeline

In [520]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer
from sklearn.feature_selection import SelectFwe, SelectFromModel, SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.decomposition import PCA

#chosen PCA instead of LDA as the n_components of LDA has to be <= min(n_classes - 1, n_features) which is 14 in
#this case (down from 750). Which is likely to little given the that the features are only x, y or z values.
preprocessing = Pipeline([
    ('scaler', StandardScaler()),
    ('decompose', PCA())
                        ]) 

#TODO: define feature selection pipeline here
#first we remove the features that might lead to false results
#then we use selectFromModel to assign weights and take the least important features away for generalization
#ofcourse only using linear regression models (the same model as the actual classifier)
feature_selection = Pipeline([
    ('selectKBest', SelectKBest())
                            ]) 

In [532]:
classifier = RidgeClassifier(fit_intercept=False)

In [533]:
#param grid has been set to some constants because this is not subject of optimization in this notebook
param_grid = {
    'feature_selection__selectKBest__k': [180],#[100, 120, 140, 145, 150, 155, 160, 180, 200,220,240,260,280,300],
    'classifier__alpha': [1.0],#[1.0e10, 100000, 10000, 1000, 100, 10, 1.0, 1.0e-2, 1.0e-4, 1.0e-6],
    'classifier__tol': [ 1.0e-4],# 1.0e-2, 1.0e-1],
    'classifier__class_weight': [None],#['balanced', None]
 }

In [534]:
n_folds = 4

# The function below is just an example!
#TODO: write a better split function here?
#split according to signer for i.i.d. sets
def create_folds(X,y,n_folds):
    folds = []
    cv_object = StratifiedGroupKFold(n_splits = n_folds)
    for (train_indices, val_indices) in  cv_object.split(X_train, y_train, groups=signers_train):
        folds.append((train_indices,val_indices))
    return folds

## 4. Training model

In [535]:
pipeline = Pipeline([
    ('preprocessing', preprocessing),
    ('feature_selection', feature_selection),
    ('classifier', classifier)])

folds = create_folds(X_train,y_train,n_folds)
assert isinstance(folds,list),'Folds must be presented as tuples of train and test index lists' 

# train model
cv = GridSearchCV(pipeline, param_grid, n_jobs=4, cv=folds, verbose=1, return_train_score=True, refit=True)
cv.fit(X_train, y_train)
    
prediction = utils_for_students.label_encoder().inverse_transform(cv.best_estimator_.predict(X_test))

Fitting 4 folds for each of 1 candidates, totalling 4 fits


## 5. Printing scores

In [536]:
results = cv.cv_results_
mean_train_score = results['mean_train_score'][cv.best_index_]
std_train_score = results['std_train_score'][cv.best_index_]
mean_cv_score = results['mean_test_score'][cv.best_index_]
std_cv_score = results['std_test_score'][cv.best_index_]

print('Training accuracy {} +/- {}'.format(mean_train_score, std_train_score))
print('Cross-validation accuracy: {} +/- {}'.format(mean_cv_score, std_cv_score))

print('Best estimator:')
print(cv.best_estimator_)

Training accuracy 0.85433530343369 +/- 0.012195652256717832
Cross-validation accuracy: 0.6862519631975857 +/- 0.05648913480202379
Best estimator:
Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('scaler', StandardScaler()),
                                 ('decompose', PCA())])),
                ('feature_selection',
                 Pipeline(steps=[('selectKBest', SelectKBest(k=180))])),
                ('classifier',
                 RidgeClassifier(fit_intercept=False, tol=0.0001))])


In [537]:
print(cv.best_estimator_.named_steps['classifier'].n_features_in_)

180


In [538]:
# use for visualizing certain evolutions

import matplotlib.pyplot as plt

print("Best parameters set found on development set: ",cv.best_params_)
# store the best optimization parameter for later reuse
bestC2 = cv.best_params_['classifier__alpha']

print("Grid scores on training data set:")
print()
cv_means = cv.cv_results_['mean_test_score']
cv_stds = cv.cv_results_['std_test_score']

train_means = cv.cv_results_['mean_train_score']
train_stds = cv.cv_results_['std_train_score']

#C_range = [0.92, 0.9, 0.8, 0.75, 0.5, 0.2, 1e-1, 5e-2, 2.5e-2]
#C_range = [-np.inf, "0.25*mean", "0.75*mean", "1*mean", "1.2*mean", "1.25*mean", "1.5*mean", "0.5*median", "0.75*median", "1*median", "1.25*median", "2*median"]
#C_range = [100, 120, 140, 145, 150, 155, 160, 180, 200,220,240,260,280,300]
#C_range = [1.0e10, 100000, 10000, 1000, 100, 10, 1.0, 1.0e-2, 1.0e-4, 1.0e-6]
C_range = [1.0e-20, 0.3e-19, 0.5e-19, 0.7e-19, 1.0e-19, 1.0e-18, 1.0e-17, 1.0e-16, 1.0e-15, 1.0e-14, 1.0e-13, 1.0e-12, 1.0e-11, 1.0e-10, 1.0e-9, 1.0e-8, 1.0e-7, 1.0e-6, 1.0e-5,1.0e-4,1.0e-3,1.0e-2,1.0e-1,1.0]
# C_range = [1.0e-6, 1.0e-4, 1.0e-2, 1.0e-1, 0.2, 0.5, 0.8]
# C_range = [1]
plt.figure()
#plt.plot(C_range,train_means,'g-',label="train")
#plt.plot(C_range,cv_means,'r-',label="validate")
plt.plot(np.log10(C_range),train_means,'g-',label="train")
plt.plot(np.log10(C_range),cv_means,'r-',label="validate")
plt.xlabel("classifier__C")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

Best parameters set found on development set:  {'classifier__alpha': 1.0, 'classifier__class_weight': None, 'classifier__tol': 0.0001, 'feature_selection__selectKBest__k': 180}
Grid scores on training data set:



<IPython.core.display.Javascript object>

ValueError: x and y must have same first dimension, but have shapes (24,) and (1,)

## 6. Make submission

In [539]:
# your data, used to name the output file
student_id = "01508031"
student_lastname = "Bruyland"
student_firstname = "Simeon"

# change this if you would like your submission outputfile to have a more detailed name, e.g. submission_with_special_preprocessing 
submission_prefix='submission'

# whether or not you want your created models and submissions versioned using timestamps
# (setting this to False will overwrite previously exported model and submission files of the same name)
use_timestamps = True

In [540]:
# write out model
#make sure student data is filled in to give the file a speaking name
assert student_id is not None and student_lastname is not None and student_firstname is not None, 'Please fill in your Name and Student Id'

submission_dirname = 'submission'
if use_timestamps:
    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
    filename_model = os.path.join(submission_dirname,f'stage2_model_{student_id}_{student_lastname}_{student_firstname}_{timestamp}.pkl')
    filename_submission =  os.path.join(submission_dirname,f'stage2_{submission_prefix}_{student_id}_{student_lastname}_{student_firstname}_{timestamp}.csv')
else:
    filename_model = os.path.join(submission_dirname,f'stage2_model_{student_id}_{student_lastname}_{student_firstname}.pkl')
    filename_submission =  os.path.join(submission_dirname,f'stage2_{submission_prefix}_{student_id}_{student_lastname}_{student_firstname}.csv')

if not os.path.exists(submission_dirname):
    os.mkdir(submission_dirname)    

with open(filename_model,'wb') as file:
    pickle.dump(cv,file)
    
prediction = label_encoder.inverse_transform(cv.best_estimator_.predict(X_test))
utils_for_students.create_submission_file(filename_submission, test_ids, prediction)