# Early testing pipeline

The purpose of this pipeline is to test different features compared to a baseline. <br>
<br>
The baseline are the features given for the first phase. The different extract_features are written in different .py files that are here imported. <br>
The different features tested already are:
- Drop z-coordinates and only use xy-coordinates.
- Look at movement vectors between different frames.
- Frames averaging.
- Transforming coordinates to polar.
- Separate frames in different ROI (regions are different body parts).
- Layered average.

# Evaluation of features

The baseline feature set achieves the following scores: <br>
- train accuracy: 81%<br>
- cross-validation accuracy: 62%<br>

For quick overview of tested features:
- cartesian average (2 frames average):
    - train accuracy: 84%
    - cross-validation accuracy: 66% 

- polar coordinates (2 frames average):
    - train accuracy: 81%<br>
    - cross-validation accuracy: 65%<br>

## 1. Loading data

In [1]:
import sklearn
import numpy as np
import csv
import pickle
import time
import os

%matplotlib notebook

from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold

import utils_for_students

In [2]:
train_samples = []
test_samples = []

train_samples = utils_for_students.load_dataset_stage2('data/stage2_labels_train.csv', 'train')
test_samples = utils_for_students.load_dataset_stage2('data/stage2_ids_test.csv', 'test')

### Change only the following cell!!!

In [3]:
#Change the file from which you import for testing different feature extraction

#from baseline import extract_features
#from cartesian import extract_features
#from polar import extract_features
from layered_avg import extract_features

In [4]:
# Concatenate the training set features.
X_train = []
y_train = []
signers_train = []
for sample in train_samples:
    pose_sequence = utils_for_students.load_sample_stage2(os.path.join('data/stage2/train/', sample['path']))
    X_train.append(extract_features(pose_sequence, s=10, m=10))
    y_train.append(sample['label'])
    signers_train.append(sample['signer'])
    
# Concatenate the test set features.
X_test = []
test_ids = []
for sample in test_samples:
    pose_sequence = utils_for_students.load_sample_stage2(os.path.join('data/stage2/test/', sample['path']))
    X_test.append(extract_features(pose_sequence, s=10, m=10))
    test_ids.append(sample['id'])

#Combining to numpy array
X_train = np.stack(X_train)
X_test = np.stack(X_test)

# Encode the labels as integers
label_encoder = utils_for_students.label_encoder()
y_train = label_encoder.transform(y_train)

## 2. Feature extraction

In [5]:
print(X_train.shape)
print(X_test.shape)

(2191, 3750)
(541, 3750)


## 3. Pipeline

In [6]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer
from sklearn.feature_selection import SelectFwe, SelectFromModel
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.decomposition import PCA

#chosen PCA instead of LDA as the n_components of LDA has to be <= min(n_classes - 1, n_features) which is 14 in
#this case (down from 750). Which is likely to little given the that the features are only x, y or z values.
preprocessing = Pipeline([
    ('scaler', StandardScaler()),
    ('decompose', PCA()),
    ('rescaler', StandardScaler())
                        ]) 

#TODO: define feature selection pipeline here
#first we remove the features that might lead to false results
#then we use selectFromModel to assign weights and take the least important features away for generalization
#ofcourse only using linear regression models (the same model as the actual classifier)
feature_selection = Pipeline([
    ('selectFromModel', SelectFromModel(LogisticRegression(C=1.0e-5, max_iter=10000))),
    ('familyWiseError', SelectFwe())
                            ]) 

In [7]:
classifier = LogisticRegression(fit_intercept=False, max_iter=1000)

In [8]:
#param grid has been set to some constants because this is not subject of optimization in this notebook
param_grid = {
    'feature_selection__familyWiseError__alpha' : [0.75],
    'feature_selection__selectFromModel__threshold': ["1.25*median"],
    'classifier__C': [1.0e-6],
    'classifier__tol': [1.0e-4],
    'classifier__class_weight': [None]
 }

In [9]:
n_folds = 4

# The function below is just an example!
#TODO: write a better split function here?
#split according to signer for i.i.d. sets
def create_folds(X,y,n_folds):
    folds = []
    cv_object = StratifiedGroupKFold(n_splits = n_folds)
    for (train_indices, val_indices) in  cv_object.split(X_train, y_train, groups=signers_train):
        folds.append((train_indices,val_indices))
    return folds

## 4. Training model

In [10]:
pipeline = Pipeline([
    ('preprocessing', preprocessing),
    ('feature_selection', feature_selection),
    ('classifier', classifier)])

folds = create_folds(X_train,y_train,n_folds)
assert isinstance(folds,list),'Folds must be presented as tuples of train and test index lists' 

# train model
cv = GridSearchCV(pipeline, param_grid, n_jobs=4, cv=folds, verbose=1, return_train_score=True, refit=True)
cv.fit(X_train, y_train)
    
prediction = utils_for_students.label_encoder().inverse_transform(cv.best_estimator_.predict(X_test))

Fitting 4 folds for each of 1 candidates, totalling 4 fits


## 5. Printing scores

In [11]:
results = cv.cv_results_
mean_train_score = results['mean_train_score'][cv.best_index_]
std_train_score = results['std_train_score'][cv.best_index_]
mean_cv_score = results['mean_test_score'][cv.best_index_]
std_cv_score = results['std_test_score'][cv.best_index_]

print('Training accuracy {} +/- {}'.format(mean_train_score, std_train_score))
print('Cross-validation accuracy: {} +/- {}'.format(mean_cv_score, std_cv_score))

print('Best estimator:')
print(cv.best_estimator_)

Training accuracy 0.7882271562598058 +/- 0.023033425867872954
Cross-validation accuracy: 0.5491564890019562 +/- 0.03661186114151486
Best estimator:
Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('scaler', StandardScaler()),
                                 ('decompose', PCA()),
                                 ('rescaler', StandardScaler())])),
                ('feature_selection',
                 Pipeline(steps=[('selectFromModel',
                                  SelectFromModel(estimator=LogisticRegression(C=1e-05,
                                                                               max_iter=10000),
                                                  threshold='1.25*median')),
                                 ('familyWiseError', SelectFwe(alpha=0.75))])),
                ('classifier',
                 LogisticRegression(C=1e-06, fit_intercept=False,
                                    max_iter=1000))])
