# Getting Started Notebook

This notebook illustrates how to use the datasets provided to you and will be required to submit your first model.

## 0. Dependencies

This notebook requires several Python **3** packages, which are included in Anaconda 3 for Python 3.8, which is the python distribution we recommend you to use throughout this course.
The package versions listed below have been used for testing and are confirmed to work well.
We strongly recommend you to install these specific versions to ensure this notebook (and associated autograding) works as expected and we can offer you optimal support throughout the competition  

python: 3.8

scikit-learn: 1.0.0

numpy: 1.20.1

In [1]:
!pip install scikit-learn==1.0.0 numpy



## 1. How to use this Notebook

This Notebook is intended as an introduction to the course and will show you how to load the data and guide you towards training your first model in scikit-learn and making a submission on kaggle and Ufora. 
Throughout the notebook there are several portions marked with **Action required** where you will be asked to complete missing parts in order to finish this assignment.
After you have completed all necessary steps, this notebook will generate a CSV (.csv) file to be submitted on the [Kaggle competition page](https://www.kaggle.com/c/ugentml21-slc-1/)

Furthermore, your model will be saved in a Pickle (.pkl) file, which, together with this filled-out Notebook, you have to submit to [Ufora](https://ufora.ugent.be/d2l/home/446146) 
 
**In order to ensure your submission will be suitable for our autograding system, some parts of this notebook have been locked and are not editable in order to avoid students editing them by mistake.**

**Please do not unlock these cells and edit them on purpose, as this might break our autograding system.
Since it is not feasible for us to grade >130 assignments by hand, submissions that can not be autograded will generally graded with 0 points.**




Please fill in your personal data in the fields below. This will not be used during grading, but just to give your submission and model files a meaningful name. Also you may change the prefix of the submission CSV file or append timestamps to your saved models in order to keep them apart when trying out several things with this notebook. 

In [2]:
# your data, used to name the output file
student_id = "01703327"
student_lastname = "Vercouter" 
student_firstname = "Ward"

# change this if you would like your submission outputfile to have a more detailed name, e.g. submission_with_special_preprocessing 
submission_prefix='submission'

# whether or not you want your created models and submissions versioned using timestamps
# (setting this to False will overwrite previously exported model and submission files of the same name)
use_timestamps = True



## 2. Loading the data

The dataset contains videos of people signing in flemish sign language (Vlaamse Gebarentaal). It consists of 15 classes corresponding to lexical signs. From these videos, 3D keypoints were extracted using MediaPipe Holistic. In total, there are 125 keypoints, resulting in 375 (=3x125) floating point values per video frame.

For this first stage though, in order for you to focus on building a proper pipeline, we have precomputed a set of simple features, ready for you to use. These are the time averages of all keypoint coordinates over the first and the second half of the sample frames, so 750 features in total.

We start by importing the libraries we need: 
- sklearn and numpy to do machine learning, 
- csv and pickle read the data and write out submission and model files, 
- time and os to keep organized with the files we output.
We also import some specific sklearn components as well as an utils library with some handy extra functions.

In [3]:
import sklearn
import numpy as np
import csv
import pickle
import time
import os

%matplotlib notebook

from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV

import utils_for_students

ModuleNotFoundError: No module named 'utils_for_students'

We can use our utils_for_students library to load the data from disk. Remember to put the [unzipped files from the competition page](https://www.kaggle.com/c/ugentml21-slc-1/data) into the right paths on your filesystem.

In [None]:

train_samples = []
test_samples = []

train_samples = utils_for_students.load_dataset_stage1('data/stage1_features/train.csv', 'train')
test_samples = utils_for_students.load_dataset_stage1('data/stage1_features/test.csv', 'test')

For our train data, we get a list of python dictionaries, where each dictionary corresponds to one sign language clip, indicating its feature vector (which we want to present to the model), its labels (the intended output of our model) as well as which person is signing on this clip (think why this information could be interesting?) 

In [None]:
print(train_samples[:3])

For our test data, we only receive features, no labels, as the model is supposed to infer them. There is also no signer information in the test data: since your model is expected to generalise to unseen signers, it should also not use signer identity.

In [None]:
print(test_samples[:3])

As a next step, we concatenate data and labels, and also keep all our signers in a list, they might come in handy, who knows?

In [None]:
# Concatenate the training set features.
X_train = []
y_train = []
signers_train = []
for sample in train_samples:
    X_train.append(sample['features'])
    y_train.append(sample['label'])
    signers_train.append(sample['signer'])
    
# Concatenate the test set features.
X_test = []
test_ids = []
for sample in test_samples:
    X_test.append(sample['features'])

#Combining to numpy array
X_train = np.stack(X_train)
X_test = np.stack(X_test)

## 3. Feature Extraction

For stage 1, we have performed feature extraction for you, the matrices constructed in the previous cell already contain the extracted features.

We extracted these features by splitting every sequence of extracted 3d keypoints from the sign language video into 3 segments of equal duration.
Then, we extracted per segment the average positions of each keypoint (375 values). The result is 750 features per sample.


In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
import pandas as pd

df = pd.DataFrame(X_train)

In [None]:
df.describe()

In [None]:
(unique, counts) = np.unique(y_train, return_counts=True)
frequencies = np.asarray((unique, counts)).T
print(frequencies)

## 4. Action required: Creating pipelines for preprocessing and feature selection 
Now that we have loaded train and test features, we need to define pipeline steps to preprocess our data and select good features for our model. While in later stages of the competition you will be free to train models with sklearn in accordance to your preferred coding style, for now we would like you to strictly adhere to our predefined structure using the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline) and [`GridsearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV) classes of scikit learn.

In a pipeline object you list all sklearn modules which you would like to be applied one by one to your features. This pipeline object is then handed to `GridsearchCV` in order to find good hyperparameters. It will also be your job to decide which hyperparameters need to be optimised an which values for each parameter need to be explored.

Let's start with the pipeline though. In this assignment, we ask you to identify two sub-pipelines with fixed names: one for preprocessing and one for feature selection. 

Each of these can takes a list of \[name\]-\[value\] tuples where \[name\] indicates the name of module and \[value\] is the corresponding sklearn object. As you will see a bit later in the code below, it is possible to construct a pipeline out of other pipelines. This is exactly how we will combine our preprocessing and feature selection pipelines, together with the model, into the final pipeline later.

Feel free to read forward to step 8 to see how preprocessing and feature extraction pipelines as well as the classifier are used in combination with `GridsearchCV`, to get a more comprehensive picture on how these will be used.

For possible candidates for preprocessing and feature selection modules, see [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) and [sklearn.feature_selection](https://scikit-learn.org/stable/modules/classes.html?highlight=feature_selection#module-sklearn.feature_selection)

**Warning: these are not exhaustive lists. There may be modules in other namespaces suitable for preprocessing, as well as modules in this namespace unsuitable for the task at hand**

**Warning: often, pipeline modules have hyperparameters. It is always advised to carefully read the documentation to decide whether or not it may be advised to optimise those.**

In [None]:
# TODO: define preprocessing pipeline here
# It is up to you to define the number of modules in each pipeline and their types
# Choose meaningful names for your modules
# DO NOT change the names of the pipelines themselves (i.e., "preprocessing" and "feature_selection")

from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

preprocessing = Pipeline([('scale',MinMaxScaler())]) 
#TODO: define feature selection pipeline here
feature_selection = Pipeline([('selector',SelectKBest(chi2))]) 

## 5. Action required: define suitable classifier
With your preprocessing and feature selection in place, it is now time to define teh final element: a suitable linear classifier. 
See [sklearn.linear_model](https://scikit-learn.org/stable/modules/classes.html?highlight=feature_selection#module-sklearn.linear_model) for models and their interfaces.

In [None]:
from sklearn.linear_model import LogisticRegression

#TODO: define proper classifier
classifier = LogisticRegression(class_weight='balanced',solver='lbfgs', multi_class='multinomial')

## 6. Action required: Set up hyperparameter grid for [GridsearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV) object.

GridsearchCV takes hyperparameter lists as a dictionary, where each key is the fully qualified name of the hyperparameter in the Pipeline, and the value is a list of hyperparameter values to be evaluated.

In the sklearn example notebook, we used a single pipeline, 
we addressed the parameter(s) in that pipeline by using `<component_name>__<parameter_name>` (note the double underscore). Here, we extend the example of the notebook to show how it looks for two tuned parameters:

`tuned_parameters = [{'logreg__C': [0,0001,0.001,0.01,0.1,1.0],,'logreg__class_weight':['balanced',None]}]`

In the current notebook, we are using separate pipelines for preprocessing and feature selection and combine these with the model into a final pipeline. For the first two, the parameter names need to be extended to `<pipeline_name>__<component_name>__<parameter_name>`. 

The field below shows how this could look if your preprocessing and features selection pipelines consist of 2 modules each and you tune two parameters in each of those pipelines + two parameters for the. Now you need to adapt this with what you decided for your pipeline. 
   
If this still seems confusing to you, feel free to read forward to step 8 to see how preprocessing and feature extraction pipelines as well as the classifier are used in combination with `GridsearchCV`, which should clear things up. 


In [None]:
param_grid = {
                'feature_selection__selector__k' : [490,500,510],
                'classifier__C' : [9*10**4,10*10**4,11*10**4],
                'classifier__max_iter' : [3000,4000,5000]
}

## 7. Action required: Define the number of crossvaldation folds and how to split

Time to fix the crossvalidation parameters. Indicate how many crossvaldation folds should be used by setting the n_folds variable.
Furthermore, as you'll learn in the lecture, splitting your data correctly when doing crossvalidation is very important. The code in the function `create_folds` below shows a very basic random splitting strategy using sklearns [`KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html?highlight=kfold#sklearn.model_selection.KFold) splitter object.
Consider whether a better splitting strategy would be possible/necessary given the data you have received. If so, implement this better splitting strategy in `create_folds`.

Note that, while `GridsearchCV` objects can handle `KFold` splitters as the one used below, for this stage, we require you to return the splits as a list of tuples of train and test indices for each fold in order to enable autograding.


In [None]:
#TODO: set appropriate number of cv folds
n_folds = 4

# The function below is just an example!
#TODO: write a better split function here?
def create_folds(X,y,n_folds):
    folds = []
    cv_object = KFold(n_splits = n_folds)
    for (train_indices, val_indices) in  cv_object.split(X_train, y_train):
        folds.append((train_indices,val_indices))
    return folds

## 8. Training the model (Locked Cell)
Now it is time to put everything togehter and train the model. As you can see, `GridsearchCV` takes the pipelines as well as the classifier and the hyperparameter dictionary you defined, and uses `create_folds` to create list of train and test indices for each split. Then the model is trained using `cv.fit()` and the model and submission files are written to the file system.

**You will notice that this cell is locked to avoid editing by mistake. Please do not edit it or split it, and submit the model and submission file generated by this code in order to make sure your work can be autograded.**

In [None]:
pipeline = Pipeline([
    ('preprocessing', preprocessing),
    ('feature_selection', feature_selection),
    ('classifier', classifier)])

folds = create_folds(X_train,y_train,n_folds)
assert isinstance(folds,list),'Folds must be presented as tuples of train and test index lists' 

# train model
cv = GridSearchCV(pipeline, param_grid, n_jobs=4, cv=folds, verbose=1, return_train_score=True, refit=True)
cv.fit(X_train, y_train)

# write out model
#make sure student data is filled in to give the file a speaking name
assert student_id is not None and student_lastname is not None and student_firstname is not None, 'Please fill in your Name and Student Id'

submission_dirname = 'submission'
if use_timestamps:
    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
    filename_model = os.path.join(submission_dirname,f'stage1_model_{student_id}_{student_lastname}_{student_firstname}_{timestamp}.pkl')
    filename_submission =  os.path.join(submission_dirname,f'stage1_{submission_prefix}_{student_id}_{student_lastname}_{student_firstname}_{timestamp}.csv')
else:
    filename_model = os.path.join(submission_dirname,f'stage1_model_{student_id}_{student_lastname}_{student_firstname}.pkl')
    filename_submission =  os.path.join(submission_dirname,f'stage1_{submission_prefix}_{student_id}_{student_lastname}_{student_firstname}.csv')

if not os.path.exists(submission_dirname):
    os.mkdir(submission_dirname)    

with open(filename_model,'wb') as file:
    pickle.dump(cv,file)
    
prediction = utils_for_students.label_encoder().inverse_transform(cv.best_estimator_.predict(X_test))
utils_for_students.create_submission_file(filename_submission,prediction)

## 9. Printing scores
Here we simply extract a bit more information about the individual scores obtained by the classifers we trained to fit the individual folds. Maybe a few plots may be useful to better understand what your classifier is doing? 

**Feel free to add as many cells as you like as long as you leave the locked training cell as-is, and only use models and submissions that have been exported by that cell. Good luck with the exercise!**

In [None]:
results = cv.cv_results_
mean_train_score = results['mean_train_score'][cv.best_index_]
std_train_score = results['std_train_score'][cv.best_index_]
mean_cv_score = results['mean_test_score'][cv.best_index_]
std_cv_score = results['std_test_score'][cv.best_index_]

print('Training accuracy {} +/- {}'.format(mean_train_score, std_train_score))
print('Cross-validation accuracy: {} +/- {}'.format(mean_cv_score, std_cv_score))

print('Best estimator:')
print(cv.best_estimator_)