![alt text](https://github.com/BITalinoWorld/python-lab-guides/raw/master/BITalino%20Hands-on/images/bitalinobar.jpg)
# Swimming Classification Using BioSPPy

On this example we will perform the classification of swimming styles using the biosppy library and machine learning tools.
To perform this example every cell must be executed. To do so click run ([ ]) in the top left of every cell.
A warning will appear to reset all runtimes before running, click to accept.

In [0]:
#@title Import libraries
import warnings
warnings.filterwarnings('ignore')
# Clone the repo.
!pip install biosppy >/dev/null 2>&1
import biosppy as bs
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import seaborn as sb

![alt text](https://github.com/BITalinoWorld/python-lab-guides/raw/master/BITalino%20Hands-on/images/bitalinobar.jpg)

# 1. Load Dataset

In this example you will use swimming files shared through a google drive link. 
Our example file contains movement data from swimmers. 
To acess the data, you will have to load it from Google Drive. For that, install PyDrive and follow the instructions that will appear:
- choose the google account you want to use
- allow for Google Cloud SDK to access your google account
- copy the verification code and place it on the box below

In [0]:
#@title Load data from Google Drive
!pip install -U -q PyDrive >/dev/null 2>&1
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

This application allows to get data from Google Drive. For that, we just need the link related to the data file. The important part of the shareable link is the "id". 

In [0]:
link = 'https://drive.google.com/file/d/1_Frt5Kf9xm4HO78ccRe9DBdrO39FjUwZ/view?usp=sharing' # The shareable link
id = link.split('/')[-2] # --> id is "1_Frt5Kf9xm4HO78ccRe9DBdrO39FjUwZ"
#download the file
downloaded = drive.CreateFile({'id':id}) 
data_name = 'swim_pitch_segmented'
downloaded.GetContentFile(data_name)  
#open the file
data = pickle.load(open(data_name,'rb'))

**If the import was sucessfull, after refreshing, the file 'swim_pitch_segmented' will appear on the Files tab on your left side of the screen.**

#1.1. Check data

The data file contains each pool run already segmented. The movement information is given by Yaw, Pitch, Roll, Accelerometer (x,y,z), Gyroscope (x,y,z) and Magnetometer (x,y,z).

At first glance, it is difficult to know which modalities are more relevant for movement.

In [0]:
pd.DataFrame.from_dict(data[0])[:1]

#1.2. Plot data

To have some insight on the relevant modalities for swimming motion, we can plot one modality at a time for a specific user. Using the data file.

You can change **user** and **sensor** variables to see diferent modalities and different swimmers.

This plot was saved locally as "Swim_styles.png"

In [0]:
#choose parameters
user = 1
sensor = 'Pitch'

#create figure
plt.figure(figsize=(25,5))
#join the user segments
plt.title('Swimmer ' + str(user))
join_time = np.concatenate(data[user]['Time'])
join_data = np.concatenate(data[user][sensor])
#plot joint data
plt.plot(join_time, join_data)
for lab in range(len(data[user]['Label'])):
    middle = len(data[user]['Time'][lab])//2
    plt.text(data[user]['Time'][lab][0]+middle, np.max(np.concatenate(data[user][sensor])), data[user]['Label'][lab], fontsize=12)
plt.savefig('Swim_styles.png')
plt.show()

![alt text](https://github.com/BITalinoWorld/python-lab-guides/raw/master/BITalino%20Hands-on/images/bitalinobar.jpg)

# 2. Machine Learning

## Key ML Terminology

![alt text](https://i.ibb.co/xs6rJ5d/swim-ML.png)

Given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis [1](http://cs229.stanford.edu/notes2019fall/cs229-notes1.pdf).


A **feature** is an input variable [2](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology).

A **label** is the thing we're predicting—the signal modality [2](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology).

# 2.2. Feature Extraction
![alt text](https://i.ibb.co/StPzn0v/Captura-de-ecr-2019-09-26-s-16-12-11.png)

Feature Calculation through biosppy 
```
data = pickle.load(open('swim_pitch_segmented','rb'))
def swim_features_calculation(data, sensor_list=['Pitch', 'Roll', 'Yaw', 'Az', 'Ay', 'Ax'])

    user_feat, user_label = [], []
    for user in range(len(data)):
        user_df = pd.DataFrame.from_dict(data[user])
        for cl in sensor_list:
            label_ = []
            for line in range(len(user_df[cl])):
                feat_row= bs.signals.tools.signal_stats(user_df[cl][line])
                feat_names = [cl + '_' + feat for feat in feat_row._names]
                feat_row = pd.DataFrame([feat_row._values], columns=feat_names)
                feat_rows = feat_row if line == 0 else pd.concat([feat_rows, feat_row], axis=0, sort=False, ignore_index=True)

            label_ = user_df['Label'].values
            feats = feat_rows.copy() if cl == sensor_list[0] else pd.concat([feats, feat_rows], axis=1, sort=False)

        user_feat += [pd.DataFrame(standarize(feats), columns=feats.columns)]
        user_label += [label_]

    pickle.dump(user_feat, open('swim_pitch_feats_pool','wb'))
    pickle.dump(user_label, open('swim_pitch_labels_pool', 'wb'))
```



Feature Calculation is a process that may take some time. To perform it in the swimming data we created the function "swim_features_calculation" that uses biosppy's signal_stats function to calculate simple measures such as average mean, standard deviation, maximum value, and others. This function allows to choose specific modalities and it creates a feature vector **user_feat** and a labels_vector **user_label** which were saved into separate files.

In order to save time we will download this files which are available on the shareable links below, using the same method as for the previous case.

In [0]:
#Download features file

link_feats = 'https://drive.google.com/file/d/17stX00p_3z6b7q-Z9hc0tjvTZJZjQIzz/view?usp=sharing' # The shareable link
id = link_feats.split('/')[-2]
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('swim_pitch_feats_pool')  



In [0]:
#Download labels file

link_labels = 'https://drive.google.com/file/d/1ysJw3Y7vgOQeQT_5vfoUZdX0Meqe7T-F/view?usp=sharing' # The shareable link
id = link_labels.split('/')[-2]
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('swim_pitch_labels_pool')  

In [0]:
#We will load feature vector as X and labels vector as Y

X = pickle.load(open('swim_pitch_feats_pool','rb'))
Y = pickle.load(open('swim_pitch_labels_pool','rb'))

In [0]:
#check all features for one specific user. For each swimming lap (0 to 15) all features were calculated for the designated modalities (Pitch, Roll, Yaw, Az, Ay, Ax).
user = 10
X[user]

# 1.2.3. Splitting Data

![alt text](https://i.ibb.co/vY7zFjM/swim-Split-Data.png)

*   training set—a subset to train a model.

*   test set—an independent subset to test the trained model.

In [0]:
# Change data type to array
X_array = []
Y_array = []
for user in range(len(X)):
    X_array.append(X[user].values)
    Y_array.append(np.array(Y[user]))
X_array = np.array(X_array)
Y_array = np.array(Y_array)


In [0]:
# Separate in train and set set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.concatenate(X_array), np.concatenate(Y_array), test_size=0.33, random_state=42)

# 2.4. Learn Model

In [0]:
#choose classifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_score

names = [ "Decision Tree", "GaussianNB","Gaussian Process", "GradientBoosting",
          "Bernoulli NB","Logistic Regression"]

classifiers = [
        DecisionTreeClassifier(),
        GaussianNB(),
        GaussianProcessClassifier(),
        GradientBoostingClassifier(),
        BernoulliNB(),
        LogisticRegression()
]

classifier = DecisionTreeClassifier()

In [0]:
# Fit supervised Learning Classifiers on the training set data

classifier.fit(X_train, y_train)

# 2.5. Evaluate Model


In [0]:
y_predicted = classifier.predict(X_test)
y_predicted



In [0]:
#@title def plot_confusion_matrix(y_true, y_pred, true_labels=None, normalize=False)

from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(y_true, y_pred, true_labels=None, normalize=False, title=''):
    """

    :param y_true: type(array), contains true labels
    :param y_pred: type(array), contains predicted labels
    :param true_labels: list of unique labels
    :param normalize: boolean
    :return:
    """
    if true_labels is None:
        true_labels = np.unique(y_true)
    cm = confusion_matrix(y_true, y_pred, labels=true_labels)
    if normalize:
        cm = np.round(cm / np.max(cm), 2)
        
    plt.figure(figsize=(10,5))
    ax = plt.subplot(1,1,1)
    ax.set_title(title)
    sb.heatmap(cm, annot=True, ax=ax, fmt='g', cmap='Blues')  # annot=True to annotate cells
    # labels, title and ticks
    ax.set_xlabel('Predicted', fontsize=20)
    ax.xaxis.set_label_position('top')
    ax.xaxis.set_ticklabels(true_labels, fontsize=10)
    ax.xaxis.tick_top()
    ax.set_ylabel('True', fontsize=20)
    ax.yaxis.set_ticklabels(true_labels, fontsize=10)

In [0]:
# Evalutaion on the test set
classifier.fit(X_train, y_train)
y_predicted = classifier.predict(X_test)
target_names = np.unique(y_test)
plot_confusion_matrix(y_test, y_predicted, target_names)


![alt text](https://github.com/BITalinoWorld/python-lab-guides/raw/master/BITalino%20Hands-on/images/bitalinobar.jpg)

# 3. Feature Selection
A very high dimensional feature vector can introduce overfitting and high computational cost.

In [0]:
#@title def cross_validation(X_train, y_train, features_descrition, classifier, random_num)
from sklearn.model_selection import KFold


def cross_validation(classifier, X_train, y_train, feat_idx=None, random_num=0.42, test_size=0.25):

    list_idx = np.arange(len(X_train))

    kf = KFold(n_splits=int(1//test_size), random_state=random_num, shuffle=False)
    acc, y_pred, y_true, x_test = [], [], [], []

    for train_index, test_index in kf.split(X_train):

        Ux_train = np.concatenate(X_train[train_index])
        Ux_test = np.concatenate(X_train[test_index])
        Uy_train, Uy_test = np.concatenate(y_train[train_index]), np.concatenate(y_train[test_index])
        if feat_idx is not None:
            Ux_train, Ux_test = Ux_train[:, feat_idx], Ux_test[:,feat_idx]

            if type(feat_idx) == int:
                Ux_train = Ux_train.reshape(-1,1)
                Ux_test = Ux_test.reshape(-1, 1)

        classifier.fit(Ux_train, Uy_train)
        U_pred = classifier.predict(Ux_test)
        acc_ = np.round(accuracy_score(Uy_test, U_pred)*100,2)
        acc.append(acc_)
    return acc


In [0]:
#@title def FSE_cross_validation(X_train, y_train, features_descrition, classifier, random_num)
from sklearn.metrics import accuracy_score

def FSE_cross_validation(X_train, y_train, features_descrition, classifier, random_num):
    """ Performs a sequential forward feature selection.
    Parameters
    ----------
    X_train : array
        Training set feature-vector.

    y_train : array
        Training set class-labels groundtruth.

    features_descrition : array
        Features labels.

    classifier : object
        Classifier.

    Returns
    -------
    FS_idx : array
        Selected set of best features indexes.

    FS_lab : array
        Label of the selected best set of features.

    FS_X_train : array
        Transformed feature-vector with the best feature set.
    """
    total_acc, total_std, FS_lab, acc_list, acc_std, FS_idx = [], [], [], [], [], []
    X_train = np.array(X_train)

    print("*** Feature selection started ***")
    for feat_idx, feat_name in enumerate(features_descrition):

        cv_result = cross_validation(classifier, X_train, y_train, feat_idx, random_num)
        acc_list.append((np.array(cv_result).prod()**(1.0/len(cv_result))))
        acc_std.append(np.std(cv_result))

    curr_acc_idx = np.argmax(acc_list)
    FS_lab.append(features_descrition[curr_acc_idx])
    last_acc = acc_list[curr_acc_idx]
    total_acc.append(last_acc)
    total_std.append(acc_std[curr_acc_idx])
    FS_idx.append(curr_acc_idx)
    while 1:
        acc_list = []
        print(FS_lab)
        for feat_idx, feat_name in enumerate(features_descrition):
            if feat_name not in FS_lab:
                feats_idx = FS_idx[:]
                feats_idx.append(feat_idx)
                cv_result = cross_validation(classifier, X_train, y_train, feats_idx, random_num)
                acc_list.append(np.array(cv_result).prod()**(1.0/len(cv_result)))
                acc_std.append(np.std(cv_result))

            else:
                acc_list.append(0)
        curr_acc_idx = np.argmax(acc_list)
        if last_acc < acc_list[curr_acc_idx]:
            FS_lab.append(features_descrition[curr_acc_idx])
            last_acc = acc_list[curr_acc_idx]
            total_acc.append(last_acc)
            total_std.append(acc_std[curr_acc_idx])
            FS_idx.append(curr_acc_idx)
        else:
            print("FINAL Features: " + str(FS_lab))
            print("Number of selected features", len(FS_lab))
            print("Features idx: ", FS_idx)
            print("Acc: ", str(total_acc))
            print(curr_acc_idx)
            print('Acc std ', str(total_std))
            print("From ", str(X_train[0].shape[1]), " features to ", str(len(FS_lab)))
            break
    print("*** Feature selection finished ***")
    FS_X_train = []


    return np.array(FS_idx), np.array(FS_lab), np.array([total_acc[-1], total_std[-1]]), FS_X_train


In [0]:
# Feature selection
feats_names = X[0].columns
idx_feat, best_feats, acc, _ = FSE_cross_validation(X_array, Y_array, feats_names, classifier, random_num=12)

In [0]:
#Update classifier for best features set
remove_columns = [feats_names[idx] for idx in range(len(feats_names)) if feats_names[idx] not in best_feats]
X_bf_array = []
Y_bf_array = []
for user in range(len(X)):
    X_bf_array.append(X[user].drop(columns=remove_columns).values)
    Y_bf_array.append(np.array(Y[user]))
X_bf_array = np.array(X_bf_array)
Y_bf_array = np.array(Y_bf_array)

In [0]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(np.concatenate(X_bf_array), np.concatenate(Y_bf_array), test_size=0.33, random_state=42)
#train classifier
classifier.fit(X_train, y_train)
#inference
y_predicted = classifier.predict(X_test)
target_names = np.unique(y_test)
#plot result in a confusion matrix
plot_confusion_matrix(y_test, y_predicted, target_names)
print('\n')

![alt text](https://github.com/BITalinoWorld/python-lab-guides/raw/master/BITalino%20Hands-on/images/bitalinobar.jpg)