## Overview
1. Import only training datasets since kaggle's sorcing system does not function correctly
2. Split training data sets into two parts: training & testing sets
3. Feature engineering
    * Since each sample covers 10 sensor channels and 128 measurements per time series, we need to group these measurements into one sample
    * Measurements are extracted by grouping the series on functions: max, min, median, mean, std, absolute maximum and quantiles
4. Use Random Forest Classifier (from scikit-learn) to train the model
5. Check model accuracy
    * OOB score
    * 10-Fold cross validation
    * 20 samples testing data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

## Read in Kaggle datasets

In [None]:
X_train = pd.read_csv('./Data/career-con-2019/X_train.csv')
y_train = pd.read_csv('./Data/career-con-2019/y_train.csv')

Let's see what the training set looks like  
It has 487680 samples, each has 13 features (not all of them are usable and will be dropped later)

In [None]:
X_train.ndim, X_train.shape

Now let's split training sets into two parts
* Last 20 groups will be used as testing set
* Rest will be used as traning set

In [None]:
# split X_train
samples = 20
time_series = 128
start_x = X_train.shape[0] - samples*time_series
X_train, X_test = X_train.iloc[:start_x], X_train.iloc[start_x:]
# split y_train
start_y = y_train.shape[0] - samples
y_train, y_test = y_train.iloc[:start_y], y_train.iloc[start_y:]

Before we go on to modify the data, first check out the information of the set  
This can give us some insight of the data

In [None]:
X_train.info()

Also let's see how many series each type has

In [None]:
y_train['surface'].value_counts().plot(kind='barh')

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
X_train.head(1)

In [None]:
X_train.keys()

In [None]:
y_train.keys()

## Process datasets
The first thing we want to do is to drop those useless columns (features)

In [None]:
X_train = X_train.drop(['row_id', 'measurement_number'], axis=1)
X_test = X_test.drop(['row_id', 'measurement_number'], axis=1)
y_train = y_train.drop('group_id', axis=1)
y_test = y_test.drop('group_id', axis=1)

In [None]:
X_train.head(1)

In [None]:
X_test.head(1)

In [None]:
y_train.head(1)

In [None]:
y_test.head(1)

We have 128 measurements for each series, it is hard to train  
So we want to compress these 128 measurements into a single measurement  
First, we will add 3 more features to each measurement, which will be the sum of orientation, angular_velocity, and linear_acceleration

In [None]:
columns = ['orientation', 'angular_velocity', 'linear_acceleration']
for i in columns:
    if(i == 'orientation'):
        X_train[i] = X_train[i+'_X'] + X_train[i+'_Y'] + X_train[i+'_Z'] + X_train[i+'_W']
        X_test[i] = X_test[i+'_X'] + X_test[i+'_Y'] + X_test[i+'_Z'] + X_test[i+'_W']
    else:
        X_train[i] = X_train[i+'_X'] + X_train[i+'_Y'] + X_train[i+'_Z']
        X_test[i] = X_test[i+'_X'] + X_test[i+'_Y'] + X_test[i+'_Z']

In [None]:
X_train.head(1)

In [None]:
X_test.head(1)

Next, we want to calculate *max, min, mean, median, abs_max, std, quartile(25%), quartile(50%), quartile(75%)* for each series (128 measurements) to maintain as much information as possiable  
* Every sample has 13 features before, each of the features will become 9 values after compression, so there will be 9 * 13 features for each sample after calculation
* There was 485120 measurements before compression, and will become 485120 / 128 = 3790 samples after calsulation

In [None]:
%%time

T_train = pd.DataFrame()
T_test = pd.DataFrame()

for i in X_train.columns[1:]:
    T_train[i+'_max'] = X_train.groupby(by='series_id')[i].max()
    T_test[i+'_max'] = X_test.groupby(by='series_id')[i].max()

    T_train[i+'_min'] = X_train.groupby(by='series_id')[i].min()
    T_test[i+'_min'] = X_test.groupby(by='series_id')[i].min()

    T_train[i+'_mean'] = X_train.groupby(by='series_id')[i].mean()
    T_test[i+'_mean'] = X_test.groupby(by='series_id')[i].mean()

    T_train[i+'_median'] = X_train.groupby(by='series_id')[i].median()
    T_test[i+'_median'] = X_test.groupby(by='series_id')[i].median()

    T_train[i+'_quantile_25'] = X_train.groupby(by='series_id')[i].quantile(0.25)
    T_test[i+'_quantile_25'] = X_test.groupby(by='series_id')[i].quantile(0.25)

    T_train[i+'_quantile_50'] = X_train.groupby(by='series_id')[i].quantile(0.5)
    T_test[i+'_quantile_50'] = X_test.groupby(by='series_id')[i].quantile(0.5)

    T_train[i+'_quantile_75'] = X_train.groupby(by='series_id')[i].quantile(0.75)
    T_test[i+'_quantile_75'] = X_test.groupby(by='series_id')[i].quantile(0.75)

    T_train[i+'_abs_max'] = X_train.groupby(by='series_id')[i].apply(lambda x: np.max(np.abs(x)))
    T_test[i+'_abs_max'] = X_test.groupby(by='series_id')[i].apply(lambda x: np.max(np.abs(x)))

    T_train[i+'_std'] = X_train.groupby(by='series_id')[i].std()
    T_test[i+'_std'] = X_test.groupby(by='series_id')[i].std()
    
X_train = T_train
X_test = T_test

In [None]:
X_train.head(1)

In [None]:
X_train.shape

In [None]:
X_test.head(1)

In [None]:
X_test.shape

## Train and test model
For this project, I am using Random Forest Classifier from scikit-learn to train the model  
Since this project is a classify problem, it is very suitable for using ensemble learning  
* we'll set 300 estimators (after trying many times, I find this number will get the nearly best result and won't cost too much time)
* Set a random seed (I set it to the course number of AI hhh)
* The default bootstrap parameter is set to True, it means that we are using bagging. But since there're almost 30% of the samples will not be chosen in the training model, we will set oob_score to True, so that we can use those not chosen smaples to test the accuracy of our model
* Set n_jobs to -1, we want to use every single CPU to train our model so that it won't take too long

In [None]:
%%time

rf_clf = RandomForestClassifier(n_estimators=300,
                                random_state=6613,
                                oob_score=True,
                                n_jobs=-1)
rf_clf.fit(X_train, y_train['surface'])

After training the model, let's see the accuracy of our model, and we are using:  
* Out-of-bag score
* 10-Fold cross validation
* Testing set split from training set

We can see that this model can get about 89.6% accuracy

In [None]:
rf_clf.oob_score_

In [None]:
%%time
scores = cross_val_score(rf_clf, X_train, y_train['surface'], cv = 10)
print('Accuracy: {:.2f} (+/- {:.2f})'.format(scores.mean(), scores.std() * 2))

In [None]:
rf_clf.score(X_test, y_test['surface'])