<h1><center><font size="7">CareerCon 2019 - Help Navigate Robots</font></center></h1>
<img src="http://storage.googleapis.com/kaggle-competitions/kaggle/13242/logos/header.png?t=2019-03-12-23-32-42">
    
## Table of contents
1. [Introduction](#1)
1. [Prepare for data analysis](#2)
1. [Data exploration](#3)
1. [Data preparation](#4)
1. [Modeling](#5)
1. [Submission](#6)
1. [References](#7)

# <a id='1'></a>1. Introduction  

The task is to predict which one of the floor types (concrete, tiles, pvc, carpet, wood) the robot is on using sensor data such as acceleration and velocity. Succeed and you'll help improve the navigation of robots without assistance across many different surfaces, so they won’t fall down on the job. We need to classify on which surface our robot is standing.

In this notebook we will explore the data, prepare it for a model, train a model and classify the target value (surface type) for the test set, then prepare a submission.



# <a id='2'></a>2. Prepare for data analysis
## 2.1 Import libraries 

In [None]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
#from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
#from sklearn.neural_network import MLPClassifier

import matplotlib.pyplot as plt #plotting
import seaborn as sns #higher-lever plotting

import os 
print(os.listdir("../input")) # let's print available data
import warnings
warnings.filterwarnings('ignore') # ignore warnings

## 2.2 Read data 

In [None]:
%%time
train_df = pd.read_csv('../input/X_train.csv')
test_df = pd.read_csv('../input/X_test.csv')
target_df = pd.read_csv('../input/y_train.csv')

# <a id='3'></a>3. Data exploration

## 3.1 Data description
**1. X_[train/test].csv** - the input data, covering 10 sensor channels and 128 measurements per time series plus three ID columns:
    - row_id: The ID for this row.
    - series_id: ID number for the measurement series. Foreign key to y_train/sample_submission.
    - measurement_number: Measurement number within the series.
    
The orientation channels encode the current angles how the robot is oriented as a quaternion (see [Wikipedia](https://en.wikipedia.org/wiki/Conversion_between_quaternions_and_Euler_angles)). Angular velocity describes the angle and speed of motion, and linear acceleration components describe how the speed is changing at different times. The 10 sensor channels are:
    - orientation_X
    - orientation_Y
    - orientation_Z
    - orientation_W
    - angular_velocity_X
    - angular_velocity_Y
    - angular_velocity_Z
    - linear_acceleration_X
    - linear_acceleration_Y
    - linear_acceleration_Z
    
**2. y_train.csv** - the surfaces for training set.
    - series_id: ID number for the measurement series.
    - group_id: ID number for all of the measurements taken in a recording session. Provided for the training set only, to enable more cross validation strategies.
    - surface: the target for this competition.

**3. sample_submission.csv** - a sample submission file in the correct format.

## 3.2 Data review

In [None]:
train_df.head(n=5)

In [None]:
train_df.info()

In [None]:
len(train_df.measurement_number.value_counts())

We have eight types of surfaces and we should classify it on using data collected from sensors.
Let's show the numbers of surfaces

In [None]:
target_df['surface'].value_counts().reset_index().rename(columns={'index': 'target'})

*and visualize it.*

In [None]:
sns.set(style='whitegrid')
sns.countplot(y = 'surface',
              data = target_df,
              order = target_df['surface'].value_counts().index)
plt.show()

In [None]:
fig, ax = plt.subplots(1,1, figsize = (15,6));
corr = train_df.corr();
mask = np.zeros_like(corr);
mask[np.triu_indices_from(mask)] = True
hm = sns.heatmap(corr,
                ax = ax,
                mask = mask,
                cmap = 'Blues',
                annot = True,
                fmt = '.2f',
                linewidths = 0.05);
fig.subplots_adjust(top=0.93);
fig.suptitle('Features Correlation Heatmap', 
              fontsize=14, 
              fontweight='bold');

# <a id='4'></a>4. Data preparation
## 4.1 NaN in data

In [None]:
print('Are there NaNs in {}?: {}\n'.format('train_df',train_df.isnull().values.any())+
      'Are there NaNs in {}?: {}\n'.format('test_df',test_df.isnull().values.any())+
      'Are there NaNs in {}?: {}\n'.format('target_df',target_df.isnull().values.any()))

Ok, our data doesn't have any NaN's. We can continue

## 4.2 Encoding categorical data
We have strings in target dataframe. It should be converted into numbers

In [None]:
le = LabelEncoder()
le.fit(target_df['surface'])
target_df['surface'] = le.transform(target_df['surface'])

In [None]:
target_df['surface'].value_counts()

Now we have numbers instead of letters and we are ready to work with data.

## 4.3 Data transformation

In [None]:
def get_features(df):
    result_df = pd.DataFrame()
    for col in df.columns:
        if col in ['row_id', 'series_id', 'measurement_number']:
            continue
        result_df['{}_mean'.format(col)] = df.groupby(['series_id'])[col].mean()
        result_df['{}_max'.format(col)] = df.groupby(['series_id'])[col].max()
        result_df['{}_min'.format(col)] = df.groupby(['series_id'])[col].min()
        result_df['{}_sum'.format(col)] = df.groupby(['series_id'])[col].sum()
        result_df['{}_mean_abs_change'.format(col)] = df.groupby(['series_id']\
        )[col].apply(lambda x: np.mean(np.abs(np.diff(x))))
    return result_df

In [None]:
%%time
train_df = get_features(train_df)
test_df = get_features(test_df)

In [None]:
# replace NAN to 0
train_df.fillna(0, inplace=True)
test_df.fillna(0, inplace=True)

# replace infinite value to zero
train_df.replace(-np.inf, 0, inplace=True)
train_df.replace(np.inf, 0, inplace=True)
test_df.replace(-np.inf, 0, inplace=True)
test_df.replace(np.inf, 0, inplace=True)

## 4.4 Scaling

In [None]:
#Feature scaling
sc = StandardScaler()
train_df = sc.fit_transform(train_df)
test_df = sc.transform(test_df)

# <a id='5'></a>5. Modeling 

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

In [None]:
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=111222)
sub_preds_rf = np.zeros((test_df.shape[0],9))
oof_preds_rf = np.zeros((train_df.shape[0]))
score = 0
counter = 0

print('start training')

for train_index, test_index in folds.split(train_df, target_df['surface']):
    
    print('Fold {}'.format(counter+1))
    
    clf = RandomForestClassifier(n_estimators=200, n_jobs=-1)
    clf.fit(train_df[train_index], target_df['surface'][train_index])
    oof_preds_rf[test_index] = clf.predict(train_df[test_index])
    sub_preds_rf += clf.predict_proba(test_df) / folds.n_splits
    score += clf.score(train_df[test_index], target_df['surface'][test_index])
    counter += 1
    
    print('score : {}'.format(clf.score(train_df[test_index], target_df['surface'][test_index])))

print('avg accuracy : {}'.format(score / folds.n_splits))


# <a id='6'></a>6. Submission  

In this competition submissions are evaluated on Multiclass Accuracy, which is simply the average number of observations with the correct label.
For each series_id in the test set, we must predict a value for the surface variable. The file should have the following format:

    series_id,surface
    0,fine_concrete
    1,concrete
    2,concrete
    etc.
    
To submit the correct format, use LabelEncoder().inverse_transform() to transform labels back to original encoding.

In [None]:
submit = pd.read_csv('../input/sample_submission.csv')
submit['surface'] = le.inverse_transform(sub_preds_rf.argmax(axis=1))
submit.to_csv('submit.csv', index=False)

print('Ready')

# <a id='7'></a>7. References  

1) https://www.kaggle.com/taigokuriyama/random-forest-baseline