# Stay Alert! The Ford Challenge

Kaggle: https://www.kaggle.com/c/stayalert

Driving while not alert can be deadly. The objective is to design a classifier that will detect whether the driver is alert or not alert, employing data that are acquired while driving.

Driving while distracted, fatigued or drowsy may lead to accidents. Activities that divert the driver's attention from the road ahead, such as engaging in a conversation with other passengers in the car, making or receiving phone calls, sending or receiving text messages, eating while driving or events outside the car may cause driver distraction. Fatigue and drowsiness can result from driving long hours or from lack of sleep.

#### Data Description
The data for this challenge shows the results of a number of "trials", each one representing about 2 minutes of sequential data that are recorded every 100 ms during a driving session on the road or in a driving simulator.  The trials are samples from some 100 drivers of both genders, and of different ages and ethnic backgrounds. The files are structured as follows:

Note:  The actual names and measurement units of the physiological, environmental and vehicular data are not disclosed in this challenge. Models which use fewer physiological variables (columns with names starting with 'P') are of particular interest, therefore competitors are encouraged to consider models which require fewer of these variables.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from sklearn import model_selection, ensemble, preprocessing, pipeline, metrics, linear_model
from matplotlib import pyplot as plt
import xgboost as xgb
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Загрузка данных

In [2]:
dataset=pd.read_csv('fordTrain.csv')
dataset.head()

Unnamed: 0,TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,...,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0,0,0,34.7406,9.84593,1400,42.8571,0.290601,572,104.895,...,0.175,752,5.99375,0,2005,0,13.4,0,4,14.8004
1,0,1,0,34.4215,13.4112,1400,42.8571,0.290601,572,104.895,...,0.455,752,5.99375,0,2007,0,13.4,0,4,14.7729
2,0,2,0,34.3447,15.1852,1400,42.8571,0.290601,576,104.167,...,0.28,752,5.99375,0,2011,0,13.4,0,4,14.7736
3,0,3,0,34.3421,8.84696,1400,42.8571,0.290601,576,104.167,...,0.07,752,5.99375,0,2015,0,13.4,0,4,14.7667
4,0,4,0,34.3322,14.6994,1400,42.8571,0.290601,576,104.167,...,0.175,752,5.99375,0,2017,0,13.4,0,4,14.7757


__Trial ID__ - each period of around 2 minutes of sequential data has a unique trial ID. For instance, the first 1210 observations represent sequential observations every 100ms, and therefore all have the same trial ID

__ObsNum__ - this is a sequentially increasing number within one trial ID

__IsAlert__ -column has a value X for each row where
               X = 1     if the driver is alert
               X = 0     if the driver is not alert
               
__P1, P2 , …….., P8__ - represent physiological data;

__E1, E2, …….., E11__ - represent environmental data;

__V1, V2, …….., V11__ - represent vehicular  data;
 

In [3]:
dataset.shape

(604329, 33)

In [4]:
dataset.isnull().values.any()

False

## Предобработка данных
### Типы признаков

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 604329 entries, 0 to 604328
Data columns (total 33 columns):
TrialID    604329 non-null int64
ObsNum     604329 non-null int64
IsAlert    604329 non-null int64
P1         604329 non-null float64
P2         604329 non-null float64
P3         604329 non-null int64
P4         604329 non-null float64
P5         604329 non-null float64
P6         604329 non-null int64
P7         604329 non-null float64
P8         604329 non-null int64
E1         604329 non-null float64
E2         604329 non-null float64
E3         604329 non-null int64
E4         604329 non-null int64
E5         604329 non-null float64
E6         604329 non-null int64
E7         604329 non-null int64
E8         604329 non-null int64
E9         604329 non-null int64
E10        604329 non-null int64
E11        604329 non-null float64
V1         604329 non-null float64
V2         604329 non-null float64
V3         604329 non-null int64
V4         604329 non-null float64
V5     

### Оценка корреляционной зависимости

In [6]:
corr=pd.DataFrame(dataset.drop('IsAlert', axis=1)).corrwith(dataset['IsAlert'])
corr=pd.DataFrame(corr)
corr.columns=['IsAlert']
corr.sort_values('IsAlert')

Unnamed: 0,IsAlert
E7,-0.329722
E8,-0.28344
V1,-0.269967
V10,-0.259607
V6,-0.24415
E6,-0.189198
V8,-0.16555
E1,-0.16083
TrialID,-0.145816
E2,-0.105495


Корреляционная связь с целевой меткой очень слабая, поэтому задачу нельзя решать линейными способами. 
Существует сильная корреляция между некоторыми признаками:
- P3 & P4
- E1 & E2
- E7 & E8
- E7 & E9
- V1 & V6 & V10
- V6 & V8

Также в данных отсутсвуют значения для следующих признаков: P8, V7, V9.

Но, забегая наперед, для конкретной задачи удаление коррелирующих признаков не будет иметь положительного влияния на итоговый результат.

### Оценка выборки

In [7]:
data=dataset.drop('IsAlert', axis=1)
labels=dataset['IsAlert'].values

In [8]:
# Стратификация выборки
train_data, test_data, train_labels, test_labels=model_selection.train_test_split(
    data, labels, test_size=0.7, random_state=0, stratify=labels)

print ('train:', '\n', 'class:1', ((np.sum(train_labels==1))/(len(train_labels))))
print ('class:0', (1-(np.sum(train_labels==1))/(len(train_labels))))

print ('test:', '\n', 'class:1', ((np.sum(test_labels==1))/(len(test_labels))))
print ('class:0', (1-(np.sum(test_labels==1))/(len(test_labels))))

train: 
 class:1 0.5787984423435448
class:0 0.42120155765645517
test: 
 class:1 0.5787991896574955
class:0 0.42120081034250445


In [9]:
# Балансировка выборки

print ('class:1', ((np.sum(labels==1))/(len(labels))))
print ('class:0', (1-(np.sum(labels==1))/(len(labels))))

class:1 0.578798965464176
class:0 0.421201034535824


Соотношение первого и нулевого класса примерно одинаковое, поэтому балансировку выборки проводить не нужно

## Применение модели XGBClassifier
Данные разной величины. Поэтому необходимо провести масштабирование

In [10]:
estimator = xgb.XGBClassifier(learning_rate=0.1, n_estimators=10, max_depth=3, min_child_weight=3)
estimator=pipeline.Pipeline(steps=[('scaling', preprocessing.StandardScaler()), ('classification', estimator)])

In [11]:
%%time
estimator.fit(train_data, train_labels)

Wall time: 2.66 s


Pipeline(memory=None,
         steps=[('scaling',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('classification',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0, max_depth=3,
                               min_child_weight=3, missing=None,
                               n_estimators=10, n_jobs=1, nthread=None,
                               objective='binary:logistic', random_state=0,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               seed=None, silent=None, subsample=1,
                               verbosity=1))],
         verbose=False)

In [12]:
score=metrics.mean_squared_error(test_labels, estimator.predict(test_data))
score

0.1490812730036333

In [13]:
score_matrix=metrics.confusion_matrix(test_labels, estimator.predict(test_data))
score_matrix

array([[124540,  53641],
       [  9425, 235425]], dtype=int64)

In [14]:
score_accuracy=metrics.accuracy_score(test_labels, estimator.predict(test_data))
score_accuracy

0.8509187269963667

In [15]:
report=metrics.classification_report(test_labels, estimator.predict(test_data))
print(report)

              precision    recall  f1-score   support

           0       0.93      0.70      0.80    178181
           1       0.81      0.96      0.88    244850

    accuracy                           0.85    423031
   macro avg       0.87      0.83      0.84    423031
weighted avg       0.86      0.85      0.85    423031



## Посроение прогноза на тестовой выборке
### Преподготовка данных

In [16]:
# тестовая выборка
dataset_test=pd.read_csv('fordTest.csv')

In [17]:
dataset_test.head()

Unnamed: 0,TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,...,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0,0,?,38.4294,10.9435,1000,60.0,0.302277,508,118.11,...,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1937
1,0,1,?,38.3609,15.3212,1000,60.0,0.302277,508,118.11,...,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1744
2,0,2,?,38.2342,11.514,1000,60.0,0.302277,508,118.11,...,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1602
3,0,3,?,37.9304,12.2615,1000,60.0,0.302277,508,118.11,...,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1725
4,0,4,?,37.8085,12.3666,1000,60.0,0.302277,504,119.048,...,0.0,255,4.50625,0,2136,0,17.6,0,4,16.1459


In [18]:
dataset_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120840 entries, 0 to 120839
Data columns (total 33 columns):
TrialID    120840 non-null int64
ObsNum     120840 non-null int64
IsAlert    120840 non-null object
P1         120840 non-null float64
P2         120840 non-null float64
P3         120840 non-null int64
P4         120840 non-null float64
P5         120840 non-null float64
P6         120840 non-null int64
P7         120840 non-null float64
P8         120840 non-null int64
E1         120840 non-null float64
E2         120840 non-null float64
E3         120840 non-null int64
E4         120840 non-null int64
E5         120840 non-null float64
E6         120840 non-null int64
E7         120840 non-null int64
E8         120840 non-null int64
E9         120840 non-null int64
E10        120840 non-null int64
E11        120840 non-null float64
V1         120840 non-null float64
V2         120840 non-null float64
V3         120840 non-null int64
V4         120840 non-null float64
V5    

In [19]:
# готовые результаты тестовой выборки
dataset_solution=pd.read_csv('solution.csv')

In [20]:
dataset_solution.head()

Unnamed: 0,TrialID,ObsNum,Prediction,Indicator
0,0,0,1,Public
1,0,1,1,Public
2,0,2,1,Private
3,0,3,1,Private
4,0,4,1,Private


In [21]:
dataset_solution. info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120840 entries, 0 to 120839
Data columns (total 4 columns):
TrialID       120840 non-null int64
ObsNum        120840 non-null int64
Prediction    120840 non-null int64
Indicator     120840 non-null object
dtypes: int64(3), object(1)
memory usage: 3.7+ MB


In [22]:
dataset_test['IsAlert']=dataset_solution['Prediction']

In [23]:
dataset_test.head()

Unnamed: 0,TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,...,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0,0,1,38.4294,10.9435,1000,60.0,0.302277,508,118.11,...,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1937
1,0,1,1,38.3609,15.3212,1000,60.0,0.302277,508,118.11,...,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1744
2,0,2,1,38.2342,11.514,1000,60.0,0.302277,508,118.11,...,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1602
3,0,3,1,37.9304,12.2615,1000,60.0,0.302277,508,118.11,...,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1725
4,0,4,1,37.8085,12.3666,1000,60.0,0.302277,504,119.048,...,0.0,255,4.50625,0,2136,0,17.6,0,4,16.1459


In [24]:
test_test_data=dataset_test.drop('IsAlert', axis=1)
test_test_labels=dataset_test['IsAlert'].values

### Оценка модели на тестовой выборке

In [25]:
score_test=metrics.mean_squared_error(test_test_labels, estimator.predict(test_test_data))
score_test

0.11890930155577624

In [26]:
report_test=metrics.classification_report(test_test_labels, estimator.predict(test_test_data))
print(report_test)

              precision    recall  f1-score   support

           0       0.99      0.52      0.69     29914
           1       0.86      1.00      0.93     90926

    accuracy                           0.88    120840
   macro avg       0.93      0.76      0.81    120840
weighted avg       0.90      0.88      0.87    120840



In [27]:
score_matrix_test=metrics.confusion_matrix(test_test_labels, estimator.predict(test_test_data))
score_matrix_test

array([[15673, 14241],
       [  128, 90798]], dtype=int64)

In [28]:
score_test_accuracy=metrics.accuracy_score(test_test_labels, estimator.predict(test_test_data))
score_test_accuracy

0.8810906984442237

Поскольку для конкретной задачи важнее, чтобы алгоритм правильно находил класс '1', то полученный результат можно считать хорошим.

In [29]:
predictions=pd.DataFrame(estimator.predict(test_test_data))
predictions.columns=['Prediction']
predictions['TrialID']=dataset_solution['TrialID']
predictions['ObsNum']=dataset_solution['ObsNum']

In [30]:
predictions.to_csv('predictions.csv')