# Driver Alertness Detection

Workflow of the project:

- Downloading a real-world dataset from a Kaggle competition
- Exploring Data Analysis
- Identifying input and target columns
- Separating numeric and categorical columns
- checking and impute missing values in numeric columns.
- scaling numeric columns to range(0,1)
- encoding categorical columns(if any).
- Splitting the data into train and validation sets.
- Model defining and Making predicitions.
- Hyperparameter tuning.
- Preprocessing the test data
- making predicitios on test data.

###Import Libraies and Load the Data

In [None]:
!pip install jovian opendatasets xgboost graphviz lightgbm scikit-learn xgboost lightgbm --upgrade --quiet

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import os
import opendatasets as od
import pandas as pd
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)

**Problem Statement:**  
The objective of this challenge is to design a detector/classifier that will detect whether the driver is alert or not alert, employing any combination of vehicular, environmental and driver physiological data that are acquired while driving.
dataset = [Stay Alert! The Ford Challenge](https://www.kaggle.com/c/stayalert/data).

In [None]:
od.download('https://www.kaggle.com/c/stayalert/data')

Skipping, found downloaded files in "./stayalert" (use force=True to force download)


In [None]:
os.listdir('stayalert')

['Solution.csv', 'fordTrain.csv', 'fordTest.csv', 'example_submission.csv']

In [None]:
train = pd.read_csv('./stayalert/fordTrain.csv')
test = pd.read_csv('./stayalert/fordTest.csv')

In [None]:
train.head()

Unnamed: 0,TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0,0,0,34.7406,9.84593,1400,42.8571,0.290601,572,104.895,0,0.0,0.0,1,-20,0.015875,324,1,1,1,57,0.0,101.96,0.175,752,5.99375,0,2005,0,13.4,0,4,14.8004
1,0,1,0,34.4215,13.4112,1400,42.8571,0.290601,572,104.895,0,0.0,0.0,1,-20,0.015875,324,1,1,1,57,0.0,101.98,0.455,752,5.99375,0,2007,0,13.4,0,4,14.7729
2,0,2,0,34.3447,15.1852,1400,42.8571,0.290601,576,104.167,0,0.0,0.0,1,-20,0.015875,324,1,1,1,57,0.0,101.97,0.28,752,5.99375,0,2011,0,13.4,0,4,14.7736
3,0,3,0,34.3421,8.84696,1400,42.8571,0.290601,576,104.167,0,0.0,0.0,1,-20,0.015875,324,1,1,1,57,0.0,101.99,0.07,752,5.99375,0,2015,0,13.4,0,4,14.7667
4,0,4,0,34.3322,14.6994,1400,42.8571,0.290601,576,104.167,0,0.0,0.0,1,-20,0.015875,324,1,1,1,57,0.0,102.07,0.175,752,5.99375,0,2017,0,13.4,0,4,14.7757


In [None]:
test.head()

Unnamed: 0,TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0,0,?,38.4294,10.9435,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,4,0.015434,328,1,1,1,64,0.0,108.57,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1937
1,0,1,?,38.3609,15.3212,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,4,0.015434,328,1,1,1,64,0.0,108.57,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1744
2,0,2,?,38.2342,11.514,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.65,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1602
3,0,3,?,37.9304,12.2615,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.65,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1725
4,0,4,?,37.8085,12.3666,1000,60.0,0.302277,504,119.048,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.57,0.0,255,4.50625,0,2136,0,17.6,0,4,16.1459


In [None]:
test['IsAlert'].value_counts()

?    120840
Name: IsAlert, dtype: int64

In [None]:
test.drop(['IsAlert'], 1, inplace=True)

In [None]:
test.head()

Unnamed: 0,TrialID,ObsNum,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0,0,38.4294,10.9435,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,4,0.015434,328,1,1,1,64,0.0,108.57,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1937
1,0,1,38.3609,15.3212,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,4,0.015434,328,1,1,1,64,0.0,108.57,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1744
2,0,2,38.2342,11.514,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.65,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1602
3,0,3,37.9304,12.2615,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.65,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1725
4,0,4,37.8085,12.3666,1000,60.0,0.302277,504,119.048,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.57,0.0,255,4.50625,0,2136,0,17.6,0,4,16.1459


In [None]:
train.shape

(604329, 33)

In [None]:
train.columns

Index(['TrialID', 'ObsNum', 'IsAlert', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6',
       'P7', 'P8', 'E1', 'E2', 'E3', 'E4', 'E5', 'E6', 'E7', 'E8', 'E9', 'E10',
       'E11', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11'],
      dtype='object')

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 604329 entries, 0 to 604328
Data columns (total 33 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   TrialID  604329 non-null  int64  
 1   ObsNum   604329 non-null  int64  
 2   IsAlert  604329 non-null  int64  
 3   P1       604329 non-null  float64
 4   P2       604329 non-null  float64
 5   P3       604329 non-null  int64  
 6   P4       604329 non-null  float64
 7   P5       604329 non-null  float64
 8   P6       604329 non-null  int64  
 9   P7       604329 non-null  float64
 10  P8       604329 non-null  int64  
 11  E1       604329 non-null  float64
 12  E2       604329 non-null  float64
 13  E3       604329 non-null  int64  
 14  E4       604329 non-null  int64  
 15  E5       604329 non-null  float64
 16  E6       604329 non-null  int64  
 17  E7       604329 non-null  int64  
 18  E8       604329 non-null  int64  
 19  E9       604329 non-null  int64  
 20  E10      604329 non-null  

In [None]:
train.describe()

Unnamed: 0,TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
count,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0,604329.0
mean,250.167657,603.841765,0.578799,35.44902,11.996525,1026.671035,64.061965,0.178923,845.38461,77.887628,0.0,10.512332,102.790045,0.290565,-4.230136,0.016262,358.674738,1.757296,1.383058,0.876787,63.311256,1.315265,76.965412,-0.03771,573.786433,19.96103,0.179814,1715.688383,0.0,12.710354,0.0,3.312257,11.668277
std,145.446164,348.931601,0.493752,7.484629,3.760292,309.277877,19.75595,0.372309,2505.335141,18.57793,0.0,14.049071,127.258629,1.006162,35.508596,0.002304,27.399973,2.854852,1.608807,0.328681,18.891029,5.247204,44.387031,0.403896,298.412888,63.269456,0.384033,618.17647,0.0,11.532085,0.0,1.243586,9.934423
min,0.0,0.0,0.0,-22.4812,-45.6292,504.0,23.8853,0.03892,128.0,0.262224,0.0,0.0,0.0,0.0,-250.0,0.008,260.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.795,240.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.67673
25%,125.0,302.0,0.0,31.7581,9.90354,792.0,49.1803,0.09211,668.0,66.6667,0.0,0.0,0.0,0.0,-8.0,0.015686,348.0,0.0,0.0,1.0,52.0,0.0,41.93,-0.175,255.0,1.4875,0.0,1259.0,0.0,0.0,0.0,3.0,7.94768
50%,250.0,604.0,1.0,34.1451,11.4004,1000.0,60.0,0.105083,800.0,75.0,0.0,0.0,0.0,0.0,0.0,0.016001,365.0,1.0,1.0,1.0,67.0,0.0,100.4,0.0,511.0,3.01875,0.0,1994.0,0.0,12.8,0.0,4.0,10.7726
75%,374.0,906.0,1.0,37.3119,13.6442,1220.0,75.7576,0.138814,900.0,89.8204,0.0,28.24,211.584,0.0,6.0,0.016694,367.0,2.0,2.0,1.0,73.0,0.0,108.5,0.07,767.0,7.48125,0.0,2146.0,0.0,21.9,0.0,4.0,15.2709
max,510.0,1210.0,1.0,101.351,71.1737,2512.0,119.048,27.2022,228812.0,468.75,0.0,243.991,359.995,4.0,260.0,0.023939,513.0,25.0,9.0,1.0,127.0,52.4,129.7,3.99,1023.0,484.488,1.0,4892.0,0.0,82.1,0.0,7.0,262.534


### Identify Input and Target Columns

In [None]:
X = train.drop(columns=['IsAlert'])
y = train['IsAlert']

In [None]:
print(X.shape)

(604329, 32)


In [None]:
print(y.shape)

(604329,)


###Separate Numeric and Target columns

In [None]:
num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(include='object').columns.tolist()

In [None]:
print(cat_cols)

[]


In [None]:
print(num_cols)

['TrialID', 'ObsNum', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'E1', 'E2', 'E3', 'E4', 'E5', 'E6', 'E7', 'E8', 'E9', 'E10', 'E11', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11']


###Handling missing values

In [None]:
X[num_cols].isna().sum()

TrialID    0
ObsNum     0
P1         0
P2         0
P3         0
P4         0
P5         0
P6         0
P7         0
P8         0
E1         0
E2         0
E3         0
E4         0
E5         0
E6         0
E7         0
E8         0
E9         0
E10        0
E11        0
V1         0
V2         0
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
dtype: int64

###Scaling numeric cols to range(0,1)

In [None]:
X[num_cols].describe().loc[['min','max']]

Unnamed: 0,TrialID,ObsNum,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
min,0.0,0.0,-22.4812,-45.6292,504.0,23.8853,0.03892,128.0,0.262224,0.0,0.0,0.0,0.0,-250.0,0.008,260.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.795,240.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.67673
max,510.0,1210.0,101.351,71.1737,2512.0,119.048,27.2022,228812.0,468.75,0.0,243.991,359.995,4.0,260.0,0.023939,513.0,25.0,9.0,1.0,127.0,52.4,129.7,3.99,1023.0,484.488,1.0,4892.0,0.0,82.1,0.0,7.0,262.534


In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(train[num_cols])
X[num_cols] = scaler.transform(X[num_cols])

In [None]:
X[num_cols].describe().loc[['min','max']]

Unnamed: 0,TrialID,ObsNum,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0


### Splitting the data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.35)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)

(392813, 32)
(392813,)
(211516, 32)
(211516,)


###Model Defining

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
lr_model = LogisticRegression().fit(X_train, y_train)
dt_model = DecisionTreeClassifier(random_state=42).fit(X_train,y_train)
rfc_model = RandomForestClassifier(n_jobs=-1, random_state=42).fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
#train preds
lr_train_preds = lr_model.predict(X_train)
dt_train_preds = dt_model.predict(X_train)
rfc_train_preds = rfc_model.predict(X_train)

#val preds
lr_val_preds = lr_model.predict(X_val)
dt_val_preds = dt_model.predict(X_val)
rfc_val_preds = rfc_model.predict(X_val)

###Evaluation

In [None]:
from sklearn import metrics

def score(targets, predictions):
    return metrics.accuracy_score(targets,predictions)

In [None]:
lr_train_score = score(y_train,lr_train_preds)
lr_val_score = score(y_val,lr_val_preds)

dt_train_score = score(y_train, dt_train_preds)
dt_val_score = score(y_val, dt_val_preds)

rfc_train_score = score(y_train,rfc_train_preds)
rfc_val_score = score(y_val,rfc_val_preds)

In [None]:
score = pd.DataFrame({
    'Models': ['Logistic Regression','DecisionTreeClassifier','RandomForestClassifier'],
    'Train_Score' : [lr_train_score,dt_train_score,rfc_train_score],
    'Validation_Score' : [lr_val_score,dt_val_score,rfc_val_score],
})

In [None]:
score

Unnamed: 0,Models,Train_Score,Validation_Score
0,Logistic Regression,0.814599,0.815395
1,DecisionTreeClassifier,1.0,0.987282
2,RandomForestClassifier,1.0,0.992748


In [None]:
dt_model.score(X_train,y_train), dt_model.score(X_val, y_val)

(1.0, 0.9872822859736379)

the training accuracy is almost 100%

### Hyperparameter tuning

**Training a Random Forest**  
 While tuning the hyperparameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees trained with slightly different parameters. This is called a random forest model.

A random forest works by averaging/combining the results of several decision trees:

<img src="https://1.bp.blogspot.com/-Ax59WK4DE8w/YK6o9bt_9jI/AAAAAAAAEQA/9KbBf9cdL6kOFkJnU39aUn4m8ydThPenwCLcBGAsYHQ/s0/Random%2BForest%2B03.gif" width="640">


We'll use the `RandomForestClassifier` class from `sklearn.ensemble`.

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_jobs=-1, random_state=42, n_estimators=150)

In [None]:
model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=150, n_jobs=-1, random_state=42)

In [None]:
print(model.score(X_train, y_train))
print(model.score(X_val, y_val))

1.0
0.992700315815352


In [None]:
def test_params(**params):
    model = RandomForestClassifier(random_state=42, n_jobs=-1, **params).fit(X_train, y_train)
    return model.score(X_train, y_train), model.score(X_val, y_val)

In [None]:
test_params(max_depth=15)

(0.9677480124130311, 0.9660782163051495)

### `n_estimators`

This argument controls the number of decision trees in the random forest. The default value is 100. For larger datasets, it helps to have a greater number of estimators. As a general rule, try to have as few estimators as needed.

In [None]:
test_params(n_estimators=25)

(0.9999541766693057, 0.99226063276537)

In [None]:
test_params(min_samples_split=3, min_samples_leaf=2)

(0.9989435176534381, 0.9883413075133796)

In [None]:
model = RandomForestClassifier(n_jobs=-1,
                               random_state=42,
                               n_estimators=25,
                               min_samples_split=3,
                               min_samples_leaf=2,
                               max_depth=15)

In [None]:
model.fit(X_train, y_train)

RandomForestClassifier(max_depth=15, min_samples_leaf=2, min_samples_split=3,
                       n_estimators=25, n_jobs=-1, random_state=42)

In [None]:
model.score(X_train, y_train), model.score(X_val, y_val)

(0.9668417287615226, 0.9650475614137938)

### preprocessing the test data


In [None]:
test.head()

Unnamed: 0,TrialID,ObsNum,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0,0,38.4294,10.9435,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,4,0.015434,328,1,1,1,64,0.0,108.57,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1937
1,0,1,38.3609,15.3212,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,4,0.015434,328,1,1,1,64,0.0,108.57,0.0,255,4.50625,0,2127,0,17.6,0,4,16.1744
2,0,2,38.2342,11.514,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.65,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1602
3,0,3,37.9304,12.2615,1000,60.0,0.302277,508,118.11,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.65,0.07,255,4.50625,0,2131,0,17.6,0,4,16.1725
4,0,4,37.8085,12.3666,1000,60.0,0.302277,504,119.048,0,0.0,0.0,4,8,0.015938,328,1,1,1,65,0.0,108.57,0.0,255,4.50625,0,2136,0,17.6,0,4,16.1459


In [None]:
test.shape

(120840, 32)

In [None]:
test.isna().sum()

TrialID    0
ObsNum     0
P1         0
P2         0
P3         0
P4         0
P5         0
P6         0
P7         0
P8         0
E1         0
E2         0
E3         0
E4         0
E5         0
E6         0
E7         0
E8         0
E9         0
E10        0
E11        0
V1         0
V2         0
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
dtype: int64

In [None]:
test[num_cols] = scaler.transform(test[num_cols])
#test[encoded_cols] = encoder.transform(test[cat_cols])

In [None]:
test_preds = model.predict(test)

In [None]:
test_preds

array([1, 1, 1, ..., 1, 1, 1])

In [None]:
Solution.csv', 'fordTrain.csv', 'fordTest.csv', 'example_submission.csv'

In [None]:
sol = pd.read_csv('./stayalert/Solution.csv')
sol.head()

Unnamed: 0,TrialID,ObsNum,Prediction,Indicator
0,0,0,1,Public
1,0,1,1,Public
2,0,2,1,Private
3,0,3,1,Private
4,0,4,1,Private


In [None]:
submission_df = pd.read_csv('./stayalert/example_submission.csv')

In [None]:
submission_df

Unnamed: 0,TrialID,ObsNum,Prediction
0,0,0,0
1,0,1,0
2,0,2,0
3,0,3,0
4,0,4,0
...,...,...,...
120835,99,1206,0
120836,99,1207,0
120837,99,1208,0
120838,99,1209,0


In [None]:
submission_df['Prediction'] = test_preds

In [None]:
submission_df.to_csv('submission.csv', index=False)