Notebook01 for Safe Driver Prediction

Goals: Give Numerical Illustrations of Data <br>
Version 2: Added RandomForestClassifier (Which shows Classifiers would not work for this problem.)

I. Import Packages and files

In [8]:
# Data Manipulation
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# display
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [9]:
# Import Files
train_df = pd.read_csv('/Users/maxji/Desktop/Kaggle/0SafeDriver/data/train.csv')
test_df = pd.read_csv('/Users/maxji/Desktop/Kaggle/0SafeDriver/data/test.csv')
submission_df = pd.read_csv('/Users/maxji/Desktop/Kaggle/0SafeDriver/data/sample_submission.csv')

Overview of data: <br>
(1) A total of 595212 train data and 892816 test data <br>
(2) Data Types: id and target(binary), 14 Categorical Variables (cat), 17 Binary Variables (bin),  10 Continuous Variables(reg_01-03,car_12-15,calc_01-03), 16 Ordinal Variables <br>
(3) Data Categories: ind 18, reg 3, car 16, calc 20 <br>
(4) Missing data: ind_02,04,05_cat, car_01,02,03,05,07,09_cat, car_11,12,14, reg_03

In [10]:
# Pick out columns with specific keyword inside
def select_cols(df,description):
    get_cols = [col for col in df.columns if description in col]
    return df[get_cols]

# Remove -1 in the code and replace with N/A
def recover_na(df):
    df = df.replace(-1, np.NaN)
    return df

In [11]:
# Select columns with specific data type (w/o price)
cat_cols = select_cols(train_df,'cat')
bin_cols = select_cols(train_df,'bin')
cont_cols = train_df.select_dtypes(include=['float64'])
temp_cols = [col for col in train_df.columns if ('cat' not in col) and ('bin' not in col) and (train_df[col].dtype != float) 
            and ('id' not in col) and ('target' not in col)]
ord_cols = train_df[temp_cols]

# Select columns with specific category
ind_cols = select_cols(train_df,'ind')
reg_cols = select_cols(train_df,'reg')
car_cols = select_cols(train_df,'car')
calc_cols = select_cols(train_df,'calc')

# Recover NAs from the file
train_recna = recover_na(train_df)

In [12]:
# Since the target function is pretty biased (97% with 0 and 3% with 1), we try to make copies of the entries
# with target value 1 to minimize the bias caused by distribution.
target_achieved = train_df['target']==1
df_copy = train_df[target_achieved]
train_df = train_df.append([df_copy]*5,ignore_index=True)

II. Training

In [13]:
#Split data into train and test
from sklearn.model_selection import train_test_split

X_all = train_df.drop(['target', 'id'], axis=1)
y_all = train_df['target']

num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)

In [14]:
# Import training model and CV package
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV

# Choose the type of classifier
clf = RandomForestClassifier()

# Choose some parameter combinations to try
# In this first notebook I'm playing with GridSearch. The following parameters are not optimal.
"""parameters = {'n_estimators': [4, 6, 9], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1,5,8]
             }"""
parameters = {'n_estimators': [200], 
              'max_features': ['log2','sqrt','auto'], 
              'criterion': ['gini','entropy'],
              'max_depth': [5], 
              'min_samples_split': [100],
              #'min_samples_leaf': [4]
             }
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
clf = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=100, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [15]:
# Create Predictions
predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

0.816146429155


In [16]:
# Cross validation using 50 folds
from sklearn.cross_validation import KFold

def run_kfold(clf):
    kf = KFold(892, n_folds=50)
    outcomes = []
    fold = 0
    for train_index, test_index in kf:
        fold += 1
        X_train, X_test = X_all.values[train_index], X_all.values[test_index]
        y_train, y_test = y_all.values[train_index], y_all.values[test_index]
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        outcomes.append(accuracy)
        print("Fold {0} accuracy: {1}".format(fold, accuracy))     
    mean_outcome = np.mean(outcomes)
    print("Mean Accuracy: {0}".format(mean_outcome)) 

run_kfold(clf)



Fold 1 accuracy: 0.9444444444444444
Fold 2 accuracy: 0.8888888888888888
Fold 3 accuracy: 0.8888888888888888
Fold 4 accuracy: 0.9444444444444444
Fold 5 accuracy: 0.9444444444444444
Fold 6 accuracy: 1.0
Fold 7 accuracy: 0.9444444444444444
Fold 8 accuracy: 1.0
Fold 9 accuracy: 1.0
Fold 10 accuracy: 1.0
Fold 11 accuracy: 0.9444444444444444
Fold 12 accuracy: 1.0
Fold 13 accuracy: 0.9444444444444444
Fold 14 accuracy: 0.8888888888888888
Fold 15 accuracy: 0.8888888888888888
Fold 16 accuracy: 1.0
Fold 17 accuracy: 0.8333333333333334
Fold 18 accuracy: 1.0
Fold 19 accuracy: 0.8888888888888888
Fold 20 accuracy: 1.0
Fold 21 accuracy: 1.0
Fold 22 accuracy: 0.8888888888888888
Fold 23 accuracy: 1.0
Fold 24 accuracy: 1.0
Fold 25 accuracy: 1.0
Fold 26 accuracy: 0.9444444444444444
Fold 27 accuracy: 1.0
Fold 28 accuracy: 1.0
Fold 29 accuracy: 0.9444444444444444
Fold 30 accuracy: 0.8888888888888888
Fold 31 accuracy: 1.0
Fold 32 accuracy: 0.9444444444444444
Fold 33 accuracy: 1.0
Fold 34 accuracy: 0.94444444

In [17]:
# Make predictions and output
ids = test_df['id']
predictions = clf.predict(test_df.drop('id', axis=1))

output = pd.DataFrame({ 'id' : ids, 'target': predictions })
output.to_csv('driver-predictions.csv', index = False)
output['target'].value_counts()

0    892816
Name: target, dtype: int64

Insight:<br>
1. The data target is pretty biased toward 0. <br>
2. Classifier seems to be not a good option because it'll magnify the problem of a majority target. As in the example above, even if we have magnified the entries of target 1 with 5 copies, the final prediction number is still all 0. Thus, we'll focus more on regression models in the future