 ## UK Road Safety: Traffic Accidents and Vehicles Machine Learning
 
 The goal of this project is the investigate what causes Serious and Fatal accidents in hopes of preventing and decreasing the number of them. The dataset consists of accident records from the UK over the course of 15+ years. I hope to show the causes of these accidents through visualizations and create an algorithm that can predict the severity of accidents. 
 
The UK government collects and publishes (usually on an annual basis) detailed information about traffic accidents across the country. This information includes, but is not limited to, geographical locations, weather conditions, type of vehicles, number of casualties and vehicle manoeuvres, making this a very interesting and comprehensive dataset for analysis and research.

The data that I'm using is compiled and available through [Kaggle](https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and-vehicles) and in a less compliled form, [here](https://beta.ukdataservice.ac.uk/datacatalogue/series/series?id=2000045). 

Problem: Severe and fatal accidents.
Solution: Use data to figure out how to lower the number of accidents and the severity of them.

Questions:
1. What effects the severity of accidents?
2. What measures should be looked into in order to lessen the severity of accidents?
3. Can we create an algorithm that correctly predicts the severity of accidents?
4. What are the limitations of the current data?
5. What things would help this research to be more accurate?
6. Who does this project benefit?


### Table of Contents
[Machine Learning](#Machine-Learning)<br>
[Selected Machine Learning Algorithm and Explanation](#Selected-Machine-Learning-Algorithm-and-Explanation)<br>

### Links to Other Notebooks
__[UK Road Safety: Traffic Accidents and Vehicles Introduction, Data Cleaning, and Feature Manipulation](UK_Road_Safety_Traffic_Accidents_and_Vehicles_Data_Cleaning_and_Feature_Manipulation.ipynb)__<br>
__[UK Road Safety: Traffic Accidents and Vehicles Introduction, Data Cleaning, and Feature Manipulation: Github Link](https://github.com/GenTaylor/Traffic-Accident-Analysis/blob/master/UK_Road_Safety_Traffic_Accidents_and_Vehicles_Data_Cleaning_and_Feature_Manipulation.ipynb)__<br>
<br>

__[UK Road Safety: Traffic Accidents and Vehicles Visualizations and Solution](UK_Road_Safety_Traffic_Accidents_and_Vehicles_Visualizations_and_Solution.ipynb)__<br>
__[UK Road Safety: Traffic Accidents and Vehicles Visualizations and Solution: Github Link](https://github.com/GenTaylor/Traffic-Accident-Analysis/blob/master/UK_Road_Safety_Traffic_Accidents_and_Vehicles_Visualizations_and_Solution.ipynb)__<br>
<br>
__[UK Road Safety: Traffic Accidents and Vehicles Machine Learning](UK_Road_Safety_Traffic_Accidents_and_Vehicles_Machine_Learning.ipynb)__<br>
__[UK Road Safety: Traffic Accidents and Vehicles Machine Learning: Github Link](https://github.com/GenTaylor/Traffic-Accident-Analysis/blob/master/UK_Road_Safety_Traffic_Accidents_and_Vehicles_Machine_Learning.ipynb)__<br>
<br>
__[Traffic Analysis and Severity Prediction Powerpoint Presentation]("Traffic_Analysis_and_Severity_Prediction.pptx")__<br>
__[Traffic Analysis and Severity Prediction Powerpoint Presentation: Github Link](https://github.com/GenTaylor/Traffic-Accident-Analysis/blob/master/Traffic_Analysis_and_Severity_Prediction.pptx)__<br>

### Importing and Data Merging

In [1]:
#Import modules
import numpy as np
import holidays
import pandas as pd
import seaborn as sns
import pickle
import time
import timeit

import matplotlib.pyplot as plt
plt.style.use('dark_background')
%matplotlib inline

import datetime
import math

#scipy
import scipy
from scipy import stats
from scipy.stats import ttest_ind

#sklearn
import sklearn
from sklearn import ensemble
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score 
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample

#other learners
from xgboost import XGBClassifier
import lightgbm as lgb

#time series stuff
import statsmodels.api as sm
from pylab import rcParams
import itertools
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA


#warning ignorer
import warnings
warnings.filterwarnings("ignore")

In [2]:
#DATAFRAME PICKLE CREATED IN CELLS BELOW INSTEAD OF RUNNING THROUGH ENTIRE PROCESS AFTER RESTARTING
#import pickled file
df = pd.read_pickle("df.pkl")

## Machine Learning

In [3]:
#made separate dataframe w. set index that wouldnt effect data vis above
df1=df
#set index to accident_index
df1.set_index('accident_index', inplace=True)
df1.head()

Unnamed: 0_level_0,1st_road_class,1st_road_number,2nd_road_number,accident_severity,carriageway_hazards,date,day_of_week,did_police_officer_attend_scene_of_accident,junction_control,junction_detail,...,vehicle_type,was_vehicle_left_hand_drive,x1st_point_of_impact,month,weekend,hour,time_of_day,season,engine_capacity_cc_size,accident_seriousness
accident_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
201001BS70003,B,302,0,Slight,,2010-01-11,Monday,1,Give way or uncontrolled,T or staggered junction,...,Goods Vehicle,No,Front,1,0,7,1,winter,small engine cc,Not Serious
201001BS70004,A,402,4204,Slight,,2010-01-11,Monday,1,Auto traffic signal,T or staggered junction,...,Car,No,Front,1,0,18,6,winter,medium engine cc,Not Serious
201001BS70007,Unclassified,0,0,Slight,,2010-01-02,Saturday,1,Give way or uncontrolled,Mini-roundabout,...,Car,No,Nearside,1,1,21,6,winter,medium engine cc,Not Serious
201001BS70007,Unclassified,0,0,Slight,,2010-01-02,Saturday,1,Give way or uncontrolled,Mini-roundabout,...,Car,No,Front,1,1,21,6,winter,small engine cc,Not Serious
201001BS70008,A,3217,3220,Slight,,2010-01-04,Monday,1,Auto traffic signal,Crossroads,...,Car,No,Nearside,1,0,20,6,winter,medium engine cc,Not Serious


In [4]:
df1 = df1.apply(LabelEncoder().fit_transform)

#### Undersampling
Undersampling is done because of the extreme unevenness and bias of the data. 

In [5]:
#First set up of X and Y
X= df1.drop(['accident_severity','accident_seriousness'],axis=1)
y= df1['accident_seriousness']

In [6]:
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

In [7]:
# concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)

In [8]:
# separate minority and majority classes
not_severe = X[X.accident_seriousness==0]
severe = X[X.accident_seriousness==1]

In [9]:
# decrease majority
not_severe_decreased = resample(not_severe,
                          replace=True, # sample with replacement
                          n_samples=len(severe), # match number in majority class
                          random_state=27) # reproducible results

In [10]:
# combine majority and severe_increased minority
newdf = pd.concat([severe, not_severe_decreased])

In [11]:
newdf.accident_seriousness.value_counts()

1    51357
0    51357
Name: accident_seriousness, dtype: int64

In [12]:
X_train = newdf.drop('accident_seriousness', axis=1)
y_train = newdf.accident_seriousness

In [13]:
#scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [14]:
#Decision Tree Classifier

dtc = DecisionTreeClassifier(random_state=42)

dtc.fit(X_train, y_train)
pred_dtc = dtc.predict(X_test)

#Check accuracy

print("Decision Tree Classifier Accuracy Score: {:0.2f}%".format(accuracy_score(y_test,
                                                                               pred_dtc )*100))
print("Decision Tree Classifier F1 Score: {:0.2f}%".format(f1_score(y_test,
                                                                   pred_dtc,average="macro")*100))
print("Decision Tree Classifier Precision Score: {:0.2f}%".format(precision_score(y_test,
                                                                                 pred_dtc, 
                                                                                 average="macro")*100))
print("Decision Tree Classifier Recall Score: {:0.2f}%".format(recall_score(y_test, 
                                                                           pred_dtc,
                                                                           average="macro")*100))
print("Decision Tree  Classifier Cross Validation Score: {:0.2f}%".format(np.mean(cross_val_score(dtc, 
                                                                           X_train,
                                                                           y_train,
                                                                           cv=5)*100)))
print('\n')
print("Decision Tree Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_dtc))

Decision Tree Classifier Accuracy Score: 57.68%
Decision Tree Classifier F1 Score: 48.05%
Decision Tree Classifier Precision Score: 53.81%
Decision Tree Classifier Recall Score: 58.87%
Decision Tree  Classifier Cross Validation Score: 60.69%


Decision Tree Classifier Confusion Matrix:
 [[70663 52647]
 [ 6716 10258]]


In [166]:
#Bagging Classifier
bagc = BaggingClassifier(random_state=42)

bagc.fit(X_train, y_train)
pred_bagc = bagc.predict(X_test)


#Check accuracy

print("Bagging Classifier Accuracy Score: {:0.2f}%".format(accuracy_score(y_test,
                                                                               pred_bagc )*100))
print("Bagging Classifier F1 Score: {:0.2f}%".format(f1_score(y_test,
                                                                   pred_bagc,average="macro")*100))
print("Bagging Classifier Precision Score: {:0.2f}%".format(precision_score(y_test,
                                                                                 pred_bagc, 
                                                                                 average="macro")*100))
print("Bagging Classifier Recall Score: {:0.2f}%".format(recall_score(y_test, 
                                                                           pred_bagc,
                                                                           average="macro")*100))
print("Bagging Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(bagc, X_train, y_train, cv=5)*100)))
print('\n')
print("Bagging Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_bagc))

Bagging Classifier Accuracy Score: 68.61%
Bagging Classifier F1 Score: 55.35%
Bagging Classifier Precision Score: 56.78%
Bagging Classifier Recall Score: 64.18%
Bagging Classifier Cross Validation Score: 65.24%


Bagging Classifier Confusion Matrix:
 [[86343 36967]
 [ 7072  9902]]


In [170]:
#ExtraTreesClassifier

extc = ExtraTreesClassifier(random_state=42)
extc.fit(X_train, y_train)
pred_extc = extc.predict(X_test)

#Check accuracy

print("Extra Trees Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_extc )*100))
print("Extra Trees Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_extc,average="macro")*100))
print("Extra Trees Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_extc, average="macro")*100))
print("Extra Trees Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_extc, average="macro")*100))
print("Extra Trees Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(extc, X_train, y_train, cv=5)*100)))
print('\n')
print("Extra Trees Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_extc))


Extra Trees Classifier Accuracy Score: 67.05%
Extra Trees Classifier F1 Score: 53.95%
Extra Trees Classifier Precision Score: 55.92%
Extra Trees Classifier Recall Score: 62.58%
Extra Trees Classifier Cross Validation Score: 64.16%


Extra Trees Classifier Confusion Matrix:
 [[84446 38864]
 [ 7354  9620]]


In [172]:
#AdaBoost Classifier 

adbc = AdaBoostClassifier(random_state=42)
adbc.fit(X_train, y_train)
pred_adbc = adbc.predict(X_test)

#Check accuracy

#Check accuracy

print("AdaBoost Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_adbc )*100))
print("AdaBoost Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_adbc,average="macro")*100))
print("AdaBoost Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_adbc, average="macro")*100))
print("AdaBoost Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_adbc, average="macro")*100))
print("AdaBoost Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(adbc, X_train, y_train, cv=5)*100)))
print('\n')
print("AdaBoost Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_adbc))


AdaBoost Classifier Accuracy Score: 66.56%
AdaBoost Classifier F1 Score: 54.87%
AdaBoost Classifier Precision Score: 57.20%
AdaBoost Classifier Recall Score: 65.78%
AdaBoost Classifier Cross Validation Score: 65.87%


AdaBoost Classifier Confusion Matrix:
 [[82388 40922]
 [ 5985 10989]]


In [173]:
#Random Forest Classifier

rfc = RandomForestClassifier(random_state = 42)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)

#Check accuracy
print("Random Forest Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_rfc )*100))
print("Random Forest Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_rfc,average="macro")*100))
print("Random Forest Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_rfc, average="macro")*100))
print("Random Forest Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_rfc, average="macro")*100))
print("Random Forest Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(rfc, X_train, y_train, cv=5)*100)))
print('\n')
print("Random Forest Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_rfc))


Random Forest Classifier Accuracy Score: 68.44%
Random Forest Classifier F1 Score: 54.92%
Random Forest Classifier Precision Score: 56.38%
Random Forest Classifier Recall Score: 63.29%
Random Forest Classifier Cross Validation Score: 64.76%


Random Forest Classifier Confusion Matrix:
 [[86420 36890]
 [ 7384  9590]]


In [174]:
#Gradient Boosting Classifier
gbc = ensemble.GradientBoostingClassifier(random_state = 42)
gbc.fit(X_train, y_train)
pred_gbc = gbc.predict(X_test)

#Check accuracy
print("Gradient Boosting Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_gbc )*100))
print("Gradient Boosting Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_gbc,average="macro")*100))
print("Gradient Boosting Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_gbc, average="macro")*100))
print("Gradient Boosting Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_gbc, average="macro")*100))
print("Gradient Boosting Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(gbc, X_train, y_train, cv=5)*100)))
print('\n')
print("Gradient Boosting Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_gbc))

Gradient Boosting Classifier Accuracy Score: 68.21%
Gradient Boosting Classifier F1 Score: 56.07%
Gradient Boosting Classifier Precision Score: 57.75%
Gradient Boosting Classifier Recall Score: 66.65%
Gradient Boosting Classifier Cross Validation Score: 66.77%


Gradient Boosting Classifier Confusion Matrix:
 [[84729 38581]
 [ 6010 10964]]


In [17]:
#Light GBM
lgbm = lgb.LGBMClassifier(random_state = 42)
lgbm.fit(X_train, y_train)
pred_lgbm = lgbm.predict(X_test)

#check accuracy
print("LightGBM Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_lgbm )*100))
print("LightGBM Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_lgbm,average="macro")*100))
print("LightGBM Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_lgbm, average="macro")*100))
print("LightGBM Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_lgbm, average="macro")*100))
print("LightGBM Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(lgbm, X_train, y_train, cv=5)*100)))
print('\n')
print("LightGBM Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_lgbm))


LightGBM Classifier Accuracy Score: 67.56%
LightGBM Classifier F1 Score: 56.08%
LightGBM Classifier Precision Score: 58.10%
LightGBM Classifier Recall Score: 67.71%
LightGBM Classifier Cross Validation Score: 67.65%


LightGBM Classifier Confusion Matrix:
 [[83256 40054]
 [ 5448 11526]]


In [187]:
#XGBoost
xgb = XGBClassifier(n_estimators=100, random_state = 42, max_depth=10)
xgb.fit(X_train, y_train)

pred_xgb = xgb.predict(X_test)

#check accuracy
print("XGBoost Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_xgb)*100))
print("XGBoost Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_xgb,average="macro")*100))
print("XGBoost Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_xgb, average="macro")*100))
print("XGBoost Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_xgb, average="macro")*100))
print("XGBoost Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(xgb, X_train, y_train, cv=5)*100)))
print('\n')
print("XGBoost Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_xgb))


XGBoost Classifier Accuracy Score: 68.11%
XGBoost Classifier F1 Score: 56.45%
XGBoost Classifier Precision Score: 58.25%
XGBoost Classifier Recall Score: 67.91%
XGBoost Classifier Cross Validation Score: 68.74%


XGBoost Classifier Confusion Matrix:
 [[84061 39249]
 [ 5490 11484]]


### Tuning

In [196]:
# #RANDOM FOREST PARAM
# rfc_param = {
#     'n_estimators': [100, 200, 300, 500],
#     'criterion': ['entropy', 'gini'],
#     'max_features':['auto','sqrt'],
#     'max_depth': [10, 50, 100],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 2, 4, 10],
#     'random_state':[42]}

# grid_rfc = GridSearchCV(rfc, param_grid = rfc_param, cv = 3, verbose = 1, n_jobs=-1)
# grid_rfc.fit(X_train,y_train)

# print(rfcbest_estimator = grid_rfc.best_estimator_)
# print("Random Forest:\n",grid_rfc.best_params_)

In [None]:
# #Gradient Boosting Classifier Tuning
# gbcparam= {'learning_rate':[0.5,0.1,1],
#            'n_estimators': [100, 200, 300, 500],
#            'max_features':['auto','sqrt'],
#            'max_depth': [10, 50, 100],
#            'min_samples_leaf': [1, 2, 4, 10],
#            'min_samples_split': [2, 5, 10],
#            'random_state':[42]}



# gbctuning =GridSearchCV(gbc, param_grid = gbcparam, cv = 3, verbose = 1, n_jobs=-1)


# gbctuning.fit(X_train,y_train)
                      
# print("Gradient Boost:\n",gbctuning.best_params_)

In [21]:
# #LightGBM Tuning

# lgbmparam={'learning_rate':[0.5,0.1,1],
#            'n_estimators': [100, 200, 300, 500],
#            'max_depth': [6, 25, 50,100],
#            "num_leaves": [6,12,50],
#            'min_data_in_leaf' : [100,500,1000],
#            'random_state':[42]}

# lgbmtuning =GridSearchCV(lgbm, param_grid = lgbmparam, cv = 3, n_jobs=1, verbose = 1)


# lgbmtuning.fit(X_train,y_train)
                     
# print("LightGBM:\n",lgbmtuning.best_params_)

Wall time: 0 ns
Fitting 3 folds for each of 432 candidates, totalling 1296 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1296 out of 1296 | elapsed: 88.8min finished


LightGBM:
 {'learning_rate': 0.1, 'max_depth': 25, 'min_data_in_leaf': 100, 'n_estimators': 500, 'num_leaves': 50, 'random_state': 42}


In [None]:
# #XGBoost Tuning
# xgbparam ={'max_depth': [10, 50, 100],}

In [48]:
start = time.time()
rfc2 = RandomForestClassifier(criterion='entropy', max_depth=40, 
                              max_features='sqrt', min_samples_split=8, 
                              n_estimators=500, random_state=42)
rfc2.fit(X_train, y_train)
pred_rfc2 = rfc2.predict(X_test)
#Check accuracy


#Check accuracy
print("Random Forest Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_rfc2 )*100))
print("Random Forest Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_rfc2,average="macro")*100))
print("Random Forest Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_rfc2, average="macro")*100))
print("Random Forest Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_rfc2, average="macro")*100))
print("Random Forest Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(rfc2, X_train, y_train, cv=5)*100)))
print('\n')
print("Random Forest Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_rfc2))
end = time.time()
print("Random Forest Time: ",end - start)

Random Forest Classifier Accuracy Score: 67.08%
Random Forest Classifier F1 Score: 55.98%
Random Forest Classifier Precision Score: 58.25%
Random Forest Classifier Recall Score: 68.21%
Random Forest Classifier Cross Validation Score: 69.58%


Random Forest Classifier Confusion Matrix:
 [[82271 41039]
 [ 5143 11831]]
Random Forest Time:  1006.990118265152


In [None]:
#Gradient Boosting Classifier
start2 = time.time()
gbc2 = ensemble.GradientBoostingClassifier(learning_rate=0.05, max_depth=40, 
                                           min_samples_leaf=1, n_estimators=500,
                                           random_state = 42)
gbc2.fit(X_train, y_train)
pred_gbc2 = gbc2.predict(X_test)

#Check accuracy
print("Gradient Boosting Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_gbc2 )*100))
print("Gradient Boosting Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_gbc2,average="macro")*100))
print("Gradient Boosting Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_gbc2, average="macro")*100))
print("Gradient Boosting Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_gbc2, average="macro")*100))
print("Gradient Boosting Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(gbc2, X_train, y_train, cv=5)*100)))
print('\n')
print("Gradient Boosting Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_gbc2))
end2 = time.time()
print("Gradient Boosting Time:", end2 - start2)

In [15]:
#Light GBM
#LightGBM:{'learning_rate': 0.1, 'max_depth': 25, 'min_data_in_leaf': 100, 
#'n_estimators': 500, 'num_leaves': 50, 'random_state': 42}
start3 = time.time()
lgbm2 = lgb.LGBMClassifier(learning_rate =0.03, max_depth=40, min_data_in_leaf=10, 
                           max_cat_threshold=99999999,
                           n_estimators=500, num_leaves=50, random_state = 42)
lgbm2.fit(X_train, y_train)
pred_lgbm2 = lgbm2.predict(X_test)

#check accuracy
print("LightGBM Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_lgbm2 )*100))
print("LightGBM Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_lgbm2,average="macro")*100))
print("LightGBM Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_lgbm2, average="macro")*100))
print("LightGBM Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_lgbm2, average="macro")*100))
print("LightGBM Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(lgbm2, X_train, y_train, cv=5)*100)))
print('\n')
print("LightGBM Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_lgbm2))
end3 = time.time()
print("LightGBM Time:", end3 - start3)

LightGBM Classifier Accuracy Score: 67.93%
LightGBM Classifier F1 Score: 56.41%
LightGBM Classifier Precision Score: 58.29%
LightGBM Classifier Recall Score: 68.06%
LightGBM Classifier Cross Validation Score: 68.22%


LightGBM Classifier Confusion Matrix:
 [[83719 39591]
 [ 5392 11582]]
LightGBM Time: 73.5155577659607


In [16]:
#XGBoost
start4 = time.time()
xgb2 = XGBClassifier(learning_rate=0.05, n_estimators=500, subsample= 1,random_state = 42,
                     gamma = 1, max_depth=40)
xgb2.fit(X_train, y_train)

pred_xgb2 = xgb2.predict(X_test)

#check accuracy
print("XGBoost Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_xgb2)*100))
print("XGBoost Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_xgb2,average="macro")*100))
print("XGBoost Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_xgb2, average="macro")*100))
print("XGBoost Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_xgb2, average="macro")*100))
print("XGBoost Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(xgb2, X_train, y_train, cv=5)*100)))
print('\n')
print("XGBoost Classifier Confusion Matrix:\n", confusion_matrix(y_test,pred_xgb2))
end4 = time.time()
print("XGBoost Time:", end4 - start4)


XGBoost Classifier Accuracy Score: 66.69%
XGBoost Classifier F1 Score: 55.71%
XGBoost Classifier Precision Score: 58.13%
XGBoost Classifier Recall Score: 68.04%
XGBoost Classifier Cross Validation Score: 69.27%


XGBoost Classifier Confusion Matrix:
 [[81700 41610]
 [ 5123 11851]]
XGBoost Time: 7625.058862447739


#### Selected Machine Learning Algorithm and Explanation