# Introduction/Business Problem

Car accidents are one of the most common issues found across the globe to be severe.Accidents might sometimes be due to our negligence or due to natural reasons or anything.Sometimes, we might be too lazy or negligent to drive costing our lives as well as the others.Whereas sometimes, due to heavy rain or heavy gales etc. we might unknowingly droop into an accident with the other car.Whatever the reason maybe, car accidents not only lead to property damage but cause injuries and sometimes even leading to people's death.In our project we decide how these accidents occur due to weather conditions.So, the main problem or question arising in this depressing situation is 

"What is the severity of these car accidents?What are their causes?And How to curb or slow down them?"


# Data section

We have several attributes in our dataset which tell us about the severity of these accidents.attributes like WEATHER, ROADCOND, LIGHTCOND, JUNCTIONTYPE can tell us about the accidents which happen naturally.And attributes like SEVERITYDESC and COLLISIONTYPE help us decide how these accidents take place.
Our predictor or target variable will be 'SEVERITYCODE' because it is used measure the severity of an accident from 0 to 5 within the dataset. Attributes used to weigh the severity of an accident are 'WEATHER', 'ROADCOND' and 'LIGHTCOND'.
* 0 : Little to no Probability (Clear Weather Conditions)  
* 1 : Very Low Probability - Chance or Property Damage
* 2 : Low Probability - Chance of Injury
* 3 : Mild Probability - Chance of Serious Injury
* 4 : High Probability - Chance of Fatality


So depending on these severity codes, we decide the extent of severity of accidents due to these these weather conditions

# Methodology


UK Road Safety data: Total accident counts with accident severity as Slight, Serious and Fatal
Normalized accident counts each month for slight and (Serious and Fatal clubbed)
Plotting importance of each feature for considered features

Data Pre-processing techniques: The dataset is imputed by replacing NaN and missing values with most frequent values of the corresponding column. All the categorical values have been labeled by integers from 0 to n for each column. Time has been converted to categorial feature with 2 values i.e., daytime and night time.

The data is visualized for correlation. Negatively correlated features are selected to be dropped. Feature importance is plotted to visualize and only features with high importance are taken into consideration for predicting accident severity.
The multi class label is converted to binary class by merging “Serious” and “Fatal” to Serious class.

Feature Selection: The dataset has 34 attributes describing the incident of an accident. There are mixed types of data such as continuous and categorical. Manually dropped few columns due to its inconsistency in values such as Accident ID, and Location ID. For selecting the best features, below functions are used from sklearn library. 
* 1. SelectKBest: SelectKBest is a sci-kit learn library provides the k best features by performing statistical tests i.e., chi squared computation between two non-negative features. Using chi squared function filters out the features which are independent of target attribute.
* 2. Recursive Feature Elimination (RFE): RFE runs the defined model by trying out different possible combinations of features, and it removes the features recursively which are not impacting the class label. Logistic regression algorithm is used as a parameter for RFE to decide on features.


In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix  

df = pd.read_csv('Accident_Information.csv', sep=',')


  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
encoding = {
"Carriageway_Hazards": {"None": 0, "Other object on road": 1, "Any animal in carriageway (except ridden horse)": 1,  "Pedestrian in carriageway - not injured": 1, "Previous accident": 1, "Vehicle load on road": 1,  "Data missing or out of range": 0  }
}
df.replace(encoding, inplace=True)
print(df['Carriageway_Hazards'].value_counts())


0    2010553
1      36703
Name: Carriageway_Hazards, dtype: int64


In [10]:
print(df['Light_Conditions'].value_counts())
encoding_light = {"Light_Conditions": {"Daylight": 0, "Darkness - lights lit": 1, "Darkness - no lighting": 1, "Darkness - lighting unknown": 1, "Darkness - lights unlit": 1, "Data missing or out of range": 0}}
df.replace(encoding_light, inplace=True)
print(df['Light_Conditions'].value_counts())


Daylight                        1496121
Darkness - lights lit            404144
Darkness - no lighting           112644
Darkness - lighting unknown       24362
Darkness - lights unlit            9971
Data missing or out of range         14
Name: Light_Conditions, dtype: int64
0    1496135
1     551121
Name: Light_Conditions, dtype: int64


In [11]:
print(df['Day_of_Week'].value_counts())
encoding_day_of_week = {"Day_of_Week": {"Saturday": 1, "Sunday": 1, "Monday": 0, "Tuesday": 0, "Wednesday": 0, "Thursday": 0, "Friday": 0}}
df.replace(encoding_day_of_week, inplace=True)
print(df['Day_of_Week'].value_counts())


Friday       335183
Wednesday    308580
Thursday     308240
Tuesday      306292
Monday       290482
Saturday     273152
Sunday       225327
Name: Day_of_Week, dtype: int64
0    1548777
1     498479
Name: Day_of_Week, dtype: int64


In [12]:
print(df['Special_Conditions_at_Site'].value_counts())
encoding_Special_Conditions_at_Site = {"Special_Conditions_at_Site": {"None": 0, "Roadworks": 1, "Oil or diesel": 1, "Mud": 1, "Road surface defective": 1, "Auto traffic signal - out": 1, "Road sign or marking defective or obscured": 1, "Auto signal part defective": 1, "Data missing or out of range": 0}}
df.replace(encoding_Special_Conditions_at_Site, inplace=True)
print(df['Special_Conditions_at_Site'].value_counts())


None                                          1995137
Roadworks                                       23525
Oil or diesel                                    6797
Mud                                              6363
Road surface defective                           4801
Auto traffic signal - out                        3855
Road sign or marking defective or obscured       2930
Data missing or out of range                     2835
Auto signal part defective                       1013
Name: Special_Conditions_at_Site, dtype: int64
0    1997972
1      49284
Name: Special_Conditions_at_Site, dtype: int64


In [13]:
encoding_1st_road_class = {"1st_Road_Class": {"A": 1, "A(M)": 1, "B": 2, "C": 3, "Motorway": 4, "Unclassified": 1}}
df.replace(encoding_1st_road_class, inplace=True)
df['1st_Road_Class'].value_counts()


1    1536156
2     258076
3     174953
4      78071
Name: 1st_Road_Class, dtype: int64

In [14]:
#replacing 'Data missing or out of range' with most occured value 'Give way or uncontrolled'
df['Junction_Control'] = df['Junction_Control'].replace(['Data missing or out of range'], 'Give way or uncontrolled')

In [15]:
df['Junction_Control'].value_counts()


Give way or uncontrolled               1742624
Auto traffic signal                     211335
Not at junction or within 20 metres      77304
Stop sign                                12333
Authorised person                         3660
Name: Junction_Control, dtype: int64

In [16]:
encoding_junction_detail = {"Junction_Control": 
                            {"Give way or uncontrolled": 1,
                             "Auto traffic signal": 2,
                             "Not at junction or within 20 metres": 3, 
                             "Stop sign": 4,
                             "Authorised person": 5,
                              }}
df.replace(encoding_junction_detail, inplace=True)
df['Junction_Control'].value_counts()

1    1742624
2     211335
3      77304
4      12333
5       3660
Name: Junction_Control, dtype: int64

In [17]:
encoding_junction_detail = {"Junction_Detail": 
                            {"Not at junction or within 20 metres": 1,
                             "T or staggered junction": 2,
                             "Crossroads": 3, 
                             "Roundabout": 4,
                             "Private drive or entrance": 5,
                             "Other junction": 6,
                             "Slip road": 7,
                             "More than 4 arms (not roundabout)": 8,
                             "Mini-roundabout": 9,
                             "Data missing or out of range": 1 }}
df.replace(encoding_junction_detail, inplace=True)
df['Junction_Detail'].value_counts()


1    827957
2    635349
3    196283
4    177214
5     72751
6     59692
7     30052
8     25551
9     22407
Name: Junction_Detail, dtype: int64

In [18]:
encoding_road_surface_cond = {"Road_Surface_Conditions": 
                            {"Dry": 1,
                             "Wet or damp": 2,
                             "Frost or ice": 3, 
                             "Snow": 4,
                             "Flood over 3cm. deep": 5,
                             "Data missing or out of range": 1 }}
df.replace(encoding_road_surface_cond, inplace=True)
df['Road_Surface_Conditions'].value_counts()


1    1423360
2     568563
3      40321
4      12167
5       2845
Name: Road_Surface_Conditions, dtype: int64

In [19]:
encoding_road_type = {"Road_Type": 
                            {"Single carriageway": 1,
                             "Dual carriageway": 2,
                             "Roundabout": 3, 
                             "One way street": 4,
                             "Slip road": 5,
                             "Unknown": 0,
                             "Data missing or out of range": 1 }}
df.replace(encoding_road_type, inplace=True)
df['Road_Type'].value_counts()


1    1527883
2     303407
3     136754
4      43258
5      21558
0      14396
Name: Road_Type, dtype: int64

In [20]:
encoding_urban_rural = {"Urban_or_Rural_Area": 
                            {"Urban": 1,
                             "Rural": 2,
                             "Unallocated": 1 }}
df.replace(encoding_urban_rural, inplace=True)
df['Urban_or_Rural_Area'].value_counts()


1    1322499
2     724757
Name: Urban_or_Rural_Area, dtype: int64

In [21]:
encoding_weather = {"Weather_Conditions": 
                            {"Fine no high winds": 1,
                             "Raining no high winds": 2,
                             "Raining + high winds": 3,
                             "Fine + high winds": 4,
                             "Snowing no high winds": 5,
                             "Fog or mist": 6,
                             "Snowing + high winds": 7,
                             "Unknown": 1,
                             "Other": 1,
                             "Data missing or out of range": 1 }}
df.replace(encoding_weather, inplace=True)
df['Weather_Conditions'].value_counts()


1    1726874
2     239281
3      28343
4      25816
5      13387
6      11068
7       2487
Name: Weather_Conditions, dtype: int64

In [22]:
np.where(np.isnan(df['Speed_limit']))


(array([1801605, 1843133, 1843396, 1857338, 1857382, 1857458, 1857466,
        1857525, 1857526, 1857527, 1857531, 1857539, 1857561, 1857564,
        1857583, 1857610, 1857613, 1857618, 1857622, 1857627, 1857635,
        1857681, 1857704, 1857720, 1857736, 1857737, 1857772, 1898106,
        1898251, 1898467, 1898663, 1898938, 1899072, 1899103, 1899306,
        1899388, 1912877], dtype=int64),)

In [23]:
df['Speed_limit'].fillna((df['Speed_limit'].mean()), inplace=True)


In [24]:
df['Time'].fillna(0, inplace=True)


In [26]:
def period(row):
    rdf = []
    if(type(row) == float):
        row = str(row)
        rdf = row.split(".")
    else:
        rdf = str(row).split(":"); # day -- 8am-8pm
        
    hr = rdf[0]
    if int(hr) > 8 and int(hr) < 20:
        return 1;
    else:
        return 2;


In [27]:
df['Time'] = df['Time'].apply(period)


In [28]:

df_train1 = df[['1st_Road_Class','Carriageway_Hazards','Junction_Control','Day_of_Week','Junction_Detail','Light_Conditions','Road_Surface_Conditions','Road_Type','Special_Conditions_at_Site','Speed_limit','Time','Urban_or_Rural_Area','Weather_Conditions','Accident_Severity']]


In [29]:
df_slight = df_train1[df_train1['Accident_Severity']=='Slight']


In [30]:
df_serious = df_train1[df_train1['Accident_Severity']=='Serious']


In [31]:
df_fatal = df_train1[df_train1['Accident_Severity']=='Fatal']


In [32]:
df_serious['Accident_Severity'].value_counts()


Serious    286339
Name: Accident_Severity, dtype: int64

In [33]:
random_subset = df_slight.sample(n=3)
random_subset.head()


Unnamed: 0,1st_Road_Class,Carriageway_Hazards,Junction_Control,Day_of_Week,Junction_Detail,Light_Conditions,Road_Surface_Conditions,Road_Type,Special_Conditions_at_Site,Speed_limit,Time,Urban_or_Rural_Area,Weather_Conditions,Accident_Severity
1320981,2,0,1,0,6,0,2,1,0,30.0,2,2,2,Slight
475236,1,0,1,0,4,1,2,3,0,70.0,2,2,1,Slight
1068383,1,0,2,0,2,0,1,1,0,30.0,1,1,1,Slight


In [34]:
df_fatal['Accident_Severity'].value_counts()


Fatal    26369
Name: Accident_Severity, dtype: int64

In [35]:
df_slight_sampling = df_slight.sample(n=45000)  #Matched the combined number of records for Fatal and Serious(As we are going to club fatal&serious to Serious)

In [36]:

df_serious_sampling = df_serious.sample(n=24693)  #Matched number of records with the rarer class (Fatal#24693)


In [37]:
df_final_sampling = pd.concat([df_serious_sampling,df_slight_sampling,df_fatal])


In [38]:
df_final_sampling.head()


Unnamed: 0,1st_Road_Class,Carriageway_Hazards,Junction_Control,Day_of_Week,Junction_Detail,Light_Conditions,Road_Surface_Conditions,Road_Type,Special_Conditions_at_Site,Speed_limit,Time,Urban_or_Rural_Area,Weather_Conditions,Accident_Severity
1237029,1,0,1,0,1,0,1,1,0,30.0,1,2,1,Serious
36349,1,0,1,0,1,0,1,1,0,60.0,1,2,1,Serious
1692934,1,0,1,0,1,0,1,1,0,30.0,1,1,1,Serious
2009652,1,0,1,0,1,1,2,1,0,60.0,2,2,1,Serious
482096,1,0,1,0,2,1,1,1,0,60.0,1,2,1,Serious


In [39]:
df_test = df_final_sampling[['Accident_Severity']]


In [40]:
#replacing 'Data missing or out of range' with most occured value 'None'
df_test['Accident_Severity'] = df_test['Accident_Severity'].replace(['Fatal'], 'Serious')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [41]:
df_train = df_final_sampling[['1st_Road_Class','Carriageway_Hazards','Junction_Control','Day_of_Week','Junction_Detail','Light_Conditions','Road_Surface_Conditions','Road_Type','Special_Conditions_at_Site','Speed_limit','Time','Urban_or_Rural_Area','Weather_Conditions']]


In [42]:
df_test['Accident_Severity'].value_counts()


Serious    51062
Slight     45000
Name: Accident_Severity, dtype: int64

# Results

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_train, df_test, test_size=0.2)


In [44]:
from sklearn.ensemble import RandomForestClassifier
#class_weight = dict({2:1, 1:15, 0:50})
rdf = RandomForestClassifier(n_estimators=300,random_state=35)

rdf.fit(X_train,y_train)

y_pred=rdf.predict(X_test)

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))


  """


Accuracy: 0.6113568937698434
[[6099 4119]
 [3348 5647]]
              precision    recall  f1-score   support

     Serious       0.65      0.60      0.62     10218
      Slight       0.58      0.63      0.60      8995

    accuracy                           0.61     19213
   macro avg       0.61      0.61      0.61     19213
weighted avg       0.61      0.61      0.61     19213



In [46]:
from sklearn.ensemble import RandomForestClassifier
#class_weight = dict({2:1, 1:15, 0:50})
rdf = RandomForestClassifier(bootstrap=True,
            class_weight="balanced_subsample", 
            criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=300,
            oob_score=True,
            random_state=35,
            verbose=0, warm_start=False)


In [47]:
rdf.fit(X_train,y_train)

y_pred=rdf.predict(X_test)


  """Entry point for launching an IPython kernel.


In [48]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


Accuracy: 0.6256701191901317


In [49]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))


[[5567 4651]
 [2541 6454]]
              precision    recall  f1-score   support

     Serious       0.69      0.54      0.61     10218
      Slight       0.58      0.72      0.64      8995

    accuracy                           0.63     19213
   macro avg       0.63      0.63      0.62     19213
weighted avg       0.64      0.63      0.62     19213



In [53]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.2.0-py3-none-win_amd64.whl (86.5 MB)
Installing collected packages: xgboost

Successfully installed xgboost-1.2.0


In [54]:
from xgboost import XGBClassifier
model = XGBClassifier(learning_rate =0.07, n_estimators=300,
                      class_weight="balanced_subsample",
                      max_depth=8, min_child_weight=1,
                      scale_pos_weight=7,
                      seed=27,subsample=0.8,colsample_bytree=0.8)

model.fit(X_train,y_train)

y_pred=model.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


  return f(**kwargs)


Parameters: { class_weight } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Accuracy: 0.47873835423931715


In [55]:
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))


[[ 322 9896]
 [ 119 8876]]
              precision    recall  f1-score   support

     Serious       0.73      0.03      0.06     10218
      Slight       0.47      0.99      0.64      8995

    accuracy                           0.48     19213
   macro avg       0.60      0.51      0.35     19213
weighted avg       0.61      0.48      0.33     19213



In [56]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier(n_neighbors=3,weights='distance')

# fit the model with data (occurs in-place)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))


  


[[6123 4095]
 [3915 5080]]
              precision    recall  f1-score   support

     Serious       0.61      0.60      0.60     10218
      Slight       0.55      0.56      0.56      8995

    accuracy                           0.58     19213
   macro avg       0.58      0.58      0.58     19213
weighted avg       0.58      0.58      0.58     19213



In [57]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)
y_pred = logisticRegr.predict(X_test)
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))


  return f(**kwargs)


[[6631 3587]
 [3754 5241]]
              precision    recall  f1-score   support

     Serious       0.64      0.65      0.64     10218
      Slight       0.59      0.58      0.59      8995

    accuracy                           0.62     19213
   macro avg       0.62      0.62      0.62     19213
weighted avg       0.62      0.62      0.62     19213



In [58]:

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)
print(format(classification_report(y_test, y_pred)))

  return f(**kwargs)


0.6157809816270233
              precision    recall  f1-score   support

     Serious       0.64      0.62      0.63     10218
      Slight       0.59      0.61      0.60      8995

    accuracy                           0.62     19213
   macro avg       0.62      0.62      0.62     19213
weighted avg       0.62      0.62      0.62     19213



In [59]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(loss="deviance", learning_rate=0.1, 
      n_estimators=100, subsample=1.0, criterion="friedman_mse", 
      min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
      max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, 
      random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, 
      presort="auto")

y_pred = gbc.fit(X_train, y_train.values.ravel()).predict(X_test)
print(format(classification_report(y_test, y_pred)))
print(accuracy_score(y_test, y_pred))




              precision    recall  f1-score   support

     Serious       0.66      0.61      0.64     10218
      Slight       0.60      0.65      0.62      8995

    accuracy                           0.63     19213
   macro avg       0.63      0.63      0.63     19213
weighted avg       0.63      0.63      0.63     19213

0.6295737261229376


# Discussion

Our main aim was to predict the severity of the accident when it is “serious” and “fatal”. It was very difficult to handle this large-sized data. Using HPC we were able to run most of our algorithms. Data is highly imbalanced so even though most of our algorithms were giving > 89% accuracies, it was of no use. It was predicting all the accidents as slight accidents. After checking on all these algorithms, the team even tried dimensionality reduction techniques and but the results were not improved. Then the team decided to use the undersampled dataset as it was giving better results in predicting the severe/fatal accidents. This decision was made on trying out oversampling, undersampling, test and train data with an equal ratio of classification classes.

# Conclusion

In conclusion, most of the algorithms are biased towards most frequent class. However, efficient pre-processing and corresponding imbalanced data techniques should give optimal results. Based on the current known conditions of weather, light, traffic signal, road surface, speed limit etc., accident severity can be classified. But there is no one feature, that influences the accident severity.