For this classification project, the aim is to use machine learning concepts to predicting whether or not a car crash is fatal.
To build a classification model to predict whether a car crash is fatal or not.
To clean and merge the three datasets(crashes,people,vehicles) so as to make it easy for modelling Identify features in the data and use them for engineering a model Evaluating our model Come up with interpretations upon interpreting our classifier model
What is the problem? To help car companys make their buyers feel safer by being able to identify which crashes are fatal or not so that emergency services can respond quickly to fatal cases and give them prority.
Farther understanding
The proposed software based on the model of this project is used to potentially save the life of a driver in the event of a car accident. The software takes record of the features occuring at the time of the accident and if they are highly correlated with those of previous fatal crashes, athorities are immediately notified of possible death.If successful and car companies implement our recommendations, we will see a decrease in fatalities and an overall increase in safety for all citizens.
I used Chicago,IL car crash data from city of Chicago website the 3 data sets contain crash info, info on people involved and info on vehicles involved.
I used the 3 data sets below from the City of Chicago website and combined them.
Traffic Crashes - Crashes:
Traffic Crashes - Vehicle:
Traffic Crashes - People: Load Data
In order to use run this data you must download the data sets from the following 3 links and store them in the folder above your notebook
Data Cleaning
Merge all three data sets together and drop duplicates.
df_merge = pd.merge(df, df_vehicles, on='CRASH_RECORD_ID').reset_index() df_merge_2 = pd.merge(df_merge, df_people, on='CRASH_RECORD_ID').reset_index()
#dropping dupllicates(checks CRASH_RECORD_ID for uniqueness) df_dropped= df_merge_2.drop_duplicates(subset=['CRASH_RECORD_ID'], keep='first')
Check the distribution of the columns of the dataset.
Dummy Classfier
Baseline model for comparision.
dummy = DummyClassifier(random_state=42)
#establishing random_state for reproducibility, y_train) y_pred = dummy.predict(X_test_clean) cm = confusion_matrix(y_test,y_pred)
ConfusionMatrixDisplay(cm, display_labels=["True label", "Predicted label"]).plot()
We have a class imbalance so we're trying to oversample the minority class of our target.
smote = SMOTE(sampling_strategy='auto',random_state=42) X_train_resampled, y_train_resampled = smote.fit_sample(X_train_clean, y_train)
Dummy Classfier with SMOTE
dummy_smote = DummyClassifier(random_state=42), y_train_resampled) y_pred_dummy_sm = dummy_smote.predict(X_test) cm_2 = confusion_matrix(y_test,y_pred_dummy_sm)
ConfusionMatrixDisplay(cm_2, display_labels=["True label", "Predicted label"]).plot()
Decision Tree
tree = DecisionTreeClassifier() Doing a grid search to find the best parameters
tree_grid = {'max_leaf_nodes': [4, 5, 6, 7], 'min_samples_split': [2, 3, 4], 'max_depth': [2, 3, 4, 5], }
plot_tree(best_tree) plt.savefig('images/decision_tree.png');
Best tree confusion matrix
Best tree ROC curve
Checking how much accuracy the model loses by excluding each variable
Random Forest
Final Model Evaluation
Decision Tree:
Accuracy: 0.9180 AUC: 0.79 Precision: 0.0079 Recall: 0.4884 F1 Score: 0.0155
This final model is excellent at predicting whether or not a car crash is fatal. Therefore, this model should be used by car companies as a safety feature in their vehicles to help 911 dispatchers determine whether or not certain crashes are a priority or not.
The best model for use is the decision tree because it has a relatively high accuracy of 91.8% meaning it is correct 91.8% of the time, however, the precision of 0.0079 indicates that, out of all the positive predictions made by the model, only 0.79% of them were actually positive. This is a very low precision value, which means that the model is making a large number of false positive predictions.This leads to the following reccomendations:
Consider using different feature selection or engineering techniques to improve the performance of the model. Consider adjusting the parameters of the Decision Tree model to optimize its performance.
In conclusion, the Decision Tree model has a relatively high accuracy, but is not performing well in terms of precision and recall. Further work may be necessary to improve the performance of the model and achieve better results. Still, it would be beneficial to use this model as a safety feature because even though precision is low,better safe than sorry.