![](../Assets/logo.png)
# Machine Learning Model for Identifying Personal Driver Behaviors Using Traffic Violation and Accident History Data


## Problem Statement

## Executive Summary

![](../Assets/Dynamic_Driver_profile.jpeg)
### Risk Criteria
![](../Assets/Risk_Criteria.png)

### Contents:
- [Datasets Description](#Datasets_Description)
- [Data Import & Cleaning](#Data_Import_and_Cleaning)
- [Exploratory Data Analysis](#Exploratory_Data_Analysis)
- [Data Visualization](#Visualize_the_data)
- [Descriptive and Inferential Statistics](#Descriptive_and_Inferential_Statistics)
- [Preprocessing and Modeling](#Preprocessing_and_Modeling)
- [Outside Research](#Outside_Research)
- [Conclusions and Recommendations](#Conclusions_and_Recommendations)

<a name="Datasets_Description"></a>
## Datasets Description


### First Dataset (Drivers):

| Name                         | Type | Description |
|------------------------------|------|-------------|
| license_id                   | int  |             |
| license_type                 | object |             |
| age                          | int  |             |
| gender                       |      |             |
| nationality                  |      |             |
| date_of_license_issue        |      |             |
| location_of_license_issue    |      |             |
| city                         |      |             |
| years_of_driving_experience  |      |             |
| license_renewed_count_number |      |             |
| traffic_violation_number     |      |             |
| accident_number              |      |             |


### Second Dataset (Vehicles):

| Name                      | Type | Description |
|---------------------------|------|-------------|
| vehicles_owner_id         |      |             |
| vehicles_registration_id  |      |             |
| vehicles_type             |      |             |
| vehicles_name             |      |             |
| vehicles_model            |      |             |
| vehicles_model_year       |      |             |
| motor_vehicles_avalible   |      |             |
| insurance_id              |      |             |
| insurance_types           |      |             |
| car_insurance_start_date  |      |             |
| car_insurance_end_date    |      |             |
| vehicle_ownership_rental  |      |             |
| vehicle_ownership         |      |             |
| other_authorized_to_drive |      |             |


### Third Dataset (Violations):

| Name                        | Type | Description |
|-----------------------------|------|-------------|
| owner_authorized_id         |      |             |
| vehicle_registration_id     |      |             |
| violation_type              |      |             |
| violation_id                |      |             |
| violation_date              |      |             |
| violation_time              |      |             |
| city                        |      |             |
| violation_location_lat-long |      |             |
| violation_group             |      |             |


### Fourth Dataset (Accidents):

| Name                       | Type | Description |
|----------------------------|------|-------------|
| owner_authorized_id        |      |             |
| vehicle_registration_id    |      |             |
| accident_typs              |      |             |
| accident_id                |      |             |
| accident_date              |      |             |
| accident_time              |      |             |
| accident_city              |      |             |
| accident_location_lat-long |      |             |
| injuries                   |      |             |
| deaths                     |      |             |
| property_damages           |      |             |
| other_person_in_accident   |      |             |
| number_of_damaged_cars     |      |             |
| human_error_percentage     |      |             |

In [8]:
pip install Faker

Collecting Faker
[?25l  Downloading https://files.pythonhosted.org/packages/2c/f6/41260bf1ce823fac4d446058a647f6b2edf1d3f4668d06c0ea040af73ed2/Faker-4.18.0-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 983kB/s eta 0:00:01
[?25hCollecting text-unidecode==1.3 (from Faker)
[?25l  Downloading https://files.pythonhosted.org/packages/a6/a5/c0b6468d3824fe3fde30dbb5e1f687b291608f9473681bbf7dabbf5a87d7/text_unidecode-1.3-py2.py3-none-any.whl (78kB)
[K     |████████████████████████████████| 81kB 11.2MB/s eta 0:00:01
Installing collected packages: text-unidecode, Faker
Successfully installed Faker-4.18.0 text-unidecode-1.3
Note: you may need to restart the kernel to use updated packages.


<a name="Data_Import_and_Cleaning"></a>
## Data Import & Cleaning

In [22]:
#Basic libraries
import numpy as np
import pandas as pd
from scipy import stats
import random
from random import randint

# Visualization libraries
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from matplotlib.colors import ListedColormap


from IPython.display import set_matplotlib_formats 
plt.style.use('ggplot')
sns.set_style('whitegrid')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'

import warnings
warnings.filterwarnings("ignore")


# Pallets used for visualizations
color= "Spectral"
color_plt = ListedColormap(sns.color_palette(color).as_hex())
color_hist = 'teal'
two_colors = [ sns.color_palette(color)[0], sns.color_palette(color)[5]]
three_colors = [ sns.color_palette(color)[5],sns.color_palette(color)[2], sns.color_palette(color)[0]]

In [14]:
from faker import Faker
#initialize Faker
fake=Faker()

In [2]:
# Importing data from excel file (order of sheets could differ)
File_path='../data/Dummy Data Template.xlsx'

driver_df = pd.read_excel(File_path, sheet_name=0)
vehicle_df = pd.read_excel(File_path, sheet_name=1)
violation_df = pd.read_excel(File_path, sheet_name=2)
accident_df = pd.read_excel(File_path, sheet_name=3)

# Function that cleans column names 
def column_name_cleaning(df):
    df.columns = df.columns.str.strip().str.lower().str.replace('/', '').str.replace('\s+', ' ', regex=True).str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str.replace("'", '')
    return (df)

# Clean column names in all data frames
column_name_cleaning(driver_df)
column_name_cleaning(vehicle_df)
column_name_cleaning(violation_df)
column_name_cleaning(accident_df);
driver_df.head()

Unnamed: 0,license_id,license_type,age,gender,nationality,data_of_license_issue,location_of_license_issue,city,years_of_driving_experience,license_renewed_count_number,traffic_violation_number,accident_number


In [12]:
id_list= []
n_digits = 10 # number of digits
n_records = 10000

for n in range(n_records):
    cond= True
    while (cond):
        one_id =''.join(["{}".format(randint(0, 9)) for num in range(0, n_digits)])
        if one_id not in id_list: 
            cond= False
            id_list.append(one_id)    

In [25]:
birth_date_list= []
gender_list=[]
gender_probability = 0.70
for n in range(n_records):   
    one_birth_date = fake.date_of_birth(tzinfo=None, minimum_age=18, maximum_age=80)
    birth_date_list.append(one_birth_date)
    
    g = 'M' if random.random() < gender_probability else 'F'
    gender_list.append(g)

In [26]:
gender_list

['M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F'

In [3]:
vehicle_df.head()

Unnamed: 0,vehicles_owner_id,vehicles_registration_id,vehicles_type,vehicles_name,vehicles_model,vehicles_model_year,motor_vehicles_avalible,insurance_id,insurance_types,car_insurance_start_date,car_insurance_end_date,vehicle_ownership_rental,vehicle_ownership,other_authorized_to_drive


In [4]:
violation_df.head()

Unnamed: 0,owner_authorized_id,vehicle_registration_id,violation_type,violation_id,violation_date,violation_time,city,violation_location_lat-long,violation_group


In [5]:
accident_df.head()

Unnamed: 0,owner_authorized_id,vehicle_registration_id,accident_typs,accident_id,accident_date,accident_time,accident_city,accident_location_lat-long,injuries,deaths,property_damages,other_person_in_accident,number_of_damaged_cars,human_error_percentage


- ## Calculating Risk points for each violation

In [6]:
violation_group_to_risk_point_map={
    1: 2,
    2: 2,
    3: 2,
    8: 2,
    4: 3,
    5: 3,
    6: 5,
    7: 5,
    9: 50,  
    }
#violation_df['violation_point']= violation_df['violation_group'].replace(violation_group_to_risk_point_map)
#violation_df['violation_point_total']= violation_df[['owner_authorized_id','violation_point']].groupby('owner_authorized_id', as_index=False).agg({'violation_point': 'sum'})


In [44]:
violation_df.head()

Unnamed: 0_level_0,owner_authorized_id,vehicle_registration_id,violation_type,violation_id,violation_date,violation_time,city,violation_location_lat-long,violation_group,violation_point,violation_point_total
owner_authorized_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [7]:
# Adding violation_point_total column from violation_df to driver_df 
#driver_df = pd.merge(driver_df,violation_df[['owner_authorized_id','violation_point_total']],left_on='license_id' , right_on='owner_authorized_id', how='left')


- ## Calculating Risk points for each accident

In [40]:
human_error_to_risk_point_map={
    0: 0,
    25: 15,
    50: 25,
    75: 35,
    100: 50, 
    }
accident_df['accident_point']= accident_df['human_error_percentage'].replace(human_error_to_risk_point_map)
accident_df['accident_point_total']= accident_df.groupby('owner_authorized_id')['accident_point'].sum()

In [None]:
# Adding accident_point_total column from accident_df to driver_df 
driver_df = pd.merge(driver_df,accident_df[['owner_authorized_id','accident_point_total']],left_on='license_id' , right_on='owner_authorized_id', how='left')


<a name="Exploratory_Data_Analysis"></a>
## Exploratory Data Analysis

<a name="Visualize_the_data"></a>
## Data Visualization

<a name="Descriptive_and_Inferential_Statistics"></a>
## Descriptive and Inferential Statistics

<a name="Preprocessing_and_Modeling"></a>
## Preprocessing and Modeling

<a name="Outside_Research"></a>
## Outside Research

<a name="Conclusions_and_Recommendations"></a>
## Conclusions and Recommendations