# Project Overview
The City of Chicago Vehicle Safety Board (CCVSB) interested in reducing traffic accidents and becoming aware of any interesting patterns.

# Business Problem

The business problem is to build a classifier that can predict the primary contributory cause of car accidents in Chicago city.

# Data
The dataset was from Chicago city. Their were three datasets that was obtain from Chicago Data Portal:

    * Traffic_Crashes_-_People 
    * Traffic_Crashes_-_Vehicles
    * Traffic_Crashes_-_Crashes
The data provides up-to-date information as per now May 2023 from 2015.The two datasets was cleaned and merged to one.

# Data Grocery




# Recording the Experimental Design

To record the experimental design for building the classifier to predict the primary contributory cause of car accidents in Chicago, I used the following steps:

* Data Collection:  I gather vehicle and people accident data from https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3  and https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d 

* Data Preprocessing: Clean the data by handling missing values, inconsistencies, and outliers.
                      Transform categorical variables into numerical representations suitable for machine learning algorithms.
                      Normalize or standardize numerical features if necessar

* Exploratory Data Analysis (EDA): Perform exploratory analysis to understand the characteristics and distributions of variables.
                                   Identify patterns, correlations, or any interesting insights within the data.
                                   Visualize the data using plots, charts, or graphs to aid in understanding.

* Feature Engineering:Extract relevant features from the available data that may contribute to predicting the primary contributory cause of car accidents.
* Target Variable Binning: Analyze the distribution of the primary contributory cause categories.
Merge or eliminate categories with very few samples to limit the number of target categories.                                  

* Feature Selection: Select the most informative features that are likely to have a significant impact on the prediction.

* Model Selection and Training:Choose a suitable machine learning algorithm for multi-class classification, considering factors like performance, interpretability, and scalability.Split the preprocessed data into training and testing sets.
Train the chosen model on the training data using appropriate algorithms and methodologies.
* Model Evaluation:Evaluate the trained model's performance using relevant evaluation metrics for multi-class classification, such as accuracy, precision, recall, F1-score, or confusion matrix.Perform cross-validation techniques like k-fold cross-validation to assess the model's robustness.
* Model Optimization:Fine-tune the model by optimizing hyperparameters to improve its performance.
Use techniques like grid search, random search.
* Predictions and Interpretation:Use the optimized model to predict the primary contributory cause of car accidents for new instances.
Analyze the predictions and interpret the results to gain insights into patterns, potential causes, or any interesting findings that can aid accident prevention efforts.
* Reporting and Recommendations:Summarize the findings and insights obtained from the classifier.
Provide actionable recommendations for the Vehicle Safety Board or the City of Chicago based on the analysis and predictions.

# Loading the datasets

In [29]:
# importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


In [30]:
# Reading csv file
people=pd.read_csv('Traffic_Crashes_-_People.csv')
vehicles=pd.read_csv('Traffic_Crashes_-_Vehicles.csv')
crashes=pd.read_csv('Traffic_Crashes_-_Crashes.csv')

  people=pd.read_csv('Traffic_Crashes_-_People.csv')
  vehicles=pd.read_csv('Traffic_Crashes_-_Vehicles.csv')


In [31]:
# A function to print the shape of our datasets
def print_dataset_shape(*datasets):
    """
    Prints the shape of one or more datasets (number of rows and columns).
    Assumes datasets are in a Pandas DataFrame format.
    """
    for idx, dataset in enumerate(datasets):
        print(f"Dataset {idx + 1} - Number of rows: {dataset.shape[0]}")
        print(f"Dataset {idx + 1} - Number of columns: {dataset.shape[1]}")
# print the shape of our dataset
print_dataset_shape(people, vehicles,crashes)

Dataset 1 - Number of rows: 1584616
Dataset 1 - Number of columns: 30
Dataset 2 - Number of rows: 1472816
Dataset 2 - Number of columns: 72
Dataset 3 - Number of rows: 722809
Dataset 3 - Number of columns: 49


In [32]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1584616 entries, 0 to 1584615
Data columns (total 30 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   PERSON_ID              1584616 non-null  object 
 1   PERSON_TYPE            1584616 non-null  object 
 2   CRASH_RECORD_ID        1584616 non-null  object 
 3   RD_NO                  1574225 non-null  object 
 4   VEHICLE_ID             1553626 non-null  float64
 5   CRASH_DATE             1584616 non-null  object 
 6   SEAT_NO                320590 non-null   float64
 7   CITY                   1156363 non-null  object 
 8   STATE                  1172272 non-null  object 
 9   ZIPCODE                1057964 non-null  object 
 10  SEX                    1559451 non-null  object 
 11  AGE                    1123239 non-null  float64
 12  DRIVERS_LICENSE_STATE  931218 non-null   object 
 13  DRIVERS_LICENSE_CLASS  785449 non-null   object 
 14  SAFETY_EQUIPMENT  

In [33]:
vehicles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1472816 entries, 0 to 1472815
Data columns (total 72 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   CRASH_UNIT_ID             1472816 non-null  int64  
 1   CRASH_RECORD_ID           1472816 non-null  object 
 2   RD_NO                     1463413 non-null  object 
 3   CRASH_DATE                1472816 non-null  object 
 4   UNIT_NO                   1472816 non-null  int64  
 5   UNIT_TYPE                 1470808 non-null  object 
 6   NUM_PASSENGERS            218014 non-null   float64
 7   VEHICLE_ID                1439674 non-null  float64
 8   CMRC_VEH_I                27533 non-null    object 
 9   MAKE                      1439669 non-null  object 
 10  MODEL                     1439525 non-null  object 
 11  LIC_PLATE_STATE           1308626 non-null  object 
 12  VEHICLE_YEAR              1206500 non-null  float64
 13  VEHICLE_DEFECT            1

In [34]:
crashes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 722809 entries, 0 to 722808
Data columns (total 49 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   CRASH_RECORD_ID                722809 non-null  object 
 1   RD_NO                          718253 non-null  object 
 2   CRASH_DATE_EST_I               54664 non-null   object 
 3   CRASH_DATE                     722809 non-null  object 
 4   POSTED_SPEED_LIMIT             722809 non-null  int64  
 5   TRAFFIC_CONTROL_DEVICE         722809 non-null  object 
 6   DEVICE_CONDITION               722809 non-null  object 
 7   WEATHER_CONDITION              722809 non-null  object 
 8   LIGHTING_CONDITION             722809 non-null  object 
 9   FIRST_CRASH_TYPE               722809 non-null  object 
 10  TRAFFICWAY_TYPE                722809 non-null  object 
 11  LANE_CNT                       199002 non-null  float64
 12  ALIGNMENT                     

In [38]:
# Function to display the head of our datasets
def display_data_head(people,vehicles, crashes):
    dfs = [people.head(), vehicles.head(),crashes.head()]
    df_names = ["people", "vehicles","crashes"]
    for df, name in zip(dfs, df_names):
        print(f"\n{name}:\n")
        display(df)
# Display the head of our datasets
display_data_head(people,vehicles,crashes)


people:



Unnamed: 0,PERSON_ID,PERSON_TYPE,CRASH_RECORD_ID,RD_NO,VEHICLE_ID,CRASH_DATE,SEAT_NO,CITY,STATE,ZIPCODE,...,EMS_RUN_NO,DRIVER_ACTION,DRIVER_VISION,PHYSICAL_CONDITION,PEDPEDAL_ACTION,PEDPEDAL_VISIBILITY,PEDPEDAL_LOCATION,BAC_RESULT,BAC_RESULT VALUE,CELL_PHONE_USE
0,O1577624,DRIVER,e8d0a18503a3ef7a69ee631eacffd421ea154ea9782131...,,1500926.0,05/17/2023 10:20:00 AM,,CHICAGO,IL,60637.0,...,,IMPROPER PARKING,UNKNOWN,UNKNOWN,,,,TEST NOT OFFERED,,
1,O1577610,DRIVER,0690865a402d40a7eab391f94a658b48dc03abb636a03a...,,1500906.0,05/17/2023 10:10:00 AM,,,,,...,,UNKNOWN,UNKNOWN,UNKNOWN,,,,TEST NOT OFFERED,,
2,O1577611,DRIVER,0690865a402d40a7eab391f94a658b48dc03abb636a03a...,,1500913.0,05/17/2023 10:10:00 AM,,VALPARAISO,IN,46385.0,...,,NONE,NOT OBSCURED,NORMAL,,,,TEST NOT OFFERED,,
3,O1577604,DRIVER,fe7d5f687f472c631a7a3516d0047a3cf7a8ab2cb0b6a4...,,1500903.0,05/17/2023 10:05:00 AM,,CHICAGO,IL,60613.0,...,,UNKNOWN,UNKNOWN,NORMAL,,,,TEST NOT OFFERED,,
4,O1577605,DRIVER,fe7d5f687f472c631a7a3516d0047a3cf7a8ab2cb0b6a4...,,1500907.0,05/17/2023 10:05:00 AM,,CHICAGO,IL,60660.0,...,,UNKNOWN,UNKNOWN,NORMAL,,,,TEST NOT OFFERED,,



vehicles:



Unnamed: 0,CRASH_UNIT_ID,CRASH_RECORD_ID,RD_NO,CRASH_DATE,UNIT_NO,UNIT_TYPE,NUM_PASSENGERS,VEHICLE_ID,CMRC_VEH_I,MAKE,...,TRAILER1_LENGTH,TRAILER2_LENGTH,TOTAL_VEHICLE_LENGTH,AXLE_CNT,VEHICLE_CONFIG,CARGO_BODY_TYPE,LOAD_TYPE,HAZMAT_OUT_OF_SERVICE_I,MCS_OUT_OF_SERVICE_I,HAZMAT_CLASS
0,1577434,25d92973475a04a93e7fd206fbfce57e8a9a1e25cc85a7...,,05/16/2023 11:12:00 PM,1,DRIVER,,1500741.0,,HONDA,...,,,,,,,,,,
1,1577435,25d92973475a04a93e7fd206fbfce57e8a9a1e25cc85a7...,,05/16/2023 11:12:00 PM,2,DRIVER,1.0,1500742.0,,CHEVROLET,...,,,,,,,,,,
2,1577450,375ac7f6fcb4ef73d728edc52ed556f23fd465a351833f...,,05/16/2023 11:06:00 PM,1,DRIVER,,1500759.0,,DODGE,...,,,,,,,,,,
3,1577451,375ac7f6fcb4ef73d728edc52ed556f23fd465a351833f...,,05/16/2023 11:06:00 PM,2,DRIVER,,1500760.0,,TOYOTA,...,,,,,,,,,,
4,1577452,375ac7f6fcb4ef73d728edc52ed556f23fd465a351833f...,,05/16/2023 11:06:00 PM,3,DRIVER,,1500761.0,,FORD,...,,,,,,,,,,



crashes:



Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,25d92973475a04a93e7fd206fbfce57e8a9a1e25cc85a7...,,,05/16/2023 11:12:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,...,0.0,0.0,3.0,0.0,23,3,5,41.952691,-87.807413,POINT (-87.807413247555 41.952691362649)
1,375ac7f6fcb4ef73d728edc52ed556f23fd465a351833f...,,,05/16/2023 11:06:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR TO FRONT,...,0.0,0.0,3.0,0.0,23,3,5,41.997837,-87.688814,POINT (-87.688813887189 41.997837266972)
2,246fea010af2010860046c6ef36efb75a8c60244088939...,,,05/16/2023 11:05:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,...,0.0,0.0,1.0,0.0,23,3,5,42.002331,-87.695032,POINT (-87.695032165757 42.002331485776)
3,18c220f7eeceb2cf6f9512c9b83382da28d8565fbbaaec...,,,05/16/2023 10:20:00 PM,25,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,PEDALCYCLIST,...,1.0,0.0,1.0,0.0,22,3,5,41.82734,-87.636475,POINT (-87.636475000374 41.827339537397)
4,cfecdce601503162eb09337bd6051ea358dca7294d440b...,,,05/16/2023 09:45:00 PM,30,UNKNOWN,FUNCTIONING PROPERLY,CLEAR,DARKNESS,REAR END,...,0.0,0.0,2.0,0.0,21,3,5,41.808853,-87.640097,POINT (-87.640097485203 41.808853153697)


# Data cleaning
### Cheaking for duplicates

In [39]:
# A function to check for duplicates in our datasets
def check_duplicates(df):
    """
    This function checks for and returns any duplicates in a given dataframe.
    """
    duplicates = df[df.duplicated()]
    if duplicates.shape[0] == 0:
        print("No duplicates found in the dataset")
    else:
        print("Duplicates found in the dataset:")
        return duplicates
# Calling for the function to check for duplicates
check_duplicates(people)
check_duplicates(vehicles)
check_duplicates(crashes)

No duplicates found in the dataset
No duplicates found in the dataset
No duplicates found in the dataset


The data had no duplicates 

### cheaking missing values

In [40]:
# A function to check for missing values in our dataset
def check_missing_values(data):
    # Count missing values in each column
    missing_values = data.isnull().sum()

    # Convert missing values count to percentage of total rows
    missing_percent = (missing_values / len(data)) * 100

    # Combine the missing values count and percent into a DataFrame
    missing_df = pd.concat([missing_values, missing_percent], axis=1)
    missing_df.columns = ['Missing Values', '% of Total']

    # Return only columns with missing values
    missing_df = missing_df[missing_df['Missing Values'] > 0]

    return missing_df

# Check missing values in each dataset
display(check_missing_values(people))
display(check_missing_values(vehicles))
display(check_missing_values(crashes))

Unnamed: 0,Missing Values,% of Total
RD_NO,10391,0.655742
VEHICLE_ID,30990,1.955679
SEAT_NO,1264026,79.7686
CITY,428253,27.025664
STATE,412344,26.021699
ZIPCODE,526652,33.235307
SEX,25165,1.588082
AGE,461377,29.116013
DRIVERS_LICENSE_STATE,653398,41.233838
DRIVERS_LICENSE_CLASS,799167,50.432849


Unnamed: 0,Missing Values,% of Total
RD_NO,9403,0.638437
UNIT_TYPE,2008,0.136337
NUM_PASSENGERS,1254802,85.197472
VEHICLE_ID,33142,2.250247
CMRC_VEH_I,1445283,98.130588
...,...,...
CARGO_BODY_TYPE,1460662,99.174778
LOAD_TYPE,1461192,99.210764
HAZMAT_OUT_OF_SERVICE_I,1462256,99.283006
MCS_OUT_OF_SERVICE_I,1462010,99.266303


Unnamed: 0,Missing Values,% of Total
RD_NO,4556,0.630319
CRASH_DATE_EST_I,668145,92.437283
LANE_CNT,523807,72.468245
REPORT_TYPE,20288,2.806827
INTERSECTION_RELATED_I,557015,77.062543
NOT_RIGHT_OF_WAY_I,689058,95.330578
HIT_AND_RUN_I,497777,68.867017
STREET_DIRECTION,4,0.000553
STREET_NAME,1,0.000138
BEAT_OF_OCCURRENCE,5,0.000692
