# Project Objective
The goal of this project is to build a classifier to determine the ***Primary Contributory Cause of a Car Accident** given certain information. 

### Order of Operations:
This is the first notebook in my **Chicago Traffic Crashes** data analysis project, and in the notebook I will be **importing** and **merging** the provided datasets.
* Obtain data
* Explore the datasets we will be using to create our Models i.e, **Trafic_Crashes, Traffic_People,** and **Traffic_Vehicles**
    - Filter out irrelevant columns
    - Data type conversion
    - Detect and handle missing data
    - Deal with outliers in data if any
    - Check for duplicates
    - Create dummy variables for categorical data

## Data Importing
The first step in this process is to load our data into the notebook. The datasets I will be using for this project can be found [here](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if).

In [1]:
# Start by import all necessary libraries
import pandas as pd  # For handle datasets and data storage

# For data visualization
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline

### Traffic_Crashes - Load and Explore Dataset

In [3]:
# Next we'll start by loading and exploring the Crashes dataset
crashes = pd.read_csv('CSV_Datasets/Traffic_Crashes.csv')
crashes.head()

Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,4fd0a3e0897b3335b94cd8d5b2d2b350eb691add56c62d...,JC343143,,07/10/2019 05:56:00 PM,35,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,...,0.0,0.0,3.0,0.0,17,4,7,41.919664,-87.773288,POINT (-87.773287883007 41.919663832993)
1,009e9e67203442370272e1a13d6ee51a4155dac65e583d...,JA329216,,06/30/2017 04:00:00 PM,35,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,...,0.0,0.0,3.0,0.0,16,6,6,41.741804,-87.740954,POINT (-87.740953581987 41.741803598989)
2,ee9283eff3a55ac50ee58f3d9528ce1d689b1c4180b4c4...,JD292400,,07/10/2020 10:25:00 AM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,...,0.0,0.0,3.0,0.0,10,6,7,41.773456,-87.585022,POINT (-87.585022352022 41.773455972008)
3,f8960f698e870ebdc60b521b2a141a5395556bc3704191...,JD293602,,07/11/2020 01:00:00 AM,30,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,PARKED MOTOR VEHICLE,...,0.0,0.0,3.0,0.0,1,7,7,41.802119,-87.622115,POINT (-87.622114914961 41.802118543011)
4,8eaa2678d1a127804ee9b8c35ddf7d63d913c14eda61d6...,JD290451,,07/08/2020 02:00:00 PM,20,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,...,0.0,0.0,1.0,0.0,14,4,7,,,


In [4]:
# Next lets get a little more information about the dataset. Start by viewing the size of the dataset
print(f'This dataset has {len(crashes)} rows and {len(crashes.columns)} columns\n')

# .info() to get more information on each column, i.e, dtypes, number of non-null values
crashes.info()

This dataset has 490128 rows and 49 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 490128 entries, 0 to 490127
Data columns (total 49 columns):
CRASH_RECORD_ID                  490128 non-null object
RD_NO                            486363 non-null object
CRASH_DATE_EST_I                 36817 non-null object
CRASH_DATE                       490128 non-null object
POSTED_SPEED_LIMIT               490128 non-null int64
TRAFFIC_CONTROL_DEVICE           490128 non-null object
DEVICE_CONDITION                 490128 non-null object
WEATHER_CONDITION                490128 non-null object
LIGHTING_CONDITION               490128 non-null object
FIRST_CRASH_TYPE                 490128 non-null object
TRAFFICWAY_TYPE                  490128 non-null object
LANE_CNT                         198965 non-null float64
ALIGNMENT                        490128 non-null object
ROADWAY_SURFACE_COND             490128 non-null object
ROAD_DEFECT                      490128 non-null object
REPOR

**Key Takeaways:**
* There are 490,128 rows and 49 columns.
* Of all 49 columns, 27 of these columns have at least 1 missing data point.
* Over half the dataset is an **object** datatype.
> For this project the 