# A Brief Look at Chicago Traffic Accidents in Recent Years

![Chicago Traffic](https://fuphotos.files.wordpress.com/2010/11/2010chicago-122.jpg)

Image courtesy fuphotos: https://fuphotos.wordpress.com/2010/11/08/chicago-traffic/

## About the dataset
### Source: https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/
This dataset is a compilation of records collected through an electronic crash reporting system at the Chicago Police Department when official crash reports are submitted. It is important to note that, according to the metadata, about half reports are self-reported by drivers involved in minor incedents and the other half are reported by the responding officer(s). It is also worth noting that data from only a few police districts was available in 2015 and citywide data became available in 2017 (this becomes apparent in the data as we will see later). Also, crashes that are handled by other police jurisdictions aside from CPD are not included in this dataset.

## Getting Started
#### Exploring the data
Our team looked at the dataset together and finalized this notebook. It shows our process of exploring the data, and determining which columns we wanted to keep or discard in order to ask and attempt to answer meaningful questions. 

In [1]:
# We only need pandas for the exploration process.
# Other modules will be imported during the analysis portion of our study. 
import pandas as pd

In [2]:
# Convert the entire dataset, downloaded as a csv, to a pandas dataframe. 
crashes_df = pd.read_csv('CHC_Crashes.csv')

### Exploring the dataset

In [3]:
# Check datatypes
crashes_df.dtypes

RD_NO                             object
CRASH_DATE_EST_I                  object
CRASH_DATE                        object
POSTED_SPEED_LIMIT                 int64
TRAFFIC_CONTROL_DEVICE            object
DEVICE_CONDITION                  object
WEATHER_CONDITION                 object
LIGHTING_CONDITION                object
FIRST_CRASH_TYPE                  object
TRAFFICWAY_TYPE                   object
LANE_CNT                         float64
ALIGNMENT                         object
ROADWAY_SURFACE_COND              object
ROAD_DEFECT                       object
REPORT_TYPE                       object
CRASH_TYPE                        object
INTERSECTION_RELATED_I            object
NOT_RIGHT_OF_WAY_I                object
HIT_AND_RUN_I                     object
DAMAGE                            object
DATE_POLICE_NOTIFIED              object
PRIM_CONTRIBUTORY_CAUSE           object
SEC_CONTRIBUTORY_CAUSE            object
STREET_NO                          int64
STREET_DIRECTION

In [4]:
# Getting a feel for the size of the data
crashes_df.shape

(227505, 48)

#### 48 Columns is a lot to sort through. We decided to get a look at the columns and decide which may or may not be useful to ask questions of.

In [5]:
# Display columns (or pandas series)
crashes_df.columns

Index(['RD_NO', 'CRASH_DATE_EST_I', 'CRASH_DATE', 'POSTED_SPEED_LIMIT',
       'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION', 'WEATHER_CONDITION',
       'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE', 'TRAFFICWAY_TYPE', 'LANE_CNT',
       'ALIGNMENT', 'ROADWAY_SURFACE_COND', 'ROAD_DEFECT', 'REPORT_TYPE',
       'CRASH_TYPE', 'INTERSECTION_RELATED_I', 'NOT_RIGHT_OF_WAY_I',
       'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LATITUDE', 'LONGITUDE

#### Removing data we considered questionable or irrelevant
Our team iteratively reviewed the data and decided to not consider crashes that reported a lane count above 12. We also removed columns we agreed may not add any value to our analysis.

In [6]:
# Create a new dataframe with crashes involving a lane count of less than 12
crashes_1 = crashes_df.loc[(crashes_df['LANE_CNT'] <= 12)]
# Remove extraneous columns
injuries = ['INJURIES_NON_INCAPACITATING', 'INJURIES_REPORTED_NOT_EVIDENT', 
            'INJURIES_NO_INDICATION', 'INJURIES_UNKNOWN', 
            'INJURIES_INCAPACITATING', 'PHOTOS_TAKEN_I',
            'STATEMENTS_TAKEN_I', 'DOORING_I',
            'WORK_ZONE_I', 'WORK_ZONE_TYPE',
            'WORKERS_PRESENT_I', 'NOT_RIGHT_OF_WAY_I',
            'DATE_POLICE_NOTIFIED', 'LOCATION']

for item in injuries:
    del crashes_1[item]

In [7]:
# A look at the refined data
crashes_1.columns

Index(['RD_NO', 'CRASH_DATE_EST_I', 'CRASH_DATE', 'POSTED_SPEED_LIMIT',
       'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION', 'WEATHER_CONDITION',
       'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE', 'TRAFFICWAY_TYPE', 'LANE_CNT',
       'ALIGNMENT', 'ROADWAY_SURFACE_COND', 'ROAD_DEFECT', 'REPORT_TYPE',
       'CRASH_TYPE', 'INTERSECTION_RELATED_I', 'HIT_AND_RUN_I', 'DAMAGE',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL', 'CRASH_HOUR',
       'CRASH_DAY_OF_WEEK', 'CRASH_MONTH', 'LATITUDE', 'LONGITUDE'],
      dtype='object')

#### This cut down the size of the dataset significantly, but we now have cleaner data from which we hope to derive some insights.

In [8]:
crashes_1.shape

(162055, 34)

#### From here, it's time to export to a shareable csv, split up, and perform individual analyses

In [9]:
crashes_1.to_csv('Output/Crashes.csv')