In [3]:
import pandas as pd
from sodapy import Socrata

# Exploratory Data Analysis

We're going to explore a large data set or traffic crashes to learn about what factors are connected with injuries. We will use data from the city of [Chicago's open data portal](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/about_data). (This activity is derived from a [blog post](https://juliasilge.com/blog/chicago-traffic-model/) by Julia Silge)

In [30]:
client = Socrata("data.cityofchicago.org", None)

results = client.get("85ca-t3if", where="CRASH_DATE > '2022-01-01'")

# Convert to pandas DataFrame
crash_raw = pd.DataFrame.from_records(results)



In [31]:
crash_raw.columns

Index(['crash_record_id', 'crash_date', 'posted_speed_limit',
       'traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'hit_and_run_i', 'damage', 'date_police_notified',
       'prim_contributory_cause', 'sec_contributory_cause', 'street_no',
       'street_direction', 'street_name', 'beat_of_occurrence', 'num_units',
       'most_severe_injury', 'injuries_total', 'injuries_fatal',
       'injuries_incapacitating', 'injuries_non_incapacitating',
       'injuries_reported_not_evident', 'injuries_no_indication',
       'injuries_unknown', 'crash_hour', 'crash_day_of_week', 'crash_month',
       'latitude', 'longitude', 'location', 'crash_date_est_i',
       'photos_taken_i', 'intersection_related_i', 'private_property_i',
       'statements_taken_i', 'work_zone_i', 'dooring_i'],
      dtype='object')

This dataset is pretty crazy! Let's do some data munging to get it into a nicer form. 

  - create a variable called `injuries` which indicates if the crash involved injuries or not.
  - create an unknown category for missing `report_type`s
  - decide which other variables to keep

In [45]:
# convert the 'Date' column to datetime format
crash_raw['crash_date'] = pd.to_datetime(crash_raw['crash_date'])
crash_raw['injuries'] = (pd.to_numeric(crash_raw['injuries_total']) > 0)

crash = crash_raw[['crash_date', 'injuries', 'latitude', 'longitude']]

## Exploratory Data Analysis

Here's a few questions to get you started. 

1. Take a look at crashes by latitude and longitude, colored by injuries. What do you notice?

2. What are the most common contributing factors to a crash?

3. How do crashes vary month by month? Compare crashes by month in 2022 to 2023.

4. Are crashes more likely to cause injuries when it is rainy and dark? Use the variables `weather_condition` and `lighting_condition` to explore.

5. Choose a question you want to explore, and create an appropriate visual.