# Analysis of Chicago Red Light Data
<p>This workbook analyzes Red Light Violations in the city of Chicago as captured by red light cameras at intersections throughout the city between July 2014 and May 2019. In this dataset there are 183 unique intersections and 365 unique camera locations. Many of the intersections have multiple camera locations to capture each approach to the intersection. Because camera locations are not identified by approach, only intersection is used in this analysis. The apparent seasonal trend to the data led me to incorporate NOAA weather data for station id 725300 (Chicago/O'Hare International Airport). The NOAA weather data was pulled from the NOAA GSOD public Big Query dataset.</p>

## Table of Contents
- [1. Load Libraries, import and format data](#1)
    - [1.1 Load Libraries and data](#1.1)
    - [1.2 Format Red Light data](#1.2)
    - [1.3 Get NOAA data from Big Query dataset](#1.3)
    - [1.4 Chart: Violations Over Time](#1.4)
- [2. Influence of weather](#2)
    - [2.1 Precipitation](#2.1)
    - [2.2 Temperature](#2.2)
    - [2.3 Winter Weather](#2.3)
    - [2.4 Snow/Ice Pellets](#2.4)
- [3. Dow of the week](#3)
    - [3.1 Violations by day (30 day detail)](#3.1)
    - [3.2 Total violations by day](#3.2)
    - [3.2 Total violations by day](#3.2)
    
## <a id="1">1. Load Libraries, import and format data</a>

### <a id="1.1">1.1 Load Libraries and data</a>


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from datetime import datetime, timedelta

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
import plotly.plotly as py
import plotly.graph_objs as go
import folium
import bq_helper

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()

cam_data=pd.read_csv("../input/red-light-camera-violations.csv")
cam_locations = pd.read_csv("../input/red-light-camera-locations.csv")
noaa_data_set = bq_helper.BigQueryHelper(active_project= "bigquery-public-data",
                                        dataset_name= "noaa_gsod")

### <a id="1.2">1.2 Format Red Light Data</a>

In [None]:
#Convert the violation date to a datetime format for use in time series analysis
cam_data['VIOLATION DATE']=pd.to_datetime(cam_data['VIOLATION DATE'])
#Get the most recent date in the dataset. This will be used to create dynamic titles for the charts
max_date = cam_data['VIOLATION DATE'].max()
min_date = cam_data['VIOLATION DATE'].min()
cam_df=cam_data.groupby(cam_data['VIOLATION DATE'])[['VIOLATIONS']].agg('sum')
cam_df['date']=cam_df.index
cam_df=cam_df.reset_index(drop=True)
cam_df['date']=pd.to_datetime(cam_df['date'])


### <a id="1.3">1.3 Get NOAA data from Big Query dataset </a>

In [None]:
#get weather data for O'Hare airport in Chicago
base_query = """
SELECT
    CAST(CONCAT(year,'-',mo,'-',da) AS date) AS date,
    temp,
    wdsp,
    max AS max_temp,
    min AS min_temp,
    prcp,
    sndp AS snow_depth,
    fog,
    rain_drizzle,
    snow_ice_pellets,
    hail,
    thunder,
    tornado_funnel_cloud
FROM
"""

where_clause = """
WHERE stn='725300'
"""
tables=[
    "`bigquery-public-data.noaa_gsod.gsod2019`",
    "`bigquery-public-data.noaa_gsod.gsod2018`",
    "`bigquery-public-data.noaa_gsod.gsod2017`",
    "`bigquery-public-data.noaa_gsod.gsod2016`",
    "`bigquery-public-data.noaa_gsod.gsod2015`",
    "`bigquery-public-data.noaa_gsod.gsod2014`"]

for t in range(len(tables)):
    if t==0:
        query = "{0} {1} {2} \n".format(base_query,tables[t],where_clause)
    else:
        query+="UNION ALL \n {0} {1} {2}".format(base_query,tables[t],where_clause)

weather_data= noaa_data_set.query_to_pandas_safe(query, max_gb_scanned=2.0)
weather_data['date']=pd.to_datetime(weather_data['date'])

#merge weather data with violation data
weather=weather_data.merge(cam_df, left_on='date', right_on= 'date')
weather = weather.rename(columns={'VIOLATIONS_x': 'speed_violations', 'VIOLATIONS_y': 'red_light_violations'})

#handle outliers


#replace snow depth equal to 999.9 with 0 (999.9 is used for missing values)
weather['snow_depth']=weather['snow_depth'].replace(999.9,0.0)
#remove outliers from max temp column
weather=weather.drop(weather[weather.max_temp==weather.max_temp.max()].index)
total_rows=len(weather)

weather['winter_weather']=np.where((weather['min_temp']<=35.0) & (weather['prcp']>0), 1, 0)
weather = weather.rename(columns={'VIOLATIONS': 'red_light_violations'})

### <a id="1.4">1.4 Chart: Violations Over Time</a>
Red Light Violations appear to follow a general seasonal trend over time with Violations reaching their peak in July and their low in January. Why are people less likely to run red lights during January? One possibility is that drivers are more cautious during winter months than they are in the summer. There is also a significant amount of noise in this chart which appears to be at least partially due to the day of week, drivers tend to have more violations toward the end of the week than the beginning of the week.

In [None]:
#Create a function which will define each of the weekend nights (friday & saturday) and mark them with a vertical rectangle on the graph
def shape(startdate, enddate,opacity):
    shapes=[]
    while startdate <= enddate:
        #determine if the day of week is a Friday or Saturday
        if startdate.weekday() ==4:
            shapes.append({
            'type': 'rect',
            'xref': 'x',
            'yref': 'paper',
            'x0': startdate,
            'y0': 0,
            'x1': startdate+timedelta(days=1),
            'y1': 1,
            'fillcolor': '#d3d3d3',
            'opacity': opacity,
            'line': {
                'width': 0,
                }
            }
            )
            #skip a date if the day was a Friday
            startdate=startdate+timedelta(days=1)
        startdate=startdate+timedelta(days=1)
    return shapes

#plot the time series data
data = [go.Scatter(x=cam_df['date'], y=cam_df['VIOLATIONS'])]

#layout = dict(title = 'Number of Violations between {} and {}'.format(d,max_date),
#             xaxis= dict(title='Violations', ticklen=1, zeroline=False))
layout = {'title':'Number of Red Light Violations between {:%x} and {:%x}'.format(min_date,max_date),
    # to highlight the timestamp we use shapes and create a rectangular
    'shapes': shape(min_date,max_date,0.3)}

fig = dict(data=data, layout=layout)
iplot(fig)

### <a id="1.5">1.4 Map: Top 25 Violation Locations</a>
This map displays the top 25 Intersections by Violation. Many of these locations are near downtown. The top location, Cicero and I55, is near the Chicago Midway International Airport which may have an impact on the red light violations. This location also appears to be an entrance ramp from Cicero onto I55. Entrance ramp red lights may be used only during high traffic volume times or they may seem irrelevant during low traffic volume which may explain this increase in red light violations.

In [None]:
top_num=25
locations=cam_data.groupby(['INTERSECTION', 'LATITUDE', 'LONGITUDE'], as_index=False)[['VIOLATIONS']].agg('sum')
locations=locations.sort_values(by=['VIOLATIONS'], ascending=False)
locations=locations.head(top_num)

chicago_location = [41.8781, -87.6298]

m = folium.Map(location=chicago_location, zoom_start=11)
for i in range(0,len(locations)):
    folium.Circle(
      location=[locations.iloc[i]['LATITUDE'], locations.iloc[i]['LONGITUDE']],
      popup="{}: {}".format(locations.iloc[i]['INTERSECTION'],locations.iloc[i]['VIOLATIONS']),
      radius=int(locations.iloc[i]['VIOLATIONS'])/100,
      color='crimson',
      fill=True,
      fill_color='crimson'
    ).add_to(m)
m

## <a id="2">2. Influence of weather on red light violations</a>
### <a id="2.1">2.1 Precipitation</a>
The number of red light violations does seem to decline as rainfall increases but there are also significantly less data points at the high end of the rainfall volume. There is a lot of variance in the red light violations at the low end of the precipitation range which implies that there are other factors impacting red light violations.

In [None]:
data = [go.Scatter(x=weather['prcp'],
    y=weather['red_light_violations'],
    mode='markers')]

#layout = dict(title = 'Number of Violations between {} and {}'.format(d,max_date),
#             xaxis= dict(title='Violations', ticklen=1, zeroline=False))
layout = {'title':'Correlation between Precipitation and Red Light Violations',
          'xaxis': {'title':'Precipitation'},
          'yaxis': {'title': 'Red Light Violations'}
}

fig = dict(data=data, layout=layout)
iplot(fig)

### <a id="2.2">2.2 Temperature</a>
There does appear to be a slight positive correlation between temperature and red light violations although there is still a lot of variation in the data points. Temperature alone does not increase red light violations.

In [None]:
data = [go.Scatter(x=weather['min_temp'],
    y=weather['red_light_violations'],
    mode='markers')]

#layout = dict(title = 'Number of Violations between {} and {}'.format(d,max_date),
#             xaxis= dict(title='Violations', ticklen=1, zeroline=False))
layout = {'title':'Correlation between Min Temp and Red Light Violations',
          'xaxis': {'title':'Min Temp'},
          'yaxis': {'title': 'Red Light Violations'}
}

fig = dict(data=data, layout=layout)
iplot(fig)

### <a id="2.3">2.3 Winter weather</a>
This analysis looks at whether conditions which make drivers more cautious, such as winter weather conditions, has in impact on the number of red light violations. Winter weather was defined as a minimum temperature of 35ºF (the temperature at which most cars display a freeze warning) or less and precipitation of greater than 0 inches. Combining temperature and precipitation does show a correlation between winter weather conditions and a decrease in red light violations. This partly explains the seasonal trend in red light violations since Chicago is much more likely to see winter weather conditions in January than in July.

In [None]:
y0=weather[weather['winter_weather']==0]['red_light_violations']
y1=weather[weather['winter_weather']==1]['red_light_violations']

trace0 = go.Box(y=y0, name='No Winter Weather')
trace1 = go.Box(y=y1, name='Winter Weather')

data = [trace0, trace1]
fig=dict(data=data)
iplot(fig)

### <a id="2.4">2.4 Snow/Ice Pellets</a>
The NOAA weather set includes an indicator for "snow_ice_pellets" which produces roughly the same relationship as the winter weather custom category created from weather and temperature variables. So why not just use this indicator instead of creating a winter weather? If the city of Chicago wanted to project future red light violations using weather data as features, nearly all forecasts contain temperature and precipitation projections but they may not contain anything specific to snow or ice. I tried to keep the analysis limited to information that could easily be gathered from weather forecasts for use in projections.

In [None]:
y0=weather[weather['snow_ice_pellets']=='0']['red_light_violations']
y1=weather[weather['snow_ice_pellets']=='1']['red_light_violations']

trace0 = go.Box(y=y0, name="No Snow/Ice")
trace1 = go.Box(y=y1, name="Snow/Ice")

data = [trace0, trace1]
fig=dict(data=data)
iplot(fig)

## <a id="3">3. Day of week</a>
### <a id="3.1">3.1 Violations by day (30 day detail)</a>
Looking at red light violations by day over a time period with less seasonal influence (July 2018) shows that violations tend to increase throughout the week, peaking on Friday or Saturday. The shaded area of the chart indicates Friday/Saturday.

In [None]:
#Create a subset of the cam_data to only include the number of days selected
last_thirty=cam_df[(cam_df['date']>='2018-07-01')&(cam_df['date']<='2018-07-30')]

#plot the time series data
data = [go.Scatter(x=last_thirty['date'], y=last_thirty['VIOLATIONS'])]

layout = {'title': 'Number of violations between {:%x} and {:%x}'.format(last_thirty['date'].min(),last_thirty['date'].max()),
         'yaxis': {'range': [0,last_thirty['VIOLATIONS'].max()]},
         'shapes': shape(last_thirty['date'].min(),last_thirty['date'].max(),.5)}
fig = dict(data=data, layout=layout)
iplot(fig)


### <a id=3.2>3.2 Total violations by day</a>
The trend is more visible looking at total violations by day.

In [None]:
dow=cam_data.groupby([cam_data['VIOLATION DATE'].dt.day_name(),cam_data['VIOLATION DATE'].dt.dayofweek])[['VIOLATIONS']].agg('sum')
dow['temp']=dow.index
dow[['dow', 'day_num']]=dow.temp.apply(pd.Series)
dow=dow.reset_index(drop=True)
dow=dow.drop(['temp'], axis=1)
dow=dow.sort_values(by=['day_num'])

In [None]:
x=dow['dow']
y=dow['VIOLATIONS']

data=[go.Bar(x=x, y=y)]
iplot(data, filename='basic-bar')

### <a id="3.3">3.3 Day of week boxplot</a>
the boxplots of the violations by day continues to show the day of week trend even with variances taken into account.

In [None]:
y0=cam_df[cam_df['date'].dt.dayofweek==0]['VIOLATIONS']
y1=cam_df[cam_df['date'].dt.dayofweek==1]['VIOLATIONS']
y2=cam_df[cam_df['date'].dt.dayofweek==2]['VIOLATIONS']
y3=cam_df[cam_df['date'].dt.dayofweek==3]['VIOLATIONS']
y4=cam_df[cam_df['date'].dt.dayofweek==4]['VIOLATIONS']
y5=cam_df[cam_df['date'].dt.dayofweek==5]['VIOLATIONS']
y6=cam_df[cam_df['date'].dt.dayofweek==6]['VIOLATIONS']

trace0 = go.Box(y=y0, name="Monday")
trace1 = go.Box(y=y1, name="Tuesday")
trace2 = go.Box(y=y2, name="Wednesday")
trace3 = go.Box(y=y3, name="Thursday")
trace4 = go.Box(y=y4, name="Friday")
trace5 = go.Box(y=y5, name="Saturday")
trace6 = go.Box(y=y6, name="Sunday")

data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6]
fig=dict(data=data)
iplot(fig)

In [None]:
#thank you