<div class="clearfix" style="padding: 2px; padding-left: 0px">
<img src="http://alpinedata.com/wp-content/themes/alpine/library/images/logo.png" width="250px" style="display: inline-block; margin-top: 2px;">
</div>



# Visualizations - NYPD Motor Vehicle Collisions Dataset

We are using the open source libraries plotly and seaborn to build some interactive visualizations. We also leverage the cufflinks library which enables us to plot straight from pandas DataFrames.

Sources and cufflink examples:<br>

<a href="https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95">https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95</a><br>
<a href="https://plot.ly/ipython-notebooks/cufflinks/">https://plot.ly/ipython-notebooks/cufflinks/</a>
<br>

### Instructions

1) To run Jupyter notebooks within Chorus, you need to set up a dedicated server and make all the needed configurations. See [our installation instructions](https://alpine.atlassian.net/wiki/display/V6/How+to+Install+Jupyter+Notebook).<br>

2) <i>(Once 1 is completed)</i> DO NOT modify/run this script in the current workspace. You should copy it to your own workspace (using the Copy button after closing the notebook).


In [None]:
import sys

if sys.version_info.major < 3:
    !pip2 install cufflinks
    !pip2 install seaborn
    !pip2 install plotly
else:
    !pip3 install cufflinks
    !pip3 install seaborn
    !pip3 install plotly

import matplotlib.pyplot as plt
# matplotlib.patches allows us create colored patches we can use for legends in plots
import matplotlib.patches as mpatches 
import seaborn as sns # seaborn also builds on matplotlib and adds graphical features and new plot types

import pandas as pd
import numpy as np
import matplotlib
import cufflinks as cf
import plotly
import plotly.offline as py
import plotly.graph_objs as go
from plotly.graph_objs import Bar, Scatter, Figure, Layout
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from IPython.core.display import display, HTML

cf.go_offline() # required to use plotly offline
init_notebook_mode(connected=True) # graphs and charts inline
%matplotlib inline


## Getting the data from the web

In [None]:
#Using NYC Open Data api, filtering results for 2014 (limited to 20000 rows)

url = 'https://data.cityofnewyork.us/resource/qiz3-axqb.json?$limit=20000&\
$where=date%20between%20%272014-01-01T00:00:00%27%20and%20%272015-01-01T00:00:00%27'
collisions = pd.read_json(url)

In [None]:
collisions.head(2)

In [None]:
#Columns in the dataset
display(HTML("<b>Columns in the dataset</b>"))
collisions.columns

## Creating a bar chart of contributing factors

Let's look at the contributing factors of vehicle collisions. In our dataset, the factors are divided into 5 columns. We can use the pandas method ``concat`` to combine them into one.

In [None]:
contributing_factors = pd.concat(
          [collisions.contributing_factor_vehicle_1,
           collisions.contributing_factor_vehicle_2,
           collisions.contributing_factor_vehicle_3,
           collisions.contributing_factor_vehicle_4,
           collisions.contributing_factor_vehicle_5])

contributing_factors.head()

Now we need to compute the counts for each contributing factor. We'll filter out ones that are 'Unspecified'.

In [None]:
temp = pd.DataFrame({'contributing_factors':contributing_factors.value_counts()})
df = temp[temp.index != 'Unspecified']
df = df.sort_values(by='contributing_factors', ascending=False)
display(HTML("<b>Top contributing factors in 2014 (limited to 20k collisions) </b>"))
df.head()

Next we will create a horizontal bar chart of the contributing factors using plotly.

In [None]:
df = df.sort_values(by='contributing_factors', ascending=True)

data  = go.Data([
            go.Bar(
              y = df.index,
              x = df.contributing_factors,
              orientation='h'
        )])
layout = go.Layout(
        height = '1000',
        margin=go.Margin(l=300),
        title = "Contributing Factors for Vehicle Collisions in 2014 (limited to 20k collisions)"
)
fig  = go.Figure(data=data, layout=layout)
py.iplot(fig)

# Analysis - Contributing Factors

The contributing factor is not specified in most collisions (these were the ones we filtered as 'unspecified').
But we see that driver distraction, fatigue and failure to yield right-of-way are common causes of collisions.

## Percentage of collisions involving injuries and deaths per borough

In [None]:
temp2 = pd.DataFrame({'borough':collisions.borough.value_counts()})
df2 = temp2.sort_values(by='borough', ascending=False)

In [None]:
temp3 = pd.DataFrame({'borough_fatal': collisions[collisions.number_of_persons_killed > 0].borough.value_counts(),
                      'borough_injuries': collisions[collisions.number_of_persons_injured > 0].borough.value_counts(),
                     'borough_total': collisions.borough.value_counts()})

temp3['injuries_per_100_collisions'] = temp3.borough_injuries / temp3.borough_total
temp3['deaths_per_100_collisions'] = temp3.borough_fatal / temp3.borough_total
display(HTML("<b>Statistics per borough - 2014 (limited to 20k collisions)</b>"))
temp3

In [None]:
temp3[['borough_fatal', 'borough_total']].sort_values(by = 'borough_total', ascending = True)


data  = go.Data([
            go.Bar(
                  y = temp3.borough_total,
                  x = temp3.index,
                  name= 'Total collisions',
                  orientation='v'
        ),
            go.Bar(
                y = temp3.borough_injuries,
                x = temp3.index,
                name = 'Total collisions with injuries',
                orientation = 'v')
    ])

layout = go.Layout(
        height = '500',
        margin=go.Margin(l=100),
        title = "Total collisions and injuries by borough - 2014 (limited to 20k collisions)"
)

fig  = go.Figure(data=data, layout=layout)
py.iplot(fig)

### Analysis

The most collisions happen in Manhattan and the least happen on Staten Island. The ratio of injuries/total collisions is the lowest in Manhattan (we will look at it in more detail below).

In [None]:
color1 = '#9467bd'
color2 = '#F08B00'

trace1 = go.Scatter(
    x = temp3.index,
    y = temp3['injuries_per_100_collisions'],
    name='injuries_per_100_collisions',
    line = dict(
        color = color1
    )
)

data = [trace1]
layout = go.Layout(
    title= "Injuries per 100 collisions by borough",
    yaxis=dict(
        title='collisions',
        titlefont=dict(
            color=color1
        ),
        tickfont=dict(
            color=color1
        )
    )
)

fig = go.Figure(data=data, layout=layout)
plot_url = py.iplot(fig)

### Analysis

The percentage of collisions resulting in at least one injury ranged from 18.3% to 35.2% across the 5 boroughs and is the lowest in Manhattan at 18.3% even though Manhattan is the first borough in terms of total number of accidents in 2014 (limited to 20k collisions).

One explanation would be that vehicles are driving slower in Manhattan because of the traffic and road types (no highways...etc), which results in fewer injuries.

## Creating a correlation heatmap 

In [None]:
collisions['hour'] = pd.DatetimeIndex(collisions.time).hour
corr2 = pd.get_dummies(collisions[['hour','borough','number_of_cyclist_injured',
                                   'number_of_cyclist_killed', 'number_of_motorist_injured',
                                   'number_of_motorist_killed', 'number_of_pedestrians_killed',
                                  'number_of_persons_killed']]).corr()
f, ax = plt.subplots(figsize=(13, 11))
ax.set_title("Correlations")
sns.heatmap(corr2, vmax=.3,
            square=True)

### Example of a correlation heatmap

This heatmap is a nice way to visualize correlations between variables. To make the analysis more relevant, we should leverage feature engineering and create a subset of relevant features to analyze (like binary features for locations of interest, or vehicle types of interest, days of week etc...)

## Creating boxplots 

In [None]:
sns.set(style="darkgrid")
plt.figure(figsize=(12,6))
ax = sns.boxplot(y="hour",x ="borough",
               data=collisions, palette="Set1")
display(HTML("<b>Distribution of collisions during the day (per borough)</b>"))

The median time for collisions to occur is around 2 p.m. for all boroughs and the distribution looks pretty much the same. There are slightly more collisions in the morning for Bronx and Queens. We might want to investigate into that.

## Creating density plots - collisions involving injuries across day hours (per borough)

In [None]:
sns.set(style="darkgrid")

# Subsets
Bronx = collisions[collisions['borough'] == 'BRONX']
Bronx['injured_yes'] = ((Bronx['number_of_persons_injured'] >= 1).astype(int))
Queens = collisions[collisions['borough'] == 'QUEENS']
Queens['injured_yes'] = ((Queens['number_of_persons_injured'] >= 1).astype(int))
Manhattan = collisions[collisions['borough'] == 'MANHATTAN']
Manhattan['injured_yes'] = ((Manhattan['number_of_persons_injured'] >= 1).astype(int))
Brooklyn = collisions[collisions['borough'] == 'BROOKLYN']
Brooklyn['injured_yes'] = ((Brooklyn['number_of_persons_injured'] >= 1).astype(int))


# Set up the figure
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize = (12,12))
#ax.set_aspect("equal")

# Draw the two density plots
sns.kdeplot(Bronx['injured_yes'], Bronx['hour'],
                 cmap="Reds", shade=True, shade_lowest=False, ax = ax1)
sns.kdeplot(Queens['injured_yes'], Queens.hour,
                 cmap="Blues", shade=True, shade_lowest=False, ax = ax2)
sns.kdeplot(Brooklyn['injured_yes'], Brooklyn.hour,
                 cmap="Greens", shade=True, shade_lowest=False, ax = ax3)
sns.kdeplot(Manhattan['injured_yes'], Manhattan.hour,
                 cmap="Purples", shade=True, shade_lowest=False, ax = ax4)

# Add labels to the plot
red = sns.color_palette("Reds")[-2]
blue = sns.color_palette("Blues")[-2]
ax1.set_title("Bronx")
ax2.set_title("Queens")
ax3.set_title("Brooklyn")
ax4.set_title("Manhattan")

## Creating a bar chart for collisions by weekday and vehicle type

In [None]:
  #List of distinct values and categories I want to associate with:
    #'TAXI': 'Taxi',
    #'AMBULANCE': 'Other',
    #'BICYCLE': 'Other',
    #'BUS': 'Bus',
    #'FIRE TRUCK': 'Other',
    #'LARGE COM VEH(6 OR MORE TIRES)': 'Truck',
    #'LIVERY VEHICLE': 'Truck',
    #'MOTORCYCLE': 'Other',
    #'OTHER': 'Other',
    #'PASSENGER VEHICLE': 'Auto',
    #'PICK-UP TRUCK': 'Other',
    #'PEDICAB': 'Other',
    #'SCOOTER': 'Other',
    #'SMALL COM VEH(4 TIRES)': 'Truck',
    #'SPORT UTILITY / STATION WAGON': 'Auto',
    #'UNKNOWN': 'Other',
    #'VAN': 'Auto',
    #'UNSPECIFIED': 'Other',
    #None: None

    # We focus on the vehicle_type_code_1 only here 
list_auto = ['PASSENGER VEHICLE', 'SPORT UTILITY / STATION WAGON', 'VAN']
list_truck = ['LARGE COM VEH(6 OR MORE TIRES)','SMALL COM VEH(4 TIRES)','LIVERY VEHICLE','PICK-UP TRUCK' ]
list_bus = ['BUS']
list_taxi = ['TAXI']
list_other = ['AMBULANCE','BICYCLE', 'FIRE TRUCK', 'PEDICAB','MOTORCYCLE','OTHER','SCOOTER','UNKNOWN','UNSPECIFIED']
    
coll_veh_type = collisions
coll_veh_type['WEEKDAY_IM'] = pd.DatetimeIndex(collisions.date).dayofweek

df_coll_by_vehicle_type = pd.DataFrame(
    {'taxi': coll_veh_type[coll_veh_type['vehicle_type_code1'].isin(list_taxi)].WEEKDAY_IM.value_counts(),
     'bus': coll_veh_type[coll_veh_type['vehicle_type_code1'].isin(list_bus)].WEEKDAY_IM.value_counts(),
     'other': coll_veh_type[coll_veh_type['vehicle_type_code1'].isin(list_other)].WEEKDAY_IM.value_counts(),
     'truck': coll_veh_type[coll_veh_type['vehicle_type_code1'].isin(list_truck)].WEEKDAY_IM.value_counts(),
     'auto': coll_veh_type[coll_veh_type['vehicle_type_code1'].isin(list_auto)].WEEKDAY_IM.value_counts(),
                     })


data  = go.Data([
                go.Bar(
                  y = df_coll_by_vehicle_type.auto,
                  x = df_coll_by_vehicle_type.index,
                  name= 'auto',
                  orientation='v'),
                go.Bar(
                y = df_coll_by_vehicle_type.taxi,
                x = df_coll_by_vehicle_type.index,
                name = 'taxi',
                orientation = 'v'),
                    go.Bar(
                y = df_coll_by_vehicle_type.bus,
                x = df_coll_by_vehicle_type.index,
                name = 'bus',
                orientation = 'v'),
                     go.Bar(
                y = df_coll_by_vehicle_type.other,
                x = df_coll_by_vehicle_type.index,
                name = 'other',
                orientation = 'v'),
                         go.Bar(
                y = df_coll_by_vehicle_type.truck,
                x = df_coll_by_vehicle_type.index,
                name = 'truck',
                orientation = 'v')
    ])

layout = go.Layout(
        barmode='stack',
        height = '500',
        margin=go.Margin(l=100),
        title = "Collisions by week day and vehicle type"
)

fig  = go.Figure(data=data, layout=layout)
py.iplot(fig)

<h4> Analysis </h4>

The most collisions involve cars by far, while buses, taxis, and trucks are involved in accidents a lot less frequently.
We notice fewer collisions during weekends. It might be interesting to see if collisions during weekends are for a different reason. They might be related to drunk driving, causing more severe accidents.


## Time series charts

### Collisions by hour of the day (total and by borough)

In [None]:
collisions2 = collisions

collisions2['hour'] = pd.DatetimeIndex(collisions2.time).hour

# df sorted by hour of accident 
df_by_hour = collisions2.ix[collisions2.hour.sort_values().index]

In [None]:
collisions_by_hour = df_by_hour.groupby('hour').hour.count()
collisions_by_hour.iplot(kind = 'scatter', title = 'Collisions by hour')

<h4> Analysis </h4>

The incidence of traffic collisions rises sharply from 7 to 9 a.m., 
when hundreds of thousands of people are commuting into and around the city to get to work.
It reaches its highest level at 4 p.m.

In [None]:
df_by_hour = collisions2.ix[collisions2.hour.sort_values().index]

collisions_by_hour_Bronx = df_by_hour[df_by_hour['borough'] == 'BRONX'].groupby('hour').hour.count()
collisions_by_hour_Queens = df_by_hour[df_by_hour['borough'] == 'QUEENS'].groupby('hour').hour.count()
collisions_by_hour_Manhattan = df_by_hour[df_by_hour['borough'] == 'MANHATTAN'].groupby('hour').hour.count()
collisions_by_hour_Brooklyn = df_by_hour[df_by_hour['borough'] == 'BROOKLYN'].groupby('hour').hour.count()
collisions_by_hour_Staten = df_by_hour[df_by_hour['borough'] == 'STATEN ISLAND'].groupby('hour').hour.count()


temp5 = pd.DataFrame({'Bronx': df_by_hour[df_by_hour['borough'] == 'BRONX'].hour.value_counts(),
                      'Queens': df_by_hour[df_by_hour['borough'] == 'QUEENS'].hour.value_counts(),
                      'Brooklyn': df_by_hour[df_by_hour['borough'] == 'BROOKLYN'].hour.value_counts(),
                      'Manhattan': df_by_hour[df_by_hour['borough'] == 'MANHATTAN'].hour.value_counts(),
                      'Staten': df_by_hour[df_by_hour['borough'] == 'STATEN ISLAND'].hour.value_counts()
                     })

color1 = '#8A0829'
color2 = '#F08B00'
color3 = '#9A2EFE'
color4 = '#DF01A5'
color5 = '#01DF74'

trace1 = go.Scatter(
    x = temp5.index,
    y = temp5.Bronx,
    name='Bronx',
    line = dict(
        color = color1
    )
)
trace2 = go.Scatter(
     x = temp5.index,
    y = temp5.Queens,
    name='Queens',
     line = dict(
        color = color2
    )
)
trace3 = go.Scatter(
     x = temp5.index,
    y = temp5.Brooklyn,
    name='Brooklyn',
     line = dict(
        color = color3
    )
)
trace4 = go.Scatter(
     x = temp5.index,
    y = temp5.Manhattan,
    name='Manhattan',
     line = dict(
        color = color4
    )
)
trace5 = go.Scatter(
     x = temp5.index,
    y = temp5.Staten,
    name='Staten Island',
     line = dict(
        color = color5
    )
)
data = [trace1, trace2, trace3, trace4, trace5]

layout = go.Layout(
    title= "Collisions per hour for each borough - 2014",
    yaxis=dict(
        title='collisions',
        titlefont=dict(
            color='Black'
        ),
        tickfont=dict(
            color='Black'
        )
    )
)

fig = go.Figure(data=data, layout=layout)
plot_url = py.iplot(fig)


In [None]:
# We can also use a filled area chart
temp5.iplot(kind='area', fill=True, title='Collisions per hour for each borough')

<b>Note:</b> To better compare these results, it would be interesting to normalize this data based on the borough size or number of vehicles in 2014 per borough.

### Collisions and Number of deaths per day 

In [None]:
collisions.date = pd.to_datetime(collisions.date)
collisions.date.sort_values().index
df_by_date = collisions.ix[collisions.date.sort_values().index]

In [None]:
collisions_by_date = df_by_date.groupby('date').date.count()
deaths_by_date = df_by_date.groupby('date')['number_of_persons_killed'].sum()

In [None]:
colli_deaths = pd.DataFrame({'collisions':collisions_by_date, 'deaths':deaths_by_date })

color1 = '#9467bd'
color2 = '#F08B00'

trace1 = go.Scatter(
    x = colli_deaths.index,
    y = colli_deaths['collisions'],
    name='collisions',
    line = dict(
        color = color1
    )
)
trace2 = go.Scatter(
    x= colli_deaths.index,
    y =colli_deaths['deaths'] ,
    name='deaths',
    yaxis='y2',
    mode='markers'

)
data = [trace1, trace2]
layout = go.Layout(
    title= "Collisions and Deaths per day (limited to 20k collisions)",
    yaxis=dict(
        title='collisions',
        titlefont=dict(
            color=color1
        ),
        tickfont=dict(
            color=color1
        )
    ),
    yaxis2=dict(
        title='deaths',
        overlaying='y',
        side='right',
        titlefont=dict(
            color=color2
        ),
        tickfont=dict(
            color=color2
        )

    )

)
fig = go.Figure(data=data, layout=layout)
plot_url = py.iplot(fig)

## Maps of collisions in NYC

In [None]:

collisions_new = collisions[collisions['latitude'] != 0][['latitude', 'longitude', 'date', 'time',
                                                               'borough', 'on_street_name', 'cross_street_name',
                                                               'number_of_persons_injured', 'number_of_persons_killed',
                                                               'contributing_factor_vehicle_1']]

#divide dataset in accidents which are: fatal, non-lethal but with person damage, non of the above
killed_pd = collisions_new[collisions_new['number_of_persons_killed']!=0]
injured_pd = collisions_new[np.logical_and(collisions_new['number_of_persons_injured']!=0, 
                                           collisions_new['number_of_persons_killed']==0)]
nothing_pd = collisions_new[np.logical_and(collisions_new['number_of_persons_killed']==0,
                                           collisions_new['number_of_persons_injured']==0)]

### Map of collisions by importance of accident 

In [None]:
#adjust settings
plt.figure(figsize=(20,15))

#create scatterplots
plt.scatter(nothing_pd.longitude, nothing_pd.latitude, alpha=0.7, s=5, color='blue')
plt.scatter(injured_pd.longitude, injured_pd.latitude, alpha=0.5, s=15, color='yellow')
plt.scatter(killed_pd.longitude, killed_pd.latitude, color='red', s=30)

#create legend
blue_patch = mpatches.Patch( label='car body damage', alpha=0.2, color='blue')
yellow_patch = mpatches.Patch(color='yellow', label='personal injury', alpha=0.5)
red_patch = mpatches.Patch(color='red', label='lethal accidents')
plt.legend([blue_patch, yellow_patch, red_patch],('car body damage', 'personal injury', 'fatal accidents'),
           loc='upper left', prop={'size':20})

#adjust more settings
plt.title('Severity of Motor Vehicle Collisions in New York City - 2014 (limited to 20k collisions)', size=20)
plt.xlim((-74.26,-73.7))
plt.ylim((40.5,40.92))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)

plt.show()

This map shows that there are fatal accident hot spots throughout the city. In some areas car body damage is prevalent, while in other areas personal injuries happen more often.