# Group 061 Final Project : Traffic Collisions in San Diego

# Data Science Questions

What are the most dangerous places and times to drive throughout the year? More specifically, we are curious about which factors in particular can predict the likelihood of accidents occurring. We know there are probably certain factors that predict accident likelihood, listed below in the hypothesis, however we would like to see through our data analysis if there may be other factors that predict accident likelihood.

# Hypothesis
We believe that there are a number of factors that are likely to predict a higher probability of accidents in certain areas. These factors include:
- The time of day - Based on influences such as low visibility and drowsiness, it is likely that accidents are more likely to occur at night.
- Time of the year -  During the holidays there tends to be more people on the roads and increased instances of DUI, there is likely to be increased accidents during these times.
- Police presence - People tend to drive slower and more carefully around police cars and more diligent in obeying traffic laws. Therefore accidents, especially those involving traffic violations are less likely to occur in areas of higher police presence.
- Quality of road infrastructure - Indicators of bad road infrastructure and maintenance such as potholes or unclear or deteriorating road indicators and signs are likely to cause either confusion or loss of control while driving, therefore leading to increased instances of traffic collisions.
- Location - there are a number of effects of living or driving in a certain location whether it be the average age of the people in the group, the buildings in the area (day life and nightlife), and possibly the type of people who live in that area (single, married, family, young adults)


# Imports

In [3]:
import pandas as pd # DataFrames, Series
import numpy as np # Math Module
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Plotting
import datetime # Datetime 
import folium # Folium Map
from folium import plugins # Folium Heatmap
from pygeocoder import Geocoder # Geocoding
import json # Reading JSON files

%matplotlib inline

# Reading Data

In [4]:
df = pd.read_csv('Datasets/pd_collisions_datasd_v1.csv')

In [5]:
df.head()

Unnamed: 0,report_id,date_time,police_beat,address_number_primary,address_pd_primary,address_road_primary,address_sfx_primary,address_pd_intersecting,address_name_intersecting,address_sfx_intersecting,violation_section,violation_type,charge_desc,injured,killed,hit_run_lvl
0,170082,2017-01-01 00:01:00,935,5500,,VALERIO,TRAIL,,,,MISC-HAZ,VC,MISCELLANEOUS HAZARDOUS VIOLATIONS OF THE VEHI...,0,0,MISDEMEANOR
1,170101,2017-01-01 00:01:00,322,6400,,CRAWFORD,STREET,,,,MISC-HAZ,VC,MISCELLANEOUS HAZARDOUS VIOLATIONS OF THE VEHI...,0,0,MISDEMEANOR
2,170166,2017-01-01 00:01:00,124,8300,,CAM DEL ORO,,,,,MISC-HAZ,VC,MISCELLANEOUS HAZARDOUS VIOLATIONS OF THE VEHI...,0,0,MISDEMEANOR
3,170218,2017-01-01 00:01:00,325,8100,,ROYAL GORGE,DRIVE,,,,22107,VC,TURNING MOVEMENTS AND REQUIRED SIGNALS,0,0,MISDEMEANOR
4,170097,2017-01-01 01:00:00,521,1000,,11TH,AVENUE,,,,22107,VC,TURNING MOVEMENTS AND REQUIRED SIGNALS,0,0,MISDEMEANOR


# Location

## Extraction of Locations from Dataset

In [6]:
addresses = df.apply(lambda x : ' '.join(
    str(i).strip() for i in [x.address_number_primary, x.address_pd_primary, x.address_road_primary, x.address_sfx_primary] 
) + ', SAN DIEGO', axis = 1)

In [22]:
addresses.shape

(28595,)

## Geocoding

In [11]:
MAPS_API_KEY = 'AIzaSyD4ozPjvWdpbW8K3fiabpFwRNSTjITvim8'
coder = Geocoder(MAPS_API_KEY)

NUM_CRASHES = 200

locations = addresses[:NUM_CRASHES].apply(lambda address : coder.geocode(address).coordinates)

In [13]:
locations.head()

0    (32.9624494, -117.2014782)
1    (32.7898003, -117.0938746)
2    (32.8568988, -117.2568618)
3    (32.8147757, -117.0512292)
4    (32.7157761, -117.1547177)
dtype: object

In [18]:
LAT, LON = 0, 1 
center = np.mean(locations.apply(lambda x : x[LAT])), np.mean(locations.apply(lambda x : x[LON]))
center

(32.79121478699999, -117.14507809200003)

## Map

In [19]:
# Creating map
m = folium.Map(center, zoom_start = 11)

### Mark each point

In [20]:
for location in locations:
    folium.CircleMarker([location[LAT], location[LON]],
                        radius=15,
                        fill_color="#3db7e4", # divvy color
                       ).add_to(m)

### Creating Heatmap

In [21]:
points_df = pd.DataFrame({
    'latitude' : list(locations.apply(lambda p : p[LAT])),
    'longitude' : list(locations.apply(lambda p : p[LON]))
})

m.add_children(plugins.HeatMap(points_df.values, radius = 20))
m

## Analysis

TODO

# Time of Year