# Seattle bike thieves mainly work in broad daylight (2015)

## Coursera Data Science at Scale - Assignment 6

With this report I want to show that bike theft incidents in Seattle don't happen mainly at night and that they happen in central areas!

Instead of using the given overall crime data of summer '14 in Seattle I decided to do an analysis of bike theft in Seattle in 2015 (01/01/2015 - 31/12/2015). The data was downloaded as CSV from 
https://data.seattle.gov/Public-Safety/Seattle-Police-Department-Police-Report-Incident/7ais-f98f

The filtering was done based on the date and on the *offense code* (=2399). We still have to filter for *offense code extension* (= 1) to get only bike theft. This could have been done in the online form of seattle.gov but I decided to use it as a small excercise with Python and Pandas and keep a little bit of discardable data. Column headers were modified so that they don't contain whitespaces or slashes

The Questions I wanted to answer on my way are:
1. Which month is the most dangerous for your bike (danger meaning your bike will be stolen)?
2. Which time of the day is the most dangerous for your bike?
3. Which area of Seattle is the most dangerous for your bike?

To answer those questions I used Python with matplotlib as well as Pandas for csv file I/O and data handling.

The available columns in the CSV file are the following (with relevant columns in *italic* print):

| Column Name | Example entry |
|-------------|--------------:|
| RMS_CDW_ID  | 329716 |
| General_Offense_Number | 201517547 |
| Offense_Code | 2399 |
| *Offense_Code_Extension* | *3* |
| Offense_Type | THEFT-OTH |
| Summary_Offense_Code | 2300 |
| Summarized_Offense_Description | OTHER PROPERTY |
| Date_Reported | 01/16/2015 02:30:00 PM |
| *Occurred_Date_or_Date_Range_Start* | *01/15/2015 07:19:00 AM*|
| Occurred_Date_Range_End | |
| Hundred_Block_Location | 19XX BLOCK OF N 46 ST |
| *District_Sector* | *B*|
| Zone_Beat | B3 |
| Census_Tract 2000 | 5100.1022 |
| Longitude | -122.334455598 |
| Latitude | 47.66212706 |
| Location | "(47.66212706, -122.334455598)" |
| *Month* | *1* |
| Year | 2015 |

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import csv

In [3]:
# create reader and load data
f = "seattle_incidents_2015_2399.csv"
df = pd.read_csv(f, header=0, delimiter=",", quoting=csv.QUOTE_MINIMAL)

# remove data that is not bike theft
df = df[df['Offense_Code_Extension'] == 1]

In [4]:
# create a barchart with counts for each month
bins = [0,1.1,2.1,3.1,4.1,5.1,6.1,7.1,8.1,9.1,10.1,11.1,12.1]
plt.hist(df.Month.values, bins, histtype='bar', rwidth=0.8)
plt.xlabel("month")
plt.ylabel("bike theft count")
plt.savefig("biketheftpermonth.png")
#plt.show()
plt.clf()

![histogram of bike theft incidents per month](biketheftpermonth.png)

The plot shows that May is the month that has the most recorded bike theft incidents in 2015 closely followed by July and June. This is more or less what you would expect as in the summer months also more people will use their bikes.

In [8]:
# parse datetime 01/16/2015 02:30:00 PM and read hour values
from datetime import datetime
timeformat = '%m/%d/%Y %I:%M:%S %p'
#datetime.strptime(df['Occurred_Date_or_Date_Range_Start'],timeformat)
hours = []
for timestring in df['Occurred_Date_or_Date_Range_Start'].values:
    dt = datetime.strptime(timestring.lower(),timeformat)
    hours.append(dt.hour)

In [13]:
# plot distribution of bike theft incidents per hour
plt.hist(hours, bins=24)
plt.xlabel("hour")
plt.ylabel("bike theft count")
plt.savefig("biketheftperhour.png")
#plt.show()
plt.clf()

![histogram of bike theft incidents per hour of the day](biketheftperhour.png)

The plot shows that the early evening hour after 5pm is the time of the day where the most bike theft incidents are reported. Other dangerous hours are the hour directly after midnight as one would expect and curiously the hour just before noon. Overall more bikes are stolen during the daylight hours than at night.

In [14]:
# Add another histogram of the district_sector
# now using the plot capability of pandas
df["District_Sector"].value_counts().plot.bar()
plt.xlabel("Sector")
plt.ylabel("bike theft count")
plt.savefig("biketheftpersector.png")
#plt.show()
plt.clf()

![histogram of bike theft incidents per hour of the day](biketheftpersector.png)

Obviously sector B, U and D are the sectors with the most bike incidents. 
Looking at the map (source: http://www.seattle.gov/police/maps/precinct_map.htm) of Seattle police precincts this corresponds to the northern central districts of Seattle.

![map of seattle police districts](precinct_map.jpg)

With Google fusion tables I mapped the locations of the incidents to the map of Seattle. Each point in the map corresponds to one bike theft incident. The higher density of points in the northern central part of Seattle nicely illustrates the finding of the histogram plot.

![mapping of incidents to seattle map](seattleMap.PNG)