# Aurora Policing Project

## About the policing data

Throughout this project, I will be analyzing a dataset of traffic stops in Aurora Colorado that was collected by the "Stanford Open Policing Project". [Stanford Open Policing Project Data](https://openpolicing.stanford.edu/data/)



<table>
  <tr>
    <td>Column name</td>
    <td>Column meaning</td>
    <td>Example value</td>
  </tr>
  <tr>
    <td>raw_row_number</td>
    <td>An number used to join clean data back to the raw data</td>
    <td>38299</td>
  </tr>
  <tr>
    <td>date</td>
    <td>The date of the stop, in YYYY-MM-DD format. Some states do not provide
    the exact stop date: for example, they only provide the year or quarter in
    which the stop occurred. For these states, stop_date is set to the date at
    the beginning of the period: for example, January 1 if only year is
    provided.</td>
    <td>"2017-02-02"</td>
  </tr>
  <tr>
    <td>time</td>
    <td>The 24-hour time of the stop, in HH:MM format.</td>
    <td>20:15</td>
  </tr>
  <tr>
    <td>location</td>
    <td>The freeform text of the location. Occasionally, this represents the
    concatenation of several raw fields, i.e. street_number, street_name</td>
    <td>"248 Stockton Rd."</td>
  </tr>
  <tr>
    <td>lat</td>
    <td>The latitude of the stop. If not provided by the department, we
    attempt to geocode any provided address or location using
    Google Maps. Google Maps returns a "best effort" response, which may not
    be completely accurate if the provided location was malformed or
    underspecified. To protect against suprious responses, geocodes more than
    4 standard deviations from the median stop lat/lng are set to NA.
    <td>72.23545</td>
  </tr>
  <tr>
    <td>lng</td>
    <td>The longitude of the stop. If not provided by the department, we
    attempt to geocode any provided address or location using
    Google Maps. Google Maps returns a "best effort" response, which may not
    be completely accurate if the provided location was malformed or
    underspecified. To protect against suprious responses, geocodes more than
    4 standard deviations from the median stop lat/lng are set to NA.
    </td>
    <td>115.2808</td>
  </tr>
  
  <tr>
    <td>district</td>
    <td>Police district. If not provided, but we have retrieved police
    department shapfiles and the location of the stop, we geocode the stop and
    find the district using the shapefiles.</td>
    <td>8</td>
  </tr>
  
  <tr>
    <td>subject_age</td>
    <td>The age of the stopped subject. When date of birth is given, we
    calculate the age based on the stop date. Values outside the range of
    10-110 are coerced to NA.</td>
    <td>54.23</td>
  </tr>
  <tr>
    <td>subject_race</td>
    <td>The race of the stopped subject. Values are standardized to white,
    black, hispanic, asian/pacific islander, and other/unknown</td>
    <td>"hispanic"</td>
  </tr>
  <tr>
    <td>subject_sex</td>
    <td>The recorded sex of the stopped subject.</td>
    <td>"female"</td>
  </tr>
 
  <tr>
    <td>type</td>
    <td>Type of stop: vehicular or pedestrian.</td>
    <td>"vehicular"</td>
  </tr>
  
  <tr>
    <td>violation</td>
    <td>Specific violation of stop where provided. What is recorded here varies
    widely across police departments.</td>
    <td>"SPEEDING 15-20 OVER"</td>
  </tr>
  <tr>
    <td>citation_issued</td>
    <td>Indicates whether a citation was issued.</td>
    <td>TRUE</td>
  </tr>
  
  <tr>
    <td>outcome</td>
    <td>The strictest action taken among arrest, citation, warning, and
    summons.</td>
    <td>"citation"</td>
  </tr>
  
</table>



## Preparing the Aurora policing data for analysis

In [61]:
# Import numpy library
import numpy as np

In [62]:
# Import pandas library 
import pandas as pd

In [63]:
# Import matplotlib.pyplot library 
import matplotlib.pyplot as plt

In [64]:
# Read file into dataframe named data 
data = pd.read_csv("co_aurora_2019_02_25 copy.csv")

### Examing the dataset

In [65]:
# Examine the head of dataframe 
data.head()

Unnamed: 0,raw_row_number,date,time,location,lat,lng,district,subject_age,subject_race,subject_sex,type,violation,citation_issued,outcome
0,1,2012-01-01,09:14:00,S I225 NB HWY AT E ALAMEDA AVE,,,,27.37637,white,male,vehicular,Speeding (20+ Over) - Muni Statue 1101,True,citation
1,2,2012-01-01,09:30:00,2600 S I225 NB HWY,,,,23.658287,black,female,vehicular,Speeding (20+ Over) - Muni Statue 1101,True,citation
2,3,2012-01-01,09:36:00,N I225 SB HWY AT E 6TH AVE,39.725279,-104.82116,2.0,23.088801,white,male,vehicular,Speeding (20+ Over) - Muni Statue 1101,True,citation
3,4,2012-01-01,09:40:00,2300 BLOCK S I225 NB HWY,,,,38.503239,white,female,vehicular,Speeding (20+ Over) - Muni Statue 1101,True,citation
4,5,2012-01-01,09:46:00,E VIRGINIA PL AT S PEORIA ST,39.706912,-104.847213,1.0,75.429441,white,male,vehicular,Failed to Present Evidence of Insurance Upon R...,True,citation


#### Aurora District information 
[Aurora District (1,2,3) Map](https://wiki.radioreference.com/images/3/3b/Aurora_Beat_Map.pdf)

## Dropping columns
Dropping the columns which are not useful to analysis. 

In [66]:
# Count the number of missing values in each column
print(data.isnull().sum())

raw_row_number         0
date                   0
time                 943
location              12
lat                31629
lng                31629
district           33838
subject_age         5863
subject_race           4
subject_sex         2006
type                4278
violation           3571
citation_issued        0
outcome                0
dtype: int64


In [67]:
print(data.shape)

(174363, 14)


#### Dropping raw_row_number, because I will not use this number as reference number

In [68]:
data.drop(['raw_row_number'], axis = 'columns', inplace = True)
print(data.shape)

(174363, 13)


#### Dropping lat, lng, and district columns
I will use 'location' instead using 'lat' and 'lng'. I will not use police district column.

In [69]:
# Drop 'lat', and 'lng' columns
data.drop(['lat','lng', 'district'], axis = 'columns', inplace = True)
print(data.shape)

(174363, 10)


I will drop 'type' column because type pedestrian only has 24 rows out of 163803. 

In [70]:
data.type.value_counts()

vehicular     170061
pedestrian        24
Name: type, dtype: int64

In [71]:
# Drop 'type' column
data.drop(['type'], axis = 'columns', inplace = True)

#### Dropping rows
I will drop the rows which contain any missing values if the fraction of missing row is small. (less than 5%)

In [72]:
# Calculate the percentage of missing data 
print(data.isnull().sum()/data.shape[0])

date               0.000000
time               0.005408
location           0.000069
subject_age        0.033625
subject_race       0.000023
subject_sex        0.011505
violation          0.020480
citation_issued    0.000000
outcome            0.000000
dtype: float64


In [73]:
data.dropna(subset=['location', 'time', 'subject_age', 'subject_race', 'subject_sex', 'violation'], inplace = True)
print(data.shape)

(163978, 9)


In [74]:
# Count the number of missing values in each column (again)
print(data.isnull().sum())

date               0
time               0
location           0
subject_age        0
subject_race       0
subject_sex        0
violation          0
citation_issued    0
outcome            0
dtype: int64


In [75]:
# Examine the shape of the Dataframe
print(data.shape)

(163978, 9)


In [76]:
data.head()

Unnamed: 0,date,time,location,subject_age,subject_race,subject_sex,violation,citation_issued,outcome
0,2012-01-01,09:14:00,S I225 NB HWY AT E ALAMEDA AVE,27.37637,white,male,Speeding (20+ Over) - Muni Statue 1101,True,citation
1,2012-01-01,09:30:00,2600 S I225 NB HWY,23.658287,black,female,Speeding (20+ Over) - Muni Statue 1101,True,citation
2,2012-01-01,09:36:00,N I225 SB HWY AT E 6TH AVE,23.088801,white,male,Speeding (20+ Over) - Muni Statue 1101,True,citation
3,2012-01-01,09:40:00,2300 BLOCK S I225 NB HWY,38.503239,white,female,Speeding (20+ Over) - Muni Statue 1101,True,citation
4,2012-01-01,09:46:00,E VIRGINIA PL AT S PEORIA ST,75.429441,white,male,Failed to Present Evidence of Insurance Upon R...,True,citation


In [79]:
data.location.value_counts()

S PARKER RD AT S PEORIA ST             3920
E I70 HWY EB AT N CHAMBERS RD          2097
15300 BLOCK E I70 EB HWY               1608
S PARKER RD AT E LEHIGH AVE            1018
E COLFAX AVE AT N ALTON ST              962
E 10TH AVE AT N AIRPORT BLVD            956
3400 BLOCK S PARKER RD                  929
E QUINCY AVE AT S GUN CLUB SH 30 RD     903
N AIRPORT BLVD AT E 10TH AVE            848
S PARKER RD AT E HAMPDEN AVE            803
15300 E I70 EB HWY                      730
E COLFAX AVE AT N PEORIA ST             687
S I225 NB HWY AT E ALAMEDA AVE          675
3600 BLOCK S PARKER RD                  663
1700 S SABLE BLVD                       634
E COLFAX AVE AT N HAVANA ST             630
S PARKER RD AT S VAUGHN WAY             617
3400 S PARKER RD                        585
2400 N AIRPORT BLVD                     582
E COLFAX AVE AT N CHAMBERS RD           573
S PARKER RD AT E QUINCY AVE             558
N I225 HWY NB AT E 6TH AVE              551
1400 S TOWER RD                 

## Preparing Aurora weather data for analysis

### About the aurora weather data
I added the weather data to analyze whether the policing data is affected by weather or not. I collected "Denver Centennial International Airport" weather data since the data of airport's weather station is well recorded than the others'. The duration of weather data is from 01/01/2012 to 12/31/2016. The weather data is from [NOAA](https://www.ncdc.noaa.gov/cdo-web/)

In [80]:
weather = pd.read_csv('weather.csv')

In [81]:
weather.head()

Unnamed: 0,STATION,NAME,DATE,AWND,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,WT01,WT02,WT03,WT04,WT05,WT06,WT08,WT09
0,USW00093067,"DENVER CENTENNIAL AIRPORT, CO US",2012-01-01,6.04,0.0,,,,38.0,18.0,,,,,,,,
1,USW00093067,"DENVER CENTENNIAL AIRPORT, CO US",2012-01-02,6.71,0.0,,,,50.0,16.0,,,,,,,,
2,USW00093067,"DENVER CENTENNIAL AIRPORT, CO US",2012-01-03,6.26,0.0,,,,55.0,31.0,,,,,,,,
3,USW00093067,"DENVER CENTENNIAL AIRPORT, CO US",2012-01-04,6.04,0.0,,,,58.0,27.0,,,,,,,,
4,USW00093067,"DENVER CENTENNIAL AIRPORT, CO US",2012-01-05,4.92,0.0,,,,66.0,35.0,,,,,,,,
