<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Stanford-Open-Policing-Project." data-toc-modified-id="Stanford-Open-Policing-Project.-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Stanford Open Policing Project.</a></span><ul class="toc-item"><li><span><a href="#Examining-the-dataset" data-toc-modified-id="Examining-the-dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Examining the dataset</a></span></li><li><span><a href="#Data-Wrangling" data-toc-modified-id="Data-Wrangling-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data Wrangling</a></span><ul class="toc-item"><li><span><a href="#Dropping-column" data-toc-modified-id="Dropping-column-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Dropping column</a></span></li><li><span><a href="#Dropping-rows" data-toc-modified-id="Dropping-rows-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Dropping rows</a></span></li><li><span><a href="#Fixing-a-data-type" data-toc-modified-id="Fixing-a-data-type-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Fixing a data type</a></span></li><li><span><a href="#Combining-object-columns" data-toc-modified-id="Combining-object-columns-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Combining object columns</a></span></li><li><span><a href="#Setting-the-index" data-toc-modified-id="Setting-the-index-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>Setting the index</a></span></li></ul></li><li><span><a href="#Data-Analysis" data-toc-modified-id="Data-Analysis-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Examining-traffic-violations" data-toc-modified-id="Examining-traffic-violations-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Examining traffic violations</a></span></li></ul></li></ul></li></ul></div>

# Stanford Open Policing Project.

In [1]:
# Load the necessary libraries
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

## Examining the dataset

Before beginning my analysis, it's important that I familiarize myself with the dataset. At this point, I'll read the dataset into pandas, examine the first few rows, and then count the number of missing values.

In [2]:
# read the data
df = pd.read_csv('police_project.csv')

In [3]:
# View the first few rows
df.head()

Unnamed: 0,stop_date,stop_time,county_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


In [4]:
# Count the missing values
print(df.isnull().sum())

stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64


It is clear that most of the columns have some missing values.

## Data Wrangling

### Dropping column

Often, a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the DataFrame, to make it easier for you to focus on the remaining columns.

I'll drop the county_name column because it only contains missing values. Thus, these column can be dropped because it contain no useful information. 

In [5]:
# shape of the dataframe
df.shape

(91741, 15)

In [6]:
# drop the 'county_name' 
df.drop(['county_name'], axis = 'columns', inplace = True)

In [7]:
# Re-examine the shape of the data
df.shape

(91741, 14)

### Dropping rows

When you know that a specific column will be critical to your analysis, and only a small fraction of rows are missing a value in that column, it often makes sense to remove those rows from the dataset.

The driver_gender column will be critical to many of my analyses. Because only a small fraction of rows are missing driver_gender, I'll drop those rows from the dataset.

In [8]:
# Drop all rows with missing values driver_gender column
df.dropna(subset = ['driver_gender'], inplace = True)

In [9]:
# count the missing values
df.isnull().sum()

stop_date                 0
stop_time                 0
driver_gender             0
driver_age_raw            1
driver_age              293
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83210
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
dtype: int64

I dropped around 5,000 rows, which is a small fraction of the dataset.

### Fixing a data type

In [10]:
# Examine the first five rows
df.head()

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


is_arrested column currently has the object data type. therefore, i'll change the data type to bool, which is the most suitable type for a column containing True and False values.

Fixing the data type will enable us to use mathematical operations on the is_arrested column that would not be possible otherwise.

In [11]:
# Change the data type of 'is_arrested' to 'bool'
df['is_arrested'] = df.is_arrested.astype(bool)

In [12]:
# Check the data type of 'is_arrested' 
print(df.is_arrested.dtype)

bool


### Combining object columns

Currently, the date and time of each traffic stop are stored in separate object columns: stop_date and stop_time.
    
To fix this, I'll combine these two columns into a single column, and then convert it to datetime format. This will enable convenient date-based attributes that I'll use later in the analysis.

In [13]:
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = df.stop_date.str.cat(df.stop_time, sep=' ')

# Convert 'combined' to datetime format
df['stop_datetime'] = pd.to_datetime(combined)

# Examine the data types of the DataFrame
print(df.dtypes)

stop_date                     object
stop_time                     object
driver_gender                 object
driver_age_raw               float64
driver_age                   float64
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                     bool
stop_duration                 object
drugs_related_stop              bool
stop_datetime         datetime64[ns]
dtype: object


### Setting the index

The last step that I'll take in this chapter is to set the stop_datetime column as the DataFrame's index. By replacing the default index with a DatetimeIndex, I'll make it easier to analyze the dataset by date and time, which will come in handy later in the analysis

In [14]:
# set the stop_datetime as the index
df.set_index('stop_datetime', inplace = True)

#examine the index
print(df.index)

# examine the columns
print(df.columns)


DatetimeIndex(['2005-01-02 01:55:00', '2005-01-18 08:15:00',
               '2005-01-23 23:15:00', '2005-02-20 17:15:00',
               '2005-03-14 10:00:00', '2005-03-23 09:45:00',
               '2005-04-01 17:30:00', '2005-06-06 13:20:00',
               '2005-07-13 10:15:00', '2005-07-13 15:45:00',
               ...
               '2015-12-31 16:38:00', '2015-12-31 19:44:00',
               '2015-12-31 19:55:00', '2015-12-31 20:20:00',
               '2015-12-31 20:25:00', '2015-12-31 20:27:00',
               '2015-12-31 20:35:00', '2015-12-31 20:45:00',
               '2015-12-31 21:42:00', '2015-12-31 22:46:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=86406, freq=None)
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
       'driver_age', 'driver_race', 'violation_raw', 'violation',
       'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
       'stop_duration', 'drugs_related_stop'],
      dtype='object')


## Data Analysis

Do the genders commit different violations?

### Examining traffic violations

Before comparing the violations being committed by each gender, I should examine the violations committed by all drivers to get a baseline understanding of the data.

In this exercise, I'll count the unique values in the violation column, and then separately express those counts as proportions.

In [21]:
# Count the unique values in 'violation'
print('unique_values:\n', df.violation.value_counts())

# Express the counts as proportions
print('proportion:\n', df.violation.value_counts(normalize=True))

unique_values:
 Speeding               48461
Moving violation       16224
Equipment              11020
Other                   4317
Registration/plates     3432
Seat belt               2952
Name: violation, dtype: int64
proportion:
 Speeding               0.560852
Moving violation       0.187765
Equipment              0.127537
Other                  0.049962
Registration/plates    0.039719
Seat belt              0.034164
Name: violation, dtype: float64


Interesting! More than half of all violations are for speeding, followed by other moving violations and equipment violations.

In [None]:
### Comparing violations by gender