<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preparing-data-for-analysis" data-toc-modified-id="Preparing-data-for-analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparing data for analysis</a></span><ul class="toc-item"><li><span><a href="#Examining-the-dataset" data-toc-modified-id="Examining-the-dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Examining the dataset</a></span></li><li><span><a href="#Data-Wrangling" data-toc-modified-id="Data-Wrangling-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data Wrangling</a></span><ul class="toc-item"><li><span><a href="#Dropping-column" data-toc-modified-id="Dropping-column-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Dropping column</a></span></li><li><span><a href="#Dropping-rows" data-toc-modified-id="Dropping-rows-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Dropping rows</a></span></li><li><span><a href="#Fixing-a-data-type" data-toc-modified-id="Fixing-a-data-type-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Fixing a data type</a></span></li><li><span><a href="#Combining-object-columns" data-toc-modified-id="Combining-object-columns-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Combining object columns</a></span></li><li><span><a href="#Setting-the-index" data-toc-modified-id="Setting-the-index-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>Setting the index</a></span></li></ul></li></ul></li><li><span><a href="#Exploring-the-relationship-between-gender-and-policing" data-toc-modified-id="Exploring-the-relationship-between-gender-and-policing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exploring the relationship between gender and policing</a></span><ul class="toc-item"><li><span><a href="#Does-gender-affect-who-gets-a-ticket-for-speeding?" data-toc-modified-id="Does-gender-affect-who-gets-a-ticket-for-speeding?-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Does gender affect who gets a ticket for speeding?</a></span><ul class="toc-item"><li><span><a href="#Examining-traffic-violations" data-toc-modified-id="Examining-traffic-violations-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Examining traffic violations</a></span></li><li><span><a href="#Comparing-violations-by-gender" data-toc-modified-id="Comparing-violations-by-gender-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Comparing violations by gender</a></span></li></ul></li><li><span><a href="#Does-gender-affect-who-gets-a-ticket-for-overspeeding" data-toc-modified-id="Does-gender-affect-who-gets-a-ticket-for-overspeeding-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Does gender affect who gets a ticket for overspeeding</a></span><ul class="toc-item"><li><span><a href="#Comparing-speeding-outcomes-by-gender" data-toc-modified-id="Comparing-speeding-outcomes-by-gender-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Comparing speeding outcomes by gender</a></span></li></ul></li><li><span><a href="#Does-gender-affect-whose-vehicle-is-searched?" data-toc-modified-id="Does-gender-affect-whose-vehicle-is-searched?-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Does gender affect whose vehicle is searched?</a></span><ul class="toc-item"><li><span><a href="#Calculating-the-search-rate" data-toc-modified-id="Calculating-the-search-rate-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Calculating the search rate</a></span></li><li><span><a href="#Comparing-search-rates-by-gender" data-toc-modified-id="Comparing-search-rates-by-gender-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Comparing search rates by gender</a></span></li><li><span><a href="#Adding-a-second-factor-to-the-analysis" data-toc-modified-id="Adding-a-second-factor-to-the-analysis-2.3.3"><span class="toc-item-num">2.3.3&nbsp;&nbsp;</span>Adding a second factor to the analysis</a></span></li></ul></li><li><span><a href="#Does-gender-affect-who-is-frisked-during-a-search" data-toc-modified-id="Does-gender-affect-who-is-frisked-during-a-search-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Does gender affect who is frisked during a search</a></span><ul class="toc-item"><li><span><a href="#Counting-protective-frisks" data-toc-modified-id="Counting-protective-frisks-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>Counting protective frisks</a></span></li><li><span><a href="#Comparing-frisk-rates-by-gender" data-toc-modified-id="Comparing-frisk-rates-by-gender-2.4.2"><span class="toc-item-num">2.4.2&nbsp;&nbsp;</span>Comparing frisk rates by gender</a></span></li></ul></li></ul></li><li><span><a href="#Visual-exploratory-data-analysis" data-toc-modified-id="Visual-exploratory-data-analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Visual exploratory data analysis</a></span></li></ul></div>

**Stanford Open Policing Project.**

In [1]:
# Load the necessary libraries
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

# Preparing data for analysis

Before beginning your analysis, it is critical that you first examine and clean the dataset, to make working with it a more efficient process. In this chapter, you will practice fixing data types, handling missing values, and dropping columns and rows while learning about the Stanford Open Policing Project dataset.


## Examining the dataset

Before beginning my analysis, it's important that I familiarize myself with the dataset. At this point, I'll read the dataset into pandas, examine the first few rows, and then count the number of missing values.

In [2]:
# read the data
df = pd.read_csv('police_project.csv')

In [3]:
# View the first few rows
df.head()

Unnamed: 0,stop_date,stop_time,county_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


In [4]:
# Count the missing values
print(df.isnull().sum())

stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64


It is clear that most of the columns have some missing values.

## Data Wrangling

### Dropping column

Often, a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the DataFrame, to make it easier for you to focus on the remaining columns.

I'll drop the county_name column because it only contains missing values. Thus, these column can be dropped because it contain no useful information. 

In [5]:
# shape of the dataframe
df.shape

(91741, 15)

In [6]:
# drop the 'county_name' 
df.drop(['county_name'], axis = 'columns', inplace = True)

In [7]:
# Re-examine the shape of the data
df.shape

(91741, 14)

### Dropping rows

When you know that a specific column will be critical to your analysis, and only a small fraction of rows are missing a value in that column, it often makes sense to remove those rows from the dataset.

The driver_gender column will be critical to many of my analyses. Because only a small fraction of rows are missing driver_gender, I'll drop those rows from the dataset.

In [8]:
# Drop all rows with missing values driver_gender column
df.dropna(subset = ['driver_gender'], inplace = True)

In [9]:
# count the missing values
df.isnull().sum()

stop_date                 0
stop_time                 0
driver_gender             0
driver_age_raw            1
driver_age              293
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83210
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
dtype: int64

I dropped around 5,000 rows, which is a small fraction of the dataset.

### Fixing a data type

In [10]:
# Examine the first five rows
df.head()

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


is_arrested column currently has the object data type. therefore, i'll change the data type to bool, which is the most suitable type for a column containing True and False values.

Fixing the data type will enable us to use mathematical operations on the is_arrested column that would not be possible otherwise.

In [11]:
# Change the data type of 'is_arrested' to 'bool'
df['is_arrested'] = df.is_arrested.astype(bool)

In [12]:
# Check the data type of 'is_arrested' 
print(df.is_arrested.dtype)

bool


### Combining object columns

Currently, the date and time of each traffic stop are stored in separate object columns: stop_date and stop_time.
    
To fix this, I'll combine these two columns into a single column, and then convert it to datetime format. This will enable convenient date-based attributes that I'll use later in the analysis.

In [13]:
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = df.stop_date.str.cat(df.stop_time, sep=' ')

# Convert 'combined' to datetime format
df['stop_datetime'] = pd.to_datetime(combined)

# Examine the data types of the DataFrame
print(df.dtypes)

stop_date                     object
stop_time                     object
driver_gender                 object
driver_age_raw               float64
driver_age                   float64
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                     bool
stop_duration                 object
drugs_related_stop              bool
stop_datetime         datetime64[ns]
dtype: object


### Setting the index

The last step that I'll take in this chapter is to set the stop_datetime column as the DataFrame's index. By replacing the default index with a DatetimeIndex, I'll make it easier to analyze the dataset by date and time, which will come in handy later in the analysis

In [14]:
# set the stop_datetime as the index
df.set_index('stop_datetime', inplace = True)

#examine the index
print(df.index)

# examine the columns
print(df.columns)


DatetimeIndex(['2005-01-02 01:55:00', '2005-01-18 08:15:00',
               '2005-01-23 23:15:00', '2005-02-20 17:15:00',
               '2005-03-14 10:00:00', '2005-03-23 09:45:00',
               '2005-04-01 17:30:00', '2005-06-06 13:20:00',
               '2005-07-13 10:15:00', '2005-07-13 15:45:00',
               ...
               '2015-12-31 16:38:00', '2015-12-31 19:44:00',
               '2015-12-31 19:55:00', '2015-12-31 20:20:00',
               '2015-12-31 20:25:00', '2015-12-31 20:27:00',
               '2015-12-31 20:35:00', '2015-12-31 20:45:00',
               '2015-12-31 21:42:00', '2015-12-31 22:46:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=86406, freq=None)
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
       'driver_age', 'driver_race', 'violation_raw', 'violation',
       'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
       'stop_duration', 'drugs_related_stop'],
      dtype='object')


# Exploring the relationship between gender and policing

Does the gender of a driver have an impact on police behavior during a traffic stop? In this chapter, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!

## Does gender affect who gets a ticket for speeding?

### Examining traffic violations

Before comparing the violations being committed by each gender, I should examine the violations committed by all drivers to get a baseline understanding of the data.

In this exercise, I'll count the unique values in the violation column, and then separately express those counts as proportions.

In [20]:
# Count the unique values in 'violation'
print('unique_values:\n', df.violation.value_counts())

unique_values:
 Speeding               48461
Moving violation       16224
Equipment              11020
Other                   4317
Registration/plates     3432
Seat belt               2952
Name: violation, dtype: int64


In [21]:
# Express the counts as proportions
print('proportion:\n', df.violation.value_counts(normalize=True))

proportion:
 Speeding               0.560852
Moving violation       0.187765
Equipment              0.127537
Other                  0.049962
Registration/plates    0.039719
Seat belt              0.034164
Name: violation, dtype: float64


Interesting! More than half of all violations are for speeding, followed by other moving violations and equipment violations.

### Comparing violations by gender

The question I am trying to answer is whether male and female drivers tend to commit different types of traffic violations.

In this exercise, I'll first create a DataFrame for each gender, and then analyze the violations in each DataFrame separately.

In [16]:
# Create a DataFrame of female drivers
female = df[df.driver_gender== 'F']

# Create a DataFrame of male drivers
male = df[df.driver_gender== 'M']

In [17]:
# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize=True))

Speeding               0.658500
Moving violation       0.136277
Equipment              0.105780
Registration/plates    0.043086
Other                  0.029348
Seat belt              0.027009
Name: violation, dtype: float64


In [18]:
# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize=True))

Speeding               0.524350
Moving violation       0.207012
Equipment              0.135671
Other                  0.057668
Registration/plates    0.038461
Seat belt              0.036839
Name: violation, dtype: float64


About two-thirds of female traffic stops are for speeding, whereas stops of males are more balanced among the six categories. This doesn't mean that females speed more often than males, however, since we didn't take into account the number of stops or drivers.

## Does gender affect who gets a ticket for overspeeding

 ### Comparing speeding outcomes by gender

When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can I find evidence of this in the dataset?

First, I'll create two DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.

Then, for each gender, I'll use the stop_outcome column to calculate what percentage of stops resulted in a "Citation" (meaning a ticket) versus a "Warning".

In [23]:
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = df[(df.driver_gender == 'F') & (df.violation == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = df[(df.driver_gender == 'M') & (df.violation == 'Speeding')]

In [24]:
# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))


Citation            0.952590
Arrest Driver       0.005361
Arrest Passenger    0.000840
N/D                 0.000840
No Action           0.000452
Name: stop_outcome, dtype: float64


In [25]:
# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))

Citation            0.946208
Arrest Driver       0.015161
Arrest Passenger    0.001243
N/D                 0.001061
No Action           0.001061
Name: stop_outcome, dtype: float64


Interesting! The numbers are similar for males and females: about 95% of stops for speeding result in a ticket. Thus, the data fails to show that gender has an impact on who gets a ticket for speeding.

## Does gender affect whose vehicle is searched?

### Calculating the search rate

During a traffic stop, the police officer sometimes conducts a search of the vehicle. In the cell that follws, I'll calculate the percentage of all stops in the ri DataFrame that result in a vehicle search, also known as the search rate.

In [27]:
# Check the data type of 'search_conducted'
print(df.search_conducted.dtype)

bool


In [28]:
# Calculate the search rate by counting the values
print(df.search_conducted.value_counts(normalize = True))

False    0.963012
True     0.036988
Name: search_conducted, dtype: float64


In [30]:
# Calculate the search rate by taking the mean
print(df.search_conducted.mean())

0.036988172117677014


Great! It looks like the search rate is about 3.7%. Next, you'll examine whether the search rate varies by driver gender.

### Comparing search rates by gender

First, I'll filter the DataFrame by gender and calculate the search rate for each group separately. Then, I'll perform the same calculation for both genders at once using a .groupby()

In [33]:
# Calculate the search rate for female drivers
print(df[df.driver_gender=='F'].search_conducted.mean())

0.02003317596018885


In [35]:
# Calculate the search rate for male drivers
print(df[df.driver_gender=='M'].search_conducted.mean())

0.04332617855155418


In [36]:
# Calculate the search rate for both groups simultaneously
print(df.groupby('driver_gender').search_conducted.mean())

driver_gender
F    0.020033
M    0.043326
Name: search_conducted, dtype: float64


Male drivers are searched more than twice as often as female drivers.

### Adding a second factor to the analysis

Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.

For example, you might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.

You can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Let's find out if that's the case!

In [38]:
# Calculate the search rate for each combination of gender and violation
print(df.groupby(['driver_gender', 'violation']).search_conducted.mean())

driver_gender  violation          
F              Equipment              0.042622
               Moving violation       0.036205
               Other                  0.056522
               Registration/plates    0.066140
               Seat belt              0.012598
               Speeding               0.008720
M              Equipment              0.070081
               Moving violation       0.059831
               Other                  0.047146
               Registration/plates    0.110376
               Seat belt              0.037980
               Speeding               0.024925
Name: search_conducted, dtype: float64


In [39]:
# Reverse the ordering to group by violation before gender
print(df.groupby(['violation', 'driver_gender']).search_conducted.mean())

violation            driver_gender
Equipment            F                0.042622
                     M                0.070081
Moving violation     F                0.036205
                     M                0.059831
Other                F                0.056522
                     M                0.047146
Registration/plates  F                0.066140
                     M                0.110376
Seat belt            F                0.012598
                     M                0.037980
Speeding             F                0.008720
                     M                0.024925
Name: search_conducted, dtype: float64


 For all types of violations, the search rate is higher for males than for females, disproving our hypothesis.

## Does gender affect who is frisked during a search

### Counting protective frisks

During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a "protective frisk."

In this point, I'll first check to see how many times "Protective Frisk" was the only search type. Then, I'll use a string method to locate all instances in which the driver was frisked.

In [41]:
# Count the 'search_type' values
print(df.search_type.value_counts())

Incident to Arrest                                          1219
Probable Cause                                               891
Inventory                                                    220
Reasonable Suspicion                                         197
Protective Frisk                                             161
Incident to Arrest,Inventory                                 129
Incident to Arrest,Probable Cause                            106
Probable Cause,Reasonable Suspicion                           75
Incident to Arrest,Inventory,Probable Cause                   34
Incident to Arrest,Protective Frisk                           33
Probable Cause,Protective Frisk                               33
Inventory,Probable Cause                                      22
Incident to Arrest,Reasonable Suspicion                       13
Inventory,Protective Frisk                                    11
Incident to Arrest,Inventory,Protective Frisk                 11
Protective Frisk,Reasonab

In [42]:
# Check if 'search_type' contains the string 'Protective Frisk'
df['frisk'] = df.search_type.str.contains('Protective Frisk', na=False)

In [43]:
# Check the data type of 'frisk'
print(df.frisk.dtype)

bool


In [44]:
# Take the sum of 'frisk'
print(df.frisk.sum())

274


It looks like there were 274 drivers who were frisked. Next, you'll examine whether gender affects who is frisked.

### Comparing frisk rates by gender

I'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?

Before doing any calculations, it's important to filter the DataFrame to only include the relevant subset of data, namely stops in which a search was conducted.

In [46]:
# Create a DataFrame of stops in which a search was conducted
searched = df[df.search_conducted == True]

# Calculate the overall frisk rate by taking the mean of 'frisk'
print('Mean: ', searched.frisk.mean())

# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender')['frisk'].mean())

Mean:  0.08573216520650813
driver_gender
F    0.061571
M    0.089908
Name: frisk, dtype: float64


The frisk rate is higher for males than for females, though we can't conclude that this difference is caused by the driver's gender.

# Visual exploratory data analysis

Are you more likely to get arrested at a certain time of day? Are drug-related stops on the rise? In this chapter, I will answer these and other questions by analyzing the dataset visually, since plots can help me to understand trends in a way that examining the raw data cannot.