# Analyzing Police Activity
* Based on a project presented at <a href="https://www.datacamp.com/projects/558"> Datacamp</a>, but with more analyzes performed

## Police Activity

This analysis is based on the course "Analyzing Police Activity" from (DataCamp) [https://learn.datacamp.com/courses/analyzing-police-activity-with-pandas].

The dataset used is provided from Stanford Open Policing Project and represent the traffic stops by the police officers. 

Each row in this dataset represents one traffic stop.

The structure of this notebook is as follows:


In [46]:
import pandas as pd

ri = pd.read_csv('datasets/police.data')
ri.head()

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,RI,2005-02-20,17:15,,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,RI,2005-02-24,01:20,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


## 1 - Preparing the data for analysis

Before beginning the analysis, it is critical  examine and clean the dataset, to make working with it a more efficient process. In this dataset, is necessary fixing data types, handling missing values, and dropping columns and rows. while learning about the Stanford Open Policing Project dataset. 


In [47]:
# Locating missing values
print("Number of NaN values in each column:\n" + str(ri.isnull().sum()))

print("\nShape: " + str(ri.shape))

Number of NaN values in each column:
state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

Shape: (91741, 15)


With this analysis, we can note that the columns 'county_name' and 'search_type' almost just have NaN values. Besides that, this dataset is restrict for just one state from US, so the column 'state' is useless. Then, we are drop this three columns from the DataFrame. 

In [48]:
# Droping the 'county_name' and 'search_type' columns
ri.drop(['county_name', 'search_type', 'state'], axis='columns', inplace=True)

Consider that the 'driver_gender' and 'violation' columns are critical to the analysis, and thus a row is useless without that data. With this arguments, we drop the rows that do not have one of this values.

In [49]:
# Droping the rows with NaN in 'driver_gender' or 'violation'
ri.dropna(subset=['driver_gender', 'violation'], inplace=True)

In [50]:
# Locating missing values
print("Number of NaN values in each column:\n" + str(ri.isnull().sum()))

print("\nShape: " + str(ri.shape))

Number of NaN values in each column:
stop_date             0
stop_time             0
driver_gender         0
driver_race           0
violation_raw         0
violation             0
search_conducted      0
stop_outcome          0
is_arrested           0
stop_duration         0
drugs_related_stop    0
district              0
dtype: int64

Shape: (86536, 12)


## Exploring the relationship between gender and policing

Does the gender of a driver have an impact on police behavior during a traffic stop? This question are answred while practicing filtering, grouping, method chaining, Boolean math, and string methods.

In [53]:
# Counting unique values
ri['violation'].value_counts()

Speeding               48423
Moving violation       16224
Equipment              10921
Other                   4409
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64

In [54]:
# Counting unique values
ri['violation'].value_counts(normalize=True)

Speeding               0.559571
Moving violation       0.187483
Equipment              0.126202
Other                  0.050950
Registration/plates    0.042791
Seat belt              0.033004
Name: violation, dtype: float64

The question we are trying to answer is whether male and female drivers tend to commit different types of traffic violations.

In [62]:
# Create a DataFrame of female drivers
female = ri[ri['driver_gender'] == 'F']

# Create a DataFrame of male drivers
male = ri[ri['driver_gender'] == 'M']

# Compute the violations by female drivers (as proportions)
print("- Most Common Female Violations:\n" + str(female['violation'].value_counts(normalize=True)))

print('\n')

# Compute the violations by male drivers (as proportions)
print("- Most Common Male Violations:\n" + str(male['violation'].value_counts(normalize=True)))

- Most Common Female Violations:
Speeding               0.658114
Moving violation       0.138218
Equipment              0.105199
Registration/plates    0.044418
Other                  0.029738
Seat belt              0.024312
Name: violation, dtype: float64


- Most Common Male Violations:
Speeding               0.522243
Moving violation       0.206144
Equipment              0.134158
Other                  0.058985
Registration/plates    0.042175
Seat belt              0.036296
Name: violation, dtype: float64
