# 1. Preparing the data for analysis

## Examining the dataset

In [2]:
# import the pandas library as pd
import pandas as pd

# read 'police.csv' into a DataFrame named pl
pl = pd.read_csv('police.csv')

# examine the head of the DataFrame
pl.head()

# examine the basic info of the DataFrame
pl.info()

# examine the shape of the DataFrame
pl.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91741 entries, 0 to 91740
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   state               91741 non-null  object 
 1   stop_date           91741 non-null  object 
 2   stop_time           91741 non-null  object 
 3   county_name         0 non-null      float64
 4   driver_gender       86536 non-null  object 
 5   driver_race         86539 non-null  object 
 6   violation_raw       86539 non-null  object 
 7   violation           86539 non-null  object 
 8   search_conducted    91741 non-null  bool   
 9   search_type         3307 non-null   object 
 10  stop_outcome        86539 non-null  object 
 11  is_arrested         86539 non-null  object 
 12  stop_duration       86539 non-null  object 
 13  drugs_related_stop  91741 non-null  bool   
 14  district            91741 non-null  object 
dtypes: bool(2), float64(1), object(12)
memory usage: 9.3+

(91741, 15)

In [2]:
# count the number of missing values in each column
pl.isnull().sum()

state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

## Dropping columns

In [3]:
pl.shape

(91741, 15)

>
- drop columns that are not useful, contain no useful information, missing values
- drop the state column because all of the traffic stops took place in one state
- make it easier to focus on the remaining columns


In [4]:
# Drop the 'county_name' and 'state' columns
# axis = 1 or axis ='columns'
pl.drop(['state', 'county_name'], axis=1, inplace=True)
# Examine the shape of the DataFrame (again)
pl.shape

(91741, 13)

## Dropping rows

In [3]:
pl

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,RI,2005-02-20,17:15,,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,RI,2005-02-24,01:20,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91736,RI,2015-12-31,21:21,,F,Black,Other Traffic Violation,Moving violation,False,,Citation,False,0-15 Min,False,Zone K2
91737,RI,2015-12-31,21:59,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
91738,RI,2015-12-31,22:04,,M,White,Other Traffic Violation,Moving violation,False,,Citation,False,0-15 Min,False,Zone X3
91739,RI,2015-12-31,22:09,,F,Hispanic,Equipment/Inspection Violation,Equipment,False,,Warning,False,0-15 Min,False,Zone K3


In [5]:
pl.isnull().sum()

stop_date                 0
stop_time                 0
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

- Define a specific column will be critical to your analysis : <b>driver_gender</b> column
- only a small fraction of rows are missing a value in that column
- Makes sense to remove those rows from the dataset

In [10]:
# Drop all rows that are missing 'driver_gender'
pl.dropna(subset=['driver_gender'], inplace=True)

In [15]:
# Count the number of missing values in each column (again)
pl.isnull().sum()


stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64

In [14]:
# Examine the shape of the DataFrame
pl.shape

(86536, 13)

## Using proper data types

### Finding an incorrect data type

> 
- Determine which's column data type should be changed

In [20]:
pl.head()

Unnamed: 0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,2005-02-20,17:15,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


In [19]:
pl.dtypes

stop_date             object
stop_time             object
driver_gender         object
driver_race           object
violation_raw         object
violation             object
search_conducted        bool
search_type           object
stop_outcome          object
is_arrested           object
stop_duration         object
drugs_related_stop      bool
district              object
dtype: object

### Fixing a data type

>
- <b>is_arrested</b> column currently has the object data type >> change the data type to bool
- Fixing the data type will enable us to use mathematical operations

In [21]:
# Examine the head of the 'is_arrested' column
pl.is_arrested.head()

0    False
1    False
2    False
3     True
4    False
Name: is_arrested, dtype: object

In [22]:
# Change the data type of 'is_arrested' to 'bool'
# Tip: use the .astype() method to convert is_arrested to a bool column.
pl['is_arrested'] = pl.is_arrested.astype(bool)

In [23]:
# Check the data type of 'is_arrested' to confirm that it is now a bool column
pl['is_arrested'].dtypes

dtype('bool')

In [24]:
pl.is_arrested.head()

0    False
1    False
2    False
3     True
4    False
Name: is_arrested, dtype: bool

## Creating a DatetimeIndex

### Combining object columns
See Also: https://sparkbyexamples.com/pandas/pandas-combine-two-columns-of-text-in-dataframe/

In [46]:
pl.head()

Unnamed: 0_level_0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
NaT,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
NaT,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
NaT,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
NaT,2005-02-20,17:15,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
NaT,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


In [27]:
pl.stop_date.dtypes

dtype('O')

- Currently, the date and time of each traffic stop are stored in separate object columns: <b>stop_date</b> and <b>stop_time</b>.
- Combine these two columns into a single column
- Convert it to datetime format

In [47]:
#1 Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = pl.stop_date.astype(str)+" "+pl.stop_time

In [31]:
#2 Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = pl.stop_date.str.cat(pl.stop_time, sep = " ")

In [48]:
# Convert 'combined' to datetime format
pl['stop_datetime'] = pd.to_datetime(combined)

In [49]:
# Examine the data types of the DataFrame
pl.stop_datetime.head()

stop_datetime
NaT   2005-01-04 12:55:00
NaT   2005-01-23 23:15:00
NaT   2005-02-17 04:15:00
NaT   2005-02-20 17:15:00
NaT   2005-02-24 01:20:00
Name: stop_datetime, dtype: datetime64[ns]

### Setting Column as Index

> Syntax of set_index('column_name')

- Setting the stop_datetime column as the DataFrame's index
- Replacing the default index with a DatetimeIndex
- Make it easier to analyze the dataset by date and time

In [50]:
# Set 'stop_datetime' as the index
pl.set_index('stop_datetime', inplace=True)

In [55]:
# Use the .index attribute to examine the DataFrame's index
pl.index

DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:15:00',
               '2005-02-17 04:15:00', '2005-02-20 17:15:00',
               '2005-02-24 01:20:00', '2005-03-14 10:00:00',
               '2005-03-29 21:55:00', '2005-04-04 21:25:00',
               '2005-07-14 11:20:00', '2005-07-14 19:55:00',
               ...
               '2015-12-31 13:23:00', '2015-12-31 18:59:00',
               '2015-12-31 19:13:00', '2015-12-31 20:20:00',
               '2015-12-31 20:50:00', '2015-12-31 21:21:00',
               '2015-12-31 21:59:00', '2015-12-31 22:04:00',
               '2015-12-31 22:09:00', '2015-12-31 22:47:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=86536, freq=None)

In [54]:
# Use the .columns attribute to examine the DataFrame's columns.
pl.columns

Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race',
       'violation_raw', 'violation', 'search_conducted', 'search_type',
       'stop_outcome', 'is_arrested', 'stop_duration', 'drugs_related_stop',
       'district'],
      dtype='object')

# 2. Exploring the relationship between gender and policing

In [53]:
pl.head()

Unnamed: 0_level_0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2005-01-04 12:55:00,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
2005-01-23 23:15:00,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-02-17 04:15:00,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2005-02-20 17:15:00,2005-02-20,17:15,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
2005-02-24 01:20:00,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


In [57]:
pl.shape

(86536, 13)

## Examining traffic violations

In [60]:
# Count the unique values in 'violation'
pl.violation.value_counts()

Speeding               48423
Moving violation       16224
Equipment              10921
Other                   4409
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64

In [58]:
# Express the counts as proportions of the total
pl.violation.value_counts(normalize=True)

Speeding               0.559571
Moving violation       0.187483
Equipment              0.126202
Other                  0.050950
Registration/plates    0.042791
Seat belt              0.033004
Name: violation, dtype: float64

 ```Interesting! More than half of all violations are for speeding, followed by other moving violations and equipment violations.```

## Comparing violations by gender

In [70]:
# Create a DataFrame, female, that only contains rows in which driver_gender is 'F'.
female = pl[pl.driver_gender == 'F']
female

Unnamed: 0_level_0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2005-02-24 01:20:00,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3
2005-03-14 10:00:00,2005-03-14,10:00,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-07-14 11:20:00,2005-07-14,11:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2005-07-18 19:30:00,2005-07-18,19:30,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-07-24 20:10:00,2005-07-24,20:10,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-12-31 07:31:00,2015-12-31,07:31,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X1
2015-12-31 09:33:00,2015-12-31,09:33,F,White,Equipment/Inspection Violation,Equipment,False,,Warning,False,0-15 Min,False,Zone X4
2015-12-31 21:21:00,2015-12-31,21:21,F,Black,Other Traffic Violation,Moving violation,False,,Citation,False,0-15 Min,False,Zone K2
2015-12-31 21:59:00,2015-12-31,21:59,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3


In [71]:
# Create a DataFrame of male drivers
male = pl[pl.driver_gender == 'M']
male

Unnamed: 0_level_0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2005-01-04 12:55:00,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
2005-01-23 23:15:00,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-02-17 04:15:00,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2005-02-20 17:15:00,2005-02-20,17:15,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
2005-03-29 21:55:00,2005-03-29,21:55,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-12-31 19:13:00,2015-12-31,19:13,M,White,Other Traffic Violation,Moving violation,False,,Citation,False,16-30 Min,False,Zone K3
2015-12-31 20:20:00,2015-12-31,20:20,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K2
2015-12-31 20:50:00,2015-12-31,20:50,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K2
2015-12-31 22:04:00,2015-12-31,22:04,M,White,Other Traffic Violation,Moving violation,False,,Citation,False,0-15 Min,False,Zone X3


In [76]:
# Compute the violations by female drivers (as proportions)
female.violation.value_counts(normalize=True)

Speeding               0.658114
Moving violation       0.138218
Equipment              0.105199
Registration/plates    0.044418
Other                  0.029738
Seat belt              0.024312
Name: violation, dtype: float64

In [75]:
# Compute the violations by male drivers (as proportions)
male.violation.value_counts(normalize=True)

Speeding               0.522243
Moving violation       0.206144
Equipment              0.134158
Other                  0.058985
Registration/plates    0.042175
Seat belt              0.036296
Name: violation, dtype: float64

```About two-thirds of female traffic stops are for speeding, whereas stops of males are more balanced among the six categories.```
<br>
```This doesn't mean that females speed more often than males, however, since we didn't take into account the number of stops or drivers. ```

## Filtering by multiple conditions

### Comparing speeding outcomes by gender

In [78]:
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = pl[(pl.driver_gender == 'F') & (pl.violation == 'Speeding')]
female_and_speeding

Unnamed: 0_level_0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2005-02-24 01:20:00,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3
2005-03-14 10:00:00,2005-03-14,10:00,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-07-14 11:20:00,2005-07-14,11:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2005-07-18 19:30:00,2005-07-18,19:30,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-07-24 20:10:00,2005-07-24,20:10,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-12-30 14:09:00,2015-12-30,14:09,F,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False,Zone X4
2015-12-30 19:21:00,2015-12-30,19:21,F,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False,Zone X1
2015-12-30 23:26:00,2015-12-30,23:26,F,Hispanic,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2015-12-31 07:31:00,2015-12-31,07:31,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X1


In [80]:
pl.stop_outcome.value_counts()

Citation            77091
Arrest Driver        2735
No Action             624
N/D                   607
Arrest Passenger      343
Name: stop_outcome, dtype: int64

In [81]:
# Compute the stop outcomes for female drivers (as proportions)
female_and_speeding.stop_outcome.value_counts(normalize=True)

Citation            0.952192
Arrest Driver       0.005752
N/D                 0.000959
Arrest Passenger    0.000639
No Action           0.000383
Name: stop_outcome, dtype: float64

In [82]:
# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = pl[(pl.driver_gender == 'M') & (pl.violation == 'Speeding')]
male_and_speeding

Unnamed: 0_level_0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2005-01-23 23:15:00,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-02-17 04:15:00,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2005-03-29 21:55:00,2005-03-29,21:55,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-04-04 21:25:00,2005-04-04,21:25,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K1
2005-07-14 19:55:00,2005-07-14,19:55,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-12-31 13:06:00,2015-12-31,13:06,M,White,Speeding,Speeding,False,,Warning,False,16-30 Min,False,Zone X3
2015-12-31 13:23:00,2015-12-31,13:23,M,White,Speeding,Speeding,False,,N/D,False,16-30 Min,False,Zone X1
2015-12-31 18:59:00,2015-12-31,18:59,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K2
2015-12-31 20:20:00,2015-12-31,20:20,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K2


In [84]:
# Compute the stop outcomes for male drivers (as proportions)
male_and_speeding.stop_outcome.value_counts(normalize=True)

Citation            0.944595
Arrest Driver       0.015895
Arrest Passenger    0.001281
No Action           0.001068
N/D                 0.000976
Name: stop_outcome, dtype: float64

``` Interesting! The numbers are similar for males and females: about 95% of stops for speeding result in a ticket. ```
<br>
```Thus, the data fails to show that gender has an impact on who gets a ticket for speeding.```

### Calculating the search rate

In [87]:
# Check the data type of 'search_conducted'
pl.search_conducted.dtypes

dtype('bool')

In [88]:
# Calculate the search rate by counting the values
pl.search_conducted.value_counts(normalize=True)

False    0.961785
True     0.038215
Name: search_conducted, dtype: float64

In [89]:
pl.search_conducted.mean()

0.0382153092354627

``` Great! It looks like the search rate is about 3.8%. Next, you'll examine whether the search rate varies by driver gender.```

### Comparing search rates by gender

In [96]:
female_and_searchrate = pl[pl.driver_gender == 'F'].search_conducted.mean()
female_and_searchrate

0.019180617481282074

In [95]:
male_and_searchrate = pl[pl.driver_gender == 'M'].search_conducted.mean()
male_and_searchrate

0.04542557598546892

In [97]:
# Calculate the search rate for both groups simultaneously
pl.groupby('driver_gender').search_conducted.mean()

driver_gender
F    0.019181
M    0.045426
Name: search_conducted, dtype: float64

```Wow! Male drivers are searched more than twice as often as female drivers. Why might this be?```

### Adding a second factor to the analysis

In [99]:
# Calculate the search rate for each combination of gender and violation
pl.groupby(['driver_gender', 'violation']).search_conducted.mean()

driver_gender  violation          
F              Equipment              0.039984
               Moving violation       0.039257
               Other                  0.041018
               Registration/plates    0.054924
               Seat belt              0.017301
               Speeding               0.008309
M              Equipment              0.071496
               Moving violation       0.061524
               Other                  0.046191
               Registration/plates    0.108802
               Seat belt              0.035119
               Speeding               0.027885
Name: search_conducted, dtype: float64

In [100]:
# Reverse the ordering to group by violation before gender
pl.groupby(['violation', 'driver_gender']).search_conducted.mean()

violation            driver_gender
Equipment            F                0.039984
                     M                0.071496
Moving violation     F                0.039257
                     M                0.061524
Other                F                0.041018
                     M                0.046191
Registration/plates  F                0.054924
                     M                0.108802
Seat belt            F                0.017301
                     M                0.035119
Speeding             F                0.008309
                     M                0.027885
Name: search_conducted, dtype: float64

```Great work! For all types of violations, the search rate is higher for males than for females, disproving our hypothesis.```

### Counting protective frisks

In [102]:
# Count the 'search_type' values
pl.search_type.value_counts()

Incident to Arrest                                          1290
Probable Cause                                               924
Inventory                                                    219
Reasonable Suspicion                                         214
Protective Frisk                                             164
Incident to Arrest,Inventory                                 123
Incident to Arrest,Probable Cause                            100
Probable Cause,Reasonable Suspicion                           54
Incident to Arrest,Inventory,Probable Cause                   35
Probable Cause,Protective Frisk                               35
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      25
Protective Frisk,Reasonable Suspicion                         19
Incident to Arrest,Inventory,Protective Frisk                 18
Incident to Arrest,Probable Cause,Protective Frisk            13
Inventory,Protective Fris