Does the gender of a driver have an impact on police behavior during a traffic stop? In this chapter, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!

# Lesson I

## Do the Genders Commit Different Violations?

Let's start by discussing a few methods that will help us with our analysis. The first method is ``value_counts()``, which contains unique values in a Series.

It's bets suited for a column that contains *categorical data* rather than numerical data.

For Example, Let's apply ``value_counts()`` method to ``stop_outcome`` column:

In [2]:
# Import Packages
import pandas as pd

# Datasets
ri = pd.read_csv('datasets/police.csv')


Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4


In [3]:
ri.stop_outcome.value_counts()
# Results are displayed in 'descending' order.

Citation            77092
Arrest Driver        2735
No Action             625
N/D                   607
Arrest Passenger      343
Name: stop_outcome, dtype: int64

Because ``value_counts()`` outputs a pandas *Series*, we can tha the sum of this *Series* by simply adding ``.sum()`` method on the end.

This is called : **Method Chaining**

``sum()`` of ``value_counts()`` equal to the number of rows in the DataFrame, which will be the case for any Series that has no missing values.

In [4]:
ri.stop_outcome.value_counts().sum()

86539

In [5]:
ri.shape

(91741, 15)

### Expressing counts as proportions

Rather than examining the raw counts, we might prefer to see to proportions of the total. Instead of dividing the number of 'citations' by the total number of outcomes and get the percantage. We can use the ``normalize=True`` parameter of ``value_counts()``.

In [6]:
ri.stop_outcome.value_counts(normalize=True)

Citation            0.890835
Arrest Driver       0.031604
No Action           0.007222
N/D                 0.007014
Arrest Passenger    0.003964
Name: stop_outcome, dtype: float64

### Filtering DataFrame Rows

Let's take a look at the ``value_counts()`` of different column, ``driver_race``:

In [7]:
ri.driver_race.value_counts()

White       61872
Black       12285
Hispanic     9727
Asian        2390
Other         265
Name: driver_race, dtype: int64

We see 5 different unique categories:
* White
* Black
* Hispanic
* Asian
* Other

If wanted to include only one race out of these, we would write that  as a condition and put it inside brackets, and save the result in a new object.

In [8]:
white = ri[ri.driver_race == 'White']
white.shape

(61872, 15)

We see the shape of the new DataFrame is : **61872**, because thats the number of white drivers in the dataset.

### Comparing stop outcomes for two groups

Lets focus on the analysis of stop outcomes but only include White drivers:

In [9]:
white.stop_outcome.value_counts(normalize=True)

Citation            0.902234
Arrest Driver       0.024017
No Action           0.007047
N/D                 0.006433
Arrest Passenger    0.002748
Name: stop_outcome, dtype: float64

## Exercise

### Examining traffic violations

Before comparing the violations being committed by each gender, you should examine the violations committed by all drivers to get a baseline understanding of the data.

In this exercise, you'll count the unique values in the ``violation`` column, and then separately express those counts as proportions.

In [10]:
# Count the unique values in 'violation' 
print(ri.violation.value_counts())

# Express the counts as proportions
print(ri.violation.value_counts(normalize=True))

Speeding               48424
Moving violation       16224
Equipment              10922
Other                   4410
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64
Speeding               0.559563
Moving violation       0.187476
Equipment              0.126209
Other                  0.050960
Registration/plates    0.042790
Seat belt              0.033002
Name: violation, dtype: float64


### Comparing violations by gender

The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

In this exercise, you'll first create a DataFrame for each gender, and then analyze the violations in each DataFrame separately.

In [15]:
# Create a DataFrame of female drivers
female = ri[ri.driver_gender == 'F']

# Create a DataFrame of male drivers
male = ri[ri.driver_gender == 'M']

# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize=True))

print('------------------------------')

# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize=True))

Speeding               0.658114
Moving violation       0.138218
Equipment              0.105199
Registration/plates    0.044418
Other                  0.029738
Seat belt              0.024312
Name: violation, dtype: float64
------------------------------
Speeding               0.522243
Moving violation       0.206144
Equipment              0.134158
Other                  0.058985
Registration/plates    0.042175
Seat belt              0.036296
Name: violation, dtype: float64


# Lesson II

## Does Gender Affect who gets a ticket for speeding?

In this section we'll narrow our focus to the relationship between gender and stop outcomes for one specific violation, *speeding*

We'll filter DataFrame by multiple conditions for this analysis. In the last exercise we used a single condition to define ``female`` and ``male`` datasets.

```python
    female = ri[ri.driver_gender == 'F']
```

What if we wanted to create a second DataFrame of female drivers, but only those who were arrested?

We simply add a second condition filter;

```python
    female_and_arrested = ri[(ri.driver_gender == 'F') & 
                                (ri.is_arrested == True)]
```

* Notice that each condition is surrounded by parantheses.
* Ampersand (``&``) represents the ``and`` operator.

This DataFrame is much more smaller than the ``female`` DataFame, because it only consists of drivers who satisfiy both conditions.

When filtering DataFrame, another option is to use the vertical pipe character (``|``), which represents ``or``.

```python
    female_or_arrested = ri[(ri.driver_gender == 'F') | 
                            (ri.is_arrested == True)]
```

* Includes all females - whether arrested or not -
* Includes all drivers who were arrested -whether male or female-

### Rules for filtering by multiple conditions

* Ampersand (``&``): only include rows that satisfy both conditions.
* Pipe (``|``): include rows that satisfy either condition.
* Each condition must be surrounded by parentheses.
* Condition can check for equality(``==``), inequality (``!=``), etc...
* Can use more than two conditions.

### Correlation, not causation

* Analyze the relationship between gender and stop outcome
    - Assess whether there is a correlation

* Not going to draw any conclusions about causation.
    - We would need additional data and expertise
    - Exploring relationships only    

## Exercise

### Comparing speeding putcomes by gender

When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can you find evidence of this in the dataset?

First, you'll create two DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.

Then, for each gender, you'll use the ``stop_outcome`` column to calculate what percentage of stops resulted in a "Citation" (meaning a ticket) versus a "Warning".

In [20]:
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender == 'F') & 
                         (ri.violation == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender == 'M') & 
                       (ri.violation == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))

print('------------------------------')

# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))

Citation            0.952192
Arrest Driver       0.005752
N/D                 0.000959
Arrest Passenger    0.000639
No Action           0.000383
Name: stop_outcome, dtype: float64
------------------------------
Citation            0.944595
Arrest Driver       0.015895
Arrest Passenger    0.001281
No Action           0.001068
N/D                 0.000976
Name: stop_outcome, dtype: float64


# Lesson III

## Does gender affect whose wehicle is searched?

During a traffic stop, the police officer sometimes conducts a search of the vehicle. Does the driver's gender affect whether their vehicle is searched? Let's review a few pandas techniques that will help us to answer this question.

### Math with Boolean Values

Recall that we can perform mathematical operations on Boolean values.

* ``True=1``, ``False=0``

Now lets try to take the ``mean()`` of Boolean values...First take a look at the following NumPy array:

```python
    np.mean([0, 1, 0, 0])
    # Output
    '''
    0.25
    '''
`` 

Now lets look at a Boolean Array:

```python
    np.mean([False, False, True, False])
    # Output
    '''
    0.25
    '''
```

* Mean of Boolean Series represents percentage of ``True`` values.

### Taking the mean of a Boolean Series

First we calculate the percantage of stops that result in an arrest using ``value_counts()`` method:

```python
    ri.is_arrested.value_counts(normalize=True)
    # Output
    '''
    False   : 0.964431
    True    : 0.035569
    '''
```

The arrest rate is around **3.6%** since that's the percentage of ``True`` values. But we can get the same result more easily by taking the ``mean()`` of the ``is_arrested`` Series:

```python
    ri.is_arrested.mean()
    # Output
    '''
    0.035569
    '''
```

Remember we change the ``dtype`` of the column ``is_arrested`` to *Boolean* in the first chapter...

### Comparing groups using groupby()

Second technique we'll review is ``groupby()``:

* Study the arrest rate by police district:

```python
    ri.district.unique()
    # Output
    '''
    array(['Zone X4', 'Zone X3', 'Zone X1', 'Zone K3', 'Zone K1', 'Zone K2'],
    dtype=object)
    '''
```

We can see that there are six district.

We can also filter each district one by one and take the ``mean()`` to calculate the ``is_arrested``. However, we can use the ``groupby()`` method on ``'district'`` and take the ``mean()``. Result will be all of the unique district's mean value.

```python
    ri[ri.district == 'Zone K2'].is_arrested.mean()
```

or

```python
    ri.groupby('district').is_arrested.mean()
```

You can also group by multiple categories at once. For example, you can group by district and gender by passing it as a list of strings. This computes the arrest rate for every combination of district and gender. In other words, you can see the arrest rate for males and females in each district separately. 

```python
    ri.groupby(['district', 'driver_gender']).is_arrested.mean()
```

## Exercise

### Calculating the Search Rate

During a traffic stop, the police officer sometimes conducts a search of the vehicle. In this exercise, you'll calculate the percentage of all stops in the ri DataFrame that result in a vehicle search, also known as the search rate.

In [22]:
# Check the data type of 'search_conducted' 
print(ri.search_conducted.dtype)

# Calculate the search rate by counting the values
print(ri.search_conducted.value_counts(normalize=True))

# Calculate the search rate by taking the mean
print(ri.search_conducted.mean())

bool
False    0.963953
True     0.036047
Name: search_conducted, dtype: float64
0.03604713268876511


### Comparing search rates by gender

In this exercise, you'll compare the rates at which female and male drivers are searched during a traffic stop. Remember that the vehicle search rate across all stops is about 3.8%.

First, you'll filter the DataFrame by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a ``.groupby()``.

In [24]:
# Calculate the search rate for female drivers
print(ri[ri.driver_gender == 'F'].search_conducted.mean())

# Calculate the search rate for male drivers
print(ri[ri.driver_gender == 'M'].search_conducted.mean())

0.019180617481282074
0.04542557598546892


In [25]:
# Calculate the search rate for both groups simultaneously
print(ri.groupby('driver_gender').search_conducted.mean())

driver_gender
F    0.019181
M    0.045426
Name: search_conducted, dtype: float64


### Adding a second Factor to the analysis

Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.

For example, you might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.

You can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Let's find out if that's the case!

In [28]:
# Calculate the search rate for each combination of gender and violation
print(ri.groupby(['driver_gender', 'violation']).search_conducted.mean())

driver_gender  violation          
F              Equipment              0.039984
               Moving violation       0.039257
               Other                  0.041018
               Registration/plates    0.054924
               Seat belt              0.017301
               Speeding               0.008309
M              Equipment              0.071496
               Moving violation       0.061524
               Other                  0.046191
               Registration/plates    0.108802
               Seat belt              0.035119
               Speeding               0.027885
Name: search_conducted, dtype: float64


In [30]:
# Reverse the ordering to group by violation before gender
print(ri.groupby(['violation', 'driver_gender']).search_conducted.mean())

violation            driver_gender
Equipment            F                0.039984
                     M                0.071496
Moving violation     F                0.039257
                     M                0.061524
Other                F                0.041018
                     M                0.046191
Registration/plates  F                0.054924
                     M                0.108802
Seat belt            F                0.017301
                     M                0.035119
Speeding             F                0.008309
                     M                0.027885
Name: search_conducted, dtype: float64


# Lesson IV

## Does gender affect who is frisked during a search?

In this section we'll take a lookt at what happens during a search.

### Examining the search types

As we've seen previously, the ``search_conducted`` field is ``True`` if there's a search during a traffic stop, and ``False`` otherwise.

There's also related field, ``search_type``, that contains additional information about the search.

In [31]:
ri.search_conducted.value_counts()

False    88434
True      3307
Name: search_conducted, dtype: int64

In [32]:
ri.search_type.value_counts(dropna=False)

NaN                                                         88434
Incident to Arrest                                           1290
Probable Cause                                                924
Inventory                                                     219
Reasonable Suspicion                                          214
Protective Frisk                                              164
Incident to Arrest,Inventory                                  123
Incident to Arrest,Probable Cause                             100
Probable Cause,Reasonable Suspicion                            54
Incident to Arrest,Inventory,Probable Cause                    35
Probable Cause,Protective Frisk                                35
Incident to Arrest,Protective Frisk                            33
Inventory,Probable Cause                                       25
Protective Frisk,Reasonable Suspicion                          19
Incident to Arrest,Inventory,Protective Frisk                  18
Incident t

Notice that ``search_type`` has **88434** missing values, which is identical to number of False values in search_conducted. That's because if no search was conducted no additional information was given on search type.

* ``value_counts()`` excludes missing values by default.
* ``dropna=False`` displays missing values.

There are only five possible values for ``search_type``, which you can see at the top of the ``value_counts()`` output: 
* Incident to Arrest, 
* Probable Cause, 
* Inventory, 
* Reasonable Suspicion, 
* Protective Frisk. 

But sometimes, multiple values are relevant for a single traffic stop, in which case they're separated by commas. 

Let's focus on *Inventory*, meaning searches in which the police took an inventory of the vehicle. Looking at the third line of the ``value_counts()`` output, we see **219**, which is the number of searches in which Inventory was the only search type. But what if we wanted to know the total number of times in which an inventory was done during a search? 

We'd also have to include any stops in which Inventory was one of multiple search types. To do this, we'll use a string method.

### Searching for a string

This time, we'll use a string method called ``contains()`` that checks whether a string is present in each element of a given column.

In [33]:
ri['inventory'] = ri.search_type.str.contains('Inventory', na=False)

* ``str.contains()`` returns ``True`` if string is found, ``False`` if not
* ``na=False`` returns ``False`` when it finds a missing value.

In [34]:
ri.inventory.dtype

dtype('bool')

As expected, the data type of the column is Boolean. To be clear, a ``True`` value in this column means that an inventory was done during a search, and a ``False`` value means it was not.

We can take the ``sum()`` of the inventory:

In [35]:
ri.inventory.sum()

441

inventory was searched during 441 searches. This includes the 219 stops in which Inventory was the only search type, plus additional stops in which Inventory was one of multiple search types.

### Calculating the inventory rate

What if we wanted to calculate the percentage of searches which included an inventory? You might think this would be as simple as taking the ``mean()`` of the inventory column,

In [36]:
ri.inventory.mean()

0.0048070110419550695

answer would be about 0.5%. But what's wrong with this calculation? 0.5% is the percentage of all traffic stops which resulted in an inventory, including those stops in which a search was not even done.

Instead, we first need to filter the DataFrame to only include those rows in which a search was done, and then take the ``mean()`` of the inventory column.

In [37]:
searched = ri[ri.search_conducted == True]
searched.inventory.mean()

0.13335349259147264

The correct answer is that 13.3% of searches included an inventory. This is a vastly different result, and it highlights the importance of carefully choosing which rows are relevant before doing a calculation.

## Exercise

### Counting protective frisks

During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a "protective frisk."

In this exercise, you'll first check to see how many times "Protective Frisk" was the only search type. Then, you'll use a string method to locate all instances in which the driver was frisked.

In [38]:
# Count the 'search_type' values
print(ri.search_type.value_counts())

Incident to Arrest                                          1290
Probable Cause                                               924
Inventory                                                    219
Reasonable Suspicion                                         214
Protective Frisk                                             164
Incident to Arrest,Inventory                                 123
Incident to Arrest,Probable Cause                            100
Probable Cause,Reasonable Suspicion                           54
Incident to Arrest,Inventory,Probable Cause                   35
Probable Cause,Protective Frisk                               35
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      25
Protective Frisk,Reasonable Suspicion                         19
Incident to Arrest,Inventory,Protective Frisk                 18
Incident to Arrest,Probable Cause,Protective Frisk            13
Inventory,Protective Fris

In [41]:
# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)

# Check the data type of 'frisk'
print(ri['frisk'].dtype)

# Take the sum of 'frisk'
print(ri['frisk'].sum())

bool
303


### Comparing frisk rates by gender

In this exercise, you'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?

Before doing any calculations, it's important to filter the DataFrame to only include the relevant subset of data, namely stops in which a search was conducted.

In [47]:
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]

# Calculate the overall frisk rate by taking the mean of 'frisk' 
print(searched.frisk.mean())

# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk.mean())

0.09162382824312065
driver_gender
F    0.074561
M    0.094353
Name: frisk, dtype: float64
