Does the gender of a driver have an impact on police behavior during a traffic stop? In this chapter, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!

# Lesson I

## Do the Genders Commit Different Violations?

Let's start by discussing a few methods that will help us with our analysis. The first method is ``value_counts()``, which contains unique values in a Series.

It's bets suited for a column that contains *categorical data* rather than numerical data.

For Example, Let's apply ``value_counts()`` method to ``stop_outcome`` column:

In [2]:
# Import Packages
import pandas as pd

# Datasets
ri = pd.read_csv('datasets/police.csv')


Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4


In [3]:
ri.stop_outcome.value_counts()
# Results are displayed in 'descending' order.

Citation            77092
Arrest Driver        2735
No Action             625
N/D                   607
Arrest Passenger      343
Name: stop_outcome, dtype: int64

Because ``value_counts()`` outputs a pandas *Series*, we can tha the sum of this *Series* by simply adding ``.sum()`` method on the end.

This is called : **Method Chaining**

``sum()`` of ``value_counts()`` equal to the number of rows in the DataFrame, which will be the case for any Series that has no missing values.

In [4]:
ri.stop_outcome.value_counts().sum()

86539

In [5]:
ri.shape

(91741, 15)

### Expressing counts as proportions

Rather than examining the raw counts, we might prefer to see to proportions of the total. Instead of dividing the number of 'citations' by the total number of outcomes and get the percantage. We can use the ``normalize=True`` parameter of ``value_counts()``.

In [6]:
ri.stop_outcome.value_counts(normalize=True)

Citation            0.890835
Arrest Driver       0.031604
No Action           0.007222
N/D                 0.007014
Arrest Passenger    0.003964
Name: stop_outcome, dtype: float64

### Filtering DataFrame Rows

Let's take a look at the ``value_counts()`` of different column, ``driver_race``:

In [7]:
ri.driver_race.value_counts()

White       61872
Black       12285
Hispanic     9727
Asian        2390
Other         265
Name: driver_race, dtype: int64

We see 5 different unique categories:
* White
* Black
* Hispanic
* Asian
* Other

If wanted to include only one race out of these, we would write that  as a condition and put it inside brackets, and save the result in a new object.

In [8]:
white = ri[ri.driver_race == 'White']
white.shape

(61872, 15)

We see the shape of the new DataFrame is : **61872**, because thats the number of white drivers in the dataset.

### Comparing stop outcomes for two groups

Lets focus on the analysis of stop outcomes but only include White drivers:

In [9]:
white.stop_outcome.value_counts(normalize=True)

Citation            0.902234
Arrest Driver       0.024017
No Action           0.007047
N/D                 0.006433
Arrest Passenger    0.002748
Name: stop_outcome, dtype: float64

## Exercise

### Examining traffic violations

Before comparing the violations being committed by each gender, you should examine the violations committed by all drivers to get a baseline understanding of the data.

In this exercise, you'll count the unique values in the ``violation`` column, and then separately express those counts as proportions.

In [10]:
# Count the unique values in 'violation' 
print(ri.violation.value_counts())

# Express the counts as proportions
print(ri.violation.value_counts(normalize=True))

Speeding               48424
Moving violation       16224
Equipment              10922
Other                   4410
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64
Speeding               0.559563
Moving violation       0.187476
Equipment              0.126209
Other                  0.050960
Registration/plates    0.042790
Seat belt              0.033002
Name: violation, dtype: float64


### Comparing violations by gender

The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

In this exercise, you'll first create a DataFrame for each gender, and then analyze the violations in each DataFrame separately.

In [15]:
# Create a DataFrame of female drivers
female = ri[ri.driver_gender == 'F']

# Create a DataFrame of male drivers
male = ri[ri.driver_gender == 'M']

# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize=True))

print('------------------------------')

# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize=True))

Speeding               0.658114
Moving violation       0.138218
Equipment              0.105199
Registration/plates    0.044418
Other                  0.029738
Seat belt              0.024312
Name: violation, dtype: float64
------------------------------
Speeding               0.522243
Moving violation       0.206144
Equipment              0.134158
Other                  0.058985
Registration/plates    0.042175
Seat belt              0.036296
Name: violation, dtype: float64


# Lesson II

## Does Gender Affect who gets a ticket for speeding?

In this section we'll narrow our focus to the relationship between gender and stop outcomes for one specific violation, *speeding*

We'll filter DataFrame by multiple conditions for this analysis. In the last exercise we used a single condition to define ``female`` and ``male`` datasets.

```python
    female = ri[ri.driver_gender == 'F']
```

What if we wanted to create a second DataFrame of female drivers, but only those who were arrested?

We simply add a second condition filter;

```python
    female_and_arrested = ri[(ri.driver_gender == 'F') & 
                                (ri.is_arrested == True)]
```

* Notice that each condition is surrounded by parantheses.
* Ampersand (``&``) represents the ``and`` operator.

This DataFrame is much more smaller than the ``female`` DataFame, because it only consists of drivers who satisfiy both conditions.

When filtering DataFrame, another option is to use the vertical pipe character (``|``), which represents ``or``.

```python
    female_or_arrested = ri[(ri.driver_gender == 'F') | 
                            (ri.is_arrested == True)]
```

* Includes all females - whether arrested or not -
* Includes all drivers who were arrested -whether male or female-

### Rules for filtering by multiple conditions

* Ampersand (``&``): only include rows that satisfy both conditions.
* Pipe (``|``): include rows that satisfy either condition.
* Each condition must be surrounded by parentheses.
* Condition can check for equality(``==``), inequality (``!=``), etc...
* Can use more than two conditions.

### Correlation, not causation

* Analyze the relationship between gender and stop outcome
    - Assess whether there is a correlation

* Not going to draw any conclusions about causation.
    - We would need additional data and expertise
    - Exploring relationships only    

## Exercise

### Comparing speeding putcomes by gender

When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can you find evidence of this in the dataset?

First, you'll create two DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.

Then, for each gender, you'll use the ``stop_outcome`` column to calculate what percentage of stops resulted in a "Citation" (meaning a ticket) versus a "Warning".

In [20]:
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender == 'F') & 
                         (ri.violation == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender == 'M') & 
                       (ri.violation == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))

print('------------------------------')

# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))

Citation            0.952192
Arrest Driver       0.005752
N/D                 0.000959
Arrest Passenger    0.000639
No Action           0.000383
Name: stop_outcome, dtype: float64
------------------------------
Citation            0.944595
Arrest Driver       0.015895
Arrest Passenger    0.001281
No Action           0.001068
N/D                 0.000976
Name: stop_outcome, dtype: float64
