<hr/>
<div class="alert alert-success alertsuccess" style="margin-top: 20px">
[Tip]: To execute the Python code in the code cell below, click on the cell to select it and press <kbd>Shift</kbd> + <kbd>Enter</kbd>.
</div>
<hr/>

# Required Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

from os.path import exists

sns.set_style("white")
sns.set_palette("coolwarm")


def load_polic_data():
    local = "11police_project.csv"
   
    if exists(local):
        print ("Read from local file")
        return pd.read_csv(local)
    else:
        print ("Read from hu-box")        
        return pd.read_csv("https://box.hu-berlin.de/f/cc297480d22c42a6ae3e/?dl=1")


def load_data():
    df = load_polic_data()
    df.loc[df.stop_duration == '2', "stop_duration"] = '0-15 Min'
    df.loc[df.stop_duration == '1', "stop_duration"] = '0-15 Min'
    df.drop_duplicates(inplace=True)
    
    df.search_type = df.search_type.str.split(",")
    
    columns=["driver_gender", "driver_race", "violation", "stop_outcome", "stop_duration"]
    for col in columns:
        df[col] = df[col].astype("category")
    
    # Dropping the state and county name columns
    df.drop(["county_name","driver_age_raw"], axis='columns', inplace=True)
    
    # Dropping  the NaN in both gender and driver columns
    df.dropna(subset=["driver_gender","driver_race","driver_age"], inplace=True)
    
    new_date = df["stop_date"].str.cat(df["stop_time"], sep=" ")
    df['stop_datetime'] = pd.to_datetime(new_date)
    df.set_index("stop_datetime", inplace=True)
    
    df = df.convert_dtypes()
    
    return df

<hr> 

# Exercise 3: Bias and Fairness in Traffic Stops

### This notebook is split into ten tasks. You must solve at least 6 out of 10 tasks correctly.
    
1. Assess the impact of race, gender, and age on the frequency of police traffic stops.
2. Evaluate how gender influences police behavior during traffic stops.
3. Examine whether gender affects the likelihood of being frisked during a search.
4. Compare frisk rates during searches between males and females.
5. Compare frisk rates during searches between Black and White drivers.
6. Determine which gender is more likely to be arrested during a traffic stop.
7. Investigate the impact of driver age on the likelihood of being stopped.
8. Analyze whether stop duration influences the probability of an arrest.
9. Study how stop duration has changed over time to identify evolving policing practices.
10. Use a large language model to suggest additional insightful analyses of the dataset.

### You must hand in this exercise via Moodle.

## Dataset

This dataset contains records of traffic stops conducted by police officers across the United States from 2005 to 2015. It details the circumstances and outcomes of these stops. The data was sourced from the Stanford Open Policing Project, which collects and standardizes traffic stop data from law enforcement agencies nationwide. One of the project's main goals is to analyze and improve interactions between police and the public. The dataset encompasses stops made across all U.S. states during the specified period. It has 85845 rows × 13 columns.

<img src="https://images.pexels.com/photos/7715199/pexels-photo-7715199.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500">

In [None]:
# DO NOT CHANGE THIS CODE
df = load_data()
df.head()

In [None]:
df.info() 

In [None]:
df.describe(include=["string", "category"])

In [None]:
df.describe()

### Some simple formatting inspirations for displaying percentages

In [None]:
# Compute percentage of genders
series = df["driver_gender"].value_counts().transform(lambda l: l / l.sum())

# Using '.style.format'
pd.DataFrame(series).style.format({'count': '{:.2%}'}).background_gradient(cmap='Blues')

In [None]:
# Compute ratio of driver race and gender while traffic stops
gender_race_ratio = pd.crosstab(df.driver_gender, df.driver_race, normalize=True)

# Format nicely
gender_race_ratio.style.format({
    'Asian': '{:.2%}', 'Black': '{:.2%}', 'Hispanic': '{:.2%}', 'Other': '{:.2%}', 'White': '{:.2%}'
}).background_gradient(cmap='Blues')

### Using a Countplot

In [None]:
g = sns.catplot(
    data=df,
    x="driver_race", 
    col="driver_gender", 
    kind="count",
    height=3, aspect=1.,
)
g.fig.suptitle("Gender and race by drivers stopped")
g.fig.subplots_adjust(top=0.8)
plt.show()

# Exploratory Data Analysis


## Task 1 -  Assess the impact of race, gender, and age on the frequency of police traffic stops.

<div class="alert alert-block alert-success">
    
- Create three separate plots to visualize the distribution of traffic stops across the categories driver_gender, driver_age, and driver_race.
- Provide a brief description.


</div>

In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 2 - Evaluate how gender influences police behavior during traffic stops.

Speeding is the most common reason for traffic stops across genders. 

<div class="alert alert-block alert-success">

- Analyze how the driver’s gender influences the distribution of `stop_outcome` specifically for stops involving the violation type `speeding`.
- The analysis can be tabular or visual.
- Provide a brief description.


</div>

Tabular-Results should be similar to this:

| stop_outcome     |           F |           M |
|:-----------------|------------:|------------:|
| Arrest Driver    | 0.00170065  | 0.0103076   |
| Arrest Passenger | 0.000269614 | 0.000850323 |
| Citation         | 0.304892    | 0.644316    |
| N/D              | 0.000269614 | 0.000725885 |
| No Action        | 0.000145177 | 0.000725885 |
| Warning          | 0.012506    | 0.0232905   |

In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 3 - Examine whether gender affects the likelihood of being frisked during a search.

Protective frisk refers to a pat-down conducted by police during a vehicle search, typically to check for weapons.

<div class="alert alert-block alert-success">

- Analyze the *absolute frequency* of protective frisks (in `search_type`) by driver's gender.
- The analysis can be tabular or visual.
- Provide a brief description.

</div>

Tabular-Results should be similar to this:

| driver_gender   |   Protective Frisk |
|:----------------|-------------------:|
| F               |                 29 |
| M               |                241 |

In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 4 - Compare frisk rates during searches between males and females.

<div class="alert alert-block alert-success">

- Calculate the *relative frequency* of "Protective Frisk" (search_type) for each driver's gender.
- For this, *normalize* the count of frisks by the total number of search cases per gender.
- The analysis can be tabular or visual.
- Provide a brief description.

</div>

Tabular-Results should be similar to this:

| driver_gender   |    count |
|:----------------|---------:|
| M               | 0.385933 |
| F               | 0.123937 |

In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 5 - Compare frisk rates during searches between Black and White drivers.

<div class="alert alert-block alert-success">

- Calculate the relative frequency of "Protective Frisk" (search_type) for each driver_race.
- Normalize the count of protective frisks by the total number of search cases within each race group.
- The analysis can be tabular or visual.
- Provide a brief description.

</div>

Tabular-Results should be similar to this:

| driver_race   |    count |
|:--------------|---------:|
| White         | 0.278506 |
| Black         | 0.485517 |
| Hispanic      | 0.370449 |
| Asian         | 0.177857 |
| Other         | 0        |

In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 6 - Determine which gender is more likely to be arrested during a traffic stop.

<div class="alert alert-block alert-success">

- Analyze the effect of driver gender on arrest likelihood (`is_arrested`).
- The analysis can be tabular or visual.
- Provide a brief description.

</div>

Tabular-Results should be similar to this:
| driver_race   |   is_arrested |
|:--------------|--------------:|
| Asian         |    0.0182303  |
| Black         |    0.0571922  |
| Hispanic      |    0.0590601  |
| Other         |    0.00840336 |
| White         |    0.0258266  |

In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 7 - Investigate the impact of driver age on the likelihood of being stopped.

<div class="alert alert-block alert-success">

- Visualize the distribution of driver ages in the dataset. 
- Provide a brief description.
    
</div>

In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 8 - Analyze whether stop duration influences the probability of an arrest.

<div class="alert alert-block alert-success">

- Analyze how the duration of traffic stops influences the likelihood of an arrest.

- The analysis can be tabular or visual.

- Provide a brief description.
    
</div>

In case of Tabular-Results, it should be similar to this:

| stop_duration   |   is_arrested |
|:----------------|--------------:|
| 0-15 Min        |     0.0123334 |
| 16-30 Min       |     0.0911045 |
| 30+ Min         |     0.253367  |

In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 9 -  Study how stop duration has changed over time to identify evolving policing practices.

<div class="alert alert-block alert-success">

- Visualize how stop duration changes over time.
- Provide a brief description.

</div>


In [None]:
df = load_data()

# WRITE YOUR CODE HERE

Add your brief description

## Task 10 - Use a large language model to suggest additional insightful analyses of the dataset.

<div class="alert alert-block alert-success">

- Request additional analysis suggestions from a large language model (LLM).
- Select one suggested analysis and perform it.
- Provide a brief description.
    
</div>

In [None]:
# Use ChatGPT...

Add your brief description

<hr> 

# Submit two-fold via Moodle:
- Your notebook 
- A html export of this notebook

### You must hand in this exercise via moodle