# Analyzing Police Activity with Pandas

This project applies the foundations of **pandas** to answer real-world questions using the **Stanford Open Policing Project dataset**. The focus is on analyzing the impact of gender on police behavior while practicing essential data analysis skills.

Through this project, you will:

- Explore and clean messy datasets
- Fix data types and handle missing values
- Drop unnecessary columns and rows
- Create meaningful visualizations
- Combine and reshape datasets
- Work with time series data

The goal is to complete a full analysis workflow—from raw data to insights—building practical experience for a data science career.

---


## Chapter 1: Preparing the Data for Analysis

Before starting the analysis, it’s important to prepare the dataset.
In this step, the work involves:

- Examining the raw data
- Fixing data types for consistency
- Handling missing values
- Dropping irrelevant columns and rows

This ensures the dataset is ready for efficient exploration and accurate analysis.

In [13]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [14]:
police = pd.read_csv(r"C:\Users\Emigb\Documents\Data Science\datasets\police.csv")
police.head()

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,RI,2005-02-20,17:15,,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,RI,2005-02-24,01:20,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


## Exploring the Dataset

The analysis will focus on a dataset of traffic stops in **Rhode Island**, collected by the **Stanford Open Policing Project**.

Before diving into deeper analysis, the first step is to get familiar with the dataset. This involves:

- **Loading the dataset** into pandas for use
- **Viewing the first few rows** to understand its structure and contents
- **Counting missing values** to identify data quality issues that need attention

This initial exploration ensures a clear understanding of what the dataset contains and highlights areas that require cleaning before further analysis.


In [15]:
police.isnull().sum()

state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

In [16]:
print(police.shape)
print(police.columns)

(91741, 15)
Index(['state', 'stop_date', 'stop_time', 'county_name', 'driver_gender',
       'driver_race', 'violation_raw', 'violation', 'search_conducted',
       'search_type', 'stop_outcome', 'is_arrested', 'stop_duration',
       'drugs_related_stop', 'district'],
      dtype='object')


In [17]:
police.state.value_counts()

state
RI    91741
Name: count, dtype: int64

## Dropping Unnecessary Columns

Not every column in a dataset is useful for analysis. Keeping irrelevant or empty columns can add clutter and make it harder to focus on the meaningful data.

In this step:

- The **`county_name`** column will be dropped because it contains only missing values.
- The **`state`** column will also be dropped since all traffic stops occurred in the same state (**Rhode Island**), making the column redundant.

Removing these columns will simplify the dataset and keep the focus on information that adds value to the analysis.


In [18]:
police.drop(['state','county_name'], axis='columns', inplace=True)
police.shape

(91741, 13)

In [19]:
police.isnull().sum()

stop_date                 0
stop_time                 0
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

## Dropping Rows

Sometimes, a dataset may have missing values in columns that are **critical for analysis**. If the number of missing rows is small, it is often better to remove them rather than risk skewing the results.

In this step:

- The **`driver_gender`** column is identified as essential for the analysis.
- Only a small fraction of rows are missing values in this column.
- Those rows will be **dropped** to ensure the dataset remains reliable and ready for gender-based analysis.

This step helps maintain accuracy while keeping as much useful data as possible.


In [23]:
police.dropna(subset=['driver_gender'], inplace=True)
print(police.isnull().sum())
print(police.shape)

stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64
(86536, 13)
