## **Business Understanding**

This project aims to effectively analyze and predict arrest outcomes of Terry Stops that have been conducted by the Seattle Police Department. It mainly focuses on whether a stop will result in a formal arrest or not. By leveraging features such as the time of day, precinct location, subject demographics and the presence of weapons, the model seeks to uncover the key drivers behind police decision-making for any given traffic stop. This will not only provide law enforcement agencies with insights into operational patterns and resource allocation, but also offer stakeholders and policymakers a data-driven lens to evaluate the consistency, transparency and factors that influence police interventions.

## **Data Understanding**

The [dataset](Data/Terry_Stops_20251229.csv) consists of 65,931 records and 23 features that comprehensively represent Terry Stops reported by the Seattle Police Department. Each row represents a unique police stop incident.

Key Features: The features can be categorized into four main groups:

 * **Subject Demographics**: Attributes describing the individual(s) who have been stopped by an officer and they include but not limited to: `Subject Age Group`, `Subject Perceived Race` and `Subject Perceived Gender`.

 * **Officer Demographics**: These are attributes describing the officer involved; such as `Officer ID`, `Officer Gender`, `Officer Race` and `Officer Year of Birth`.

 * **Event Details**: Contextual information about the stop, including `Reported Date`, `Occurred Date`, `Precinct`, `Sector` and `Beat`.

* **Operational Factors**: Police procedure details, such as `Initial Call Type`, `Final Call Type`, `Frisk Flag` (whether a frisk was performed) and `Weapon Type`.  


<br>

The target variable will be `Arrest Flag`, which is a binary indicator showing whether the stop resulted in a physical arrest (Y) or not (N). The varaible is thus significant and valuable for our classification task at hand.

<br>
The distribution of the dataset is highly imbalanced. This is because:

  * No Arrest (N): ~89% of cases (58,368 stops)

  * Arrest (Y): ~11% of cases (7,563 stops)

The model must account for this imbalance to prevent it from simply predicting "No Arrest" every time to achieve high accuracy.

<br>
In terms of the data quality, the following were some of the notable observations:

 1. **Missing Values**: The `Weapon Type` column contains a significant number of missing values (~50%), often indicating that no weapon was involved.

 2. **Placeholder Values**: Several columns contain placeholder characters like `-` or `Unknown`, which require cleaning during the preprocessing stage.

 3. **Categorical Complexity**: Features like `Call Type` and `Final Call Type` have high cardinality (many unique text values), requiring grouping or dimensionality reduction for effective modeling.

4. **Subject ID Issues**: A large number of records have a Subject ID of -1, indicating that many subjects were not uniquely identified or linked to a master record system.

In [2]:
import pandas as pd
d = pd.read_csv('Data/Terry_Stops_20251229.csv')
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65931 entries, 0 to 65930
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Subject Age Group         65931 non-null  object
 1   Subject ID                65931 non-null  int64 
 2   GO / SC Num               65931 non-null  int64 
 3   Terry Stop ID             65931 non-null  int64 
 4   Stop Resolution           65931 non-null  object
 5   Weapon Type               65931 non-null  object
 6   Officer ID                65931 non-null  object
 7   Officer YOB               65931 non-null  int64 
 8   Officer Gender            65931 non-null  object
 9   Officer Race              65931 non-null  object
 10  Subject Perceived Race    65931 non-null  object
 11  Subject Perceived Gender  65931 non-null  object
 12  Reported Date             65931 non-null  object
 13  Initial Call Type         65931 non-null  object
 14  Final Call Type       