## Seattle Terry Stops Prediction Project
By: Shamla Tadese Araya

Moringa School

Aug 2024

***

## INTRODUCTION

In Seattle, Terry stops refer to brief stops and detentions by police officers based on a reasonable suspicion that a person may be involved in criminal activity. The term originates from the U.S. Supreme Court case Terry v. Ohio (1968), which established the legal standard for such stops. In Seattle, these stops are subject to both federal and local regulations, and there has been considerable debate over their impact on communities, particularly concerning concerns about racial profiling and civil liberties. Efforts to refine and improve the practice focus on balancing effective policing with the protection of individual rights.

In this project we aim to achieve the following objectives:

* Determine if there is a racial disparity in the Seattle Terry Stops
* Do the differences in races between the officer and the subject play a role in frisks arrests?
* Determine the most common outcome of the Seattle Terry Stops and what it means
* Develop a model that can accurately predict the likelihood of an arrest occurring during a Terry Stop

This project will be divided into three workbooks, each focusing on a specific aspect of the process. We will start by exploring the data and cleaning, followed by followed by exploratory analysis aiming to addressing the first three objectives which are racial disparity during Terry Stops, role of race in the Terry Stops and the most common outcome of the Terry Stops in Seattle. We will then move on to the third workbook where we will be addressing the fourth objective of developing a comprehensive predictive model that can accurately predict the likelihood of an arrest following a Terry Stop based on various factors.

In this project we will be using the data obtained from City of Seattle on https://data.seattle.gov/Public-Safety/Terry-Stops.

## Observing the Data

In [4]:
# Importing the relevant libraries for EDA and visualization

import numpy as np
import pandas as pd
from scipy import stats
from datetime import datetime
import warnings
warnings.filterwarnings(action='ignore')
warnings.filterwarnings('ignore')

In [5]:
# Loading the data
df = pd.read_csv('data/Terry_Stops_20240826.csv')

# Checking the first few rows of the data
df.head()

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,36 - 45,-1,20160000398323,208373,Offense Report,,4852,1953,M,Asian,...,15:18:00.0000000,TRESPASS,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,NORTH PCT 2ND W - LINCOLN - PLATOON 1,N,N,North,L,L3
1,18 - 25,-1,20180000227180,559146,Citation / Infraction,,5472,1964,M,Asian,...,00:07:00.0000000,"SUSPICIOUS PERSON, VEHICLE, OR INCIDENT",--TRAFFIC - BICYCLE VIOLATION,911,SOUTHWEST PCT - 3RD WATCH - F/W RELIEF,N,N,Southwest,F,F3
2,18 - 25,-1,20180000410091,498246,Offense Report,,6081,1962,M,White,...,02:56:00.0000000,"NARCOTICS - VIOLATIONS (LOITER, USE, SELL, NARS)",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,NORTH PCT 3RD W - BOY (JOHN) - PLATOON 1,N,N,North,B,B1
3,-,-1,20160000001637,146742,Field Contact,,6924,1974,M,White,...,01:44:00.0000000,-,-,-,EAST PCT OPS - NIGHT ACT,N,N,East,C,C1
4,46 - 55,-1,20150000006037,104477,Field Contact,,6732,1975,M,White,...,02:59:00.0000000,-,-,-,SOUTH PCT 3RD W - SAM - PLATOON 2,N,N,North,B,B2


In [6]:
# Checking the number of rows and columns of the data
df.shape

(61009, 23)


The dataset on Terry stops from the City of Seattle's open data portal contains about 61000 entries with 23 columns which typically includes information on interactions between Seattle police officers and individuals during Terry stops. Here’s a general description of what this dataset contains:

Stop Date and Time: When the Terry stop occurred, including the specific date and time.

Location: The geographic location where the stop took place, often including neighborhood or precinct information.

Officer Details: Identifiers or information related to the officers who conducted the stop, though specific identifying details might be anonymized.

Demographic Information: Data on the individuals stopped, such as race, gender, and age. This helps in analyzing the demographic breakdown of those stopped.

Reason for Stop: The reason or suspicion that led to the stop, providing context for why the individual was stopped.

Outcome of the Stop: The result of the stop, such as whether a search was conducted, if an arrest was made, or if a citation was issued.

Search Details: Information on whether a search was conducted during the stop, and if so, what was found.

Interaction Type: Information on the nature of the interaction, such as whether it was a stop-and-frisk, a consent stop, or another type of encounter.

Agency and Division: Information about which division or unit within the police department conducted the stop.

The dataset aims to provide transparency and allow for analysis of police practices, helping to ensure accountability and evaluate the impact of Terry stops on different communities.

# Column Names and Descriptions
The following descriptions were provided by data.seattle.gov
This dataset contains the following data:

**Subject Age Group**: Subject Age Group (10 year increments) as reported by the officer.

**Subject ID**: Key, generated daily, identifying unique subjects in the dataset using a character to character match of first name and last name. "Null" values indicate an "anonymous" or "unidentified" subject. Subjects of a Terry Stop are not required to present identification.

**GO / SC Num**: General Offense or Street Check number, relating the Terry Stop to the parent report. This field may have a one to many relationship in the data.

**Terry Stop ID**: Key identifying unique Terry Stop reports.

**Stop Resolution**: Resolution of the stop as reported by the officer.

**Weapon Type**: Type of weapon, if any, identified during a search or frisk of the subject. Indicates "None" if no weapons was found.

**Officer ID**: Key identifying unique officers in the dataset.

**Officer YOB**: Year of birth, as reported by the officer.

**Officer Gender**: Gender of the officer, as reported by the officer.

**Officer Race**: Race of the officer, as reported by the officer.

**Subject Perceived Race**: Perceived race of the subject, as reported by the officer.

**Subject Perceived Gender**: Perceived gender of the subject, as reported by the officer.

**Reported Date**: Date the report was filed in the Records Management System (RMS). Not necessarily the date the stop occurred but generally within 1 day.

**Reported Time**: Time the stop was reported in the Records Management System (RMS). Not the time the stop occurred but generally within 10 hours.

**Initial Call Type**: Initial classification of the call as assigned by 911.

**Final Call Type**: Final classification of the call as assigned by the primary officer closing the event.

**Call Type**: How the call was received by the communication center.

**Officer Squad**: Functional squad assignment (not budget) of the officer as reported by the Data Analytics Platform (DAP).

**Arrest Flag**: Indicator of whether a "physical arrest" was made, of the subject, during the Terry Stop. Does not necessarily reflect a report of an arrest in the Records Management System (RMS).

**Frisk Flag**: Indicator of whether a "frisk" was conducted, by the officer, of the subject, during the Terry Stop.

**Precinct**: Precinct of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

**Sector**: Sector of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

**Beat**: Beat of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

In [7]:
# Getting a closer look at the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61009 entries, 0 to 61008
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Subject Age Group         61009 non-null  object
 1   Subject ID                61009 non-null  int64 
 2   GO / SC Num               61009 non-null  int64 
 3   Terry Stop ID             61009 non-null  int64 
 4   Stop Resolution           61009 non-null  object
 5   Weapon Type               28444 non-null  object
 6   Officer ID                61009 non-null  object
 7   Officer YOB               61009 non-null  int64 
 8   Officer Gender            61009 non-null  object
 9   Officer Race              61009 non-null  object
 10  Subject Perceived Race    61009 non-null  object
 11  Subject Perceived Gender  61009 non-null  object
 12  Reported Date             61009 non-null  object
 13  Reported Time             61009 non-null  object
 14  Initial Call Type     

The data types of the columns are as follows:

## Column Classification
### Numerical Columns:
Subject ID,
GO / SC Num,
Terry Stop ID,
Officer YOB,

### Categorical Columns:
Subject Age Group,
Stop Resolution,
Weapon Type,
Officer ID,
Officer Gender,
Officer Race,
Subject Perceived Race,
Subject Perceived Gender,
Reported Date,
Reported Time,
Initial Call Type,
Final Call Type,
Call Type,
Officer Squad,
Arrest Flag,
Frisk Flag,
Precinct,
Sector,
Beat,

In [8]:
# Checking for missing values in the dataset
df.isnull().sum()

Subject Age Group               0
Subject ID                      0
GO / SC Num                     0
Terry Stop ID                   0
Stop Resolution                 0
Weapon Type                 32565
Officer ID                      0
Officer YOB                     0
Officer Gender                  0
Officer Race                    0
Subject Perceived Race          0
Subject Perceived Gender        0
Reported Date                   0
Reported Time                   0
Initial Call Type               0
Final Call Type                 0
Call Type                       0
Officer Squad                 561
Arrest Flag                     0
Frisk Flag                      0
Precinct                        0
Sector                          0
Beat                            0
dtype: int64

Here we can clearly see that the Weapon Type feature has a lot of values missing and also the Office Squad as well has 561 missing values. Still yet we need to check the dataset in depth to learn if there are any place holders, none or other unnecessary values/Characters.

In [9]:
# Creating a function that shows the value counts of each column in the dataset.
def col_values(df):
    """
    For use in Preprocessing and cleaning to find placeholder values
    Input: Data frame
    Output: Counts of unique values for each column
    """
    for col in df.columns:
        print(df[col].value_counts())
        print('-------------------------------------------------------')
        
col_values(df)

Subject Age Group
26 - 35         20373
36 - 45         13618
18 - 25         11573
46 - 55          7738
56 and Above     3221
1 - 17           2286
-                2200
Name: count, dtype: int64
-------------------------------------------------------
Subject ID
-1              35104
 7753260438        28
 7774286580        22
 7726918259        21
 7731717691        20
                ...  
 15606702593        1
 7735943699         1
 7738872582         1
 16724979306        1
 19137661313        1
Name: count, Length: 17000, dtype: int64
-------------------------------------------------------
GO / SC Num
20160000378750    16
20150000190790    16
20180000134604    14
20210000267148    14
20230000049052    14
                  ..
20150000006142     1
20180000000272     1
20200000339446     1
20220000283906     1
20220000018102     1
Name: count, Length: 48845, dtype: int64
-------------------------------------------------------
Terry Stop ID
32633045284    3
19324329995    3
19268585

In [10]:
# For ease of use let us rename the columns
df.columns = ['subject_age_group', 'subject_id', 'go_sc_num', 'terry_stop_id',
       'stop_resolution', 'weapon_type', 'officer_id', 'officer_yob',
       'officer_gender', 'officer_race', 'subject_perceived_race',
       'subject_perceived_gender', 'reported_date', 'reported_time',
       'initial_call_type', 'final_call_type', 'call_type', 'officer_squad',
       'arrest_flag', 'frisk_flag', 'precinct', 'sector', 'beat']

df.columns

Index(['subject_age_group', 'subject_id', 'go_sc_num', 'terry_stop_id',
       'stop_resolution', 'weapon_type', 'officer_id', 'officer_yob',
       'officer_gender', 'officer_race', 'subject_perceived_race',
       'subject_perceived_gender', 'reported_date', 'reported_time',
       'initial_call_type', 'final_call_type', 'call_type', 'officer_squad',
       'arrest_flag', 'frisk_flag', 'precinct', 'sector', 'beat'],
      dtype='object')

In [11]:
# Checking the dataframe
df.head()

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_yob,officer_gender,officer_race,...,reported_time,initial_call_type,final_call_type,call_type,officer_squad,arrest_flag,frisk_flag,precinct,sector,beat
0,36 - 45,-1,20160000398323,208373,Offense Report,,4852,1953,M,Asian,...,15:18:00.0000000,TRESPASS,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,NORTH PCT 2ND W - LINCOLN - PLATOON 1,N,N,North,L,L3
1,18 - 25,-1,20180000227180,559146,Citation / Infraction,,5472,1964,M,Asian,...,00:07:00.0000000,"SUSPICIOUS PERSON, VEHICLE, OR INCIDENT",--TRAFFIC - BICYCLE VIOLATION,911,SOUTHWEST PCT - 3RD WATCH - F/W RELIEF,N,N,Southwest,F,F3
2,18 - 25,-1,20180000410091,498246,Offense Report,,6081,1962,M,White,...,02:56:00.0000000,"NARCOTICS - VIOLATIONS (LOITER, USE, SELL, NARS)",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,NORTH PCT 3RD W - BOY (JOHN) - PLATOON 1,N,N,North,B,B1
3,-,-1,20160000001637,146742,Field Contact,,6924,1974,M,White,...,01:44:00.0000000,-,-,-,EAST PCT OPS - NIGHT ACT,N,N,East,C,C1
4,46 - 55,-1,20150000006037,104477,Field Contact,,6732,1975,M,White,...,02:59:00.0000000,-,-,-,SOUTH PCT 3RD W - SAM - PLATOON 2,N,N,North,B,B2


## Data Cleaning
Let us clean the data before we proceed to processing. Let us start by replacing the dashes and place holders with the more workable values first. Then we will go on to the more complex cleaning process.

In [12]:
# Replacing the dashes with Unknown
df = df.replace('-', 'Unknown')
df.head()

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_yob,officer_gender,officer_race,...,reported_time,initial_call_type,final_call_type,call_type,officer_squad,arrest_flag,frisk_flag,precinct,sector,beat
0,36 - 45,-1,20160000398323,208373,Offense Report,,4852,1953,M,Asian,...,15:18:00.0000000,TRESPASS,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,NORTH PCT 2ND W - LINCOLN - PLATOON 1,N,N,North,L,L3
1,18 - 25,-1,20180000227180,559146,Citation / Infraction,,5472,1964,M,Asian,...,00:07:00.0000000,"SUSPICIOUS PERSON, VEHICLE, OR INCIDENT",--TRAFFIC - BICYCLE VIOLATION,911,SOUTHWEST PCT - 3RD WATCH - F/W RELIEF,N,N,Southwest,F,F3
2,18 - 25,-1,20180000410091,498246,Offense Report,,6081,1962,M,White,...,02:56:00.0000000,"NARCOTICS - VIOLATIONS (LOITER, USE, SELL, NARS)",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,NORTH PCT 3RD W - BOY (JOHN) - PLATOON 1,N,N,North,B,B1
3,Unknown,-1,20160000001637,146742,Field Contact,,6924,1974,M,White,...,01:44:00.0000000,Unknown,Unknown,Unknown,EAST PCT OPS - NIGHT ACT,N,N,East,C,C1
4,46 - 55,-1,20150000006037,104477,Field Contact,,6732,1975,M,White,...,02:59:00.0000000,Unknown,Unknown,Unknown,SOUTH PCT 3RD W - SAM - PLATOON 2,N,N,North,B,B2


Officer_gender has 30 'N' values. We cannot be sure if 'N' stands for 'Not Available', 'Not Disclosed', or even 'Non-Gender Binary'. Since it's such a small amount of data, we'll just drop it.

In [13]:
# Dropping the entries with 'N' values from the officer_gender column.
df.drop(df[df['officer_gender'] == 'N'].index, inplace=True)
df.officer_gender.value_counts()


officer_gender
M    54072
F     6907
Name: count, dtype: int64

Officer_squad also has some NAN values. Since this information is less relevant to this particular task, it is better to just drop the column.

In [14]:
# Dropping the officer_squad column and assigning the data to a copy data frame.
df.drop('officer_squad', axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 60979 entries, 0 to 61008
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   subject_age_group         60979 non-null  object
 1   subject_id                60979 non-null  int64 
 2   go_sc_num                 60979 non-null  int64 
 3   terry_stop_id             60979 non-null  int64 
 4   stop_resolution           60979 non-null  object
 5   weapon_type               28419 non-null  object
 6   officer_id                60979 non-null  object
 7   officer_yob               60979 non-null  int64 
 8   officer_gender            60979 non-null  object
 9   officer_race              60979 non-null  object
 10  subject_perceived_race    60979 non-null  object
 11  subject_perceived_gender  60979 non-null  object
 12  reported_date             60979 non-null  object
 13  reported_time             60979 non-null  object
 14  initial_call_type         6

As we saw it above, there are some subject IDs that are repeated multiple times. This could be either duplicates or repeat offenders. So it is crucial to investigate that feature.

In [15]:
# Checking the subject_id column
df['subject_id'].value_counts()

subject_id
-1              35095
 7753260438        28
 7774286580        22
 7726918259        21
 7731717691        20
                ...  
 15606702593        1
 7735943699         1
 7738872582         1
 16724979306        1
 19137661313        1
Name: count, Length: 16988, dtype: int64

In [16]:
# Let us replace those -1 values in 'Subject_ID' with 'unassigned'
df['subject_id'] = df['subject_id'].replace(-1, 'unassigned')
df.subject_id.value_counts()

subject_id
unassigned     35095
7753260438        28
7774286580        22
7726918259        21
7731717691        20
               ...  
15606702593        1
7735943699         1
7738872582         1
16724979306        1
19137661313        1
Name: count, Length: 16988, dtype: int64

Here it looks like we have multiple duplicates in 'Subject_IDs'. If that is the case this could make our dataset biased so we need to check closely to decide whether we have duplicates or not. We can do this by checking a number of columns namely 'subject_id', 'terry_stop_id' and 'officer_id'.

In [17]:
# Group by 'subject_id', 'terry_stop_id', and 'officer_id' and count occurrences
df['count'] = df.groupby(['subject_id', 'terry_stop_id', 'officer_id'])['subject_id'].transform('count')

# Create 'repeat_offenders' column based on count
df['repeat_offenders'] = df['count'].apply(lambda x: 'Yes' if x > 1 else 'No')

# Drop the 'count' column as it is no longer needed
df.drop(columns=['count'], axis=1, inplace=True)

df['repeat_offenders'].value_counts()

repeat_offenders
No     60784
Yes      195
Name: count, dtype: int64

This tells us we have 195 duplicated in our dataset but still we need to dig deeper before we conclusively decide.

**Terry Stop ID** also has some duplicate values worth checking.

In [18]:
# Checking terry stop id value counts
df['terry_stop_id'].value_counts()

terry_stop_id
19324329995    3
19268585233    3
27511831225    3
36014210659    3
32633045284    3
              ..
87443          1
108886         1
274766         1
12093615563    1
31342435997    1
Name: count, Length: 60877, dtype: int64

In [19]:
# Listing the duplicates
dup_ids = df[df['terry_stop_id'].duplicated(keep=False)].sort_values(by = 'terry_stop_id')
# dup_ids = dup_ids[['subject_age_group', 'subject_id', 'go_sc_num', 
#                    'terry_stop_id', 'stop_resolution', 'weapon_type',
#                    'officer_id', 'reported_date', 'reported_time',
#                    'initial_call_type', 'final_call_type', 'arrest_flag',
#                    'frisk_flag', ]]
dup_ids

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_yob,officer_gender,officer_race,...,reported_time,initial_call_type,final_call_type,call_type,arrest_flag,frisk_flag,precinct,sector,beat,repeat_offenders
52,26 - 35,7810387129,20190000254490,8611673538,Field Contact,Knife/Cutting/Stabbing Instrument,7712,1987,M,White,...,01:09:47.0000000,"SUSPICIOUS PERSON, VEHICLE, OR INCIDENT",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,N,Y,Unknown,Unknown,Unknown,Yes
43012,26 - 35,7810387129,20190000254490,8611673538,Field Contact,Blunt Object/Striking Implement,7712,1987,M,White,...,01:09:47.0000000,"SUSPICIOUS PERSON, VEHICLE, OR INCIDENT",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,N,Y,Unknown,Unknown,Unknown,Yes
14583,26 - 35,7730805128,20190000268604,8677596250,Offense Report,Taser/Stun Gun,5630,1964,M,Black or African American,...,10:27:54.0000000,THEFT (DOES NOT INCLUDE SHOPLIFT OR SVCS),--THEFT - CAR PROWL,911,N,Y,Southwest,F,F2,Yes
8805,26 - 35,7730805128,20190000268604,8677596250,Offense Report,Knife/Cutting/Stabbing Instrument,5630,1964,M,Black or African American,...,10:27:54.0000000,THEFT (DOES NOT INCLUDE SHOPLIFT OR SVCS),--THEFT - CAR PROWL,911,N,Y,Southwest,F,F2,Yes
19773,18 - 25,9458419522,20190000285750,9585545373,Field Contact,Handgun,8382,1993,M,White,...,22:50:59.0000000,ASLT - PERSON SHOT OR SHOT AT,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,ONVIEW,N,Y,East,E,E3,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20234,26 - 35,53848066671,20240000133855,56110860878,Arrest,Knife/Cutting/Stabbing Instrument,8846,1997,M,Black or African American,...,11:14:27.0000000,DISTURBANCE,"--ASSAULTS, OTHER",911,Y,Y,North,B,B1,Yes
46922,36 - 45,57754429915,20240000198136,57754515446,Arrest,Knife/Cutting/Stabbing Instrument,8584,1982,M,White,...,11:15:20.0000000,THEFT (DOES NOT INCLUDE SHOPLIFT OR SVCS),--BURGLARY - NON RESIDENTIAL/COMMERCIAL,"TELEPHONE OTHER, NOT 911",Y,Y,South,O,O1,Yes
42736,36 - 45,57754429915,20240000198136,57754515446,Arrest,Blunt Object/Striking Implement,8584,1982,M,White,...,11:15:20.0000000,THEFT (DOES NOT INCLUDE SHOPLIFT OR SVCS),--BURGLARY - NON RESIDENTIAL/COMMERCIAL,"TELEPHONE OTHER, NOT 911",Y,Y,South,O,O1,Yes
9130,18 - 25,7741755512,20240000210107,57961741719,Arrest,Taser/Stun Gun,8897,1987,M,White,...,05:21:37.0000000,THEFT (DOES NOT INCLUDE SHOPLIFT OR SVCS),--BURGLARY - NON RESIDENTIAL/COMMERCIAL,911,Y,Y,East,G,G1,Yes


This may look like a confirmation of duplicate entries at first glance. However if we look carefully we can see that all these incidents have different weapon types even though they have identical subject, stop and officer ids. From this we can understand that these incidents are entries done by the same officer at the same time with the same subject who happen to be with multiple weapon types. After all it is a normal procedure for officers to place multiple entries of the same subject based on each weapon type found with.

we can verify that by running the same code we used above to check for duplicates only this time we will add the 'weapon_type' feature as well.

In [20]:
# Group by 'subject_id', 'terry_stop_id', and 'officer_id' and count occurrences
df['count'] = df.groupby(['subject_id', 'terry_stop_id', 'officer_id', 'weapon_type'])['subject_id'].transform('count')

# Create 'repeat_offenders' column based on count
df['repeat_offenders'] = df['count'].apply(lambda x: 'Yes' if x > 1 else 'No')

# Drop the 'count' column as it is no longer needed
df.drop(columns=['count'], axis=1, inplace=True)

df['repeat_offenders'].value_counts()

repeat_offenders
No    60979
Name: count, dtype: int64

In [21]:
df.head()

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_yob,officer_gender,officer_race,...,reported_time,initial_call_type,final_call_type,call_type,arrest_flag,frisk_flag,precinct,sector,beat,repeat_offenders
0,36 - 45,unassigned,20160000398323,208373,Offense Report,,4852,1953,M,Asian,...,15:18:00.0000000,TRESPASS,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,N,N,North,L,L3,No
1,18 - 25,unassigned,20180000227180,559146,Citation / Infraction,,5472,1964,M,Asian,...,00:07:00.0000000,"SUSPICIOUS PERSON, VEHICLE, OR INCIDENT",--TRAFFIC - BICYCLE VIOLATION,911,N,N,Southwest,F,F3,No
2,18 - 25,unassigned,20180000410091,498246,Offense Report,,6081,1962,M,White,...,02:56:00.0000000,"NARCOTICS - VIOLATIONS (LOITER, USE, SELL, NARS)",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,N,N,North,B,B1,No
3,Unknown,unassigned,20160000001637,146742,Field Contact,,6924,1974,M,White,...,01:44:00.0000000,Unknown,Unknown,Unknown,N,N,East,C,C1,No
4,46 - 55,unassigned,20150000006037,104477,Field Contact,,6732,1975,M,White,...,02:59:00.0000000,Unknown,Unknown,Unknown,N,N,North,B,B2,No


This confirms our observation. Therefore since having this multiple entries of the same subjects can bloat our dataset and since the incidents are only 195, it is better to drop the duplicates and keep only the first entries.

In [22]:
# Dropping the duplicates and keeping the first instance
df.drop_duplicates('terry_stop_id', keep='first', inplace=True)
df.sort_values(by='terry_stop_id')

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_yob,officer_gender,officer_race,...,reported_time,initial_call_type,final_call_type,call_type,arrest_flag,frisk_flag,precinct,sector,beat,repeat_offenders
5536,1 - 17,unassigned,20150000084533,28020,Referred for Prosecution,Lethal Cutting Instrument,4585,1955,M,Hispanic or Latino,...,16:10:00.0000000,Unknown,Unknown,Unknown,N,Y,East,G,G2,No
34249,36 - 45,unassigned,20150000001428,28092,Field Contact,,7634,1977,M,White,...,05:49:00.0000000,Unknown,Unknown,Unknown,N,N,Unknown,Unknown,Unknown,No
22176,18 - 25,unassigned,20150000001428,28093,Field Contact,,7634,1977,M,White,...,05:55:00.0000000,Unknown,Unknown,Unknown,N,N,Unknown,Unknown,Unknown,No
46800,26 - 35,unassigned,20150000001437,28381,Field Contact,,7634,1977,M,White,...,10:38:00.0000000,Unknown,Unknown,Unknown,N,N,Unknown,Unknown,Unknown,No
35165,36 - 45,unassigned,20150000087329,28462,Offense Report,,7634,1977,M,White,...,11:46:00.0000000,SUICIDE - CRITICAL,--CRISIS COMPLAINT - GENERAL,"TELEPHONE OTHER, NOT 911",N,Y,East,E,E3,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8952,18 - 25,33970989734,20240000173704,58505996345,Field Contact,Handgun,6848,1978,M,White,...,09:44:14.0000000,Unknown,Unknown,Unknown,N,Y,West,K,K3,No
43019,1 - 17,7734651568,20240000173704,58506837543,Field Contact,Unknown,6848,1978,M,White,...,11:33:58.0000000,Unknown,Unknown,Unknown,N,Y,West,K,K3,No
31174,1 - 17,53252554865,20240000173704,58506902793,Field Contact,Unknown,6848,1978,M,White,...,11:43:37.0000000,Unknown,Unknown,Unknown,N,Y,West,K,K3,No
32618,26 - 35,7729016487,20240000238040,58508727659,Field Contact,Unknown,8564,1987,M,Hispanic or Latino,...,15:55:53.0000000,ROBBERY - CRITICAL (INCLUDES STRONG ARM),--ROBBERY - STRONG ARM,911,N,Y,North,B,B1,No


Next let us check some repetitions in the general offense street check column.

In [23]:
# Investigating the repeated values in the go_sc_num column
stops = df[df['go_sc_num'] > 1]
stops['go_sc_num'].value_counts()

go_sc_num
20160000378750    16
20150000190790    16
20230000049052    14
20180000134604    14
20210000267148    14
                  ..
20170000437667     1
20210000238907     1
20220000218677     1
20220000320502     1
20220000018102     1
Name: count, Length: 48826, dtype: int64

In [24]:
# Looking closely
stops = stops[stops['go_sc_num'] == 20160000378750]
stops

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_yob,officer_gender,officer_race,...,reported_time,initial_call_type,final_call_type,call_type,arrest_flag,frisk_flag,precinct,sector,beat,repeat_offenders
5940,46 - 55,unassigned,20160000378750,208306,Offense Report,,7492,1983,M,White,...,22:45:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
11366,36 - 45,unassigned,20160000378750,208312,Offense Report,,7492,1983,M,White,...,23:08:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
13031,26 - 35,unassigned,20160000378750,208300,Offense Report,,7492,1983,M,White,...,22:22:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
33240,36 - 45,unassigned,20160000378750,208314,Arrest,,7492,1983,M,White,...,23:15:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
33810,46 - 55,unassigned,20160000378750,208309,Arrest,,7492,1983,M,White,...,22:54:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
36586,46 - 55,unassigned,20160000378750,208304,Offense Report,,7492,1983,M,White,...,22:39:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
36631,36 - 45,unassigned,20160000378750,208299,Offense Report,,7492,1983,M,White,...,22:18:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
39754,26 - 35,unassigned,20160000378750,208307,Offense Report,,7492,1983,M,White,...,22:48:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
41484,26 - 35,unassigned,20160000378750,208301,Offense Report,,7492,1983,M,White,...,22:24:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No
43337,18 - 25,unassigned,20160000378750,208311,Arrest,,7492,1983,M,White,...,23:04:00.0000000,Unknown,Unknown,Unknown,N,Y,North,N,N3,No


Taking into account the dates, the separate Terry Stop ID's, the different Stop Resolutions and it all roughly happening within the same hour, it appears that this was a **dispute** of some sort in which an officer **collected Offense Reports from 12 people** and issued out **tickets 4 people** (because there was **no physical arrest** denoted by the column 'arrest_flag', these were **non-custodial** arrests/citations).  

Looking back at the Column Description document, the GO/SC Number is considered the **"parent report"** that contain **associated Terry Stops**. This confirms our observations.

## Report Date
Ok so now lets remove the timestamp from date, create a new columns "incident year" and "incident month" with the year and month of the incidents and drop the reported date.

In [25]:
# Checking the column reported date
df.reported_date.dtype, df.reported_date.head()

(dtype('O'),
 0    2016-11-03T00:00:00Z
 1    2018-06-22T00:00:00Z
 2    2018-11-02T00:00:00Z
 3    2016-04-17T00:00:00Z
 4    2015-11-29T00:00:00Z
 Name: reported_date, dtype: object)

In [26]:
# Converting to date time format
df['reported_date'] = pd.to_datetime(df['reported_date'])

# Creating a new column with the year of the incident
df['incident_year'] = df['reported_date'].dt.year

# Creating a new column with the month of the incident
df['incident_month'] = df['reported_date'].dt.month

# Dropping the reported date column
df.drop('reported_date', axis=1, inplace=True)
df.head()

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_yob,officer_gender,officer_race,...,final_call_type,call_type,arrest_flag,frisk_flag,precinct,sector,beat,repeat_offenders,incident_year,incident_month
0,36 - 45,unassigned,20160000398323,208373,Offense Report,,4852,1953,M,Asian,...,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,N,N,North,L,L3,No,2016,11
1,18 - 25,unassigned,20180000227180,559146,Citation / Infraction,,5472,1964,M,Asian,...,--TRAFFIC - BICYCLE VIOLATION,911,N,N,Southwest,F,F3,No,2018,6
2,18 - 25,unassigned,20180000410091,498246,Offense Report,,6081,1962,M,White,...,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,N,N,North,B,B1,No,2018,11
3,Unknown,unassigned,20160000001637,146742,Field Contact,,6924,1974,M,White,...,Unknown,Unknown,N,N,East,C,C1,No,2016,4
4,46 - 55,unassigned,20150000006037,104477,Field Contact,,6732,1975,M,White,...,Unknown,Unknown,N,N,North,B,B2,No,2015,11


## Officer Age
Let us create a new column 'officer_age' that holds the age value of the officer at the time of the incident. We can do this by subtracting the officer year of birth from the incident year.

In [27]:
# Creating a column that holds the officer's year.
df['officer_age'] = df['incident_year'] - df['officer_yob']
df.officer_age.unique()

array([ 63,  54,  56,  42,  40,  48,  32,  39,  31,  38,  24,  28,  37,
        34,  27,  23,  43,  33,  26,  35,  30, 121,  55,  25,  29,  52,
        51,  57,  22,  49,  47,  36,  44,  45,  46,  50,  58,  41,  60,
        61,  53,  62,  64,  65,  59,  69,  71, 120,  21,  67, 118,  70,
        66,  68, 119])

Wow we have some entries for officer age that are unrealistic. Let us fix that.

In [28]:
# Dropping unrealistic ages from the officers age column
# df[df['officer_age'] <= 100]
df.drop(df[df['officer_age'] >= 100].index, inplace=True)

# Confirming our change
df['officer_age'].describe()

count    60807.000000
mean        34.488102
std          8.267055
min         21.000000
25%         28.000000
50%         33.000000
75%         39.000000
max         71.000000
Name: officer_age, dtype: float64

So now our officer age looks more realistic ranging from 21 years to 71 years old. Let us now drop the officer year of birth column from the dataframe.

In [29]:
# Dropping the officer_yob column
df.drop('officer_yob', axis=1, inplace=True)
df.columns

Index(['subject_age_group', 'subject_id', 'go_sc_num', 'terry_stop_id',
       'stop_resolution', 'weapon_type', 'officer_id', 'officer_gender',
       'officer_race', 'subject_perceived_race', 'subject_perceived_gender',
       'reported_time', 'initial_call_type', 'final_call_type', 'call_type',
       'arrest_flag', 'frisk_flag', 'precinct', 'sector', 'beat',
       'repeat_offenders', 'incident_year', 'incident_month', 'officer_age'],
      dtype='object')

Let us now proceed to the stop resolution. From common knowledge, we know that any arrest which is not flagged as one in the appropriate column is considered a "non-custodial arrest" or an instance where a citation was issued. 

In [30]:
# Checking the stop resolution
df['stop_resolution'].value_counts()

stop_resolution
Field Contact               29439
Offense Report              15701
Arrest                      14722
Referred for Prosecution      728
Citation / Infraction         217
Name: count, dtype: int64

Even though this column tells us what happened after the incident, the `Field Contact` and `Offense Report` values do give us insight as to why an officer may have initiated a stop. So let us create columns for these values and drop the stop resolution column.

In [31]:
# Creating field_contact column that contains 'y' and 'n' values
df['field_contact'] = df['stop_resolution'].str.contains('Field Contact')
df['field_contact'] = df['field_contact'].map({True: 'Y', False: 'N'})

# Creating offense_report column that contains 'y' and 'n' values
df['offense_report'] = df['stop_resolution'].str.contains('Offense Report')
df['offense_report'] = df['offense_report'].map({True: 'Y', False: 'N'})

df.head()

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_gender,officer_race,subject_perceived_race,...,frisk_flag,precinct,sector,beat,repeat_offenders,incident_year,incident_month,officer_age,field_contact,offense_report
0,36 - 45,unassigned,20160000398323,208373,Offense Report,,4852,M,Asian,White,...,N,North,L,L3,No,2016,11,63,N,Y
1,18 - 25,unassigned,20180000227180,559146,Citation / Infraction,,5472,M,Asian,Hispanic,...,N,Southwest,F,F3,No,2018,6,54,N,N
2,18 - 25,unassigned,20180000410091,498246,Offense Report,,6081,M,White,White,...,N,North,B,B1,No,2018,11,56,N,Y
3,Unknown,unassigned,20160000001637,146742,Field Contact,,6924,M,White,Unknown,...,N,East,C,C1,No,2016,4,42,Y,N
4,46 - 55,unassigned,20150000006037,104477,Field Contact,,6732,M,White,White,...,N,North,B,B2,No,2015,11,40,Y,N


In [32]:
# Checking our new columns
df.offense_report.value_counts(), df.field_contact.value_counts()

(offense_report
 N    45106
 Y    15701
 Name: count, dtype: int64,
 field_contact
 N    31368
 Y    29439
 Name: count, dtype: int64)

The weapon type column contains a lot of redundant values. Let us clean it up and organize it well.

In [33]:
# Checking the weapon type column
df['weapon_type'].value_counts()

weapon_type
Unknown                                 24493
Lethal Cutting Instrument                1482
Knife/Cutting/Stabbing Instrument        1289
Handgun                                   384
Blunt Object/Striking Implement           150
Firearm                                   102
Firearm Other                             100
Other Firearm                              73
Club, Blackjack, Brass Knuckles            49
Mace/Pepper Spray                          48
None/Not Applicable                        18
Firearm (unk type)                         15
Taser/Stun Gun                             14
Fire/Incendiary Device                     12
Rifle                                      10
Club                                        9
Shotgun                                     5
Automatic Handgun                           2
Personal Weapons (hands, feet, etc.)        2
Poison                                      1
Blackjack                                   1
Brass Knuckles        

In [34]:
# Weapon type categories
none = ['None/Not Applicable']

knife = ['Lethal Cutting Instrument', 'Knife/Cutting/Stabbing Instrument']

blunt_obj = ['Club, Blackjack, Brass Knuckles', 'Club', 'Blackjack', 'Brass Knuckles']
firearm = ['Firearm Other', 'Firearm (unk type)', 'Other Firearm', 'Rifle', 
          'Shotgun', 'Automatic Handgun', 'Handgun']
other = ['Taser/Stun Gun', 'Mace/Pepper Spray', 'Fire/Incendiary Device', 'Poison', 'Personal Weapons (hands, feet, etc.)']

# Creating a function called replace_val that takes the source data, column name, old name and a new name
def replace_val(df, col, old_val, new_val):
    for i in range(len(df[col])):
        for j in range(len(old_val)):
            if df[col].iloc[i] == old_val[j]:
                df[col].iloc[i] = df[col].iloc[i].replace(old_val[j], new_val)


# Applying the function to replace weapon type values
# replacing none
replace_val(df, 'weapon_type', none, 'None')

# replacing knife
replace_val(df, 'weapon_type', knife, 'Knife/Stabbing Instrument')

# replacing blunt object
replace_val(df, 'weapon_type', blunt_obj, 'Blunt Object/Striking Implement')

# replacing firearm
replace_val(df, 'weapon_type', firearm, 'Firearm')

# other
replace_val(df, 'weapon_type', other, 'Other')

df['weapon_type'].value_counts()

weapon_type
Unknown                            24493
Knife/Stabbing Instrument           2771
Firearm                              691
Blunt Object/Striking Implement      210
Other                                 77
None                                  18
Name: count, dtype: int64

Let us tidy up the reported time column as well

In [35]:
# Converting the time column to datetime format and keeping only the hour assigning it to a new reported_hour column
df['reported_time'] = pd.to_datetime(df['reported_time'])
df['reported_hour'] = df['reported_time'].dt.hour
df.drop('reported_time', axis=1, inplace=True)
df.reported_hour.head()

0    15
1     0
2     2
3     1
4     2
Name: reported_hour, dtype: int32

Great the reported time now has been arranged by hour in 24 hour format.

In [36]:
# CHecking the copy dataframe so far
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 60807 entries, 0 to 61007
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   subject_age_group         60807 non-null  object
 1   subject_id                60807 non-null  object
 2   go_sc_num                 60807 non-null  int64 
 3   terry_stop_id             60807 non-null  int64 
 4   stop_resolution           60807 non-null  object
 5   weapon_type               28260 non-null  object
 6   officer_id                60807 non-null  object
 7   officer_gender            60807 non-null  object
 8   officer_race              60807 non-null  object
 9   subject_perceived_race    60807 non-null  object
 10  subject_perceived_gender  60807 non-null  object
 11  initial_call_type         60807 non-null  object
 12  final_call_type           60807 non-null  object
 13  call_type                 60807 non-null  object
 14  arrest_flag               6

Moving on let us work with the call types column. There are 13000+ entries in this column with an 'unknown' values which indicates that these instances were not put in the CAD system. This could be unanimous calls for privacy reasons so we will keep them. However we will drop all the other entries of this column that have very little values.

In [37]:
# Checking the call types column
df['initial_call_type'].value_counts()

initial_call_type
Unknown                                        13406
SUSPICIOUS STOP - OFFICER INITIATED ONVIEW      4769
SUSPICIOUS PERSON, VEHICLE, OR INCIDENT         4200
DISTURBANCE                                     3060
ASLT - CRITICAL (NO SHOOTINGS)                  2702
                                               ...  
ESCAPE - PRISONER                                  1
PHONE - OBSCENE OR NUISANCE PHONE CALLS            1
EXPLOSION                                          1
ORDER - ASSIST DV VIC W/SRVC OF COURT ORDER        1
-ASSIGNED DUTY - STAKEOUT                          1
Name: count, Length: 181, dtype: int64

ESCAPE - PRISONER, PHONE - OBSCENE OR NUISANCE PHONE CALLS, EXPLOSION, ORDER - ASSIST DV VIC W/SRVC OF COURT ORDER, ASSIGNED DUTY - STAKEOUT, TEXT MESSAGE and SCHEDULED EVENT (RECURRING) all of these have very small values, it is better to drop them as well. We are keeping the ones with the unknown values as they are so many and we assume that these instances could be meant for privacy reasons.

In [38]:
# Dropping the call types with small values
df = df[(df['call_type'] != 'ESCAPE - PRISONER')&
                  (df['call_type'] != 'PHONE - OBSCENE OR NUISANCE PHONE CALLS')&
                  (df['call_type'] != 'EXPLOSION')&
                  (df['call_type'] != 'ORDER - ASSIST DV VIC W/SRVC OF COURT ORDER')&
                  (df['call_type'] != 'ASSIGNED DUTY - STAKEOUT')&
                  (df['call_type'] != 'SCHEDULED EVENT (RECURRING)')&
                  (df['call_type'] != 'TEXT MESSAGE')]


df['call_type'].value_counts()

call_type
911                              28734
ONVIEW                           14013
Unknown                          13406
TELEPHONE OTHER, NOT 911          4099
ALARM CALL (NOT POLICE ALARM)      525
Name: count, dtype: int64

Alright now that the initial call type is cleared let us proceed frisk flag.

In [39]:
# Checking our dataframe.
df.shape

(60777, 26)

Let us check and clean the frisk flag instances. 

In [40]:
# Checking the frisk flag column
df['frisk_flag'].value_counts()

#Dropping the frisk flag instances with unknown values only 478 instances.
df.drop(df[df['frisk_flag'] == 'Unknown'].index, inplace=True)

df['frisk_flag'].value_counts()


frisk_flag
N    45834
Y    14465
Name: count, dtype: int64

In [41]:
# Checking the arrest flag column for any missing values
df['arrest_flag'].value_counts()

arrest_flag
N    53789
Y     6510
Name: count, dtype: int64

The arrest flag column seems clean with no missing values. Out of the total instances we have only 6510 arrests which make up to 10%.

**Officer Race** has some instances with unknown or not specified values. It is better to combine them together and tidying up the column.

In [42]:
# Checking the officer race column
df['officer_race'].value_counts()

officer_race
White                            43220
Two or More Races                 4201
Hispanic or Latino                3986
Asian                             2876
Not Specified                     2828
Black or African American         2403
Nat Hawaiian/Oth Pac Islander      545
American Indian/Alaska Native      240
Name: count, dtype: int64

In [43]:
# Combining the Unknown values into the not specified values of the officer race column
replace_val(df, 'officer_race', ['Unknown/Unspecified', 'Not Stated'], 'Not Specified')

df.officer_race.value_counts()

officer_race
White                            43220
Two or More Races                 4201
Hispanic or Latino                3986
Asian                             2876
Not Specified                     2828
Black or African American         2403
Nat Hawaiian/Oth Pac Islander      545
American Indian/Alaska Native      240
Name: count, dtype: int64

**Subject Gender** same as officer race has unknown and unable to determine values that need to be combined together.

In [44]:
# Combining the Unknown values into the unable to determine values of the subject perceived gender column
Unknown = ['Unknown']
replace_val(df,'subject_perceived_gender', Unknown, 'Unable to Determine')

df.subject_perceived_gender.value_counts()

subject_perceived_gender
Male                                                         47640
Female                                                       12004
Unable to Determine                                            608
Gender Diverse (gender non-conforming and/or transgender)       45
MULTIPLE SUBJECTS                                                2
Name: count, dtype: int64

In [45]:
# Dropping the gender diverse and multiple subjects values as they are very few and can create problems
df.drop(df[df['subject_perceived_gender'].isin([
    'Gender Diverse (gender non-conforming and/or transgender)',
    'MULTIPLE SUBJECTS'
])].index, inplace=True)

df.subject_perceived_gender.value_counts()

subject_perceived_gender
Male                   47640
Female                 12004
Unable to Determine      608
Name: count, dtype: int64

## Precinct, Sector and Beat
These are location data which can be very important in this process as they determine the probability of one getting stopped. However they have some place holder values which need to be cleaned up.

In [46]:
# Checking the precinct data
df['precinct'].value_counts()

precinct
West         16657
North        12693
Unknown      10661
East          8159
South         7290
Southwest     4673
OOJ             97
FK ERROR        22
Name: count, dtype: int64

In [47]:
# Let us check what the FK ERROR is
df[df['precinct'] == 'FK ERROR']

Unnamed: 0,subject_age_group,subject_id,go_sc_num,terry_stop_id,stop_resolution,weapon_type,officer_id,officer_gender,officer_race,subject_perceived_race,...,precinct,sector,beat,repeat_offenders,incident_year,incident_month,officer_age,field_contact,offense_report,reported_hour
1996,56 and Above,7760748894,20190000369575,10569761986,Field Contact,Unknown,8599,M,Two or More Races,Unknown,...,FK ERROR,99,99,No,2019,10,30,Y,N,18
2697,1 - 17,34202427492,20220000012513,31222898051,Arrest,Firearm,8394,M,White,Black or African American,...,FK ERROR,99,99,No,2022,1,31,N,N,3
6011,36 - 45,21896920197,20210000064178,21897246304,Arrest,Unknown,6882,M,White,White,...,FK ERROR,99,99,No,2021,3,42,N,N,7
18028,18 - 25,10392618417,20190000350728,10392612439,Field Contact,Unknown,7603,M,White,Unknown,...,FK ERROR,99,99,No,2019,9,42,Y,N,14
22615,36 - 45,7727091519,20190000325595,10042391872,Field Contact,Knife/Stabbing Instrument,6844,M,Not Specified,White,...,FK ERROR,99,99,No,2019,8,51,Y,N,18
22642,26 - 35,12172435351,20200000021260,12172421137,Field Contact,Unknown,8692,M,White,Unknown,...,FK ERROR,99,99,No,2020,1,34,Y,N,15
24264,26 - 35,7732925068,20200000028389,12221869321,Arrest,Unknown,5630,M,Black or African American,White,...,FK ERROR,99,99,No,2020,1,56,N,N,12
32431,46 - 55,7729016287,20210000021036,20132790775,Field Contact,Unknown,8684,M,White,White,...,FK ERROR,99,99,No,2021,1,27,Y,N,23
35677,26 - 35,7726837499,20190000468247,12108530793,Offense Report,Unknown,8614,M,Two or More Races,Unknown,...,FK ERROR,99,99,No,2019,12,23,N,Y,11
36841,18 - 25,58002696823,20200000187751,13477897443,Field Contact,Unknown,8715,M,White,Unknown,...,FK ERROR,99,99,No,2020,6,26,Y,N,18


It looks very interesting the data we have here is between 2019 and 2022 half of them being in 2019. This suggests that the error is most likely due to system failure. Since this incidents really occurred we cannot ignore them and must clean them.

In [48]:
prec = ['FK ERROR', 'OOJ'] # OOJ stands for Obstruction of Justice
sect = ['99']
beats = ['99', '99', 'OOJ']

# precinct
replace_val(df, col='precinct', old_val=prec, new_val='Unknown')
# sector
replace_val(df, col='sector', old_val=sect, new_val='Unknown')
# beat
replace_val(df, col='beat', old_val=beats, new_val='Unknown')

df['precinct'].value_counts(), df['sector'].value_counts(), df['beat'].value_counts()

(precinct
 West         16657
 North        12693
 Unknown      10780
 East          8159
 South         7290
 Southwest     4673
 Name: count, dtype: int64,
 sector
 Unknown    10739
 K           5597
 M           5091
 E           4284
 N           3554
 D           3427
 F           2815
 R           2763
 B           2733
 Q           2538
 L           2466
 O           2326
 S           2200
 U           2199
 G           2084
 W           1856
 C           1791
 J           1739
 OOJ           50
 Name: count, dtype: int64,
 beat
 Unknown    10783
 K3          3209
 M3          2472
 E2          1789
 N3          1761
 E1          1439
 M2          1349
 D1          1346
 N2          1340
 K2          1303
 R2          1286
 D2          1278
 M1          1273
 Q3          1213
 F2          1159
 K1          1085
 E3          1054
 B2          1042
 B1          1018
 U2          1015
 O1           945
 S2           857
 L2           852
 F3           836
 F1           820
 L1     

## Final Check
Ok so let us now check what we have done with our dataset before proceeding further.

In [49]:
# Checking the copy dataframe we have cleaned up
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 60252 entries, 0 to 61007
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   subject_age_group         60252 non-null  object
 1   subject_id                60252 non-null  object
 2   go_sc_num                 60252 non-null  int64 
 3   terry_stop_id             60252 non-null  int64 
 4   stop_resolution           60252 non-null  object
 5   weapon_type               28175 non-null  object
 6   officer_id                60252 non-null  object
 7   officer_gender            60252 non-null  object
 8   officer_race              60252 non-null  object
 9   subject_perceived_race    60252 non-null  object
 10  subject_perceived_gender  60252 non-null  object
 11  initial_call_type         60252 non-null  object
 12  final_call_type           60252 non-null  object
 13  call_type                 60252 non-null  object
 14  arrest_flag               6

In [None]:
# Fixing the format of subject age group values
replace_val(df, 'subject_age_group', ['26 - 35'], '26_35')
replace_val(df, 'subject_age_group', ['18 - 25'], '18_25')
replace_val(df, 'subject_age_group', ['36 - 45'], '36_45')
replace_val(df, 'subject_age_group', ['46 - 55'], '46_55')
replace_val(df, 'subject_age_group', ['56 and Above'], '56_up')
replace_val(df, 'subject_age_group', ['1 - 17'], '1_17')


df['subject_age_group'].value_counts()

With that we say we have cleaned up the dataset for EDA and modeling. 

# Feature Engineering

## Same Races
Here we'll make a binary column 'same_race' that displays as 1 if the officer and the subject were of the same race and 0 if they are of different.

To accomplish this, we need to make sure that the categories in 'Officer_Race' and 'Subject_Perceived_Race' have the same values and make any necessary changes.

In [None]:
# Checking the values of both columns
races = df[['officer_race', 'subject_perceived_race']]
col_values(races)

Ok so we can see that we don't have the same values for both columns. The differences are (Hispanic or Latino, Hispanic), (American Indian/Alaska Native, American Indian or Alaska Native), (Two or More Races, Multi-Racial), (Nat Hawaiian/Oth Pac Islander, Other), and (Not Specified, Unknown). Lets sort this out.


In [None]:
# Aligning the column values
native = ['American Indian/Alaska Native', 'American Indian or Alaska Native']
multi = ['Two or More Races']
other = ['Nat Hawaiian/Oth Pac Islander', 'Native Hawaiian or Other Pacific Islander']
unknown = ['Unknown']
hispanic = ['Hispanic or Latino']

# native
replace_val(df, 'officer_race', native, 'Native American')
replace_val(df, 'subject_perceived_race', native, 'Native American')
# multi
replace_val(df, 'officer_race', multi, 'Multi-Racial')
# other
replace_val(df, 'officer_race', other, 'Other')
replace_val(df, 'subject_perceived_race', other, 'Other')
# unknown
replace_val(df, 'subject_perceived_race', unknown, 'Not Specified')
# hispanic
replace_val(df, 'officer_race', hispanic, 'Hispanic')

df.officer_race.unique() 

In [None]:
df.subject_perceived_race.unique()

In [None]:
# Now that the values of the two fields are identical, let us create a new column same_race and populate it
df['same_race'] = np.nan
for i in range(len(df['officer_race'])):
    if df['officer_race'].iloc[i] == df['subject_perceived_race'].iloc[i]:
        df['same_race'].iloc[i] = 'Y'
    else:
        df['same_race'].iloc[i] = 'N'

df['same_race'].value_counts()

## Gender Race
Let us do the same thing with the officer gender and the subject gender. First let us make sure that the genders in both columns match.

In [None]:
# Matching both genders
male = ['Male']
female = ['Female']

replace_val(df, 'subject_perceived_gender', male, 'M')
replace_val(df, 'subject_perceived_gender', female, 'F')

# Now that the values of the two fields are identical, let us create a new column dif_gender and populate it
df['same_gender'] = np.nan
for g in range(len(df['officer_gender'])):
    if df['officer_gender'].iloc[g] == df['subject_perceived_gender'].iloc[g]:
        df['same_gender'].iloc[g] = 'Y'
    else:
        df['same_gender'].iloc[g] = 'N'

# df_copy['subject_perceived_gender']!= df_copy['officer_gender']

df['same_gender'].value_counts()


In [None]:
# Creating a new dataframe for EDA
df_clean = df
df_clean.head()

In [None]:
df_clean.columns

## Exporting to CSV
We are done with cleaning, feature engineering and preprocessing the dataset. Let us export it to a new CSV file that we will use for EDA.

In [None]:
# #  Exporting to csv file
# df_clean.to_csv('data/clean_Terry_stops_data.csv', index=False)

# print('Data exported to clean_data.csv successfully.')