## Data Understanding

In [1]:
#Importing Libraries
import pandas as pd 
import numpy as np

In [2]:
data = pd.read_csv('Data/Terry_Stops.csv')
data

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,-8,20220000063036,32023419019,Field Contact,-,6805,1973,M,White,...,09:34:02.0000000,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,ONVIEW,WEST PCT 1ST W - KQ/DM RELIEF,N,Y,West,K,K3
1,-,-8,20220000233868,35877423282,Field Contact,-,8881,1988,M,Asian,...,19:20:16.0000000,THREATS (INCLS IN-PERSON/BY PHONE/IN WRITING),--DISTURBANCE - OTHER,911,TRAINING - FIELD TRAINING SQUAD,N,Y,South,O,O1
2,-,-1,20140000120677,92317,Arrest,,7500,1984,M,Black or African American,...,11:32:00.0000000,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
3,-,-1,20150000001463,28806,Field Contact,,5670,1965,M,White,...,07:59:00.0000000,-,-,-,,N,N,-,-,-
4,-,-1,20150000001516,29599,Field Contact,,4844,1961,M,White,...,19:12:00.0000000,-,-,-,,N,-,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53649,56 and Above,35908211663,20220000238170,35908320953,Field Contact,-,8889,1977,M,White,...,04:11:19.0000000,"DISTURBANCE, MISCELLANEOUS/OTHER","DISTURBANCE, MISCELLANEOUS/OTHER",911,TRAINING - FIELD TRAINING SQUAD,N,N,West,Q,Q3
53650,56 and Above,36244081163,20220000243193,36244016178,Arrest,-,8857,1996,M,White,...,01:53:14.0000000,-,-,-,WEST PCT 2ND W - D/M RELIEF,Y,N,East,E,E3
53651,56 and Above,36540999080,20220000250573,36541078424,Field Contact,-,6805,1973,M,White,...,12:57:21.0000000,"SUSPICIOUS PERSON, VEHICLE OR INCIDENT",--MISCHIEF OR NUISANCE - GENERAL,ONVIEW,WEST PCT 1ST W - KQ/DM RELIEF,N,N,West,K,K3
53652,56 and Above,36545542648,20220000251229,36545507606,Field Contact,-,7773,1978,M,White,...,02:33:21.0000000,FOUND - PERSON,FOUND - PERSON,911,NORTH PCT 3RD W - LINCOLN,N,N,North,J,J3



The data can be lumped into four categories:

#### Demographics: race, age, gender

#### Situation: weapons involved, date, time

#### Administrative: report type, precinct, officer squad, officer IDs

#### Outcome: arrest, citation, field contact, offense report


There are 53654 rows and 23 columns in our dataset. Below is a brief description of the data each column represents:

#### Subject Age Group
Subject Age Group (10 year increments) as reported by the officer.
  
#### Subject ID
Key, generated daily, identifying unique subjects in the dataset using a character to character match of first name and last name. "Null" values indicate an "anonymous" or "unidentified" subject. Subjects of a Terry Stop are not required to present identification.
    
#### GO/SC Num
General Offense or Street Check number, relating the Terry Stop to the parent report. This field may have a one to many relationship in the data.
    
#### Terry Stop ID
Key identifying unique Terry Stop reports.

#### Stop Resolution
Resolution of the stop as reported by the officer.
    
#### Weapon Type
Type of weapon, if any, identified during a search or frisk of the subject. Indicates "None"     if no weapons was found.
    
#### Officer ID
Key identifying unique officers in the dataset.
    
#### Officer YOB: 
Year of birth, as reported by the officer.
    
#### Officer Gender
Gender of the officer, as reported by the officer.
    
#### Officer Race
Race of the officer, as reported by the officer.
    
#### Subject Perceived Race
Perceived race of the subject, as reported by the officer.

#### Subject Perceived Gender
Perceived gender of the subject, as reported by the officer.
    
#### Reported Date
Date the report was filed in the Records Management System (RMS). Not necessarily the date the stop occurred but generally within 1 day.
    
#### Reported Time
Time the stop was reported in the Records Management System (RMS). Not the time the stop occurred but generally within 10 hours.
    
#### Initial Call Type
Initial classification of the call as assigned by 911.
    
#### Final Call Type
Final classification of the call as assigned by the primary officer closing the event.
    
#### Call Type
How the call was received by the communication center.
    
#### Officer Squad
Functional squad assignment (not budget) of the officer as reported by the Data Analytics Platform (DAP).
    
#### Arrest Flag
Indicator of whether a "physical arrest" was made, of the subject, during the Terry Stop. Does not necessarily reflect a report of an arrest in the Records Management System (RMS).
    
#### Frisk Flag
Indicator of whether a "frisk" was conducted, by the officer, of the subject, during the Terry Stop.
    
#### Precinct
Precinct of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.
    
#### Sector
Sector of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.
    
#### Beat
Beat of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.


In [3]:
data['Stop Resolution'].value_counts()

Field Contact               23006
Offense Report              16599
Arrest                      13131
Referred for Prosecution      728
Citation / Infraction         190
Name: Stop Resolution, dtype: int64

In [4]:
data['Reported Time']

0        09:34:02.0000000
1        19:20:16.0000000
2        11:32:00.0000000
3        07:59:00.0000000
4        19:12:00.0000000
               ...       
53649    04:11:19.0000000
53650    01:53:14.0000000
53651    12:57:21.0000000
53652    02:33:21.0000000
53653    23:21:54.0000000
Name: Reported Time, Length: 53654, dtype: object

In [5]:
data['Precinct'].value_counts()

West         14070
North        11699
-            10240
East          6904
South         6363
Southwest     2320
SouthWest     1775
Unknown        200
OOJ             61
FK ERROR        22
Name: Precinct, dtype: int64

In [6]:
data['Frisk Flag'].value_counts(normalize=True)

N    0.760316
Y    0.230775
-    0.008909
Name: Frisk Flag, dtype: float64

In [7]:
data['Subject Age Group'].value_counts(normalize=True)

26 - 35         0.334178
36 - 45         0.216834
18 - 25         0.195922
46 - 55         0.128434
56 and Above    0.051851
1 - 17          0.038767
-               0.034014
Name: Subject Age Group, dtype: float64

In [8]:
data['Officer YOB'].value_counts(normalize=True)

1986    0.068774
1987    0.063779
1991    0.055522
1984    0.054441
1992    0.053193
1990    0.050099
1985    0.048459
1988    0.044657
1989    0.042345
1982    0.036269
1983    0.034778
1993    0.033101
1995    0.031983
1979    0.031964
1981    0.029653
1994    0.025087
1971    0.023707
1976    0.023223
1978    0.022757
1977    0.020520
1973    0.018731
1996    0.017930
1980    0.017426
1967    0.014761
1997    0.013904
1970    0.012487
1968    0.012376
1969    0.010996
1975    0.010791
1974    0.010791
1962    0.008629
1964    0.008555
1972    0.008368
1965    0.007902
1963    0.004939
1966    0.004380
1961    0.004361
1958    0.004138
1959    0.003243
1960    0.003001
1998    0.002292
1900    0.001286
1954    0.000820
1957    0.000801
1953    0.000652
1999    0.000466
2000    0.000429
1955    0.000391
1956    0.000317
1948    0.000205
1952    0.000168
1949    0.000093
1946    0.000037
1951    0.000019
Name: Officer YOB, dtype: float64

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53654 entries, 0 to 53653
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Subject Age Group         53654 non-null  object
 1   Subject ID                53654 non-null  int64 
 2   GO / SC Num               53654 non-null  int64 
 3   Terry Stop ID             53654 non-null  int64 
 4   Stop Resolution           53654 non-null  object
 5   Weapon Type               53654 non-null  object
 6   Officer ID                53654 non-null  object
 7   Officer YOB               53654 non-null  int64 
 8   Officer Gender            53654 non-null  object
 9   Officer Race              53654 non-null  object
 10  Subject Perceived Race    53654 non-null  object
 11  Subject Perceived Gender  53654 non-null  object
 12  Reported Date             53654 non-null  object
 13  Reported Time             53654 non-null  object
 14  Initial Call Type     

In [10]:
data.shape

(53654, 23)

In [11]:
#descriptive statistics for numeric columns
data.describe()

Unnamed: 0,Subject ID,GO / SC Num,Terry Stop ID,Officer YOB
count,53654.0,53654.0,53654.0,53654.0
mean,4343049000.0,20180400000000.0,6931744000.0,1983.457356
std,7565102000.0,89604370000.0,10819500000.0,9.312311
min,-8.0,-1.0,28020.0,1900.0
25%,-1.0,20160000000000.0,207466.0,1979.0
50%,-1.0,20180000000000.0,454442.5,1986.0
75%,7731599000.0,20200000000000.0,12621180000.0,1990.0
max,37430510000.0,20220000000000.0,37429610000.0,2000.0


In [12]:
data.isnull().sum()

Subject Age Group             0
Subject ID                    0
GO / SC Num                   0
Terry Stop ID                 0
Stop Resolution               0
Weapon Type                   0
Officer ID                    0
Officer YOB                   0
Officer Gender                0
Officer Race                  0
Subject Perceived Race        0
Subject Perceived Gender      0
Reported Date                 0
Reported Time                 0
Initial Call Type             0
Final Call Type               0
Call Type                     0
Officer Squad               489
Arrest Flag                   0
Frisk Flag                    0
Precinct                      0
Sector                        0
Beat                          0
dtype: int64

In [13]:
data['Subject Age Group'].value_counts()
#There are missing values

26 - 35         17930
36 - 45         11634
18 - 25         10512
46 - 55          6891
56 and Above     2782
1 - 17           2080
-                1825
Name: Subject Age Group, dtype: int64

In [14]:
data['Terry Stop ID'].duplicated().sum()
#we have 73 duplicates

73

In [15]:
data['Stop Resolution'].value_counts()

Field Contact               23006
Offense Report              16599
Arrest                      13131
Referred for Prosecution      728
Citation / Infraction         190
Name: Stop Resolution, dtype: int64

In [16]:
data['Weapon Type'].value_counts()
#we have missing values
#further investigate 'personal weapons'

None                                    32565
-                                       17800
Lethal Cutting Instrument                1482
Knife/Cutting/Stabbing Instrument         967
Handgun                                   342
Blunt Object/Striking Implement           125
Firearm Other                             100
Firearm                                    63
Club, Blackjack, Brass Knuckles            49
Mace/Pepper Spray                          44
Other Firearm                              41
Firearm (unk type)                         15
Taser/Stun Gun                             13
Fire/Incendiary Device                     11
None/Not Applicable                        10
Club                                        9
Rifle                                       8
Shotgun                                     4
Automatic Handgun                           2
Personal Weapons (hands, feet, etc.)        2
Blackjack                                   1
Brass Knuckles                    

In [17]:
data['Officer YOB'].value_counts()
#no missing values

1986    3690
1987    3422
1991    2979
1984    2921
1992    2854
1990    2688
1985    2600
1988    2396
1989    2272
1982    1946
1983    1866
1993    1776
1995    1716
1979    1715
1981    1591
1994    1346
1971    1272
1976    1246
1978    1221
1977    1101
1973    1005
1996     962
1980     935
1967     792
1997     746
1970     670
1968     664
1969     590
1975     579
1974     579
1962     463
1964     459
1972     449
1965     424
1963     265
1966     235
1961     234
1958     222
1959     174
1960     161
1998     123
1900      69
1954      44
1957      43
1953      35
1999      25
2000      23
1955      21
1956      17
1948      11
1952       9
1949       5
1946       2
1951       1
Name: Officer YOB, dtype: int64

In [18]:
data['Officer Gender'].value_counts()


M    47516
F     6108
N       30
Name: Officer Gender, dtype: int64

In [19]:
data['Officer Race'].value_counts()
#there are unkown and not specified officer race

White                            39376
Two or More Races                 3336
Hispanic or Latino                3278
Asian                             2399
Not Specified                     2293
Black or African American         2098
Nat Hawaiian/Oth Pac Islander      472
American Indian/Alaska Native      333
Unknown                             69
Name: Officer Race, dtype: int64

In [20]:
data['Subject Perceived Race'].value_counts()
# we have unknowns, blanks, 'other' and duplicated values
#feature will inform the objective on whether the Terry stops are biased towards minority groups

White                                        26320
Black or African American                    15936
Unknown                                       3526
-                                             1810
Asian                                         1803
Hispanic                                      1684
American Indian or Alaska Native              1514
Multi-Racial                                   809
Other                                          152
Native Hawaiian or Other Pacific Islander       98
DUPLICATE                                        2
Name: Subject Perceived Race, dtype: int64

In [21]:
data['Subject Perceived Gender'].value_counts()
#we have missing, unknown and duplicated values.

Male                                                         42251
Female                                                       10749
Unable to Determine                                            326
-                                                              239
Unknown                                                         67
Gender Diverse (gender non-conforming and/or transgender)       20
DUPLICATE                                                        2
Name: Subject Perceived Gender, dtype: int64

In [22]:
data['Reported Time'].value_counts()
#we will feature engineer a new column of day/night to predict time of day that most Terry Stops were made.

02:56:00.0000000    52
03:09:00.0000000    51
19:18:00.0000000    51
17:00:00.0000000    51
18:51:00.0000000    50
                    ..
06:24:25.0000000     1
07:54:13.0000000     1
16:48:51.0000000     1
16:48:27.0000000     1
23:08:01.0000000     1
Name: Reported Time, Length: 18104, dtype: int64

In [23]:
data['Arrest Flag'].value_counts()
#this is our target variable

N    48731
Y     4923
Name: Arrest Flag, dtype: int64

In [24]:
data['Frisk Flag'].value_counts()
#we have missing values
#this will be used to check whether there was a terry stop and a frisk that led to the discovery of a weapon

N    40794
Y    12382
-      478
Name: Frisk Flag, dtype: int64

In [25]:
data['Precinct'].value_counts()

West         14070
North        11699
-            10240
East          6904
South         6363
Southwest     2320
SouthWest     1775
Unknown        200
OOJ             61
FK ERROR        22
Name: Precinct, dtype: int64

## Data Preparation

### Validity

In [26]:
data.columns

Index(['Subject Age Group', 'Subject ID', 'GO / SC Num', 'Terry Stop ID',
       'Stop Resolution', 'Weapon Type', 'Officer ID', 'Officer YOB',
       'Officer Gender', 'Officer Race', 'Subject Perceived Race',
       'Subject Perceived Gender', 'Reported Date', 'Reported Time',
       'Initial Call Type', 'Final Call Type', 'Call Type', 'Officer Squad',
       'Arrest Flag', 'Frisk Flag', 'Precinct', 'Sector', 'Beat'],
      dtype='object')

In [27]:
relevant_columns=['Subject Age Group', 'Terry Stop ID','Stop Resolution', 'Weapon Type', 'Officer YOB', 
                  'Officer Gender', 'Officer Race', 'Subject Perceived Race', 'Subject Perceived Gender', 
                  'Reported Date', 'Reported Time', 'Arrest Flag', 'Frisk Flag', 'Precinct']
terry_stops = data[relevant_columns]
terry_stops

Unnamed: 0,Subject Age Group,Terry Stop ID,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Reported Time,Arrest Flag,Frisk Flag,Precinct
0,-,32023419019,Field Contact,-,1973,M,White,DUPLICATE,DUPLICATE,2022-03-14T00:00:00Z,09:34:02.0000000,N,Y,West
1,-,35877423282,Field Contact,-,1988,M,Asian,DUPLICATE,DUPLICATE,2022-09-02T00:00:00Z,19:20:16.0000000,N,Y,South
2,-,92317,Arrest,,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00Z,11:32:00.0000000,N,N,South
3,-,28806,Field Contact,,1965,M,White,-,-,2015-03-19T00:00:00Z,07:59:00.0000000,N,N,-
4,-,29599,Field Contact,,1961,M,White,White,Male,2015-03-21T00:00:00Z,19:12:00.0000000,N,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53649,56 and Above,35908320953,Field Contact,-,1977,M,White,Black or African American,Male,2022-09-07T00:00:00Z,04:11:19.0000000,N,N,West
53650,56 and Above,36244016178,Arrest,-,1996,M,White,Black or African American,Male,2022-09-12T00:00:00Z,01:53:14.0000000,Y,N,East
53651,56 and Above,36541078424,Field Contact,-,1973,M,White,Asian,Female,2022-09-19T00:00:00Z,12:57:21.0000000,N,N,West
53652,56 and Above,36545507606,Field Contact,-,1978,M,White,White,Male,2022-09-20T00:00:00Z,02:33:21.0000000,N,N,North


In [28]:
terry_stops['Terry Stop ID'].duplicated().sum()

73

In [29]:
#we removed duplicated rows
terry_stops = terry_stops.drop_duplicates(subset='Terry Stop ID', keep='first')
terry_stops['Terry Stop ID'].duplicated().sum()

0

### Uniformity

In [30]:
terry_stops.columns = terry_stops.columns.str.replace(" ", "_")

terry_stops= terry_stops.rename(columns=str.lower)

terry_stops


Unnamed: 0,subject_age_group,terry_stop_id,stop_resolution,weapon_type,officer_yob,officer_gender,officer_race,subject_perceived_race,subject_perceived_gender,reported_date,reported_time,arrest_flag,frisk_flag,precinct
0,-,32023419019,Field Contact,-,1973,M,White,DUPLICATE,DUPLICATE,2022-03-14T00:00:00Z,09:34:02.0000000,N,Y,West
1,-,35877423282,Field Contact,-,1988,M,Asian,DUPLICATE,DUPLICATE,2022-09-02T00:00:00Z,19:20:16.0000000,N,Y,South
2,-,92317,Arrest,,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00Z,11:32:00.0000000,N,N,South
3,-,28806,Field Contact,,1965,M,White,-,-,2015-03-19T00:00:00Z,07:59:00.0000000,N,N,-
4,-,29599,Field Contact,,1961,M,White,White,Male,2015-03-21T00:00:00Z,19:12:00.0000000,N,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53649,56 and Above,35908320953,Field Contact,-,1977,M,White,Black or African American,Male,2022-09-07T00:00:00Z,04:11:19.0000000,N,N,West
53650,56 and Above,36244016178,Arrest,-,1996,M,White,Black or African American,Male,2022-09-12T00:00:00Z,01:53:14.0000000,Y,N,East
53651,56 and Above,36541078424,Field Contact,-,1973,M,White,Asian,Female,2022-09-19T00:00:00Z,12:57:21.0000000,N,N,West
53652,56 and Above,36545507606,Field Contact,-,1978,M,White,White,Male,2022-09-20T00:00:00Z,02:33:21.0000000,N,N,North


### Completeness

In [31]:
terry_stops['subject_age_group'].value_counts()

26 - 35         17906
36 - 45         11617
18 - 25         10496
46 - 55          6881
56 and Above     2777
1 - 17           2079
-                1825
Name: subject_age_group, dtype: int64

In [32]:
terry_stops['reported_date']= terry_stops['reported_date'].apply(lambda x:x[:10])
terry_stops['reported_date'] = pd.to_datetime(terry_stops['reported_date'])
terry_stops['reported_date']

0       2022-03-14
1       2022-09-02
2       2015-10-16
3       2015-03-19
4       2015-03-21
           ...    
53649   2022-09-07
53650   2022-09-12
53651   2022-09-19
53652   2022-09-20
53653   2022-09-29
Name: reported_date, Length: 53581, dtype: datetime64[ns]

In [33]:
terry_stops['reported_time']= terry_stops['reported_time'].apply(lambda x:x[:2])
terry_stops['reported_time']= terry_stops['reported_time'].astype('int')

In [34]:
terry_stops['reported_time'].value_counts().sort_values(ascending=True)

6     1409
7     1469
8     1504
20    1653
4     1664
9     1688
5     1703
12    1747
21    2027
13    2043
10    2134
11    2177
14    2330
22    2336
15    2353
0     2514
16    2534
23    2696
1     2718
17    2764
19    2885
3     2888
2     3160
18    3185
Name: reported_time, dtype: int64

In [36]:
terry_stops.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53581 entries, 0 to 53653
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   subject_age_group         53581 non-null  object        
 1   terry_stop_id             53581 non-null  int64         
 2   stop_resolution           53581 non-null  object        
 3   weapon_type               53581 non-null  object        
 4   officer_yob               53581 non-null  int64         
 5   officer_gender            53581 non-null  object        
 6   officer_race              53581 non-null  object        
 7   subject_perceived_race    53581 non-null  object        
 8   subject_perceived_gender  53581 non-null  object        
 9   reported_date             53581 non-null  datetime64[ns]
 10  reported_time             53581 non-null  int64         
 11  arrest_flag               53581 non-null  object        
 12  frisk_flag        