# Police Data Cleaning
## Crisis Data Dataset
### Exploration

Preview the data.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.read_csv('../Datasets/Crisis_Data.csv')
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75082 entries, 0 to 75081
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Template ID                  75082 non-null  int64 
 1   Reported Date                75082 non-null  object
 2   Reported Time                75082 non-null  object
 3   Occurred Date / Time         75082 non-null  object
 4   Call Type                    75082 non-null  object
 5   Initial Call Type            75082 non-null  object
 6   Final Call Type              75082 non-null  object
 7   Disposition                  75082 non-null  object
 8   Use of Force Indicator       75082 non-null  object
 9   Subject Veteran Indicator    75082 non-null  object
 10  CIT Officer Requested        75082 non-null  object
 11  CIT Officer Dispatched       75082 non-null  object
 12  CIT Officer Arrived          75082 non-null  object
 13  Officer ID                   75

In [2]:
df.iloc[8:10]

Unnamed: 0,Template ID,Reported Date,Reported Time,Occurred Date / Time,Call Type,Initial Call Type,Final Call Type,Disposition,Use of Force Indicator,Subject Veteran Indicator,CIT Officer Requested,CIT Officer Dispatched,CIT Officer Arrived,Officer ID,Officer Gender,Officer Race,Officer Year of Birth,Officer Years of Experience,CIT Certified Indicator,Officer Bureau Desc,Officer Precinct Desc,Officer Squad Desc,Precinct,Sector,Beat
8,57315,2015-05-15T00:00:00,08:30:00,05/15/2015 06:16:56 PM,911,HAZ - POTENTIAL THRT TO PHYS SAFETY (NO HAZMAT),--DISTURBANCE - OTHER,Shelter Transport,N,N,N,N,Y,7666,F,White,1970,2,N,OPERATIONS BUREAU,EAST PCT,EAST PCT 3RD W - E/G RELIEF,-,-,-
9,43946,2015-05-15T00:00:00,10:14:00,05/15/2015 06:47:32 PM,911,SUICIDE - IP/JO SUICIDAL PERSON AND ATTEMPTS,--CRISIS COMPLAINT - GENERAL,Emergent Detention / ITA,N,N,Y,Y,Y,6115,M,White,1968,39,N,OPERATIONS BUREAU,NORTH PCT,NORTH PCT 2ND W - BOY,North,LINCOLN,L2


### Dealing with Missing Values
To make the missing data here more consistent, let's replace the dash values with None.

In [3]:
df.replace({'-': None}, inplace=True)
df['Occurred Date / Time'].replace({'1900-01-01': None}, inplace=True)


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75082 entries, 0 to 75081
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Template ID                  75082 non-null  int64 
 1   Reported Date                75082 non-null  object
 2   Reported Time                75082 non-null  object
 3   Occurred Date / Time         75082 non-null  object
 4   Call Type                    69482 non-null  object
 5   Initial Call Type            69482 non-null  object
 6   Final Call Type              69482 non-null  object
 7   Disposition                  73463 non-null  object
 8   Use of Force Indicator       75082 non-null  object
 9   Subject Veteran Indicator    75081 non-null  object
 10  CIT Officer Requested        75082 non-null  object
 11  CIT Officer Dispatched       75082 non-null  object
 12  CIT Officer Arrived          75082 non-null  object
 13  Officer ID                   75

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75082 entries, 0 to 75081
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Template ID                  75082 non-null  int64 
 1   Reported Date                75082 non-null  object
 2   Reported Time                75082 non-null  object
 3   Occurred Date / Time         75082 non-null  object
 4   Call Type                    69482 non-null  object
 5   Initial Call Type            69482 non-null  object
 6   Final Call Type              69482 non-null  object
 7   Disposition                  73463 non-null  object
 8   Use of Force Indicator       75082 non-null  object
 9   Subject Veteran Indicator    75081 non-null  object
 10  CIT Officer Requested        75082 non-null  object
 11  CIT Officer Dispatched       75082 non-null  object
 12  CIT Officer Arrived          75082 non-null  object
 13  Officer ID                   75

In [6]:
df.shape[0] - df.dropna().shape[0]

10328

There exist 10,328 rows with at least some missing data. 

In [7]:
df[(df['Officer Years of Experience'] == -1)].shape[0]


1056

In [8]:
df[(df['Officer Year of Birth'] == 1900)].shape[0]


377

There still exist some missing data in the integer columns in the form of officers being born in the year 1900, or officers having -1 years of experience. 


Let's look at optimizing the date columns.

In [9]:
df['Reported Date'].value_counts()['1900-01-01T00:00:00']

6

In [10]:
df['Occurred Date / Time'].value_counts()['01/01/1900 12:00:00 AM']

5600

There are some dates that are listed as occurring on January First, 1900. This must be a placeholder for missing / unknown data. Out of the 750082 entries, 6 have unknown reported dates, and 5600 have occurred dates that are unknown. 

In [11]:
date_cols = ['Reported Date', 'Reported Time', 'Occurred Date / Time']
for col in date_cols:
    df[col] = pd.to_datetime(df[col])


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75082 entries, 0 to 75081
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Template ID                  75082 non-null  int64         
 1   Reported Date                75082 non-null  datetime64[ns]
 2   Reported Time                75082 non-null  datetime64[ns]
 3   Occurred Date / Time         75082 non-null  datetime64[ns]
 4   Call Type                    69482 non-null  object        
 5   Initial Call Type            69482 non-null  object        
 6   Final Call Type              69482 non-null  object        
 7   Disposition                  73463 non-null  object        
 8   Use of Force Indicator       75082 non-null  object        
 9   Subject Veteran Indicator    75081 non-null  object        
 10  CIT Officer Requested        75082 non-null  object        
 11  CIT Officer Dispatched       75082 non-nu

### Optimizing Numeric Columns
We can cast some columns such as the Officer Year of Birth to integer datatype and downcast it to the optimal subtype. The Officer ID columns has missing data, so it will remain as an object datatype. 


In [20]:
for col in ['Template ID', 'Officer Year of Birth', 'Officer Years of Experience']:
    df[col] = df[col].astype('int')
    df[col] = pd.to_numeric(df[col], downcast='integer')
    print(df[col].dtype, col)                               

int64 Template ID
int16 Officer Year of Birth
int8 Officer Years of Experience


### Optimizing Object Columns
Many columns have string values that repeat. We can optimize the space needed for this dataset by converting the columns with < 50% unique rows into category datatypes. 

In [14]:
for col in df.select_dtypes(include=['object']):
    num_unique_values = len(df[col].unique())
    num_total_values = len(df[col])
    print(col, num_unique_values, num_total_values)
    if num_unique_values / num_total_values < 0.5:
        df[col] = df[col].astype('category')

Call Type 9 75082
Initial Call Type 176 75082
Final Call Type 217 75082
Disposition 30 75082
Use of Force Indicator 2 75082
Subject Veteran Indicator 3 75082
CIT Officer Requested 2 75082
CIT Officer Dispatched 2 75082
CIT Officer Arrived 2 75082
Officer ID 1283 75082
Officer Gender 3 75082
Officer Race 9 75082
CIT Certified Indicator 2 75082
Officer Bureau Desc 7 75082
Officer Precinct Desc 27 75082
Officer Squad Desc 184 75082
Precinct 8 75082
Sector 18 75082
Beat 54 75082


In [16]:
df.iloc[5:15]

Unnamed: 0,Template ID,Reported Date,Reported Time,Occurred Date / Time,Call Type,Initial Call Type,Final Call Type,Disposition,Use of Force Indicator,Subject Veteran Indicator,CIT Officer Requested,CIT Officer Dispatched,CIT Officer Arrived,Officer ID,Officer Gender,Officer Race,Officer Year of Birth,Officer Years of Experience,CIT Certified Indicator,Officer Bureau Desc,Officer Precinct Desc,Officer Squad Desc,Precinct,Sector,Beat
5,552651,1900-01-01,2022-09-09 12:00:00,2019-04-18 01:02:18,911,"DISTURBANCE, MISCELLANEOUS/OTHER",--CRISIS COMPLAINT - GENERAL,Resources Declined,N,N,N,N,Y,7490,M,White,1985,-1,N,,,,East,EDWARD,E1
6,43479,2015-05-15,2022-09-09 11:21:00,2015-05-15 18:10:23,911,"DISTURBANCE, MISCELLANEOUS/OTHER",--CRISIS COMPLAINT - GENERAL,Emergent Detention / ITA,N,N,N,N,Y,7754,M,White,1981,6,N,OPERATIONS BUREAU,WEST PCT,WEST PCT 2ND W - K/Q RELIEF,West,KING,K3
7,43469,2015-05-15,2022-09-09 03:57:00,2015-05-15 11:16:25,911,PERSON IN BEHAVIORAL/EMOTIONAL CRISIS,--CRISIS COMPLAINT - GENERAL,Mobile Crisis Team,N,N,Y,Y,N,7474,F,White,1969,7,N,OPERATIONS BUREAU,EAST PCT,EAST PCT 2ND W - EDWARD,East,EDWARD,E1
8,57315,2015-05-15,2022-09-09 08:30:00,2015-05-15 18:16:56,911,HAZ - POTENTIAL THRT TO PHYS SAFETY (NO HAZMAT),--DISTURBANCE - OTHER,Shelter Transport,N,N,N,N,Y,7666,F,White,1970,2,N,OPERATIONS BUREAU,EAST PCT,EAST PCT 3RD W - E/G RELIEF,,,
9,43946,2015-05-15,2022-09-09 10:14:00,2015-05-15 18:47:32,911,SUICIDE - IP/JO SUICIDAL PERSON AND ATTEMPTS,--CRISIS COMPLAINT - GENERAL,Emergent Detention / ITA,N,N,Y,Y,Y,6115,M,White,1968,39,N,OPERATIONS BUREAU,NORTH PCT,NORTH PCT 2ND W - BOY,North,LINCOLN,L2
10,43653,2015-05-15,2022-09-09 08:33:00,2015-05-15 19:48:18,911,PERSON IN BEHAVIORAL/EMOTIONAL CRISIS,--CRISIS COMPLAINT - GENERAL,No Action Possible / Necessary,N,N,N,N,Y,7789,M,Black or African American,1985,2,N,OPERATIONS BUREAU,WEST PCT,WEST PCT 3RD W - DAVID,West,KING,K3
11,43662,2015-05-15,2022-09-09 10:22:00,2015-05-15 21:33:28,ONVIEW,HAZ - POTENTIAL THRT TO PHYS SAFETY (NO HAZMAT),--CRISIS COMPLAINT - GENERAL,No Action Possible / Necessary,N,N,N,N,Y,7654,M,White,1984,8,N,OPERATIONS BUREAU,WEST PCT,WEST PCT 3RD W - KING,West,KING,K1
12,43992,2015-05-15,2022-09-09 10:56:00,2015-05-15 15:52:13,911,THEFT (DOES NOT INCLUDE SHOPLIFT OR SVCS),--THEFT - ALL OTHER,Geriatric Regional Assessment Team,N,N,N,N,N,7785,M,Two or More Races,1988,6,N,PROFESSIONAL STANDARDS BUREAU,TRAINING AND EDUCATION SECTION,TRAINING - FIELD TRAINING SQUAD,North,JOHN,J3
13,44026,2015-05-16,2022-09-09 03:51:00,2015-05-16 13:57:12,911,DIST - IP/JO - DV DIST - NO ASLT,--DISTURBANCE - JUVENILE,Mental Health Agency or Case Manager Notified,N,N,N,N,N,4831,M,White,1961,61,N,OPERATIONS BUREAU,SOUTHWEST PCT,SOUTHWEST PCT 2ND W - FRANK,SouthWest,FRANK,F2
14,43929,2015-05-16,2022-09-09 06:20:00,2015-05-16 17:29:10,911,SERVICE - WELFARE CHECK,--CRISIS COMPLAINT - GENERAL,No Action Possible / Necessary,N,N,N,N,N,4831,M,White,1961,30,N,OPERATIONS BUREAU,SOUTHWEST PCT,SOUTHWEST PCT 2ND W - FRANK,SouthWest,FRANK,F2


## Condensing Cleaning
Knowing what we do now, many of these operations can be consolidated and run during reading of the csv file. 

In [18]:
dtypes = {'Officer Year of Birth': 'int16', 'Officer Years of Experience': 'int8', 'Call Type': 'category'}
date_cols = ['Reported Date', 'Reported Time', 'Occurred Date / Time']

df = pd.read_csv('../Datasets/Crisis_Data.csv', dtype=dtypes, parse_dates=date_cols)
# df.info(memory_usage='deep')
df.iloc[5:20]

Unnamed: 0,Template ID,Reported Date,Reported Time,Occurred Date / Time,Call Type,Initial Call Type,Final Call Type,Disposition,Use of Force Indicator,Subject Veteran Indicator,CIT Officer Requested,CIT Officer Dispatched,CIT Officer Arrived,Officer ID,Officer Gender,Officer Race,Officer Year of Birth,Officer Years of Experience,CIT Certified Indicator,Officer Bureau Desc,Officer Precinct Desc,Officer Squad Desc,Precinct,Sector,Beat
5,552651,1900-01-01,2022-09-09 12:00:00,2019-04-18 01:02:18,911,"DISTURBANCE, MISCELLANEOUS/OTHER",--CRISIS COMPLAINT - GENERAL,Resources Declined,N,N,N,N,Y,7490,M,White,1985,-1,N,,,,East,EDWARD,E1
6,43479,2015-05-15,2022-09-09 11:21:00,2015-05-15 18:10:23,911,"DISTURBANCE, MISCELLANEOUS/OTHER",--CRISIS COMPLAINT - GENERAL,Emergent Detention / ITA,N,N,N,N,Y,7754,M,White,1981,6,N,OPERATIONS BUREAU,WEST PCT,WEST PCT 2ND W - K/Q RELIEF,West,KING,K3
7,43469,2015-05-15,2022-09-09 03:57:00,2015-05-15 11:16:25,911,PERSON IN BEHAVIORAL/EMOTIONAL CRISIS,--CRISIS COMPLAINT - GENERAL,Mobile Crisis Team,N,N,Y,Y,N,7474,F,White,1969,7,N,OPERATIONS BUREAU,EAST PCT,EAST PCT 2ND W - EDWARD,East,EDWARD,E1
8,57315,2015-05-15,2022-09-09 08:30:00,2015-05-15 18:16:56,911,HAZ - POTENTIAL THRT TO PHYS SAFETY (NO HAZMAT),--DISTURBANCE - OTHER,Shelter Transport,N,N,N,N,Y,7666,F,White,1970,2,N,OPERATIONS BUREAU,EAST PCT,EAST PCT 3RD W - E/G RELIEF,-,-,-
9,43946,2015-05-15,2022-09-09 10:14:00,2015-05-15 18:47:32,911,SUICIDE - IP/JO SUICIDAL PERSON AND ATTEMPTS,--CRISIS COMPLAINT - GENERAL,Emergent Detention / ITA,N,N,Y,Y,Y,6115,M,White,1968,39,N,OPERATIONS BUREAU,NORTH PCT,NORTH PCT 2ND W - BOY,North,LINCOLN,L2
10,43653,2015-05-15,2022-09-09 08:33:00,2015-05-15 19:48:18,911,PERSON IN BEHAVIORAL/EMOTIONAL CRISIS,--CRISIS COMPLAINT - GENERAL,No Action Possible / Necessary,N,N,N,N,Y,7789,M,Black or African American,1985,2,N,OPERATIONS BUREAU,WEST PCT,WEST PCT 3RD W - DAVID,West,KING,K3
11,43662,2015-05-15,2022-09-09 10:22:00,2015-05-15 21:33:28,ONVIEW,HAZ - POTENTIAL THRT TO PHYS SAFETY (NO HAZMAT),--CRISIS COMPLAINT - GENERAL,No Action Possible / Necessary,N,N,N,N,Y,7654,M,White,1984,8,N,OPERATIONS BUREAU,WEST PCT,WEST PCT 3RD W - KING,West,KING,K1
12,43992,2015-05-15,2022-09-09 10:56:00,2015-05-15 15:52:13,911,THEFT (DOES NOT INCLUDE SHOPLIFT OR SVCS),--THEFT - ALL OTHER,Geriatric Regional Assessment Team,N,N,N,N,N,7785,M,Two or More Races,1988,6,N,PROFESSIONAL STANDARDS BUREAU,TRAINING AND EDUCATION SECTION,TRAINING - FIELD TRAINING SQUAD,North,JOHN,J3
13,44026,2015-05-16,2022-09-09 03:51:00,2015-05-16 13:57:12,911,DIST - IP/JO - DV DIST - NO ASLT,--DISTURBANCE - JUVENILE,Mental Health Agency or Case Manager Notified,N,N,N,N,N,4831,M,White,1961,61,N,OPERATIONS BUREAU,SOUTHWEST PCT,SOUTHWEST PCT 2ND W - FRANK,SouthWest,FRANK,F2
14,43929,2015-05-16,2022-09-09 06:20:00,2015-05-16 17:29:10,911,SERVICE - WELFARE CHECK,--CRISIS COMPLAINT - GENERAL,No Action Possible / Necessary,N,N,N,N,N,4831,M,White,1961,30,N,OPERATIONS BUREAU,SOUTHWEST PCT,SOUTHWEST PCT 2ND W - FRANK,SouthWest,FRANK,F2


In [15]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75082 entries, 0 to 75081
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Template ID                  75082 non-null  int64         
 1   Reported Date                75082 non-null  datetime64[ns]
 2   Reported Time                75082 non-null  datetime64[ns]
 3   Occurred Date / Time         75082 non-null  datetime64[ns]
 4   Call Type                    69482 non-null  category      
 5   Initial Call Type            69482 non-null  category      
 6   Final Call Type              69482 non-null  category      
 7   Disposition                  73463 non-null  category      
 8   Use of Force Indicator       75082 non-null  category      
 9   Subject Veteran Indicator    75081 non-null  category      
 10  CIT Officer Requested        75082 non-null  category      
 11  CIT Officer Dispatched       75082 non-nu

What percentage of crisis calls result in a CIT certified Officer responding per year?

There exist 51 beats in Seattle. Why does this dataset have 54 unique beats? Which beats are not present in the dataset?