# Police Data Cleaning
## Crisis Data Dataset
### Exploration

Preview the data.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.read_csv('../Datasets/Crisis_Data.csv')
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75082 entries, 0 to 75081
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Template ID                  75082 non-null  int64 
 1   Reported Date                75082 non-null  object
 2   Reported Time                75082 non-null  object
 3   Occurred Date / Time         75082 non-null  object
 4   Call Type                    75082 non-null  object
 5   Initial Call Type            75082 non-null  object
 6   Final Call Type              75082 non-null  object
 7   Disposition                  75082 non-null  object
 8   Use of Force Indicator       75082 non-null  object
 9   Subject Veteran Indicator    75082 non-null  object
 10  CIT Officer Requested        75082 non-null  object
 11  CIT Officer Dispatched       75082 non-null  object
 12  CIT Officer Arrived          75082 non-null  object
 13  Officer ID                   75

### Dealing with Missing Values
To make the missing data here more consistent, let's replace the dash values with None.

In [2]:
df.iloc[8:9,22:25]

Unnamed: 0,Precinct,Sector,Beat
8,-,-,-


In [3]:
df.replace({'-': None}, inplace=True)
df.iloc[8:9,22:25]

Unnamed: 0,Precinct,Sector,Beat
8,,,


In [4]:
df.shape[0] - df.dropna().shape[0]

10328

There exist 10,328 rows with at least some missing data. 

In [5]:
df[(df['Officer Years of Experience'] == -1)].shape[0]


1056

In [6]:
df[(df['Officer Year of Birth'] == 1900)].shape[0]


377

There still exist some missing data in the integer columns in the form of officers being born in the year 1900, or officers having -1 years of experience. 

### Optimizing Datetime Columns

Let's look at optimizing the date columns.

In [7]:
df['Reported Date'].value_counts()['1900-01-01T00:00:00']

6

In [8]:
df['Occurred Date / Time'].value_counts()['01/01/1900 12:00:00 AM']

5600

There are some dates that are listed as occurring on January First, 1900. This must be a placeholder for missing / unknown data. Out of the 750082 entries, 6 have unknown reported dates, and 5600 have occurred dates that are unknown. 

In [9]:
date_cols = ['Reported Date', 'Reported Time', 'Occurred Date / Time']
for col in date_cols:
    df[col] = pd.to_datetime(df[col])


### Optimizing Numeric Columns
We can cast some columns such as the Officer Year of Birth to integer datatype and downcast it to the optimal subtype. The Officer ID columns has missing data, so it will remain as an object datatype. 


In [10]:
for col in ['Template ID', 'Officer Year of Birth', 'Officer Years of Experience']:
    df[col] = df[col].astype('int')
    df[col] = pd.to_numeric(df[col], downcast='integer')
    print(df[col].dtype, col)                               

int64 Template ID
int16 Officer Year of Birth
int8 Officer Years of Experience


### Optimizing Object Columns
Many columns have string values that repeat. We can optimize the space needed for this dataset by converting the columns with < 50% unique rows into category datatypes. 

In [11]:
for col in df.select_dtypes(include=['object']):
    num_unique_values = len(df[col].unique())
    num_total_values = len(df[col])
    print(col, num_unique_values, num_total_values)
    if num_unique_values / num_total_values < 0.5:
        df[col] = df[col].astype('category')

Call Type 9 75082
Initial Call Type 176 75082
Final Call Type 217 75082
Disposition 30 75082
Use of Force Indicator 2 75082
Subject Veteran Indicator 3 75082
CIT Officer Requested 2 75082
CIT Officer Dispatched 2 75082
CIT Officer Arrived 2 75082
Officer ID 1283 75082
Officer Gender 3 75082
Officer Race 9 75082
CIT Certified Indicator 2 75082
Officer Bureau Desc 7 75082
Officer Precinct Desc 27 75082
Officer Squad Desc 184 75082
Precinct 8 75082
Sector 18 75082
Beat 54 75082


## Condensing Cleaning
Knowing what we do now, many of these operations can be consolidated and run during reading of the csv file. 

In [12]:
dtypes = {
    'Template ID': 'int64',
    'Call Type': 'category',
    'Initial Call Type': 'category',
    'Final Call Type': 'category',
    'Disposition': 'category',
    'Use of Force Indicator': 'category',
    'Subject Veteran Indicator': 'category',
    'CIT Officer Requested': 'category',
    'CIT Officer Dispatched': 'category',
    'CIT Officer Arrived': 'category',
    'Officer ID': 'category',
    'Officer Gender': 'category',
    'Officer Race': 'category',
    'Officer Year of Birth': 'int16', 
    'Officer Years of Experience': 'int8', 
    'CIT Certified Indicator': 'category',
    'Officer Bureau Desc': 'category',
    'Officer Precinct Desc': 'category',
    'Officer Squad Desc': 'category',
    'Precinct': 'category',
    'Sector': 'category',
    'Beat': 'category',
}
date_cols = ['Reported Date', 'Reported Time', 'Occurred Date / Time']

df = pd.read_csv('../Datasets/Crisis_Data.csv', dtype=dtypes, parse_dates=date_cols)
df.replace({'-': None}, inplace=True)
df.iloc[15:20]

Unnamed: 0,Template ID,Reported Date,Reported Time,Occurred Date / Time,Call Type,Initial Call Type,Final Call Type,Disposition,Use of Force Indicator,Subject Veteran Indicator,CIT Officer Requested,CIT Officer Dispatched,CIT Officer Arrived,Officer ID,Officer Gender,Officer Race,Officer Year of Birth,Officer Years of Experience,CIT Certified Indicator,Officer Bureau Desc,Officer Precinct Desc,Officer Squad Desc,Precinct,Sector,Beat
15,44102,2015-05-16,2022-09-09 11:30:00,1900-01-01 00:00:00,,,,Voluntary Committal,N,N,N,N,Y,7685,M,White,1973,3,N,,,,,,
16,43982,2015-05-16,2022-09-09 11:07:00,2015-05-16 22:50:33,"TELEPHONE OTHER, NOT 911",SERVICE - WELFARE CHECK,--CRISIS COMPLAINT - GENERAL,Emergent Detention / ITA,N,N,N,N,Y,7402,M,White,1973,15,N,OPERATIONS BUREAU,EAST PCT,EAST PCT 3RD W - EDWARD,East,EDWARD,E1
17,43719,2015-05-16,2022-09-09 05:58:00,2015-05-16 03:15:18,"TELEPHONE OTHER, NOT 911",UNKNOWN - ANI/ALI - WRLS PHNS (INCL OPEN LINE),--CRISIS COMPLAINT - GENERAL,Unable to Contact,N,N,Y,Y,Y,7787,M,White,1987,0,N,PROFESSIONAL STANDARDS BUREAU,TRAINING AND EDUCATION SECTION,TRAINING - FIELD TRAINING SQUAD,North,JOHN,J1
18,43832,2015-05-16,2022-09-09 01:24:00,2015-05-16 10:14:07,911,"DISTURBANCE, MISCELLANEOUS/OTHER",--DISTURBANCE - OTHER,Resources Declined,N,N,N,N,Y,7634,M,White,1977,2,N,OPERATIONS BUREAU,EAST PCT,EAST PCT 1ST W - E/G RELIEF (CHARLIE),East,CHARLIE,C3
19,43897,2015-05-16,2022-09-09 03:52:00,1900-01-01 00:00:00,,,,Mobile Crisis Team,N,N,Y,Y,Y,4980,F,White,1962,30,N,,,,,,


In [13]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75082 entries, 0 to 75081
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Template ID                  75082 non-null  int64         
 1   Reported Date                75082 non-null  datetime64[ns]
 2   Reported Time                75082 non-null  datetime64[ns]
 3   Occurred Date / Time         75082 non-null  datetime64[ns]
 4   Call Type                    69482 non-null  category      
 5   Initial Call Type            69482 non-null  category      
 6   Final Call Type              69482 non-null  category      
 7   Disposition                  73463 non-null  category      
 8   Use of Force Indicator       75082 non-null  category      
 9   Subject Veteran Indicator    75081 non-null  category      
 10  CIT Officer Requested        75082 non-null  category      
 11  CIT Officer Dispatched       75082 non-nu

## Summary
The total size of the dataset was reduced from 106.9 MB to just 4.3 MB. 

## Further topics for investigation
What percentage of crisis calls result in a CIT certified Officer responding per year?

There exist 51 beats in Seattle. Why does this dataset have 54 unique beats? Which beats are not present in the dataset?