# Discipline Data Notes

In June, reporter Samantha Max requested what is supposed to be the definitive list of discipline data for the Metro Nashville Police Department. 

Here are notes about the file and a summary of possible issues we will have with the data. 


This is an overview of the discipline data that came as one-row-per-charge/ 

In [131]:
import os
import pandas as pd
import altair as alt

cwd = os.getcwd()
data_dir = os.path.join(cwd, 'data')
source_dir = os.path.join(data_dir,'source')
processed_dir = os.path.join(data_dir, 'processed')
pkl_dir = os.path.join(processed_dir,'pkl')

# Data is an xls file, so we won't be using read_csv 
discipline_xls = os.path.join(source_dir, 'Report 6-14-21 Data Request.xls')

In [20]:
# There are a lot of column datatypes to assign and column names to change 

info = {
    'dtypes' : {
        'Off: ENO': 'object', 
        'Off: Badge/ID number': 'object', 
        'Off: First name': 'object',
        'Off: Middle name': 'object', 
        'Off: Last name': 'object', 
        'Off: Race': 'category', 
        'Off: Sex': 'category',
        'Off: Title': 'object', 
        'Alleg: Allegation': 'object', 
        'Alleg: Finding': 'object',
        'Act: Action taken': 'object', 
        'Act: Narrative': 'object'
    },
    'columns' : { 
        'Off: Date hired': 'date_hired', 
        'Off: ENO': 'officer_employee_num', 
        'Off: Badge/ID number': 'badge_num',
        'Off: Date-of-birth': 'date_birth', 
        'Off: Employment end date': 'date_employment_over', 
        'Off: First name': 'first_name',
        'Off: Middle name': 'middle_name', 
        'Off: Last name': 'last_name', 
        'Off: Race': 'race', 
        'Off: Sex': 'sex',
        'Off: Title': 'title', 
        'Alleg: Allegation': 'allegation', 
        'Alleg: Finding': 'finding',
        'Act: Action taken': 'action_taken', 
        'Act: Action taken date': 'date_action_taken', 
        'Act: Narrative': 'narrative'
    }
}

In [132]:
# datetypes are assigned on import by passing a list to parse dates
discipline_df = pd.read_excel(
    discipline_xls,
    parse_dates = [
        'Off: Date hired',
        'Off: Date-of-birth',
        'Off: Employment end date',
        'Act: Action taken date'
    ],
    dtype=info['dtypes']
)

discipline_df = discipline_df.rename(columns=info['columns'])

discipline_df.to_pickle(
    os.path.join(pkl_dir, 'cleaned_new_discipline_data.pkl')
)


## Overview 
There are 16 columns with 50,406 rows. The data comes with a lot of missing values. 

In [22]:
discipline_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50406 entries, 0 to 50405
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   date_hired            50008 non-null  datetime64[ns]
 1   officer_employee_num  50373 non-null  object        
 2   badge_num             49048 non-null  object        
 3   date_birth            49998 non-null  datetime64[ns]
 4   date_employment_over  17046 non-null  datetime64[ns]
 5   first_name            50405 non-null  object        
 6   middle_name           45311 non-null  object        
 7   last_name             50405 non-null  object        
 8   race                  50272 non-null  category      
 9   sex                   50302 non-null  category      
 10  title                 50204 non-null  object        
 11  allegation            32850 non-null  object        
 12  finding               30156 non-null  object        
 13  action_taken    

There is one row that is completely empty so we'll drop it

In [36]:
discipline_df[discipline_df.isnull().all(1)]

Unnamed: 0,date_hired,officer_employee_num,badge_num,date_birth,date_employment_over,first_name,middle_name,last_name,race,sex,title,allegation,finding,action_taken,date_action_taken,narrative
29027,NaT,,,NaT,NaT,,,,,,,,,,NaT,


In [37]:
discipline_df = discipline_df.drop([29027])

## Duplicate Rows - what is one row in this dataset? 

Inspecting the data, it looks like there are a lot of duplicate rows. That raises the important question of what one row in the dataset represents. 

For example, Michael Dudley has 8 rows associated with the allegation 'zDNU - Adherence to Law (13)' with an action taken on 3/31/2017. There are two action takens. Why are there 8 rows and what do the duplicates mean? Do we have a way of narrowing down to a single allegation? A single discpline? This might not be an issue, with more information. 

In [87]:
dudley_mask = (discipline_df.first_name == 'Michael')\
& (discipline_df.last_name == 'Dudley')\
& (discipline_df.allegation =='zDNU - Adherence to Law (13)')

discipline_df[dudley_mask]

Unnamed: 0,date_hired,officer_employee_num,badge_num,date_birth,date_employment_over,first_name,middle_name,last_name,race,sex,title,allegation,finding,action_taken,date_action_taken,narrative
3,2001-02-16,402646,39581,1970-09-25,2018-11-17,Michael,L,Dudley,White,Male,PO,zDNU - Adherence to Law (13),Agree w/Recommended Disc,zDNU - Suspend/Demot,2017-03-31,Suspend/Demot\r\nAction Taken Detials: FIVE (5...
4,2001-02-16,402646,39581,1970-09-25,2018-11-17,Michael,L,Dudley,White,Male,PO,zDNU - Adherence to Law (13),Agree w/Recommended Disc,zDNU - Suspend/Demot,2017-03-31,Suspend/Demot\r\nAction Taken Detials: FIVE (5...
7,2001-02-16,402646,39581,1970-09-25,2018-11-17,Michael,L,Dudley,White,Male,PO,zDNU - Adherence to Law (13),Agree w/Recommended Disc,zDNU - Suspend/Demot,2017-03-31,Suspend/Demot\r\nAction Taken Detials: FIVE (5...
8,2001-02-16,402646,39581,1970-09-25,2018-11-17,Michael,L,Dudley,White,Male,PO,zDNU - Adherence to Law (13),Agree w/Recommended Disc,zDNU - Suspend/Demot,2017-03-31,Suspend/Demot\r\nAction Taken Detials: FIVE (5...
11,2001-02-16,402646,39581,1970-09-25,2018-11-17,Michael,L,Dudley,White,Male,PO,zDNU - Adherence to Law (13),Agree w/Recommended Disc,zDNU - See 313-313A Disc Action,2017-03-31,SEE 313-313A DISC ACTION\r\nAction Taken Detia...
12,2001-02-16,402646,39581,1970-09-25,2018-11-17,Michael,L,Dudley,White,Male,PO,zDNU - Adherence to Law (13),Agree w/Recommended Disc,zDNU - See 313-313A Disc Action,2017-03-31,SEE 313-313A DISC ACTION\r\nAction Taken Detia...
928,2001-02-16,402646,39581,1970-09-25,2018-11-17,Michael,L,Dudley,White,Male,PO,zDNU - Adherence to Law (13),Agree w/Recommended Disc,zDNU - See 313-313A Disc Action,2017-03-31,SEE 313-313A DISC ACTION\r\nAction Taken Detia...
929,2001-02-16,402646,39581,1970-09-25,2018-11-17,Michael,L,Dudley,White,Male,PO,zDNU - Adherence to Law (13),Agree w/Recommended Disc,zDNU - See 313-313A Disc Action,2017-03-31,SEE 313-313A DISC ACTION\r\nAction Taken Detia...


## Badge Number and Employee Number

Badge Number and Employee Number are necessary to do a proper matching between a roster of officers and the discipline data. We should use employee number because it is more complete. 


#### Employee Number 
Employee number is nearly complete. There are only 32 null values, representing 23 employees. 

In [46]:
null_employee_nums = discipline_df[discipline_df.officer_employee_num.isna()]

null_employee_nums

Unnamed: 0,date_hired,officer_employee_num,badge_num,date_birth,date_employment_over,first_name,middle_name,last_name,race,sex,title,allegation,finding,action_taken,date_action_taken,narrative
11168,2020-02-16,,,1994-12-08,NaT,Dillon,,Gann,White,Male,POT,"MNPD Manual 4.20.050 K Use of Alcohol, Drugs o...",,,NaT,
11169,2020-02-16,,,1994-12-08,NaT,Dillon,,Gann,White,Male,POT,MNPD Manual 4.20.040 D Conduct Unbecoming an E...,,,NaT,
13097,2018-10-16,,,1985-01-09,2019-08-16,Justin,,Spencer,White,Male,PT,MNPD Manual 4.20.040 A Adherence to Policy & R...,Sustained,Dismissal,2019-08-16,
14219,NaT,,,NaT,NaT,David,,Harms,White,Male,PO II,MNPD Manual 4.20.040 G Courtesy,,,NaT,
15177,2019-08-16,,,1996-04-06,NaT,Dallas,,Johnson,White,Male,POT,MNPD Manual 4.20.040 K Obstruction of Rights,,,NaT,
15232,NaT,,,NaT,NaT,HARHEEN,R,YUNUS,,,,MNPD Manual 4.20.040 A Adherence to Policy & R...,Sustained,Dismissal,2018-05-01,
19512,2019-08-16,,,1993-07-04,NaT,Matthew,E,Herod,White,Male,POT,,,,NaT,
21945,1992-08-23,,,1992-08-23,NaT,Cameron,,Schmid,White,Male,,MNPD Manual 4.20.040 A Adherence to Policy & R...,,,NaT,
21946,1992-08-23,,,1992-08-23,NaT,Cameron,,Schmid,White,Male,,MNPD Manual 4.20.040 A Adherence to Policy & R...,,,NaT,
22894,1991-05-01,,,1991-11-30,NaT,Michael,,Sposito,White,Male,POT,MNPD Manual 4.20.040 A Adherence to Policy & R...,Sustained,,NaT,


In [44]:
discipline_df.officer_employee_num.isna().sum()

32

In [49]:
len(null_employee_nums.last_name.unique())

23

There are 2,590 employee numbers in the data

In [52]:
len(discipline_df.officer_employee_num.unique())

2590

### Badge Number
Badge number is less complete than employee number, with 1,357 nulls, or 220 people. Maybe this is because there are people who are not officers in the data. 

In [58]:
null_badge_nums = discipline_df[discipline_df.badge_num.isna()]

null_badge_nums

Unnamed: 0,date_hired,officer_employee_num,badge_num,date_birth,date_employment_over,first_name,middle_name,last_name,race,sex,title,allegation,finding,action_taken,date_action_taken,narrative
407,NaT,226532,,NaT,2015-01-06,Melissa,Ann,Johnson,White,Female,POII,zDNU - Devoting Entire Time to Duty (09),Matter of Record,zDNU - None,2012-09-15,None\r\nAction Taken Detials: due to the compl...
743,NaT,251667,,NaT,2017-01-19,Jennifer,,Parker-Ayers,Black,Female,PCCII,zDNU - Conduct Unbecoming (13),Sustained,zDNU - Resigned w/Under Invest,NaT,Resigned w/Under Invest\r\nAction Taken Detial...
784,NaT,675829,,NaT,NaT,Stephen,Charles,Fouche,White,Male,POII,zDNU - Adherence2Policy&Rules (New09),Agree w/Recommended Disc,zDNU - Suspension,2010-05-25,Suspension\r\nAction Taken Detials: One (1) da...
868,NaT,179466,,NaT,NaT,Tara,,Thurman,White,Female,POAI,zDNU - Attendance (13),Rev'd-Signed,zDNU - Written Reprimand,2017-03-27,Written Reprimand\r\nAction Taken Detials: 065...
881,NaT,727199,,NaT,NaT,Charle,S.,Eaton,White,Male,PSGT,zDNU - Care of Property (13),Rev'd-Signed,zDNU - Written Reprimand,2017-05-09,Written Reprimand\r\nAction Taken Detials: 100...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50306,2018-01-16,320581,,1988-09-01,NaT,Johnathan,,Sharp II,White,Male,PO,,,,NaT,
50309,2018-10-16,352213,,1996-01-12,NaT,Joel,,Cottrill,White,Male,POII,,,,NaT,
50338,2018-10-16,352201,,1982-09-27,NaT,Eric,L,Jackson,White,Male,POII,,,,NaT,
50360,NaT,320600,,1992-01-02,2019-01-20,Jonathan,,Watch,White,Male,PO,,,,NaT,


In [59]:
discipline_df.badge_num.isna().sum() 

1357

In [60]:
len(null_badge_nums.last_name.unique())

220

There are 2,384 badge numbers in the data 

In [61]:
len(discipline_df.badge_num.unique())

2384

## Race 
There are a lot of values in the race columns. It's possible we could reduce to 4: Black, White, Asian, Hispanic, and Unknown.

We should ask what T and 03 represent, since there are so many of them. 

There are 133 rows where there is no officer race. 

In [65]:
discipline_df.race.value_counts(dropna=False)

White              40304
Black               6899
T                   1286
03                   917
Asian                658
NaN                  133
P                     88
American Indian       63
Other                 24
Hispanic              22
Black/White            5
U                      3
Q                      2
Latino                 1
Name: race, dtype: int64

## Action Taken

There is no disposition date or accusation date in the data. There is an action taken date, but it is very incomplete. It has 32,626 null rows, which is 64% of the data. 

In [66]:
discipline_df.action_taken.isna().sum()

32626

In [67]:
discipline_df.action_taken.isna().sum()/len(discipline_df)

0.6472770558476342

Without the date of the allegation/discipline/disposition, we will have a very hard time performing an analysis on whether or not there is any retaliation going on at the department. The reason that we need the date is that it is possible that an officer had a few disciplary actions taken against them before they filed a grievance and no discipline taken after. But without the ability to sort actions into "before" and "after" the grievance, we would be counting all discipline towards the analysis. 


We could still look at the rate of discipline for officers that filed grievances and those that did not, but we would have to be very clear with the audience that we do not know if the discipline happened before or after the grievance. 


Let's look at Monica Blake. We know she filed two grievances in 2017. There are 59 rows related to her discipline in the data and 18 of them don't have an associated date and one row has 1990 as the year, which is 15 years before she was hired. That's just about 32% of the rows we'd have to figure out how to handle.

In [78]:
blake_mask = (discipline_df.first_name == 'Monica')&(discipline_df.last_name == 'Blake')
blake_df = discipline_df[blake_mask]

len(blake_df)

59

In [79]:
blake_df.date_action_taken.isna().sum()

18

The missing dates don't seem to be evenly distributed. There is a clear pattern to the data that we do have.  


In [130]:
actions_taken = discipline_df.date_action_taken.copy()

data = actions_taken.apply(
    lambda x: x.year
).value_counts().reset_index().rename(columns={'index':'year'})

alt.Chart(data).mark_bar().encode(
    x='year:O',
    y='date_action_taken'
)

## Allegations

There are a lot of blank allegations. 17,556 rows have nothing in the allegation field, so we are unable to determine how to use them to collapse from charges to incident. 

In [134]:
discipline_df.allegation.isna().sum()

17556

The allegations are also difficult to reduce from charges to incidents. Manually comparing one name in the new data, John Hatcher, to the old data, shows the difficulty. 

There are 12 rows with John Hatcher in the new data and only 4 incidents, based on the old data. But the new data list 5 distinct allegations with many additional blank rows. 