https://icasas101.github.io/FinalDSTutorial/ <<<<< This is the link to our website!

# Identifying Bias in the NOPD

By Josh Kellner and Isabella Casas

CMPS 3160 - Introduction to Data Science - Professor Mattei

## Introduction

### Background

The issue of police brutality against black people in our country has, of course, been as widely discussed as ever over the last year because of the widespread abundance protests against it all over the country. This issue has by no means only been brought up around this time when many people are reckoning with it, but the ability of social media to spread information and set trends so quickly has given the issue a major spotlight. 

Some of the questions we would like to answer are as follows:
1. Are police more likely to question, search, and/or take more severe actions against people of color?
2. Could gender have an effect on how likely someone is to be questioned/searched?
3. Could police be biased against people of certain classes?

In an ideal situation, any unfair biases that are discovered by this project will be used to determine a change in or create policy that would correct these unethical discriminatory actions and prompt city officials to put that policy in place. At the moment, we are focusing on New Orleans but a possible expansion, given that the datasets would be accessible, could be to compare the conclusions that we draw to conclusions drawn from other cities’ data.

### About our dataset

Link to dataset: https://data.nola.gov/Public-Safety-and-Preparedness/Stop-and-Search-Field-Interviews-/kitu-f4uy/data

For our Final Tutorial, we have partnered up to analyze a dataset called “Stop and Search (Field Interviews).” It is filled with data regarding instances of people being questioned by the New Orleans Police Department. Some of the information about these interviews includes when and where it happened, the officer conducting the questioning and potential search, a description of the individual being searched including age, gender, race, height and weight, the reason the interview was conducted, actions taken, etc. We plan to analyze this information in such a way that one can use our analysis to learn about any biases that NOPD has, or a lack thereof, and how these biases manifest themselves. We expect to specifically look at relationships between frequencies of interviews and searches and descriptors of the subjects of these interviews and searches as well as the relationships between the severity of the actions taken by the police and the descriptors of the subjects. The dataset provides information about the car that the subject was driving, if they were driving one, which will be another variable that can shed light on biases.

### Collaboration plan

In terms of a collaboration plan, we have a Github repository set up to keep track of our most up to date work as well as each update. Every two weeks we plan to meet on Zoom to divide specific chunks of work to be done. In these meetings we will review the work we have done since the last meeting and work through things that we couldn’t complete individually, together. 

## Data ETL

### Extraction

Our first step was to import our necessary libaries and then download the data file.

In [2]:
import pandas as pd

In [3]:
!head ../FinalDSTutorial/Stop_and_Search__Field_Interviews_.csv

'head' is not recognized as an internal or external command,
operable program or batch file.


### Load

In [36]:
# REMOVE COMMENT BEFORE TURNING IN: We need census data! I requested it from the data center but haven't received it yet.

In [11]:
df = pd.read_csv("../FinalDSTutorial/Stop_and_Search__Field_Interviews_.csv", dtype={'FieldInterviewID': int})
df.head()

Unnamed: 0,FieldInterviewID,NOPD_Item,EventDate,District,Zone,OfficerAssignment,StopDescription,ActionsTaken,VehicleYear,VehicleMake,...,SubjectWeight,SubjectEyeColor,SubjectHairColor,SubjectDriverLicState,CreatedDateTime,LastModifiedDateTime,Longitude,Latitude,Zip,BlockAddress
0,17415,,01/01/2010 01:11:00 AM,6,E,6th District,TRAFFIC VIOLATION,,2005.0,DODGE,...,160.0,Brown,Black,LA,01/01/2010 01:26:26 AM,,0.0,0.0,,
1,17416,,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,,,...,140.0,Brown,Black,,01/01/2010 02:27:38 AM,,0.0,0.0,,
2,17416,,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,,,...,145.0,Brown,Black,,01/01/2010 02:27:38 AM,,0.0,0.0,,
3,17416,,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,,,...,140.0,Brown,Black,,01/01/2010 02:27:38 AM,,0.0,0.0,,
4,17416,,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,,,...,140.0,Brown,Black,,01/01/2010 02:27:38 AM,,0.0,0.0,,


### Transform

As you can see, our original dataframe had some messy information, so our next step was to clean it up. First, we dropped any columns that were not necessary for our analysis.

In [29]:
# DELETE THIS COMMENT BEFORE TURNING IN: let me know if any of these columns SHOULD NOT be dropped.
dropped_df = df.drop(columns=['NOPD_Item', 'VehicleYear', 'VehicleMake', 'VehicleModel', 'VehicleStyle', 'VehicleColor', 'SubjectWeight', 'SubjectHeight', 'SubjectEyeColor', 'SubjectHairColor'])
dropped_df.head()

Unnamed: 0,FieldInterviewID,EventDate,District,Zone,OfficerAssignment,StopDescription,ActionsTaken,SubjectID,SubjectRace,SubjectGender,SubjectAge,SubjectHasPhotoID,SubjectDriverLicState,CreatedDateTime,LastModifiedDateTime,Longitude,Latitude,Zip,BlockAddress
0,17415,01/01/2010 01:11:00 AM,6,E,6th District,TRAFFIC VIOLATION,,20465.0,BLACK,FEMALE,26.0,Yes,LA,01/01/2010 01:26:26 AM,,0.0,0.0,,
1,17416,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,20466.0,BLACK,MALE,17.0,No,,01/01/2010 02:27:38 AM,,0.0,0.0,,
2,17416,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,20467.0,BLACK,MALE,18.0,No,,01/01/2010 02:27:38 AM,,0.0,0.0,,
3,17416,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,20468.0,BLACK,MALE,18.0,No,,01/01/2010 02:27:38 AM,,0.0,0.0,,
4,17416,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,20469.0,BLACK,MALE,30.0,No,,01/01/2010 02:27:38 AM,,0.0,0.0,,


Next, we changed our index to be both FieldInterviewID and SubjectID in order to not have multiple FieldInterview entries, but still be able to see how many individuals were involved in a single interview.

In [30]:
index_df = dropped_df.set_index(['FieldInterviewID', 'SubjectID'])
index_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,EventDate,District,Zone,OfficerAssignment,StopDescription,ActionsTaken,SubjectRace,SubjectGender,SubjectAge,SubjectHasPhotoID,SubjectDriverLicState,CreatedDateTime,LastModifiedDateTime,Longitude,Latitude,Zip,BlockAddress
FieldInterviewID,SubjectID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
17415,20465.0,01/01/2010 01:11:00 AM,6,E,6th District,TRAFFIC VIOLATION,,BLACK,FEMALE,26.0,Yes,LA,01/01/2010 01:26:26 AM,,0.0,0.0,,
17416,20466.0,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,BLACK,MALE,17.0,No,,01/01/2010 02:27:38 AM,,0.0,0.0,,
17416,20467.0,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,BLACK,MALE,18.0,No,,01/01/2010 02:27:38 AM,,0.0,0.0,,
17416,20468.0,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,BLACK,MALE,18.0,No,,01/01/2010 02:27:38 AM,,0.0,0.0,,
17416,20469.0,01/01/2010 02:06:00 AM,5,D,5th District,CALL FOR SERVICE,,BLACK,MALE,30.0,No,,01/01/2010 02:27:38 AM,,0.0,0.0,,


In [34]:
index_df.ActionsTaken.unique()

array([nan,
       'Stop Results: Physical Arrest;Subject Type: Passenger;Search Occurred: Yes;Search Types: Vehicle;Search Types: Pat-down;Search Types: Passenger(s);Legal Basises: Probable cause;Legal Basises: Plain view;Evidence Seized: Yes;Evidence Types: Weapon(s)',
       'Stop Results: Citation Issued;Subject Type: Driver;Search Occurred: No;Legal Basises: Probable cause;Evidence Seized: No',
       ...,
       'Stop Results: No action taken;Subject Type: Driver;Search Occurred: Yes;Evidence Seized: No;Legal Basises: Incident to arrest;Legal Basises: Vehicle Exception;Consent To Search: No;Exit Vehicle: Yes;Search Type Pat Down: Yes;Consent Form Completed: No;StripBody Cavity Search: No',
       'Stop Results: Summons Issued;Subject Type: Passenger;Search Occurred: Yes;Evidence Seized: Yes;Evidence Types: Other;Legal Basises: Incident to arrest;Legal Basises: Vehicle Exception;Consent To Search: No;Exit Vehicle: Yes;Search Type Pat Down: Yes;Consent Form Completed: No;StripBody 

In [35]:
# REMOVE BEFORE TURNING IN: We need to split this based on above values somehow but I'm stuck rn
#index_df[['StopResults', 'SubjectType', 'SearchOccurred', 'SearchTypes', 'LegalBasises', 'EvidenceSeized']] = index_df.ActionsTaken.str.split(';', expand=True)

ValueError: Columns must be same length as key

After splitting the ActionsTaken column into separate columns for easier readability, we moved onto some data analysis. Our first analysis was a simple test to see if we could extract any useful information based on a subject's gender. Below is a simple bar graph that shows how many men vs how many women were stopped by police.

In [None]:
# REMOVE BEFORE TURNING IN: Should compare this with census data to see if it's disproportionate
index_df['SubjectGender'] = index_df['SubjectGender'].fillna('Unknown')
index_df['SubjectGender'].value_counts().plot.bar()