# Summary

### Required Libraries

In [200]:
#Base Python libraries
import requests
import os

#Data Sci/Analysis libraries
import numpy as np
import pandas as pd

#Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

### Sources

Washington Post Github Repo:
- <a href="https://github.com/washingtonpost/data-police-shootings/">https://github.com/washingtonpost/data-police-shootings/</a>

Data Dictionary:
- <a href="https://github.com/washingtonpost/data-police-shootings/blob/master/v2/README.md">https://github.com/washingtonpost/data-police-shootings/blob/master/v2/README.md</a>

### Download Data from Washington Post Github Repo

We will be downloading two data sources:
   - The actual fatal police shootings (fatal-police-shootings-data.csv)
   - Agency information (such as department type, state, ORI codes, etc)

In [201]:
data_url = "https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/v2/fatal-police-shootings-data.csv"

#Put on two lines for readability
agency_url = \
"https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/v2/fatal-police-shootings-agencies.csv"

We will only download this data if it already doesn't exist in our repo: (current date/time of download 2023-09-15 20:00 UTC)

In [202]:
if ~ os.path.exists('../Data/fatal-police-shootings-data.csv'):
    data_csv = requests.get(data_url).content
    with open('../Data/fatal-police-shootings-data.csv', 'wb') as csv_file:
        csv_file.write(data_csv)
        
if ~ os.path.exists('../Data/fatal-police-shootings-agencies.csv'):
    data_csv = requests.get(agency_url).content
    with open('../Data/fatal-police-shootings-agencies.csv', 'wb') as csv_file:
        csv_file.write(data_csv)

### Inspection of Shooting Dataset

In [203]:
df = pd.read_csv('../Data/fatal-police-shootings-data.csv')
agency_df = pd.read_csv('../Data/fatal-police-shootings-agencies.csv').drop(columns=['total_shootings', 'state'])

In [204]:
df.head()

Unnamed: 0,id,date,threat_type,flee_status,armed_with,city,county,state,latitude,longitude,location_precision,name,age,gender,race,race_source,was_mental_illness_related,body_camera,agency_ids
0,3,2015-01-02,point,not,gun,Shelton,Mason,WA,47.246826,-123.121592,not_available,Tim Elliot,53.0,male,A,not_available,True,False,73
1,4,2015-01-02,point,not,gun,Aloha,Washington,OR,45.487421,-122.891696,not_available,Lewis Lee Lembke,47.0,male,W,not_available,False,False,70
2,5,2015-01-03,move,not,unarmed,Wichita,Sedgwick,KS,37.694766,-97.280554,not_available,John Paul Quintero,23.0,male,H,not_available,False,False,238
3,8,2015-01-04,point,not,replica,San Francisco,San Francisco,CA,37.76291,-122.422001,not_available,Matthew Hoffman,32.0,male,W,not_available,True,False,196
4,9,2015-01-04,point,not,other,Evans,Weld,CO,40.383937,-104.692261,not_available,Michael Rodriguez,39.0,male,H,not_available,False,False,473


In [205]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8735 entries, 0 to 8734
Data columns (total 19 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          8735 non-null   int64  
 1   date                        8735 non-null   object 
 2   threat_type                 8692 non-null   object 
 3   flee_status                 7558 non-null   object 
 4   armed_with                  8525 non-null   object 
 5   city                        8681 non-null   object 
 6   county                      3879 non-null   object 
 7   state                       8735 non-null   object 
 8   latitude                    7755 non-null   float64
 9   longitude                   7755 non-null   float64
 10  location_precision          7755 non-null   object 
 11  name                        8158 non-null   object 
 12  age                         8134 non-null   float64
 13  gender                      8685 

For the most part, our dataset is in good shape. But we will want to transform/enrich/remove some of our data to perform our analysis. 

**Missing Data**</br>
The following columns that contain NULL values we will mark as "unknown":

- threat_type
- flee_status
- armed_with
- city
- county
- gender
- race

**Data to be Removed**<br/>
We will remove the following columns as they aren't relavant to our analysis:
- name
- race_source

**Data to be Enriched/Transformed**<br/>
- Race
    - We will replace the race with the full version (i.e. W => White) as described in the provided data dictionary
    - Additionally, if TWO or more races are present (designated with a ";" separator) we will only use the FIRST race
- latitude/longitude
    - If we do not have latitude/longitude, we will geolocate the data to the nearest city 
- agency_ids
    - If we have MORE than one agency_id listed for a shooting, we will use the first agency for easier analysis

#### Cleaning our Dataset

We can go ahead and turn out date into a Python DateTime object

In [233]:
df['date'] = pd.to_datetime(df['date'])

First we will mark our NULL values as described above as "unknown"

In [206]:
columns_to_replace_nan_vals = ['threat_type', 'flee_status','armed_with','city','county','gender','race']

In [207]:
df.loc[:, columns_to_replace_nan_vals] = \
df.loc[:, columns_to_replace_nan_vals].fillna('unknown')

In [208]:
#Assert that we have replace ALL missing values for designated columns
assert(df.loc[:, columns_to_replace_nan_vals].isna().sum().sum() == 0)

Next we will remove the columns not necessary for our analysis

In [209]:
df.drop(columns=['name', 'race_source'], inplace=True)

Now we will begin transforming/enriching our data:

#### Race Transformation
We will first replace the short-hand versions of race with the full version for easier interpretation.

In [210]:
#Short-hand to longer description map
race_map = {
    'W':'White',
    'B':'Black',
    'A':'Asian Heritage',
    'N':'Native American',
    'H':'Hispanic',
    'O':'Other',
    '--':'unknown',
    'unknown':'unknown',
}

In [211]:
# We only have one record that lists the deceadant as two races 
df.loc[df['race'].str.contains(';'), 'race']

7704    B;H
Name: race, dtype: object

In [212]:
#We will fix any subjects that have multiple races listed
multi_race_mask = df['race'].str.contains(';')
df.loc[multi_race_mask, 'race'] = df.loc[multi_race_mask, 'race'].apply(lambda x: x[0])
assert(len(df.loc[df['race'].str.contains(';'), 'race']) == 0) #Quickly verify we handeled all multi-race subjects

In [213]:
# Now we will map over our races and transform them into the long-form race descriptions
df['race'] = df['race'].map(race_map)

#### Obtaining missing latitude/longitude data based on city+state

#### Agency Cleaning (Only listing one agency per shooting)

We have two shootings that DO NOT have an agency id listed, rather than dropping these, we will use the agency_id of another shooting that happened in the same city+county+state:

In [214]:
df.loc[df['agency_ids'].isna()]

Unnamed: 0,id,date,threat_type,flee_status,armed_with,city,county,state,latitude,longitude,location_precision,age,gender,race,was_mental_illness_related,body_camera,agency_ids
8672,9435,2023-08-04,threat,foot,gun,Kingwood,Harris,TX,30.055926,-95.222457,poi_large,,unknown,unknown,False,False,
8676,9428,2023-08-05,shoot,unknown,gun,Columbia,Boone,MO,38.922324,-92.335543,address,22.0,male,unknown,True,True,


In [215]:
df.loc[df['id'] == 9435, 'agency_ids'] =  \
list(df.loc[(df['state'] == 'TX') \
       & (df['county'] == 'Harris')  \
       & (df['city'] == 'Kingwood') \
       & (~df['agency_ids'].isna()), 'agency_ids'])[0]

In [216]:
#This city+county+state combination has multiple listed, we will use the first
df.loc[df['id'] == 9428, 'agency_ids'] = \
list(df.loc[(df['state'] == 'MO') \
       & (df['county'] == 'Boone')  \
       & (df['city'] == 'Columbia') \
       & (~df['agency_ids'].isna()), 'agency_ids'])[0]

In [217]:
assert(len(df.loc[df['agency_ids'].isna()]) == 0)

Now that we've handeled the missing agency ID's we can remove the shootings with multiple agencies. Of course if we were doing agency-specific related analysis we would handle this process differently.

In [218]:
multi_agency_mask = df['agency_ids'].str.contains(';')

In [219]:
df.loc[multi_agency_mask, 'agency_ids'] = df.loc[multi_agency_mask, 'agency_ids'].str.split(';').apply(lambda x: x[0])

In [220]:
assert(df['agency_ids'].str.contains(';').sum() == 0)

In [221]:
df['agency_ids'] = df['agency_ids'].astype(int)

### Agency Dataset

In order to enrich our data, we will also join the respective agency for the fatal shooting.

In [222]:
agency_df.head()

Unnamed: 0,id,name,type,oricodes
0,3145,Abbeville County Sheriff's Office,sheriff,SC00100
1,2576,Aberdeen Police Department,local_police,WA01401
2,2114,Abilene Police Department,local_police,TX22101
3,2088,Abington Township Police Department,local_police,PA04601
4,3187,Acadia Parish Sheriff's Office,sheriff,LA00100


In [230]:
agency_df = agency_df.rename(columns={'id':'agency_ids'}) #For easier merging

Now we can merge out data with the agency data

In [228]:
df = pd.merge(df, agency_df, how='left', left_on='agency_ids', right_on='agency_ids')

In [235]:
df.head()

Unnamed: 0,id,date,threat_type,flee_status,armed_with,city,county,state,latitude,longitude,location_precision,age,gender,race,was_mental_illness_related,body_camera,agency_ids,name,type,oricodes
0,3,2015-01-02,point,not,gun,Shelton,Mason,WA,47.246826,-123.121592,not_available,53.0,male,Asian Heritage,True,False,73,Mason County Sheriff's Office,sheriff,WA02300
1,4,2015-01-02,point,not,gun,Aloha,Washington,OR,45.487421,-122.891696,not_available,47.0,male,White,False,False,70,Washington County Sheriff's Office,sheriff,OR03400
2,5,2015-01-03,move,not,unarmed,Wichita,Sedgwick,KS,37.694766,-97.280554,not_available,23.0,male,Hispanic,False,False,238,Wichita Police Department,local_police,KS08703
3,8,2015-01-04,point,not,replica,San Francisco,San Francisco,CA,37.76291,-122.422001,not_available,32.0,male,White,True,False,196,San Francisco Police Department,local_police,CA03801
4,9,2015-01-04,point,not,other,Evans,Weld,CO,40.383937,-104.692261,not_available,39.0,male,Hispanic,False,False,473,Evans Police Department,local_police,CO06204
