<p>Hello, just a beginner here. I found this <a href = 'https://data.fivethirtyeight.com/'>dataset</a> about the police killings that happened in the USA in 2015 and decided to look into it. My primary goal was to find the most vulnerable community, basing my analysis on the age group that seemed to be most affected. There are many ways that one could go about this, that is by zeroing in on the age or race but seeing as the states is predominantly biased, I felt that that would give me biased results. </p>

In [1]:
import pandas as pd
from csv import reader

In [2]:
df = pd.read_csv('police_killings.csv', encoding = 'ISO-8859-1')
df.head()

Unnamed: 0,name,age,gender,raceethnicity,month,day,year,streetaddress,city,state,...,share_hispanic,p_income,h_income,county_income,comp_income,county_bucket,nat_bucket,pov,urate,college
0,A'donte Washington,16,Male,Black,February,23,2015,Clearview Ln,Millbrook,AL,...,5.6,28375,51367.0,54766,0.937936,3.0,3.0,14.1,0.097686,0.16851
1,Aaron Rutledge,27,Male,White,April,2,2015,300 block Iris Park Dr,Pineville,LA,...,0.5,14678,27972.0,40930,0.683411,2.0,1.0,28.8,0.065724,0.111402
2,Aaron Siler,26,Male,White,March,14,2015,22nd Ave and 56th St,Kenosha,WI,...,16.8,25286,45365.0,54930,0.825869,2.0,3.0,14.6,0.166293,0.147312
3,Aaron Valdez,25,Male,Hispanic/Latino,March,11,2015,3000 Seminole Ave,South Gate,CA,...,98.8,17194,48295.0,55909,0.863814,3.0,3.0,11.7,0.124827,0.050133
4,Adam Jovicic,29,Male,White,March,19,2015,364 Hiwood Ave,Munroe Falls,OH,...,1.7,33954,68785.0,49669,1.384868,5.0,4.0,1.9,0.06355,0.403954


In [246]:
df.shape

(467, 34)

In [247]:
df.columns 

Index(['name', 'age', 'gender', 'raceethnicity', 'month', 'day', 'year',
       'streetaddress', 'city', 'state', 'latitude', 'longitude', 'state_fp',
       'county_fp', 'tract_ce', 'geo_id', 'county_id', 'namelsad',
       'lawenforcementagency', 'cause', 'armed', 'pop', 'share_white',
       'share_black', 'share_hispanic', 'p_income', 'h_income',
       'county_income', 'comp_income', 'county_bucket', 'nat_bucket', 'pov',
       'urate', 'college'],
      dtype='object')

In [26]:
# Droppping unnecessary columns together with those rows where the age was not specified.

df.drop(['name','day','streetaddress', 'city','latitude', 'longitude', 'state_fp','county_fp', 
         'tract_ce', 'geo_id', 'county_id', 'namelsad', 'year', 'nat_bucket','h_income',
       'county_income', 'comp_income'], inplace = True, axis = 1)
unknowns = df[(df['age'] == 'Unknown')]
df = df.drop(unknowns.index)
df.shape

(463, 17)

In [24]:
df.head()

Unnamed: 0,age,gender,raceethnicity,month,state,lawenforcementagency,cause,armed,pop,share_white,share_black,share_hispanic,p_income,county_bucket,pov,urate,college
0,16,Male,Black,February,AL,Millbrook Police Department,Gunshot,No,3779,60.5,30.5,5.6,28375,3.0,14.1,0.097686,0.16851
1,27,Male,White,April,LA,Rapides Parish Sheriff's Office,Gunshot,No,2769,53.8,36.2,0.5,14678,2.0,28.8,0.065724,0.111402
2,26,Male,White,March,WI,Kenosha Police Department,Gunshot,No,4079,73.8,7.7,16.8,25286,2.0,14.6,0.166293,0.147312
3,25,Male,Hispanic/Latino,March,CA,South Gate Police Department,Gunshot,Firearm,4343,1.2,0.6,98.8,17194,3.0,11.7,0.124827,0.050133
4,29,Male,White,March,OH,Kent Police Department,Gunshot,No,6809,92.5,1.4,1.7,33954,5.0,1.9,0.06355,0.403954


In [37]:
# I first begin by grouping the dataset into age groups

below_18 = df[(df['age'].astype(int) >= 1) & (df['age'].astype(int) <= 17)]
eighteen_thirty = df[(df['age'].astype(int) >= 18) & (df['age'].astype(int) <= 30)] 
thirties = df[(df['age'].astype(int) >= 31) & (df['age'].astype(int) <= 39)] 
forties = df[(df['age'].astype(int) >= 40) & (df['age'].astype(int) <= 49)] 
fifties = df[(df['age'].astype(int) >= 50) & (df['age'].astype(int) <= 59)] 
above_sixty = df[(df['age'].astype(int) >= 60)]

In [38]:
# Determining what the age concentration is for the killed people people is. I placed this into a dictionary
df['state'].value_counts().max()
dicts = {}
groups = ['below_18', 'eighteen_thirty', 'thirties', 'forties', 'fifties', 'above_sixty']
lenghts = [below_18.shape[0], eighteen_thirty.shape[0], thirties.shape[0], forties.shape[0], fifties.shape[0], above_sixty.shape[0]]
dicts = dict(zip(groups, lenghts))        
dicts

{'below_18': 9,
 'eighteen_thirty': 147,
 'thirties': 135,
 'forties': 90,
 'fifties': 53,
 'above_sixty': 29}

In [39]:
# Where does the most affected group come from?

eighteen_thirty['state'].value_counts()

CA    30
TX    16
FL    13
AZ    11
OK     7
MO     5
GA     5
LA     4
IL     4
MD     4
OH     4
MN     3
NY     3
MI     3
SC     3
WI     3
IN     3
NC     3
VA     3
CO     2
MA     2
WA     2
KS     2
NJ     1
ID     1
UT     1
KY     1
MT     1
NM     1
IA     1
PA     1
NV     1
HI     1
NE     1
TN     1
Name: state, dtype: int64

In [40]:
youthsInCa = eighteen_thirty[(eighteen_thirty['state'] == 'CA')]
youthsInCa['raceethnicity'].value_counts() # What descent did the killed youths from CA mostly belong to?

Hispanic/Latino           13
Black                      8
White                      6
Unknown                    2
Asian/Pacific Islander     1
Name: raceethnicity, dtype: int64

<p>From this, we can tell that the most vulnerable group according to the date are people within the age bracket of 18-30, living in california and belong to a Hispanic descent</p>

In [41]:
HispyouthsInCa = youthsInCa[(youthsInCa['raceethnicity'] == 'Hispanic/Latino')]
HispyouthsInCa.shape

(13, 17)

In [43]:
# let's look at the economic status of the people who were killed to further get a glimpse of just who could be at risk of police killing due to one reason or another.
belo10kincome = HispyouthsInCa[(HispyouthsInCa['p_income'].astype(int) <10000)]
ten_twenty = HispyouthsInCa[(HispyouthsInCa['p_income'].astype(int) >= 10000) & (HispyouthsInCa['p_income'].astype(int) <= 20000)]
over_twenty = HispyouthsInCa[(HispyouthsInCa['p_income'].astype(int) > 20000)]
print(
    ' Those who have a median income of less than 10 thousand are :', belo10kincome.shape[0],'\n',
    'Those who have a median income between ten and twenty thousand are :', ten_twenty.shape[0], '\n',
    'Those who have a median income that is over twenty thousand are :', over_twenty.shape[0])

 Those who have a median income of less than 10 thousand are : 0 
 Those who have a median income between ten and twenty thousand are : 8 
 Those who have a median income that is over twenty thousand are : 5


In [44]:
#poverty rate
belo10 = HispyouthsInCa[(HispyouthsInCa['pov'].astype(float) <10)]
ten_twenty = HispyouthsInCa[(HispyouthsInCa['pov'].astype(float) >= 10) & (HispyouthsInCa['pov'].astype(float) <= 20)]
twenties = HispyouthsInCa[(HispyouthsInCa['pov'].astype(float) >= 21) & (HispyouthsInCa['pov'].astype(float) <= 29)]
over_thirty = HispyouthsInCa[(HispyouthsInCa['pov'].astype(float) > 30)]
print(
    ' Those who have a poverty rate of below 10 are :', belo10kincome.shape[0],'\n',
    'Those who have a poverty rate between ten and twenty are :', ten_twenty.shape[0], '\n',
    'Those who have a poverty rate that ranges within the twenties are :', twenties.shape[0], '\n',
    'Those who have a poverty rate that is over thirty are :', over_thirty.shape[0])
    

 Those who have a poverty rate of below 10 are : 0 
 Those who have a poverty rate between ten and twenty are : 5 
 Those who have a poverty rate that ranges within the twenties are : 3 
 Those who have a poverty rate that is over thirty are : 4


Although we had determined that the most vulnerable group of people were the youthful Latinas from Carlifonia, I did some further analysis just to make sure that I wasn't wrong about this. This time, I zeroed in on the race that seemed to be most affected

In [45]:
df['raceethnicity'].value_counts()

White                     235
Black                     135
Hispanic/Latino            66
Unknown                    13
Asian/Pacific Islander     10
Native American             4
Name: raceethnicity, dtype: int64

In [57]:
whites = df[(df['raceethnicity'] == 'White')]
whites['state'].value_counts()

CA    27
TX    15
OK    14
AZ    14
FL    14
NC     7
OR     7
MI     7
GA     6
MS     6
NY     6
WA     6
KY     5
PA     5
IL     5
IN     5
OH     5
NM     5
LA     5
UT     5
SC     5
MO     5
ID     4
NE     4
CO     4
AL     4
MA     4
KS     4
MN     3
TN     3
WI     3
AR     2
HI     2
IA     2
WV     2
DE     2
NV     2
MT     2
VA     2
NJ     2
ME     1
WY     1
MD     1
NH     1
CT     1
Name: state, dtype: int64

In [60]:
whitesInCa = whites[(whites['state'] == 'CA')]
whiteYInCa = whitesInCa[(whitesInCa['age'].astype(int) >= 18) & (whitesInCa['age'].astype(int) <= 30)]
whiteYInCa.shape

(6, 17)

From this, we can see that being the most vulnerable does not necessarily mean having the biggest numbers, especially if you dominate that region. That would be biased. According to me, the group of people most at risk are the above earlier mentioned