<h1>Chicago Police Department Strategic Subject List Analysis </h1>
<p>We present an analysis of the Chicago Police Department's (CPD) Strategic Subject List (SSL). The SSL is a database the CPD developed as a training set for their predictive policing algorithms. Each person on this list is assigned a score that rates their risk of being involved in a violent crime. By using the designation "involved in a violent crime", the CPD does not distinguish between people likely to be victims or perpetraters of violent crimes, and so they are policed all the same.</p>

<p> We will perform a statistical analysis to demonstrate how this database is evidence for policing vulnerable margenalized people. We will do this by breaking down the characteristics of the people that the CPD have profiled. We will consider people's demographics, their income level, and  their neighborhood.</p>

<h3>Import Data and Configure Dataframes</h3>

In [27]:
# import modules
import numpy as np
import pandas as pd
#from scipy import 

#display options 
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [28]:
ssl                 = pd.read_csv("Chicago_Strategic_Subject_List.csv",low_memory=False)
demographics        = pd.read_excel("Chicago_Neighborhood_Demographics.xlsx")
neighborhood_income = pd.read_csv("Chicago_Neighborhood_Demographics.csv")

In [29]:
#reconfigure demographics dataframe 

#selects the 0th row, and sets that to the column labels 
demographics.columns = demographics.iloc[0]

#removes 0th row, as its now unnecessary 
demographics = demographics.reindex(demographics.index.drop(0))

#make values consistent with other dataset 
demographics['Geog']= demographics['Geog'].str.upper()

#sort by neighborhood, and have index from 0 to n 
demographics = demographics.sort_values(by=['Geog']).reset_index(drop=True)

In [30]:
#reconfigure neighborhood_income dataframe

neighborhood_income['Community'] = neighborhood_income['Community'].str.upper()
neighborhood_income = neighborhood_income.sort_values(by=['Community']).reset_index(drop=True)


<h3>Number of People on the SSL in each community by Demographic</h3>

In [31]:
#communities
communities_array = np.sort(ssl['COMMUNITY AREA'].unique()[1:])

#number of ppl on the list in each community that are white; [1:] to ignore no neighborhood data
white_count_ssl = ssl[ssl['RACE CODE CD']=='WHI']['COMMUNITY AREA'].value_counts()[1:]
#reindexing ensures that communities without counts appear with value zero for the count, all alphabetized 
white_count_ssl = white_count_ssl.reindex(communities_array,fill_value=0)

black_count_ssl = ssl[ssl['RACE CODE CD']=='BLK']["COMMUNITY AREA"].value_counts()[1:]
black_count_ssl = black_count_ssl.reindex(communities_array,fill_value=0)

asian_count_ssl = ssl[ssl['RACE CODE CD']=='API']['COMMUNITY AREA'].value_counts()[1:]
asian_count_ssl = asian_count_ssl.reindex(communities_array,fill_value=0)

native_count_ssl = ssl[ssl['RACE CODE CD']=='I']['COMMUNITY AREA'].value_counts()[1:]
native_count_ssl = native_count_ssl.reindex(communities_array,fill_value=0)

unknown_count_ssl = ssl[ssl['RACE CODE CD']=='U']['COMMUNITY AREA'].value_counts()[1:]
unknown_count_ssl = unknown_count_ssl.reindex(communities_array,fill_value=0)

#we aggregate white hispanic and black hispanic as other data corresponding data does not make the distinction 
hispanic_ssl = ssl[(ssl['RACE CODE CD']=='WBH')|(ssl['RACE CODE CD']=='WWH') ]
hispanic_count_ssl = hispanic_ssl["COMMUNITY AREA"].value_counts()[1:]
hispanic_count_ssl = hispanic_count_ssl.reindex(communities_array,fill_value=0)



<h3>Age Information<h3>

In [32]:
age_counts = ssl['AGE GROUP'].value_counts()
age_counts

20-30           140587
30-40            88561
less than 20     67866
40-50            58647
50-60            34173
60-70             7717
70-80              980
Name: AGE GROUP, dtype: int64

In [33]:
#age_counts_total = (demographics[''])

<h3>Percentage of demographic on SSL in each community</h3>

In [34]:
white_percentages    = (
    100
    *white_count_ssl.values
    /demographics['Not Hispanic or Latino, White alone'].values)

In [35]:
black_percentages    = (
    100
    *black_count_ssl.values
    /demographics['Not Hispanic or Latino, Black or African American alone'].values)

In [36]:
#we must aggregate both categories because corresponding data set does not make these distinctions
asian_per_neighborhood = (
    demographics['Not Hispanic or Latino, Native Hawaiian and Other Pacific Islander alone'].values
    +demographics['Not Hispanic or Latino, Asian alone'].values)

#must take maximum to replace zero values by 1; cannot divide by zero 
asian_percentages = (
    100
    *asian_count_ssl.values
    /np.maximum(1,asian_per_neighborhood))


In [37]:
native_percentages = (
    100
    *native_count_ssl.values
    /np.maximum(1,demographics['Not Hispanic or Latino, American Indian and Alaska Native alone']))

In [38]:
hispanic_percentages = (
    100
    *hispanic_count_ssl.values
    /(demographics['Hispanic or Latino'].values))

In [39]:
#we must aggregate all other ethnicities since SSL's only other distinction is the "unknown" category 
unknown_per_neighborhood = (
    demographics['Not Hispanic or Latino, Some Other Race alone']
    +demographics['Not Hispanic or Latino, Two or More Races'])

unknown_percentages = (
    100*
    unknown_count_ssl.values
    /unknown_per_neighborhood)

<h3>Adding Values to New Data Frame </h3>
<p>NOTE: All added values provide information per community</p>

In [14]:
visualized_frame = pd.DataFrame(data = {'COMMUNITY_AREA':communities_array}) 


In [15]:
#percentage of people on the list of each community
total_community = demographics["Total Population"]
total_community_ssl= ssl["COMMUNITY AREA"].value_counts()[1:]
visualized_frame['total_listed_normalized']=  100*total_community_ssl.sort_index().values/total_community.values

In [16]:
#income per community
visualized_frame['median_income'] = neighborhood_income['Median Income']

In [17]:
#poverty rate per community 
visualized_frame['poverty_rate'] = neighborhood_income['Poverty Rate']


In [18]:
#demographic info
visualized_frame['white_listed'    ] = white_count_ssl.values
visualized_frame['white_present'   ] = demographics['Not Hispanic or Latino, White alone'].values

visualized_frame['black_listed'    ] = black_count_ssl.values
visualized_frame['black_present'   ] = demographics['Not Hispanic or Latino, Black or African American alone'].values

visualized_frame['native_listed'   ] = native_count_ssl.values
visualized_frame['native_present'  ] = demographics['Not Hispanic or Latino, American Indian and Alaska Native alone'].values

visualized_frame['hispanic_listed' ] = hispanic_count_ssl.values
visualized_frame['hispanic_present'] = demographics['Hispanic or Latino'].values

asian_per_neighborhood = (
    demographics['Not Hispanic or Latino, Native Hawaiian and Other Pacific Islander alone'].values
    +demographics['Not Hispanic or Latino, Asian alone'].values)
visualized_frame['asian_listed' ] = asian_count_ssl.values
visualized_frame['asian_present'] = asian_per_neighborhood

unknown_per_neighborhood = (
    demographics['Not Hispanic or Latino, Some Other Race alone'].values
    +demographics['Not Hispanic or Latino, Two or More Races'].values)
visualized_frame['unknown_listed' ] = unknown_count_ssl.values
visualized_frame['unknown_present'] = unknown_per_neighborhood


In [40]:
#demographic info
visualized_frame['white_percentages']    = white_percentages
visualized_frame['black_percentages']    = black_percentages
visualized_frame['asian_percentages']    = asian_percentages
visualized_frame['native_percentages']   = native_percentages
visualized_frame['unknown_percentages']  = unknown_percentages
visualized_frame['hispanic_percentages'] = hispanic_percentages 

In [41]:
visualized_frame

Unnamed: 0,COMMUNITY_AREA,total_listed_normalized,median_income,poverty_rate,white_listed,white_present,black_listed,black_present,native_listed,native_present,hispanic_listed,hispanic_present,asian_listed,asian_present,unknown_listed,unknown_present,white_percentages,black_percentages,asian_percentages,native_percentages,unknown_percentages,hispanic_percentages
0,ALBANY PARK,3.26918,44261,19.0,351,15054,266,2076,4,119,978,25487,81,7452,5,1354,2.33161,12.8131,1.08696,3.36134,0.369276,3.83725
1,ARCHER HEIGHTS,6.29433,46207,12.1,115,2874,128,130,0,15,595,10182,1,138,4,54,4.00139,98.4615,0.724638,0.0,7.40741,5.84365
2,ARMOUR SQUARE,7.10925,30263,27.9,147,1642,569,1419,2,4,169,464,62,9722,3,140,8.9525,40.0987,0.637729,50.0,2.14286,36.4224
3,ASHBURN,3.68784,59478,11.1,189,6251,958,18976,1,45,356,15132,6,277,5,400,3.02352,5.04848,2.16606,2.22222,1.25,2.35263
4,AUBURN GRESHAM,13.0542,32672,26.4,107,134,6177,47661,3,81,59,459,10,35,7,373,79.8507,12.9603,28.5714,3.7037,1.87668,12.854
5,AUSTIN,18.168,33797,26.1,1204,4364,15046,83837,2,161,1585,8722,37,586,24,844,27.5894,17.9467,6.31399,1.24224,2.8436,18.1724
6,AVALON PARK,10.2602,35163,21.9,21,81,1008,9751,0,18,14,153,2,23,0,159,25.9259,10.3374,8.69565,0.0,0.0,9.15033
7,AVONDALE,4.18725,47183,16.9,404,11166,163,991,2,69,1043,25295,29,1199,3,542,3.61813,16.448,2.41868,2.89855,0.553506,4.12334
8,BELMONT CRAGIN,6.73965,42965,19.7,642,11959,526,2493,3,74,4092,62101,34,1545,10,571,5.36834,21.0991,2.20065,4.05405,1.75131,6.58927
9,BEVERLY,2.08645,87628,4.5,82,11785,322,6838,0,23,13,915,0,121,1,352,0.6958,4.70898,0.0,0.0,0.284091,1.42077


In [42]:
visualized_frame_for_json = visualized_frame.set_index(visualized_frame['COMMUNITY_AREA'])
visualized_frame_for_json = visualized_frame_for_json.drop('COMMUNITY_AREA',1)

In [43]:
np.sort(visualized_frame['total_listed_normalized'].values)

array([1.4318132699373245, 1.6983999284884241, 1.775519824019274,
       2.086453029849256, 2.517354077195257, 2.5738873096842023,
       2.618495780329435, 2.6483249110986335, 2.971726237173251,
       2.9762472574644665, 2.9792939073439593, 3.1080592699674736,
       3.2067369360325695, 3.2691785340110977, 3.2967867723394018,
       3.3586640397785907, 3.4030179188934295, 3.64459660942073,
       3.6734517481308613, 3.678854551247212, 3.6878362259925512,
       3.9536707358426586, 3.962782355249145, 4.0137974286610225,
       4.187254852019764, 4.39065269057988, 4.611377955211777,
       5.113564037751632, 5.281142078894888, 5.318726855933698,
       5.494560585551712, 5.5166560577127095, 5.6303402788187675,
       5.650551167433383, 5.666938717575672, 5.873686907507046,
       5.981750591416019, 6.043789728584401, 6.153373032514188,
       6.264306203459566, 6.294332860449488, 6.621351459416234,
       6.708742402992052, 6.739646698754175, 6.798436142484796,
       7.109252483010978

In [44]:
visualized_frame_for_json.to_csv('visualized_json.csv ')





In [45]:
visualized_frame.to_csv('visual_info.csv')

In [46]:
age_frame = pd.DataFrame(data = {'AGE_GROUP':age_counts.index}) 
age_frame['COUNTS'] = age_counts.values
age_frame

Unnamed: 0,AGE_GROUP,COUNTS
0,20-30,140587
1,30-40,88561
2,less than 20,67866
3,40-50,58647
4,50-60,34173
5,60-70,7717
6,70-80,980


In [47]:
age_frame.to_csv('age_info.csv')