In [1]:
import pandas as pd

# Investigation

First, let's explore our data.

In [3]:
idot = pd.read_csv('./idot_clean.csv')
idot.head()

Unnamed: 0,id,datetime,year,reason,any_search,any_search_consent,search_hit,driver_race,beat,district,weapon_hit,drugs_hit,alcohol_hit,contraband_hit
0,20996,2015-01-01 00:20:00,2015,moving,1,0,0,hispanic,2532,25,0,0,0,0
1,18436,2015-01-01 01:03:00,2015,moving,0,0,0,black,312,3,0,0,0,0
2,81589,2015-01-01 01:14:00,2015,license,0,0,0,hispanic,1121,11,0,0,0,0
3,25992,2015-01-01 01:20:00,2015,moving,0,0,0,black,413,4,0,0,0,0
4,19440,2015-01-01 01:30:00,2015,moving,0,0,0,black,312,3,0,0,0,0


In [13]:
idot.shape

(2351325, 14)

Our dataset has 14 columns and 2,351,325 rows where each row is a traffic stop that occurred in Chicago.

In [7]:
idot.year.unique()

array([2015, 2016, 2017, 2018, 2019, 2020, 2021])

Using `.unique`, we can see that the dataset includes stops from the years 2015-2021, but we are only interested in 2020. Let's filter the data so that we only have data for 2020.

In [17]:
idot_20 = idot.loc[idot.year == 2020].reset_index(drop=True)
idot_20.head()

Unnamed: 0,id,datetime,year,reason,any_search,any_search_consent,search_hit,driver_race,beat,district,weapon_hit,drugs_hit,alcohol_hit,contraband_hit
0,1651649,2020-01-01 00:01:00,2020,moving,0,0,0,am_indian,1024,10,0,0,0,0
1,1651809,2020-01-01 00:04:00,2020,moving,0,0,0,hispanic,433,4,0,0,0,0
2,1652341,2020-01-01 00:05:00,2020,equipment,0,0,0,black,1021,10,0,0,0,0
3,1651148,2020-01-01 00:09:00,2020,equipment,0,0,0,black,735,7,0,0,0,0
4,1651808,2020-01-01 00:10:00,2020,moving,0,0,0,black,2011,20,0,0,0,0


In [18]:
idot_20.shape

(327290, 14)

Now, we have 327,290 stops in our data. There is a lot of interesting data here including whether the car was searched, if the officer found contraband (called a 'hit'), and where the stop occurred (the police district and beat). For example, Hyde Park is in district 2. Let's see how many stops occurred in district 2.

In [21]:
idot_20.groupby("district")["id"].count()

district
1      9347
2     15905
3     14510
4     20800
5      8567
6     11035
7     27101
8     17449
9     12383
10    24979
11    41736
12    14788
14     8468
15    24066
16     4575
17     7994
18     9588
19     7132
20     8118
22     5729
24     7542
25    23515
31     1963
Name: id, dtype: int64

So, there were 15,905 traffic stops made in the district surrounding Hyde Park in 2020. We can also investigate the reason for the traffic stops.

In [22]:
idot_20.groupby("reason")["id"].count()

reason
equipment    131250
license       66398
moving       129638
none              4
Name: id, dtype: int64

In 2020, 129,638 people were stopped for moving violations (like speeding or not using a blinker), 131,250 were stopped because of equipment (like a burnt out taillight), 66,398 were stopped for an invalid license, and there were 4 cases where the reason was not entered in the database. 

## Traffic Stops

This is all very interesting, but what we wanted to investigate was the stop rates for drivers of different races. We can do this using the variable `driver_race`. First, let's drop some of the columns that we don't need for now.

In [25]:
idot_20_abbrv = idot_20.drop(columns=["id","datetime","any_search_consent","beat","district","weapon_hit","drugs_hit","alcohol_hit","contraband_hit"])
idot_20_abbrv

Unnamed: 0,year,reason,any_search,search_hit,driver_race
0,2020,moving,0,0,am_indian
1,2020,moving,0,0,hispanic
2,2020,equipment,0,0,black
3,2020,equipment,0,0,black
4,2020,moving,0,0,black
...,...,...,...,...,...
327285,2020,moving,0,0,asian
327286,2020,equipment,0,0,black
327287,2020,moving,0,0,white
327288,2020,moving,0,0,black


Recall, we also could have selected only the columns we wanted instead of dropping the columns we didn't need (see [Chapter 6](../../06/DataFrames.ipynb)).

Now, let's group the data by `driver_race` and count how many drivers of each race were stopped.

In [32]:
stopped = idot_20_abbrv.groupby("driver_race")["year"].count().reset_index().rename(columns={"year":"num_stopped"})
stopped

Unnamed: 0,driver_race,num_stopped
0,am_indian,1176
1,asian,7448
2,black,204203
3,hispanic,78449
4,other,895
5,white,35053


From this table, it certainly looks like more Black and Hispanic drivers were stopped in 2020 than drivers of other races. However, we are only looking at raw counts. It will be easier to see a pattern if we convert these counts to a proportion of total stops.

In [33]:
stopped['prop_stopped'] = stopped['num_stopped'] / 327290
stopped

Unnamed: 0,driver_race,num_stopped,prop_stopped
0,am_indian,1176,0.003593
1,asian,7448,0.022757
2,black,204203,0.623921
3,hispanic,78449,0.239693
4,other,895,0.002735
5,white,35053,0.107101


Now, we can see that 62% of drivers stopped were Black, 24% were Hispanic, and 11% were White.

### The Benchmark

In order to do a benchmark analysis, we need a benchmark for comparison, in this case, the population of Chicago. Next, we will read-in data on the estimated driving population of each race by beat. This dataset was created by a research team at the University of Chicago Data Science Institute. 

In [45]:
pop = pd.read_csv('./adjusted_population_beat.csv')
pop.head()

Unnamed: 0,beat,White,Black,Hispanic,Asian,Native,Other
0,1713,1341.069794,1865.300113,937.499902,317.076466,0.0,0.0
1,1651,0.0,0.0,0.0,0.0,0.0,0.0
2,1914,641.288133,5878.781077,1621.201006,407.487888,0.0,0.0
3,1915,1178.322329,1331.124001,1597.027253,283.175022,0.0,0.0
4,1913,739.593202,2429.896179,535.721544,121.08396,159.015646,0.0


Since, we are interested in Chicago as a whole, we can sum over beats to get the estimated population for all of Chicago.

In [60]:
pop_total = pop.sum(axis=0).to_frame(name="est_pop").reset_index().rename(columns={'index': 'driver_race'})
pop_total

Unnamed: 0,driver_race,est_pop
0,beat,335962.0
1,White,298755.197477
2,Black,248709.530847
3,Hispanic,222341.224432
4,Asian,60069.768541
5,Native,2664.876752
6,Other,212.53729


Clearly, the beat row gives us no useful information, so let's drop it.

In [62]:
pop_total = pop_total.drop([0])
pop_total

Unnamed: 0,driver_race,est_pop,prop_pop
1,White,298755.197477,0.255627
2,Black,248709.530847,0.212806
3,Hispanic,222341.224432,0.190244
4,Asian,60069.768541,0.051398
5,Native,2664.876752,0.00228
6,Other,212.53729,0.000182


Now, we can calculate proportions of the population, similar to what we did with number of traffic stops above.

In [63]:
pop_total['prop_pop'] = pop_total['est_pop'] / pop_total['est_pop'].sum()
pop_total

Unnamed: 0,driver_race,est_pop,prop_pop
1,White,298755.197477,0.358756
2,Black,248709.530847,0.298659
3,Hispanic,222341.224432,0.266995
4,Asian,60069.768541,0.072134
5,Native,2664.876752,0.0032
6,Other,212.53729,0.000255


We want to compare these proportions with the proportion of stops we calculated above. This is easier to do if they are in the same DataFrame. We can combine them using `merge` but first we need to make sure each race is named the same in both DataFrames.

In [66]:
pop_total['driver_race'] = pop_total['driver_race'].apply(lambda x: x.lower())
pop_total

Unnamed: 0,driver_race,est_pop,prop_pop
1,white,298755.197477,0.358756
2,black,248709.530847,0.298659
3,hispanic,222341.224432,0.266995
4,asian,60069.768541,0.072134
5,native,2664.876752,0.0032
6,other,212.53729,0.000255


In [68]:
race_map = {"native":"am_indian"}

pop_total['driver_race'] = pop_total['driver_race'].replace(race_map)
pop_total

Unnamed: 0,driver_race,est_pop,prop_pop
1,white,298755.197477,0.358756
2,black,248709.530847,0.298659
3,hispanic,222341.224432,0.266995
4,asian,60069.768541,0.072134
5,am_indian,2664.876752,0.0032
6,other,212.53729,0.000255


In [69]:
stops_vs_pop = stopped.merge(pop_total)
stops_vs_pop

Unnamed: 0,driver_race,num_stopped,prop_stopped,est_pop,prop_pop
0,am_indian,1176,0.003593,2664.876752,0.0032
1,asian,7448,0.022757,60069.768541,0.072134
2,black,204203,0.623921,248709.530847,0.298659
3,hispanic,78449,0.239693,222341.224432,0.266995
4,other,895,0.002735,212.53729,0.000255
5,white,35053,0.107101,298755.197477,0.358756


Using this new table, we can see that American Indian, Hispanic, and Other drivers are all stopped at a proportion that is similar to the proportion of the population they make up. However, Black drivers are stopped twice as often as they should be according to their population, and Asian and White drivers are stopped at rates less than their proportion of the population.

## Searches

Our data doesn't just have information on stops. It also has information on whether the driver or car were searched during the traffic stop. Let's investigate this by race as well.

In [41]:
searched = idot_20_abbrv.groupby("driver_race")[["any_search","search_hit"]].sum().reset_index().rename(columns={"any_search":"num_searched", "search_hit":"num_hit"})
searched

Unnamed: 0,driver_race,num_searched,num_hit
0,am_indian,3,1
1,asian,31,12
2,black,3712,980
3,hispanic,1140,317
4,other,5,3
5,white,190,66


Again, let's convert counts to proportions. For number of searches, we will convert to proportion of total searches. For number of hits, we will convert to proportion of searches that resulted in a hit. 

In [43]:
searched['prop_searched'] = searched['num_searched'] / searched['num_searched'].sum()
searched

Unnamed: 0,driver_race,num_searched,num_hit,prop_hit,prop_searched
0,am_indian,3,1,0.333333,0.00059
1,asian,31,12,0.387097,0.006101
2,black,3712,980,0.264009,0.730565
3,hispanic,1140,317,0.27807,0.224365
4,other,5,3,0.6,0.000984
5,white,190,66,0.347368,0.037394


In [44]:
searched['prop_hit'] = searched['num_hit'] / searched['num_searched']
searched

Unnamed: 0,driver_race,num_searched,num_hit,prop_hit,prop_searched
0,am_indian,3,1,0.333333,0.00059
1,asian,31,12,0.387097,0.006101
2,black,3712,980,0.264009,0.730565
3,hispanic,1140,317,0.27807,0.224365
4,other,5,3,0.6,0.000984
5,white,190,66,0.347368,0.037394


Here, we can see that Black and Hispanic drivers are also searched more often than other races despite all races having similar hit rates. In other words, though contraband is found in similar rates across all races, police search Black and Hispanic drivers for contraband more often than other drivers.