In [None]:
import pathlib
from itssutils.itssdata import RawITSSData, ITSSMetrics
DATA_DIR = pathlib.Path("PATH_TO_DATA_HERE")

`itssutils` provides two classes to aid in loading and analyzing traffic stop data. The first, `RawITSSData`, is (as its name suggests) a class to be used for loading raw traffic stop data from a text file and visualizing the raw data in a few different ways. The second, `ITSSMetrics`, provides functions for splitting the data into different categories, calculating metrics, and plotting these metrics.

First, we can load in the data for a single year, say 2017.

In [None]:
raw_2017 = RawITSSData()
filepath = DATA_DIR / "2017_ITSS_Data.txt"
raw_2017.load_single_year(2017, filepath, save=True, fast=False)

Let's look at how many drivers were stopped each day, grouped by the race of the driver.

In [None]:
raw_2017.plot_timeseries(frequency='1D', group='DriverRace')

So it looks like police in Illinois conduct around 6,000-7,000 traffic stops per day. The police also appear to get into the holiday spirit -- the number of traffic stops drops significantly on Christmas Day.

We can group by multiple categories and focus on an individual department, say the Chicago Police.

In [None]:
raw_2017.plot_timeseries(agency='Chicago Police', group=['DriverRace', 'DriverSex'])

We can also filter and group by values in a given category. For instance, we can look at how many tickets were given to male and female drivers by the Illinois State Police every month.

In [None]:
raw_2017.plot_timeseries(frequency='1M', 
                         agency='Illinois State Police', 
                         filter_cols='ResultOfStop', 
                         filter_values='Citation', 
                         group='DriverSex')


We can now calculate metrics like citation rate or search hit rate from this raw data, again grouping by different categories as desired. The metrics are calculated based on the entirety of the data that is passed in -- for more granular control over, for example, the time frame over which the metrics are calculated, extract the data frame from the raw data using `get_raw_dataframe`.

In [None]:
metrics_2017 = ITSSMetrics(raw_2017)
metrics_2017.calculate_metrics(['AgencyName', 'DriverRace'])

Or, if you download our pre-processed file [here](

In [None]:
metrics_2017.get_metrics()

Let's examine Chicago's search rate and search "hit" rates for all different types of drivers.

In [None]:
metrics_2017.plot_bars('Chicago Police', 'SearchRate', 
              only_include_rows=['Black', 'Hispanic/Latino', 'Asian', 'White'])
metrics_2017.plot_bars('Chicago Police', 'SearchHitRate', 
              only_include_rows=['Black', 'Hispanic/Latino', 'Asian', 'White'])

It looks like both Black and Hispanic drivers are searched at higher rates than White drivers, but that the Chicago police are less likely to find contraband when searching Black or Hispanic drivers than when searching White drivers.

We can now compare the difference between the search rate for Black and White drivers for all police departments in Illinois. The scatter plot functionality will enable us to visualize the rates for all the departments in the data set. We can identify the largest departments by sizing the dots according to a certain count, in this case the number of searches performed. Using a log scale makes the data easier to visualize in some cases when values are clustered around a low value, as in this case.

In [None]:
s1 = metrics_2017.plot_scatter('Black', 'White', 'SearchRate', 'SearchCount', 
                      population_col='StopCount',
                      logscale=True, 
                      limits=[0.001, 1], 
                      title=' ')
s2 = metrics_2017.plot_scatter('Hispanic/Latino', 'White', 'SearchRate', 'SearchCount', 
                      population_col='StopCount',
                      logscale=True, 
                      limits=[0.001,1], 
                      title=' ')

We can then similarly visualize the search "hit" rate.

In [None]:
s1 = metrics_2017.plot_scatter('Black', 'White', 'SearchHitRate', 'SearchHitCount', 
                      population_col='StopCount',
                      logscale=False, 
                      title='Search Hit Rate Comparison')
s2 = metrics_2017.plot_scatter('Hispanic/Latino', 'White', 'SearchHitRate', 'SearchHitCount', 
                      population_col='StopCount',
                      logscale=False, 
                      title='Search Hit Rate Comparison')

And the citation rate

In [None]:
s1 = metrics_2017.plot_scatter('Black', 'White', 'Result-CitationRate', 'Result-CitationCount', 
                      population_col='StopCount',
                      logscale=False, 
                      title='Search Hit Rate Comparison')
s2 = metrics_2017.plot_scatter('Hispanic/Latino', 'White', 'Result-CitationRate', 'Result-CitationCount', 
                      population_col='StopCount',
                      logscale=False, 
                      title='Search Hit Rate Comparison')

Finally, we can ask whether the results shown in the scatter plot are significant by conducting significance testing. A z-score can be calculated using the z-test for two population proportions.
$$z=\frac{p_1-p_0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}$$ where the observed probabilities $p_i$ are the number of observed occurences $x_i$ over the total number of observations $n_i$, $$p_i = \frac{x_i}{n_i}$$ and the overall probability $\hat{p}$ is $$\hat{p} = \frac{x_1+x_2}{n_1+n_2}$$

We can then examine whether the difference in search hit rates is statistically significant for all of the departments across Illinois. Since we calculate this metric for a number of departments, we might expect to observe deviation from perfect equality due to random chance. Observing a histogram of values relative to the expected distribution allows us to draw conclusions about the overall distribution of values and identify outliers.

In [None]:
z = metrics_2017.plot_zhist('Black', 'White', 'SearchHitCount', 'SearchCount')
print(z.sort_values().head(10))
z = metrics_2017.plot_zhist('Hispanic/Latino', 'White', 'SearchHitCount', 'SearchCount')
print(z.sort_values().head(10))

In [None]:
z = metrics_2017.plot_zhist('Black', 'White', 'Result-CitationCount', 'StopCount')
print(z.sort_values(ascending=False).head(10))
z = metrics_2017.plot_zhist('Hispanic/Latino', 'White', 'Result-CitationCount', 'StopCount')
print(z.sort_values(ascending=False).head(10))

We can save this data as a pickle file for loading in later -- that way, we don't have to wait every time we want to look at these metrics.

In [None]:
metrics_2017.save_csv(DATA_DIR / "preprocessed" / "ITSS_Metrics_2017.csv")
metrics_2017.save(DATA_DIR / "preprocessed" / "ITSS_Metrics_2017.pkl")

We can also do more! Let's load in a bunch of data.

In [None]:
raw_2012_2017 = RawITSSData()
year_file_list = [(year, DATA_DIR / f'{year}_ITSS_Data.txt') for year in range(2012, 2018)]
# This might take a bit of time...
raw_2012_2017.load_multiple_years(year_file_list, fast=True, save=False) 

Like above, we can do a timeseries for all of this raw data

In [None]:
raw_2012_2017.plot_timeseries(frequency='1M',
                                agency='Chicago Police', 
                                group='DriverRace')

Wow, something clearly happened in after 2015... The Chicago Police appear to have tripled the number of traffic stops they conducted!

We can calculate metrics for this expanded time range.

In [None]:
metrics_2012_2017 = ITSSMetrics(raw_2012_2017)
# This will probably take some time... like 15-30 minutes time...
metrics_2012_2017.calculate_metrics(['AgencyName', 'DriverRace', 'Year']) 

In [None]:
metrics_2012_2017.save_csv(DATA_DIR / "preprocessed" / "2012-2017_ITSS_Metrics.csv")
metrics_2012_2017.save(DATA_DIR / "preprocessed" / "2012-2017_ITSS_Metrics.pkl")

In [None]:
metrics_2012_2017.plot_timeseries('SearchRate', 
                                  only_include_rows='Chicago Police',
                                  only_include_entries=['Black', 'Hispanic/Latino', 'Asian', 'White'],
                                  title='Search Rate 2012-2017')

In [None]:
s = metrics_2012_2017.plot_scatter(('Black', 'All_Year'), ('White', 'All_Year'), 'SearchHitRate', 'SearchHitCount')

In [None]:
s = metrics_2012_2017.plot_scatter(('Hispanic/Latino', 'All_Year'), ('White', 'All_Year'), 
                                   'SearchHitRate', 'SearchHitCount')

In [None]:
z = metrics_2012_2017.plot_zhist(('Black', 'All_Year'), ('White', 'All_Year'), 'SearchHitCount', 'SearchCount')
print(z.sort_values().head(10))

In [None]:
z = metrics_2012_2017.plot_zhist(('Hispanic/Latino', 'All_Year'), ('White', 'All_Year'), 'SearchHitCount', 'SearchCount')
print(z.sort_values().head(10))

To a high degree of statistical significance, there are many police departments that find contraband at lower rates when searching Black or Hispanic drivers than when searching White drivers.

In [None]:
s = metrics_2012_2017.plot_scatter(('Black', 'All_Year'), ('White', 'All_Year'), 
                                   'Result-CitationRate', 'Result-CitationCount',
                                   population_col='StopCount')

In [None]:
s = metrics_2012_2017.plot_scatter(('Hispanic/Latino', 'All_Year'), ('White', 'All_Year'), 
                                   'Result-CitationRate', 'Result-CitationCount', 
                                    population_col='StopCount')

In [None]:
z = metrics_2012_2017.plot_zhist(('Black', 'All_Year'), ('White', 'All_Year'), 'Result-CitationCount', 'StopCount')
print(z.sort_values(ascending=False).head(10))

In [None]:
z = metrics_2012_2017.plot_zhist(('Hispanic/Latino', 'All_Year'), ('White', 'All_Year'), 'Result-CitationCount', 'StopCount')
print(z.sort_values(ascending=False).head(10))

Is this citation difference due to driving behavior? We can break down the citations by the type of violation to try to get an answer.

In [None]:
for race in [('Black', 'All_Year'), ('Hispanic/Latino', 'All_Year'),]:
    for violation in ['MovingViolation', 'Equipment', 'LicenseRegistration', 'CommercialVehicle']:
        colstr = 'Reason-' + violation
        z = metrics_2012_2017.plot_zhist(race, ('White', 'All_Year'), colstr + 'CitationCount', colstr + 'Count')

In [None]:
for race in [('Black', 'All_Year'), ('Hispanic/Latino', 'All_Year')]:
    for mv in ['Speed', 'Traffic', 'Other', 'Lane', 'Follow', 'Seat']:
        colstr = 'move-' + mv
        z = metrics_2012_2017.plot_zhist(race, ('White', 'All_Year'), colstr + 'CitationCount', colstr + 'Count')

There's more to try on your own! You could try grouping by the sex of the driver (`DriverSex`) or by the year to look at year-over-year changes in specific metrics. 