# Checkpoint 4: Machine Learning

In [1]:
import helpers
import numpy as np
from sklearn.linear_model import LinearRegression

%load_ext autoreload
%autoreload 2

## Data Preprocessing

Our goal for this checkpoint is to compare the police districts of Chicago, grouped into North, Central, West, and South regions. First, we use our pre-made CSV files to load our data into three dictionaries: number of complaints per year per district, number of officers per district, and civilian population per district. 

The complaint data used only enumerates complaints that fall into the "Use Of Force" or "Illegal Search" categories, as these are the two types of complaints that would be seen as most malicious.

In [4]:
complaint_dict = helpers.prepare_complaint_json()
officer_dict = helpers.prepare_officer_json()
population_dict = helpers.prepare_population_json()

Now, we spend the next few cells grouping districts into regional buckets. These 'buckets' were hand-picked by observing a map of Chicago, and each region represents 5-6 districts. We perform this grouping operation for each of the statistics obtained above.

In [5]:
north, central, west, south = helpers.district_buckets()

In [6]:
print(north)

['16th', '17th', '24th', '20th', '19th']


### Complaints

In [7]:
north_complaints = helpers.merge_dicts(north, complaint_dict)
central_complaints = helpers.merge_dicts(central, complaint_dict)
west_complaints = helpers.merge_dicts(west, complaint_dict)
south_complaints = helpers.merge_dicts(south, complaint_dict)

In [8]:
print("The complaints for the northern region per year are as follows:\n")
print(north_complaints)

The complaints for the northern region per year are as follows:

Counter({'2007': 14534, '2008': 14418, '2009': 14393, '2001': 13620, '2010': 12603, '2002': 12376, '2011': 12282, '2003': 11277, '2012': 10983, '2013': 10829, '2006': 10268, '2005': 9861, '2004': 9638, '2014': 7962, '2015': 7703, '2016': 7331, '2017': 6491, '2018': 3005})


### Police Officers

In [9]:
north_cops = helpers.merge_dicts(north, officer_dict)
central_cops = helpers.merge_dicts(central, officer_dict)
west_cops = helpers.merge_dicts(west, officer_dict)
south_cops = helpers.merge_dicts(south, officer_dict)

In [10]:
print(f'Number of officers active in the northern region: {north_cops}\n')

Number of officers active in the northern region: Counter({'2001': 897, '2002': 852, '2003': 802, '2007': 712, '2008': 697, '2006': 609, '2004': 605, '2005': 567, '2009': 553, '2010': 486, '2011': 464, '2012': 374, '2013': 364, '2015': 362, '2014': 358, '2016': 242, '2017': 221, '2018': 56})



### Population

In [11]:
north_pop = helpers.merge_values(north, population_dict)
central_pop = helpers.merge_values(central, population_dict)
west_pop = helpers.merge_values(west, population_dict)
south_pop = helpers.merge_values(south, population_dict)

In [12]:
print(f'Number of civilians in the northern region: {north_pop}')

Number of civilians in the northern region: 776681


## Linear Regression

Now that we've grouped our data, we can begin building a regression model for each of the regions. We can compare the obtained model for each region by looking at it's coefficients and intercept.

### North

In [14]:
X, y = helpers.format_data(north_complaints, north_cops, north_pop)

reg = LinearRegression().fit(X, y)

In [15]:
print(reg.coef_)
print(reg.intercept_)

[-5.56960415e-04 -4.71942923e-06]
1.1351897348343347


### Central

In [16]:
X, y = helpers.format_data(central_complaints, central_cops, central_pop)

reg = LinearRegression().fit(X, y)

In [17]:
print(reg.coef_)
print(reg.intercept_)

[-1.62082533e-03 -1.73855752e-06]
3.2917074862002123


### West

In [18]:
X, y = helpers.format_data(west_complaints, west_cops, west_pop)

reg = LinearRegression().fit(X, y)

In [19]:
print(reg.coef_)
print(reg.intercept_)

[-8.45460917e-04  4.42908131e-06]
1.7232630790765284


### South

In [20]:
X, y = helpers.format_data(south_complaints, south_cops, south_pop)

reg = LinearRegression().fit(X, y)

In [21]:
print(reg.coef_)
print(reg.intercept_)

[-1.29859345e-03  4.19909565e-06]
2.6489945683932206
