![](http://freechicagowalkingtours.com/wp-content/uploads/2016/09/Crime-Scene-Banner-Size-1024x397.jpg)

**In this kernel we will try and exploit the information made available to kagglers regarding crimes in Chicago. The dataset contains records from 2001 to present day, however it only has 65-66k records compared to the original dataset which has around 6.6M records. If anyone is interested in analyzing the original dataset, it can be found [here](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2). Nonetheless, 65k instances should be a fair amount of information to give us some good insights into the crime scene in Chicago, so let's get started by importing the necessary libraries first.**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import folium
import folium.plugins
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
import plotly.plotly as py
from sklearn.cluster import KMeans
import pylab as pl
from mpl_toolkits.mplot3d import Axes3D
from bubbly.bubbly import bubbleplot 
from plotly.graph_objs import Scatter, Figure, Layout
from __future__ import division
import datetime

**Loading the data now! Plus a quick peek into the first few instances to see what it looks like**

In [None]:
data = pd.read_csv("../input/Crimes_2001_to_present_sample.csv")

In [None]:
data.head()

**Dropping some unnecessary or redundant columns**

In [None]:
data.drop(['X Coordinate', 'Y Coordinate', 'Updated On', 'Location', 'Beat'], axis=1, inplace=True)

**Doing some date time preprocessing**

In [None]:
data['Date'] = pd.to_datetime(data.Date) 
data['date'] = [d.date() for d in data['Date']]
data['time'] = [d.time() for d in data['Date']]

data['time'] = data['time'].astype(str)
empty_list = []
for timestr in data['time'].tolist():
    ftr = [3600,60,1]
    var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
    empty_list.append(var)
    
data['seconds'] = empty_list

![](https://img.ifun01.com/images/2016/12/31/18/2414_afxSAU_zvZ7fg.jpg!r800x0.jpg)

**For the purpose of clustering, we will be doing a kmeans clustering on:**
1. **First, we will cluster the data according to the District, Ward and Primary Type(as per IUCR code), i feel it will help us identify which portions of the city experience criminal attacks of which type.**
2. **Then we shall cluster it on the basis of Time, District and Primary Type(as per IUCR code), it should help us classify which districts are more prone to what sort of attacks at what time etc.**                                                                                                                                                                                                                                                                                      
3.**At last we will cluster according to the Time, Date and Primary Type(as per IUCR code) which should help us answer questions like which sort of attacks the city are most prone to, say on New Year's Eve and at what time.**
**The KMeans problem is solved using Lloyd's algorithm. In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.**

**If you want to know more about KMeans and Lloyd's algorithms, go here [KMeans](https://en.wikipedia.org/wiki/K-means_clustering) and here [Lloyd](https://en.wikipedia.org/wiki/Lloyd%27s_algorithm)**

**Creating a subset of our dataset with IUCR codes , District codes and Ward codes**
**IUCR stands for Illinois Uniform Crime Reporting, it encodes different nature of crime using a specific code table; the list of IUCR codes for different crimes can be found [here](https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e/data). The District codes and Ward codes can be found [here](https://www.cityofchicago.org/city/en/about/wards.html) and [here](https://home.chicagopolice.org/community/districts/)**

In [None]:
sub_data = data[['Ward', 'IUCR', 'District']]
sub_data = sub_data.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data['IUCR'] = sub_data.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data.head()

![](https://www.robotlab.com/hs-fs/hub/314265/file-2565558338-jpg/images/Blog/MathJoke.jpg?t=1532994469695&width=650&name=MathJoke.jpg)

**Before getting into clustering, some things to note: For finding the optimal number of clusters, I will go with the elbow rule, which basically states that on the curve of score vs number of clusters, the optimal point is that where the first bend(or elbow) occurs primarily because after that the the score eventually decreases to zero implying each point starts behaving as its own cluster. However for the purpose of KMeans we will have to normalize the data first as without it, KMeans will simply cluster the data based on the euclidean distances of IUCR code as it has a larger range of values than District or Ward codes. I will show both the clusters with and without normalization, so that you can see the results for yourself**

In [None]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data).score(sub_data) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show() 

**Ok, so without normalizing the data, the best number of cluster is around 4, so let's try that out**

In [None]:
km = KMeans(n_clusters=4)
km.fit(sub_data)
y = km.predict(sub_data)
labels = km.labels_
sub_data['Cluster'] = y

In [None]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data['Ward'])
y = np.array(sub_data['IUCR'])
z = np.array(sub_data['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data["Cluster"], s=60, cmap="jet")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()

**As expected, KMeans simply clusters the data based on the euclidean distances of the IUCR codes. So let's fix that by normalizing the data**

In [None]:
sub_data['IUCR'] = (sub_data['IUCR'] - sub_data['IUCR'].min())/(sub_data['IUCR'].max()-sub_data['IUCR'].min())
sub_data['Ward'] = (sub_data['Ward'] - sub_data['Ward'].min())/(sub_data['Ward'].max()-sub_data['Ward'].min())
sub_data['District'] = (sub_data['District'] - sub_data['District'].min())/(sub_data['District'].max()-sub_data['District'].min())

**Let's find the optimum clusters again**

In [None]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data).score(sub_data) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

del sub_data['Cluster']

**The elbow seems more close to being 3 now! Let's run KMeans again on the normalized data.**

In [None]:
km = KMeans(n_clusters=3)
km.fit(sub_data)
y = km.predict(sub_data)
labels = km.labels_
sub_data['Clusters'] = y

In [None]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data['Ward'])
y = np.array(sub_data['IUCR'])
z = np.array(sub_data['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()

**The clustering does'nt seems to be based solely on the euclidean distances of IUCR codes now, which is good! Next up, lets check out the distribution of crime type district wise using an animated bubble chart which can be easily plotted using the package bubbly. More of bubbly's use [in this kernel](https://www.kaggle.com/aashita/guide-to-animated-bubble-charts-using-plotly).Click on autoscale in case the data is not properly distributed**

In [None]:
data['IUCR'] = data.IUCR.str.extract('(\d+)', expand=True).astype(int)
figure = bubbleplot(dataset=data, x_column='Latitude', y_column='Longitude', 
    bubble_column='Primary Type', time_column='Year',size_column='IUCR',color_column='District', 
    x_title="Latitude", y_title="Longitude", title='District wise distribution of Crime types in Chicago',
    x_logscale=False)

iplot(figure, config={'scrollzoom': True})

**Let's do the clustering now based on Time, District and Primary Type(as per IUCR codes). Here also we will scale the time in seconds to be between 1 and 0, with 0.5 representing the time 12:00 noon(else clusters will only be based on time segments), that way the clusters will be divided into sections of morning, afternoon and night.**

In [None]:
#Normalizing the time to be between 0 and 1, this way lower values would indicate midnight to early morning
#medium values would indicate the afternoon sessions, and high values would indicate evening and night time
#also kmeans then won't cluster just based on the time as the range of euclidean distances in time column will be very high without scaling
data['Normalized_time'] = (data['seconds'] - data['seconds'].min())/(data['seconds'].max()-data['seconds'].min())

In [None]:
sub_data1 = data[['IUCR', 'Normalized_time', 'District']]
#sub_data1['IUCR'] = sub_data1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data1['IUCR'] = (sub_data1['IUCR'] - sub_data1['IUCR'].min())/(sub_data1['IUCR'].max()-sub_data1['IUCR'].min())
sub_data1['District'] = (sub_data1['District'] - sub_data1['District'].min())/(sub_data1['District'].max()-sub_data1['District'].min())
sub_data1.head()

**Let's run  KMeans on the above data now! Like before we shall start off by plotting the elbow curve to find the optimal number of clusters**

In [None]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data1).score(sub_data1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

**The optimal number of clusters seem to be around 4-5, let's try it out with 4 first, and then we will plot the clusters on a 3d plot for 5 clusters as well and see how it turns out.**

In [None]:
km = KMeans(n_clusters=4)
km.fit(sub_data1)
y = km.predict(sub_data1)
labels = km.labels_
sub_data1['Clusters'] = y
sub_data1.head()

In [None]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data1['Normalized_time'])
y = np.array(sub_data1['IUCR'])
z = np.array(sub_data1['District'])

ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()

**Let's check out the clustering by setting n_clusters=5**

In [None]:
km = KMeans(n_clusters=5)
km.fit(sub_data1)
y = km.predict(sub_data1)
labels = km.labels_
sub_data1['Clusters'] = y
sub_data1.head()

In [None]:
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data1['Normalized_time'])
y = np.array(sub_data1['IUCR'])
z = np.array(sub_data1['District'])

ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()

**So much of clustering going on, what do these clusters even indicate??? Hopefully we will get to know that in the next section!**
![](https://thumb1.shutterstock.com/display_pic_with_logo/173932116/739897180/stock-vector-confused-kid-face-expression-cute-cartoon-girl-illustration-739897180.jpg)

**Observations:**
1. **To be added soon, have to make sure first whether the clusters are ok or flawed; any inputs from your side would surely be appreciated.**

**Now let's plot some more interesting graphs to get  further insights into the data we have in hand. After that we will move on to the date time serializing process and clustering them on that basis. Hopefully after that we will be able to predict crimes by either of the two approaches:**
1. **Minimizing the distance between a crime occurrence and the centroid of a cluster**
2. **Performing regression analysis on the identified clusters and fitting crimes to the best fit line**


In [None]:
# convert dates to pandas datetime format
data.Date = pd.to_datetime(data.Date, format='%m/%d/%Y %I:%M:%S %p')
# setting the index to be the date will help us a lot later on
data.index = pd.DatetimeIndex(data.Date)

In [None]:
plt.figure(figsize=(11,6))
data.resample('BM').size().plot(legend=False)
plt.title('Number of crimes per month (2001 - 2016)')
plt.xlabel('Months')
plt.ylabel('Number of crimes')
plt.show()

**From the looks of it, the overall crime scene seems to be decreasing from 2001 onwards, as the graph consistently follows a periodic and decreasing pattern. However, this does'nt provide the entire picture and to get a general idea its always better to plot individual crime patterns and then investigate as to what has decreased or increased over the years.**

In [None]:
crimes_count_date = data.pivot_table('ID', aggfunc=np.size, columns='Primary Type', index=data.index.date, fill_value=0)
crimes_count_date.index = pd.DatetimeIndex(crimes_count_date.index)
plo = crimes_count_date.rolling(365).sum().plot(figsize=(30, 30), subplots=True, layout=(-1, 3), sharex=False, sharey=False)

**From the plots above, Battery charges were at it's least in 2016, Weapons Violation also decreased in 2016, although it did started rising after that, Theft was at it's lowest too in 2016, Stalking charges were very high though, Public Peace Violation was decreasing etc. Overall, positive takeaway is that most of the criminal charges were decreasing as 2016 approached compared to the previous years, although a few did rise significantly, like Stalking, Sex offences, Offences involving children, Interference with Public Officers etc. Seems like as the years advanced, victims were mostly from the feminine side or children side, maybe because as law enforcements cracked down severely on HVT's, these people became soft/easy targets for the offenders.**

In [None]:
days = ['Monday','Tuesday','Wednesday',  'Thursday', 'Friday', 'Saturday', 'Sunday']
data.groupby([data.index.dayofweek]).size().plot(kind='barh', figsize=(5, 6))
plt.ylabel('Days of the week')
plt.yticks(np.arange(7), days)
plt.xlabel('Number of crimes')
plt.title('Number of crimes by day of the week')
plt.show()

**Well I guess nothing special out here, almost all the days experience the same number of crimes, the Friday count being slightly higher than the others, but still nothing significant to come to any conclusion.**

In [None]:
plt.figure(figsize=(8,10))
data.groupby([data['Primary Type']]).size().sort_values(ascending=True).plot(kind='barh')
plt.title('Number of crimes by type')
plt.ylabel('Crime Type')
plt.xlabel('Number of crimes')
plt.show()

**Seems like Theft and Battery charges outnumbers the rest in Chicago, however it will not be 100% correct to comment that these are the only two major crimes that plague the city of Chicago as most crime incidents go unreported or unheard of , or probably some incidents got left out in the process of data collection, we will never know!**

**The trends are getting pretty boring and repetitive or predictable, we all know that all crimes are not the same, some have a higher chance of occuring or be more frequent in occurence like theft/robbery than murder or homicide, it would be really cool if we could visualize some of the answers to questions like: Is theft or burglary is more likely to occur at a weekday compared to a weekend? Are they more likely to happen in the morning vs evening or late night ? are they more likely to occur in a street vs a bar? We will get to work on it with the pivot function of pandas.
**

In [None]:
hour_by_location = data.pivot_table(values='ID', index='Location Description', columns=data.index.hour, aggfunc=np.size).fillna(0)
hour_by_type     = data.pivot_table(values='ID', index='Primary Type', columns=data.index.hour, aggfunc=np.size).fillna(0)
dayofweek_by_type = data.pivot_table(values='ID', index='Primary Type', columns=data.index.dayofweek, aggfunc=np.size).fillna(0)
location_by_type  = data.pivot_table(values='ID', index='Location Description', columns='Primary Type', aggfunc=np.size).fillna(0)

**For the purpose of plotting the heatmaps we will first do an AgglomerativeClustering on our dataset in order to group the rows in to meaningful clusters and the use those labels for the purpose of plotting our heatmaps. We will also scale the row values(z-scale) to have a mean of zero and unit variance and then use it for plotting our heatmaps.**

In [None]:
from sklearn.cluster import AgglomerativeClustering as AC

def scale_df(df,axis=0):
    return (df - df.mean(axis=axis)) / df.std(axis=axis)


def plot_hmap(df, ix=None, cmap='PuRd'):
    if ix is None:
        ix = np.arange(df.shape[0])
    plt.imshow(df.iloc[ix,:], cmap=cmap)
    plt.colorbar(fraction=0.03)
    plt.yticks(np.arange(df.shape[0]), df.index[ix])
    plt.xticks(np.arange(df.shape[1]))
    plt.grid(False)
    plt.show()
    
def scale_and_plot(df, ix = None):
    df_marginal_scaled = scale_df(df.T).T
    if ix is None:
        ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
    cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
    df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
    plot_hmap(df_marginal_scaled, ix=ix)

In [None]:
#CMAP = 'PuRd'
plt.figure(figsize=(60,30))
scale_and_plot(hour_by_location)

**From the plot above, we can see that places like Police Facility or vehicle parking lots experience attacks mostly during the early morning(like around 3-5am, indicated by the dark purple coloring of the plot), Government buildings, Schools, College/University campus, Hospitals  all come under attack mostly during the midday timings(mostly 9-14). Fitting a regression line on such clusters would prove to be quite useful in predicting the next crime about to take place in such sites!**

In [None]:
#CMAP = 'OrRd'
plt.figure(figsize=(20, 10))
scale_and_plot(hour_by_type)

**Domestic violence, offense involving children, sex offense etc are all likely to occur late night at around 12 -14, burglary, theft mostly occuring early morning to mid day at around 8-12, probably because people are in their offices, children at school and broaddaylight is the least time someone would expect their house to get robbed, Homicide occuring mostly early in the morning like at around 2-3AM, probably because there wont be much people hanging around on the streets or whatever place it is at that time, leaving little to no witnesses. Next up, we will try and visualize what sort of crimes mostly occur in which places.**

In [None]:
def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

df = normalize(location_by_type)
ix = AC(3).fit(df.T).labels_.argsort() 
plt.figure(figsize=(27, 12))
plt.imshow(df.T.iloc[ix,:], cmap='Blues')
plt.colorbar(fraction=0.03)
plt.xticks(np.arange(df.shape[0]), df.index, rotation='vertical')
plt.yticks(np.arange(df.shape[1]), df.columns)
plt.title('Normalized location frequency for each crime')
plt.grid(False)
plt.show()

Notebook to be updated!