# Data Analysis on Chicago Crimes Dataset

First things first, we import all the modules that we are going to use.

In [114]:
import numpy as np
import pandas as pd
import geopandas as gpd
import geoplot as gplt
import geoplot.crs as gcrs
import matplotlib.pyplot as plt

%matplotlib inline

Getting the number of lines in each dataset to get a sense of its size.

In [2]:
import mmap

def lines_mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

print(f'Chicago Crimes dataset has {lines_mapcount("../data/chicagoCrimes/Chicago_Crimes.json"):,} lines.')
print(f'ZCTA5 dataset has {lines_mapcount("../data/zcta5/TIGER2018_ZCTA5.json"):,} lines.')

Chicago Crimes dataset has 7,147,877 lines.
ZCTA5 dataset has 33,144 lines.


Loading the Pickle file and converting it into a GeoPandas DataFrame

In [2]:
file_name = "../data/chicagoCrimes/Chicago_Crimes_cleaned.pkl"
dfCrime = pd.read_pickle(file_name)
dfCrime = gpd.GeoDataFrame(dfCrime, geometry='g')
print("Loading done!")

Loading done!


Here we take a look on the first 5 rows of the dataset.

In [3]:
dfCrime.head()

Unnamed: 0,g,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Score Crime,Score Arrest
0,POINT (-87.67920 41.96925),9799787,2014-09-30 01:05:00,CRIMINAL TRESPASS,TO VEHICLE,STREET,1,0,20.0,26,1,
1,POINT (-87.67920 41.96930),9933145,2015-01-20 15:30:00,CRIMINAL DAMAGE,TO PROPERTY,SMALL RETAIL STORE,0,0,20.0,14,13,
2,POINT (-87.67772 41.96947),9936081,2015-01-21 11:00:00,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,0,0,20.0,14,13,
3,POINT (-87.67772 41.96947),9833072,2014-10-25 12:00:00,CRIMINAL DAMAGE,TO VEHICLE,APARTMENT,0,0,20.0,14,13,
4,POINT (-87.67772 41.96949),9822109,2014-10-17 08:30:00,THEFT,$500 AND UNDER,RESIDENTIAL YARD (FRONT/BACK),0,0,20.0,6,21,


Then we get general info of each column; like, the datatype, abd the non-null values count.

In [4]:
dfCrime.info(memory_usage="deep")

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 7078918 entries, 0 to 7147876
Data columns (total 12 columns):
 #   Column                Dtype         
---  ------                -----         
 0   g                     geometry      
 1   ID                    Int32         
 2   Date                  datetime64[ns]
 3   Primary Type          category      
 4   Description           category      
 5   Location Description  category      
 6   Arrest                Int8          
 7   Domestic              Int8          
 8   District              category      
 9   FBI Code              category      
 10  Score Crime           Int8          
 11  Score Arrest          Int8          
dtypes: Int32(1), Int8(4), category(5), datetime64[ns](1), geometry(1)
memory usage: 297.1 MB


Then a quick descriptive statistics summary of numeric columns.

In [5]:
dfCrime.describe()

Unnamed: 0,ID,Arrest,Domestic,Score Crime,Score Arrest
count,7078918.0,7078918.0,7078918.0,7078918.0,0.0
mean,6531848.0,0.2735994,0.1339089,16.04244,
std,3214072.0,0.4458058,0.3405544,6.778636,
min,634.0,0.0,0.0,1.0,
25%,3561036.0,0.0,0.0,13.0,
50%,6521622.0,0.0,0.0,19.0,
75%,9310905.0,1.0,0.0,21.0,
max,12095050.0,1.0,1.0,26.0,


Then a quick overview of the number of unique values in each column.

In [6]:
dfCrime.nunique()

g                        872127
ID                      7078918
Date                    2882091
Primary Type                 36
Description                 522
Location Description        212
Arrest                        2
Domestic                      2
District                     24
FBI Code                     26
Score Crime                  23
Score Arrest                  0
dtype: int64

---

1. Which police district has the most crimes?

In [24]:
dfCrime.groupby("District", as_index=False).count().sort_values(by=['ID'], ascending=False)[['District', 'ID']].iloc[0]

District         8.0
ID          479853.0
Name: 7, dtype: float64

2. Which police district has the most arrest percentage?


In [40]:
district_group = dfCrime.groupby('District', as_index=True)
(district_group['Arrest'].sum() / district_group['Arrest'].count()).sort_values(ascending=False).iloc[[0]]

District
21.0    0.5
Name: Arrest, dtype: Float64

3. How many murder crimes happen on the street?


In [81]:
dfCrime[(dfCrime['Primary Type'] == 'HOMICIDE') & (dfCrime['Location Description'] == 'STREET')].count()['ID']

5118

4. Where does murder crimes usually happen? (ZIP Code)

5. What is the time of day when murder and kidnapping happen the most?


In [77]:
tmp_df = dfCrime[(dfCrime['Primary Type'] == 'HOMICIDE') | (dfCrime['Primary Type'] == 'KIDNAPPING')].groupby(dfCrime['Date'].dt.hour, as_index=False).size().sort_values(by=['size'], ascending=False).iloc[:1]
print(tmp_df)

    Date  size
18    18  1024


6. What is the "location description" (street, sideway, appartment, etc..) where murderS happen the most?

In [75]:
tmp_df = dfCrime[(dfCrime['Primary Type'] == 'HOMICIDE')].groupby('Location Description', as_index=False).size().sort_values(by=['size'], ascending=False).iloc[0:5]
print(tmp_df)

    Location Description  size
186               STREET  5118
21                  AUTO  1193
17             APARTMENT   886
15                 ALLEY   660
115                HOUSE   552


7. What is the most common domestic crime?

In [85]:
dfCrime[dfCrime['Domestic'] == 1].groupby('Primary Type', as_index=False).size().sort_values('size').iloc[-1]

Primary Type    BATTERY
size             559295
Name: 2, dtype: object

8. What is the percentage of domestic crimes that led to an arrest?


In [86]:
tmp_df = dfCrime[dfCrime['Domestic'] == 1]
tmp_df['Arrest'].sum() / tmp_df['Arrest'].size

0.19576129039064066

9. Which day of week has the most domestic crimes?

In [98]:
dfCrime[dfCrime['Domestic'] == 1].groupby(dfCrime['Date'].dt.dayofweek).size().sort_values().iloc[-1]

160171

10. What is the most common crime in each day of week?

11. Which month generally has the greatest number of crimes?

In [136]:
dfCrime['month'] = dfCrime['Date'].dt.month
dfCrime.groupby('month', as_index=False).size().sort_values('size').iloc[-1]

month         7
size     650441
Name: 6, dtype: int64

12. What is the time of day when theft-related crimes happen the most?

In [139]:
dfCrime['hour'] = dfCrime['Date'].dt.hour
theftRelatedList = ["MOTOR VEHICLE THEFT", "BURGLARY", "ROBBERY", "THEFT"]
dfCrime[
    (dfCrime['Primary Type'] == "MOTOR VEHICLE THEFT") |
    (dfCrime['Primary Type'] == "BURGLARY") |
    (dfCrime['Primary Type'] == "ROBBERY") |
    (dfCrime['Primary Type'] == "THEFT")
    ].groupby('hour', as_index=False).size().sort_values('size').iloc[-1]

hour        12
size    151309
Name: 12, dtype: int64