# Chicago Crimes - Bodo Hosted Trial

This example shows an exploratory data analysis (EDA) of crimes in Chicago using the HPC-like platform Bodo. Chicago crime data is extracted from Bodo's public S3 bucket, cleaned and processed. Then some analysis are done to extract insight. All are **parallelized across multiple cores using Bodo**. This can be a straightforward way to make Python code run faster without a lot of changes to the code. Original example can be found [here](https://medium.com/@ahsanzafar222/chicago-crime-data-cleaning-and-eda-a744c687a291) and [here](https://www.kaggle.com/fahd09/eda-of-crime-in-chicago-2005-2016). The data size is reduced to fit this hosted trial cluster. The full example can be found in [Bodo-Examples Git repository](https://github.com/Bodo-inc/Bodo-examples/blob/master/notebooks/Chicago-crimes.ipynb). You can run the large-scale example on [Bodo platform](https://platform.bodo.ai/account/login).

The Bodo framework knows when to parallelize code based on the `%%px` at the start of cells and `@bodo.jit` function decorators. Removing those and restarting the kernel will run the code without Bodo.



In [1]:
%%px
import numpy as np
import pandas as pd
import time
import bodo

print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>


  0%|          | 0/8 [00:00<?, ?engine/s]

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:7] Hello World from rank 7. Total ranks=8


[stdout:1] Hello World from rank 1. Total ranks=8


[stdout:0] Hello World from rank 0. Total ranks=8


[stdout:5] Hello World from rank 5. Total ranks=8


[stdout:4] Hello World from rank 4. Total ranks=8


[stdout:3] Hello World from rank 3. Total ranks=8


[stdout:6] Hello World from rank 6. Total ranks=8


[stdout:2] Hello World from rank 2. Total ranks=8


## Load Crimes Data in Chicago 2012_to_2017

In [2]:
%%px
@bodo.jit(cache=True)
def load_chicago_crimes():
    t1 = time.time()
    crimes = pd.read_parquet('s3://bodo-example-data/chicago-crimes/Chicago_Crimes_2012_to_2017.pq')
    crimes = crimes.sort_values(by="ID")    
    print("Reading time: ", ((time.time() - t1) * 1000), " (ms)")    
    return crimes

crimes1 = load_chicago_crimes()
if bodo.get_rank()==0:
    display(crimes1.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Reading time:  4327.149833537078  (ms)


[output:0]

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1267593,4105388,20225,HV102221,01/02/2012 05:58:00 PM,024XX E 78TH ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,...,7.0,43.0,01A,1194033.0,1853729.0,2012,08/17/2015 03:03:40 PM,41.753569,-87.564503,"(41.75356945, -87.56450286)"
1267595,4105549,20227,HV101433,01/02/2012 05:15:00 AM,107XX S COTTAGE GROVE AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,True,...,9.0,50.0,01A,1182247.0,1833951.0,2012,08/17/2015 03:03:40 PM,41.699577,-87.608304,"(41.699577165, -87.608304224)"
1267596,4105635,20228,HV102986,01/03/2012 12:07:00 PM,010XX N PULASKI RD,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,...,37.0,23.0,01A,1149528.0,1906741.0,2012,08/17/2015 03:03:40 PM,41.900017,-87.726226,"(41.900017263, -87.726225708)"
1267600,4105891,20231,HV105192,01/05/2012 02:35:00 AM,046XX W MONROE ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,...,28.0,25.0,01A,1145493.0,1899161.0,2012,08/17/2015 03:03:40 PM,41.879294,-87.741239,"(41.879294275, -87.741238618)"
1267601,4105973,20232,HV103598,01/05/2012 08:00:00 AM,017XX W ALBION AVE,110,HOMICIDE,FIRST DEGREE MURDER,APARTMENT,True,...,40.0,1.0,01A,1163697.0,1943895.0,2012,08/17/2015 03:03:40 PM,42.001683,-87.673131,"(42.001682746, -87.67313138)"


## Preprocessing and Cleaning
 1. Drop duplicated cases, filter unused columns, and add day of week and date of the crime.
 2. Keep only the most frequent crime type categories.


In [3]:
%%px
@bodo.jit(distributed=["crimes"], cache=True)
def data_cleanup(crimes):
    t1 = time.time()    
    crimes = crimes.drop_duplicates()    
    crimes.drop(['Unnamed: 0', 'Case Number', 'IUCR','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location'], inplace=True, axis=1)
    crimes.Date = pd.to_datetime(crimes.Date, format='%m/%d/%Y %I:%M:%S %p')
    crimes["dow"] = crimes["Date"].dt.dayofweek
    crimes["date only"] = crimes["Date"].dt.floor('D')
    crimes = crimes.sort_values(by="ID")    
    print("Data cleanup time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes

crimes = data_cleanup(crimes1)
if bodo.get_rank()==0:
    display(crimes.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Data cleanup time:  5242.116393467768  (ms)


[output:0]

Unnamed: 0,ID,Date,Block,Primary Type,Description,Location Description,Arrest,Domestic,District,X Coordinate,Y Coordinate,Latitude,Longitude,dow,date only
1267593,20225,2012-01-02 17:58:00,024XX E 78TH ST,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,4.0,1194033.0,1853729.0,41.753569,-87.564503,0,2012-01-02
1267595,20227,2012-01-02 05:15:00,107XX S COTTAGE GROVE AVE,HOMICIDE,FIRST DEGREE MURDER,STREET,True,False,5.0,1182247.0,1833951.0,41.699577,-87.608304,0,2012-01-02
1267596,20228,2012-01-03 12:07:00,010XX N PULASKI RD,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,11.0,1149528.0,1906741.0,41.900017,-87.726226,1,2012-01-03
1267600,20231,2012-01-05 02:35:00,046XX W MONROE ST,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,11.0,1145493.0,1899161.0,41.879294,-87.741239,3,2012-01-05
1267601,20232,2012-01-05 08:00:00,017XX W ALBION AVE,HOMICIDE,FIRST DEGREE MURDER,APARTMENT,True,True,24.0,1163697.0,1943895.0,42.001683,-87.673131,3,2012-01-05


In [4]:
%%px
@bodo.jit(cache=True)
def get_top_crime_types(crimes):
    t1 = time.time()
    top_crime_types = crimes['Primary Type'].value_counts().index[0:10]
    print("Getting top crimes Time: ", ((time.time() - t1) * 1000), " (ms)")
    return top_crime_types

top_crime_types = get_top_crime_types(crimes)
top_crime_types = bodo.allgatherv(top_crime_types)
if bodo.get_rank()==0:
    print(top_crime_types)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Getting top crimes Time:  788.1024963521668  (ms)
Index(['THEFT', 'BATTERY', 'CRIMINAL DAMAGE', 'NARCOTICS', 'ASSAULT',
       'OTHER OFFENSE', 'BURGLARY', 'DECEPTIVE PRACTICE',
       'MOTOR VEHICLE THEFT', 'ROBBERY'],
      dtype='object')


In [5]:
%%px

@bodo.jit(cache=True)
def filter_crimes(crimes, top_crime_types):
    t1 = time.time()
    top_crimes = crimes[crimes['Primary Type'].isin(top_crime_types)]
    print("Filtering crimes Time: ", ((time.time() - t1) * 1000), " (ms)")
    return top_crimes

crimes = filter_crimes(crimes, top_crime_types)
if bodo.get_rank()==0:
    display(crimes.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Filtering crimes Time:  68.5541266943801  (ms)


[output:0]

Unnamed: 0,ID,Date,Block,Primary Type,Description,Location Description,Arrest,Domestic,District,X Coordinate,Y Coordinate,Latitude,Longitude,dow,date only
77272,8421398,2012-01-01 00:23:00,033XX N HALSTED ST,ASSAULT,AGGRAVATED:KNIFE/CUTTING INSTR,BAR OR TAVERN,True,False,19.0,1170335.0,1922325.0,41.942351,-87.649345,6,2012-01-01
77273,8421402,2012-01-01 00:30:00,092XX S DR MARTIN LUTHER KING JR DR,BATTERY,AGGRAVATED: OTHER DANG WEAPON,SIDEWALK,False,False,6.0,1180537.0,1843779.0,41.726586,-87.614265,6,2012-01-01
77279,8421414,2012-01-01 00:50:00,010XX N MILWAUKEE AVE,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,False,12.0,1167070.0,1906993.0,41.90035,-87.661786,6,2012-01-01
77288,8421427,2012-01-01 01:39:00,045XX N SHERIDAN RD,NARCOTICS,POSS: CRACK,ALLEY,True,False,19.0,1168795.0,1930288.0,41.964235,-87.654773,6,2012-01-01
77291,8421430,2012-01-01 00:20:00,005XX N RUSH ST,CRIMINAL DAMAGE,TO VEHICLE,PARKING LOT/GARAGE(NON.RESID.),True,False,18.0,1177011.0,1903813.0,41.891405,-87.625369,6,2012-01-01


## Crime Analysis

### Find Pattern of each crime over the years



In [6]:
%%px
@bodo.jit(cache=True)
def get_crimes_count_date(crimes):
    t1 = time.time()
    crimes_count_date = crimes.pivot_table(index='date only', columns='Primary Type', values='ID', aggfunc="count")
    print("Computing Crime Pattern Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_count_date

crimes_count_date = get_crimes_count_date(crimes)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Computing Crime Pattern Time:  2285.3919505992053  (ms)


In [7]:
%%px

@bodo.jit
def get_crimes_type_date(crimes_count_date):
    t1 = time.time()
    crimes_count_date.index = pd.DatetimeIndex(crimes_count_date.index)
    result = crimes_count_date.fillna(0).rolling(365).sum()
    result = result.sort_index(ascending=False)
    print("Computing Crime Pattern Time: ", ((time.time() - t1) * 1000), " (ms)")
    return result

get_crimes_type_date = get_crimes_type_date(crimes_count_date)
if bodo.get_rank()==0:
    display(get_crimes_type_date.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Computing Crime Pattern Time:  2332.6121890568174  (ms)


[output:0]

Unnamed: 0,ROBBERY,THEFT,ASSAULT,OTHER OFFENSE,BATTERY,DECEPTIVE PRACTICE,NARCOTICS,BURGLARY,MOTOR VEHICLE THEFT,CRIMINAL DAMAGE
2017-01-18,5682.0,32227.0,8930.0,8679.0,25899.0,7472.0,13160.0,8154.0,6003.0,15330.0
2017-01-17,5752.0,31861.0,8804.0,8517.0,25462.0,7573.0,12156.0,7877.0,6072.0,14993.0
2017-01-16,5587.0,31745.0,9002.0,8810.0,25700.0,7515.0,12665.0,8087.0,5890.0,15295.0
2017-01-15,5747.0,31883.0,8796.0,8521.0,25470.0,7583.0,12179.0,7874.0,6066.0,14987.0
2017-01-14,5628.0,32005.0,8804.0,8696.0,25446.0,7369.0,12950.0,8014.0,5839.0,15204.0


## A general view of crime records by time, type and location

### Determining the pattern on daily basis

In [8]:
%%px
@bodo.jit(distributed=['crimes', 'crimes_days'], cache=True)
def get_crimes_by_days(crimes):
    t1 = time.time()
    crimes_days = crimes.groupby('dow', as_index=False)['ID'].count().sort_values(by='dow')
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_days
    
crimes_days = get_crimes_by_days(crimes)
if bodo.get_rank()==0:
    display(crimes_days.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Group by days Time:  1151.63554845185  (ms)


[output:0]

Unnamed: 0,dow,ID
4,0,95090
1,1,94739
2,2,95593
3,3,94761
0,4,100601


### Determining the pattern on monthly basis

In [9]:
%%px
@bodo.jit(distributed=['crimes', 'crimes_months'], cache=True)
def get_crimes_by_months(crimes):
    t1 = time.time()
    crimes['month'] = crimes["Date"].dt.month
    crimes_months = crimes.groupby('month', as_index=False)['ID'].count().sort_values(by='month')
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_months
    
crimes_months = get_crimes_by_months(crimes)
if bodo.get_rank()==0:
    display(crimes_months.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Group by days Time:  461.3977206108757  (ms)


[output:0]

Unnamed: 0,month,ID
6,1,56524
7,2,45259
8,3,54459
3,4,54425
10,5,59399


### Determining the pattern by crime type

In [10]:
%%px
@bodo.jit(distributed=['crimes', 'crimes_type'], cache=True)
def get_crimes_by_type(crimes):
    t1 = time.time()
    crimes_type = crimes.groupby('Primary Type', as_index=False)['ID'].count().sort_values(by='ID', ascending=False)
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_type
    
crimes_type = get_crimes_by_type(crimes)
if bodo.get_rank()==0:
    display(crimes_type.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Group by days Time:  680.3096781682143  (ms)


[output:0]

Unnamed: 0,Primary Type,ID
1,THEFT,164840
4,BATTERY,131803


### Determining the pattern by location

In [11]:
%%px
@bodo.jit(distributed=['crimes', 'crimes_location'], cache=True)
def get_crimes_by_location(crimes):
    t1 = time.time()
    crimes_location = crimes.groupby('Location Description', as_index=False)['ID'].count().sort_values(by='ID', ascending=False)
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_location
    
crimes_location = get_crimes_by_location(crimes)
if bodo.get_rank()==0:
    display(crimes_location.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Group by days Time:  1227.0002865907372  (ms)


[output:0]

Unnamed: 0,Location Description,ID
32,STREET,153539
39,RESIDENCE,108365
31,APARTMENT,86543
74,SIDEWALK,73893
95,OTHER,25845
