In [3]:
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(".."))

# Chicago data and simulating address-level data

The Chicago data is anonymised by moving obscuring address points etc etc.

- The data can be downloaded from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 (see the module docstring of `open_cp.sources.chicago`  See also https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

In this notebook, we quickly look at the data, check that the data agrees between both sources, and demo some of the library features provided for loading the data.

In another notebook, we will look closely at how the location data corresponds (or doesn't!) to the block addresses given.

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import PIL
import pandas as pd
import geopandas as gpd

import open_cp.sources.chicago as chicago

import sys, os, csv, lzma
filename = os.path.join("..", "open_cp", "sources", "chicago.csv")
filename_all = os.path.join("..", "open_cp", "sources", "chicago_all.csv.xz")
filename_all1 = os.path.join("..", "open_cp", "sources", "chicago_all1.csv.xz")

Let us look at the snapshot of the last year, vs the total dataset.  The data appears to be the same, though the exact format changes.

In [3]:
with open(filename, "rt") as file:
    reader = csv.reader(file)
    print(next(reader))
    print(next(reader))

['CASE#', 'DATE  OF OCCURRENCE', 'BLOCK', ' IUCR', ' PRIMARY DESCRIPTION', ' SECONDARY DESCRIPTION', ' LOCATION DESCRIPTION', 'ARREST', 'DOMESTIC', 'BEAT', 'WARD', 'FBI CD', 'X COORDINATE', 'Y COORDINATE', 'LATITUDE', 'LONGITUDE', 'LOCATION']
['HZ560767', '12/22/2016 02:55:00 AM', '010XX N CENTRAL PARK AVE', '4387', 'OTHER OFFENSE', 'VIOLATE ORDER OF PROTECTION', 'APARTMENT', 'N', 'Y', '1112', '27', '26', '1152189', '1906649', '41.899712716', '-87.716454159', '(41.899712716, -87.716454159)']


In [4]:
with lzma.open(filename_all, "rt") as file:
    reader = csv.reader(file)
    print(next(reader))
    print(next(reader))

['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location']
['8651563', 'HV322174', '06/05/2012 11:00:00 AM', '022XX N CANNON DR', '0810', 'THEFT', 'OVER $500', 'STREET', 'false', 'false', '1814', '018', '43', '7', '06', '1175057', '1915111', '2012', '02/04/2016 06:33:39 AM', '41.922450893', '-87.632206293', '(41.922450893, -87.632206293)']


As well as loading data directly into a `TimedPoints` class, we can process a sub-set of the data to GeoJSON, or straight to a geopandas dataframe (if geopandas is installed).

In [5]:
geo_data = chicago.load_to_GeoJSON()
geo_data[0]

{'geometry': {'coordinates': [-87.716454159, 41.899712716], 'type': 'Point'},
 'properties': {'address': '010XX N CENTRAL PARK AVE',
  'case': 'HZ560767',
  'crime': 'OTHER OFFENSE',
  'location': 'APARTMENT',
  'timestamp': '2016-12-22T02:55:00',
  'type': 'VIOLATE ORDER OF PROTECTION'},
 'type': 'Feature'}

In [6]:
frame = chicago.load_to_geoDataFrame()
frame.head()

Unnamed: 0,address,case,crime,geometry,location,timestamp,type
0,010XX N CENTRAL PARK AVE,HZ560767,OTHER OFFENSE,POINT (-87.71645415899999 41.899712716),APARTMENT,2016-12-22T02:55:00,VIOLATE ORDER OF PROTECTION
1,051XX S WASHTENAW AVE,HZ561134,BATTERY,POINT (-87.691539994 41.800445234),RESIDENTIAL YARD (FRONT/BACK),2016-12-22T11:17:00,AGGRAVATED: OTHER FIREARM
2,059XX W DIVERSEY AVE,HZ565584,DECEPTIVE PRACTICE,POINT (-87.774165121 41.931166274),RESIDENCE,2016-12-09T12:00:00,FINANCIAL IDENTITY THEFT $300 AND UNDER
3,001XX N STATE ST,HZ561772,THEFT,POINT (-87.62787669799999 41.883500187),DEPARTMENT STORE,2016-12-22T18:50:00,RETAIL THEFT
4,008XX N MICHIGAN AVE,HZ561969,THEFT,POINT (-87.624095634 41.897982937),SMALL RETAIL STORE,2016-12-22T19:20:00,RETAIL THEFT


## Explore with QGIS

If geoPandas is installed, we can save the dataframe to a shape-file which can be viewed in e.g. QGIS.

To explore the spatial-distribution, I would recommend using an interactive GIS package.  Using QGIS (free and open source) you can easily add a basemap using GoogleMaps or OpenStreetMap, etc.  See http://maps.cga.harvard.edu/qgis/wkshop/basemap.php

I found this to be slightly buggy.  On Windows, QGIS 2.18.7 I found that the following worked:
- First open the `chicago.shp` file produced from the line above.
- Select the Coordinate reference system "WGS 84 / EPSG:4326"
- Now go to the menu "Web" -> "OpenLayers plugin" -> Whatever
- The projection should change to EPSG:3857.  The basemap will obscure the point map, so in the "Layers Panel" drag the basemap to the bottom.
- Selecting EPSG:3857 at import time doesn't seem to work (which is different to the instructions..!)

In [None]:
frame.to_file("chicago")

## Check the total data sets agree

The files `filename` and `filename1` were downloaded from, respectively, the US Gov website, and the Chicago site.  They are slightly different in size, but appear to contain the same data.  (This can be checked!)

The files `filename_all` and `filename_all1` were also downloaded from, respectively, the US Gov website, and the Chicago site.  While they are the same size (uncompressed), and have the same headers, the data appears, at least naively, to be different.

In [3]:
with lzma.open(filename_all, "rt") as file:
    print(next(file))
with lzma.open(filename_all1, "rt") as file:
    print(next(file))    

ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location

ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location



In [4]:
with lzma.open(filename_all, "rt") as file:
    next(file); print(next(file))
with lzma.open(filename_all1, "rt") as file:
    next(file); print(next(file))  

8651563,HV322174,06/05/2012 11:00:00 AM,022XX N CANNON DR,0810,THEFT,OVER $500,STREET,false,false,1814,018,43,7,06,1175057,1915111,2012,02/04/2016 06:33:39 AM,41.922450893,-87.632206293,"(41.922450893, -87.632206293)"

9257701,HW403120,08/11/2013 12:40:00 PM,039XX W WILCOX ST,2024,NARCOTICS,POSS: HEROIN(WHITE),ALLEY,true,false,1122,011,28,26,18,1150094,1899044,2013,02/04/2016 06:33:39 AM,41.878884869,-87.724347406,"(41.878884869, -87.724347406)"



In [5]:
def load_as_tuples(f):
    events = []
    for feature in chicago.generate_GeoJSON_Features(f, type="all"):
        props = feature["properties"]
        if props["crime"] == "HOMICIDE":
            continue
        coords = feature["geometry"]
        if coords is None:
            coords = (-1, -1)
        else:
            coords = coords["coordinates"]
        event = (props["case"], props["crime"], props["type"], props["location"],
                 props["timestamp"], props["address"], coords[0], coords[1])
        events.append(event)
    return events

def load_as_dict_to_lists(f):
    events = dict()
    for feature in chicago.generate_GeoJSON_Features(f, type="all"):
        props = feature["properties"]
        if props["crime"] == "HOMICIDE":
            continue
        coords = feature["geometry"]
        if coords is None:
            coords = (-1, -1)
        else:
            coords = coords["coordinates"]
        case = props["case"]
        if case not in events:
            events[case] = []
        event = (props["crime"], props["type"], props["location"],
                 props["timestamp"], props["address"], coords[0], coords[1])
        events[case].append(event)
    return events    

In [6]:
with lzma.open(filename_all, "rt") as file:
    events = load_as_dict_to_lists(file)
    print("events:", len(events))
with lzma.open(filename_all1, "rt") as file:
    events1 = load_as_dict_to_lists(file)
    print("events:", len(events1))

events: 6320160
events: 6320160


In [7]:
diffs = dict()

for case in set(events.keys()):
    if len(events[case]) == 1 and case in events1 and len(events1[case])==1:
        if not events[case] == events1[case]:
            diffs[case] = (events[case], events1[case])
        del events[case]
        del events1[case]

In [8]:
len(diffs)

0

In [9]:
events.keys() == events1.keys()

True

In [10]:
for key in events:
    assert set(events['']) == set(events1[''])

# A geoPandas example

Let's use the "generator of GeoJSON" option shown above to pick out only BURGLARY crimes from the 2001-- dataset (which is too large to easily load into a dataframe in one go).

In [5]:
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["crime"] == "BURGLARY" ]

frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()

Unnamed: 0,address,case,crime,geometry,location,timestamp,type
0,003XX W 94TH PL,HV327134,BURGLARY,POINT (-87.632356731 41.723006487),RESIDENCE,2012-06-09T23:30:00,FORCIBLE ENTRY
1,011XX N HERMITAGE AVE,HV327217,BURGLARY,POINT (-87.671145473 41.902514714),RESIDENCE,2012-06-06T12:00:00,FORCIBLE ENTRY
2,047XX W WEST END AVE,HV327110,BURGLARY,POINT (-87.743935241 41.883144607),RESIDENCE,2012-06-09T01:30:00,FORCIBLE ENTRY
3,038XX S SACRAMENTO AVE,HV327153,BURGLARY,POINT (-87.69951882399999 41.823319195),APARTMENT,2012-06-09T12:00:00,UNLAWFUL ENTRY
4,013XX W 91ST ST,HV327223,BURGLARY,POINT (-87.65724045 41.72858934),RESIDENCE,2012-06-09T02:00:00,FORCIBLE ENTRY


In [6]:
frame["type"].unique()

array(['FORCIBLE ENTRY', 'UNLAWFUL ENTRY', 'ATTEMPT FORCIBLE ENTRY',
       'HOME INVASION'], dtype=object)

In [7]:
frame["location"].unique()

array(['RESIDENCE', 'APARTMENT', 'DEPARTMENT STORE', 'RESIDENCE-GARAGE',
       'SCHOOL, PUBLIC, BUILDING', 'SMALL RETAIL STORE',
       'CONSTRUCTION SITE', 'CHA PARKING LOT/GROUNDS',
       'CHURCH/SYNAGOGUE/PLACE OF WORSHIP', 'APPLIANCE STORE',
       'RESTAURANT', 'CHA APARTMENT', 'CLEANING STORE',
       'GROCERY FOOD STORE', 'OTHER', 'DAY CARE CENTER',
       'RESIDENTIAL YARD (FRONT/BACK)', 'COMMERCIAL / BUSINESS OFFICE',
       'ABANDONED BUILDING', 'BARBERSHOP',
       'PARKING LOT/GARAGE(NON.RESID.)', 'WAREHOUSE', 'VACANT LOT/LAND',
       'BAR OR TAVERN', 'PARK PROPERTY', 'CTA GARAGE / OTHER PROPERTY',
       'RESIDENCE PORCH/HALLWAY', 'BANK', 'FEDERAL BUILDING', 'STREET',
       'OTHER RAILROAD PROP / TRAIN DEPOT', 'GAS STATION',
       'MEDICAL/DENTAL OFFICE', 'CAR WASH', 'MOVIE HOUSE/THEATER',
       'CONVENIENCE STORE', 'HOTEL/MOTEL', 'DRIVEWAY - RESIDENTIAL',
       'SCHOOL, PRIVATE, BUILDING', 'ALLEY', 'SCHOOL, PUBLIC, GROUNDS',
       'CURRENCY EXCHANGE', 'HOSPITAL BU

In [23]:
frame.to_file("chicago_all_burglary")

Upon loading into QGIS to visualise, we find that the 2001 data seems to be geocoded in a different way...  The events are not on the road, and the distribution looks less artificial.  Let's extract the 2001 burglary data, and then the all the 2001 data, and save.

In [9]:
frame2001 = frame[frame.timestamp.map(lambda s : s.startswith("2001"))]

In [10]:
frame2001.to_file("chicago_2001_burglary")

In [12]:
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["timestamp"].startswith("2001") ]

frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()

Unnamed: 0,address,case,crime,geometry,location,timestamp,type
0,079XX S CAMPBELL AVE,G397434,HOMICIDE,POINT (-87.685268108 41.749135591),STREET,2001-08-09T19:30:00,FIRST DEGREE MURDER
1,048XX W KAMERLING AVE,G473112,HOMICIDE,POINT (-87.746889161 41.905072512),STREET,2001-08-10T01:04:00,FIRST DEGREE MURDER
2,009XX N LAMON AVE,G477463,HOMICIDE,POINT (-87.74830887500001 41.897236716),STREET,2001-08-11T23:06:00,FIRST DEGREE MURDER
3,036XX W OHIO ST,G476774,HOMICIDE,POINT (-87.717959287 41.891770726),APARTMENT,2001-08-12T18:17:00,FIRST DEGREE MURDER
4,019XX S RACINE AVE,G477822,HOMICIDE,POINT (-87.656374568 41.85626764),AUTO,2001-08-12T04:35:00,FIRST DEGREE MURDER


In [13]:
frame.to_file("chicago_2001")