In [1]:
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(os.path.join("..", "..")))

# Chicago data

The data can be downloaded from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 (see the module docstring of `open_cp.sources.chicago`  See also https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

In this notebook, we quickly look at the data, check that the data agrees between both sources, and demo some of the library features provided for loading the data.

In [2]:
import open_cp.sources.chicago as chicago
import geopandas as gpd

import sys, os, csv, lzma
filename = os.path.join("..", "..", "open_cp", "sources", "chicago.csv")
filename_all = os.path.join("..", "..", "open_cp", "sources", "chicago_all.csv.xz")
filename_all1 = os.path.join("..", "..", "open_cp", "sources", "chicago_all1.csv.xz")

Let us look at the snapshot of the last year, vs the total dataset.  The data appears to be the same, though the exact format changes.

In [3]:
with open(filename, "rt") as file:
    reader = csv.reader(file)
    print(next(reader))
    print(next(reader))

['CASE#', 'DATE  OF OCCURRENCE', 'BLOCK', ' IUCR', ' PRIMARY DESCRIPTION', ' SECONDARY DESCRIPTION', ' LOCATION DESCRIPTION', 'ARREST', 'DOMESTIC', 'BEAT', 'WARD', 'FBI CD', 'X COORDINATE', 'Y COORDINATE', 'LATITUDE', 'LONGITUDE', 'LOCATION']
['HZ560767', '12/22/2016 02:55:00 AM', '010XX N CENTRAL PARK AVE', '4387', 'OTHER OFFENSE', 'VIOLATE ORDER OF PROTECTION', 'APARTMENT', 'N', 'Y', '1112', '27', '26', '1152189', '1906649', '41.899712716', '-87.716454159', '(41.899712716, -87.716454159)']


In [4]:
with lzma.open(filename_all, "rt") as file:
    reader = csv.reader(file)
    print(next(reader))
    print(next(reader))

['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location']
['4652043', 'HL594701', '09/06/2005 12:06:44 PM', '004XX E 61ST ST', '1811', 'NARCOTICS', 'POSS: CANNABIS 30GMS OR LESS', 'OTHER', 'true', 'false', '0313', '003', '20', '42', '18', '1180151', '1864661', '2005', '04/15/2016 08:55:02 AM', '41.783897141', '-87.61504023', '(41.783897141, -87.61504023)']


As well as loading data directly into a `TimedPoints` class, we can process a sub-set of the data to GeoJSON, or straight to a geopandas dataframe (if geopandas is installed).

In [5]:
geo_data = chicago.load_to_GeoJSON()
geo_data[0]

{'geometry': {'coordinates': [-87.716454159, 41.899712716], 'type': 'Point'},
 'properties': {'address': '010XX N CENTRAL PARK AVE',
  'case': 'HZ560767',
  'crime': 'OTHER OFFENSE',
  'location': 'APARTMENT',
  'timestamp': '2016-12-22T02:55:00',
  'type': 'VIOLATE ORDER OF PROTECTION'},
 'type': 'Feature'}

In [6]:
frame = chicago.load_to_geoDataFrame()
frame.head()

Unnamed: 0,address,case,crime,geometry,location,timestamp,type
0,010XX N CENTRAL PARK AVE,HZ560767,OTHER OFFENSE,POINT (-87.71645415899999 41.899712716),APARTMENT,2016-12-22T02:55:00,VIOLATE ORDER OF PROTECTION
1,051XX S WASHTENAW AVE,HZ561134,BATTERY,POINT (-87.691539994 41.800445234),RESIDENTIAL YARD (FRONT/BACK),2016-12-22T11:17:00,AGGRAVATED: OTHER FIREARM
2,059XX W DIVERSEY AVE,HZ565584,DECEPTIVE PRACTICE,POINT (-87.774165121 41.931166274),RESIDENCE,2016-12-09T12:00:00,FINANCIAL IDENTITY THEFT $300 AND UNDER
3,001XX N STATE ST,HZ561772,THEFT,POINT (-87.62787669799999 41.883500187),DEPARTMENT STORE,2016-12-22T18:50:00,RETAIL THEFT
4,008XX N MICHIGAN AVE,HZ561969,THEFT,POINT (-87.624095634 41.897982937),SMALL RETAIL STORE,2016-12-22T19:20:00,RETAIL THEFT


## Explore with QGIS

We can save the dataframe to a shape-file which can be viewed in e.g. QGIS.

To explore the spatial-distribution, I would recommend using an interactive GIS package.  Using QGIS (free and open source) you can easily add a basemap using GoogleMaps or OpenStreetMap, etc.  See http://maps.cga.harvard.edu/qgis/wkshop/basemap.php

I found this to be slightly buggy.  On Windows, QGIS 2.18.7 I found that the following worked:
- First open the `chicago.shp` file produced from the line above.
- Select the Coordinate reference system "WGS 84 / EPSG:4326"
- Now go to the menu "Web" -> "OpenLayers plugin" -> Whatever
- The projection should change to EPSG:3857.  The basemap will obscure the point map, so in the "Layers Panel" drag the basemap to the bottom.
- Selecting EPSG:3857 at import time doesn't seem to work (which is different to the instructions..!)

In [7]:
# On my Windows install, if I don't do this, I get a GDAL error in
# the Jupyter console, and the resulting ".prj" file is empty.
# This isn't critical, but it confuses QGIS, and you end up having to
# choose a projection when loading the shape-file.
import os
os.environ["GDAL_DATA"] = "C:\\Users\\Matthew\\Anaconda3\\Library\\share\\gdal\\"

frame.to_file("chicago")

# A geoPandas example

Let's use the "generator of GeoJSON" option shown above to pick out only BURGLARY crimes from the 2001-- dataset (which is too large to easily load into a dataframe in one go).

In [8]:
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["crime"] == "THEFT" ]
    
frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()

Unnamed: 0,address,case,crime,geometry,location,timestamp,type
0,007XX N MICHIGAN AVE,HM251023,THEFT,POINT (-87.624279065 41.896010965),DEPARTMENT STORE,2006-03-24T19:00:00,RETAIL THEFT
1,038XX W DIVERSEY AVE,HM250171,THEFT,POINT (-87.722811197 41.931845968),VEHICLE NON-COMMERCIAL,2006-03-24T12:25:00,$500 AND UNDER
2,011XX W THORNDALE AVE,HM250827,THEFT,POINT (-87.659104566 41.990039942),SMALL RETAIL STORE,2006-03-24T17:30:00,RETAIL THEFT
3,073XX N CLARK ST,HM250039,THEFT,POINT (-87.674962947 42.014588191),SIDEWALK,2006-03-23T21:00:00,OVER $500
4,019XX E 79TH ST,HL796548,THEFT,POINT (-87.57686194999999 41.751596008),SMALL RETAIL STORE,2005-12-18T14:27:52,RETAIL THEFT


In [9]:
frame.to_file("chicago_all_theft")

In [10]:
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["crime"] == "BURGLARY" ]

frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()

Unnamed: 0,address,case,crime,geometry,location,timestamp,type
0,049XX S MARSHFIELD AVE,HM246722,BURGLARY,POINT (-87.66607344800001 41.804251898),RESIDENCE,2006-03-22T13:00:00,FORCIBLE ENTRY
1,059XX W LAWRENCE AVE,HM250229,BURGLARY,POINT (-87.775482272 41.967676276),COMMERCIAL / BUSINESS OFFICE,2006-03-23T20:00:00,UNLAWFUL ENTRY
2,023XX W JACKSON BLVD,HM250645,BURGLARY,POINT (-87.684893829 41.877565688),APARTMENT,2006-03-24T10:00:00,FORCIBLE ENTRY
3,102XX S RACINE AVE,HL816910,BURGLARY,POINT (-87.65259514500001 41.708116723),RESIDENCE,2005-12-30T09:22:37,FORCIBLE ENTRY
4,001XX W 112TH ST,HM246476,BURGLARY,POINT (-87.62658922200001 41.690730834),RESIDENCE,2006-03-22T14:44:00,FORCIBLE ENTRY


In [11]:
frame.to_file("chicago_all_burglary")

In [12]:
frame["type"].unique()

array(['FORCIBLE ENTRY', 'UNLAWFUL ENTRY', 'ATTEMPT FORCIBLE ENTRY',
       'HOME INVASION'], dtype=object)

In [13]:
frame["location"].unique()

array(['RESIDENCE', 'COMMERCIAL / BUSINESS OFFICE', 'APARTMENT',
       'RESIDENCE-GARAGE', 'CONSTRUCTION SITE', 'SCHOOL, PUBLIC, BUILDING',
       'OTHER', 'DEPARTMENT STORE', 'APPLIANCE STORE',
       'SMALL RETAIL STORE', 'GROCERY FOOD STORE', 'CHA APARTMENT',
       'FACTORY/MANUFACTURING BUILDING', 'PARKING LOT/GARAGE(NON.RESID.)',
       'WAREHOUSE', 'MEDICAL/DENTAL OFFICE', 'RESTAURANT',
       'RESIDENCE PORCH/HALLWAY', 'VACANT LOT/LAND', 'BARBERSHOP',
       'GOVERNMENT BUILDING/PROPERTY', 'SCHOOL, PRIVATE, BUILDING',
       'HOTEL/MOTEL', 'GAS STATION', 'ALLEY', 'DRUG STORE', 'CAR WASH',
       'TAVERN/LIQUOR STORE', 'CLEANING STORE', 'PARK PROPERTY',
       'CHURCH/SYNAGOGUE/PLACE OF WORSHIP', 'SCHOOL, PUBLIC, GROUNDS',
       'MOVIE HOUSE/THEATER', 'CTA PLATFORM',
       'OTHER RAILROAD PROP / TRAIN DEPOT', 'HOSPITAL BUILDING/GROUNDS',
       'ABANDONED BUILDING', 'STREET', 'BAR OR TAVERN',
       'CONVENIENCE STORE', 'POLICE FACILITY/VEH PARKING LOT',
       'CURRENCY EXCH

Upon loading into QGIS to visualise, we find that the 2001 data seems to be geocoded in a different way...  The events are not on the road, and the distribution looks less artificial.  Let's extract the 2001 burglary data, and then the all the 2001 data, and save.

In [14]:
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["timestamp"].startswith("2001") ]

frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()

Unnamed: 0,address,case,crime,geometry,location,timestamp,type
0,069XX W 64TH ST,HM376257,THEFT,POINT (-87.79442148699999 41.775545301),RESIDENCE,2001-05-15T12:00:00,FINANCIAL ID THEFT:$300 &UNDER
1,016XX N CENTRAL PARK AVE,HM243576,OFFENSE INVOLVING CHILDREN,POINT (-87.716721957 41.910820079),RESIDENCE,2001-01-01T00:01:00,AGG SEX ASSLT OF CHILD FAM MBR
2,048XX S KENWOOD AVE,HM381863,OFFENSE INVOLVING CHILDREN,POINT (-87.593706931 41.807274768),RESIDENCE,2001-01-01T00:00:00,SEX ASSLT OF CHILD BY FAM MBR
3,048XX N PAULINA ST,HM384838,THEFT,POINT (-87.67092233299999 41.970613961),RESIDENCE,2001-01-01T00:01:00,FINANCIAL ID THEFT: OVER $300
4,010XX E 73RD ST,HM388575,THEFT,POINT (-87.599210982 41.762408071),RESIDENCE,2001-03-31T00:00:00,FINANCIAL ID THEFT: OVER $300


In [15]:
frame.to_file("chicago_2001")

# Explore rounding errors

We check the following:
- The X and Y COORDINATES fields (which we'll see, in a different notebook, at longitude / latitude coordinates projected in EPSG:3435 in feet) are always whole numbers.
- The longitude and latitude data contains at most 9 decimals places of accuracy.

In the other notebook, we look at map projections.  The data is most consistent with the longitude / latitude coordinates being the primary source, and the X/Y projected coordinates being computed and rounded to the nearest integer.

In [16]:
longs, lats = [], []
xcs, ycs = [], []

with open(filename, "rt") as file:
    reader = csv.reader(file)
    header = next(reader)
    print(header)
    for row in reader:
        if len(row[14]) > 0:
            longs.append(row[14])
            lats.append(row[15])
            xcs.append(row[12])
            ycs.append(row[13])

['CASE#', 'DATE  OF OCCURRENCE', 'BLOCK', ' IUCR', ' PRIMARY DESCRIPTION', ' SECONDARY DESCRIPTION', ' LOCATION DESCRIPTION', 'ARREST', 'DOMESTIC', 'BEAT', 'WARD', 'FBI CD', 'X COORDINATE', 'Y COORDINATE', 'LATITUDE', 'LONGITUDE', 'LOCATION']


In [17]:
set(len(x) for x in longs), set(len(x) for x in lats)

({8, 9, 10, 11, 12}, {8, 9, 10, 11, 12, 13})

In [18]:
any(x.find('.') >= 0 for x in xcs), any(y.find('.') >= 0 for y in ycs)

(False, False)