# Prepare example data

In order to illustrate the application of the tools contained here, some example datasets are provided. This Notebook outlines the pre-processing steps involved in preparing these datasets.

The data will be drawn from the [Crime Open Database (CODE)](https://osf.io/zyaqn/), maintained by Matt Ashby. This collates crime data from a number of open sources in a harmonised format. Snapshots of this data for several years were downloaded in CSV format.

The spatial data is provided in lat/lon format; here the PyProj library will be used to re-project the coordinates to metric units for distance calculations.

In [6]:
#%pip install pyproj

Collecting pyproj
  Obtaining dependency information for pyproj from https://files.pythonhosted.org/packages/30/bd/b9bd3761f08754e8dbb34c5a647db2099b348ab5da338e90980caf280e37/pyproj-3.6.1-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading pyproj-3.6.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (31 kB)
Downloading pyproj-3.6.1-cp311-cp311-macosx_11_0_arm64.whl (4.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pyproj
Successfully installed pyproj-3.6.1
Note: you may need to restart the kernel to use updated packages.


In [7]:
import pandas as pd
from pyproj import CRS, Transformer

For the test data, data from the city of **Chicago** will be used, for the offence category '**residential burglary/breaking & entering**'. Data is concatenated for 2014-2017, inclusive.

In [2]:
data14 = pd.read_csv("../data/crime_open_database_core_2014.csv", parse_dates=['date_single'])
data15 = pd.read_csv("../data/crime_open_database_core_2015.csv", parse_dates=['date_single'])
data16 = pd.read_csv("../data/crime_open_database_core_2016.csv", parse_dates=['date_single'])
data17 = pd.read_csv("../data/crime_open_database_core_2017.csv", parse_dates=['date_single'])
data = pd.concat([data14, data15, data16, data17], axis=0)
data = data[data['city_name'] == "Chicago"]
data = data[data['offense_type'] == "residential burglary/breaking & entering"]
data.shape

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


(45319, 14)

In [27]:
data = pd.read_csv("/Users/cristinaalvarez/Documents/GitHub/NearRepeatVictimization/data/archivo.csv", parse_dates=['date_single'])
data.head()

Unnamed: 0,x,y,date_single
0,-99.11819,19.47576,2018-12-04
1,-99.11819,19.47576,2018-12-04
2,-99.12794,19.48969,2019-01-03
3,-99.07334,19.3167,2019-01-04
4,-99.02823,19.34434,2019-01-04


The total number of incidents across the 4 years is 45,319.

The re-projection will use the [Illinois State Plane](http://www.spatialreference.org/ref/epsg/26971/) as the target reference system.

In [28]:
wgs84 = CRS.from_epsg(4326)
isp = CRS.from_epsg(26971)
transformer = Transformer.from_crs(wgs84, isp)



In [29]:
x, y = transformer.transform(data["y"].values, data["x"].values)
data = data.assign(x=x, y=y)

In [30]:
data.head()

Unnamed: 0,x,y,date_single
0,-837526.259837,-1869145.0,2018-12-04
1,-837526.259837,-1869145.0,2018-12-04
2,-838464.76106,-1867516.0,2019-01-03
3,-833876.744475,-1887300.0,2019-01-04
4,-828876.474137,-1884499.0,2019-01-04


In [31]:
row_nan_count = data.isna().sum(axis=0)
print(row_nan_count)

x              318
y              318
date_single     32
dtype: int64


In [33]:
data.dropna( inplace=True)

In [34]:
row_nan_count = data.isna().sum(axis=0)
print(row_nan_count)

x              0
y              0
date_single    0
dtype: int64


Finally, save the derived data in minimal form.

In [35]:
data.to_csv("/Users/cristinaalvarez/Documents/GitHub/NearRepeatVictimization/data/archivo_reproyeccion.csv", 
            columns=['x','y','date_single'], 
            date_format='%d/%m/%Y', index=False)