# Sources

1. Bowers, Johnson, Pease, "Prospective hot-spotting: The future of crime mapping?", Brit. J. Criminol. (2004) 44 641--658.  doi:10.1093/bjc/azh036

2. Johnson et al., "Prospective crime mapping in operational context", Home Office Online Report 19/07  [Police online library](http://library.college.police.uk/docs/hordsolr/rdsolr1907.pdf)

# Algorithm

### Grid the space

Divide the area of interest into a grid.  The grid is used for both the algorithm, and for data visualisation.  There is some discussion about reasonable grid sizes.  10m by 10m or 50m by 50m have been used.

### Aim of algorithm

We select "bandwidths": space/time regions of the data.  Common values are the look at events within 400m and the last 2 months (8 weeks).  For each grid cell, for each event falling in this range, we compute a weighting for the event, and then sum all the weightings to produce a (un-normalised) "risk intensity" for that cell.

### Choice of weights

I believe the original paper (1) is unclear on this.  The discussion on page 9 shows a formula involving "complete 1/2 grid widths" but does not give details as to how, _exactly_, such a distance is to be computed.  The next paragraph gives a couple of examples which seem unclear, as it simply talks about "neighbouring cell".  No formula is given for total weight, but we can infer it from the examples.

Let $t_i$ be the elapsed time between now and the event, and $d_i$ the distance of event $i$ from the centre of the grid cell we are interested in.  Then
$$ w = \sum_{t_i \leq t_\max, d_i \leq d_\max} \frac{1}{1+d_i} \frac{1}{1+t_i}. $$
For this to make sense, we introduce units:

   - $t_i$ is the number of whole weeks which have elapsed.  So if today is 20th March, and the event occurred on the 17th March, $t_i=0$.  If the event occurred on the 10th, $t_i=1$.
   - $d_i$ is the number of "whole 1/2 grid widths between the event" and the centre of the cell.  Again, this is slightly unclear, as an event occurring very near the edge of a cell would (thanks to Pythagoras) have $d_i=1$, while the example in the paper suggests always $d_i=0$ in this case.  We shall follow the precise definition.
   
Paper (2) uses a different formula, and gives no examples:

$$ w = \sum_{t_i \leq t_\max, d_i \leq d_\max} \Big( 1 + \frac{1}{d_i} \Big) \frac{1}{t_i}. $$

where we told only that:

   - $t_i$ is the elapsed time (but using $1/t_i$ suggests very large weights for events occurring very close to the time of analysis).
   - $d_i$ is the "number of cells" between the event and the cell in question.  The text also notes: "Thus, if a crime occurred within the cell under consideration, the distance would be zero (actually, for computational reasons 1) if it
occurred within an adjacent cell, two, and so on."  What to do about diagonal cells is not specified: sensible choices might be that a cell diagonally offset from the cell of interest is either distance 2 or 3.  However, either choice would seem to introduce an anisotropic component which seems unjustified.

It is not clear to me that (2) gives the temporal and spatial bandwidths used.

### Coupled units

Notice that both weight functions couple the "units" of time and space.  For example, if we halve the cell width used, then (roughly speaking) each $d_i$ will double, while the $t_i$ remain unchanged.  This results in the time component now having a larger influence on the weight.

   - It hence seems sensible that we scale both time and distance together.
   - If we run one test with a grid size of 50m and the time unit as 7 days,
   - then another test could be a grid size of 25m, but also with the time unit as 3.5 days.

### Variations

Paper (2) introduces a variation of this method:

> For the second set of results for the prospective method, for each cell, the crime that confers
the most risk is identified and the cell is assigned the risk intensity value generated by that
one point.

Again, this is not made _entirely_ clear, but I believe it means that we look at the sum above, and instead of actually computing the sum, we compute each _summand_ and then take the largest value to be the weight.

### Generating predictions

The "risk intensity" for each grid cell is computed, and then displayed graphically as relative risk.  For example:

   - Visualise by plotting the top 1% of grid cells, top 5% and top 10% as different colours.  Paper (2) does this.
   - Visualise by generating a "heat map".  Paper (1) does this.
   
When using the risk intensity to make predictions, there are two reasonable choices:

1. Compute the risk intensity for today, using all the data up until today.  Treat this as a risk profile for the next few days in time.
2. For each day into the future we wish to predict, recompute the risk intensity.

The difference between (1) and (2) is that events may change their time-based weighting (or event fall out of the temporal bandwidth completely).  For example, if today is the 20th March and an event occurred on the 14th, we consider it as occuring zero whole weeks ago, and so it contributes a weight of $1/1 = 1$ (in the 1st formula, for example).  However, if we recompute the risk for the 22nd March, this event is now one whole week in the past, and so the weight becomes $1/2$.

### Aliasing issues

This issue falls under what I term an "aliasing issue" which comes about as we are taking continuous data and making it discrete:

   - We lay down a grid, making space discrete, because we measure distance as some multiple of "whole grid width".
   - We measure time in terms of "whole weeks" but seem to make day level predictions.
   
It would appear, a priori, that changing the offset of the grid (e.g. moving the whole grid 10m north) could cause a lot of events to jump from one grid cell to another.

### Implementation

We keep the grid for "prediction" purposes, but we allow a large range of "weights" to be plugged in, from various "guesses" as to what the exactly the original studies used, to variations of our own making.

Note, however, that this is still ultimately a "discrete" algorithm.  We give a variation which generates a continuous kernel (and then bins the result for visualisation / comparision purposes) as a different prediction method.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import open_cp
import open_cp.prohotspot as phs

In [9]:
import datetime
times = [datetime.datetime(2017,3,10) + datetime.timedelta(days=np.random.randint(0,10)) for _ in range(20)]
times.sort()
xc = np.random.random(size=20) * 50 + 50
yc = np.random.random(size=20) * 10 - 20
points = open_cp.TimedPoints.from_coords(times, xc, yc)
points.coords

array([[ 68.99449305, -18.68248891],
       [ 91.10647407, -19.90010428],
       [ 85.30882659, -18.13107613],
       [ 84.96479956, -17.97288045],
       [ 68.88482081, -17.67102775],
       [ 77.51379752, -15.14940798],
       [ 89.90020311, -14.33244934],
       [ 53.64697538, -13.6744408 ],
       [ 83.3870943 , -13.76446312],
       [ 86.7713431 , -18.04446361],
       [ 86.92105554, -18.12153375],
       [ 97.79933395, -18.79188021],
       [ 95.67031429, -11.19746285],
       [ 71.6308542 , -16.65951133],
       [ 83.93597405, -13.18928524],
       [ 76.63783453, -14.4761322 ],
       [ 93.72336824, -17.41285243],
       [ 68.55555927, -16.73044268],
       [ 69.97297061, -17.26866978],
       [ 58.37996541, -13.96590989]])

In [10]:
for p in points.coords:
    print(p)

[ 68.99449305 -18.68248891]
[ 91.10647407 -19.90010428]
[ 85.30882659 -18.13107613]
[ 84.96479956 -17.97288045]
[ 68.88482081 -17.67102775]
[ 77.51379752 -15.14940798]
[ 89.90020311 -14.33244934]
[ 53.64697538 -13.6744408 ]
[ 83.3870943  -13.76446312]
[ 86.7713431  -18.04446361]
[ 86.92105554 -18.12153375]
[ 97.79933395 -18.79188021]
[ 95.67031429 -11.19746285]
[ 71.6308542  -16.65951133]
[ 83.93597405 -13.18928524]
[ 76.63783453 -14.4761322 ]
[ 93.72336824 -17.41285243]
[ 68.55555927 -16.73044268]
[ 69.97297061 -17.26866978]
[ 58.37996541 -13.96590989]


In [12]:
points.coords[:,1]

array([-18.68248891, -19.90010428, -18.13107613, -17.97288045,
       -17.67102775, -15.14940798, -14.33244934, -13.6744408 ,
       -13.76446312, -18.04446361, -18.12153375, -18.79188021,
       -11.19746285, -16.65951133, -13.18928524, -14.4761322 ,
       -17.41285243, -16.73044268, -17.26866978, -13.96590989])