In [1]:
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(".."))

# Comparison with SaTScan

Having discovered further trouble replicating the results of SaTScan, we introduce some more support for reading and writing SaTScan files, and test various corner cases.

The class `AbstractSTScan` works with "generic time" (so just numbers, now interpretted as some time unit _before_ an epoch time).  This allows us to concentrate on the details.  We also introduce a more complicated rule about cases when the boundary of a disc contains more than one point (see below.

## Setup

In [2]:
import open_cp.stscan, open_cp.stscan2
import numpy as np

In [3]:
def make_random_data():
    times = np.floor(np.random.random(size=100) * 200)
    times.sort()
    times = np.flipud(times)
    coords = np.random.random(size=(2,100)) * 100
    return coords, times

def build_ab_scan(coords, times):
    ab_scan = open_cp.stscan2.AbstractSTScan(coords, times)
    ab_scan.geographic_radius_limit = 1000
    ab_scan.geographic_population_limit = 0.5
    ab_scan.time_max_interval = 200
    ab_scan.time_population_limit = 0.5
    return ab_scan

def build_trainer(coords, times):
    """Convert to days before 2017-04-01 and use `STSTrainer`."""
    timestamps = (np.timedelta64(1,"D") / np.timedelta64(1,"s")) * times * np.timedelta64(1,"s")
    timestamps = np.datetime64("2017-04-01T00:00") - timestamps
    data = open_cp.data.TimedPoints(timestamps, coords)

    trainer = open_cp.stscan.STSTrainer()
    trainer.data = data
    trainer.time_max_interval = np.timedelta64(200,"D")
    trainer.time_population_limit = 0.5
    trainer.geographic_population_limit = 0.5
    trainer.geographic_radius_limit = 1000
    return trainer

# Comparison

We find that _most_ of the time, we obtain the same clusters.  But sometimes we don't.  This is down to:

- Non-deterministic ordering.  If we compare things in different orders, we can break ties in different ways.
- As the discs are always centred on events, it is possible for different discs to contain the same events.  As we generate further clusters by finding the next most significant cluster which is _disjoint_ for current clusters, if we again process things in a different order, then we can obtain different disks.

From this point of view, obtaining perfect agreement with SaTScan seems an almost hopeless ideal!

In [4]:
coords, times = make_random_data()
ab_scan = build_ab_scan(coords, times)
all_clusters = list(ab_scan.find_all_clusters())
for c in all_clusters:
    print(c.centre, c.radius, c.time, c.statistic)

[ 47.7583294   79.80132806] 21.1470597081 9.0 3.8239455272
[ 12.96456247  32.47863865] 6.77699923599 32.0 1.89781609396
[ 49.09593276  18.68854038] 4.65419610183 66.0 1.33652984035
[ 93.94502034  64.27771926] 22.8324256358 15.0 1.2377140172
[ 80.13923462  16.13742495] 10.3301961977 82.0 1.01886049118
[ 55.9165906   40.56776161] 9.68113378677 78.0 0.877923733363
[  2.55182235  62.80109178] 25.1985193203 85.0 0.757036668024
[ 29.97718124   4.63486825] 10.8741986427 76.0 0.743358791298


In [5]:
trainer = build_trainer(coords, times)
result = trainer.predict(time=np.datetime64("2017-04-01T00:00"))
for c, t, s in zip(result.clusters, result.time_ranges, result.statistics):
    assert np.datetime64("2017-04-01T00:00") == t[1]
    t = (np.datetime64("2017-04-01T00:00") - t[0]) / np.timedelta64(1,"D")
    print(c, t, s)

Cluster(centre=array([ 47.7583294 ,  79.80132806]), radius=21.147271178651728) 9.0 3.8239455272
Cluster(centre=array([ 10.32565546,  38.72074474]), radius=6.7770670059800882) 32.0 1.89781609396
Cluster(centre=array([ 49.09593276,  18.68854038]), radius=4.6542426437907478) 66.0 1.33652984035
Cluster(centre=array([ 93.94502034,  64.27771926]), radius=22.832653960059005) 15.0 1.2377140172
Cluster(centre=array([ 82.69369159,   6.78297889]), radius=9.8332259659759718) 82.0 1.01886049118
Cluster(centre=array([ 55.9165906 ,  40.56776161]), radius=9.6812305981065947) 78.0 0.877923733363
Cluster(centre=array([ 29.97718124,   4.63486825]), radius=10.874307384696504) 76.0 0.743358791298
Cluster(centre=array([ 12.74989957,  59.64349518]), radius=10.67583601591971) 85.0 0.124080354492
Cluster(centre=array([  3.78541391,  92.72426815]), radius=21.246367657690392) 67.0 0.115179159952


## Timings

The newer code in `AbstractSTScan` is a bit quicker.

In [6]:
%timeit( list(ab_scan.find_all_clusters()) )

1 loop, best of 3: 2.48 s per loop


In [7]:
%timeit( trainer.predict() )

1 loop, best of 3: 2.5 s per loop


## Optionally save

We can write the data out in SaTScan format for comparison purposes.  Be sure to adjust Advanced Analysis options in SaTScan to reflect the settings we used above (no limit of size of clusters, but a population limit of 50% for both space and time).~

In [8]:
#ab_scan.to_satscan("satscan_test2", 1000)

# Grided data

Where we have found quite different behaviour from SaTScan is in "boundary" behaviour.  Consider the case when a disk's boundary (it's circumference) contains more than one event.  The `STSTrainer` class always considers all events inside or on the edge of the disk.  But SaTScan will _sometimes_ consider events inside the disc, and then only _some_ of the events on the boundary.

Notice in particular that we can expect this to happen a lot if the input data is on a regular grid.

We try to replicate this behaviour in `AbstractSTScan` by considering all possibilities of events on the boundary being counted or not.  Unfortunately, we then seem to beat SaTScan at its own game, and consider too many subsets, resulting in finding clusters which SaTScan does not.

Using the Chicago data, conformed to a grid, we found an example where SaTScan seems to perform this "boundary behaviour".  We have not been able to replicate it with random data.

## Generate example random data

We use the grid abilities of `STSTrainer`.

In [9]:
def trainer_to_data(trainer):
    coords = trainer.data.coords
    times = (np.datetime64("2017-04-01T00:00") - trainer.data.timestamps) / np.timedelta64(1,"s")
    times /= (np.timedelta64(1,"D") / np.timedelta64(1,"s"))
    times = np.floor(times)
    
    return coords, times

np.testing.assert_array_almost_equal(trainer_to_data(trainer)[0], coords)
np.testing.assert_array_almost_equal(trainer_to_data(trainer)[1], times)

In [33]:
trainer = build_trainer(*make_random_data())
region = open_cp.RectangularRegion(xmin=0, ymin=0, xmax=100, ymax=100)
ab_scan = build_ab_scan( *trainer_to_data( trainer.grid_coords(region, grid_size=20) ) )

In [34]:
all_clusters = list(ab_scan.find_all_clusters())
for c in all_clusters:
    print(c.centre, c.radius, c.time, c.statistic)

[ 30.  30.] 20.0 7.0 3.35205338619
[ 50.  50.] 0.0 38.0 2.71560614559
[ 90.  70.] 0.0 29.0 1.09086496593
[ 70.  90.] 0.0 27.0 0.639868972963
[ 10.  90.] 0.0 60.0 0.43341798412
[ 90.  90.] 0.0 84.0 0.367754306996
[ 90.  10.] 0.0 42.0 0.316663808549
[ 10.  10.] 0.0 98.0 0.221514808107
[ 70.  10.] 0.0 93.0 0.166852706468
[ 30.  70.] 0.0 63.0 0.133405703443
[ 30.  90.] 0.0 97.0 0.0871975764851
[ 10.  50.] 0.0 34.0 0.0871975764851
[ 50.  70.] 0.0 64.0 0.0173811215268
[ 50.  10.] 0.0 87.0 0.00689368813393


In [35]:
ab_scan.to_satscan("satscan_test1", 1000)

## Reload some data

Here's one we prepared earlier.  It shows a case where our aggressive algorithm finds a cluster which SaTScan does not.

In [13]:
def find_satscan_ids_for_mask(in_disc, time):
    in_disc &= ab_scan.timestamps <= time
    in_disc = set( (x,y) for x,y in ab_scan.coords[:,in_disc].T )
    return [i for i in satscan_data.geo if satscan_data.geo[i] in in_disc]

def find_mask(centre, radius):
    return np.sum((ab_scan.coords - np.array(centre)[:,None])**2, axis=0) <= radius**2

def to_our_indexes(sat_scan_indexes):
    out = set()
    for i in sat_scan_indexes:
        x, y = satscan_data.geo[i]
        m = (ab_scan.coords[0] == x) & (ab_scan.coords[1] == y)
        for j in np.arange(ab_scan.coords.shape[1])[m]:
            out.add(j)
    return out

In [15]:
satscan_data = open_cp.stscan2.SaTScanData("satscan_test3", 1000)
ab_scan = build_ab_scan( *satscan_data.to_coords_time() )

all_clusters = list(ab_scan.find_all_clusters())
for c in all_clusters:
    print(c.centre, c.radius, c.time, c.statistic)

[ 50.  30.] 20.0 45 1.78403489846
[ 30.  70.] 20.0 13 1.2377140172
[ 90.  70.] 0.0 20 0.719563298144
[ 10.  30.] 0.0 70 0.532363441331
[ 70.  10.] 0.0 55 0.253033910799
[ 10.  90.] 0.0 42 0.124080354492
[ 70.  70.] 0.0 97 0.0766353331714
[ 90.  90.] 0.0 97 0.0766353331714
[ 10.  50.] 0.0 46 0.0173811215268
[ 90.  30.] 0.0 91 0.0109248357106


In [16]:
# Cluster which SaTScan finds -- In this case, seemingly SaTScan includes all events
in_disc = find_mask([30,30], 20)
find_satscan_ids_for_mask(in_disc, 70)

[6, 11, 21, 22]

In [17]:
# Our cluster-- all events in or on the disc
in_disc = find_mask([50,30], 20)
find_satscan_ids_for_mask(in_disc, 45)

[2, 7, 9, 11, 21]

In [18]:
# The subset of events our algorithm chooses to use
in_disc = all_clusters[0].mask
find_satscan_ids_for_mask(in_disc, 45)

[2, 9, 11, 21]

In [105]:
def make_time_mask(coords, time):
    ab_scan = build_ab_scan(coords, times)
    un_times, cutoff = ab_scan.build_times_cutoff()
    time_mask = times[:,None] <= un_times[None,:]
    time_counts = np.sum(time_mask, axis=0)
    return time_mask, time_counts

def clusters_around(time_mask, time_counts, centre):
    distsq = np.sum( (coords - pt[:,None])**2, axis=0 )
    unique_dists = np.unique(distsq)
    mask = distsq[:,None] <= unique_dists[None,:]
    
    # Clamp
    temp_counts = np.sum(mask, axis=0)
    mask = mask[:,(temp_counts > 1) & (temp_counts <= 50)]
    
    space_counts = np.sum(mask, axis=0)
    uber_mask = mask[:,:,None] & time_mask[:,None,:]
    actual_counts = np.sum(uber_mask, axis=0)
    expected = (space_counts[:,None] * time_counts[None,:]) / 100
    want = (actual_counts > 1) & (actual_counts > expected)
    
    cluster_masks = uber_mask[:,want]
    return cluster_masks, actual_counts[want], expected[want]

def best(coords, times):
    # Not working...
    ab_scan = build_ab_scan(coords, times)
    time_mask, time_counts = make_time_mask(coords, times)
    results = []
    for centre in coords.T:
        cluster_masks, actual, expected = clusters_around(time_mask, time_counts, centre)
        stats = ab_scan._statistic(actual, expected, len(times))
        for c, s in zip(cluster_masks.T, stats):
            results.append((c,s))
    results.sort(key = lambda p : p[1])
    return results[-1]

In [106]:
coords, times = make_random_data()

In [108]:
ab_scan = build_ab_scan(coords, times)
all_choices = list(ab_scan.score_clusters())

In [125]:
time_mask, time_counts = make_time_mask(coords, times)
cluster_masks, actual, expected = clusters_around(time_mask, time_counts, coords[:,0])
stats = ab_scan._statistic(actual, expected, len(times))

for mask, stat in zip(cluster_masks.T, stats):
    for c in all_choices:
        m, s = c[0].mask, c[2]
        if np.all(m == mask):
            print(s, stat)

True

In [103]:
best(coords, times)

(array([False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
         True, False,  True, False, False,  True, False, False, False,
         True, False, False, False, False, False, False, False, False,
        False, False, False, False,  True, False, False, False, False, False], dtype=bool),
 1.2941428696750181)

In [104]:
ab_scan = build_ab_scan(coords, times)
list(ab_scan.find_all_clusters())[0]

Result(centre=array([ 93.47533489,  62.56407798]), radius=14.776960605640376, mask=array([False,  True,  True, False, False, False,  True, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False, False], dtype=bool), time=8.0, statistic=4.6108152675742087)

In [18]:
pt = coords[:,0]
distsq = np.sum( (coords - pt[:,None])**2, axis=0 )
unique_dists = np.unique(distsq)
mask = distsq[:,None] <= unique_dists[None,:]
# mask[:,i] = mask for distance i

In [39]:
# Clamp to disks with <= 50% of population
mask = mask[:,np.sum(mask, axis=0) <= 50]
space_counts = np.sum(mask, axis=0)

In [29]:
ab_scan = build_ab_scan(coords, times)
un_times, cutoff = ab_scan.build_times_cutoff()

In [40]:
time_mask = times[:,None] <= un_times[None,:]
time_counts = np.sum(time_mask, axis=0)

In [43]:
uber_mask = mask[:,:,None] & time_mask[:,None,:]
actual_counts = np.sum(uber_mask, axis=0)

In [48]:
expected = (space_counts[:,None] * time_counts[None,:]) / 100
want = (actual_counts > 1) & (actual_counts > expected)

In [56]:
# Masks for clusters
print(uber_mask[:,want].shape)
# actual counts
print(actual_counts[want].shape)
# expected counts
print(expected[want].shape)

(100, 1578)
(1578,)
(1578,)


In [60]:
ab_scan._statistic(actual_counts[want], expected[want], 100)

array([ 1.34589998,  1.01785138,  0.84643009, ...,  0.4233791 ,
        0.52691134,  0.6401457 ])

array([ 199.,  199.,  198.,  197.,  192.,  191.,  191.,  185.,  183.,
        181.,  179.,  173.,  171.,  170.,  168.,  167.,  166.,  166.,
        165.,  163.,  158.,  155.,  153.,  150.,  147.,  147.,  142.,
        142.,  139.,  139.,  137.,  132.,  128.,  126.,  123.,  122.,
        121.,  121.,  119.,  118.,  117.,  116.,  115.,  112.,  111.,
        109.,  107.,  106.,  104.,  102.,   99.,   97.,   95.,   93.,
         93.,   89.,   89.,   86.,   86.,   85.,   80.,   79.,   77.,
         68.,   65.,   60.,   60.,   60.,   59.,   56.,   56.,   53.,
         52.,   51.,   49.,   45.,   44.,   41.,   40.,   39.,   37.,
         35.,   33.,   32.,   32.,   23.,   23.,   23.,   22.,   16.,
         15.,   13.,   10.,    8.,    7.,    5.,    4.,    3.,    2.,    0.])