<h1>K-Nearest Neighbour Clustering Of Massacres For The Identification Of Australian Wars</h1>

(c) Bill Pascoe and Kaine Usher, 2025

This notebook uses the k-nearest neighbour clustering method on data from <i>Colonial Frontier Massacres in Australia, 1788-1930</i> (Ryan et al, 2025) project to help identify Australian Wars.

<b>To run this notebook and see the map at the end press the two little triangles that look like a 'fast forward' button.</b>

For important information on how to understand this notebook, see the Introduction <a href="AWR_Introduction.ipynb">AWR_Introduction.ipynb</a>.


<h3>Parameter Selection</h3>

The most informative clusters of massacres emerge by setting the value of k to be somewhere between 2 and 6. You can change the value of k here. Eg: set to k = 3. Then run the notebook again by pressing the two little triangles button above.

In [None]:
# Enter file path of dataset:
file_path = "CMassacres_TLCM_20250314.csv"

# Enter number of nearest neighbours:
k = 2

<h3>STKNN Clustering/Aggregation Code</h3>

The block below contains the code necessary for STKNN clustering/aggregating the data based on the k parameter you assigned. You do not need to change anything - simply run it as is.

In [None]:
import pandas as pd
df_initial = pd.read_csv(file_path)

df = df_initial.filter(["ghap_id", "title", "description", "latitude", "longitude", "datestart", "dateend", "linkback", "Victims", "VictimsDead", "Attackers", "AttackersDead", "MassacreGroup"], axis=1)
df["ghap_id"] = df["ghap_id"].astype(str)

from geojikuu.preprocessing.projection import MGA2020Projector
mga_2020_projector = MGA2020Projector("wgs84")
results = mga_2020_projector.project(list(zip(df["latitude"], df["longitude"])))
df["mga_2020"] = results["mga2020_coordinates"]
unit_conversion = results["unit_conversion"]

from geojikuu.preprocessing.conversion_tools import DateConvertor
date_convertor = DateConvertor(date_format_in="%Y-%m-%d", date_format_out="%Y-%m-%d")
df['date_converted'] = df['datestart'].apply(date_convertor.date_to_days)

from geojikuu.aggregation.point_aggregators import STKNearestNeighbours
st_knn = STKNearestNeighbours(data=df, coordinate_label="mga_2020", time_label="date_converted")
results = st_knn.aggregate(k=k, aggregate_type="mean")

results[["earliest_date", "latest_date"]] = results["temporal_extent"].str.replace('[()]', '', regex=True).str.split(',', expand=True).astype(int)
results["earliest_date"] = results['earliest_date'].apply(date_convertor.days_to_date)
results["latest_date"] = results['latest_date'].apply(date_convertor.days_to_date)
results["temporal_midpoint"] = results['date_converted'].apply(date_convertor.days_to_date)



In [None]:


results["spatial_midpoint"] = mga_2020_projector.inverse_project(results["midpoint"])
results[["lat_mid", "lon_mid"]] = results["spatial_midpoint"].astype(str).str.replace('[()]', '', regex=True).str.split(',', expand=True).astype(float)
results["mbr"] = results['mbr'] * unit_conversion

results = results.drop(["latitude", "longitude", "date_converted", "midpoint", "temporal_extent"], axis=1)

<h3>Output</h3>

The results can be output to a file for download and further processing.
The output files are in the same directory as this notebook.
The first few lines of the data are shown on screen.

<h4>stknn_clusters.csv output</h4>

In [None]:
results.to_csv('stknn_clusters_' + str(k) + '.csv')
results.head()

In [None]:

import geopandas

def getConvexHull(id, polygononly):
    ## query df_initial for assigned_cluster = id, and make into list, and make into convex hull and add to summary
    cluster = df_initial[df_initial["assigned_cluster"] == id]


    # temporarily use geopandas to create a 'geometry' from the coordinates in this cluster so we can call the convexhull method on it
    gdf = geopandas.GeoDataFrame(
        cluster, geometry=geopandas.points_from_xy(cluster.longitude, cluster.latitude), crs="EPSG:4326"
    )
    # print ("Convex Hull")
    chull = gdf.geometry.union_all().convex_hull
    #display(chull)

    
    if len(cluster.index) > 2 and polygononly :
        print("Cluster " + str(id) + " has " + str(len(cluster.index)) + " sites.")
        return chull
    else :
        return None

<h4>output</h4>

In [None]:
def find_index(id):
    for idx, ids in results['ghap_id'].items():
        id_list = ids.split(', ')
        if str(id) in id_list:
            return idx
    return None

df_initial['assigned_cluster'] = df_initial['ghap_id'].apply(find_index)


# preparing cluster summary and polygon
polygononly = True
clusterSummary = results.filter(["ghap_id", "title", "datestart", "dateend", "linkback", "Victims", "VictimsDead", "Attackers", "AttackersDead", "count", "mbr", "earliest_date", "latest_date", "temporal_midpoint", "spatial_midpoint", "lat_mid", "lon_mid"], axis=1)
clusterSummary['cluster_id'] = clusterSummary.index
clusterSummary['convex_hull'] = clusterSummary['cluster_id'].apply(getConvexHull, args = (polygononly,))

clusterSummary = clusterSummary[clusterSummary['convex_hull'].notnull()]

df_initial.to_csv('colfront_stknn_labelled_' + str(k) + '.csv')
df_initial.head()

<h3>Visualisation</h3>

In [None]:
import random
import folium

def flipLatLng(ll) :
    return (ll[1],ll[0])

map_center = [df_initial['latitude'].mean(), df_initial['longitude'].mean()]
mapc = folium.Map(location=map_center, zoom_start=4)

folium.TileLayer(
    tiles = 'https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}',
    attr = 'Esri',
    name = 'Esri Satellite',
    overlay = False,
    control = True
    ).add_to(mapc)

def random_color():
    return "#" + ''.join([random.choice('0123456789ABCDEF') for _ in range(6)])

cluster_colors = {cluster: random_color() for cluster in df_initial['assigned_cluster'].unique()}


# Add polygons
fillpolygon = False;
if fillpolygon : 
    popacity = 0.4
else :
    popacity = 0

for _, row in clusterSummary.iterrows():
    
    # geopanda, spacey etc generate lat lng in the opposite order to what folium and leaflet assume, so we have to flip the coordinates
    locpoly = list(map(flipLatLng, list(row["convex_hull"].exterior.coords)))
    
    folium.Polygon(
        locations=locpoly,
        color=cluster_colors[row['cluster_id']],
        weight=12,
        opacity=0.2,
        line_join='round',
        fill_color=cluster_colors[row['cluster_id']],
        fill_opacity=popacity,
        fill=True,
        popup=f"<b>Cluster:</b> {row['cluster_id']}<br><br>"
              f"<b>Count:</b> {row['count']}<br><br>"
              f"<b>MBR:</b> {row['mbr']}<br><br>"
              f"<b>Earliest massacre:</b> {row['earliest_date']}<br><br>"
              f"<b>Latest massacre:</b> {row['latest_date']}<br><br>"
              f"<b>Temporal Midpoint:</b> {row['temporal_midpoint']}<br><br>"
              f"<b>Spatial Midpoint:</b> {row['spatial_midpoint']}<br><br>",
        tooltip="Cluster details",
    ).add_to(mapc)

# add points
for _, row in df_initial.iterrows():
    folium.CircleMarker(
        location=(row['latitude'], row['longitude']),
        radius=5,
        color=cluster_colors[row['assigned_cluster']],
        fill=True,
        fill_color=cluster_colors[row['assigned_cluster']],
        fillOpacity=1,
        popup=f"<b>Site:</b> {row['title']}<br><br>"
                  f"<b>Lat:</b> {row['latitude']}<br><br>"
                  f"<b>Lon:</b> {row['longitude']}<br><br>"
                  f"<b>Date:</b> {row['datestart']}<br><br>"
                  f"<b>Victims Dead:</b> {row['VictimsDead']}<br><br>"
                  f"<b>Attackers Dead:</b> {row['AttackersDead']}<br><br>"
                  f"<b>Assigned Cluster:</b> {row['assigned_cluster']}<br>"
                  f"<b>Link:</b> <a href='{row['linkback']}' target='_blank'>{row['linkback']}</a><br>"
        ).add_to(mapc)
mapc