In [1]:
import json
import pandas as pd
from sklearn.cluster import KMeans

In [2]:
df = pd.read_csv("NYPD_Motor_Vehicle_Collisions_reduced_data.csv", low_memory=False)

First, we need to grab only those collisions where someone was injured/killed. Therefore in the dataframe, if the sum of all columns incdicating casualites is more than 0 we grab those observations.

In [3]:
y = df[['NUMBER OF PERSONS INJURED',
       'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
       'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
       'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
       'NUMBER OF MOTORIST KILLED']].sum(axis=1)

In [4]:
y = [1 if x > 0 else 0 for x in y]

Add that column to the dataframe.

In [5]:
df['Y'] = y

Now we make a new dataframe containing only those observations that are involving a casualty.

In [6]:
cas_df = df[df['Y'] == 1][['LATITUDE', 'LONGITUDE']]
cas_df.shape

(188340, 2)

In [7]:
cas_df = cas_df.dropna()
cas_df.shape

(151452, 2)

Since we are loading the data into a website, we choose to sample only a portion of the data. This saves memory and makes the visualization run faster. In addition, it was determined with cluster analysis that there was unlikely to be an well separatable clustering of motor vehicle accidents in NYC. Therefor the website visualization is more for implementation purposes, rather than showing any inherent clustering of accidents.

In [8]:
cas_df = cas_df.sample(n=5000)
cas_df.shape

(5000, 2)

Rename columns to save memory.

In [9]:
cas_df = cas_df.rename(columns={'LATITUDE': 'Y', 'LONGITUDE': 'X'})
cas_df.head()

Unnamed: 0,Y,X
526630,40.674138,-73.839023
905621,40.753482,-73.980889
453926,40.692356,-73.911053
838119,40.895468,-73.877121
722875,40.71423,-73.816454


Now we fit different KMeans with different number of clusters. We save which cluster each observation belongs to in the `cas_df` dataframe and we save the cluster centers in the `centroids` dictionary.

In [10]:
centroids = {}

for i in range(2, 7):
    
    model = KMeans(i)
    clusters = str(i)
    model.fit(cas_df)
    
    cas_df[clusters] = model.labels_
    centroids[clusters] = [list(x) for x in model.cluster_centers_]

We create a dictionary where each element contains an observation, it's coordinates and which cluster it belongs to in each clustering. In addition we have the centroids element that contains the centroid locations for each clustering.

In [11]:
plot_data = {
    "points": [dict(x) for idx, x in cas_df.iterrows()],
    "centroids": centroids
}

Lastly we save the the file as json.

In [12]:
with open("data/casualties_plot_5000.json", "w") as outfile:
    json.dump(plot_data, outfile)