**Making Sense of Humongous Location Datasets**

Geospatial clustering techniques handle these problems by reducing the dimensionality of location data into smaller, manageable, and relevant variables for the data analysis process. Clustering technique importance increases as the amount of data grows. 

We will use machine learning and spatial statistics to derive an insightful location analysis with less dimensional complexity. We will cover the following topics in this chapter:

- K-means clustering
- Density-Based Spatial Clustering Applications with Noise (DBSCAN)
- Spatial autocorrelation

***K-means clustering***

K-means clustering is one of the most widely used unsupervised machine learning techniques and is used mainly for data mining purposes. In a classic k-means clustering, the full weights are on attribute similarity, while location-based k-means specifically targets geographic coordinates to derive spatial or location similarity. We will use the latter as we are interested in location data analysis. 

The k-means algorithm is based on randomly selecting k (where k is the number of clusters specified) number of objects that represent initially a cluster mean or center. Then, the algorithm assigns other objects to the cluster, which is closely based on the Euclidean distance between the object and cluster mean. The k-means process is iterative and requires heavy computations when applied to a large dataset as it goes through each object iteratively. 

*The crime dataset*

In [None]:
# Read the dataset

crime_somerset = pd.read_csv("2019-02-avon-and-somerset-street.csv")
crime_somerset.head()

*Cleaning data*

It is necessary to carry out preprocessing and cleaning out data. 

Let's check first how many null values we have in our data. We use the pandas .isnull() function and .sum() to get each column's total null values

In [None]:
crime_somerset.isnull().sum()

We can drop all columns with more than a threshold, for example, 2,000 rows of missing values while maintaining columns with missing values less than the specified number (2,000 in our case)

In [None]:
crime_somerset.drop(['Last outcome category','Context', 'Crime ID' ], axis=1, inplace=True)

We need to drop rows of missing values to clean our data and get it ready for machine learning models. To drop rows with any missing values in the dataset, we can do the following

In [None]:
crime_somerset.dropna(axis=0,inplace=True)

If you run this code again, you will see that the whole dataset does not have any missing values

In [None]:
crime_somerset.isnull().sum()

Let's convert the pandas DataFrame into a GeoPandas GeoDataFrame

Here is a function that creates a GeoDataFrame using a pandas DataFrame. We can use this function to create a GeoDataFrame from any CSV file with Latitude and Longitude columns

In [None]:
def create_gdf(df, lat, lon):
  """ Convert pandas dataframe into a Geopandas GeoDataFrame"""
  crs = {'init': 'epsg:4326'}
  geometry = [Point(xy) for xy in zip(airbnb[lon], airbnb[lat])]
  gdf = gpd.GeoDataFrame(airbnb, crs=crs, geometry=geometry)
  return gdf

Now that we have created a function to convert a pandas DataFrame into a GeoPandas GeoDataFrame, we can use to call that function by providing the names of the coordinates in case the dataset has different names for latitude and longitude. Let's call the function on the crime dataset

In [None]:
crime_somerset_gdf = create_gdf(crime_somerset, 'Latitude', 'Longitude')

**K-means clustering with scikit-learn**

To apply k-means clustering on location data, we need to get the coordinates of these features. Before we do that, we will split the dataset into train and test datasets. The test dataset will be used for predicting which group a point belongs to

In [None]:
from sklearn.cluster import KMeans

train = airbnb.sample(frac=0.7, random_state=14)
test = airbnb.drop(train.index)

Now that we have created a training and test dataset, let's store training and test coordinates

In [None]:
train_coords = train[['latitude', 'longitude']].values
test_coords = test[['latitude', 'longitude']].values

Let's compute k-means clustering. In this example, we arbitrarily choose 5 clusters

In [None]:
kmeans = KMeans(n_clusters=5)
kmeans.fit(train_coords)

Then, we compute cluster centers and predict the cluster index for each sample

In [None]:
preds = kmeans.predict(test_coords)
centers = kmeans.cluster_centers_

Let's visualize the 5 cluster outputs of the predictions from the test dataset

In [None]:
fig, ax = plt.subplots(figsize=(12,10))
plt.scatter(test_coords[:, 0], test_coords[:, 1], c=preds, s=30, cmap='viridis')
plt.scatter(centers[:,0], centers[:,1], c='Red', marker="s", s=50);

K-means clustering with five clusters

The center points are displayed as squares while other cluster points are displayed as a circle point. Each cluster measures how close it is to that mean. We have some outlier points here at the corners, but they clearly show how the k-means algorithm works in this case. Each point is clustered according to the nearest center mean point