**Density-Based Spatial Clustering Applications with Noise**

While k-means clustering relied on providing the number of clusters beforehand, the DBSCAN algorithm is a non-parametric algorithm. Given a set of points, DBSCAN groups together points that are close to each other while also marking outliers. This algorithm can identify clusters even in large spatial datasets by simply highlighting the local density of points. It is also one of the most widely used clustering algorithms, especially for location data. DBSCAN requires two parameters to be supplied before running the algorithm: epsilon and minimum points or samples. Their values significantly influence the results of this algorithm and therefore require some fine-tuning, as well as exploration, before finding suitable clusters.

Epsilon is the parameter that specifies the radius of a neighborhood with respect to other points. In a given set of points, epsilon indicates the distance a point lies closer to a cluster of other points. On the other hand, minimum samples or points set the minimum number of points needed to form a cluster. Based on these two parameters, DBSCAN classifies into three types (this can be more or less depending on these two parameters, but at least one cluster is present):

- Core points (0): A point is a core point when it fulfils the epsilon distance, has the minimum points, and is a neighbor with another core point.
- Border points (1 to n): A point is a border point when it does not fulfil the minimum points required but does share a neighborhood with at least one core point. The number of clusters in this category can be many depending on the parameters. 
- Noise Points (-1): A point that does not fulfil the epsilon distance, does not have the minimum points required, and does not share any neighborhood with other core points is known as noise. These are considered outliers.

We can use DBSCAN to primarily detect outliers or noise or in contrast main clusters. We will cover both of these use cases in the next two sections: Detecting outliers and Detecting clusters. We will be using the DBSCAN algorithm from the scikit-learn library. 

***Detecting outliers***

Let's first create coordinates out of the latitude and longitude for the whole data, since we do not need to split the train and test dataset for this algorithm

In [None]:
coords = crime_somerset_gdf[['Latitude', 'Longitude']]

Now, we are all set to apply DBSCAN on the coordinates. DBSCAN returns tuples; we are only interested in the labels and not the index and therefore we label it a _ symbol, which denotes generally unwanted results

In [None]:
_, labels = dbscan(crime_somerset_gdf[['Latitude', 'Longitude']], eps=0.1, min_samples=10)

Here, we pass an epsilon of 0.1 and minimum samples of 10 points, a higher epsilon and lower samples to detect outliers. Let's create a DataFrame out of the labels result and group according to the clusters

In [None]:
labels_df = pd.DataFrame(labels, index=crime_somerset_gdf.index, columns=['cluster'])
labels_df.groupby('cluster').size()

With these epsilon and minimum sample parameters, DBSCAN results indicate that only 18 points are classified as noise or outliers, while all other points fall into the core cluster. Let's assign each category a name and plot it to see the results.

First, we subset the noise and core from labels_df and create a separate DataFrame for each

In [None]:
noise = crime_somerset_gdf.loc[labels_df['cluster']==-1, ['Latitude', 'Longitude']]
core = crime_somerset_gdf.loc[labels_df['cluster']== 0, ['Latitude', 'Longitude']]

Now, we will plot them using the scatter plot. Here, we use it to display the noise as stars while the core is displayed as circles

In [None]:
fig, ax = plt.subplots(figsize=(12,10))
ax.scatter(noise['Latitude'], noise['Longitude'],marker= '*', s=40, c='blue' )
ax.scatter(core['Latitude'], core['Longitude'], marker= 'o', s=20, c='red')
plt.show();

The scatter plot for the results is as follows. As you can see, the 18 points detected as noise (outliers) are displayed as stars while the core points are displayed as circles

DBSCAN algorithm clusters: outlier detection

We can also use DBSCAN to detect clusters instead of outliers. In the following section, we will cover tweaking this algorithm to detect clusters

***Detecting clusters***

To detect clusters, we need to make the epsilon lower while increasing the minimum samples. Let's look at one example of such a scenario. We will set the epsilon as 0.01 and make the minimum samples higher, at 300 points

In [None]:
_, labels = dbscan(crime_somerset_gdf[['Latitude', 'Longitude']], eps=0.01, min_samples=300)
labels_df = pd.DataFrame(labels, index=crime_somerset_gdf.index, columns=['cluster'])
labels_df.groupby('cluster').size()

In this case, we have more than two clusters as there are some border points (three border point clusters). Let's create a cluster for each one of them and plot them in a scatter plot

In [None]:
noise = crime_somerset_gdf.loc[labels_df['cluster']==-1, ['Latitude', 'Longitude']]
core = crime_somerset_gdf.loc[labels_df['cluster']== 0, ['Latitude', 'Longitude']]
bp1 = crime_somerset_gdf.loc[labels_df['cluster']== 1, ['Latitude', 'Longitude']]
bp2 = crime_somerset_gdf.loc[labels_df['cluster']== 2, ['Latitude', 'Longitude']]
bp3 = crime_somerset_gdf.loc[labels_df['cluster']== 3, ['Latitude', 'Longitude']]

Now that we have created each cluster into a separate DataFrame, we can plot them using a scatter plot, as follows. To display the image clearly, we also limit the x axis and y axis to zoom into the clusters

In [None]:
fig, ax = plt.subplots(figsize=(15,12))
ax.scatter(noise['Latitude'], noise['Longitude'],s=1, c='gray' )
ax.scatter(core['Latitude'], core['Longitude'],marker= "*", s=10, c='red')
ax.scatter(bp1['Latitude'], bp1['Longitude'], marker = "v", s=10, c='yellow')
ax.scatter(bp2['Latitude'], bp2['Longitude'], marker= "P", s=10, c='green')
ax.scatter(bp3['Latitude'], bp3['Longitude'], marker= "d", s=10, c='blue')
ax.set_xlim(left=50.8, right=51.7)
ax.set_ylim(bottom=-3.5, top=-2.0)

plt.show()

The output of the scatter plot is shown in the following diagram. There are five clusters detected. The noise (-1) is shown as circles and these are dimmed. The core points are shown with star markers and lie on clustered points at the upper north area of the plot. The three lower clusters are border points

Zoomed cluster detection with DBSCAN 

This clearly shows where clusters lie in our data. In the next section, we will cover more elaborate cluster detection techniques using spatial autocorrelation