### $k$-means is a classic method for clustering.
- $k$ is an integer number that produces a fixed number of cluster, which are associated with a center and each data point is assigned to a cluster.
- It solves the following optimization problem:
$$
\mathrm{minimize} \sum^{n}_{i=1} \Vert( \mathbf{x}_i - \mathbf{\mu}_{z_i} \Vert^2)  \quad \mathrm{w.r.t} \quad \left(\mathbf{\mu}, z\right)
$$
where $\mu_k$ is the center of the $k^\mathrm{th}$ cluster, $z_i$ is an index of the cluster for point $\mathbf{x}_i$

### import required library

In [23]:
import RDatasets, Clustering, Plots
import Statistics, Distances
Plots.plotly()

Plots.PlotlyBackend()

### load `iris` dataset from R datasets

In [3]:
iris = RDatasets.dataset("datasets", "iris");

### select some data for clustering

In [4]:
features = Matrix(iris[:, 1:4])'; # features to use for clustering

### run clustering algorithm

In [5]:
result = Clustering.kmeans(features, 3); # run K-means for the 3 clusters

### check if the number of clusters are same as specified

In [12]:
@assert Clustering.nclusters(result) == 3

### view some basic features for $k$-meanus clustering

#### center of the clusters

In [9]:
M = result.centers

4×3 Matrix{Float64}:
 6.85385  5.006  5.88361
 3.07692  3.428  2.74098
 5.71538  1.462  4.38852
 2.05385  0.246  1.43443

### cluster size ==> number of data points for each cluster

In [13]:
size = Clustering.counts(result)

3-element Vector{Int64}:
 39
 50
 61

### get the assignments of points to clusters

In [14]:
a = Clustering.assignments(result)

150-element Vector{Int64}:
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 ⋮
 3
 1
 1
 1
 3
 1
 1
 1
 3
 1
 1
 3

### plot with the point color mapped to the assigned cluster index

In [7]:
Plots.scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
        color=:lightrainbow, legend=false)

### We ran single $k$-means clustering but we don't know if 3 is the best cluster for this data
- for validation of $k$-means clustering results there are several metrics
- one of them is silhouette width
- others are elbow, cross tabulation, rand index, variation of information, V-measure, mutual information
- among them silhouette is the best metric; however, these metrics are dataset dependent
### Silhouette width measures the quality of each clustering by quantifying distance of each cluster from its neighboring clusters
- The Silhouette width for $i$ data point is a cosine norm:
    $$
    s_i = \frac{b_i - a_i}{\mathrm{max}\left(a_i, b_i\right)} 
    $$ 
where
- $a_i$ is the average distance from $i$ to the other points in the same cluster $z_i$
- $b_i$ is the average distance from the i to the points in the $k$-th cluster
### to compute Silhouette width, we need distance matrix of features/data

In [45]:
#dist_function(x)  = Distances.pairwise(Distances.Euclidean(), x, dims = 2) # defines distance function
#dist = dist_function(features)
dists = Distances.pairwise(Distances.SqEuclidean(), features)

150×150 Matrix{Float64}:
  0.0    0.29   0.26   0.42   0.02  …  21.66  18.29  19.89  21.63  17.14
  0.29   0.0    0.09   0.11   0.37     22.09  18.06  20.24  22.26  17.25
  0.26   0.09   0.0    0.06   0.26     23.66  19.63  21.73  23.51  18.48
  0.42   0.11   0.06   0.0    0.42     22.52  18.39  20.55  22.27  17.22
  0.02   0.37   0.26   0.42   0.0      22.1   18.75  20.29  21.89  17.42
  0.38   1.19   1.18   1.36   0.38  …  18.36  15.91  16.83  18.19  14.58
  0.27   0.26   0.07   0.11   0.21     23.01  19.22  21.1   22.56  17.79
  0.03   0.18   0.17   0.25   0.05     21.15  17.64  19.34  21.06  16.49
  0.85   0.26   0.19   0.09   0.85     24.15  19.62  22.1   23.9   18.51
  0.22   0.03   0.1    0.1    0.28     21.78  17.81  19.87  21.83  16.86
  0.14   0.75   0.78   1.0    0.18  …  20.28  17.39  18.63  20.35  16.26
  0.14   0.21   0.14   0.14   0.12     21.14  17.51  19.25  20.81  16.18
  0.35   0.02   0.07   0.07   0.41     22.89  18.7   20.94  22.96  17.79
  ⋮                       

### find silhouette width

In [36]:
sil_width = Statistics.mean(Clustering.silhouettes(result, dists))

0.734413057978783

### We did for one cluster. Now, perform analyses for multiple $k$ values

In [42]:
cl_num    = [2, 3, 4, 5, 6, 7]
sil_width = []
for cluster in cl_num
    results          = Clustering.kmeans(features, cluster)
    silhouette_width = Statistics.mean(Clustering.silhouettes(results, dists))
    push!(sil_width, silhouette_width)
    display(silhouette_width)
end

0.8503512229251473

0.735659605433223

0.67879986093218

0.6695344607787888

0.5539050829155903

0.5221934827667588

In [43]:
Plots.plot(cl_num, sil_width, xlabel="No. of cluster", ylabel="Silhouette width", linewidth=2)

### Here, $k = 2$ has highest Silhouette value. However, three is the closest to two. From my experience, $k=2$ provides highest silhouette width than larger $k$ values. Potential cause is two cluster can easilty demarcate the boundaries in a dataset. However, it does not mean that they represent the data accurately. So, it is better to look for $k$ value greater than `two`. Here, $k=3$ does it and also the actual data also has `three` distinct classification.

## DBSCAN (density-based spatial clustering of applications with noise)
#### A cluster, which is a subset of the given set of points, satisfies two properties:

- All points within the cluster are mutually density-connected, meaning that for any two distinct points $p$ and $q$ in a cluster, there exists a point $o$ sucht that both $p$ and $q$ are density reachable from $o$.
- If a point is density-connected to any point of a cluster, it is also part of that cluster.
- clusters with less than 20 points will be discarded:

In [59]:
points = randn(3, 1000)
#clusters = Clustering.dbscan(features, 0.05, min_neighbors = 3, min_cluster_size = 20)
clusters = Clustering.dbscan(points, 0.05, min_neighbors = 4)

1000-element Vector{Clustering.DbscanCluster}:
 Clustering.DbscanCluster(1, Int64[], [1])
 Clustering.DbscanCluster(1, Int64[], [2])
 Clustering.DbscanCluster(1, Int64[], [3])
 Clustering.DbscanCluster(1, Int64[], [4])
 Clustering.DbscanCluster(1, Int64[], [5])
 Clustering.DbscanCluster(1, Int64[], [6])
 Clustering.DbscanCluster(1, Int64[], [7])
 Clustering.DbscanCluster(1, Int64[], [8])
 Clustering.DbscanCluster(1, Int64[], [9])
 Clustering.DbscanCluster(1, Int64[], [10])
 Clustering.DbscanCluster(1, Int64[], [11])
 Clustering.DbscanCluster(1, Int64[], [12])
 Clustering.DbscanCluster(1, Int64[], [13])
 ⋮
 Clustering.DbscanCluster(1, Int64[], [989])
 Clustering.DbscanCluster(1, Int64[], [990])
 Clustering.DbscanCluster(1, Int64[], [991])
 Clustering.DbscanCluster(1, Int64[], [992])
 Clustering.DbscanCluster(1, Int64[], [993])
 Clustering.DbscanCluster(1, Int64[], [994])
 Clustering.DbscanCluster(1, Int64[], [995])
 Clustering.DbscanCluster(1, Int64[], [996])
 Clustering.DbscanCluster(1