## Chapter: 10 Data Clustering

Some of the scripts presented in this notebook use several Python libraries which have been pre-installed for you. If you had been required to install these libaries on your own, you would issue the following commands:

```python
! pip install --user pandas
! pip install --user numpy
! pip install --user sklearn
! pip install --user statistics
```

# Determining the Number of Clusters

When you use the K-means algorithm, you must specify the value of K, meaning the number of clusters you desire. If you specify too few clusters, you may lose valuable insights. Likewise, if you specify too many clusters, you will increase your processing time and you may not gain additional insights. 

You will need to determine and specify the number of clusters for each dataset with which you work. Depending on the dataset values, you may find that for one set of values (possibly from the same data source), a cluster size of 3 is appropriate, whereas for other values, a cluster size of 5 provides better grouping. The only way to determine the appropriate cluster size is to create clusters and then to analyze the results (normally using the sum of squared distances). 

Several algorithms exist to help you determine the proper number of clusters for your data. A common approach is called the “elbow method,” so named because the chart that it produces resembles the bend in an elbow. In this case, the bend at the elbow (the point where adding more clusters has minimal impact) occurs at 3 clusters. You create the elbow chart by charting the sum of the squares of each cluster result.

The following Python program, Elbow.py, creates an elbow chart:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 1
######################################

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from pandas import DataFrame

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])

distances = []

K = range(1,10)
for k in K:
    ClusterInfo = kmeanModel = KMeans(n_clusters=k).fit(df)
    distances.append(ClusterInfo.inertia_)

plt.plot(K, distances, 'bo-')
plt.xlabel('K-Clusters')
plt.ylabel('Distance')
plt.title('Cluster Values and Distances')
plt.show()

# Using K-Means Clustering

The “means” in K-means clustering corresponds to the average distance for each point in the cluster to the cluster’s center (centroid). K-means is an iterative algorithm that loops until either the maximum number of iterations is reached, or, the clusters do not change. To start the the k-means clustering process, you will specify the number of clusters, the maximum number of iterations, and the starting location for k centroids (cluster centers for which you will normally specify k-random values). The locations that you choose for the starting centroids, as specified, can be random. The K-means algorithm will move the centroids to the ideal locations as it performs it processing. With each iteration, the K-means algorithm will perform these steps:

    •	Calculate k-centroid locations
    •	Move each point into the nearest cluster

In other words, with each iteration, the algorithm will move each cluster’s centroid to the location that minimizes the average distance to the cluster’s points. The following Python script, KMeans.py, creates a 3-cluster grouping:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 2
######################################

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from pandas import DataFrame

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])
  
kmeans = KMeans(n_clusters=3).fit(df)
centroids = kmeans.cluster_centers_

plt.scatter(df['x'], df['y'], c=kmeans.labels_.astype(float))
plt.scatter(centroids[:, 0], centroids[:, 1], c='red')
plt.show()

# Using K-Means++

When you use the K-means algorithm, you normally specify the starting centroid locations as random values. The K-means++ algorithm improves processing time by better calculating the starting centroid locations. The following Python program, KMeansPlusPlus.py, using K-means to create the same 3-cluster grouping:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 3
######################################

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from pandas import DataFrame

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])
  
kmeans = KMeans(n_clusters=3, init='k-means++').fit(df)
centroids = kmeans.cluster_centers_

plt.scatter(df['x'], df['y'], c=kmeans.labels_.astype(float))
plt.scatter(centroids[:, 0], centroids[:, 1], c='red')
plt.show()


As you can see, the program passes the parameter init=’k-means++’ to the KMeans function.
K-means++ should arrive at a solution faster than K-means. The following Python script, TimeClusters.py, uses the K-means and K-means++ algorithms to create clusters with K=3, K=4, and K=5, timing the processing required:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 4
######################################

import matplotlib.pyplot as plt
import time
from sklearn.cluster import KMeans
from pandas import DataFrame

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])

KMeansStartTime = time.time()
   
kmeans = KMeans(n_clusters=3).fit(df)
kmeansDistance = kmeans.inertia_

kmeans = KMeans(n_clusters=4).fit(df)
kmeansDistance += kmeans.inertia_

kmeans = KMeans(n_clusters=5).fit(df)
kmeansDistance += kmeans.inertia_

KMeansStopTime = time.time()

KMeansppStartTime = time.time()
   
kmeans = KMeans(n_clusters=3, init='k-means++').fit(df)
kmeansppDistance = kmeans.inertia_

kmeans = KMeans(n_clusters=4, init='k-means++').fit(df)
kmeansppDistance += kmeans.inertia_

kmeans = KMeans(n_clusters=5, init='k-means++').fit(df)
kmeansppDistance += kmeans.inertia_

KMeansppStopTime = time.time()

print('KMeans time ', KMeansStopTime - KMeansStartTime)
print('KMeans total distance ', kmeansDistance)
print('KMeans++ time ', KMeansppStopTime - KMeansppStartTime)
print('KMeans++ total distance ', kmeansppDistance)

# Hierarchical Clustering

A hierarchical-clustering algorithm takes a different approach to grouping data. There are two forms of hierarchical-clustering algorithms: bottom-up and top-down. The bottom-up-clustering algorithm is called an agglomerative algorithm because, with each iteration, it merges related clusters into a larger cluster. In other words, the bottom-up algorithm finds the two nearest clusters and merges them, repeating this process until only one cluster exists.

In contrast, a top-down hierarchical-clustering algorithm starts with one cluster and with each iteration, decomposes the cluster to form the lower-level clusters. Because it breaks apart a larger cluster into smaller clusters, the top-down approach is called a divisive algorithm. To understand how the hierarchical algorithm groups clusters, analysts use a chart,  called a dendrogram, to show the cluster groupings. 

The previous discussion used the minimum distance between points to select the points assigned to a cluster. It turns out that hierarchical algorithms can use several different approaches to selecting points:

    •	Simple linkage: Select the closest neighbor.
    •	Complete linkage: Selects points furthest apart.
    •	Wards: Selects the point that results in the smallest increase to the group’s sum of squares.
    •	Average linkage: Selects points to minimize the average distance between points.

The following Python script, MultiDendrogram.py, creates the dendrograms using each method:

In [None]:
import matplotlib.pyplot as plt
from pandas import DataFrame
from scipy.cluster.hierarchy import dendrogram, linkage

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])

dendrogram(linkage(df, 'ward'))
plt.title('Ward')
plt.show()

dendrogram(linkage(df, 'single'))
plt.title('Single')
plt.show()

dendrogram(linkage(df, 'complete'))
plt.title('Complete')
plt.show()

dendrogram(linkage(df, 'average'))
plt.title('Average')
plt.show()

The following Python program, HierchicalCharts.py, produces the cluster charts for each point-selection algorithm:

In [None]:
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.cluster import AgglomerativeClustering

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])

cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')  
cluster.fit_predict(df)  
plt.scatter(df['x'], df['y'], c=cluster.labels_, cmap='rainbow')  
plt.title('Ward');
plt.show()

cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='single')  
cluster.fit_predict(df)  
plt.scatter(df['x'], df['y'], c=cluster.labels_, cmap='rainbow')  
plt.title('Single')
plt.scatter(df['x'], df['y'], c=cluster.labels_, cmap='rainbow')  
plt.show()

cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete')  
cluster.fit_predict(df)  
plt.scatter(df['x'], df['y'], c=cluster.labels_, cmap='rainbow')  
plt.title('Complete')
plt.show()

cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average')  
cluster.fit_predict(df)  
plt.scatter(df['x'], df['y'], c=cluster.labels_, cmap='rainbow')  
plt.title('Average')
plt.show()

# DBSCAN Clustering

DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, groups points within clusters based on the density of surrounding points. Within a cluster, a “core point” has at least the minimum number of points surrounding it (min_samples) within the given radius (eps). A “cluster border point,” in contrast, falls within the radius distance, but does not have the minimum number of points surrounding it. All other points fall outside of the cluster and are considered “noise” (represented by the purple dots in the script below).

The DBSCAN algorithm starts by determining the point types (core, border, and noise). It then creates a cluster for each core point, merging the clusters that fall within the radius. Finally, DBSCAN adds the border points to the cluster.
The following Python script, DBSCAN.py, uses the DBSCAN algorithm to group data into clusters:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 5
######################################

import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from pandas import DataFrame

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])

clustering = DBSCAN(eps=3, min_samples=3).fit(df)
labels = clustering.labels_
numberofclusters = len(set(labels)) - (1 if -1 in labels else 0)
plt.title('DBSCAN Number of clusters: %d' % numberofclusters)
plt.scatter(df['x'], df['y'], c=clustering.labels_.astype(float))
plt.show()

# Interesting Cluster Shapes

When data analysts first start clustering data, they often envision clusters as neat and orderly groups. Clusters, however, can take on a variety of shapes and form. The following Python program, Moons.py, creates the clusters using the make_moons and make_circles datasets:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 6
######################################

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets

X, y = datasets.make_moons(n_samples=500)

kmeans = KMeans(n_clusters=5).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_.astype(float))
plt.show()

X, y = datasets.make_moons(n_samples=500, noise=0.05)

kmeans = KMeans(n_clusters=5).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_.astype(float))
plt.show()

X, y = datasets.make_circles(n_samples=500)

kmeans = KMeans(n_clusters=5).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_.astype(float))
plt.show()

X, y = datasets.make_circles(n_samples=500, noise=0.05)

kmeans = KMeans(n_clusters=5).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_.astype(float))
plt.show()

# Viewing Cluster Assignments

When you create your clusters, the functions will normally return a vector of values that specify to which cluster the corresponding point has been assigned. When you plot the clusters, your plotting functions will use this vector to assign different colors to each cluster. The following Python script, ShowClusters.py, prints the cluster vector returned by the Kmeans function:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 7
######################################

from sklearn.cluster import KMeans
from pandas import DataFrame

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])
  
kmeans = KMeans(n_clusters=3).fit(df)

clusters = kmeans.labels_
i = 0
print("Cluster\t X\t Y")
for row in df.iterrows():
   print(clusters[i],'      ', row[1]['x'],'    ', row[1]['y'])
   i = i + 1

Similarly, this Python script, ShowHierarchicalClusters.py, displays the clusters returned for a hierarchical clustering:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 8
######################################

from pandas import DataFrame
from sklearn.cluster import AgglomerativeClustering

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])

cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')  
cluster.fit_predict(df)  

clusters = cluster.labels_
i = 0
print("Cluster\t X\t Y")
for row in df.iterrows():
   print(clusters[i], '      ', row[1]['x'],'    ', row[1]['y'])
   i = i + 1

# Identifying Data Outliers

An outlier is a value that falls outside of the expected range of values. Depending on the analysis you are performing, the presence of one or more outliers can have a significant impact on your results. The following Python program, BasicMetrics.py, calculates the mean and standard deviation for an array of values:

In [None]:
import statistics
 
values = [-100, -75, 1,2,3,4,5, 75, 100]
print('Mean', statistics.mean(values))
print('Standard Deviation', statistics.stdev(values))

In this case, the large standard deviation, relative to the mean, is an indication that outlier values may exist. Next, the following program, IdentifyOutliers.py, examines the array values to identify values that fall outside of the standard deviation from the mean, and if so, identifies the corresponding value and index:

In [None]:
import statistics
 
values = [-100, -75, 1,2,3,4,5, 75, 100]
mean = statistics.mean(values)
stdev = statistics.stdev(values)

print('Mean ', mean)
print('Standard deviation ', stdev) 

newvalues = []

for i in range(len(values)):
  if values[i] < (mean - stdev) or values[i] > (mean + stdev):
     print(i, values[i])

Depending on your data-analytic goal, you may actually pursue outliers. For example, within healthcare data, an outlier might provide you with a genetic trait key to a cause or cure. Often, however, you will simply delete the outlier values. The following Python program, NoOutliers.py, again performs the mean and standard-deviation calculations, this time, however, with and without the outliers:

In [None]:
import statistics
 
values = [-100, -75, 1,2,3,4,5, 75, 100]
mean = statistics.mean(values)
stdev = statistics.stdev(values)
print('Starting values ', values)
print('Mean ', mean)
print('Standard deviation ', stdev) 

newvalues = []

for i in range(len(values)):
  if values[i] > (mean - stdev) and values[i] < (mean + stdev):
     newvalues.append(values[i])

mean = statistics.mean(newvalues)
stdev = statistics.stdev(newvalues)

print('\nList without outliers ', newvalues)
print('Mean ', mean)
print('Standard deviation ', stdev) 

# Identifying Outliers Using DBSCAN

When you cluster data sets, most cluster algorithms will assign all values to clusters, even outliers. As you have learned, the DBSCAN-clustering algorithm will identify “core” values, “border” values and noise. If a point is not in a cluster, meaning the point is noise (an outlier), the vector will contain the value -1. The following Python script, ShowNoise.py, displays the noise values identified by DBSCAN:

In [None]:
######################################
# Chapter 10 (Python) / Deliverable 9
######################################

from sklearn.cluster import DBSCAN
from pandas import DataFrame

Data = {
        'x': [35,34,32,37,33,33,31,27,35,34,62,54,57,47,50,57,59,52,61,47,50,48,39,40,45,47,39,44,50,48],
        'y': [79,54,52,77,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,23,22,13,14,22,7,29,25,9,8]
       }
  
df = DataFrame(Data,columns=['x','y'])

clustering = DBSCAN(eps=5, min_samples=3).fit(df)
labels = clustering.labels_

i = 0
print("Index\tCluster\t X\t Y")
for row in df.iterrows():
  if labels[i] == -1:
    print(i,'     ', labels[i], '     ', row[1]['x'],'    ', row[1]['y'])
  i = i + 1