# CHAPTER - 19: Clustering

In supervised machine learning, we have access to both features and targets, this is not always the case. Sometimes we only know the features.

For example: We can not break up the sales of a grocery store by weather a shopper is a memmber of discount club or not using supervised machine learning, because we don't have a target to train and evaluate our models.

We can use unsupervised learning, to check the behaviour of the club members and nonmembers in the grocery store.

So there will be two clusters of observations.

The goal of clustering algorithms is to identify those latent groupings of observations, which done well allow us to predict class of observations even without a target vector.

## 19.1 Clustering using K-Means

Grouping observations into k groups:

In [1]:
# loading the libraries

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [2]:
# loading the data

iris = datasets.load_iris()
features = iris.data

In [3]:
# Standardize the features

scaler = StandardScaler()
features_std = scaler.fit_transform(features)

In [4]:
# create a k-mean object

cluster = KMeans(n_clusters = 3, random_state = 0)

In [5]:
# training the model

model = cluster.fit(features_std)

  super()._check_params_vs_input(X, default_n_init=10)


K-means is one of the most common clustering techniques, this algorithm attempts to group observations into k groups, with each group roughly having equal variance.
1. k cluster "center" points are created at random locations.
2. for each observation:
   a. The distance between each observation and k center points is calculated.
   b. The observation is assigned to the cluster of the nearest center point.
3. Center points are moved to the means(i.e., centers) of thier respective clusters.
4. Step 2 and 3 are repeated until no observation changes in cluster.


In [6]:
# View predict class
model.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)

In [7]:
# View true class
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [8]:
# Create new observation
new_observation = [[0.8, 0.8, 0.8, 0.8]]
# Predict observation's cluster
model.predict(new_observation)

array([2], dtype=int32)

In [9]:
# View cluster centers
model.cluster_centers_

array([[-0.05021989, -0.88337647,  0.34773781,  0.2815273 ],
       [-1.01457897,  0.85326268, -1.30498732, -1.25489349],
       [ 1.13597027,  0.08842168,  0.99615451,  1.01752612]])

## 19.2 Speeding Up K-means Clustering

When we want to group observations into k grouops and k-means takes too long. We can do this by using mini-batch k-means

In [10]:
# load libraries

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import MiniBatchKMeans

In [11]:
# load batch

iris = datasets.load_iris()
features = iris.data

In [12]:
# Standardize the features

scaler = StandardScaler()
features_std = scaler.fit_transform(features)

In [13]:
# creating k-means object

cluster = MiniBatchKMeans(n_clusters = 3, random_state = 0, batch_size = 100)

In [14]:
# Train model

model = cluster.fit(features_std)

  super()._check_params_vs_input(X, default_n_init=3)


mini-batch k-means is conducted on only a random sample of observations, this reduces the time required for algorithm to find convergence with small cost in quality. The larger the batch, the more computationally costly the training process.

## 19.3 Clustering Using Meanshift

Grouping observations without assuming the number of clusters or thier shape.

Using meanshift clustering:

In [15]:
# loading libraries

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import MeanShift

In [16]:
# loading the data

iris = datasets.load_iris()
features = iris.data

In [17]:
# Standardizing features

scaler = StandardScaler()
features_std = scaler.fit_transform(features)

In [18]:
# Creating meanshift object

cluster = MeanShift(n_jobs = -1)

In [19]:
# Train model

model = cluster.fit(features_std)

One of the disadvantage of k-means clustering is we set the no of clusters prior to training, and the method makes assunmption about the shape of the clusters, Meanshift clustering algorithm do not have these limitations.
Parameters of meanshift:
1. bandwidth - sets the radius
2. sometimes there are no observations included in the algorithm, by default meanshift assigns all these "orphan" observations to the kernel of the nearst observation.

## 19.4 Clustering Using DBSCAN

Grouping observations into clusters of high density:

In [20]:
# loading libraries

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

In [21]:
# loading data

iris = datasets.load_iris()
features = iris.data

In [22]:
# Standardizing features

scaler = StandardScaler()
features_std = scaler.fit_transform(features)

In [23]:
# create meanshift object

cluster = DBSCAN(n_jobs = -1)

In [24]:
# train model

model = cluster.fit(features_std)

DBSCAN works with the idea that clusters will be areas where many observations are densely packed together and makes no assumption of cluster shape. 

Any observation close to a cluster but not a core sample is considered part of a cluster, and any observation not close to the cluster is labeled as an outlier.

DBSCAN has 3 main parameters to set:
1. eps: The maximum distance from an observation to another observation to be considered as its neighbor.
2. min_samples: the minimum n umber of observations less than eps distance from an observation for it to be considered a core observation.
3. metric: the distance metric used by eps


In our training data, 2 clusters are identified, 0 and 1, while outlier observations are labeled as -1.

In [25]:
# show cluster membership

model.labels_

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1,
        1,  1,  1, -1, -1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
       -1,  1,  1, -1,  1,  1, -1,  1,  1,  1, -1, -1, -1,  1,  1,  1, -1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1])

## 19.5 Clustering Using Hierarchial Merging

Grouping observations using hierarchy of clusters, Using agglomerative clustering:

In [27]:
# loading libraries

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering

In [28]:
# loading data

iris = datasets.load_iris()
features = iris.data

In [29]:
# Standardizing features

scaler = StandardScaler()
features_std = scaler.fit_transform(features)

In [30]:
# create meanshift object

cluster = AgglomerativeClustering(n_clusters = 3)

In [None]:
# trainig the model

model = cluster.fit(features_std)

Agglomerative Clustering is a powerful, flexible hierarchial clustering algorithm.

In agglomerative clustering, all observations start as their own clusters. Next, clusters meeting some criteria are merged together. This process is repeated, growing clusters until some end point is reached. In scikit-learn, AgglomerativeClustering uses the linkage parameter to determine the merging strategy to minimize the following:
1. Varianceofmergedclusters(ward)
2. Averagedistancebetweenobservationsfrompairsofclusters(average)
3. Maximumdistancebetweenobservationsfrompairsofclusters (complete)

Two important parameters:
1. affinity: determines the distance metric used for linkage (minkowski, euclidean, etc.).
2. n_clusters: sets the number of clusters the clustering algorithm will attempt to find. That is, clusters are successively merged until there are only n_clusters remaining.

labels_ to see the cluster in which every observation is assigned:

In [31]:
# to show cluster membership

model.labels_

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1,
        1,  1,  1, -1, -1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
       -1,  1,  1, -1,  1,  1, -1,  1,  1,  1, -1, -1, -1,  1,  1,  1, -1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1])