# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Clustering" data-toc-modified-id="Clustering-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Clustering</a></div><div class="lev2 toc-item"><a href="#Types-of-clustering-methods" data-toc-modified-id="Types-of-clustering-methods-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Types of clustering methods</a></div><div class="lev2 toc-item"><a href="#How-to-evaluate-a-clustering-method?" data-toc-modified-id="How-to-evaluate-a-clustering-method?-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>How to evaluate a clustering method?</a></div>

# Clustering

We have discussed clustering in the Stat/Math course and in the applied machine learning course. **Clustering organizes data objects by proximity based on variables/features and helps to create natural groupings for a set of data objects.** By grouping data, one can understand how each data point relates to the other points and discover groups of similar points. Data items within a group/cluster share common properties and can be considered as separate entities. The terms "clustering" and "cluster analysis" are used interchangeably. 

Clustering is an important and efficient knowledge discovery method that can be used to search for interesting patterns. It is used to cluster data for market research, pattern recognition, image processing, classifying documents for information discovery, credit card fraud detection, grouping genes with similar patterns, and many other applications.

In general, clustering is situation-specific and often Explanatory. E.g., A real state company management would group customers according to their annual income, rather than age, gender, or address, to decide how many buildings at various price levels should be constructed. In another example, an insurance company would group the customers by age, number of dependents, and annual insurance claims to set appropriate driver's insurance premiums [Practical Applications of Data Mining by S. Suh, 2012]. 

We can look at the clustering methods from various perspectives. 

* Clustering methods fall under **unsupervised learning**, one of the three broad categories of machine learning algorithms. The other two broad categories are **supervised learning** and **reinforcement learning**. In a supervised learning method, data points or instances are labeled, and the learning task requires using the data points along with the labels. On the other hand, data points or instances are not labeled within an unsupervised learning method, and the learning task requires only the data points. Some of the notable supervised learning methods are classification and regression. Some notable unsupervised learning methods are applied in clustering, dimensionality reduction, anomaly detection methods, information retrieval, and recommender systems.

* In most cases, the key task in unsupervised machine learning is to discover the **hidden structure** in the datasets. From this perspective, clustering methods also belong to the subcategory of **structure learning** methods. 

* Each cluster can be represented by a **reference vector** for the cluster (e.g., centroids in K-means clustering). In this sense, reference vectors are also called **codebook vectors** or **code words**. We can use the code word for each of the cluster and create a simpler representation of the data and perform other downstream analyses including applying supervised learning methods.


There are a variety of clustering methods handling different types of clusters. However, some common aspects can be identified. A clustering process usually involves the following steps:

1. Observation Selection
1. Variable/Feature Selection
1. Variable Standardization
1. Similarity Measurement
1. Clustering Observations
1. Cluster Refinement
1. Interpretation of Clusters 
 




## Types of clustering methods

There are four broad categories of clustering methods (cf. Zaki-Meira Book). 

1. **Representative-based Clustering**
    * Given the number of desired clusters k, the goal of representative-based clustering is to partition the dataset into k groups or clusters
    * Examples include K-means Algorithm, Kernel K-means, Expectation-Maximization Clustering
1. **Hierarchical Clustering**
    * Creates a sequence of nested partitions, which can be conveniently visualized via a tree or hierarchy of clusters, also called the **cluster dendrogram**
    * The clusters in the hierarchy range from the fine-grained to the coarse-grained
        * the lowest level of the tree (the leaves) consists of each point in its cluster
        * the highest level (the root) consists of all points in one cluster 
        * Both of the highest-level and lowest-level clusters may be considered to be trivial
        * At some intermediate level, we may find meaningful clusters
    * Examples include agglomerative and divisive hierarchical clustering
1. **Density-basedClustering**
    * This kind of method is suitable to find clusters with non-convex data. This method solves the problem with the representative-based clustering methods, which are suitable for finding ellipsoid-shaped clusters but fail with non-convex data clusters.
    * Examples include DBSCAN, kernel-density estimation

1. **Spectral and graph clustering:**
    * Given a graph, the goal is to cluster the nodes by using the edges and their weights, which represent the similarity between the incident nodes
    * Graph clustering is related to divisive hierarchical clustering
    * Graph clustering also has a very strong connection to the spectral decomposition of graph-based matrices 


There are other ways to look at clustering methods. E.g.,  

**Hard clustering vs Soft clustering:** In hard clustering, every object belongs to exactly one cluster. In soft clustering, an object can belong to one or more clusters. The membership can be partial, meaning the objects may belong to certain clusters more than to others.

In the labs, we will discuss KMeans, Agglomerative, and DBSCAN clustering. 


## How to evaluate a clustering method?


If prior or expert-specified knowledge about the clusters, for example, class labels for each point are given, then we can use this information to evaluate the quality of clusters. The following two scores measures quality of clusters given ground truth. 

**Homogeneity Score:** A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class. This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way. See [sklearn.metrics.homogeneity_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html)

**Completeness Score:** A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster. This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way. See [sklearn.metrics.completeness_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.completeness_score.html#sklearn.metrics.completeness_score)


A typical scenario in clustering is that ground truth is missing. In this case, we need to use internal measures such as intracluster similarity or compactness, contrasted with notions of intercluster separation. The internal measures are based on the **distance matrix** of all pairwise distances between the data points. 


**Silhouette score:**

The silhouette coefficient is a measure of both cohesion and separation of clusters, and is based on the difference between the average distance to points in the closest cluster and to points in the same cluster. 

From [sklearn.metrics.silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html?highlight=silhouette#sklearn.metrics.silhouette_score):
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.


For other types of evaluation scores please go over Ch 17: Cluster Validation from the Data Mining and Machine Learning book.

**A good summary of clustering trade-offs and characteristics:**
 * http://scikit-learn.org/stable/modules/clustering.html
 
**Clustering Slides from a Popular Data Mining Text Book:**


**Additional Readings:**

Please note the chapters involved. First, perform a skim of each reading. 
Chapters 2, 10, 11 & 13 in Han. 
Then 14.3 in Hastie's elements of statistical learning. 
Finally, Tan's chapter 7 on clustering. 

Taken together, these readings give you an overview of the technologies and CONCEPTUAL rationale's behind their application, which are used in this module. 


- [Han - Data Mining 3rd ed (pdf)](https://web.dsa.missouri.edu/static/PDF/DMIR/Han_Data_Mining_3e_Chapters_2-10-11-13.pdf)
    - Chapter 10: Cluster Analysis: Basic concepts and methods 
    

- [Hastie - Elements of Statistical Learning 2nd ed (pdf)](https://web.dsa.missouri.edu/static/PDF/DMIR/Hastie_ElementsStatisticalLearning2e.pdf)
    - Chapter Section 14.3: Cluster Analysis 
    
- [Tan - Intro to Dining Mining 2nd ed: Chapter 7: Clustering (pdf)](https://web.dsa.missouri.edu/static/PDF/DMIR/Tan_IntroDataMining2ed_ch7_clustering.pdf)


**Case Study Reading**

- [Customer Data Clustering Using Data Mining (pdf)](https://web.dsa.missouri.edu/static/PDF/DMIR/CustomerDataClusteringUsingDataMining.pdf)

