## Assignment on Clustering - 2

Q1. What is hierarchical clustering, and how is it different from other clustering techniques?

Hierarchical clustering is a type of clustering method that builds a hierarchy of clusters either in a bottom-up or top-down fashion. There are two types of hierarchical clustering:

Agglomerative (bottom-up approach): Each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive (top-down approach): All data points start in one cluster, and splits are performed recursively as one moves down the hierarchy.

The result of hierarchical clustering is usually represented as a dendrogram, which visually shows the nested clustering structure.



Here's how hierarchical clustering differs from other clustering techniques:

Number of clusters: In K-means clustering, the number of clusters (K) needs to be specified beforehand, whereas in hierarchical clustering, the number of clusters can be determined by cutting the dendrogram at a desired level.

Cluster shape: K-means assumes that clusters are spherical and of similar size, whereas hierarchical clustering makes fewer assumptions about the shape of clusters, which can result in more accurate clustering for certain types of data.

Reassigning of points: In hierarchical clustering, once a data point is assigned to a cluster, it cannot be reassigned. In contrast, in methods like K-means, data points can change clusters as the centroids are adjusted.

Deterministic vs non-deterministic: Hierarchical clustering is deterministic, i.e., it always produces the same clusters given the same input and parameters, unlike K-means which can produce different clustering results due to different initial centroid placements.

Performance: Hierarchical clustering can be computationally more expensive than flat clustering methods like K-means, especially on large datasets, because it doesn't scale well with the number of data points.

Q2. What are the two main types of hierarchical clustering algorithms? Describe each in brief.

Hierarchical clustering algorithms are of two types: Agglomerative and Divisive.

Agglomerative Hierarchical Clustering (bottom-up approach): In Agglomerative Hierarchical Clustering, each data point starts in its own cluster. Pairs of clusters are successively merged based on a certain criterion or distance measure (such as Euclidean distance for continuous variables or Jaccard distance for categorical variables) until only one cluster is left, or until the distance between the remaining clusters is above a certain threshold. This type of hierarchical clustering is sometimes referred to as "bottom-up" because it starts with each data point in a separate cluster and merges clusters together to move up the hierarchy.

Divisive Hierarchical Clustering (top-down approach): In Divisive Hierarchical Clustering, all data points start in one cluster. The cluster is divided based on a certain criterion or distance measure until each data point is in its own cluster, or until the distance between the remaining clusters is below a certain threshold. This is the opposite of the agglomerative approach, and is sometimes referred to as "top-down" because it starts with one big cluster and splits clusters to move down the hierarchy.

The criterion or measure used to decide which clusters to merge (for Agglomerative) or split (for Divisive) can vary, and different choices lead to different versions of these algorithms (e.g., single linkage, complete linkage, average linkage, Ward's method, etc.). The result of both types of hierarchical clustering is usually represented as a dendrogram, which allows the analyst to view the nested clusters at different levels of granularity.

Q3. How do you determine the distance between two clusters in hierarchical clustering, and what are the common distance metrics used?

In hierarchical clustering, the distance between two clusters is determined by a linkage criterion. This linkage criterion determines the dissimilarity between sets of observations as a function of the pairwise distances between observations. The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Here are some common linkage methods:

Single Linkage (MIN): The distance between two clusters is defined as the shortest distance between two points in each cluster. It can result in elongated, "chain-like" clusters.

Complete Linkage (MAX): The distance between two clusters is defined as the maximum distance between any two points in the clusters. It tends to produce more compact, ball or disc-shaped clusters.

Average Linkage (AVG): The distance between two clusters is defined as the average distance between every pair of points, one in each cluster. It is a compromise between Single Linkage and Complete Linkage and tends to create more naturally shaped clusters.

Centroid Linkage: The distance between two clusters is the distance between the centroid for each cluster. This can be less susceptible to outliers than other methods.

Ward's Linkage: The distance between two clusters is the increase in the summed square distance from each point to its centroid caused by merging the clusters. This method tends to produce clusters of roughly the same size.

Q4. How do you determine the optimal number of clusters in hierarchical clustering, and what are some common methods used for this purpose?

Determining the optimal number of clusters in hierarchical clustering is often done by visualizing the clusters using a dendrogram and applying a cutoff, or threshold, to the dendrogram. Here are a few common methods used for this purpose:

Visual Inspection of Dendrogram: A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. By visually inspecting the dendrogram, we can choose a cut-off point where to 'cut' the tree to form clusters. Ideally, this cut-off point is chosen to reflect the point where there is a significant jump in the combination distances when moving up the tree hierarchy.

Inconsistency Method: This method computes the inconsistency coefficient for each link in the dendrogram. The inconsistency coefficient compares the height of a link in a dendrogram (i.e., the cluster distance) with the average height of links below it. Links with large inconsistency coefficients are more likely to be genuine cluster boundaries.

Elbow Method: Similar to its application in K-means, the elbow method looks at the percentage of variance explained as a function of the number of clusters. You pick the number of clusters where the increase in variance explained begins to decrease (forms an 'elbow').

Gap Statistic: This method compares the total intra-cluster variation for different values of k with their expected values under the null reference distribution of the data. The optimal number of clusters k is the value that maximizes the gap statistic.

Q5. What are dendrograms in hierarchical clustering, and how are they useful in analyzing the results?

A dendrogram is a tree-like diagram that displays the sequence of merges or splits made by hierarchical clustering. It's especially useful for visualizing the results of hierarchical clustering, which does not require specifying the number of clusters upfront.

The individual data points are arranged along the bottom of the dendrogram and a link is drawn connecting any two objects or clusters that are merged. The height of the link indicates the distance between the two objects or clusters, with lower heights indicating lower distances. The entire dendrogram is often interpreted top-down: starting with one large cluster and ending up with many small clusters.

Dendrograms are useful for several reasons:

Identifying the number of clusters: You can determine the number of clusters by deciding a "cut-off" distance and drawing a horizontal line at this distance on the dendrogram. The number of vertical lines it intersects is the chosen number of clusters.

Visualizing the clustering process: A dendrogram provides a visual representation of the sequence in which clusters were merged or split. This can give insight into how similar different groups are, and at what scale.

Understanding cluster composition: By tracing the path of each data point up the dendrogram, we can understand which points are merged together at each step of the clustering process. This helps understand the composition of each cluster and the relationships between the data points.

Examining cluster distances: The height of the branches in the dendrogram reflects the distance between clusters, allowing for an easy visual comparison of the distances between any two clusters.

Q6. Can hierarchical clustering be used for both numerical and categorical data? If yes, how are the distance metrics different for each type of data?

Yes, hierarchical clustering can be used for both numerical and categorical data, but the choice of distance metric (the measure used to calculate the similarity or dissimilarity between data points) will be different depending on the type of data.


For numerical data, some common distance metrics include:

Euclidean Distance: This is the square root of the sum of the squared differences between corresponding elements of the two vectors. It's the most commonly used distance metric, corresponding to the straight-line distance between two points in Euclidean space.

Manhattan Distance: This is the sum of the absolute differences between corresponding elements of the two vectors. It is also called city block distance as it measures distance as if you were navigating a grid of streets (like in Manhattan).

Minkowski Distance: This is a generalized metric distance measure, in which different metrics can be calculated by adjusting the power parameter. Manhattan and Euclidean distances are special cases of Minkowski distance.


For categorical data, some common distance metrics include:

Hamming Distance: This is used for binary variables. It's the sum of the bit-wise exclusive or operation on binary vectors.

Jaccard Similarity or Distance: This metric calculates the distance between two sets by dividing the size of intersection of the sets by the size of the union of the sets. It's often used when dealing with binary or boolean data.

Gower Distance: This measure computes the similarity between rows that may have a mix of continuous and categorical variables. It scales each variable according to its range of variation so that no variable has a disproportionate effect.

Q7. How can you use hierarchical clustering to identify outliers or anomalies in your data?

Hierarchical clustering can be used to identify outliers or anomalies in your data through the process of cluster analysis. Here's how you might do this:

Smallest Clusters: After performing hierarchical clustering and deciding on a cut-off to determine the number of clusters, the smallest clusters (especially those with only one or a few data points) may be considered as outliers. These are points that are significantly different from all other points and didn't fit well into any larger cluster.

Distance in Dendrogram: In the dendrogram created from hierarchical clustering, outliers will be points that merge very late in the process, i.e., the vertical lines representing these points will be much longer than for other points. This is because outliers are, by definition, far from other points, and so it takes a while for them to be merged into a cluster.

Cluster Centroid Distance: Calculate the distance of each point to the centroid of its cluster. Data points that are a certain distance away from the centroid could be considered outliers. The definition of this distance can depend on the specific problem and dataset.

Silhouette Analysis: The silhouette value measures how similar a point is to its own cluster compared to other clusters. Points with low silhouette values could be considered as potential outliers.