Q1. What is hierarchical clustering, and how is it different from other clustering techniques?

Hierarchical clustering is a clustering technique used in unsupervised machine learning and data analysis. It is different from other clustering techniques in that it creates a hierarchical representation of the data, which can be visualized as a tree-like structure called a dendrogram. Here's an overview of hierarchical clustering and how it differs from other clustering techniques:

Hierarchical Clustering:

Approach:

Hierarchical clustering starts with each data point as its own cluster and successively merges or splits clusters until a termination condition is met.
It builds a hierarchy of clusters, creating a dendrogram that visually represents the relationship between clusters at different levels of granularity.
Number of Clusters:

Hierarchical clustering does not require you to specify the number of clusters (k) in advance, which is a key difference from algorithms like K-means.
The number of clusters can be determined post hoc by cutting the dendrogram at an appropriate level.
Agglomerative vs. Divisive:

Agglomerative hierarchical clustering is the most common approach and works by merging similar clusters in a bottom-up fashion.
Divisive hierarchical clustering is less common and starts with all data points in one cluster, successively splitting them into smaller clusters in a top-down manner.
Distance Metric:

Hierarchical clustering allows you to choose various distance metrics to measure the dissimilarity between data points, such as Euclidean distance or Manhattan distance.
Differences from Other Clustering Techniques:

Number of Clusters:

Many other clustering techniques, like K-means or DBSCAN, require you to specify the number of clusters in advance. In contrast, hierarchical clustering does not have this requirement, making it more flexible in that regard.

Hierarchy:

Hierarchical clustering provides a hierarchical structure of clusters, allowing you to explore clustering solutions at various levels of granularity, which is not a feature of most other clustering techniques.

Interpretability:

The dendrogram generated by hierarchical clustering provides a clear visual representation of the relationships between data points and clusters, making it more interpretable for some tasks.
Cluster Shape:

Hierarchical clustering does not assume specific cluster shapes or sizes, making it suitable for data with irregularly shaped or non-convex clusters.

Computation:

Hierarchical clustering can be more computationally intensive, especially for larger datasets, as it needs to consider all pairwise distances between data points.

Dendrogram Cutting:

The user needs to decide where to cut the dendrogram to obtain the desired number of clusters, which can be somewhat subjective.
In summary, hierarchical clustering is a versatile clustering technique that provides a hierarchical view of the data, allowing you to explore clustering solutions at different levels of granularity. This hierarchical representation makes it unique compared to other clustering methods like K-means, which require specifying the number of clusters in advance.

Q2. What are the two main types of hierarchical clustering algorithms? Describe each in brief.

Clustering tries to find structure in data by creating groupings of data with similar characteristics. The most famous clustering algorithm is likely K-means, but there are a large number of ways to cluster observations. Hierarchical clustering is an alternative class of clustering algorithms that produce 1 to n clusters, where n is the number of observations in the data set. As you go down the hierarchy from 1 cluster (contains all the data) to n clusters (each observation is its own cluster), the clusters become more and more similar (almost always). There are two types of hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).

Divisive:-
Divisive hierarchical clustering works by starting with 1 cluster containing the entire data set. The observation with the highest average dissimilarity (farthest from the cluster by some metric) is reassigned to its own cluster. Any observations in the old cluster closer to the new cluster are assigned to the new cluster. This process repeats with the largest cluster until each observation is its own cluster.

Agglomerative:-
Agglomerative clustering starts with each observation as its own cluster. The two closest clusters are joined into one cluster. The next closest clusters are grouped together and this process continues until there is only one cluster containing the entire data set.

Q3. How do you determine the distance between two clusters in hierarchical clustering, and what are the
common distance metrics used?

In hierarchical clustering, determining the distance between two clusters is a crucial step as it dictates how clusters are merged or compared. The choice of distance metric can impact the results and the structure of the resulting dendrogram. Common distance metrics used to measure the dissimilarity between clusters include:

Single Linkage (Minimum Linkage):

The distance between two clusters is defined as the minimum distance between any two data points in the two clusters.
It tends to create long, "stringy" clusters and is sensitive to outliers.
Complete Linkage (Maximum Linkage):

The distance between two clusters is defined as the maximum distance between any two data points in the two clusters.
It tends to create compact, spherical clusters and is less sensitive to outliers.
Average Linkage (UPGMA - Unweighted Pair Group Method with Arithmetic Mean):

The distance between two clusters is defined as the average of all pairwise distances between data points in the two clusters.
It often strikes a balance between single and complete linkage, producing moderately compact clusters.
Centroid Linkage:

The distance between two clusters is defined as the distance between their centroids (the mean vector of data points in each cluster).
It can lead to well-balanced, spherical clusters but may be sensitive to outliers.
Ward's Method:

Ward's method minimizes the increase in the total within-cluster sum of squares when merging two clusters.
It tends to produce compact and equally sized clusters, making it robust to the variance within clusters.
Distance Weighted Pair Group Method with Centroids (WPGMA):

Similar to UPGMA but considers the number of data points in each cluster when calculating the distance between clusters.
Other Dissimilarity Metrics:

You can use various dissimilarity metrics to measure the distance between clusters, such as Euclidean distance, Manhattan distance, Mahalanobis distance, and correlation-based distances.
The choice of distance metric depends on the nature of the data and the objectives of the clustering. For example, single linkage is sensitive to outliers and may not be suitable when outliers are present, while complete linkage is more robust to outliers. Average linkage is a common choice when you want a balanced approach, and Ward's method is often used for its ability to create equally sized, compact clusters.

It's essential to choose the distance metric that aligns with the specific characteristics of your data and the goals of your analysis. Experimenting with different metrics and observing the resulting dendrograms can help determine the most appropriate distance metric for your hierarchical clustering task.



Q4. How do you determine the optimal number of clusters in hierarchical clustering, and what are some
common methods used for this purpose?

Dendogram:-
This technique is specific to the agglomerative hierarchical method of clustering. The agglomerative hierarchical method of clustering starts by considering each point as a separate cluster and starts joining points to clusters in a hierarchical fashion based on their distances. In a separate blog, we will focus on the details of this method. To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters.

If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters. We will plot the graph using the dendogram function from scipy library.

![image.png](attachment:0c1f84a6-d1a6-460b-82be-1b882ecb6211.png)!

Elbow Method:-
It is the most popular method for determining the optimal number of clusters. The method is based on calculating the Within-Cluster-Sum of Squared Errors (WSS) for different number of clusters (k) and selecting the k for which change in WSS first starts to diminish.

The idea behind the elbow method is that the explained variation changes rapidly for a small number of clusters and then it slows down leading to an elbow formation in the curve. The elbow point is the number of clusters we can use for our clustering algorithm. Further details on this method can be found in this paper by Chunhui Yuan and Haitao Yang.

![image.png](attachment:33e25787-94b1-4373-b75a-fb6a413a2df3.png)

Q5. What are dendrograms in hierarchical clustering, and how are they useful in analyzing the results?

This technique is specific to the agglomerative hierarchical method of clustering. The agglomerative hierarchical method of clustering starts by considering each point as a separate cluster and starts joining points to clusters in a hierarchical fashion based on their distances. In a separate blog, we will focus on the details of this method. To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters.

If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters. We will plot the graph using the dendogram function from scipy library.

![image.png](attachment:6e87cc98-e8f0-49a0-a60c-fec2b0621b63.png)!

Q6. Can hierarchical clustering be used for both numerical and categorical data? If yes, how are the
distance metrics different for each type of data?

Hierarchical clustering can indeed be used for both numerical (quantitative) and categorical (qualitative) data. However, the choice of distance metrics and methods for handling each type of data differs.

For Numerical Data:
When dealing with numerical data, you can use various distance metrics designed for quantitative data. Common distance metrics include:

Euclidean Distance: This is the most widely used distance metric for numerical data. It calculates the straight-line distance between data points in a multi-dimensional space.

Manhattan Distance (L1 Norm): It measures the distance between two points as the sum of the absolute differences of their coordinates. It is less sensitive to outliers compared to Euclidean distance.

Minkowski Distance: This is a generalization of both Euclidean and Manhattan distances, where you can control the exponent to adjust the sensitivity to differences in different dimensions.

Mahalanobis Distance: It accounts for the correlation between variables and can be useful when dealing with datasets with highly correlated features.

Correlation-Based Distances: For data where the relative values or relationships between variables are more important than their absolute values, correlation-based distances like Pearson correlation or Spearman rank correlation can be used.

For Categorical Data:
Categorical data does not have a natural notion of distance as numerical data does. Therefore, different distance metrics are used to handle categorical data:

Jaccard Distance: It measures the dissimilarity between two sets by considering the size of their intersection relative to the size of their union. It is often used for binary (presence/absence) categorical data.

Hamming Distance: It calculates the number of positions at which two categorical vectors differ. It is suitable for binary or multi-category data with a fixed number of categories.

Gower's Distance: This is a generalized distance metric for mixed data (both numerical and categorical). It incorporates different distance measures for numerical and categorical variables and is weighted by the nature of each variable.

Custom Distance Metrics: In some cases, you may need to define custom distance metrics specific to your categorical data, considering the meaning and relationships between the categories.

When combining numerical and categorical data, it's common to use a hybrid approach, such as Gower's distance, which can accommodate both types of variables in a unified framework. This allows you to perform hierarchical clustering on datasets with mixed data types. The choice of distance metric should be driven by the characteristics of your data and the specific problem you are trying to solve.

Q7. How can you use hierarchical clustering to identify outliers or anomalies in your data?

Cluster-Based Approaches for detecting Outliers: 
Clustering-based outlier detection methods assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters. Clustering-based approaches detect outliers by extracting the relationship between Objects and Cluster. An object is an outlier if  

1.Does the object belong to any cluster? If not, then it is identified as an outlier.
2.Is there a large distance between the object and the cluster to which it is closest? If yes, it is an outlier.
3.Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster are outliers.
Checking an outlier:
1.To check the objects that do not belong to any cluster we go with DENSITY BASED CLUSTERING (DBSCAN)
2.To check outlier detection using distance to the closest cluster we go with K-MEANS CLUSTERING (K-Means)