## Clustering - 2 Assignment
**By Shahequa Modabbera**

Q1. What is hierarchical clustering, and how is it different from other clustering techniques?

Ans) Hierarchical clustering is a clustering algorithm that aims to build a hierarchy of clusters in a dataset. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be predefined. Instead, it recursively merges or divides clusters based on the similarity or dissimilarity between data points.

The key characteristics of hierarchical clustering are as follows:

1. Hierarchy: Hierarchical clustering produces a hierarchical structure of clusters, often visualized as a dendrogram. The dendrogram illustrates the merging or splitting of clusters at different levels.

2. Agglomerative vs. Divisive: Hierarchical clustering can be agglomerative or divisive. Agglomerative clustering starts with individual data points as separate clusters and iteratively merges the closest pairs of clusters until a single cluster is formed. Divisive clustering starts with a single cluster containing all data points and recursively splits it into smaller clusters.

3. Distance Metric: Hierarchical clustering requires a distance metric to measure the similarity or dissimilarity between data points. Common distance metrics include Euclidean distance, Manhattan distance, and cosine similarity.

4. Linkage Criteria: The linkage criterion determines the distance between clusters and influences the merging or splitting process. Common linkage criteria include complete linkage (based on maximum distance between data points in different clusters), single linkage (based on minimum distance), and average linkage (based on the average distance).

5. No Predefined Number of Clusters: Unlike K-means clustering or KNN clustering, hierarchical clustering does not require the number of clusters to be specified beforehand. It can capture clusters of different sizes and shapes.

6. Interpretability: The hierarchical structure produced by hierarchical clustering allows for easy interpretation. By examining the dendrogram, one can identify the relationships and similarities between different clusters.

Hierarchical clustering is suitable for scenarios where the data has an inherent hierarchical structure or when the number of clusters is unknown. It is commonly used in biology, social sciences, and image analysis. However, hierarchical clustering can be computationally expensive, especially for large datasets, and may suffer from scalability issues.

Q2. What are the two main types of hierarchical clustering algorithms? Describe each in brief.

Ans) Hierarchical clustering is a clustering algorithm that aims to build a hierarchy of clusters in a dataset. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be predefined. Instead, it recursively merges or divides clusters based on the similarity or dissimilarity between data points.
The two main types of hierarchical clustering algorithms are Agglomerative Clustering and Divisive Clustering:

1. Agglomerative Clustering:
   - Agglomerative clustering starts with each data point as a separate cluster and iteratively merges the closest pairs of clusters based on a similarity measure.
   - At the beginning, each data point is considered as a single cluster.
   - Then, at each iteration, the two closest clusters are merged to form a larger cluster.
   - This process continues until all data points belong to a single cluster or until a specified number of clusters is reached.
   - The proximity between clusters is determined by a linkage criterion, such as complete linkage, single linkage, or average linkage.
   - Agglomerative clustering creates a hierarchy of clusters, typically represented as a dendrogram, which visually displays the merging process.

2. Divisive Clustering:
   - Divisive clustering starts with a single cluster containing all data points and recursively splits it into smaller clusters.
   - The process begins by considering all data points as part of a single cluster.
   - Then, the algorithm finds the most dissimilar data points within the cluster and separates them into different clusters.
   - This splitting process continues recursively until each data point is assigned to its own individual cluster or until a specified number of clusters is reached.
   - Divisive clustering creates a hierarchy of clusters, similar to agglomerative clustering, but in the reverse order.
   - Divisive clustering can be more computationally demanding than agglomerative clustering, especially when dealing with large datasets.

Both agglomerative and divisive clustering methods produce a hierarchical structure of clusters, but they differ in terms of their bottom-up (agglomerative) or top-down (divisive) approach. The choice between the two depends on the nature of the problem and the characteristics of the dataset. Agglomerative clustering is more commonly used due to its simplicity and computational efficiency.

Q3. How do you determine the distance between two clusters in hierarchical clustering, and what are the
common distance metrics used?

Ans) Hierarchical clustering is a clustering algorithm that aims to build a hierarchy of clusters in a dataset. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be predefined. Instead, it recursively merges or divides clusters based on the similarity or dissimilarity between data points.
In hierarchical clustering, the distance between two clusters is determined based on the distance between their constituent data points. The choice of distance metric plays a crucial role in measuring the similarity or dissimilarity between data points and, consequently, between clusters. Common distance metrics used in hierarchical clustering include:

1. Euclidean Distance:
   - Euclidean distance is the most widely used distance metric in clustering algorithms.
   - It measures the straight-line distance between two points in a multi-dimensional space.
   - It is defined as the square root of the sum of the squared differences between corresponding coordinates of the two points.

2. Manhattan Distance:
   - Manhattan distance, also known as city block distance or L1 distance, measures the distance between two points by summing the absolute differences of their coordinates.
   - It is calculated as the sum of the absolute differences between the x-coordinates and y-coordinates (and additional dimensions if applicable) of the two points.

3. Minkowski Distance:
   - Minkowski distance is a generalized distance metric that includes both Euclidean distance and Manhattan distance as special cases.
   - It is defined as the pth root of the sum of the pth powers of the absolute differences between corresponding coordinates of the two points.
   - When p=1, it reduces to Manhattan distance, and when p=2, it reduces to Euclidean distance.

4. Cosine Similarity:
   - Cosine similarity is a distance metric commonly used for text and document clustering.
   - It measures the cosine of the angle between two vectors representing the data points.
   - It is calculated as the dot product of the vectors divided by the product of their magnitudes.

5. Correlation Distance:
   - Correlation distance measures the dissimilarity between two data points based on their correlation coefficient.
   - It is commonly used when dealing with datasets where the relative values and relationships between variables are important.

The choice of distance metric depends on the nature of the data and the problem at hand. It is essential to select a distance metric that is appropriate for the type of data being clustered and aligns with the underlying assumptions of the analysis.

Q4. How do you determine the optimal number of clusters in hierarchical clustering, and what are some
common methods used for this purpose?

Ans) Hierarchical clustering is a clustering algorithm that aims to build a hierarchy of clusters in a dataset. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be predefined. Instead, it recursively merges or divides clusters based on the similarity or dissimilarity between data points.
Determining the optimal number of clusters in hierarchical clustering can be a challenging task. However, there are several methods commonly used to guide the selection of the optimal number of clusters:

1. Dendrogram:
   - The dendrogram is a graphical representation of the hierarchical clustering process.
   - It displays the clustering hierarchy and shows how clusters are merged at each step.
   - By analyzing the dendrogram, one can identify the number of clusters based on the vertical cutoff or gap between clusters.

2. Elbow Method:
   - The elbow method is based on the concept of finding the "elbow" or "knee" point in a plot of the within-cluster sum of squares (WCSS) or the average linkage distance against the number of clusters.
   - The WCSS measures the compactness of the clusters.
   - The elbow point represents a significant reduction in WCSS, indicating a good balance between the number of clusters and their compactness.

3. Silhouette Score:
   - The silhouette score is a measure of how well each data point fits into its assigned cluster.
   - It considers both the distance between a data point and other points in its cluster (cohesion) and the distance to points in other clusters (separation).
   - The silhouette score ranges from -1 to 1, where higher values indicate better-defined clusters.
   - The optimal number of clusters can be determined by maximizing the average silhouette score across all data points.

4. Gap Statistic:
   - The gap statistic compares the within-cluster dispersion of a dataset with that of a reference null distribution.
   - It measures the gap between the expected dispersion under the null model and the observed dispersion.
   - The optimal number of clusters corresponds to the value that maximizes the gap statistic.

5. Calinski-Harabasz Index:
   - The Calinski-Harabasz index measures the ratio of between-cluster dispersion to within-cluster dispersion.
   - It quantifies the separation between clusters and their compactness.
   - The optimal number of clusters corresponds to the value that maximizes the Calinski-Harabasz index.

These methods provide different perspectives on the clustering structure and can help in determining the appropriate number of clusters. It is recommended to apply multiple methods and consider the consensus or majority decision across the techniques to arrive at a robust estimate of the optimal number of clusters.

Q5. What are dendrograms in hierarchical clustering, and how are they useful in analyzing the results?

Ans) Hierarchical clustering is a clustering algorithm that aims to build a hierarchy of clusters in a dataset. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be predefined. Instead, it recursively merges or divides clusters based on the similarity or dissimilarity between data points.
Dendrograms are graphical representations commonly used in hierarchical clustering to visualize the clustering hierarchy and the process of merging clusters. They provide valuable insights into the structure and relationships among the data points.

A dendrogram typically consists of a tree-like structure, where each node represents a cluster or a merged set of clusters, and the leaves represent individual data points. The height or distance between two nodes in the dendrogram represents the dissimilarity or distance between the clusters being merged. The longer the branch length, the greater the dissimilarity between the clusters.

Dendrograms are useful in analyzing the results of hierarchical clustering in several ways:

1. Cluster Identification: Dendrograms help identify the number and composition of clusters. By examining the dendrogram, one can observe the vertical cutoff or gap between clusters to determine the number of clusters present in the data.

2. Cluster Similarity: The horizontal axis of the dendrogram shows the dissimilarity or distance between clusters. Clusters that are merged at a lower level of the dendrogram are more similar to each other than clusters merged at higher levels. This information helps identify the similarity or dissimilarity between different clusters.

3. Interpretation of Subclusters: Dendrograms allow for the interpretation of subclusters within larger clusters. By zooming into specific regions of the dendrogram, one can observe the subclusters formed and analyze the similarity patterns among the data points.

4. Cut-off Selection: Dendrograms provide guidance in selecting an appropriate cut-off point to form clusters. By setting a specific vertical threshold, clusters can be determined by cutting the dendrogram at the desired level. The choice of cut-off depends on the desired level of granularity or specificity in clustering.

Overall, dendrograms offer a visual representation of the hierarchical clustering process and enable researchers to understand the relationships and structure within the data. They serve as a powerful tool for exploratory analysis and provide insights that can guide subsequent data analysis and decision-making processes.

Q6. Can hierarchical clustering be used for both numerical and categorical data? If yes, how are the
distance metrics different for each type of data?

Ans) Hierarchical clustering is a clustering algorithm that aims to build a hierarchy of clusters in a dataset. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be predefined. Instead, it recursively merges or divides clusters based on the similarity or dissimilarity between data points.
Yes, hierarchical clustering can be used for both numerical and categorical data. However, the choice of distance metric differs depending on the type of data being clustered.

For numerical data:
1. Euclidean Distance: This is the most common distance metric used for numerical data in hierarchical clustering. It measures the straight-line distance between two data points in the multidimensional space.

2. Manhattan Distance: Also known as city block distance or L1 distance, it measures the sum of the absolute differences between the coordinates of two data points. It is particularly useful when the data is sparse or when there are outliers.

3. Cosine Similarity: It measures the cosine of the angle between two vectors, representing the similarity in their directions. It is commonly used when the magnitude of the vectors is not important, but the orientation or relative angle matters.

For categorical data:
1. Jaccard Distance: It measures the dissimilarity between two sets based on the size of their intersection divided by the size of their union. It is commonly used for binary or presence/absence data.

2. Hamming Distance: It measures the percentage of positions at which the corresponding elements of two strings (categories) are different. It is suitable for categorical data with the same number of categories.

3. Gower's Distance: It is a generalized distance metric that can handle a mix of categorical and numerical variables. It calculates the distance based on the data type and ranges from 0 to 1.

When dealing with a combination of numerical and categorical data, one approach is to preprocess the data by transforming the categorical variables into numerical representations, such as one-hot encoding or ordinal encoding, and then use an appropriate distance metric for the transformed data.

It is important to choose the appropriate distance metric based on the data type to ensure meaningful clustering results. The selection of the distance metric depends on the nature of the data and the desired interpretation of similarity or dissimilarity between the data points.

Q7. How can you use hierarchical clustering to identify outliers or anomalies in your data?

Ans) Hierarchical clustering is a clustering algorithm that aims to build a hierarchy of clusters in a dataset. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be predefined. Instead, it recursively merges or divides clusters based on the similarity or dissimilarity between data points.
Hierarchical clustering can be used to identify outliers or anomalies in data by examining the structure of the resulting dendrogram. Here's a general approach to identify outliers using hierarchical clustering:

1. Perform hierarchical clustering: Apply hierarchical clustering algorithm (e.g., agglomerative or divisive) on your dataset using an appropriate distance metric and linkage criterion.

2. Visualize the dendrogram: Plot the dendrogram, which represents the hierarchy of clusters formed during the clustering process. Each data point is represented by a leaf node in the dendrogram.

3. Identify outlier clusters: Look for clusters in the dendrogram that have a significantly small number of data points compared to other clusters. These small clusters are potential candidates for outliers.

4. Set a threshold: Determine a threshold for the minimum number of data points that define a cluster as an outlier. This threshold can be determined based on domain knowledge or statistical considerations.

5. Identify outlier data points: Traverse the dendrogram starting from the leaves and moving upwards. When encountering a cluster with fewer data points than the threshold, consider all the data points within that cluster as outliers.

6. Analyze outliers: Once the outliers are identified, analyze and investigate them further. Assess whether they are genuine anomalies or erroneous data points. Understanding the reasons behind outliers can provide valuable insights into the data quality or uncover interesting patterns.

It's important to note that the effectiveness of using hierarchical clustering for outlier detection depends on the characteristics of the data and the clustering algorithm's parameters. Different linkage criteria and distance metrics may lead to different clustering results and, consequently, different outlier identification. Therefore, it is crucial to select appropriate parameters and interpret the results with caution, considering the specific context and domain knowledge.