# Introduction to Clustering Methods and Distances

Clustering represents a form of unsupervised learning wherein the primary aim is to discern patterns inherent within unlabeled data. This technique is predominantly employed to partition vast datasets into distinct subgroups, facilitating informed decision-making. Clustering algorithms operate by segregating data into disparate clusters, each characterized by similar features, yet markedly distinct from data points in other clusters.

## Clustering Types

Clustering algorithms can employ either a hard or soft methodology for classifying data points. In the former approach, data points are unequivocally assigned to a single cluster, while in the latter, probabilities are computed to ascertain the likelihood of a data point belonging to each cluster. Based on the principle of similarity between data points, clustering algorithms can be categorized into several distinct groups:

1. **Connectivity-based Models:**
   These models rely on spatial proximity within the data space to gauge similarity. Clusters are formed by initially assigning all data points to a single cluster and then iteratively partitioning the data into smaller clusters as inter-point distances increase. Alternatively, each data point may be initially assigned an individual cluster, followed by the aggregation of nearby data points. Hierarchical clustering exemplifies this methodology.

2. **Density-based Models:**
   Clusters are delineated based on data point density within the data space. High-density regions represent clusters, typically delineated from one another by regions of lower density. The DBSCAN algorithm is illustrative of this approach.

3. **Distribution-based Models:**
   Models in this category predicate cluster identification on the assumption that all data points within a cluster adhere to a shared distribution, such as a Gaussian distribution. Gaussian Mixture Models (GMM) typify this methodology, positing that data points arise from a blend of distinct Gaussian distributions.

4. **Centroid-based Models:**
   Operating on the principle of defining a centroid for each cluster, these models employ an iterative process to continuously update centroids. Data points are then allocated to the cluster where their proximity to the centroid is minimized. The k-means algorithm serves as a prominent example of this approach.


## Distance metrics

Understanding distance metrics is crucial in clustering algorithms as they quantify the dissimilarity or similarity between data points, forming the foundation for cluster formation. Various distance metrics, such as Euclidean distance, Manhattan distance, and Cosine similarity, offer distinct perspectives on data point separation. The most commonly used is Eucledian distance in case numerical data types and Hamming distance for categorical data types. 

### Euclidean Distance

<center><img src="./imgs/euc.png"/></center>

Euclidean distance is a widely recognized metric that measures the straight-line distance between two points in a multidimensional space. It is often familiar to many due to its intuitive geometric interpretation. This distance metric can take on any positive real number or be zero. Mathematically, Euclidean distance between two points, $x$ and $y$, in an $n$-dimensional space is calculated as the square root of the sum of squares of differences between corresponding coordinates:

$$ d_n(x,y) =  \sqrt {(x_1-y_1)^2 + (x_2-y_2)^2 + (x_3-y_3)^2 + \ldots + (x_n-y_n)^2} $$

For higher dimensions, the Euclidean distance computation involves adding the squared differences between each pair of coordinates and then taking the square root. Its versatility and ease of interpretation make it a fundamental tool in various fields, particularly in clustering algorithms where it serves as a crucial component for measuring dissimilarity between data points.

### Manhattan Distance

<center><img src="./imgs/man.png"/></center>

Manhattan distance is a metric used to measure the absolute numerical difference between two points in space, employing Cartesian coordinates. Unlike Euclidean distance, which calculates the straight-line distance using Pythagoras theorem, Manhattan distance considers distance as the sum of the line vectors (x,y), akin to a taxi navigating among blocks of buildings.

$$d_n(x,y) = \sum_{i=1}^{n}{|(x_i-y_i)|}$$

Mathematically, the Manhattan distance between two points, $x$ and $y$, in an $n$-dimensional space is computed as the sum of the absolute differences between corresponding coordinates. This metric is particularly useful in scenarios where movement is constrained to a grid-like pattern, such as city navigation or grid-based data representation.

### Minkowski Distance

The Minkowski distance is a generalized distance metric calculated using the following formula:

$$d_n(x,y) = {\biggl(\sum_{i=1}^{n}|(x_i-y_i)|^p\biggr)}^{1/p}$$

This distance metric allows for manipulation by substituting different values of 'p', enabling various ways to compute the distance between two data points. Consequently, Minkowski Distance is also referred to as Lp norm distance. Common values of 'p' yield well-known distance measures:

- $p = 1$: Manhattan Distance
- $p = 2$: Euclidean Distance
- $p = \infty$: Chebyshev Distance

The flexibility of the Minkowski distance makes it a versatile tool in diverse applications, offering different perspectives on data similarity or dissimilarity depending on the chosen 'p' value.

### Hamming Distance

<center><img src="./imgs/hamm.png"/></center>

Hamming distance is a metric employed to calculate the distances between categorical variables, often referred to as nominal variables. It quantifies the number of differences between two binary strings, reflecting the dissimilarity between categorical values. Unlike numerical variables, categorical variables lack a natural ordering, making Hamming distance particularly suitable for assessing the similarity between categorical data points.

When dealing exclusively with categorical features in a dataset, Hamming distance serves as a valuable tool for measuring the similarity between two data points. However, it's essential to note that Hamming distance can only be computed when comparing vectors of equal length; comparisons between vectors of unequal lengths are not feasible.

### Gower Distance

Gower (1971) distance is a hybrid measure designed to accommodate both continuous and categorical data. 

For continuous or ordinal data features, Gower distance employs either the Manhattan distance or a ranked ordinal Manhattan distance, respectively. On the other hand, for categorical data features, it utilizes the DICE coefficient, which quantifies the similarity between two sets.

The DICE coefficient is calculated as:

$$DICE = \frac{2|X\cap Y|}{|X|+|Y|} = \frac{2TP}{2TP + FP + FN}$$

where $X$ and $Y$ represent two sets, and $TP$, $FP$, and $FN$ denote True Positives, False Positives, and False Negatives, respectively.

The Gower distance between a pair of points $p$ and $q$, denoted as $G_n(p,q)$, is computed as:

$$G_n(p,q) = \frac{\sum_{i=1}^{n}W_{pqk}S_{pqk}}{\sum_{i=1}^{n}W_{pqk}}$$

Here, $S_{pqk}$ represents either the Manhattan distance or the DICE coefficient for feature $k$, while $W_{pqk}$ is a binary indicator (1 or 0) denoting the validity of feature $k$. The Gower distance is calculated as the sum of feature scores divided by the sum of feature weights, providing a comprehensive measure of dissimilarity between data points with mixed data types.

### Cosine Similarity

Cosine similarity is a widely utilized metric that addresses the challenges of high dimensionality often encountered with Euclidean distance. It quantifies the similarity between two vectors by measuring the cosine of the angle between them. Notably, cosine similarity remains invariant to the magnitude of the vectors, focusing solely on their orientation.

Mathematically, the cosine similarity between two vectors, $x$ and $y$, is computed as:

$$d_{(x,y)} = \cos(\theta) = \frac{x \cdot y}{||x|| \ ||y||}$$

Here, $x \cdot y$ represents the dot product of the vectors, while $||x||$ and $||y||$ denote their respective magnitudes. 

Cosine similarity finds extensive application in scenarios involving high-dimensional data, where the magnitude of vectors is insignificant. In text analyses, for instance, it is commonly employed when data is represented by word counts, allowing for effective comparison and similarity assessment.

### Haversine Distance

<center><img src="./imgs/hav.png"/></center>

Haversine distance calculates the distance between two points on a sphere, typically represented by their longitudes and latitudes. Similar to Euclidean distance, it determines the shortest line connecting two points. However, unlike Euclidean distance, Haversine distance accounts for the curvature of the Earth's surface, assuming that the two points are positioned on a sphere.

The formula for calculating Haversine distance between two points is:

$$ d = 2\ r\ \sin^{-1}\biggl(\sqrt{\sin^2\biggl(\frac{\phi_2-\phi_1}{2}\biggr) + \cos(\phi_1)\cos(\phi_2)\sin^2\biggl(\frac{\lambda_2-\lambda_1}{2}\biggr)}\biggr)$$

Here, $r$ denotes the radius of the sphere, $\phi_1$ and $\phi_2$ represent the latitudes of the two points, and $\lambda_1$ and $\lambda_2$ represent their longitudes.

One notable disadvantage of Haversine distance is its assumption that the points lie on a perfect sphere. In reality, the Earth's shape is more complex, which can lead to inaccuracies in distance calculations, especially over long distances or in regions with significant topographical variations.

### Jaccard Distance

<center><img src="./imgs/jac.png"/></center>

The Jaccard index, also known as the Intersection over Union, serves as a metric for assessing the similarity and diversity of sample sets. It quantifies the similarity between two sets by measuring the ratio of the size of their intersection to the size of their union. In essence, it represents the proportion of common entities between sets relative to their total number of entities.

The Jaccard index, denoted as $J(x,y)$, is calculated as:

$$J(x,y) = \frac{|x \cap y|}{|x \cup y|}$$

To derive the Jaccard distance, we subtract the Jaccard index from 1:

$$D_{(x,y)} = 1 - J(x,y) = 1 - \frac{|x \cap y|}{|x \cup y|}$$

One notable drawback of the Jaccard index is its sensitivity to dataset size. Large datasets can disproportionately impact the index, potentially skewing results by inflating the union while maintaining a relatively constant intersection. Despite this limitation, the Jaccard index finds widespread application in scenarios involving binary or binarized data, such as image segmentation accuracy assessment in deep learning models or text similarity analysis. It enables the comparison of sets of patterns, facilitating tasks ranging from image analysis to natural language processing.


# Importance of Feature Scaling:

Feature scaling is crucial as it ensures that all features have the same scale. Many machine learning algorithms, including KNN, rely on distance-based metrics to assess the similarity between data points. When features are on different scales, the algorithm may assign more weight to features with larger scales, potentially resulting in biased or inaccurate outcomes.

Let's consider an example of distance calculation using two features whose magnitudes/ranges vary greatly:

$$\text{Euclidean distance} = \sqrt{(820000 - 325000)^2 - (3.75 - 0.50)^2}$$

From the above equation, it's evident that features with high magnitudes contribute significantly more than those with lower magnitudes. Thus, normalizing the data to a range of 0-1 is recommended for better results.

One common method to achieve this normalization is by using sklearn's MinMaxScaler:

$$ X_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$$

**Note:** Standardization is not employed here because it doesn't assume data to follow any specific distribution.

# Curse of Dimensionality

The Curse of Dimensionality occurs when your data has too many features. Having an excessive number of features makes it challenging to cluster observations effectively. This happens because with too many dimensions, every observation in the dataset seems equidistant from all the others.

Clustering relies on distance measures like Euclidean distance to gauge the similarity between observations. When distances are nearly equal for all observations, it becomes problematic. In such cases, all observations seem equally similar (and dissimilar) to each other, making it impossible to form meaningful cluster# Applications of Clustering

Clustering finds applications across diverse domains due to its versatility. Some of the most common applications of clustering include:

- **Recommendation engines:** Clustering helps in grouping similar users or items to provide personalized recommendations.

- **Market segmentation:** Clustering aids in dividing customers into distinct groups based on their characteristics, allowing businesses to tailor marketing strategies accordingly.

- **Social network analysis:** Clustering helps identify communities or groups within social networks, enabling analysis of network structures and behaviors.

- **Search result grouping:** Clustering assists in organizing search results into coherent groups, enhancing user experience and information retrieval.

- **Medical imaging:** Clustering techniques are utilized for segmenting medical images to identify structures or anomalies, aiding in diagnosis and treatment planning.

- **Image segmentation:** Clustering is employed to partition images into meaningful regions or objects, facilitating tasks like object recognition and image understanding.

- **Anomaly detection:** Clustering helps in detecting outliers or anomalies in datasets, highlighting potentially unusual or suspicious instances for further investigation.
s.

