# Big Data Lecture Notes: Clustering

## Part 1: Clustering Fundamentals

### What is Clustering? 🧐

Clustering is a type of **unsupervised learning**, which means we use it when our data has no target attribute or predefined labels. The goal is to explore the data to find intrinsic structures or groups within it.

In simple terms, clustering is the task of **grouping a set of objects** in such a way that objects in the same group (called a **cluster**) are more **similar** to each other than to those in other groups.

* **High Intra-cluster Similarity**: Data points within the same cluster are very similar.
* **Low Inter-cluster Similarity**: Data points in different clusters are dissimilar.


This technique is used in many fields, with popular applications including:
* **Market Segmentation**: Grouping customers with similar behaviors for targeted marketing.
* **Social Network Analysis**: Finding communities of users with similar interests.
* **Recommendation Engines**: Suggesting items to users based on the preferences of similar user clusters.
* **Anomaly Detection**: Identifying outliers that don't fit into any group.

### How Do We Measure Similarity?

The core of clustering is figuring out how "similar" or "close" two data points are. This is typically done using a **distance measure**. The shorter the distance between two objects, the more similar they are.

A very common distance measure is the **Euclidean Distance**.

$dist(x_{i},x_{j})=\sqrt{\sum_{k=1}^{n}(x_{ik}-x_{jk})^{2}}$
* **Note**: This formula calculates the straight-line distance between two points in n-dimensional space.
* $x_i$ and $x_j$: Two different data points.
* $n$: The number of attributes (or dimensions) each data point has.
* $x_{ik}$: The value of the k-th attribute for data point $x_i$.

---

## Part 2: Clustering Algorithms

There are many different ways to group data. The two main categories of clustering techniques are Partitional and Hierarchical.

* **Partitional Clustering**: This approach divides the dataset into a set of non-overlapping clusters. Each data point belongs to exactly one cluster. K-Means is the most famous example.
* **Hierarchical Clustering**: This method creates a tree of clusters, known as a dendrogram.
    * **Agglomerative**: A "bottom-up" approach where each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
    * **Divisive**: A "top-down" approach where all data points start in one cluster, and splits are performed recursively as one moves down the hierarchy.

### The K-Means Algorithm

K-Means is a simple and widely used **partitional clustering** algorithm. Its goal is to partition a dataset into a user-specified number of clusters, **k**.

Each cluster is represented by its center point, called the **centroid**.

#### How K-Means Works
The algorithm iterates through two main steps until the clusters are stable:

1.  **Initialization**: First, you choose the number of clusters, **k**. Then, **k** initial centroids are chosen (this can be done randomly or using a more advanced method).
2.  **Assignment Step**: Each data point is assigned to the cluster whose centroid is closest (using a distance measure like Euclidean distance).
3.  **Update Step**: After all points are assigned, the centroid of each cluster is recalculated by taking the mean of all data points belonging to that cluster.
4.  **Repeat**: Steps 2 and 3 are repeated until a stopping criterion is met—usually, when the cluster assignments no longer change.


#### Evaluating K-Means Clusters
A common way to evaluate the quality of the final clusters is the **Sum of Squared Error (SSE)**.

$SSE = \sum_{i=1}^{K} \sum_{x \in C_i} dist^2(m_i, x)$
* **Note**: This formula measures the compactness of the clusters. It sums the squared distances of each point ($x$) to the centroid ($m_i$) of its assigned cluster ($C_i$).
* A **lower SSE** value means the data points are closer to their centroids, indicating better-defined, denser clusters.

#### How to Choose the Optimal 'k'
A major challenge with K-Means is that you have to choose **k** (the number of clusters) in advance. How do you know the right number?

* **The Elbow Method**: One popular technique is to run the K-Means algorithm for a range of *k* values (e.g., from 1 to 10) and calculate the SSE for each. When you plot SSE against *k*, the graph often looks like an arm. The "elbow" on the arm is the point where the rate of SSE decrease slows down significantly, and this is considered a good choice for *k*.


#### Pros and Cons of K-Means
* **Pros**:
    * Simple, fast, and scales well to large datasets.
    * Easy to implement.
* **Cons**:
    * You must specify the number of clusters, **k**, beforehand.
    * The final result is sensitive to the random choice of initial centroids.
    * It doesn't identify outliers.

---

## Part 3: Parallel K-Means Clustering

For very large datasets, we can speed up the K-Means algorithm by running it in parallel across multiple machines or processors.

### Data Parallelism

In this approach, the **dataset is partitioned** (usually horizontally, by rows) and distributed among processors.
* Each processor runs the K-Means algorithm on its own subset of the data.
* At the end of each iteration, the processors need to communicate. Each processor shares the **sum and count** of data points in its local clusters. This information is aggregated to calculate the **global new centroids**.
* These new, updated centroids are then broadcast back to all processors for the next assignment step.
* A key feature is that **data points do not move between processors**; they only change cluster membership within their local processor.

### Result Parallelism

This approach focuses on **partitioning the clusters themselves**.
* Each processor is made responsible for one (or more) of the **k** final clusters.
* In each iteration, when a data point is assigned to a new cluster, it may need to be **physically moved to the processor** that "owns" that cluster.
* Because all data points for a given cluster reside on a single processor, recalculating the centroid is straightforward and doesn't require global communication. However, this method can involve significant data shuffling between processors.