**III. Unsupervised Learning Algorithms**.

The fundamental difference from supervised learning is that in unsupervised learning, we work with data that **does not have predefined labels or target outcomes**. The goal is to explore the data to find some intrinsic structure, patterns, or relationships within it.

Most popular clustering algorithms:

**Clustering - K-Means**

**1. Introduction to Unsupervised Learning**

* **Goal:** To discover hidden patterns, structures, or groupings in unlabeled data. We are not trying to predict a specific output based on input features because we don't have a known output to learn from.
* **Types of Unsupervised Learning Tasks:**
    * **Clustering:** Grouping similar data points together based on their characteristics. (e.g., customer segmentation, document grouping).
    * **Dimensionality Reduction:** Reducing the number of features while preserving important information (e.g., Principal Component Analysis - PCA).
    * **Association Rule Mining:** Discovering relationships between items in large datasets (e.g., "customers who bought X also bought Y").
    * **Anomaly Detection:** Identifying data points that are significantly different from the rest of the data.

---

**2. Introduction to Clustering**

* **What is Clustering?** Clustering is the task of dividing a set of data points into groups (called clusters) such that:
    * Data points *within* the same cluster are very similar to each other.
    * Data points in *different* clusters are dissimilar from each other.
* **Similarity/Dissimilarity:** This is typically measured using a distance metric (like Euclidean distance) in the feature space.
* **Applications:** Customer segmentation, image segmentation, document analysis, anomaly detection, organizing computing clusters.

---

**3. K-Means Clustering**

K-Means is one of the simplest and most widely used **centroid-based clustering algorithms**.

* **Goal of K-Means:** To partition $N$ data points into $K$ distinct, non-overlapping clusters, where each data point belongs to the cluster with the nearest mean (centroid). The number of clusters, $K$, is a hyperparameter that must be specified beforehand.
* **Centroids:**
    * A centroid is the **center point** of a cluster.
    * It's typically calculated as the mean of all the data points belonging to that cluster.
    * **Conceptual Diagram of Centroids and Clusters:**
        Imagine a 2D scatter plot of data points.
        * If K=3, there would be three distinct groups.
        * Each group would have a central point (the centroid), marked perhaps with an 'X' or a larger dot.
        * All points around a particular centroid belong to its cluster. The clusters are often visualized by coloring the points based on their assigned cluster.

        ```
        Feature 2 ^
                  |
                  |   * * (Cluster 1)
                  | * X1 *
                  |   * *
                  |                    o o (Cluster 2)
                  |                  o  X2 o
                  |                    o o
                  |
                  |          + + (Cluster 3)
                  |        +  X3 +
                  |          + +
                  ----------------------------> Feature 1
        (X1, X2, X3 are centroids for the three clusters)
        ```

* **How the K-Means Algorithm Works (Iterative Process):**

    The algorithm iteratively refines the cluster assignments and centroid locations.

    1.  **Initialization Step:**
        a.  **Choose K:** Specify the desired number of clusters, $K$.
        b.  **Initialize Centroids:** Randomly select $K$ data points from the dataset to be the initial centroids. (Other initialization methods exist, like K-Means++, which is smarter and often leads to better results and faster convergence. Scikit-learn uses K-Means++ by default).
        *Conceptual Diagram (Initialization):*
            Imagine K=2. Two random points from your dataset are picked and marked as the initial centroids C1 and C2.

    2.  **Assignment Step (Expectation Step in EM context):**
        a.  For each data point in the dataset, calculate its distance to *all* $K$ centroids.
        b.  Assign each data point to the cluster whose centroid is closest to it (i.e., has the minimum distance).
        *Conceptual Diagram (Assignment):*
            All data points are now colored based on whether C1 or C2 is closer. You'll see two distinct groups forming around the initial (possibly poorly placed) centroids.

    3.  **Update Step (Maximization Step in EM context):**
        a.  Once all data points are assigned to a cluster, recalculate the position of the $K$ centroids.
        b.  The new centroid for each cluster is the **mean** of all data points currently assigned to that cluster.
        *Conceptual Diagram (Update):*
            The centroids C1 and C2 now move to the actual center (mean position) of all the points currently assigned to their respective clusters.

    4.  **Repeat:** Repeat the Assignment Step (Step 2) and the Update Step (Step 3) iteratively until one of the stopping conditions is met:
        * The centroids no longer move significantly between iterations (convergence).
        * The cluster assignments of the data points no longer change.
        * A maximum number of iterations is reached.

* **Visualizing the Iterative Process:**
    Imagine a scatter plot.
    * *Iteration 0:* Random centroids are placed.
    * *Iteration 1 (Assignment):* Points are assigned to the nearest random centroid. Clusters look rough.
    * *Iteration 1 (Update):* Centroids move to the mean of their newly assigned points.
    * *Iteration 2 (Assignment):* Points are re-assigned based on the new centroid positions. Cluster boundaries might shift.
    * *Iteration 2 (Update):* Centroids move again.
    * ... This continues until the centroids stabilize, and the cluster assignments don't change much. The final clusters should look more natural and well-separated (if the data has such structure).

---

**4. Inertia (Within-Cluster Sum of Squares - WCSS): The K-Means Objective**

How does K-Means know when it has found "good" clusters? It tries to minimize an objective function called **Inertia**, also commonly known as the **Within-Cluster Sum of Squares (WCSS)**.

* **What is Inertia/WCSS?**
    * It measures the **compactness** of the clusters.
    * For each cluster, it calculates the sum of the squared distances between each data point in that cluster and the cluster's centroid.
    * Inertia is the sum of these squared distances over all clusters.
    * Mathematically, for $K$ clusters $C_1, C_2, \dots, C_K$ with centroids $\mu_1, \mu_2, \dots, \mu_K$:
        $$\text{Inertia} = \text{WCSS} = \sum_{k=1}^{K} \sum_{x_i \in C_k} ||x_i - \mu_k||^2$$
        Where $||x_i - \mu_k||^2$ is the squared Euclidean distance.

* **Goal of K-Means (revisited):** The K-Means algorithm aims to find the centroid positions and cluster assignments that **minimize this Inertia value**. The iterative assignment and update steps are precisely designed to try and achieve this minimum.

* **Interpretation of Inertia:**
    * **Lower Inertia:** Generally indicates that the clusters are more compact (points within a cluster are closer to their centroid) and thus, potentially "better" defined.
    * **Higher Inertia:** Indicates that clusters are more spread out or less dense.

* **Limitation of Inertia:**
    * Inertia is a measure of cluster compactness, but it's not a perfect measure of clustering "quality" on its own.
    * **Inertia always decreases as the number of clusters ($K$) increases.**
        * If $K$ equals the number of data points ($N$), then each point is its own cluster, each centroid is the point itself, and Inertia becomes 0 (perfect compactness, but useless for finding meaningful groups).
    * Therefore, we can't just pick the $K$ that gives the absolute minimum inertia (which would be $K=N$). We need a way to find a $K$ that represents a good trade-off between compactness and the number of clusters.

---

**5. Choosing the Optimal Number of Clusters ($K$): The Elbow Method**

Since $K$ is a hyperparameter that we must specify, a common technique to help choose an appropriate $K$ is the **Elbow Method**.

* **How the Elbow Method Works:**
    1.  Run the K-Means algorithm for a range of different values of $K$ (e.g., $K=1, 2, 3, \dots, 10$ or more).
    2.  For each value of $K$, calculate the Inertia (WCSS).
    3.  Plot the Inertia values against the corresponding values of $K$.

* **Interpreting the Plot (Conceptual Diagram):**
    You will typically see a plot that looks like an arm:

    ```
    Inertia ^
    (WCSS)  |
            | *
            |  *
            |   *
            |    *
            |     * <-- "Elbow" point
            |      *
            |       *
            |        *
            |         *
            ----------------------------> Number of Clusters (K)
    ```

    * As $K$ increases, Inertia decreases. Initially, the drop in Inertia is usually steep.
    * At some point, the rate of decrease in Inertia starts to slow down significantly, forming an "elbow" shape in the plot.
    * **The "elbow" point is often considered a good indicator of the optimal number of clusters.** This is the point where adding another cluster doesn't provide much better modeling of the data (i.e., doesn't reduce the within-cluster sum of squares substantially enough to justify the added complexity of another cluster).

* **Limitations of the Elbow Method:**
    * The "elbow" can sometimes be ambiguous or not very clear, making it subjective.
    * It doesn't always work well for all dataset shapes or when clusters are not well-separated and globular.

* **Other Methods for Choosing K (Beyond Elbow):**
    * **Silhouette Analysis:** Measures how similar a data point is to its own cluster compared to other clusters. It provides a Silhouette Score for each sample, and an average score can be calculated for different values of $K$. Higher Silhouette Scores (closer to +1) are better.
    * **Gap Statistic:** Compares the within-cluster dispersion of the data to that of random data.
    * **Domain Knowledge:** Often, the best way to choose $K$ is based on understanding the problem domain and what a meaningful number of groups would be.

---

**6. The Hyperparameter $K$**

* As highlighted, $K$ (the number of clusters) is the most important hyperparameter for K-Means.
* Its choice directly influences the resulting clusters and the interpretation of the results.
* Incorrectly choosing $K$ can lead to:
    * **Too few clusters ($K$ too small):** Merging distinct groups, leading to loss of information and poor representation of the data's structure.
    * **Too many clusters ($K$ too large):** Splitting natural groups into smaller, less meaningful sub-clusters, potentially overfitting to noise or minor variations.

---

**7. Initialization of Centroids (The `init` Parameter)**

The initial placement of centroids can affect the final clustering results and the speed of convergence.
* **Random Initialization:** If centroids are initialized poorly (e.g., all very close together in one part of the data), K-Means might converge to a **local minimum** of the Inertia function, rather than the global minimum. This means it might find suboptimal clusters.
    * To mitigate this, it's common to run K-Means multiple times with different random initializations and choose the clustering result that yields the lowest Inertia. Scikit-learn's `KMeans` does this by default via the `n_init` parameter.
* **K-Means++ Initialization (Default in Scikit-learn):**
    * This is a "smarter" initialization method that tries to spread out the initial centroids.
    * It selects the first centroid randomly from the data points.
    * Then, for each subsequent centroid, it selects a data point based on a probability distribution where points further away from already chosen centroids are more likely to be selected.
    * K-Means++ generally leads to better (lower Inertia) and more consistent results, and often faster convergence, compared to purely random initialization.
    * In Scikit-learn, `init='k-means++'` is the default.

---

**8. Assumptions and Limitations of K-Means**

K-Means is popular due to its simplicity and speed, but it has several underlying assumptions and limitations:

1.  **Assumes Spherical/Globular Clusters:** K-Means works best when clusters are roughly spherical (isotropic) and of similar sizes. It tries to find round clusters.
2.  **Assumes Clusters of Similar Size/Density:** It tends to create clusters of roughly equal size and density because it assigns points to the nearest centroid. It might struggle with clusters of very different sizes or densities.
3.  **Sensitive to Feature Scaling:** Since K-Means uses distance measures (typically Euclidean), features with larger scales can dominate the distance calculations. **It is crucial to scale/normalize features** (e.g., using `StandardScaler`) before applying K-Means.
4.  **Need to Specify $K$ in Advance:** As discussed, choosing the optimal $K$ can be challenging.
5.  **Sensitive to Initial Centroid Placement (if not using K-Means++ or multiple initializations):** Can converge to local optima.
6.  **Struggles with Non-Globular Shapes:** K-Means has difficulty identifying clusters with complex, elongated, or non-convex shapes (e.g., crescent moons, concentric circles) because its cluster boundaries are implicitly linear (Voronoi cells).
    * **Conceptual Diagram (K-Means on Moons):**
        Imagine two intertwined crescent moon shapes of data points. K-Means (with K=2) would likely try to draw a straight line to separate them, incorrectly splitting each moon in half rather than identifying each moon as a separate cluster.
7.  **Impact of Outliers:** Centroids are mean-based, so outliers can pull centroids towards them, potentially distorting the clusters.

Despite these limitations, K-Means is often a good starting point for clustering tasks due to its efficiency and ease of interpretation. If its assumptions are not met, other clustering algorithms like DBSCAN, Hierarchical Clustering, or Gaussian Mixture Models might be more appropriate.
