In [1]:
#https://www.geeksforgeeks.org/k-means-clustering-introduction/

# K-means clustering
is a popular unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping subsets (clusters). Each data point belongs to the cluster with the nearest mean, and the algorithm iteratively refines the cluster assignments until convergence. The "K" in K-means refers to the predetermined number of clusters that the algorithm aims to find.

### Steps in K-means Clustering:

1. **Initialization:**
   - Choose the number of clusters \( K \).
   - Randomly initialize the centroids of the clusters. Each centroid is a point in the feature space.

2. **Assignment:**
   - Assign each data point to the cluster whose centroid is the nearest (typically using Euclidean distance).

3. **Update Centroids:**
   - Recalculate the centroids of the clusters by taking the mean of all data points assigned to each cluster.

4. **Repeat:**
   - Repeat steps 2 and 3 until convergence, i.e., until the centroids no longer change significantly or a predefined number of iterations is reached.

5. **Final Result:**
   - The algorithm converges to a set of cluster assignments, and each data point is associated with a specific cluster.

### Example Using Python and Scikit-Learn:

Here's a simple example of K-means clustering using the `KMeans` class from the scikit-learn library:

```python
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data with three clusters
np.random.seed(42)
X = np.concatenate([np.random.normal(0, 1, (100, 2)),
                    np.random.normal(5, 1, (100, 2)),
                    np.random.normal(10, 1, (100, 2))])

# Apply K-means clustering with K=3
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Get cluster assignments and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Visualize the clusters and centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', s=50, alpha=0.8)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
```

In this example, synthetic data is generated with three clusters. The K-means algorithm is then applied to partition the data into three clusters. The resulting clusters and centroids are visualized using a scatter plot. The color of each point represents its assigned cluster, and the red "X" markers indicate the centroids.

- First, we randomly initialize k points (centroids) , called means or cluster centroids.
- We categorize each item to its closest mean, and we update the mean’s coordinates, which are the averages of the items categorized in that cluster so far. (cal dist b/w points n centroid)
- some clusters are formed
- new centroids
- cal dist again
- new clusters
- new centroids
- dist
- clusters

this conti for k time iterations