# K-means Clustering Algorithm

K-means is a popular partitioning-based clustering algorithm that aims to group similar data points into clusters by minimizing the within-cluster sum of squares.

## History

The k-means algorithm was first proposed by Stuart Lloyd in 1957, but the work was not published until 1982. In 1965, Edward W. Forgy published a similar method, which became known as the k-means algorithm.

## Mathematical Equations

The objective of k-means is to minimize the within-cluster sum of squares (WCSS), defined as:

WCSS = Σ(Σ(||x - μ_i||^2))

where x is a data point, μ_i is the centroid of cluster i, and ||.|| denotes the Euclidean distance.

## Learning Algorithm

The learning algorithm for k-means consists of the following steps:

1. Initialize the centroids randomly by selecting k data points from the dataset.
2. Assign each data point to the nearest centroid.
3. Update the centroids by calculating the mean of all the data points assigned to each centroid.
4. Repeat steps 2 and 3 until the centroids' positions do not change significantly or a maximum number of iterations is reached.

The number of clusters (k) is an input parameter of the k-means algorithm.

## Pros and Cons

**Pros:**
- Easy to understand and implement.
- Efficient in terms of time complexity.
- Works well with large datasets.
- Guaranteed to converge.

**Cons:**
- Requires the number of clusters as an input parameter.
- Sensitive to the initial placement of centroids.
- Assumes that clusters are spherical and have similar densities.
- Can get stuck in local optima.
- Does not work well with categorical data.

## Suitable Tasks and Datasets

K-means can be applied to a variety of clustering tasks, including:

- Image segmentation
- Anomaly detection
- Market segmentation
- Document clustering

It works well with datasets that have spherical clusters and similar densities. K-means is not suitable for datasets with arbitrary shapes or varying densities.

## Difference between K-means and DBSCAN

The main differences between k-means and DBSCAN are:

- K-means requires the number of clusters as an input parameter, while DBSCAN does not.
- K-means is sensitive to the initial placement of centroids and may converge to local optima, while DBSCAN is more robust due to its density-based approach.
- K-means tends to work well with spherical clusters and may struggle with clusters of arbitrary shapes, while DBSCAN can find clusters of any shape.
- K-means is less robust to noise compared to DBSCAN, which can identify and separate noise points from clusters.
- DBSCAN can handle datasets with varying densities, while k-means assumes similar densities across clusters.

## References

1. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.
2. Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21(3), 768-769.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Function to calculate Euclidean distance between two points
def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

# K-means class
class KMeans:
    def __init__(self, n_clusters=3, max_iter=300, random_state=None):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.random_state = random_state

    def fit_predict(self, X):
        np.random.seed(self.random_state)
        centroids = X[np.random.choice(X.shape[0], self.n_clusters, replace=False)]

        for _ in range(self.max_iter):
            clusters = [np.argmin([euclidean_distance(x, centroid) for centroid in centroids]) for x in X]
            new_centroids = [X[np.array(clusters) == i].mean(axis=0) for i in range(self.n_clusters)]

            if np.allclose(centroids, new_centroids):
                break
            centroids = new_centroids

        self.cluster_centers_ = np.array(centroids)
        return np.array(clusters)

# Load the Iris dataset
iris = load_iris()
X = iris.data[:, :2]  # Use only the first two features for easy visualization

# Apply the k-means algorithm
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)

# Visualize the results
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='x')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('K-means Clustering')
plt.show()


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load the Iris dataset
iris = load_iris()
X = iris.data[:, :2]  # Use only the first two features for easy visualization

# Apply the k-means algorithm
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)

# Evaluate the model using the elbow method
inertia_values = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia_values.append(kmeans.inertia_)

plt.plot(K, inertia_values, 'bo-')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('The Elbow Method')
plt.show()

# Visualize the results
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='x')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('K-means Clustering')
plt.show()
