<a href="https://colab.research.google.com/github/Ps1231/Data-Science-Tutotial-Using-Python/blob/main/Clustering/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 8: Clustering
K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into distinct groups or clusters based on the similarity of data points. The goal of the K-means algorithm is to assign each data point to one of K clusters in a way that minimizes the sum of squared distances between data points and the centroid of their assigned cluster.

Here's a step-by-step explanation of how the K-means algorithm works:

**Initialization:** Choose K initial cluster centroids randomly or using a specific initialization method.

**Assignment Step:** Assign each data point to the nearest centroid. The distance metric commonly used is Euclidean distance, but other distance metrics can be employed.

**Update Step:** Recalculate the centroids of the clusters based on the current assignment of data points.

**Repeat:** Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids no longer change significantly or when a specified number of iterations is reached.

The choice of K (the number of clusters) is a crucial parameter in K-means clustering and is often determined using methods like the elbow method or silhouette analysis.

K-means clustering has several applications, such as customer segmentation, image compression, anomaly detection, and more. It is an iterative algorithm that converges to a solution, and its efficiency makes it suitable for large datasets. However, K-means is sensitive to the initial choice of centroids and may converge to a local minimum, so multiple initializations or more advanced variants like K-means++ can be used to address this issue.







## 8.1 Demonstrate  application of k‐MEANS CLUSTERING Using R/Python


In [1]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the dataset (replace 'bank.csv' with the actual path to your dataset)
bank_data = pd.read_csv('/content/bank.csv')

# Select predictor variables (replace with your actual predictor variables)
X = bank_data[["balance", "duration"]]

# Standardize the predictor variables using z-score transformation
scaler = StandardScaler()
Xz = pd.DataFrame(scaler.fit_transform(X), columns=["balance", "duration"])

# Run k-means clustering on the training data set
kmeans_model = KMeans(n_clusters=2).fit(Xz)

# Get cluster membership for each record
cluster_membership = kmeans_model.labels_

# Separate records into two groups based on cluster membership
cluster1 = Xz.loc[cluster_membership == 0]
cluster2 = Xz.loc[cluster_membership == 1]

# Compute summary statistics for the two clusters
print("Cluster 1 Summary Statistics:")
print(cluster1.describe())

print("\nCluster 2 Summary Statistics:")
print(cluster2.describe())




Cluster 1 Summary Statistics:
           balance     duration
count  8932.000000  8932.000000
mean     -0.036090    -0.405619
std       0.812361     0.411093
min      -2.596850    -1.065918
25%      -0.438270    -0.728852
50%      -0.311924    -0.483975
75%       0.033319    -0.138266
max      10.977833     0.656865

Cluster 2 Summary Statistics:
           balance     duration
count  2230.000000  2230.000000
mean      0.144553     1.624657
std       1.528837     1.015145
min      -1.422064    -0.757661
25%      -0.427573     0.913986
50%      -0.265571     1.343961
75%       0.172919     2.039700
max      24.703510    10.109123
