In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
print("Loading dataset...")
iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
target = pd.Series(iris.target, name='target')

# Display first few rows of the dataset
print("First few rows of the dataset:")
print(data.head())

# Standardize the features
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Apply K-Means clustering
print("Performing K-Means clustering...")
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(data_scaled)

# Evaluate clustering performance
silhouette_avg = silhouette_score(data_scaled, clusters)
print(f"Silhouette Score: {silhouette_avg}")

# Visualize the clusters
plt.figure(figsize=(10, 6))
plt.scatter(data_scaled[:, 0], data_scaled[:, 1], c=clusters, cmap='viridis', edgecolor='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering Results')
plt.colorbar(label='Cluster')
plt.show()

# Summary of clustering results
data['Cluster'] = clusters
print("Cluster distribution:")
print(data['Cluster'].value_counts())

print("""
Q1. Different Types of Clustering Algorithms:
1. **K-Means Clustering**: Partitional clustering method that aims to partition data into K clusters, minimizing the variance within each cluster.
2. **Hierarchical Clustering**: Builds a hierarchy of clusters either by agglomeration (bottom-up approach) or divisive (top-down approach).
3. **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**: Clusters based on density, allowing for clusters of arbitrary shapes and handling noise.
4. **Mean Shift Clustering**: A non-parametric algorithm that shifts data points to the mode of the data distribution.
5. **Gaussian Mixture Models (GMMs)**: Assumes data is generated from a mixture of several Gaussian distributions and uses Expectation-Maximization to find the parameters.

Q2. K-Means Clustering:
K-Means is an iterative algorithm that partitions data into K clusters. It works as follows:
1. Initialize K cluster centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate centroids as the mean of the data points assigned to each cluster.
4. Repeat steps 2 and 3 until centroids no longer change.

Q3. Advantages and Limitations of K-Means Clustering:
**Advantages**:
- Simple and easy to implement.
- Efficient with large datasets.
- Works well when clusters are spherical and of similar size.

**Limitations**:
- Requires specifying the number of clusters K in advance.
- Sensitive to the initial placement of centroids.
- Assumes clusters are of similar size and density.

Q4. Determining the Optimal Number of Clusters:
**Methods**:
1. **Elbow Method**: Plot the sum of squared distances from points to their assigned cluster centroids for different values of K and look for an "elbow" point.
2. **Silhouette Score**: Measure how similar a data point is to its own cluster compared to other clusters.
3. **Gap Statistic**: Compare the total within-cluster variation for different values of K with their expected values under null reference distribution.

Q5. Applications of K-Means Clustering:
1. **Customer Segmentation**: Group customers with similar purchasing behavior.
2. **Image Compression**: Reduce the number of colors in an image by clustering similar colors.
3. **Anomaly Detection**: Identify unusual patterns in data by clustering normal data points and flagging outliers.

Q6. Interpreting K-Means Clustering Output:
- **Cluster Centroids**: Represent the mean feature values of each cluster.
- **Cluster Distribution**: Shows how data points are distributed among clusters.
- **Cluster Characteristics**: Analyze the features of each cluster to gain insights into different groups within the data.

Q7. Challenges and Solutions in K-Means Clustering:
**Challenges**:
- **Choosing K**: Difficult to determine the optimal number of clusters.
- **Initialization Sensitivity**: Different initializations can lead to different results.

**Solutions**:
- **Use K-Means++ Initialization**: Improve centroid initialization.
- **Run Multiple Trials**: Run K-Means multiple times with different initializations and choose the best result.

""")
