# Clustering Assignment: 48 Questions with Answers

**1. What is unsupervised learning in the context of machine learning?**

Unsupervised learning is a type of machine learning where the model learns patterns from unlabelled data. The algorithm tries to find hidden structure, clusters, or associations within the dataset without any target output.

**2. How does K-Means clustering algorithm work?**

K-Means clusters data by initializing 'k' centroids, assigning each point to the nearest centroid, then updating the centroids as the mean of all points in each cluster. This process repeats until the centroids no longer change significantly.

**3. Explain the concept of a dendrogram in hierarchical clustering?**

A dendrogram is a tree-like diagram that shows the arrangement of clusters formed by hierarchical clustering. It illustrates the merging or splitting of clusters at various levels of similarity or distance.

**4. What is the main difference between K-Means and Hierarchical Clustering?**

K-Means is a partitional clustering method that needs the number of clusters as input, while Hierarchical Clustering builds a hierarchy of clusters and doesn't require specifying the number of clusters beforehand.

**5. What are the advantages of DBSCAN over K-Means?**

DBSCAN can find clusters of arbitrary shapes, handles noise well, and does not require the number of clusters in advance, unlike K-Means which assumes spherical clusters and fixed k value.

**6. When would you use Silhouette Score in clustering?**

Silhouette Score is used to measure how well data points fit within their clusters. It helps evaluate the quality of clustering and decide the optimal number of clusters.

**7. What are the limitations of Hierarchical Clustering?**

It is computationally expensive for large datasets and sensitive to noise and outliers. Also, once a decision is made to merge or split clusters, it cannot be undone.

**8. Why is feature scaling important in clustering algorithms like K-Means?**

Feature scaling ensures that each feature contributes equally to the distance calculations used by clustering algorithms. Without scaling, features with larger ranges can dominate.

**9. How does DBSCAN identify noise points?**

DBSCAN labels a point as noise if it has fewer neighbors within a defined radius (eps) than a minimum number of points (min_samples). These points don’t belong to any cluster.

**10. Define inertia in the context of K-Means?**

Inertia is the sum of squared distances between each point and its assigned cluster centroid. It measures how internally coherent the clusters are. Lower inertia means better clustering.

**11. What is the elbow method in K-Means clustering?**

The elbow method involves plotting the inertia against various k values and selecting the 'elbow point' where inertia decreases less sharply. This helps determine the optimal number of clusters.

**12. Describe the concept of "density" in DBSCAN?**

Density in DBSCAN refers to the number of points in a given neighborhood (radius eps). A region is considered dense if it has at least min_samples points within eps distance.

**13. Can hierarchical clustering be used on categorical data?**

Yes, but it requires using a suitable distance metric for categorical data (like Hamming distance) and might not perform as well as specialized categorical clustering methods.

**14. What does a negative Silhouette Score indicate?**

A negative Silhouette Score means that the sample is likely placed in the wrong cluster, as it is closer to points in a neighboring cluster than to its own cluster.

**15. Explain the term "linkage criteria" in hierarchical clustering?**

Linkage criteria define how the distance between clusters is calculated. Common types include single, complete, average, and ward linkage.

**16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?**

Because K-Means assumes clusters are spherical and of similar size and density. It may merge small dense clusters or split large sparse ones incorrectly.

**17. What are the core parameters in DBSCAN, and how do they influence clustering?**

The main parameters are eps (radius of neighborhood) and min_samples (minimum points to form a dense region). They control cluster formation and noise detection.

**18. How does K-Means++ improve upon standard K-Means initialization?**

K-Means++ initializes centroids in a smarter way by spreading them out, which reduces the chances of poor clustering and speeds up convergence.

**19. What is agglomerative clustering?**

Agglomerative clustering is a bottom-up approach where each point starts in its own cluster, and clusters are merged step by step based on distance metrics until one big cluster is formed.

**20. What makes Silhouette Score a better metric than just inertia for model evaluation?**

Silhouette Score considers both intra-cluster tightness and inter-cluster separation, making it a more balanced and informative metric than inertia alone.

### Q21. Generate synthetic data with 4 centers using `make_blobs` and apply K-Means clustering. Visualize using a scatter plot.

In [None]:
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, color='red', marker='X')
plt.title("K-Means Clustering with 4 Centers")
plt.show()

### Q22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)

print("First 10 predicted labels:", labels[:10])

### Q23. Generate synthetic data using `make_moons` and apply DBSCAN. Highlight outliers in the plot.

In [None]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=300, noise=0.1, random_state=42)
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Paired')
plt.title("DBSCAN on make_moons (outliers shown as -1)")
plt.show()