<a href="https://colab.research.google.com/github/Chaakash16/Python-Basics/blob/main/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Theoretical**
1. What is unsupervised learning in the context of machine learning?

Answer:
Unsupervised learning is a type of machine learning where the model learns patterns from data that has no labels. It is mainly used for clustering, association, and dimensionality reduction tasks.

2. How does K-Means clustering algorithm work?

Answer:
K-Means assigns data points to the nearest of K centroids, then recalculates the centroids by averaging the points in each cluster. This process repeats until the centroids stabilize.

3. Explain the concept of a dendrogram in hierarchical clustering.

Answer:
A dendrogram is a tree-like diagram that shows the merging of clusters in hierarchical clustering. It helps visualize how clusters are formed at various distances or similarity levels.

4. What is the main difference between K-Means and Hierarchical Clustering?

Answer:
K-Means requires the number of clusters in advance and works iteratively, while hierarchical clustering builds a tree of clusters without needing a predefined number.

5. What are the advantages of DBSCAN over K-Means?

Answer:
DBSCAN can find clusters of arbitrary shape and handle noise, unlike K-Means which assumes spherical clusters and is sensitive to outliers.

6. When would you use Silhouette Score in clustering?

Answer:
Silhouette Score is used to measure how well a data point fits within its cluster compared to other clusters. It helps evaluate clustering quality.

7. What are the limitations of Hierarchical Clustering?

Answer:
Hierarchical clustering is computationally expensive and cannot handle very large datasets efficiently. It also doesn't allow reassigning points once merged.

8. Why is feature scaling important in clustering algorithms like K-Means?

Answer:
Feature scaling ensures that all variables contribute equally to the distance calculation. Without it, variables with larger scales dominate the clustering.

9. How does DBSCAN identify noise points?

Answer:
DBSCAN labels a point as noise if it has fewer than a minimum number of neighbors within a specified radius (epsilon), meaning it doesn't belong to any cluster.

10. Define inertia in the context of K-Means.

Answer:
Inertia is the sum of squared distances between each data point and its cluster center. Lower inertia indicates tighter and more compact clusters.

11. What is the elbow method in K-Means clustering?

Answer:
The elbow method helps choose the optimal number of clusters by plotting inertia against the number of clusters and finding the point where the decrease slows.

12. Describe the concept of "density" in DBSCAN.

Answer:
In DBSCAN, density refers to the number of points within a given radius. Clusters are formed from areas where the density exceeds a certain threshold.

13. Can hierarchical clustering be used on categorical data?

Answer:
Yes, but it requires a distance metric suitable for categorical variables, such as Hamming distance or using dummy/encoded features.

14. What does a negative Silhouette Score indicate?

Answer:
A negative Silhouette Score suggests that the sample might be assigned to the wrong cluster, as it is closer to a neighboring cluster than its own.

15. Explain the term "linkage criteria" in hierarchical clustering.

Answer:
Linkage criteria determine how the distance between clusters is calculated. Common types include single, complete, average, and ward linkage.

16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?

Answer:
K-Means assumes clusters of similar size and density. If clusters vary significantly, it may misclassify points or merge separate clusters incorrectly.

17. What are the core parameters in DBSCAN, and how do they influence clustering?

Answer:
The main parameters are epsilon (distance threshold) and min_samples (minimum points to form a cluster). They control the cluster size and shape.

18. How does K-Means++ improve upon standard K-Means initialization?

Answer:
K-Means++ selects initial centroids more strategically by spreading them out, which reduces the chances of poor clustering results and speeds up convergence.

19. What is agglomerative clustering?

Answer:
Agglomerative clustering is a bottom-up approach where each data point starts as its own cluster, and clusters are merged step-by-step based on similarity.

20. What makes Silhouette Score a better metric than just inertia for model evaluation?

Answer:
Silhouette Score considers both intra-cluster tightness and inter-cluster separation, while inertia only measures compactness, making Silhouette more informative.

**Practical**

1. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot.

Answer:
Use make_blobs(n_samples=300, centers=4) from sklearn.datasets to generate the data. Fit a KMeans(n_clusters=4) model, predict labels, and visualize using matplotlib.pyplot.scatter() with color-coded clusters.

2. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.

Answer:
Load the Iris dataset using sklearn.datasets.load_iris(), apply AgglomerativeClustering(n_clusters=3) from sklearn.cluster, fit the model, and print the first 10 values of model.labels_.

3. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot.

Answer:
Generate moon-shaped data using make_moons(noise=0.05), apply DBSCAN(eps=0.2, min_samples=5), and plot clusters using different colors. Label points with cluster -1 as outliers and highlight them.

4. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster.

Answer:
Use StandardScaler to scale the Wine dataset, fit KMeans(n_clusters=3) on the scaled data, and use np.bincount(labels) to print the number of samples in each cluster.

5. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result.

Answer:
Generate circular data using make_circles(noise=0.05), apply DBSCAN, and visualize clusters using a scatter plot. DBSCAN is effective here since K-Means would fail due to the shape.

6. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids.

Answer:
Scale the data using MinMaxScaler(), fit KMeans(n_clusters=2), and print kmeans.cluster_centers_ to display the centroids in the scaled feature space.

7. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN.

Answer:
Generate data with different cluster_std values using make_blobs(), apply DBSCAN, and visualize. DBSCAN handles variable density better than K-Means in such scenarios.

8. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means.

Answer:
Apply PCA to reduce the Digits dataset to 2 components, use KMeans to cluster the data, and plot the clusters in 2D using different colors for each cluster.

9. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart.

Answer:
For each k from 2 to 5, fit KMeans, calculate silhouette_score, and store results. Plot the scores using a bar chart to visualize which k performs best.

10. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage.

Answer:
Compute linkage matrix using scipy.cluster.hierarchy.linkage() with method='average', then plot a dendrogram with dendrogram() to show how samples are merged.

11. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries.

Answer:
Create overlapping data with make_blobs, fit KMeans, and use contour plotting to draw decision boundaries by predicting on a meshgrid of points.

12. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results.

Answer:
Apply TSNE(n_components=2) to reduce dimensions, fit DBSCAN on the transformed data, and use a scatter plot to visualize clusters and outliers.

13. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result.

Answer:
Generate data using make_blobs, apply AgglomerativeClustering(linkage='complete'), predict labels, and plot the results using different colors for each cluster.

14. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot.

Answer:
For k values from 2 to 6, fit KMeans, record inertia_ values, and plot them on a line graph to observe the elbow point.

15. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage.

Answer:
Create circular data using make_circles, apply AgglomerativeClustering(linkage='single'), and visualize the clustering with a scatter plot.

16. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise).

Answer:
Standardize the Wine dataset, apply DBSCAN, and count unique labels excluding -1 to get the number of actual clusters formed.

17. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points.

Answer:
Fit KMeans on the data and plot the data points using scatter plot. Overlay the centroids using a different color and shape to distinguish them.

18. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise.

Answer:
After applying DBSCAN to the Iris dataset, count the number of samples labeled as -1 using np.sum(labels == -1) to find how many were marked as noise.

19. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result.

Answer:
Use make_moons() to generate data, apply KMeans, and visualize the clusters. The plot will show that K-Means struggles due to the non-linear shape.

20. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot.

Answer:
Reduce the dataset using PCA(n_components=3), fit KMeans, and visualize the clusters in 3D using matplotlib.pyplot with Axes3D.

21. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering.

Answer:
Generate data using make_blobs(n_samples=500, centers=5), apply KMeans(n_clusters=5), predict labels, and evaluate clustering quality using silhouette_score() from sklearn.metrics.

22. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D.

Answer:
Load and standardize the dataset, apply PCA(n_components=2) to reduce dimensions, then apply AgglomerativeClustering, and plot the 2D result with clusters shown in different colors.

23. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side.

Answer:
Create data with make_circles(noise=0.05), apply both KMeans and DBSCAN, and plot their results in two subplots to show how DBSCAN captures the circular shape better.

24. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering.

Answer:
Apply KMeans(n_clusters=3) on the Iris dataset, compute sample-wise Silhouette Scores using silhouette_samples(), and visualize them in a bar plot to assess clustering quality.

25. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters.

Answer:
Use make_blobs() to generate data, fit AgglomerativeClustering(linkage='average'), and plot the clusters using a scatter plot with distinct colors for each label.

26. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features).

Answer:
Use KMeans(n_clusters=3), add cluster labels to the dataset, and create a pairplot using seaborn.pairplot() to explore clustering patterns in the first four features.

27. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count.

Answer:
After generating data with noise using make_blobs(), apply DBSCAN, and count both clusters and outliers using np.unique(labels) and np.sum(labels == -1).

28. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters.

Answer:
Apply TSNE(n_components=2) on the Digits dataset, then fit AgglomerativeClustering, and visualize the 2D clusters using a scatter plot with unique colors for each label.