<a href="https://colab.research.google.com/github/AsraniSanjana/All_Codes/blob/main/All_Semester_Codes/ML_sem7/models/ML07_B_DBSCAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Instructions:**

1. Understand key parameters of DBSCAN
2. Find out the value of key parameters.



In [None]:
import numpy as np
from sklearn.cluster import DBSCAN
from scipy.spatial import distance_matrix

# Generate sample data (you can replace this with your own dataset)
data = np.array([[1, 2],
                 [1, 3],
                 [2, 3],
                 [8, 7],
                 [9, 8],
                 [8, 9]])

# Calculate the pairwise distance matrix
dist_matrix = distance_matrix(data, data)

# Extract the lower triangular part of the distance matrix
lower_triangular = np.tril(dist_matrix, k=-1)

# Print the lower triangular matrix of distances
print("Lower Triangular Matrix of Distances:")
print(lower_triangular)

# Perform DBSCAN clustering
epsilon = 2.0  # Set the epsilon (Eps) value
min_samples = 2  # Set the MinPoints value
dbscan = DBSCAN(eps=epsilon, min_samples=min_samples)
cluster_assignments = dbscan.fit_predict(data)

# Print the cluster assignments (border, core, or outlier)
labels = {0: "Outlier"}
for i, label in enumerate(cluster_assignments):
    if label == -1:
        labels[i] = "Outlier"
    elif label == 0:
        labels[i] = "Border"
    else:
        labels[i] = "Core"

print("\nPoint Status (Cluster Assignments):")
for i, status in labels.items():
    print(f"Point {i + 1}: {status}")


Lower Triangular Matrix of Distances:
[[ 0.          0.          0.          0.          0.          0.        ]
 [ 1.          0.          0.          0.          0.          0.        ]
 [ 1.41421356  1.          0.          0.          0.          0.        ]
 [ 8.60232527  8.06225775  7.21110255  0.          0.          0.        ]
 [10.          9.43398113  8.60232527  1.41421356  0.          0.        ]
 [ 9.89949494  9.21954446  8.48528137  2.          1.41421356  0.        ]]

Point Status (Cluster Assignments):
Point 1: Border
Point 2: Border
Point 3: Border
Point 4: Core
Point 5: Core
Point 6: Core


# **ASSESSMENT**

1. **Is it necessary to provide the number of clusters prior to the implementation of DBSCAN?**

No, it is not necessary to provide the number of clusters prior to implementing DBSCAN (Density-Based Spatial Clustering of Applications with Noise). One of the advantages of DBSCAN is that it can automatically discover the number of clusters within the data based on the density of data points, making it a density-based clustering algorithm.


In DBSCAN, clusters are formed based on the density of data points rather than requiring a predefined number of clusters. The algorithm identifies clusters as areas of high data point density separated by areas of lower density. Therefore, you do not need to specify the number of clusters in advance, which is a common requirement for some other clustering algorithms like k-means.

However, the following two important parameters need to be specified:


1. **Epsilon (Eps)**: This parameter defines the radius within which the algorithm searches for neighboring points. It determines the spatial extent of a data point's neighborhood.


2. **MinPoints**: This parameter specifies the minimum number of data points required to form a dense region or cluster. Data points with at least MinPoints neighbors within an Epsilon radius are considered core points.



---


2. **Is the algorithm sensitive to outliers as is the case of K-Means clustering?**

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is less sensitive to outliers compared to the K-Means clustering algorithm. Here's why:


1. **K-Means Sensitivity to Outliers:**
   - In K-Means clustering, each data point is assigned to the nearest centroid. Outliers, which are data points significantly distant from the cluster centroids, can have a substantial impact on the clustering result.
   - Outliers can disproportionately influence the position and size of clusters because they can "pull" cluster centroids towards them.
   - K-Means aims to minimize the sum of squared distances between data points and their assigned centroids, so outliers can distort the means of clusters.


2. **DBSCAN's Robustness to Outliers:**
   - DBSCAN, on the other hand, is density-based rather than distance-based. It defines clusters as dense regions of data points separated by areas of lower point density.
   - Outliers, by definition, are typically located in areas of lower density. DBSCAN identifies such points as noise and doesn't assign them to any cluster.
   - DBSCAN's clustering result is less influenced by individual outliers because it primarily focuses on identifying dense regions and doesn't force every data point into a cluster.

---


3. **Explain the significance of Epsilon (Eps) and MinPoints.**

1. **Epsilon (Eps)**: Epsilon (Eps) is a positive real number that defines the radius around each data point within which the algorithm searches for neighboring points.

   - **Significance**: Epsilon determines the spatial extent of a data point's neighborhood. Data points within this distance are considered neighbors, and they play a central role in defining core points in DBSCAN.
   - **Impact**:
     - A larger Epsilon value results in a broader definition of what constitutes a neighbor, potentially leading to larger clusters.
     - A smaller Epsilon value restricts the neighborhood size, potentially leading to smaller and more tightly packed clusters.
   - **Tuning**: Choosing an appropriate Epsilon value depends on the specific characteristics of your dataset. It often requires domain knowledge or trial and error.


2. **MinPoints**: MinPoints is a positive integer that specifies the minimum number of data points required to form a dense region or cluster.

   - **Significance**: MinPoints defines the density threshold that determines whether a data point is a core point or not. Core points must have at least MinPoints neighbors (including themselves) within an Epsilon radius.
   - **Impact**:
     - Increasing MinPoints requires denser clusters to be formed, which leads to more robust clustering results but may result in smaller clusters.
     - Decreasing MinPoints allows for the detection of sparser clusters but may increase the likelihood of noise points being included as part of a cluster.
   - **Tuning**: Like Epsilon, selecting an appropriate MinPoints value depends on the dataset and the desired cluster density. It may require experimentation to find the optimal value.

---


4. **Explain the 3 types of data points used in the algorithm**

- **Core Points:** These are data points that have a sufficient number of neighbors within a specified radius (\(eps\)).


- **Border Points:** These are data points that are within the \(eps\) distance of a core point but do not have enough neighbors to be considered core points themselves.


- **Noise (Outlier) Points:** These are data points that are neither core points nor border points. They are typically isolated and considered noise or outliers.

---


