In [None]:
Q1. Explain the basic concept of clustering and give examples of applications where clustering is useful.

In [None]:
Clustering is a fundamental technique in unsupervised machine learning used to group similar objects or data points into clusters, where objects within the same cluster are more similar to each other than to those in other clusters. The goal of clustering is to discover inherent structures or patterns in the data without the need for labeled examples.

The basic concept of clustering involves partitioning a dataset into subsets, or clusters, based on some measure of similarity or distance between data points. Common clustering algorithms aim to optimize certain criteria, such as maximizing intra-cluster similarity while minimizing inter-cluster similarity. Clustering algorithms can broadly be categorized into partitioning methods (e.g., K-means), hierarchical methods (e.g., agglomerative clustering), density-based methods (e.g., DBSCAN), and model-based methods (e.g., Gaussian mixture models).

Examples of applications where clustering is useful include:

1. Customer Segmentation:
   - Clustering customers based on their purchasing behavior, demographics, or preferences to identify distinct market segments. This helps businesses tailor marketing strategies and product offerings to different customer segments.

2. Document Clustering:
   - Grouping similar documents together based on their content or topics. This aids in organizing large document collections, information retrieval, and document summarization.

3. Image Segmentation:
   - Partitioning an image into regions or segments with similar colors, textures, or features. Image segmentation is used in medical imaging, object recognition, and computer vision applications.

4. Anomaly Detection:
   - Identifying outliers or unusual patterns in data that do not conform to expected behavior. Anomaly detection is applied in fraud detection, network security, and detecting equipment failures in industrial systems.

5. Genomic Clustering:
   - Grouping genes or proteins based on their expression patterns or sequence similarities. Genomic clustering aids in understanding gene functions, identifying biomarkers, and studying disease mechanisms.

6. *Social Network Analysis:
   - Clustering users or communities in social networks based on their connections, interests, or interactions. Social network clustering helps identify influential users, detect communities, and analyze information diffusion.

7. Recommendation Systems:
   - Grouping users or items with similar preferences or behavior in recommendation systems. Clustering helps personalize recommendations and improve user experience in e-commerce, content recommendation, and movie/music streaming platforms.

8. Climate Pattern Analysis:
   - Clustering weather or climate data to identify recurring patterns, trends, or anomalies. Climate pattern analysis aids in weather forecasting, climate modeling, and understanding climate variability.

These are just a few examples of how clustering is applied across various domains to uncover patterns, structure, and insights from data, ultimately leading to informed decision-making and improved understanding of complex phenomena.

In [None]:
Q2. What is DBSCAN and how does it differ from other clustering algorithms such as k-means and 
hierarchical clustering?

In [None]:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm used to partition a dataset into clusters of varying shapes and sizes. Unlike K-means and hierarchical clustering, DBSCAN does not require the number of clusters to be specified beforehand and can identify clusters of arbitrary shapes. Here's an overview of DBSCAN and its differences from K-means and hierarchical clustering:

DBSCAN:

1. Density-Based Clustering:
   - DBSCAN defines clusters as dense regions in the data space, separated by regions of lower density.
   - It assigns each data point to one of three categories: core points, border points, or noise points (outliers).
   - Core points are densely connected to other points and form the core of a cluster. Border points are connected to core points but are not dense enough to be considered core points. Noise points do not belong to any cluster.

2. Parameter-Free:
   - DBSCAN does not require the number of clusters to be specified beforehand.
   - It relies on two parameters: epsilon (\( \varepsilon \)), the maximum distance between two points to be considered neighbors, and MinPts, the minimum number of points required to form a dense region (core point).

3. Handles Arbitrary Cluster Shapes:
   - DBSCAN can identify clusters of arbitrary shapes and sizes, making it suitable for datasets with complex structures.
   - It is robust to outliers and noise and can effectively handle datasets with varying densities.

4. No Assumptions about Cluster Geometry:
   - DBSCAN does not make assumptions about the geometry of clusters, unlike K-means which assumes spherical clusters and hierarchical clustering which produces nested, hierarchical clusters.

Differences from K-means:

1. Number of Clusters:
   - K-means requires the number of clusters (\( k \)) to be specified beforehand, whereas DBSCAN automatically determines the number of clusters based on the data.

2. Cluster Shape:
   - K-means assumes that clusters are spherical and isotropic, whereas DBSCAN can identify clusters of arbitrary shapes.

3. Handling Outliers:
   - K-means treats outliers as noise and may assign them to the nearest cluster centroid, whereas DBSCAN explicitly identifies outliers as noise points.

Differences from Hierarchical Clustering:

1. Hierarchy of Clusters:
   - Hierarchical clustering produces a hierarchical structure of clusters, whereas DBSCAN does not explicitly produce a hierarchy of clusters.

2. Handling Noise:
   - Hierarchical clustering does not explicitly handle noise points, whereas DBSCAN identifies noise points as outliers.

3. Computation Complexity:
   - DBSCAN has a time complexity of \( O(n \log n) \) or \( O(n^2) \), depending on the implementation and dataset, whereas hierarchical clustering can have higher computational complexity, especially for large datasets.

In summary, DBSCAN is a density-based clustering algorithm that can identify clusters of arbitrary shapes and sizes without requiring the number of clusters to be specified beforehand. It differs from K-means in its ability to handle arbitrary cluster shapes and from hierarchical clustering in its approach to cluster formation and handling noise points.

In [None]:
Q3. How do you determine the optimal values for the epsilon and minimum points parameters in DBSCAN 
clustering?

In [None]:
Determining the optimal values for the epsilon (\( \varepsilon \)) and minimum points parameters in DBSCAN clustering is crucial for obtaining meaningful clustering results. Here are some common approaches to determining these parameters:

1. Visual Inspection:
   - Visualize the dataset and experiment with different values of \( \varepsilon \) and MinPts.
   - Choose values that result in clusters that align with the underlying structure of the data.
   - Adjust the parameters iteratively based on the clustering results until meaningful clusters are obtained.

2. Distance Distribution Analysis:
   - Analyze the distribution of distances between data points.
   - Plot a k-distance graph, where the \( y \)-axis represents the distance to the \( k \)-th nearest neighbor for each data point, sorted in ascending order.
   - Choose \( \varepsilon \) based on the "knee" or "elbow" point in the k-distance graph, where the rate of change in distance starts to decrease.
   - The value of \( \varepsilon \) corresponding to the knee point represents a reasonable neighborhood size.

3. Reachability Plot:
   - Construct a reachability plot by sorting data points based on their reachability distance.
   - The reachability distance of a point is the maximum of its \( \varepsilon \)-neighborhood or the distance to its nearest core point.
   - Choose \( \varepsilon \) based on the characteristic reachability distance in the reachability plot, where the distances stabilize or exhibit a significant change.

4. Silhouette Score:
   - Compute the silhouette score for different combinations of \( \varepsilon \) and MinPts.
   - The silhouette score measures the cohesion within clusters and the separation between clusters.
   - Choose the combination of parameters that maximizes the average silhouette score, indicating well-defined and separated clusters.

5. Grid Search:
   - Perform a grid search over a range of values for \( \varepsilon \) and MinPts.
   - Evaluate the clustering quality using internal validation metrics such as silhouette score or Davies-Bouldin index.
   - Choose the combination of parameters that yields the best clustering performance.

6. Domain Knowledge:
   - Incorporate domain knowledge or expert judgment to guide the selection of parameters.
   - Consider the characteristics of the data, such as the density of clusters, the scale of the features, and the expected size of clusters.

7. Cross-Validation:
   - Split the dataset into training and validation sets.
   - Tune the parameters on the training set and evaluate the clustering performance on the validation set.
   - Choose the parameters that result in the best clustering performance on unseen data.

It's important to note that parameter tuning in DBSCAN can be subjective and dependent on the specific characteristics of the dataset and the objectives of the analysis. Experimenting with different parameter values and evaluating the clustering results using multiple approaches can help identify the optimal values for \( \varepsilon \) and MinPts.

In [None]:
Q4. How does DBSCAN clustering handle outliers in a dataset?

In [None]:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering handles outliers in a dataset by explicitly identifying them as noise points that do not belong to any cluster. Here's how DBSCAN handles outliers:

1. Density-Based Approach:
   - DBSCAN defines clusters as dense regions in the data space, separated by regions of lower density.
   - Core points are densely connected to other points and form the core of a cluster, while border points are connected to core points but are not dense enough to be considered core points.
   - Noise points, or outliers, are points that are neither core points nor border points and do not belong to any cluster.

2. Parameter-Free:
   - DBSCAN does not require the number of clusters to be specified beforehand but relies on two parameters: epsilon (\( \varepsilon \)), the maximum distance between two points to be considered neighbors, and MinPts, the minimum number of points required to form a dense region (core point).
   - Points that are not within the \( \varepsilon \)-neighborhood of any core point and do not satisfy the MinPts criterion are considered noise points.

3. Handling Outliers:
   - DBSCAN explicitly identifies outliers as noise points during the clustering process.
   - Noise points are not assigned to any cluster and are treated as separate entities in the dataset.
   - Outliers may occur due to data artifacts, measurement errors, or instances that do not conform to the underlying patterns in the data.

4. Robustness to Outliers:
   - DBSCAN is robust to outliers and noise in the data because it focuses on identifying dense regions rather than individual data points.
   - Outliers do not affect the clustering of the main data clusters, as they are not considered part of any cluster.

5. Parameter Sensitivity:
   - The choice of parameters \( \varepsilon \) and MinPts can affect the handling of outliers in DBSCAN.
   - Larger values of \( \varepsilon \) or smaller values of MinPts may result in more points being classified as outliers, while smaller values of \( \varepsilon \) or larger values of MinPts may lead to tighter clusters with fewer outliers.

In summary, DBSCAN clustering explicitly identifies outliers as noise points that do not belong to any cluster. By focusing on dense regions in the data space, DBSCAN is robust to outliers and noise and can effectively handle datasets with varying densities and complex structures.

In [None]:
Q5. How does DBSCAN clustering differ from k-means clustering?

In [None]:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering and K-means clustering are two fundamentally different clustering algorithms that approach clustering from different perspectives. Here are some key differences between DBSCAN clustering and K-means clustering:

1. Algorithm Type:
   - DBSCAN is a density-based clustering algorithm, while K-means is a centroid-based clustering algorithm.
   - DBSCAN identifies clusters based on the density of data points, whereas K-means partitions the data into clusters based on the distance to centroids.

2. Number of Clusters:
   - DBSCAN does not require the number of clusters to be specified beforehand, as it automatically identifies clusters based on density.
   - K-means requires the number of clusters (\( k \)) to be predefined before clustering.

3. Cluster Shape:
   - DBSCAN can identify clusters of arbitrary shapes and sizes, as it defines clusters based on density-connected regions in the data space.
   - K-means assumes that clusters are spherical and isotropic, as it assigns data points to the nearest centroid.

4. Handling Outliers:
   - DBSCAN explicitly identifies outliers as noise points that do not belong to any cluster.
   - K-means treats outliers as data points that are assigned to the nearest centroid, potentially affecting the centroid positions and cluster boundaries.

5. Robustness to Noise:
   - DBSCAN is robust to noise and outliers in the data, as it focuses on identifying dense regions rather than individual data points.
   - K-means is sensitive to outliers and noise, as they can affect the position of centroids and the assignment of data points to clusters.

6. Parameter Sensitivity:
   - DBSCAN relies on two parameters: epsilon (\( \varepsilon \)), the maximum distance between two points to be considered neighbors, and MinPts, the minimum number of points required to form a dense region.
   - K-means is sensitive to the initial placement of centroids and may converge to different solutions depending on the initialization.

7. Cluster Representation:
   - In DBSCAN, clusters are represented as dense regions in the data space, and each data point belongs to at most one cluster.
   - In K-means, clusters are represented by centroids, and each data point is assigned to the nearest centroid, potentially belonging to multiple clusters if centroids are close together.

In summary, DBSCAN clustering and K-means clustering are distinct algorithms with different approaches to clustering. DBSCAN is suitable for identifying clusters of arbitrary shapes and handling outliers, while K-means is more appropriate for datasets with well-defined spherical clusters and a predefined number of clusters.

In [None]:
Q6. Can DBSCAN clustering be applied to datasets with high dimensional feature spaces? If so, what are 
some potential challenges?

In [None]:
Yes, DBSCAN clustering can be applied to datasets with high-dimensional feature spaces, but there are some potential challenges associated with doing so. Here are some considerations when applying DBSCAN to high-dimensional datasets:

1. Curse of Dimensionality:
   - In high-dimensional spaces, the density of data points tends to decrease exponentially with the number of dimensions. This phenomenon, known as the curse of dimensionality, can affect the performance of density-based clustering algorithms like DBSCAN.
   - As the number of dimensions increases, the concept of density becomes less meaningful, making it challenging to define appropriate neighborhoods and density thresholds.

2. Parameter Sensitivity:
   - The choice of parameters (\( \varepsilon \) and MinPts) in DBSCAN becomes more critical in high-dimensional spaces.
   - Determining suitable values for \( \varepsilon \) and MinPts that capture meaningful density structures in the data can be challenging, as the notion of distance and density may vary across dimensions.

3. Dimension Reduction:
   - Dimensionality reduction techniques such as PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding) can be applied to reduce the dimensionality of the data before clustering.
   - Dimension reduction can help mitigate the curse of dimensionality and improve the performance of DBSCAN by focusing on the most informative dimensions.

4. Feature Scaling:
   - Feature scaling is important in high-dimensional spaces to ensure that all dimensions contribute equally to the distance calculations.
   - Standardizing or normalizing the features to have zero mean and unit variance can help alleviate the influence of features with larger scales on the clustering results.

5. Sparse Data:
   - High-dimensional datasets often exhibit sparsity, where most feature values are zero or close to zero.
   - DBSCAN may struggle to identify dense regions in sparse data, leading to fragmented or suboptimal clustering results.

6. Computational Complexity:
   - The computational complexity of DBSCAN increases with the dimensionality of the data, as distance calculations become more computationally expensive in high-dimensional spaces.
   - Efficient implementation techniques and algorithms tailored for high-dimensional clustering may be necessary to handle large-scale datasets.

In summary, while DBSCAN clustering can be applied to datasets with high-dimensional feature spaces, there are challenges associated with the curse of dimensionality, parameter sensitivity, and computational complexity. Careful consideration of these challenges and appropriate preprocessing techniques can help improve the performance of DBSCAN on high-dimensional datasets.

In [None]:
Q7. How does DBSCAN clustering handle clusters with varying densities?

In [None]:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering is well-suited for handling clusters with varying densities, as it defines clusters based on local density rather than global density. Here's how DBSCAN handles clusters with varying densities:

1. Core Points:
   - DBSCAN identifies core points as data points with a sufficient number of neighbors within a specified distance (\( \varepsilon \)).
   - Core points are located in dense regions of the dataset and serve as the core or nucleus of a cluster.

2. Border Points:
   - Border points are data points that are not core points themselves but are within the \( \varepsilon \)-neighborhood of a core point.
   - Border points are connected to a cluster and help expand the cluster boundaries but do not have enough neighbors to be considered core points.

3. Reachability:
   - DBSCAN defines the reachability distance of a point as the maximum of its \( \varepsilon \)-neighborhood or the distance to its nearest core point.
   - By considering reachability distance, DBSCAN can capture clusters of varying densities, as the reachability distance adjusts to the local density of the data.

4. Density-Connected Regions:
   - DBSCAN forms clusters by grouping together core points and their directly reachable neighbors.
   - A cluster consists of all core points and border points that are density-connected to each other, forming a contiguous region of high density.

5. Handling Varying Neighborhood Sizes:
   - DBSCAN adapts to varying neighborhood sizes and densities by allowing the \( \varepsilon \)-neighborhood to vary for each data point.
   - In regions of high density, the \( \varepsilon \)-neighborhood may be smaller, capturing more local detail. In regions of lower density, the \( \varepsilon \)-neighborhood may be larger, allowing clusters to expand across sparser regions.

6. Parameter Sensitivity:
   - The parameter \( \varepsilon \) in DBSCAN controls the size of the neighborhood and influences the detection of clusters with varying densities.
   - By choosing an appropriate value for \( \varepsilon \), DBSCAN can effectively capture clusters of different densities in the dataset.

In summary, DBSCAN clustering handles clusters with varying densities by identifying core points in dense regions and allowing clusters to expand to include border points within their reachability distance. By defining clusters based on local density rather than global density, DBSCAN is robust to clusters of different shapes, sizes, and densities in the dataset.

In [None]:
Q8. What are some common evaluation metrics used to assess the quality of DBSCAN clustering results?

In [None]:
Several evaluation metrics can be used to assess the quality of DBSCAN clustering results, although DBSCAN's density-based nature may make traditional evaluation metrics less applicable. Nevertheless, here are some common evaluation metrics used for assessing the quality of DBSCAN clustering results:

1. Silhouette Score:
   - The silhouette score measures the cohesion within clusters and the separation between clusters.
   - It is calculated for each data point as: \[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \]
   - Where \( a(i) \) is the average distance from the point to other points in the same cluster, and \( b(i) \) is the smallest average distance from the point to points in a different cluster.
   - The overall silhouette score is the average silhouette score across all data points, ranging from -1 (poor clustering) to +1 (dense, well-separated clusters).

2. Davies-Bouldin Index:
   - The Davies-Bouldin index measures the average similarity between each cluster and its most similar cluster, normalized by the average dissimilarity within clusters.
   - It is calculated as: \[ DB = \frac{1}{n} \sum_{i=1}^{n} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d(C_i, C_j)} \right) \]
   - Where \( n \) is the number of clusters, \( \sigma_i \) and \( \sigma_j \) are the average distances from each point in cluster \( i \) and \( j \) to their respective centroids, and \( d(C_i, C_j) \) is the distance between the centroids of clusters \( i \) and \( j \).
   - Lower values of the Davies-Bouldin index indicate better clustering, with a minimum value of 0 indicating perfectly separated clusters.

3. Calinski-Harabasz Index:
   - The Calinski-Harabasz index measures the ratio of between-cluster dispersion to within-cluster dispersion.
   - It is calculated as: \[ CH = \frac{B}{W} \times \frac{n - k}{k - 1} \]
   - Where \( B \) is the between-cluster dispersion, \( W \) is the within-cluster dispersion, \( n \) is the total number of data points, and \( k \) is the number of clusters.
   - Higher values of the Calinski-Harabasz index indicate better clustering, with a higher ratio of between-cluster dispersion to within-cluster dispersion.

4. Visual Inspection:
   - While not a quantitative metric, visual inspection of the clustering results can provide valuable insights into the quality of the clustering.
   - Dendrograms, cluster visualizations, and scatter plots of clustered data points can help assess the separation and compactness of clusters.

These evaluation metrics provide different perspectives on the quality of DBSCAN clustering results. It's important to consider the characteristics of the data and the clustering objectives when selecting and interpreting evaluation metrics. Additionally, DBSCAN's effectiveness may be influenced by the choice of parameters (\( \varepsilon \) and MinPts), so parameter tuning and sensitivity analysis are also important aspects of assessing clustering quality.

In [None]:
Q9. Can DBSCAN clustering be used for semi-supervised learning tasks?

In [None]:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is primarily an unsupervised clustering algorithm and is not inherently designed for semi-supervised learning tasks. However, with some adaptations and in combination with other techniques, DBSCAN can be used in semi-supervised learning scenarios. Here's how DBSCAN can be applied to semi-supervised learning tasks:

1. Seed-Based Clustering:
   - In semi-supervised learning, some labeled data points (seeds) are provided along with the unlabeled data.
   - DBSCAN can be seeded with labeled data points to guide the clustering process. These labeled points can be treated as core points, and the clustering algorithm can propagate labels to nearby unlabeled points based on their density and reachability.

2. Pseudo-Labeling:
   - After clustering the data with DBSCAN, each data point is assigned a cluster label (including noise points).
   - The labeled data points can then be used to train a classifier in a semi-supervised manner, where the cluster labels act as pseudo-labels for the unlabeled data points.

3. Density-Based Outlier Detection:
   - DBSCAN can be used for outlier detection, identifying data points that do not belong to any cluster (noise points).
   - In semi-supervised learning, these noise points can be treated as anomalies or outliers and given special consideration during model training.

4. Combination with Other Algorithms:
   - DBSCAN can be combined with other semi-supervised learning algorithms to leverage its clustering capabilities.
   - For example, after clustering with DBSCAN, a classifier such as a support vector machine (SVM) or a decision tree can be trained on the labeled data points and the cluster labels assigned by DBSCAN.

5. Active Learning:
   - DBSCAN clustering results can guide the selection of informative data points for labeling in active learning settings.
   - By identifying dense regions or clusters with uncertain boundaries, DBSCAN can suggest areas of the feature space where additional labeled data points would be most beneficial for improving model performance.

While DBSCAN itself is not explicitly designed for semi-supervised learning, its density-based clustering approach and ability to handle noisy data can be beneficial in semi-supervised scenarios. However, careful consideration of how to incorporate labeled data, handle noise, and combine clustering results with other learning algorithms is necessary to effectively use DBSCAN in semi-supervised learning tasks.

In [None]:
Q10. How does DBSCAN clustering handle datasets with noise or missing values?

In [None]:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering is designed to handle datasets with noise or outliers effectively. Here's how DBSCAN handles datasets with noise or missing values:

1. Noise Handling:
   - DBSCAN explicitly identifies noise points as data points that do not belong to any cluster.
   - Noise points are typically located in sparsely populated regions or isolated from dense clusters.
   - By ignoring noise points during cluster formation, DBSCAN focuses on identifying dense regions in the data space and robustly handles outliers.

2. Parameter Sensitivity:
   - The choice of parameters in DBSCAN, particularly the epsilon (\( \varepsilon \)) and minimum points (MinPts) parameters, influences its sensitivity to noise.
   - By adjusting these parameters appropriately, DBSCAN can adapt to varying levels of noise in the dataset.
   - Higher values of \( \varepsilon \) or lower values of MinPts may increase the tolerance for noise, while lower values of \( \varepsilon \) or higher values of MinPts may result in tighter clusters with fewer noise points.

3. Outlier Detection:
   - DBSCAN can be used for outlier detection by considering noise points as outliers.
   - Noise points represent data points that do not fit well into any cluster or are located in regions of low density.
   - By explicitly identifying noise points, DBSCAN provides insights into the distribution of data and helps detect anomalous patterns or outliers.

4. Missing Values:
   - DBSCAN does not inherently handle missing values in the dataset, as it relies on distance calculations between data points.
   - However, missing values can be addressed through preprocessing techniques such as imputation or data cleaning before applying DBSCAN.
   - Imputation methods such as mean imputation, median imputation, or k-nearest neighbors (KNN) imputation can be used to fill in missing values, allowing DBSCAN to operate on complete data.

5. Handling Sparse Data:
   - DBSCAN is robust to sparse datasets, where most feature values are zero or missing.
   - Sparse data may result in noise points being incorrectly identified as outliers, particularly if the density estimation is affected by the sparsity of the data.
   - Preprocessing techniques such as feature scaling or normalization can help mitigate the impact of sparse data on DBSCAN clustering.

In summary, DBSCAN clustering handles datasets with noise or missing values by explicitly identifying noise points as outliers and focusing on identifying dense regions in the data space. Parameter tuning, outlier detection, and preprocessing techniques can help improve the performance of DBSCAN on noisy or incomplete datasets.

In [None]:
Q11. Implement the DBSCAN algorithm using a python programming language, and apply it to a sample 
dataset. Discuss the clustering results and interpret the meaning of the obtained clusters.

In [1]:
import numpy as np

class DBSCAN:
    def __init__(self, eps, min_pts):
        self.eps = eps
        self.min_pts = min_pts

    def fit_predict(self, X):
        # Initialize cluster labels
        labels = np.zeros(len(X), dtype=int)
        cluster_idx = 0
        
        # Iterate over each data point
        for i in range(len(X)):
            # Skip if the point is already assigned to a cluster
            if labels[i] != 0:
                continue
            
            # Find neighbors of the current point
            neighbors = self._find_neighbors(X, i)
            
            # If the number of neighbors is less than min_pts, mark as noise
            if len(neighbors) < self.min_pts:
                labels[i] = -1  # Noise point
            else:
                cluster_idx += 1
                self._expand_cluster(X, labels, i, neighbors, cluster_idx)
        
        return labels
    
    def _find_neighbors(self, X, i):
        # Calculate distances between point i and all other points
        distances = np.linalg.norm(X - X[i], axis=1)
        
        # Find points within epsilon neighborhood
        neighbors = np.where(distances <= self.eps)[0]
        
        return neighbors
    
    def _expand_cluster(self, X, labels, i, neighbors, cluster_idx):
        # Assign cluster label to the current point
        labels[i] = cluster_idx
        
        # Process each neighbor point
        for neighbor in neighbors:
            # If neighbor is not assigned to a cluster yet
            if labels[neighbor] == 0:
                labels[neighbor] = cluster_idx
                # Find neighbors of the neighbor
                neighbor_neighbors = self._find_neighbors(X, neighbor)
                
                # If the number of neighbors is greater than min_pts, expand the cluster
                if len(neighbor_neighbors) >= self.min_pts:
                    neighbors = np.union1d(neighbors, neighbor_neighbors)

# Sample dataset
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

# Hyperparameters
eps = 2
min_pts = 2

# Instantiate and fit DBSCAN
dbscan = DBSCAN(eps=eps, min_pts=min_pts)
labels = dbscan.fit_predict(X)

# Print clustering results
print("DBSCAN Clustering Results:")
for i in range(len(X)):
    if labels[i] == -1:
        print(f"Point {X[i]} is noise")
    else:
        print(f"Point {X[i]} is in cluster {labels[i]}")



DBSCAN Clustering Results:
Point [1. 2.] is in cluster 1
Point [1.5 1.8] is in cluster 1
Point [5. 8.] is noise
Point [8. 8.] is noise
Point [1.  0.6] is in cluster 1
Point [ 9. 11.] is noise
