### DBSCAN Clustering - Density-Based Spatial Clustering of Applications with Noise

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

import warnings 
warnings.filterwarnings("ignore")

In [2]:
df=pd.read_csv("Mall_Customers.csv")
df.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [3]:
df.columns

Index(['CustomerID', 'Genre', 'Age', 'Annual Income (k$)',
       'Spending Score (1-100)'],
      dtype='object')

In [4]:
df.drop(['CustomerID', 'Genre', 'Age'],axis=1,inplace=True)

In [5]:
df.head()

Unnamed: 0,Annual Income (k$),Spending Score (1-100)
0,15,39
1,15,81
2,16,6
3,16,77
4,17,40


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   Annual Income (k$)      200 non-null    int64
 1   Spending Score (1-100)  200 non-null    int64
dtypes: int64(2)
memory usage: 3.3 KB


In [7]:
x=df.iloc[:,:2].values

**Import DBSCAN and Fit the Model**

In [8]:
from sklearn.cluster import DBSCAN
dbs=DBSCAN(eps=2,min_samples=5)
y_dbs=dbs.fit_predict(x)

- DBSCAN(eps=2, min_samples=5) initializes the DBSCAN model with:
    - eps=2: The maximum distance between two samples for them to be considered as in the same neighborhood.
    - min_samples=5: The minimum number of samples in a neighborhood for a point to be considered a core point.

- fit_predict(x) fits the model to the data x and predicts the cluster labels for each point.

- The values of eps and min_samples are arbitrary. These hyperparameters need to be tuned based on the dataset and domain knowledge.
- If x is not preprocessed (e.g., scaled), DBSCAN may not perform well because it is sensitive to the scale of the data.

In [9]:
# Checking unique cluster labels
np.unique(y_dbs)

array([-1,  0,  1,  2], dtype=int64)

- np.unique(y_dbs) returns the unique cluster labels assigned by DBSCAN.
- In DBSCAN, -1 indicates noise points (points that do not belong to any cluster).
- This is a good way to check the number of clusters and noise points.

**Add Cluster Labels to DataFrame**

In [10]:
df["cluster"]=pd.DataFrame(y_dbs)
df

Unnamed: 0,Annual Income (k$),Spending Score (1-100),cluster
0,15,39,-1
1,15,81,-1
2,16,6,-1
3,16,77,-1
4,17,40,-1
...,...,...,...
195,120,79,-1
196,126,28,-1
197,126,74,-1
198,137,18,-1


In [11]:
# Displaying unique clusters
df["cluster"].unique()

array([-1,  0,  1,  2], dtype=int64)