#DBSCAN Clustering

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a clustering algorithm used in machine learning to partition data into clusters based on their distance to other points. Its effective at identifying and removing noise in a data set, making it useful for data cleaning and outlier detection

#About the Dataset

The dataset (Penguins Species dataset) was taken from kaggle. The dataset consists of 5 columns

1. culmen_length_mm: culmen length (mm)
2. culmen_depth_mm: culmen depth (mm)
3. flipper_length_mm: flipper length (mm)
4. body_mass_g: body mass (g)
5. sex: penguin sex

**Problem statement:**

To identify clusters of similar data points with the help of DBSCAN algorithm.

#Import the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

#Load the dataset

In [2]:
data=pd.read_csv("penguins.csv")

#Summarizing the dataset

In [3]:
data.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,39.1,18.7,181.0,3750.0,MALE
1,39.5,17.4,186.0,3800.0,FEMALE
2,40.3,18.0,195.0,3250.0,FEMALE
3,,,,,
4,36.7,19.3,193.0,3450.0,FEMALE


In [4]:
data.shape

(344, 5)

In [5]:
data.describe()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,214.01462,4201.754386
std,5.459584,1.974793,260.558057,801.954536
min,32.1,13.1,-132.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.75,4750.0
max,59.6,21.5,5000.0,6300.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   culmen_length_mm   342 non-null    float64
 1   culmen_depth_mm    342 non-null    float64
 2   flipper_length_mm  342 non-null    float64
 3   body_mass_g        342 non-null    float64
 4   sex                335 non-null    object 
dtypes: float64(4), object(1)
memory usage: 13.6+ KB


#EDA

In [7]:
#check the null values
data.isnull().sum()

culmen_length_mm     2
culmen_depth_mm      2
flipper_length_mm    2
body_mass_g          2
sex                  9
dtype: int64

In [8]:
#Drop the null values
df=data.dropna(axis=0)

#re-check the null values
df.isnull().sum()

culmen_length_mm     0
culmen_depth_mm      0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

In [9]:
df.describe()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
count,335.0,335.0,335.0,335.0
mean,43.988358,17.169552,214.355224,4209.179104
std,5.45343,1.971966,263.253508,803.633495
min,32.1,13.1,-132.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.55,18.7,213.0,4787.5
max,59.6,21.5,5000.0,6300.0


Tthe above output showing is a negative value in column "flipper_length_mm".

Check if this column contains more negative datapoints.



In [10]:
#check the negative values
negative=(df['flipper_length_mm']<0).sum()
print(negative)

1


In [11]:
#Creat a copy of dataset
df_copy=df.copy()

In [12]:
#Replace negative datapoint to positive value in copy of dataset
df_copy['flipper_length_mm']=df_copy["flipper_length_mm"].abs()

df_copy.describe()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
count,335.0,335.0,335.0,335.0
mean,43.988358,17.169552,215.143284,4209.179104
std,5.45343,1.971966,262.607931,803.633495
min,32.1,13.1,132.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.55,18.7,213.0,4787.5
max,59.6,21.5,5000.0,6300.0


In [13]:
df_copy.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,39.1,18.7,181.0,3750.0,MALE
1,39.5,17.4,186.0,3800.0,FEMALE
2,40.3,18.0,195.0,3250.0,FEMALE
4,36.7,19.3,193.0,3450.0,FEMALE
5,39.3,20.6,190.0,3650.0,MALE


Drop column "sex" as it contains only 2 variable.

In [14]:
#Drop column "sex"
df_copy_cleaned=df_copy.drop("sex",axis=1)
df_copy_cleaned.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
4,36.7,19.3,193.0,3450.0
5,39.3,20.6,190.0,3650.0


# Converting the Dataset in DataFrame Format to Array

In [15]:
df_copy_cleaned_array=df_copy_cleaned.values
df_copy_cleaned_array

array([[  39.1,   18.7,  181. , 3750. ],
       [  39.5,   17.4,  186. , 3800. ],
       [  40.3,   18. ,  195. , 3250. ],
       ...,
       [  50.4,   15.7,  222. , 5750. ],
       [  45.2,   14.8,  212. , 5200. ],
       [  49.9,   16.1,  213. , 5400. ]])

##Standerdize the data

In [17]:
scale=StandardScaler()
x=scale.fit_transform(df_copy_cleaned_array) #Standardize the data
x

array([[-0.89772327,  0.77726336, -0.13021068, -0.57223347],
       [-0.82426521,  0.11703673, -0.11114241, -0.50992298],
       [-0.67734909,  0.42175671, -0.07681952, -1.19533834],
       ...,
       [ 1.17746691, -0.74633656,  0.02614915,  1.920186  ],
       [ 0.22251214, -1.20341653, -0.0119874 ,  1.23477065],
       [ 1.08564434, -0.5431899 , -0.00817374,  1.4840126 ]])

#Loading the model

In [19]:
model=DBSCAN(min_samples=5)
model

#Training the model

In [21]:
model.fit(x)

#Displaying the model

In [22]:
model.labels_

array([ 0,  0,  0,  0,  0,  0,  0, -1,  0,  0, -1,  0,  0, -1,  0, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
       -1,  0,  0,  0,  0,  0,  0,  0, -1,  0, -1,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0, -1,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0, -1,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0, -1,  0,  0, -1,
        0,  0,  0,  0,  0,  0, -1,  0, -1,  0,  0,  0,  0,  0, -1, -1, -1,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1

In [23]:
set(list(model.labels_))

{-1, 0, 1}

In [24]:
class_=pd.DataFrame(model.labels_,columns=['cluster_values'])
class_

Unnamed: 0,cluster_values
0,0
1,0
2,0
3,0
4,0
...,...
330,1
331,1
332,1
333,1


#Merging the cluster_values to dataset

In [26]:
final_df=pd.concat([df_copy_cleaned,class_],axis=1)
final_df

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,cluster_values
0,39.1,18.7,181.0,3750.0,0.0
1,39.5,17.4,186.0,3800.0,0.0
2,40.3,18.0,195.0,3250.0,0.0
4,36.7,19.3,193.0,3450.0,0.0
5,39.3,20.6,190.0,3650.0,0.0
...,...,...,...,...,...
11,,,,,0.0
47,,,,,0.0
246,,,,,1.0
286,,,,,1.0


In [27]:
final_df.isnull().sum()

culmen_length_mm     8
culmen_depth_mm      8
flipper_length_mm    8
body_mass_g          8
cluster_values       8
dtype: int64