<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **DBSCAN**


Estimated time needed: **30** minutes


Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for density-based clustering.

## Use cases of DBSCAN:

- **Recommender systems** that make recommendations to users based on preferences (such as Netflix viewing patterns) of similar users (such as neighbors).
- **Anomaly detection** that identifies rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior.
- **Customer segmentation** that aims at separating customers into multiple clusters, and devise targeted marketing strategy based on each cluster's characteristics.


## When is DBSCAN better than K-Means?


You are a Data Scientist working for a retail company, and the marketing team wants you to use Data Science techniques to divide the company's customer base into groups of individuals that are similar in specific aspects such as age, gender, and spending behavior, so that they could devise targeted marketing strategy for each group based on its average spending on various product categories. 

In this case, since you don't know what would be the optimal number of groups/clusters to look for in the customer base, you can't use K-Means as it requires you to specify the number of clusters. 

Besides, since the marketing team wants to understand the average spendings of each group in order to devise its corresponding promotion, we should use an algorithm that's more robust to outliers.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%202/images/marketing.jpg" style="width: 70%">

Image from [PxHere](https://pxhere.com/en/photo/1448601?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01)


In this notebook, we will look at a clustering technique using DBSCAN, which help us overcome the two aforementioned shortcomings of the K-Means clustering. Also, we will be able to look at the handwriting example to prove whether someones handwriting is bad, using DBSCAN.


## __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-Required-Libraries">Installing Required Libraries</a></li>
            <li><a href="#Importing-Required-Libraries">Importing Required Libraries</a></li>
        </ol>
    </li>
    <li>
        <a href="#Background">Background</a>
        <ol>
            <li><a href="#How-does-DBSCAN-work?">How does DBSCAN work?</a></li>
        </ol>
    </li>
    <li><a href="#Visual-Example">Example 1: Visual Example</a></li>
    <li><a href="#Proving-Your-Friend-Has-Bad-Handwriting">Example 2: Proving your friend has bad handwriting</a></li>
</ol>

<a href="#Exercises">Exercises</a>
<ol>
    <li><a href="#Exercise-1---Find-the-number-of-clusters">Exercise 1 - Find the number of clusters</a></li>
    <li><a href="#Exercise-2---Find-the-%-of-data-marked-as-noise">Exercise 2 - Find the % of data marked as noise</a></li>
    <li><a href="#Exercise-3---Visualize-the-clustered-data-using-matplotlib.pyplot">Exercise 3 - Visualize the clustered data using matplotlib.pyplot</a></li>
</ol>


## Objectives


After completing this lab you will be able to:


* __Explain__ what DBSCAN does and how it works.
* __Understand__ the strengths, weaknesses, and use cases of DBSCAN.


## Setup


For this lab, we will be using the following libraries:
 - [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) for managing the data.
 - [`numpy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) for mathematical operations.
 - [`matplotlib`](https://matplotlib.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) for visualizing the data.
 - [`scipy`](https://docs.scipy.org/doc/scipy/reference/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) for complex numerical operations.
 - [`sklearn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) for machine learning and machine-learning-pipeline related functions.


### Installing Required Libraries

The following required modules are pre-installed in the Skills Network Labs environment. However if you run this notebook commands in a different Jupyter environment (e.g. Watson Studio or Ananconda) you will need to install these libraries by removing the `#` sign before `!mamba` in the code cell below.


### Importing Required Libraries


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

In [None]:
import warnings

warnings.filterwarnings('ignore')

## Background

__DBSCAN__ is a clustering algorithm that stands for:

"**D**ensity-**B**ased **S**patial **C**lustering of **A**pplications with **N**oise"

 - _Density-Based_ - DBSCAN will group points that are densely-packed together
 - _Spacial-Clustering_ - DBSCAN is for numerical points $\in \mathbb{R}^n$
 - _Applications with Noise_ - DBSCAN is great for applications that have noise
   - This is because DBSCAN also finds outliers/noise in data
