<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Example-Dataset---Blobs" data-toc-modified-id="Example-Dataset---Blobs-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Example Dataset - Blobs</a></span><ul class="toc-item"><li><span><a href="#Attempt-#1-&amp;-displaying-the-$\epsilon$" data-toc-modified-id="Attempt-#1-&amp;-displaying-the-$\epsilon$-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Attempt #1 &amp; displaying the $\epsilon$</a></span></li><li><span><a href="#Attempt-#2-&amp;-displaying-the-$\epsilon$" data-toc-modified-id="Attempt-#2-&amp;-displaying-the-$\epsilon$-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Attempt #2 &amp; displaying the $\epsilon$</a></span></li></ul></li><li><span><a href="#Example-Dataset---Changing-Density" data-toc-modified-id="Example-Dataset---Changing-Density-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Example Dataset - Changing Density</a></span><ul class="toc-item"><li><span><a href="#First-attempt-with-default-parameters" data-toc-modified-id="First-attempt-with-default-parameters-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>First attempt with default parameters</a></span><ul class="toc-item"><li><span><a href="#Knowledge-Check-🧠:-Does-this-clustering-make-sense?-How-would-you-cluster-it?" data-toc-modified-id="Knowledge-Check-🧠:-Does-this-clustering-make-sense?-How-would-you-cluster-it?-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Knowledge Check 🧠: Does this clustering make sense? How would you cluster it?</a></span></li></ul></li><li><span><a href="#Scenario-1" data-toc-modified-id="Scenario-1-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Scenario 1</a></span><ul class="toc-item"><li><span><a href="#Possible-Solution" data-toc-modified-id="Possible-Solution-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Possible Solution</a></span></li></ul></li><li><span><a href="#Scenario-2" data-toc-modified-id="Scenario-2-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Scenario 2</a></span><ul class="toc-item"><li><span><a href="#Possible-Solution" data-toc-modified-id="Possible-Solution-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Possible Solution</a></span></li></ul></li></ul></li></ul></div>

> **NOTE** Parts adopted from this repo: https://github.com/udacity/DSND_Term1/tree/master/lessons/Unsupervised/2_HierarchcalDensityClustering

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Example Dataset - Blobs 

In [None]:
from sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(
               n_samples=100, 
               centers=3,
               cluster_std=0.40, 
               random_state=27
)

In [None]:
figsize = (10,10)
point_size = 150
point_border = 0.8

def plot_data(data, pt_color="#00B3E9", xlim=None, ylim= None):
    plt.figure(figsize=figsize)
    plt.scatter(data[:,0], data[:,1], 
                s=point_size, color=pt_color, 
                edgecolor='black', lw=point_border)
    if xlim != None:
        plt.xlim(xlim)
    if ylim != None:
        plt.ylim(ylim)
    plt.show()

In [None]:
plot_data(X)

## Attempt #1 & displaying the $\epsilon$

In [None]:
from sklearn import cluster

eps = 0.2
dbscan = cluster.DBSCAN(eps=eps)

clustering_labels_1 = dbscan.fit_predict(X)

In [None]:
from itertools import cycle, islice

def plot_clustered_dataset(dataset, y_pred, xlim=None, ylim= None,
                           neighborhood=False, epsilon=0.5):

    fig, ax = plt.subplots(figsize=figsize)
    
    colors = np.array(list(islice(cycle(['#df8efd', '#78c465', '#ff8e34',
                                     '#f65e97', '#a65628', '#984ea3',
                                     '#999999', '#e41a1c', '#dede00']),
                              int(max(y_pred) + 1))))
    colors = np.append(colors, '#BECBD6')
    
    
    if neighborhood:
        for point in dataset:
            circle1 = plt.Circle(point, epsilon, 
                                 color='#666666', fill=False, 
                                 zorder=0, alpha=0.3)
            ax.add_artist(circle1)

    ax.scatter(dataset[:, 0], dataset[:, 1], 
               s=point_size, color=colors[y_pred], 
               zorder=10, edgecolor='black', lw=point_border)

    if xlim != None:
        plt.xlim(xlim)
    if ylim != None:
        plt.ylim(ylim)
        
    plt.show()

In [None]:
plot_clustered_dataset(X, clustering_labels_1)

In [None]:
plot_clustered_dataset(X, y_true, neighborhood=True, epsilon=eps)

> Epsilon ($\epsilon = 0.2$) is just too small

## Attempt #2 & displaying the $\epsilon$

In [None]:
eps=0.5

dbscan = cluster.DBSCAN(eps=eps)
clustering_labels_2 = dbscan.fit_predict(X)

plot_clustered_dataset(X, clustering_labels_2, neighborhood=True, epsilon=eps)

# Example Dataset - Changing Density

In [None]:
dataset_2 = pd.read_csv('data/varied.csv')[:300].values
plot_data(dataset_2)

## First attempt with default parameters

In [None]:
dbscan = cluster.DBSCAN()
eps = dbscan.eps
predicted_labels = dbscan.fit_predict(dataset_2)

In [1]:
plot_clustered_dataset(dataset_2, 
                      predicted_labels, 
                      neighborhood=True, 
                      epsilon=eps)

NameError: name 'plot_clustered_dataset' is not defined

### Knowledge Check 🧠: Does this clustering make sense? How would you cluster it?

Maybe... but it all depends on what we're looking for. 

There's a couple ways to break it up which we'll explore next by adjusting the hyperparameters.

## Scenario 1

> **3 clusters**: bottom left, top right, and the middle

In [None]:
# Change these hyperparameters from the defaults
eps = 0.5
min_samples = 5

In [None]:
# Create cluster and plot results
dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples)
predicted_labels = dbscan.fit_predict(dataset_2)

plot_clustered_dataset(dataset_2, 
                    predicted_labels, 
                    neighborhood=True, 
                    epsilon=eps)

### Possible Solution

In [None]:
eps = 1.2
min_samples = 10


dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples)
predicted_labels = dbscan.fit_predict(dataset_2)

plot_clustered_dataset(dataset_2, 
                    predicted_labels, 
                    neighborhood=True, 
                    epsilon=eps)

## Scenario 2

> **2 clusters**: bottom left, top right, and the middle are just noise

In [None]:
# Change these hyperparameters from the defaults
eps = 0.5
min_samples = 5

In [None]:
# Create cluster and plot results
dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples)
predicted_labels = dbscan.fit_predict(dataset_2)

plot_clustered_dataset(dataset_2, 
                    predicted_labels, 
                    neighborhood=True, 
                    epsilon=eps)

### Possible Solution

In [None]:
eps = 1.5
min_samples = 40


dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples)
predicted_labels = dbscan.fit_predict(dataset_2)

plot_clustered_dataset(dataset_2, 
                    predicted_labels, 
                    neighborhood=True, 
                    epsilon=eps)