## DBSCAN - Digits Exercise

In [None]:
import pandas as pd
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import DBSCAN

from scipy.spatial.distance import pdist

For this exercise, you'll be working with the digits dataset from sklearn. These are 8x8 handwritten digits. 

The `load_digits` function returns the digits and other metadata as a dictionary.

In [None]:
digits = load_digits(as_frame = True)

digits.keys()

We can make use of matplotlib to visualize each image. Try changing the value of `i` below to view different examples.

In [None]:
i = 13

print(f'Label: {digits["target"][i]}')
plt.imshow(digits['images'][i], cmap = 'Greys');

**Question 1:** We'll be working with `digits['data']`. What is the dimensionality of this dataset? That is, how many variables do we have?

In this notebook, we'll take a simple approach to measure similarity of images - the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance), which is the normal distance formula from high school algebra but extended to more than 2 dimensions.

Let's start by getting an idea about the typical distance between observations.

Currently, the values are between 0 and 255. Let's rescale so that the observations are between 0 and 1 using a MinMaxScaler and then compute distances using the `pdist` function from scipy.

In [None]:
distances = pdist(MinMaxScaler().fit_transform(digits['data']))

If we want to look at the distribution, it might help to convert the result to a pandas Series.

In [None]:
distances = pd.Series(distances)

**Question 2:** Look at the distribution of distances. What do you notice?

In [None]:
# Your Code Here

Now, let's apply the DBSCAN algorithm to our dataset.

It might be helpful to refer to the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Let's start with the default value for `eps`, 0.5. Note that we're using a pipeline so that we can scale our points before clustering.

In [None]:
pipe = Pipeline(
    steps = [
        ('scale', MinMaxScaler()),
        ('cluster', DBSCAN(eps = 0.5))
    ]
)

pipe.fit(digits['data'])

After running the algorithm, we can look at the distribution of labels.

**Question 3:** What do you notice about the labels for this setting of eps? Why does this happen? Hint: you may need to refer to the documentation to understand the meaning of the labels you get.

In [None]:
pd.Series(pipe['cluster'].labels_).value_counts()

Now, let's adjust the value of eps up.

In [None]:
pipe = Pipeline(
    steps = [
        ('scale', MinMaxScaler()),
        ('cluster', DBSCAN(eps = 1))
    ]
)

pipe.fit(digits['data'])

pd.Series(pipe['cluster'].labels_).value_counts()

Let's look at the images for some of the smaller clusters to see if the result looks plausible.

In [None]:
digits['data'][(pipe['cluster'].labels_ == # Put in the cluster number of one of the smaller clusters here
               )].index.to_list()

In [None]:
i = # Use values from one of your clusters here

print(f'Label: {digits["target"][i]}')
plt.imshow(digits['images'][i], cmap = 'Greys');

Now, vary the value of eps and watch how the different digits get clustered together. 

**Question 4:** What do you notice as you vary the value of eps? Do digits tend to be correctly clustered together? 

In [None]:
pipe = Pipeline(
    steps = [
        ('scale', MinMaxScaler()),
        ('cluster', DBSCAN(eps = # Adjust this and observe who the target/cluster distribution changes
                          ))
    ]
)

pipe.fit(digits['data'])

pd.DataFrame({
    'target': digits['target'],
    'cluster': pipe['cluster'].labels_
}).groupby(['cluster'])['target'].value_counts()

Look at see which target values end up clustered together. You can use the following cell to identify the index of anything that looks odd.

In [None]:
pd.DataFrame({
    'target': digits['target'],
    'cluster': pipe['cluster'].labels_
}).query('cluster == _ and target == _')

In [None]:
i = # Add the index of any unusual points here

print(f'Label: {digits["target"][i]}')
plt.imshow(digits['images'][i], cmap = 'Greys');

Finally, adjust the value of eps until you have only a small number of outliers. Once you do that, take a look at the noise/outlier point.

In [None]:
pipe = Pipeline(
    steps = [
        ('scale', MinMaxScaler()),
        ('cluster', DBSCAN(eps = # Adjust this until you have only a small number of outlier points 
                          ))
    ]
)

pipe.fit(digits['data'])

pd.Series(pipe['cluster'].labels_).value_counts()

In [None]:
digits['data'][(pipe['cluster'].labels_ == -1
               )].index.to_list()

In [None]:
i = # Add in the index of one of the outlier points here

print(f'Label: {digits["target"][i]}')
plt.imshow(digits['images'][i], cmap = 'Greys');

**Question 5:** What might be some downsides to using the Euclidean distance on this dataset?