# DBSCAN 
* stands for Density-based clustering
* uses DENSITY to determine clusters, not discrete distances between points. 

## Definitions
* x = a point
* $\epsilon$ = radius of a "ball" or circle of radius surrounding some point x
* $\delta(x,y)$ = the euclidean distance between points x and y
* $N_{\epsilon}(x)$ = the collection of points that have a euclidean distance between the point x and all points 
* $N_{\epsilon}(x) = \{y | \delta(x,y) \leq \epsilon\}$
* minpts = some integer number of points that is user-defined
* core point = x is a core point if $|N_{\epsilon}(x)| \geq \text{minpts}$
    * if there are at least minpts in x's neighborhood for a given $\epsilon$
* border point = x is a border point if $|N_{\epsilon}(x)| \lt \text{minpts}$ AND it belongs in the neighborhood of some **core point** z.
    * $|N_{\epsilon}(x)| \lt \text{minpts} \cap x \in N_{\epsilon}(z)$
* noise point = x is a noise point if it is neither a core point or a border point
    * some point x is not in the neighborhood of any core points and $|N_{\epsilon}(x)| \lt \text{minpts}$
    * the above conditional can't be or, because if $|N_{\epsilon}(x)| \geq \text{minpts}$ then it qualifies to be a core point
    * refer to figure 15.2
* directly density reachable = x is directly density reachable to y if $x \in N_{\epsilon}(y)$ and y is a core point.
    * in the above definition for a border point, a border point is directly density reachable to the core point
* density reachable
    * we define there to be two points x and y.
    * There is also a set of points defined to be s = $x_0, x_1, ... x_l$ such that there are l points
    * if x = x_0 (the first point in the set) and y = x_l (the last point in the set)
    * Each successive pair of points in s, $x_0 - x_1, x_1 - x_2, ... x_{l-1} - x_l$ have to be directly density reachable.
        * thus, there is a set of core points leading from $x_i$ to $x_{i+1}$
* density connected
    * we define there to be two points x and y
    * points x and y are density connected if there exists a point z such that independently, x and y are density reachable (chain set of core points) from z
* the purpose of DBSCAN - is a maximal set of density connected points

In [159]:
class point:
    """
    neighbors - list of integers (integers - row id of df)
    df_id - id of the transaction in the dataframe
    cluster_id - cluster id assigned to it
    """
    def __init__(self, df_id, neighbors, cluster_id):
        self.neighbors = neighbors
        self.df_id = df_id
        self.cluster_id = cluster_id
        self.core = False
        self.border = False
        self.noise = False
    
    def set_cluster_id(self, _id):
        self.cluster_id = _id
    
    def is_core(self):
        self.core = True

    def is_border(self):
        self.border = True

    def is_noise(self):
        self.noise = True
    
    def __repr__(self):
        var = "outlier"
        if self.core:
            var = "cluster"
        elif self.border:
            var = "border"
        elif self.noise:
            var = "noise"
        return "{}, {}, {}".format(var, self.df_id, self.cluster_id)

In [206]:
def dist(x, y):
    # ensures that the columns of x and y are the same
    assert(sum(x.index == y.index) == len(x))
    dist = 0
    for x_val, y_val in zip(x,y):
        dist += (x_val - y_val) ** 2
    return(dist ** (1/2))

In [207]:
def N(ep, D, x, x_id):
    """
    D - pandas dataframe
    x - pandas series
    ep - integer
    return
        ret_lst - list of transaction ids that are neighbors of x
    """
    ret_lst = []
    for row in range(len(D)):
        if row == x_id:
            continue
        elif dist(x, D.iloc[row]) <= ep:
            # is a neighbor
            ret_lst.append(row)
    return ret_lst

In [219]:
def density_connected(x, k, core_set, mapping):
    """
    x - integer row id of corresponding transaction in df_core_map
    k - cluster id
    core_set - set of row ids that are considered as core points
    mapping - mapping of row ids to core point objects
    """
    # mapping[x].neighbors is equivalent to N_ep(x)
    print()
    for y in mapping[x].neighbors:
        if mapping[y].cluster_id:
            continue
        else:
            # y is an integer
            mapping[y].set_cluster_id(k)
            print("setting transaction: {} to cluster id: {}".format(y, k))
            if y in core_set:
                density_connected(y, k, core_set, mapping)

In [220]:
from sklearn import datasets
import pandas as pd
import numpy as np
# need to store a dictionary that maps index of some pandas series with the 

def DBSCAN(D, ep, minpts):
    
    df_core_map = {i: point(0, [], None) for i in range(len(D))}
    
    core = set()
    for row in range(len(D)):
        t = D.iloc[row]
        neighbors = N(ep, D, t, row)
        if len(neighbors) >= minpts:
            pt = df_core_map[row]
            pt.neighbors = neighbors
            pt.df_id = row
            pt.is_core()
            core.add(row)
    print(df_core_map)
    k = 0
    for core_id in core:
        core_pt = df_core_map[core_id]
        if core_pt and not core_pt.cluster_id:
            k += 1
            core_pt.set_cluster_id(k)
            density_connected(core_id, k, core, df_core_map)
    cluster_map = {}
    noise = set()
    for i in range(k+1):
        cluster_map[i] = set()
    for df_ind, pt in df_core_map.items():
        if pt.cluster_id:
            cluster_map[pt.cluster_id].add(df_ind)
        else:
            noise.add(df_ind)
    border = set(range(len(D))) - noise.union(core)
    print(cluster_map)
    print(core)
    print(border)
    print(noise)

In [221]:
def main():
    iris = datasets.load_iris()
    df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
    X = df.drop('target', axis=1)[['sepal length (cm)', 'sepal width (cm)']]
    y = df['target']
    
    DBSCAN(X, 0.36, 3)

In [222]:
main()

{0: cluster, 0, None, 1: cluster, 1, None, 2: cluster, 2, None, 3: cluster, 3, None, 4: cluster, 4, None, 5: cluster, 5, None, 6: cluster, 6, None, 7: cluster, 7, None, 8: cluster, 8, None, 9: cluster, 9, None, 10: cluster, 10, None, 11: cluster, 11, None, 12: cluster, 12, None, 13: cluster, 13, None, 14: outlier, 0, None, 15: outlier, 0, None, 16: cluster, 16, None, 17: cluster, 17, None, 18: cluster, 18, None, 19: cluster, 19, None, 20: cluster, 20, None, 21: cluster, 21, None, 22: cluster, 22, None, 23: cluster, 23, None, 24: cluster, 24, None, 25: cluster, 25, None, 26: cluster, 26, None, 27: cluster, 27, None, 28: cluster, 28, None, 29: cluster, 29, None, 30: cluster, 30, None, 31: cluster, 31, None, 32: cluster, 32, None, 33: cluster, 33, None, 34: cluster, 34, None, 35: cluster, 35, None, 36: cluster, 36, None, 37: cluster, 37, None, 38: cluster, 38, None, 39: cluster, 39, None, 40: cluster, 40, None, 41: outlier, 0, None, 42: cluster, 42, None, 43: cluster, 43, None, 44: cluste