# DBVI Index

Consider a dataset of the following form: D = {(0, 0), (6, 0), (3, 3), (10, 1), (15, 0), (12, 3)} and its clustering into two clusters: C1 = {(0, 0), (6, 0), (3, 3)} and C2 = {(10, 1), (15, 0), (12, 3)}. Let's compute the DBVI index of that clustering (consider the version of the index where borderline points of the clusters are distinguished from the inner points).

In [5]:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse.csgraph import minimum_spanning_tree

# Creating the dataset and the two clusters

c1 = [(0,0),(6,0),(3,3)]
c2 = [(10,1),(15,0),(12,3)]
data = [(0,0),(6,0),(3,3),(10,1),(15,0),(12,3)]
dimension = 2

In the definitions of our concepts we use the following notations. Let O = {o1, · · · , on} be a dataset containing n objects in the Rd feature space. Let Dist be an n×n matrix of pairwise distances d(op,oq), where op,oq ∈ O, for a given metric distance d(·,·). Let KNN (o, i) be the distance between object o and its ith nearest neighbor. Let C = ({Ci},N) 1 ≤ i ≤ l be a clustering solution containing l clusters and (a possibly empty) set of noise objects N, for which ni is the size of the ith cluster and nN is the cardinality of noise.

Let's calculate the core distance:

In [6]:
def KNN(pts, cluster, neighboor):
    
    # calculates the distance between a point (pts) and its neighbors within its cluster
    
    result = []
    for i in data:
        if i != pts:
            result.append(np.linalg.norm(np.array(pts)-np.array(i)))
    #print(result[neighboor-1],pts)
    result.sort()
    return result[neighboor-1]


In [7]:
KNN_list = []
for i in data:
    for j in range(1,3):
        if i in c1:
            cluster = c1
        else:
            cluster = c2
        KNN_list.append(KNN(i,cluster,j))
#print(KNN_list)

In [8]:
def coredist(o):
    
    if o in c1:
        c = c1
    else: c = c2
    
    # Calculation of the coredistance as per what was seen in class
    core = ((((1/KNN(o,c,1))**2) + (((1/KNN(o,c,2))**2)))/2)**(-1/2)
    return(core)

In [9]:
coredist_list = dict()
for p in data:
    coredist_list[p] = coredist(p)
print('The core distance for each point in the dataset is as follows:',coredist_list)

The core distance for each point in the dataset is as follows: {(0, 0): 4.8989794855663558, (3, 3): 4.2426406871192848, (10, 1): 3.298484500494129, (15, 0): 4.6122366887148445, (6, 0): 4.1815923146230176, (12, 3): 3.3282011773513749}


The mutual reachability distance between two objects oi and oj in O is defined as dmreach(oi,oj) = max{apts coredist (oi ), apts coredist (oj ), d(oi , oj )}.

Calculating the Mutual Reachability Distance:

In [10]:
def dmreach(oi,oj):
    return(max(coredist(oi), coredist(oj), np.linalg.norm(np.array(oi)-np.array(oj))))

In [11]:
temp_1 = []
for i in range(len(data)):    
    temp_2 = []
    for j in range(len(data)):
        temp_2.append(dmreach(data[i],data[j]))
    temp_1.append(temp_2)    
M = np.matrix(temp_1)
MSTMRD = minimum_spanning_tree(M)
print(M)
print(MSTMRD)

[[  4.89897949   6.           4.89897949  10.04987562  15.          12.36931688]
 [  6.           4.18159231   4.24264069   4.18159231   9.           6.70820393]
 [  4.89897949   4.24264069   4.24264069   7.28010989  12.36931688   9.        ]
 [ 10.04987562   4.18159231   7.28010989   3.2984845    5.09901951
    3.32820118]
 [ 15.           9.          12.36931688   5.09901951   4.61223669
    4.61223669]
 [ 12.36931688   6.70820393   9.           3.32820118   4.61223669
    3.32820118]]
  (0, 2)	4.89897948557
  (1, 2)	4.24264068712
  (3, 1)	4.18159231462
  (4, 5)	4.61223668871
  (5, 3)	3.32820117735


In [12]:
temp_1 = []
for i in range(0,3):    
    temp_2 = []
    for j in range(0,3):
        temp_2.append(dmreach(data[i],data[j]))
    temp_1.append(temp_2)    
M1 = np.matrix(temp_1)

MSTMRD1 = minimum_spanning_tree(M1)
print(MSTMRD1)

  (0, 2)	4.89897948557
  (1, 2)	4.24264068712


In [13]:
temp_1 = []
for i in range(3,6):    
    temp_2 = []
    for j in range(3,6):
        temp_2.append(dmreach(data[i],data[j]))
    temp_1.append(temp_2)    
M2 = np.matrix(temp_1)

MSTMRD2 = minimum_spanning_tree(M2)
print(MSTMRD2)

  (0, 2)	3.32820117735
  (1, 2)	4.61223668871


The Density Sparseness of a Cluster (DSC) Ci is defined as the maximum edge weight of the internal edges in MSTMRD of the cluster Ci, where MSTMRD is the minimum spanning tree constructed using aptscoredist considering the objects in Ci.

The Density Separation of a Pair of Clusters (DSPC) Ci and Cj, 1 ≤ i,j ≤ l,i ̸= j, is defined as the minimum reachability distance between the internal nodes of the MSTMRDs of clusters Ci and Cj.

We calculate the Density Separation of a pair of clusters as well as the validity index of each cluster:



In [14]:
DSC1 =  4.89897948557 # Manual input, we take the max edge value of the minimum spanning tree of cluster 1
DSC2 = 4.61223668871 # Manual input, we take the max edge value of the minimum spanning tree of cluster 2

DSPC1C2 = 9 # Minimum reachability distance between the internal nodes. Internal nodes are nodes 2 and 5. From matrix
# M we see that the minimum reachability distance is 9.

VC1 = (DSPC1C2 - DSC1)/max(DSPC1C2, DSC1)
VC2 = (DSPC1C2 - DSC2)/max(DSPC1C2, DSC2)
DBCV = (VC1 + VC2)/2
print('The validity index of the clustering is:',DBCV)

The validity index of the clustering is: 0.4715991014288889
