## DBSCAN Cluster Analysis

In [7]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN
import seaborn as sns
import matplotlib.pyplot as plt

In [76]:
data0 = pd.read_csv('../data/working_files/data_withKMeans.csv')
data = data0.copy()
data.drop(columns=['KMeans cluster'])
data.head()

Unnamed: 0,good_logs_num,neutral_logs_num,bad_logs_num,difficulty,terrain,size,status,is_premium,short_description,long_description,...,cache_type_Maze Exhibit,cache_type_Mega event,cache_type_Multi,cache_type_Traditional,cache_type_Unknown/Mystery,cache_type_Virtual,cache_type_Webcam,cache_type_Wherigo,sentiment,KMeans cluster
0,9,0,1,2.0,1.5,1.0,1,1,0,1,...,0,0,0,1,0,0,0,0,0.973,1
1,1,3,2,1.5,1.5,2.0,1,0,0,1,...,0,0,0,1,0,0,0,0,0.9797,49
2,7,2,0,1.5,1.5,2.0,1,0,0,1,...,0,0,0,1,0,0,0,0,-0.9134,47
3,10,0,0,2.0,3.0,3.0,1,0,1,1,...,0,0,0,0,1,0,0,0,0.9958,2
4,5,2,0,1.5,2.5,1.0,1,0,0,1,...,0,0,0,1,0,0,0,0,0.9744,0


In [6]:
sc = StandardScaler()
X_sc = sc.fit_transform(data)



In [31]:
db = DBSCAN(eps=1.0, min_samples=4) #these are the defaults
db.fit(X_sc)

labels = db.labels_

sil = silhouette_score(X_sc, labels)
pct_noise = sum([1 for l in labels if l==-1])*100/data.shape[0]
num_clusters = np.max(labels)


print(f'Silhouette Score: {sil} with cache data.')
print(f'{num_clusters} clusters, + {pct_noise}% were NOT categorized.')



Silhouette Score: -0.021417778489332146 with cache data.
229 clusters, + 32.83197940963565% were NOT categorized.


In [46]:
chk = pd.DataFrame({'kmeans': list(data0['KMeans cluster']),'dbscan': labels})

capture_list = []
for c in range(labels.max()+1):
    num_dom = np.max(np.array(chk[chk['dbscan']==c]['kmeans'].value_counts()))
    num_all = chk[chk['dbscan']==c]['kmeans'].shape[0]
    cap=num_dom*100/num_all
    capture_list.append(cap)
avg_capture = np.mean(capture_list)

print(f'These clusters on average were {avg_capture} of their dominant kmeans cluster.')

These clusters on average were 99.42746345442846 of their dominant kmeans cluster.


#### Let's try a few different values for eps and min_samples:

In [49]:
def DBSCAN_gs(eps_list, n_list):
    
    kmeanlabels = list(data0['KMeans cluster'])
    scores = []
    for eps in eps_list:
        for min_samples in n_list:
            
            db = DBSCAN(eps=eps, min_samples=min_samples) #these are the defaults
            db.fit(X_sc)

            labels = db.labels_

            sil = silhouette_score(X_sc, labels)
            pct_noise = sum([1 for l in labels if l==-1])*100/data.shape[0]
            num_clusters = np.max(labels)

            chk = pd.DataFrame({'kmeans': kmeanlabels,'dbscan': labels})

            capture_list = []
            for c in range(labels.max()+1):
                num_dom = np.max(np.array(chk[chk['dbscan']==c]['kmeans'].value_counts()))
                num_all = chk[chk['dbscan']==c]['kmeans'].shape[0]
                cap=num_dom*100/num_all
                capture_list.append(cap)
            avg_capture = np.mean(capture_list)

            
            print(f'eps = {eps}, min_samples={min_samples}')
            print('--------------------------------------------------')
            print(f'Silhouette Score: {sil} with cache data.')
            print(f'{num_clusters} clusters, + {pct_noise}% were NOT categorized.')
            print(f'These clusters on average were {avg_capture} of their dominant kmeans cluster.')
            print()

            scores.append({
                'eps': eps,
                'min_samples': min_samples,
                'silhouette against data': sil,
                'number of clusters': num_clusters,
                'percent unclustered': pct_noise,
                'comparison against kmeans label': avg_capture
            })
    return scores
        

In [50]:
eps_list = [1, 2, 3, 4, 5]
n_list = [2, 3, 4, 5, 10]
scores1 = DBSCAN_gs(eps_list, n_list)


eps = 1, min_samples=2
--------------------------------------------------
Silhouette Score: 0.01745481196612966 with cache data.
611 clusters, + 24.676264779216602% were NOT categorized.
These clusters on average were 99.77637507684172 of their dominant kmeans cluster.

eps = 1, min_samples=3
--------------------------------------------------
Silhouette Score: -0.005879470981111998 with cache data.
340 clusters, + 29.03563098206386% were NOT categorized.
These clusters on average were 99.74528312911183 of their dominant kmeans cluster.

eps = 1, min_samples=4
--------------------------------------------------
Silhouette Score: -0.021417778489332146 with cache data.
229 clusters, + 32.83197940963565% were NOT categorized.
These clusters on average were 99.42746345442846 of their dominant kmeans cluster.

eps = 1, min_samples=5
--------------------------------------------------
Silhouette Score: -0.03290071928714871 with cache data.
180 clusters, + 35.78380117429422% were NOT categorized

In [51]:
pd.DataFrame(scores1).describe()

Unnamed: 0,eps,min_samples,silhouette against data,number of clusters,percent unclustered,comparison against kmeans label
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,3.0,4.8,0.247999,140.4,10.397169,99.02625
std,1.443376,2.84312,0.161582,132.142284,12.893221,0.5183
min,1.0,2.0,-0.078037,39.0,0.659535,98.010167
25%,2.0,3.0,0.185343,58.0,1.600579,98.704053
50%,3.0,4.0,0.306723,87.0,3.828521,99.145798
75%,4.0,5.0,0.34801,180.0,12.217486,99.397172
max,5.0,10.0,0.433746,611.0,45.186198,99.931905


Very impressed with the match with KMeans! For all parameter values tested, the DBSCAN clusters were almost exclusively single KMeans Labels. Of course, many of them have a lot more clusters than kmeans did, which means this method is further dividing some of those clusters.

In [53]:
score1_df = pd.DataFrame(scores1)
score1_df.sort_values(by='silhouette against data', ascending=False).head()

Unnamed: 0,eps,min_samples,silhouette against data,number of clusters,percent unclustered,comparison against kmeans label
24,5,10,0.433746,39,1.311027,98.336692
23,5,5,0.433074,42,1.029518,98.452737
22,5,4,0.43104,45,0.916915,98.010167
21,5,3,0.430832,48,0.820397,98.131994
20,5,2,0.424089,58,0.659535,98.448605


In [54]:
score1_df = pd.DataFrame(scores1)
score1_df.sort_values(by='percent unclustered', ascending=True).head()

Unnamed: 0,eps,min_samples,silhouette against data,number of clusters,percent unclustered,comparison against kmeans label
20,5,2,0.424089,58,0.659535,98.448605
21,5,3,0.430832,48,0.820397,98.131994
22,5,4,0.43104,45,0.916915,98.010167
23,5,5,0.433074,42,1.029518,98.452737
15,4,2,0.340256,87,1.294941,99.145798


An eps value of 5 clearly beats the others, but this is the max I gave it before. I should extend that.

I'll also extend the range of min_samples, on the higher end.

I'll use silhouette as my metric, provided that the perrcent unclustered is also reasonable.


In [55]:
eps_list = [5, 6, 8, 10]
n_list = [2, 3, 4, 5, 6, 7, 8, 9, 10, 20]
scores2 = DBSCAN_gs(eps_list, n_list)
score2_df = pd.DataFrame(scores2)
score2_df.sort_values(by='silhouette against data', ascending=True).head()

eps = 5, min_samples=2
--------------------------------------------------
Silhouette Score: 0.42408914927203 with cache data.
58 clusters, + 0.659535108179844% were NOT categorized.
These clusters on average were 98.44860500842285 of their dominant kmeans cluster.

eps = 5, min_samples=3
--------------------------------------------------
Silhouette Score: 0.43083177610040346 with cache data.
48 clusters, + 0.820397329687123% were NOT categorized.
These clusters on average were 98.131993785652 of their dominant kmeans cluster.

eps = 5, min_samples=4
--------------------------------------------------
Silhouette Score: 0.4310404865554111 with cache data.
45 clusters, + 0.9169146625914903% were NOT categorized.
These clusters on average were 98.01016729341191 of their dominant kmeans cluster.

eps = 5, min_samples=5
--------------------------------------------------
Silhouette Score: 0.4330735486420228 with cache data.
42 clusters, + 1.0295182176465858% were NOT categorized.
These cluster

Unnamed: 0,eps,min_samples,silhouette against data,number of clusters,percent unclustered,comparison against kmeans label
0,5,2,0.424089,58,0.659535,98.448605
10,6,2,0.428694,56,0.450414,98.813151
9,5,20,0.429677,37,1.753398,98.249251
1,5,3,0.430832,48,0.820397,98.131994
2,5,4,0.43104,45,0.916915,98.010167


In [56]:
score2_df.describe()

Unnamed: 0,eps,min_samples,silhouette against data,number of clusters,percent unclustered,comparison against kmeans label
count,40.0,40.0,40.0,40.0,40.0,40.0
mean,7.25,7.4,0.466076,38.025,0.63058,98.138968
std,1.94475,4.924038,0.033798,6.988956,0.404929,0.333651
min,5.0,2.0,0.424089,28.0,0.080431,97.517069
25%,5.75,4.0,0.433578,33.75,0.297595,97.94952
50%,7.0,6.5,0.465866,38.0,0.514759,98.204059
75%,8.5,9.0,0.498818,41.0,0.868656,98.377261
max,10.0,20.0,0.502109,58.0,1.753398,98.813151


These clusters are still lining up really well with the kmeans clusters, just breaking them into smaller pieces.

#### Setting eps = 5 and min_samples = 2

In [77]:
db = DBSCAN(eps=5.0, min_samples=2) 
db.fit(X_sc)

labels = db.labels_

sil = silhouette_score(X_sc, labels)
pct_noise = sum([1 for l in labels if l==-1])*100/data.shape[0]
num_clusters = np.max(labels)

kmeanlabels = list(data0['KMeans cluster'])
chk = pd.DataFrame({'kmeans': kmeanlabels,'dbscan': labels})

capture_list = []
for c in range(labels.max()+1):
    num_dom = np.max(np.array(chk[chk['dbscan']==c]['kmeans'].value_counts()))
    num_all = chk[chk['dbscan']==c]['kmeans'].shape[0]
    cap=num_dom*100/num_all
    capture_list.append(cap)
avg_capture = np.mean(capture_list)

print(f'These clusters on average were {avg_capture} of their dominant kmeans cluster.')
print(f'Silhouette Score: {sil} with cache data.')
print(f'{num_clusters} clusters, + {pct_noise}% were NOT categorized.')


These clusters on average were 98.44860500842285 of their dominant kmeans cluster.
Silhouette Score: 0.42408914927203 with cache data.
58 clusters, + 0.659535108179844% were NOT categorized.


Note, this is better than the best silhouette score for kmeans, which was 0.335615866756178.

In [78]:
data_old = pd.read_csv('../data/working_files/strippeddata.csv')
data_old['DBSCAN cluster'] = labels
data_old.to_csv('../data/working_files/data_withDBSCAN.csv',index=False)

In [79]:
clabels = [-1] + list(range(labels.max()+1))
captures = [0] + capture_list
cnt = [chk[chk['dbscan']==c]['kmeans'].shape[0] for c in clabels]
clabels[0] = 'noise'

In [82]:
chk1 = pd.DataFrame({
    'cluster': clabels,
    'caches within': cnt,
    'homogeneity': captures
})
chk1.sort_values(by='homogeneity').head()

Unnamed: 0,cluster,caches within,homogeneity
0,noise,82,0.0
1,0,8683,33.467695
28,27,4,75.0
32,31,2,100.0
33,32,2,100.0


In [83]:
chk[chk['dbscan']==0]['kmeans'].value_counts()

0     2906
40    1406
2      965
1      821
49     758
45     725
28     468
46     467
47     167
Name: kmeans, dtype: int64

In [86]:
chk[chk['dbscan']==0]['kmeans'].value_counts().sum()*100/data_old.shape[0]

69.83833346738518

In [87]:
2906/2908

0.9993122420907841

In [88]:
chk[chk['dbscan']==27]['kmeans'].value_counts()

33    3
47    1
Name: kmeans, dtype: int64

In [89]:
2908/data_old.shape[0]

0.23389367007158368

In [91]:
chk[chk['dbscan']==0]['kmeans'].value_counts().sum()-2908

5775

The DBSCAN clusters map directly to single KMeans clusters, EXCEPT for DBSCAN clusters 1 and 27.

Several KMeans clusters have been split into 2 or more DBSCANS clusters.

DBSCAN cluster #27 is just 4 caches, and 3 of them are the same KMeans label.

The only real issue is with DBSCAN cluster #0, with includes 99.9% of the KMeans cluster 0, which was already the largest cluster in KMeans, with 23.4% of the caches, PLUS another 5575 caches split among 8 other Kmeans clusters. Altogether, DBSCAN cluster #0 is 69.8% of the caches.

In [99]:
print('| DBSCAN cluster | number of caches | homogeneity |')
print('|--- |--- |--- |')
for j in list(range(0,60)):
    print(f'| {list(chk1.loc[j])[0]} | {list(chk1.loc[j])[1]} | {list(chk1.loc[j])[2]} |')

| DBSCAN cluster | number of caches | homogeneity |
|--- |--- |--- |
| noise | 82 | 0.0 |
| 0 | 8683 | 33.46769549694806 |
| 1 | 70 | 100.0 |
| 2 | 159 | 100.0 |
| 3 | 65 | 100.0 |
| 4 | 131 | 100.0 |
| 5 | 2 | 100.0 |
| 6 | 350 | 100.0 |
| 7 | 152 | 100.0 |
| 8 | 87 | 100.0 |
| 9 | 50 | 100.0 |
| 10 | 48 | 100.0 |
| 11 | 62 | 100.0 |
| 12 | 3 | 100.0 |
| 13 | 9 | 100.0 |
| 14 | 56 | 100.0 |
| 15 | 108 | 100.0 |
| 16 | 2 | 100.0 |
| 17 | 78 | 100.0 |
| 18 | 5 | 100.0 |
| 19 | 85 | 100.0 |
| 20 | 3 | 100.0 |
| 21 | 60 | 100.0 |
| 22 | 4 | 100.0 |
| 23 | 3 | 100.0 |
| 24 | 50 | 100.0 |
| 25 | 195 | 100.0 |
| 26 | 14 | 100.0 |
| 27 | 4 | 75.0 |
| 28 | 65 | 100.0 |
| 29 | 2 | 100.0 |
| 30 | 58 | 100.0 |
| 31 | 2 | 100.0 |
| 32 | 2 | 100.0 |
| 33 | 88 | 100.0 |
| 34 | 49 | 100.0 |
| 35 | 2 | 100.0 |
| 36 | 52 | 100.0 |
| 37 | 59 | 100.0 |
| 38 | 331 | 100.0 |
| 39 | 46 | 100.0 |
| 40 | 4 | 100.0 |
| 41 | 179 | 100.0 |
| 42 | 103 | 100.0 |
| 43 | 15 | 100.0 |
| 44 | 114 | 100.0 |
| 45 | 102 

#### Summary:

With the defaults of eps=1.0, min_samples=4:

Silhouette Score: -0.021417778489332146 with cache data.
229 clusters, + 32.83197940963565% were NOT categorized.
These clusters on average were 99.42746345442846 of their dominant kmeans cluster.

Manual gridsearch over eps_list = [1, 2, 3, 4, 5] and n_list = [2, 3, 4, 5, 10], 

then over eps_list = [5, 6, 8, 10], n_list = [2, 3, 4, 5, 6, 7, 8, 9, 10, 20]

Very impressed with the match with KMeans! For all parameter values tested, the DBSCAN clusters were almost exclusively single KMeans Labels. Of course, many of them have a lot more clusters than kmeans did, which means this method is further dividing some of those clusters.

Selected eps = 5 and min_samples = 2 on the basis of maximum silhouette and minimum percent noise. 

These clusters on average were 98.44860500842285 of their dominant kmeans cluster.
Silhouette Score: 0.42408914927203 with cache data.
58 clusters, + 0.659535108179844% were NOT categorized.

Note, this is better than the best silhouette score for kmeans, which was 0.335615866756178.

And the number of clusters is comparable to kmeans (58 vs 50).

| DBSCAN cluster | number of caches | homogeneity |
|--- |--- |--- |
| noise | 82 | 0.0 |
| 0 | 8683 | 33.46769549694806 |
| 1 | 70 | 100.0 |
| 2 | 159 | 100.0 |
| 3 | 65 | 100.0 |
| 4 | 131 | 100.0 |
| 5 | 2 | 100.0 |
| 6 | 350 | 100.0 |
| 7 | 152 | 100.0 |
| 8 | 87 | 100.0 |
| 9 | 50 | 100.0 |
| 10 | 48 | 100.0 |
| 11 | 62 | 100.0 |
| 12 | 3 | 100.0 |
| 13 | 9 | 100.0 |
| 14 | 56 | 100.0 |
| 15 | 108 | 100.0 |
| 16 | 2 | 100.0 |
| 17 | 78 | 100.0 |
| 18 | 5 | 100.0 |
| 19 | 85 | 100.0 |
| 20 | 3 | 100.0 |
| 21 | 60 | 100.0 |
| 22 | 4 | 100.0 |
| 23 | 3 | 100.0 |
| 24 | 50 | 100.0 |
| 25 | 195 | 100.0 |
| 26 | 14 | 100.0 |
| 27 | 4 | 75.0 |
| 28 | 65 | 100.0 |
| 29 | 2 | 100.0 |
| 30 | 58 | 100.0 |
| 31 | 2 | 100.0 |
| 32 | 2 | 100.0 |
| 33 | 88 | 100.0 |
| 34 | 49 | 100.0 |
| 35 | 2 | 100.0 |
| 36 | 52 | 100.0 |
| 37 | 59 | 100.0 |
| 38 | 331 | 100.0 |
| 39 | 46 | 100.0 |
| 40 | 4 | 100.0 |
| 41 | 179 | 100.0 |
| 42 | 103 | 100.0 |
| 43 | 15 | 100.0 |
| 44 | 114 | 100.0 |
| 45 | 102 | 100.0 |
| 46 | 168 | 100.0 |
| 47 | 62 | 100.0 |
| 48 | 48 | 100.0 |
| 49 | 52 | 100.0 |
| 50 | 25 | 100.0 |
| 51 | 60 | 100.0 |
| 52 | 2 | 100.0 |
| 53 | 6 | 100.0 |
| 54 | 2 | 100.0 |
| 55 | 73 | 100.0 |
| 56 | 38 | 100.0 |
| 57 | 2 | 100.0 |
| 58 | 2 | 100.0 |

The DBSCAN clusters map directly to single KMeans clusters, EXCEPT for DBSCAN clusters 1 and 27.

Several KMeans clusters have been split into 2 or more DBSCANS clusters.

DBSCAN cluster #27 is just 4 caches, and 3 of them are the same KMeans label.

The only real issue is with DBSCAN cluster #0, with includes 99.9% of the KMeans cluster 0, which was already the largest cluster in KMeans, with 23.4% of the caches, PLUS another 5575 caches split among 8 other Kmeans clusters. Altogether, DBSCAN cluster #0 is 69.8% of the caches.

