# DBSCAN

We run DBSCAN clustering on the full cardinality of the data obtained from _greg-features.ipynb_. But we only use the columns retained after running _centroid-sampling.ipynb_ and _dim-red-via-correlation.ipynb_.

Reference material:

* [sklearn DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
* [DBSCAN demo](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)

<hr>

In [1]:
%pylab inline
%autosave 25

import pandas as pd

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


In [2]:
data_loc = '../data/FDA-COVID19_files_v1.0/'

### Get the full data set

In [3]:
store = pd.HDFStore(data_loc + 'features.h5')
df_all = pd.DataFrame(store['df' ])
print('rows: {:,}, columns: {:,}'.format(len(df_all), len(df_all.columns)))
store.close()

rows: 184,063, columns: 17,076


### Take a subset of the columns

Exclude zero-variance and highly correlated columns, as found from analysing the centroid-based sampled data set.

In [4]:
store = pd.HDFStore(data_loc + 'sampled_data.h5')
df_sampled = pd.DataFrame(store['df' ])
store.close()
print('rows: {:,}, columns: {:,}'.format(len(df_sampled), len(df_sampled.columns)))

rows: 22,172, columns: 14,730


### Original data with reduced dimensionality

In [5]:
df_features = df_all[df_sampled.columns]
del df_all
del df_sampled
print('rows: {:,}, columns: {:,}'.format(len(df_features), len(df_features.columns)))

rows: 184,063, columns: 14,730


In [None]:
#X = StandardScaler().fit_transform(df_features)

In [None]:
scaler = StandardScaler()

for chunk in np.array_split(df_features.values, 100):
  scaler.partial_fit(chunk)

In [None]:
X = scaler.transform(df_features)

In [None]:
db = DBSCAN(eps=0.3, min_samples=10).fit(X)

In [None]:
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)