# DBSCAN

We run DBSCAN clustering on the full cardinality of the data obtained from _greg-features.ipynb_. But we only use the columns retained after running _centroid-sampling.ipynb_ and _dim-red-via-correlation.ipynb_.

Reference material:

* [sklearn DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
* [DBSCAN demo](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)
* [Let’s cluster data points using DBSCAN](https://medium.com/@agarwalvibhor84/lets-cluster-data-points-using-dbscan-278c5459bee5)

<hr>

In [1]:
%pylab inline
%autosave 25

import pandas as pd

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


### Data location

In [2]:
data_loc = '../data/FDA-COVID19_files_v1.0/'

### Get the full data set

In [3]:
store = pd.HDFStore(data_loc + 'features.h5')
df_all = pd.DataFrame(store['df' ])
print('rows: {:,}, columns: {:,}'.format(len(df_all), len(df_all.columns)))
store.close()

rows: 184,063, columns: 17,076


### Take a subset of the columns

Exclude zero-variance and highly correlated columns, as found from analysing the centroid-based sampled data set.

In [4]:
store = pd.HDFStore(data_loc + 'sampled_data.h5')
meta = store.select('df', start=1, stop=1) # Grab on the column names. Speeds things up.
cols = meta.columns
store.close()
print('columns: {:,}'.format(len(cols)))

columns: 14,730


### Original data with reduced dimensionality

In [5]:
df_features = df_all[cols]
del df_all
del cols
print('rows: {:,}, columns: {:,}'.format(len(df_features), len(df_features.columns)))

rows: 184,063, columns: 14,730


### Scaling

We need to make sure the data are standardized. Fingerprint and binding-site features are already scaled so we can ignore those and scale the remaining features.

In [13]:
cols_to_exclude = ['cid', 'pid', 'activity']

# fingerprint columns
df = pd.read_csv(data_loc+'drug_features/fingerprints.csv', index_col=0, nrows=0)
cols_to_exclude = cols_to_exclude + df.columns.tolist()
del df

# binding-site columns
df = pd.read_csv(data_loc+'protein_features/binding_sites_v1.0.csv', index_col=0, nrows=0)
cols_to_exclude = cols_to_exclude + df.columns.tolist()
del df

# some columns have already been dropped in previous processing, e.g. removal of zero-variance columns
cols_to_exclude = df_features.columns.intersection(cols_to_exclude)

print('{:,} don\'t need to be scaled'.format(len(cols_to_exclude)))

cols_to_keep = df_features.drop(cols_to_exclude, axis=1).columns.tolist()

print('{:,} do need to be scaled'.format(len(cols_to_keep)))

12,405 don't need to be scaled
2,325 do need to be scaled


In [14]:
# scale only the cols_to_keep
scaler = StandardScaler()
features = df_features[cols_to_keep]
df_features[cols_to_keep] = scaler.fit_transform(features.values)
del features

In [19]:
# save the scaled features to file to save running this slow op again
store = pd.HDFStore(data_loc + 'dbscan_scaled_data.h5')
store['df'] = df_features
store.close()

### Load pre-scaled data

In [6]:
store = pd.HDFStore(data_loc + 'dbscan_scaled_data.h5')
df_features = pd.DataFrame(store['df' ])
print('rows: {:,}, columns: {:,}'.format(len(df_features), len(df_features.columns)))
store.close()

rows: 184,063, columns: 14,730


### Cluster

In [7]:
X = df_features.drop(columns=['cid', 'pid', 'activity'], axis=1)

In [None]:
db = DBSCAN(eps=0.8, min_samples=10).fit(X.values)

In [None]:
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

In [None]:
n_clusters_

In [None]:
n_noise_