# HDBSCAN clustering result analysis

Let's start by loading up some libraries and static data that may be useful in the next steps.

In [None]:
%load_ext autoreload
%autoreload 2

from collections import defaultdict, Counter
from utilities import constants
import plotly.offline as ply
import pandas as pd
import numpy as np
import json
import os

In [None]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training and a testing set.


In [None]:
from preprocessing import pp_action

In [None]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

## Dimensionality Reduction

Currently each data vector has approximately 300.000 components. High dimensionality feature vectors usually create problems during the clustering phase.

Therefore, before going ahead to clustering the data, we proceed to reduce the dimensionality of the dataset.

In this case we will use Principal Components Analysis to transfor our feature vectors in a new, dimensionally smaller, dataset.

In [None]:
from dimensionality_reduction import dr_pca

In [None]:
uuids = samples_data.index[samples_data['selected'] == 1].tolist()
reduced, dr_model = dr_pca.reduce(config, uuids, 100)

# If you had already computed PCA, load it from the disk instead
# dr_model = joblib.load(os.path.join(constants.dir_d, constants.dir_mod, 'pca_X_X.pkl)) 
# reduced = np.loadtxt('matrix_file')

## Clustering

Once the data dimensionality has been reduced we can proceed with clustering. 

Here we will use HDBSCAN a hierarchical density-based clustering algorithm.

In [None]:
from clustering import clu_hdbscan

In [None]:
labels_num = samples_data.fam_num[samples_data['selected'] == 1].tolist()
clustering, clu_model = clu_hdbscan.cluster(config, 'c', uuids, labels_num, sparse=False)

## Cluster Analysis

To better understand the result of the clustering algorithm we would like to see the features characterizing the computed clusters. 

Since the dataset dimensionality was reduced with PCA before clustering we would need to reverse this step to understand the characteristics of the obtained clusters.

To achieve this we will compute the centroids as the average of the data for each cluster and then multiply it by the transposed components matrix.

We will start by creating an inverted index of the clustering.

In [None]:
inverted_clustering = defaultdict(list)
for i in range(len(uuids)):
    inverted_clustering[clustering[i]].append(uuids[i])

Using Pandas we can construct a dataframe representing our reduced data matrix with dimensions $ ( n\_samples \times n\_pca\_components) $

In [None]:
reduced_df = pd.DataFrame(reduced, index=uuids)

To compute the centroids we will just average the values of the PCA-reduced features of each cluster.

In [None]:
centroids = {label : np.zeros(len(reduced[0])) for label in sorted(set(clustering))}

i = 0
for index, vector in reduced_df.iterrows():
    centroids[clustering[i]] += vector.values
    i += 1

centroid_matrix = []
for centroid in sorted(centroids.keys()):
    centroids[centroid] /= len(inverted_clustering[centroid])
    centroid_matrix.append(centroids[centroid])
    
centroid_matrix = np.array(centroid_matrix)

Once we have the centroid matrix in the PCA space, we can bring it back to its original dimensions by multiplying it with the PCA components matrix.

This will result in a $ ( n\_centroids \times n\_original\_features ) $ matrix.

In [None]:
centroids_orig_fts = np.dot(centroid_matrix, dr_model.components_)
centroids_orig_fts.shape

Once in the original dimension space we can identify the ten most influencial words for each cluster.

In [None]:
words = dict(zip(range(len(words)), sorted(words.keys())))

i = -1
for centroid in centroids_orig_fts:
    cent_series = pd.Series(np.abs(centroid), index=sorted(words.values()))
    
    print('Centroid {}:'.format(i))
    print(cent_series.nlargest(10))
    print()
    i += 1

It may also be interesting to see which of the initial malware families compose each cluster.

In [None]:
clust_compositions = {i: Counter() for i in sorted(set(clustering.flatten()))}

for i in range(len(uuids)):
    clust_compositions[clustering[i]][uuids_family[uuids[i]]] += 1

for clu in sorted(clust_compositions.keys()):
    print('Cluster {}:'.format(clu))
    print(clust_compositions[clu].most_common())
    print()

## Cluster Visualization

We can also generate a visual output from our clustering. 

Let's start by visualizing the original dataset. Since the ~300000 original features would not allow us to plot the data, we will use a 2-dimensional tSNE reduced version of our feature vectors.

The color of each data point will be defined by the AV label extracted form VirusTotal using AVClass.

In [None]:
from visualization import vis_data, vis_cluster

In [None]:
families = samples_data.family[samples_data['selected'] == 1].tolist()
vis_data.plot_data('data/d_matrices/tsne_2_all.txt', families)

In [None]:
vis_data.plot_data('data/d_matrices/tsne_3_1209.txt', families)

As we can observe, the 3d representation generated through tSNE does not provide a clear view of the data. We have more success with a 3d representation obtained using PCA for dimensionality reduction. 

In [None]:
vis_data.plot_data('data/d_matrices/pca_3_1209.txt', families)

Now we can compare the classification provided by the AV data with the result of our clustering, plotted over the same dimensionality reduced data points.

Here, the color of the points will reflect the cluster in which they are assigned by the algorithm.

In [None]:
clustering.shape

In [None]:
vis_data.plot_data('data/d_matrices/tsne_2_all.txt', clustering)

In [None]:
vis_data.plot_data('data/d_matrices/pca_3_1209.txt', clustering)