# HDBSCAN clustering result analysis

Let's start by loading up some libraries and static data that may be useful in the next steps.

In [1]:
%load_ext autoreload
%autoreload 2

from visualization import vis_data, vis_cluster
from collections import defaultdict, Counter
from dimensionality_reduction import dr_pca
from sklearn.externals import joblib
from preprocessing import pp_action
from clustering import clu_hdbscan
from utilities import constants
import plotly.offline as ply
import pandas as pd
import numpy as np
import json
import os

In [2]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training and a testing set.


In [3]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

Please choose the subset of data to workon on:
l for all labeled samples
k for samples of families mydoom, gepys, lamer, neshta, bladabindi, flystudio, eorezo
s for 8 samples of families mydoom, gepys, bladabindi, flystudio
f for a single family
b for a balanced subset of samples
q to quit
k

846 train samples belonging to 7 malware families
Malware family:      bladabindi      Number of samples:  199  
Malware family:        eorezo        Number of samples:  191  
Malware family:        neshta        Number of samples:  150  
Malware family:        mydoom        Number of samples:  104  
Malware family:        lamer         Number of samples:   89  
Malware family:      flystudio       Number of samples:   60  
Malware family:        gepys         Number of samples:   53  

182 dev samples belonging to 7 malware families
Malware family:      bladabindi      Number of samples:   42  
Malware family:        neshta        Number of samples:   36  
Malware family:        eorezo        Num

## Dimensionality Reduction

Currently each data vector has approximately 300.000 components. High dimensionality feature vectors usually create problems during the clustering phase.

Therefore, before going ahead to clustering the data, we proceed to reduce the dimensionality of the dataset.

In this case we will use Principal Components Analysis to transfor our feature vectors in a new, dimensionally smaller, dataset.

In [7]:
uuids = samples_data.index[samples_data['selected'] == 1].tolist()

In [None]:
reduced, dr_model = dr_pca.reduce(config, uuids, 128)

In [8]:
# If you had already computed PCA, load it from the disk instead
dr_model = joblib.load(os.path.join(constants.dir_d, constants.dir_mod, 'pca_128_1209.pkl')) 
reduced = np.loadtxt('data/d_matrices/pca_128_1209.txt')

## Clustering

Once the data dimensionality has been reduced we can proceed with clustering. 

Here we will use HDBSCAN a hierarchical density-based clustering algorithm.

In [4]:
uuids = samples_data.index[samples_data['selected'] == 1].tolist()
labels_num = samples_data.fam_num[samples_data['selected'] == 1].tolist()

clustering, clu_model = clu_hdbscan.cluster(config, 'c', uuids, labels_num, sparse=False)

Please select the desired dimensionality reduced dataset (q to quit)
data/d_matrices/pca_128_1209.txt
Perform clustering with cosine distance
(1209, 1209)
--------------------------------------------------------------------------------
Clustering evaluation
Number of clusters 6
Number of distinct families 7
Adjusted Rand index: 0.593707595098
Adjusted Mutual Information: 0.722010017203
Fowlkes-Mallows: 0.660816849808
Homogeneity: 0.741104177426
Completeness: 0.724229988206
BCubed Precision: 0.711353670031
BCubed Recall: 0.754348138067
BCubed FScore: 0.732220310475
Silhouette 0.690010500691
--------------------------------------------------------------------------------


## Cluster Analysis

To better understand the result of the clustering algorithm we would like to see the features characterizing the computed clusters. 

Since the dataset dimensionality was reduced with PCA before clustering we would need to reverse this step to understand the characteristics of the obtained clusters.

To achieve this we will compute the centroids as the average of the data for each cluster and then multiply it by the transposed components matrix.

We will start by creating an inverted index of the clustering.

In [5]:
inverted_clustering = defaultdict(list)
for i in range(len(uuids)):
    inverted_clustering[clustering[i]].append(uuids[i])

Using Pandas we can construct a dataframe representing our reduced data matrix with dimensions $ ( n\_samples \times n\_pca\_components) $

In [9]:
reduced_df = pd.DataFrame(reduced, index=uuids)

To compute the centroids we will just average the values of the PCA-reduced features of each cluster.

In [10]:
centroids = {label : np.zeros(len(reduced[0])) for label in sorted(set(clustering))}

i = 0
for index, vector in reduced_df.iterrows():
    centroids[clustering[i]] += vector.values
    i += 1

centroid_matrix = []
for centroid in sorted(centroids.keys()):
    centroids[centroid] /= len(inverted_clustering[centroid])
    centroid_matrix.append(centroids[centroid])
    
centroid_matrix = np.array(centroid_matrix)

Once we have the centroid matrix in the PCA space, we can bring it back to its original dimensions by multiplying it with the PCA components matrix.

This will result in a $ ( n\_centroids \times n\_original\_features ) $ matrix.

In [11]:
centroids_orig_fts = np.dot(centroid_matrix, dr_model.components_)
centroids_orig_fts.shape

(7, 297360)

Once in the original dimension space we can identify the ten most influencial words for each cluster.

In [12]:
words = dict(zip(range(len(words)), sorted(words.keys())))

i = -1
for centroid in centroids_orig_fts:
    cent_series = pd.Series(np.abs(centroid), index=sorted(words.values()))
    
    print('Centroid {}:'.format(i))
    print(cent_series.nlargest(10))
    print()
    i += 1

Centroid -1:
STEGMAN            10.280565
ISMOOTH            10.198220
NONSPACE           10.189892
AFBEELDINGEN        9.974334
ADVERTS             9.706033
ARTAPPLICATIONS     9.685721
RESIMLERI           9.635831
RESIMLER            9.461271
CHATRIN             9.347339
CSELECTS            9.264268
dtype: float64

Centroid 0:
PYROELECTRICITY     29.610894
HOVERCART           28.865509
LIMITATIONSA        28.331277
INHERENTLY          28.261260
SELFDEFENSE         28.167152
EXCHANGING          28.150867
WEBMISTRESS         28.117479
STACKOVERFLOWCOM    28.083772
PYROELECTRIC        28.038404
HEZBOLLAH           27.916700
dtype: float64

Centroid 1:
CONTROVERSIAL    29.868688
PHILOSOPHIC      29.742619
STAPHYLOCOCCI    29.468225
THERMODYNAMIC    29.229050
ARGENTINIAN      29.090548
ROWHORIZONTAL    29.080221
CHOREOGRAPH      29.071045
CONTROVERSIA     29.013478
THERMOMETER      28.842510
PACKEDARRAY      28.774394
dtype: float64

Centroid 2:
SOUKS       27.958196
RENGIERS    26.504072

It may also be interesting to see which of the initial malware families compose each cluster.

In [13]:
clust_compositions = {i: Counter() for i in sorted(set(clustering.flatten()))}

for i in range(len(uuids)):
    clust_compositions[clustering[i]][uuids_family[uuids[i]]] += 1

for clu in sorted(clust_compositions.keys()):
    print('Cluster {}:'.format(clu))
    print(clust_compositions[clu].most_common())
    print()

Cluster -1:
[('eorezo', 79), ('bladabindi', 35), ('flystudio', 6), ('neshta', 2), ('lamer', 1)]

Cluster 0:
[('mydoom', 146)]

Cluster 1:
[('lamer', 131), ('eorezo', 6)]

Cluster 2:
[('neshta', 222), ('flystudio', 1)]

Cluster 3:
[('eorezo', 108)]

Cluster 4:
[('flystudio', 79), ('gepys', 76), ('eorezo', 73), ('bladabindi', 53), ('neshta', 6), ('lamer', 4)]

Cluster 5:
[('bladabindi', 181)]



## Cluster Visualization

We can also generate a visual output from our clustering. 

Let's start by visualizing the original dataset. Since the ~300000 original features would not allow us to plot the data, we will use a 2-dimensional tSNE reduced version of our feature vectors.

The color of each data point will be defined by the AV label extracted form VirusTotal using AVClass.

In [15]:
families = samples_data.family[samples_data['selected'] == 1].tolist()
vis_data.plot_data('data/d_matrices/tsne_2_1209.txt', families)

Number of labels: 7


Now we can compare the classification provided by the AV data with the result of our clustering, plotted over the same dimensionality reduced data points.

Here, the color of the points will reflect the cluster in which they are assigned by the algorithm.

In [16]:
vis_data.plot_data('data/d_matrices/tsne_2_1209.txt', clustering)

Number of labels: 7


We can repeat the same comparison process with a 3-dimensional representation of the dataset. Since in this case tSNE generated a representation quite difficult to explore visually, we will use PCA to reduce the dimensions of our vectors.

In [17]:
vis_data.plot_data('data/d_matrices/pca_3_1209.txt', families)

Number of labels: 7


In [18]:
vis_data.plot_data('data/d_matrices/pca_3_1209.txt', clustering)

Number of labels: 7
