# HDBSCAN clustering result analysis

Let's start by loading up some libraries and static data that may be useful in the next steps.

In [15]:
%load_ext autoreload
%autoreload 2

from collections import defaultdict, Counter
from utilities import constants
import plotly.offline as ply
import pandas as pd
import numpy as np
import json
import os

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training and a testing set.


In [5]:
from preprocessing import pp_action

In [6]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

Please choose the subset of data to workon on:
l for all labeled samples
k for samples of families mydoom, gepys, lamer, neshta, bladabindi, flystudio, eorezo
s for 8 samples of families mydoom, gepys, bladabindi, flystudio
f for a single family
b for a balanced subset of samples
q to quit
k

967 train samples belonging to 7 malware families
Malware family:        eorezo        Number of samples:  219  
Malware family:      bladabindi      Number of samples:  218  
Malware family:        neshta        Number of samples:  184  
Malware family:        mydoom        Number of samples:  116  
Malware family:        lamer         Number of samples:  101  
Malware family:      flystudio       Number of samples:   70  
Malware family:        gepys         Number of samples:   59  

242 test samples belonging to 7 malware families
Malware family:      bladabindi      Number of samples:   51  
Malware family:        eorezo        Number of samples:   47  
Malware family:        neshta        Nu

## Dimensionality Reduction

Currently each data vector has approximately 300.000 components. High dimensionality feature vectors usually create problems during the clustering phase.

Therefore, before going ahead to clustering the data, we proceed to reduce the dimensionality of the dataset.

In this case we will use Principal Components Analysis to transfor our feature vectors in a new, dimensionally smaller, dataset.

In [8]:
from dimensionality_reduction import dr_pca

In [9]:
uuids = samples_data.index[samples_data['selected'] == 1].tolist()
reduced, dr_model = dr_pca.reduce(config, uuids, 100)

# If you had already computed PCA, load it from the disk instead
# dr_model = joblib.load(os.path.join(constants.dir_d, constants.dir_mod, 'pca_X_X.pkl)) 
# reduced = np.loadtxt('matrix_file')

Performing dimensionality reduction using PCA
Processing documents from 0 to 299
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 300 to 599
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 600 to 899
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 900 to 1199
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 1200 to 1499
Loading Tf-Idf of 9 documents
(9, 297360)
Explained Variance Ratio
0.827713758758
Transforming documents from 0 to 299
Loading Tf-Idf of 300 documents
(300, 297360)
Transforming documents from 300 to 599
Loading Tf-Idf of 300 documents
(300, 297360)
Transforming documents from 600 to 899
Loading Tf-Idf of 300 documents
(300, 297360)
Transforming documents from 900 to 1199
Loading Tf-Idf of 300 documents
(300, 297360)
Transforming documents from 1200 to 1499
Loading Tf-Idf of 9 documents
(9, 297360)


## Clustering

Once the data dimensionality has been reduced we can proceed with clustering. 

Here we will use HDBSCAN a hierarchical density-based clustering algorithm.

In [12]:
from clustering import clu_hdbscan

In [13]:
labels_num = samples_data.fam_num[samples_data['selected'] == 1].tolist()
clustering, clu_model = clu_hdbscan.cluster(config, 'c', uuids, labels_num, sparse=False)

Please select the desired training matrix file (q to quit)
data/d_matrices/pca_100_1209.txt
Perform clustering with cosine distance
(1209, 1209)
--------------------------------------------------------------------------------
Clustering evaluation
Number of clusters 5
Number of distinct families 7
Adjusted Rand index: 0.569993767111
Adjusted Mutual Information: 0.679512637141
Fowlkes-Mallows: 0.646542604906
Homogeneity: 0.681691410371
Completeness: 0.724663631543
BCubed Precision: 0.723173231952
BCubed Recall: 0.70001306948
BCubed FScore: 0.711404702751
Silhouette 0.631992203581
--------------------------------------------------------------------------------


## Cluster Analysis

To better understand the result of the clustering algorithm we would like to see the features characterizing the computed clusters. 

Since the dataset dimensionality was reduced with PCA before clustering we would need to reverse this step to understand the characteristics of the obtained clusters.

To achieve this we will compute the centroids as the average of the data for each cluster and then multiply it by the transposed components matrix.

We will start by creating an inverted index of the clustering.

In [16]:
inverted_clustering = defaultdict(list)
for i in range(len(uuids)):
    inverted_clustering[clustering[i]].append(uuids[i])

Using Pandas we can construct a dataframe representing our reduced data matrix with dimensions $ ( n\_samples \times n\_pca\_components) $

In [17]:
reduced_df = pd.DataFrame(reduced, index=uuids)

To compute the centroids we will just average the values of the PCA-reduced features of each cluster.

In [18]:
centroids = {label : np.zeros(len(reduced[0])) for label in sorted(set(clustering))}

i = 0
for index, vector in reduced_df.iterrows():
    centroids[clustering[i]] += vector.values
    i += 1

centroid_matrix = []
for centroid in sorted(centroids.keys()):
    centroids[centroid] /= len(inverted_clustering[centroid])
    centroid_matrix.append(centroids[centroid])
    
centroid_matrix = np.array(centroid_matrix)

Once we have the centroid matrix in the PCA space, we can bring it back to its original dimensions by multiplying it with the PCA components matrix.

This will result in a $ ( n\_centroids \times n\_original\_features ) $ matrix.

In [19]:
centroids_orig_fts = np.dot(centroid_matrix, dr_model.components_)
centroids_orig_fts.shape

(6, 297360)

Once in the original dimension space we can identify the ten most influencial words for each cluster.

In [20]:
words = dict(zip(range(len(words)), sorted(words.keys())))

i = -1
for centroid in centroids_orig_fts:
    cent_series = pd.Series(np.abs(centroid), index=sorted(words.values()))
    
    print('Centroid {}:'.format(i))
    print(cent_series.nlargest(10))
    print()
    i += 1

Centroid -1:
ENTITYATTRIBUTES    13.574922
AFBEELDINGEN        13.513300
RESIMLERI           13.434392
ARTAPPLICATIONS     13.422351
RESIMLER            12.968265
NONSPACE            12.889580
CHATRIN             12.675096
REOPERATIONS        12.578603
EXINTERPRETER       12.528929
ALREADYPRESENT      12.469737
dtype: float64

Centroid 0:
HOVERCART          32.680296
OBALNA             30.583863
NONRECOGNIZABLE    30.550353
HOVERCAR           30.250792
MOYENORIENT        30.123205
PYROELECTRICITY    29.628717
MIDMARKET          29.329555
TWITTERREDDIT      29.213025
USFOLLOWING        29.212505
YCOMBINATOR        29.192296
dtype: float64

Centroid 1:
STAPHYLOCOCCI    30.480420
CONTROVERSIAL    30.375936
PHILOSOPHIC      30.234558
THERMODYNAMIC    30.217283
ROWHORIZONTAL    30.061853
CHOREOGRAPH      29.802483
UPDATABILITY     29.734139
BIBLIOGRAPHIC    29.667438
CONTROVERSIA     29.507344
PROLETARIAN      29.494926
dtype: float64

Centroid 2:
SOUKS       28.058270
RENGIERS    26.508345

It may also be interesting to see which of the initial malware families compose each cluster.

In [21]:
clust_compositions = {i: Counter() for i in sorted(set(clustering.flatten()))}

for i in range(len(uuids)):
    clust_compositions[clustering[i]][uuids_family[uuids[i]]] += 1

for clu in sorted(clust_compositions.keys()):
    print('Cluster {}:'.format(clu))
    print(clust_compositions[clu].most_common())
    print()

Cluster -1:
[('eorezo', 196), ('bladabindi', 68), ('mydoom', 17), ('flystudio', 9), ('neshta', 3), ('lamer', 1)]

Cluster 0:
[('mydoom', 129)]

Cluster 1:
[('lamer', 131), ('eorezo', 2)]

Cluster 2:
[('neshta', 221), ('flystudio', 1)]

Cluster 3:
[('gepys', 76), ('flystudio', 76), ('eorezo', 68), ('bladabindi', 47), ('neshta', 6), ('lamer', 4)]

Cluster 4:
[('bladabindi', 154)]



## Cluster Visualization

We can also generate a visual output from our clustering. 

Let's start by visualizing the original dataset. Since the ~300000 original features would not allow us to plot the data, we will use a 2-dimensional tSNE reduced version of our feature vectors.

The color of each data point will be defined by the AV label extracted form VirusTotal using AVClass.

In [24]:
from visualization import vis_data, vis_cluster

In [23]:
families = samples_data.family[samples_data['selected'] == 1].tolist()
vis_data.plot_data('data/d_matrices/tsne_2_all.txt', families)

Number of labels: 7


Now we can compare the classification provided by the AV data with the result of our clustering, plotted over the same dimensionality reduced data points.

Here, the color of the points will reflect the cluster in which they are assigned by the algorithm.

In [26]:
clustering.shape

(1209,)

In [27]:
vis_data.plot_data('data/d_matrices/tsne_2_all.txt', clustering)

Number of labels: 6
