# HDBSCAN clustering result analysis

Let's start by loading up some libraries and static data that may be useful in the next steps.

In [3]:
%load_ext autoreload
%autoreload 2

from utilities import constants
import plotly.offline as ply
import pandas as pd
import numpy as np
import json
import os

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training and a testing set.


In [5]:
from preprocessing import pp_action

In [6]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

Please choose the subset of data to workon on:
l for all labeled samples
k for samples of families mydoom, gepys, lamer, neshta, bladabindi, flystudio, eorezo
s for 8 samples of families mydoom, gepys, bladabindi, flystudio
f for a single family
b for a balanced subset of samples
q to quit
k

967 train samples belonging to 7 malware families
Malware family:        eorezo        Number of samples:  219  
Malware family:      bladabindi      Number of samples:  218  
Malware family:        neshta        Number of samples:  184  
Malware family:        mydoom        Number of samples:  116  
Malware family:        lamer         Number of samples:  101  
Malware family:      flystudio       Number of samples:   70  
Malware family:        gepys         Number of samples:   59  

242 test samples belonging to 7 malware families
Malware family:      bladabindi      Number of samples:   51  
Malware family:        eorezo        Number of samples:   47  
Malware family:        neshta        Nu

## Dimensionality Reduction

Currently each data vector has approximately 300.000 components. High dimensionality feature vectors usually create problems during the clustering phase.

Therefore, before going ahead to clustering the data, we proceed to reduce the dimensionality of the dataset.

In this case we will use Principal Components Analysis to transfor our feature vectors in a new, dimensionally smaller, dataset.

In [8]:
from dimensionality_reduction import dr_pca

In [9]:
uuids = samples_data.index[samples_data['selected'] == 1].tolist()
reduced, dr_model = dr_pca.reduce(config, uuids, 100)

# If you had already computed PCA, load it from the disk instead
# dr_model = joblib.load(os.path.join(constants.dir_d, constants.dir_mod, 'pca_X_X.pkl)) 
# reduced = np.loadtxt(matrix_file)

Performing dimensionality reduction using PCA
Processing documents from 0 to 299
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 300 to 599
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 600 to 899
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 900 to 1199
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 1200 to 1499
Loading Tf-Idf of 9 documents
(9, 297360)
Explained Variance Ratio
0.827713758758
Transforming documents from 0 to 299
Loading Tf-Idf of 300 documents
(300, 297360)
Transforming documents from 300 to 599
Loading Tf-Idf of 300 documents
(300, 297360)
Transforming documents from 600 to 899
Loading Tf-Idf of 300 documents
(300, 297360)
Transforming documents from 900 to 1199
Loading Tf-Idf of 300 documents
(300, 297360)
Transforming documents from 1200 to 1499
Loading Tf-Idf of 9 documents
(9, 297360)


## Clustering

Once the data dimensionality has been reduced we can proceed with clustering. 

In [None]:
clustering, clu_model = clu_action.cluster(samples_data, config)

### Cluster Analysis

To better understand the result of the clustering algorithm we would like to see the features characterizing the computed clusters. 

Since the dataset dimensionality was reduced with PCA before clustering we would need to reverse this step to understand the characteristics of the obtained clusters.

To achieve this we will compute the centroids as the average of the data for each cluster and then multiply it by the transposed components matrix.

In [None]:
inverted_clustering = defaultdict(list)

for i in range(len(uuids)):
    inverted_clustering[clustering[i]].append(uuids[i])

In [None]:
data_red = pd.DataFrame(reduced, index=uuids)

To compute the centroids we will just average the values of the PCA-reduced features of each cluster.

In [None]:
centroids = {label : np.zeros(len(reduced[0])) for label in sorted(set(clustering))}

In [None]:
i = 0
for index, vector in data_red.iterrows():
    centroids[clustering[i]] += vector.values
    i += 1

In [None]:
for centroid in centroids:
    centroids[centroid] /= len(inverted_clustering[centroid])

In [None]:
centroid_matrix = []
for centroid in sorted(centroids.keys()):
    centroid_matrix.append(centroids[centroid])
centroid_matrix = np.array(centroid_matrix)

Once we have the centroid matrix in the PCA space, we can bring it back to its original dimensions by multiplying it with the PCA components matrix.

In [None]:
centroids_orig_fts = np.dot(centroid_matrix, dr_model.components_)
centroids_orig_fts.shape

Once in the original dimension space we can identify the ten most influencial words for each cluster.

In [None]:
words = dict(zip(range(len(words)), sorted(words.keys())))

In [None]:
for centroid in centroids_orig_fts:
    cent_series = pd.Series(np.abs(centroid), index=sorted(words.values()))
    print(cent_series.nlargest(10))
    print()

It may be interesting to see which of the initial malware families compose each cluster.

In [None]:
clust_compositions = {i: Counter() for i in sorted(set(clustering.flatten()))}

In [None]:
for i in range(len(uuids)):
    clust_compositions[clustering[i]][uuids_family[uuids[i]]] += 1

In [None]:
for clu in sorted(clust_compositions.keys()):
    print(clu)
    print(clust_compositions[clu].most_common())
    print()

In [None]:
vis_action.visualize(samples_data, config)