# K-Means clustering result analysis

We will start our exploration of the dataset with one of the most classical clustering algorithms: K-Means.

In [2]:
%load_ext autoreload
%autoreload 2

In [14]:
from clustering import clu_kmeans, clu_kmeans_minibatch
from visualization import vis_data, vis_cluster
from collections import defaultdict, Counter
from keywords import kw_keyword_tfidf
from sklearn.metrics import f1_score
from sklearn.externals import joblib
from preprocessing import pp_action
from helpers import loader_tfidf
from utilities import constants
import plotly.graph_objs as go
import plotly.offline as ply
import pandas as pd
import numpy as np
import random
import json
import os

In [4]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training and a testing set.


In [5]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

Please choose the subset of data to workon on:
l for all labeled samples
k for samples of families mydoom, gepys, lamer, neshta, bladabindi, flystudio, eorezo
s for 8 samples of families mydoom, gepys, bladabindi, flystudio
f for a single family
b for a balanced subset of samples
q to quit
k

846 train samples belonging to 7 malware families
Malware family:      bladabindi      Number of samples:  199  
Malware family:        eorezo        Number of samples:  191  
Malware family:        neshta        Number of samples:  150  
Malware family:        mydoom        Number of samples:  104  
Malware family:        lamer         Number of samples:   89  
Malware family:      flystudio       Number of samples:   60  
Malware family:        gepys         Number of samples:   53  

182 dev samples belonging to 7 malware families
Malware family:      bladabindi      Number of samples:   42  
Malware family:        neshta        Number of samples:   36  
Malware family:        eorezo        Num

## Clustering

Now that we have our data subset we can start with K-Means

In [6]:
uuids = samples_data.index[samples_data['selected'] == 1].tolist()
labels_num = samples_data.fam_num[samples_data['selected'] == 1].tolist()

In [20]:
clustering, clu_model = clu_kmeans_minibatch.cluster(config, 10, uuids, labels_num)

Processing documents from 0 to 299
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 300 to 599
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 600 to 899
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 900 to 1199
Loading Tf-Idf of 300 documents
(300, 297360)
Processing documents from 1200 to 1499
Loading Tf-Idf of 9 documents
(9, 297360)
Predicting documents from 0 to 299
Loading Tf-Idf of 300 documents
(300, 297360)
Predicting documents from 300 to 599
Loading Tf-Idf of 300 documents
(300, 297360)
Predicting documents from 600 to 899
Loading Tf-Idf of 300 documents
(300, 297360)
Predicting documents from 900 to 1199
Loading Tf-Idf of 300 documents
(300, 297360)
Predicting documents from 1200 to 1499
Loading Tf-Idf of 9 documents
(9, 297360)
--------------------------------------------------------------------------------
Clustering evaluation
Number of clusters 2
Number of distinct families 7
Adjusted Rand index: 0

(300, 297360)
Processing documents from 1200 to 1499
Loading Tf-Idf of 9 documents
(9, 297360)
Predicting documents from 0 to 299
Loading Tf-Idf of 300 documents
(300, 297360)
Predicting documents from 300 to 599
Loading Tf-Idf of 300 documents
(300, 297360)
Predicting documents from 600 to 899
Loading Tf-Idf of 300 documents
(300, 297360)
Predicting documents from 900 to 1199
Loading Tf-Idf of 300 documents
(300, 297360)
Predicting documents from 1200 to 1499
Loading Tf-Idf of 9 documents
(9, 297360)
--------------------------------------------------------------------------------
Clustering evaluation
Number of clusters 8
Number of distinct families 7
Adjusted Rand index: 0.41555684569
Adjusted Mutual Information: 0.584252793388
Fowlkes-Mallows: 0.599791792263
Homogeneity: 0.588141663804
Completeness: 0.83532192802
BCubed Precision: 0.877924814352
BCubed Recall: 0.599779589811
BCubed FScore: 0.712674853717
-------------------------------------------------------------------------------

## Cluster Analysis

To better understand the result of the clustering algorithm we would like to see the features characterizing the computed clusters. We can therefore aggregate the vectors composing each cluster in a single cumulative vector and retrieve the features with the highest weight in the cluster-vector.

In [21]:
kw_keyword_tfidf.extract_keywords(config, 'data/d_clusterings/clustering_kmeans_euclidean_minibatch_1209.json')

Number of clusters: 7


In [22]:
with open('data/d_keywords/clustering_kmeans_euclidean_minibatch_1209_keywords_tfidf', 'r') as kws:
    print(kws.read())

Cluster	0.0
ALREADYPRESENT	4645.2294058440675
SCRIPTCONFIG	4589.864259793414
RESOURCES11	4519.300608792893
EXINTERPRETER	4504.444313354044
UNINTERRUPTIBLE	4479.732733964338
ENTITYATTRIBUTES	4474.067403542652
TDESIGNATION	4454.970605343973
ALREADYCLOSED	4388.447974969967
REHYDRATION	4354.1308318206675
DUCTSYSTEM	4310.740944682553

Cluster	1.0
SOUKS	7774.228037677314
RENGIERS	7287.666292889318
TELINTRA	6962.69876351303
TOLARIAN	6876.2314595692005
CONCOLON	6834.051146659879
ORTKENS	6744.52529074124
ANGURAL	6712.900377860381
ESMONDO	6674.632915270494
OLDSAND	6648.122863407802
UNICOMM	6641.8413555494935

Cluster	2.0
PREFERREDPROVIDER	6476.197968811619
IPCONNECTIVITY	6394.325654686024
NONBROADCAST	5729.346500122689
STUNICAS	5471.00380826504
ISOLATIONIN	5448.02268716988
TRICKEYS	5379.146915860487
EACCOUNTING	5338.574435502978
ISOLATIONS	5322.846439573167
INTRASITE	5264.942601878135
MDCONFIG	5249.649627561374

Cluster	3.0
STAPHYLOCOCCI	4553.691633197608
CONTROVERSIAL	4527.300464002738
THERMODY

## Cluster Visualization

We can also generate a visual output from our clustering. 

Let's start by visualizing the original dataset. Since the ~300000 original features would not allow us to plot the data, we will use a 2-dimensional tSNE reduced version of our feature vectors.

The color of each data point will be defined by the AV label extracted form VirusTotal using AVClass.

In [23]:
families = samples_data.family[samples_data['selected'] == 1].tolist()
vis_data.plot_data('data/d_matrices/tsne_2_1209.txt', families)

Number of labels: 7


Now we can compare the classification provided by the AV data with the result of our clustering, plotted over the same dimensionality reduced data points.

Here, the color of the points will reflect the cluster in which they are assigned by the algorithm.

In [24]:
vis_data.plot_data('data/d_matrices/tsne_2_1209.txt', clustering)

Number of labels: 7


We can repeat the same comparison process with a 3-dimensional representation of the dataset. Since in this case tSNE generated a representation quite difficult to explore visually, we will use PCA to reduce the dimensions of our vectors.

In [25]:
vis_data.plot_data('data/d_matrices/pca_3_1209.txt', families)

Number of labels: 7


In [26]:
vis_data.plot_data('data/d_matrices/pca_3_1209.txt', clustering)

Number of labels: 7
