# Inspecting the content of possibly misclassified samples


After performin the clustering phase we compared the results with a baseline clustering provided by AV labels. 

From this comparison it was clear that there were some malware families which where classified in the same way by both our clustering and the AVs.

At the same time, however, there are groups of samples which result close in our feature space while being cathegorized as belonging to different families by the AVs.

We would like to inspect this samples to better understand why they were classified differently from the AV baseline.

Let's start by importing some useful packages.

In [None]:
%load_ext autoreload
%autoreload 2

from collections import defaultdict, Counter
from utilities import constants
from pprint import pprint
import plotly.offline as ply
import pandas as pd
import numpy as np
import json
import os

In [None]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)

Next we load the labels and clustering results files

In [None]:
labels = json.load(open('data/labels.json', 'r'))
inv_labels = json.load(open('data/inverted_labels.json', 'r'))

clustering = json.load(open('data/d_clusterings/clustering_hdbscan_cosine_1209.json', 'r'))

In [None]:
clust_compositions = {i: Counter() for i in sorted(set(clustering.values()))}

for i in clustering:
    clust_compositions[clustering[i]][labels[i]] += 1

for clu in sorted(clust_compositions.keys()):
    print('Cluster {}:'.format(clu))
    print(clust_compositions[clu].most_common())
    print()

In [None]:
inverted_clustering = defaultdict(list)
for i in clustering:
    inverted_clustering[clustering[i]].append(i)

Let's isolate the noise cluster, i.e. the samples which the algorithm was unable to fit in a cluster.

In [None]:
noise = inverted_clustering[-1]

This cluster seems composed primarily by samples of the Eorezo and Bladabindi families.

In [None]:
noise_e = []
noise_b = []

for uuid in noise:
    if uuids_family[uuid] == 'eorezo':
        noise_e.append(uuid)
    elif uuids_family[uuid] == 'bladabindi':
        noise_b.append(uuid)

noise_e = sorted(noise_e)
noise_b = sorted(noise_b)

pprint(noise_e[:5])
pprint(noise_b[:5])

Similarly for cluster number 4

In [None]:
clus4 = inverted_clustering[4]

This time it seems this cluster should have been populated primarily by the Flystudio or the Gepys family. However a large number of samples from both Eorezo and Bladabindi are included in this cluster.

In [None]:
clus4_e = []
clus4_b = []
clus4_g = []
clus4_f = []

for uuid in clus4:
    if uuids_family[uuid] == 'eorezo':
        clus4_e.append(uuid)
    elif uuids_family[uuid] == 'bladabindi':
        clus4_b.append(uuid)
    elif uuids_family[uuid] == 'gepys':
        clus4_g.append(uuid)
    elif uuids_family[uuid] == 'flystudio':
        clus4_f.append(uuid)


clus4_e = sorted(clus4_e)
clus4_b = sorted(clus4_b)
clus4_g = sorted(clus4_g)
clus4_f = sorted(clus4_f)

pprint(clus4_e[:5])
pprint(clus4_b[:5])
pprint(clus4_g[:5])
pprint(noise_b[:5])