# Inspecting the content of possibly misclassified samples


After performin the clustering phase we compared the results with a baseline clustering provided by AV labels. 

From this comparison it was clear that there were some malware families which where classified in the same way by both our clustering and the AVs.

At the same time, however, there are groups of samples which result close in our feature space while being cathegorized as belonging to different families by the AVs.

We would like to inspect this samples to better understand why they were classified differently from the AV baseline.

Let's start by importing some useful packages.

In [1]:
%load_ext autoreload
%autoreload 2

In [17]:
from collections import defaultdict, Counter
from utilities import db_manager
from utilities import constants
import plotly.offline as ply
from pprint import pprint
import pandas as pd
import numpy as np
import json
import os

In [3]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)

Next we load the labels and clustering results files

In [18]:
labels = json.load(open('data/labels.json', 'r'))
inv_labels = json.load(open('data/inverted_labels.json', 'r'))

clustering = json.load(open('data/d_clusterings/clustering_hdbscan_cosine_1209.json', 'r'))
uuid_md5 = db_manager.acquire_malware_file_dict_full(config['dir_db'])

In [5]:
clust_compositions = {i: Counter() for i in sorted(set(clustering.values()))}

for i in clustering:
    clust_compositions[clustering[i]][labels[i]] += 1

for clu in sorted(clust_compositions.keys()):
    print('Cluster {}:'.format(clu))
    print(clust_compositions[clu].most_common())
    print()

Cluster -1:
[('eorezo', 79), ('bladabindi', 35), ('flystudio', 6), ('neshta', 2), ('lamer', 1)]

Cluster 0:
[('mydoom', 146)]

Cluster 1:
[('lamer', 131), ('eorezo', 6)]

Cluster 2:
[('neshta', 222), ('flystudio', 1)]

Cluster 3:
[('eorezo', 108)]

Cluster 4:
[('flystudio', 79), ('gepys', 76), ('eorezo', 73), ('bladabindi', 53), ('neshta', 6), ('lamer', 4)]

Cluster 5:
[('bladabindi', 181)]



In [6]:
inverted_clustering = defaultdict(list)
for i in clustering:
    inverted_clustering[clustering[i]].append(i)

Let's isolate the noise cluster, i.e. the samples which the algorithm was unable to fit in a cluster.

In [7]:
noise = inverted_clustering[-1]

This cluster seems composed primarily by samples of the Eorezo and Bladabindi families.

In [24]:
noise_e = []
noise_b = []

for uuid in noise:
    if uuids_family[uuid] == 'eorezo':
        noise_e.append(uuid)
    elif uuids_family[uuid] == 'bladabindi':
        noise_b.append(uuid)

noise_e = sorted(noise_e)
noise_b = sorted(noise_b)

pprint(dict(zip(noise_e[:5], [uuid_md5[i] for i in noise_e[:5]])))
pprint(dict(zip(noise_b[:5], [uuid_md5[i] for i in noise_b[:5]])))

{'000582b2-0933-44f3-8a8e-c0e3c08d2ab1': 'e1f2df81e63964ef69e9e7aba104d835',
 '006953d6-8a8a-4938-bda1-987733b970cd': 'a465f50359ed35b76ac4129572cdbb49',
 '00a2434b-6852-4e50-8b5c-bf5cc1da5ec1': '31dd6a383d56a2757e000fad6a846be6',
 '00b47107-48c5-47aa-8673-4e15b61b1846': '64e8e7fc62100153cc430d109cda894d',
 '02f02dd2-1a8d-4831-94a0-e7476f30c73d': '2e844dbc08b830da3f53c4ebe03aca0d'}
{'0423c18b-8250-4299-a947-0bbd707b0d67': 'de5f7e8141606ffc2e076081304612cc',
 '0e8a702a-5a33-4de0-8de1-c3fcbeae6e48': 'd89ca9abc03cad0380f726031b093e55',
 '103443a4-ff58-4996-8c25-8d06e52ce551': '35ee1936b2f5a7377ba02968904138b8',
 '10d811b1-2603-4e26-8311-3b94d2f78ad9': 'aa25b0c7ffe281c7c3d998a6d74bbb18',
 '1dc440ca-0f47-4daf-a45c-5c9c7111da31': '0115fecdde4942bbf7bfc18b1d4f8e16'}


Similarly for cluster number 4

In [9]:
clus4 = inverted_clustering[4]

This time it seems this cluster should have been populated primarily by the Flystudio or the Gepys family. However a large number of samples from both Eorezo and Bladabindi are included in this cluster.

In [23]:
clus4_e = []
clus4_b = []
clus4_g = []
clus4_f = []

for uuid in clus4:
    if uuids_family[uuid] == 'eorezo':
        clus4_e.append(uuid)
    elif uuids_family[uuid] == 'bladabindi':
        clus4_b.append(uuid)
    elif uuids_family[uuid] == 'gepys':
        clus4_g.append(uuid)
    elif uuids_family[uuid] == 'flystudio':
        clus4_f.append(uuid)


clus4_e = sorted(clus4_e)
clus4_b = sorted(clus4_b)
clus4_g = sorted(clus4_g)
clus4_f = sorted(clus4_f)

pprint(dict(zip(clus4_e[:5], [uuid_md5[i] for i in clus4_e[:5]])))
pprint(dict(zip(clus4_b[:5], [uuid_md5[i] for i in clus4_b[:5]])))
pprint(dict(zip(clus4_g[:5], [uuid_md5[i] for i in clus4_g[:5]])))
pprint(dict(zip(clus4_f[:5], [uuid_md5[i] for i in clus4_f[:5]])))

{'0413eaa7-5431-4d2b-9510-0781508eae02': '70d3f98ae704c3a7da2c6bf0f8c6011b',
 '09321ff0-764d-4200-adc4-8fba0627e6ae': 'c32a2349dddb3b2e669ec9ed5682cb19',
 '09b85bd8-d4ea-4d64-8363-facad113e7b4': '274451d66afddf086b08af6db8194351',
 '0ab68dc8-fbd9-4e30-8f26-9c975243bb77': 'ea28714f03114500fecae3f62aab3b92',
 '0b858225-9f9b-426b-8a18-479d8b653c40': '6258c392004d6e3849568cd7eeff72ee'}
{'059fed1d-1577-4bd3-a380-bdf3adb278e8': '1a01428174cf815e9fe0aea2376bfcd4',
 '071aa948-8a20-423d-84a8-17d312cd5f28': 'f4c2472146ad8bc9483540915974cda1',
 '1afaa51d-37d3-4b21-b824-a86cd14b62f2': 'e46fa26f56702e409d2903d65b6dc58a',
 '1c410f27-6b28-4ead-b2d1-53fcf3132394': '000629bbe2e985767e9341c888752d94',
 '1f1ab0e7-d53c-4a88-9ac1-c58197d42302': '0f429f9a0a4fb54b6ad392281e767d96'}
{'00b4a2aa-3216-435a-80b2-1db8b9c186ca': '620302b9900b549529044523be00d220',
 '062abbb2-324e-49b5-952d-a11716763e2f': 'a9aa361091e254695aaf727121646c8c',
 '0baaa6fa-ef83-4632-8786-03f77ef83920': 'ef24cc2566463f62c465acd04ac43780',

Having isolated 5 samples for each 'misclassified' group we can try to inspect each of them individually. Let's start by printing the top ten words (ordered alphabetically) for each sample.

In [11]:
def top_words(config, sample):
    tf_idf_file = os.path.join(config['dir_store'], sample)
    tf_idf = Counter(json.load(open(tf_idf_file, 'r')))
    print(sorted([i[0] for i in tf_idf.most_common(20)]))

In [12]:
def top_words_grp(config, grp):
    for sample in grp:
        print(sample)
        top_words(config, sample)
        print()

In [13]:
print('-' * 80)
print('eorezo')
top_words_grp(config, clus4_e[:5])

print('-' * 80)
print('bladabindi')
top_words_grp(config, clus4_b[:5])

print('-' * 80)
print('gepys')
top_words_grp(config, clus4_g[:5])

print('-' * 80)
print('flystudio')
top_words_grp(config, clus4_f[:5])


--------------------------------------------------------------------------------
eorezo
0413eaa7-5431-4d2b-9510-0781508eae02
['ALLNORMAL', 'BARRESE', 'BOFFICE', 'BONCONTE', 'BONPANE', 'BORANI', 'ETUTOR', 'KEYBAORD', 'LEWINTER', 'MONCHAN', 'MONPRESS', 'MOUSEMAN', 'MYDOCK', 'NDWANDWA', 'TERAZ', 'TERCHUN', 'TIPBACK', 'TONACH', 'USAFTER', 'YOUAND']

09321ff0-764d-4200-adc4-8fba0627e6ae
['29WIND', 'ALTMORE', 'COLLAPSER', 'F85CM', 'FOROCOCHES', 'GADIRE', 'GADOGADO', 'GIBINA', 'JUNSELE', 'KAUPPALEHTI', 'LNTERNET', 'MIGRATEDIN', 'MYSWITZERLAND', 'NIEUWSBLAD', 'PEERFACTOR', 'PROMICROSOFT', 'REPRESENTD', 'TANOTO', 'UNAVAILABLETHE', 'VOYAGESSNCFCOM']

09b85bd8-d4ea-4d64-8363-facad113e7b4
['1REGULAR', 'ALLNORMAL', 'BARCOMB', 'BARRESE', 'BOFFICE', 'BONCONTE', 'BONKEY', 'BONPANE', 'ETUTOR', 'KALINGAR', 'LEWINTER', 'MONCHAN', 'MONPRESS', 'NUHASH', 'TECDOC', 'TERAZ', 'TERCHUN', 'TIPBACK', 'TONACH', 'USAFTER']

0ab68dc8-fbd9-4e30-8f26-9c975243bb77
['4X800', 'A4800', 'AKOAH', 'DESTEM', 'FRUUX', 'GH800',

### Looking at VirusTotal data

Now that we have isolated some problematic samples, let's look at the realted VirusTotal report.

In [16]:
print('Eorezo samples in cluster 4: ', len(clus4_e))
for uuid in clus4_e:
    md5 = uuid_md5[uuid]
    vt = json.load(open(os.path.join(config['dir_vt'], md5), 'r'))
    ms_lab = vt['scans']['Microsoft']['result']
    ks_lab = vt['scans']['Kaspersky']['result']
    fs_lab = vt['scans']['F-Secure']['result']
    ca_lab = vt['scans']['ClamAV']['result']
    
#     print('{:<20} {:<20} {:<20} {:<20}'.format(str(md5), str(ms_lab), str(ks_lab), str(fs_lab)))
    print('{:<20} {:<38} {:<30} {:<20}'.format(str(ms_lab), str(ks_lab), str(fs_lab), str(ca_lab)))
    

Eorezo samples in cluster 4:  73
None                 not-a-virus:AdWare.Win32.Eorezo.afob   Gen:Variant.Adware.Eorezo      None                
Adware:Win32/EoRezo  not-a-virus:AdWare.Win32.Eorezo.ap     Gen:Variant.Zusy.192704        None                
None                 None                                   Adware.Eorezo.BZ               None                
None                 None                                   None                           None                
None                 None                                   Adware.Eorezo.CB               None                
None                 not-a-virus:AdWare.Win32.Agent.jgbe    Adware.Eorezo.CU               None                
Adware:Win32/EoRezo  not-a-virus:AdWare.Win32.Eorezo.feyg   Gen:Variant.Adware.Eorezo      Win.Adware.Eorezo-525
None                 None                                   None                           None                
None                 None                                   None      