Clustering (also known as cluster analysis) is a quintessential method of unsupervised machine learning. It's primarily used for dividing data up into groups (or clusters) based on shared criteria. By examining these clusters, different "families" can be deduced, families which provide a useful category to assign future data points to. For our purposes, clustering is used to divide up VBA macros based on their content, clusters to which we're able to assign varying levels of "maliciousnous" or "benignness." What follows is a demonstration of this process, with clusters formed based on the words contained in a macro.
Read the full blog at [InQuest Blog: Clustering for Classification](https://inquest.net/blog/2020/12/16/Clustering-for-Classification)

1. First, we must import everything we need. This includes numpy and pandas for opening and reading our dataframes, and our clustering algorithm. For our clustering algorithm, we use sklearn's kmeans classifier. We also load in our data into a panda's dataframe. 

In [1]:
from __future__ import print_function
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import string
import os

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

word_content_df = pd.read_csv(open(os.path.join(os.getcwd(), "vba_features.csv")),index_col=0)

word_content_df

Unnamed: 0,abs,accelerator,activate,activecell,activecodepane,activecontrol,activedocument,activesheet,activevbproject,activewindow,...,Join,LBound,Split,UBound,CurDir,Dir,FileAttr,FileDateTime,FileLen,GetAttr
4ee2939230a5962bc7937f0e54c27900595c4040ba723f6f31d84cc9e60ac3a7.macro,0.000000,0.0,0.010344,0.000000,0.0,0.0,0.000000,0.106488,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7b409025db2a30a58b42adc20bc81daf7dd6a6f8b7dfd089e90fb9fb232a5c05.macro,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.002899,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2cba5a34d84cba315019a94f15a70ba3f2b013955cd240ccca106aa9a569f827.macro,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
e406765eedf8d315750823133c94acb9d8af9a74afb25af8f983a5d4d48af5b7.macro,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.281521,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3e1b48c1b05736c3125723decf7cfe0e4e4efa51de63ad11eda1fa35e564b72e.macro,0.008199,0.0,0.008457,0.016246,0.0,0.0,0.000000,0.021371,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
f58a9e939d186b69453367e26d1123dc1a03719c8f3c0f69a5157c8a8faca814.macro,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4df4ed09ddcf6e2aa6d682bae21f3b2c243551310bdd565c2d2e0e89f05c4bd6.macro,0.000000,0.0,0.020384,0.000000,0.0,0.0,0.028581,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
a5c6ab5f55851e5a4a665888b4804457a9c27b82bfdadd98185dc733b6ab6ed4.macro,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
985577430d9c17e113d00a3923df0e4e6bcf452452fdd2e3196f89f9b5cffb34.macro,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


2. We now initialize our kmeans model. For our purposes we have it divide the data into 5 clusters, though in actual practice we divide it into up to 5000. We then fit our model to the data.

In [2]:
word_content_kmeans = KMeans(n_clusters=5).fit(word_content_df)

3. We are now able to get our data displayed

In [33]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def slider_explore(xs=396,ys=374,zs=398):
    
    fig = plt.figure(1, figsize=(10, 9))
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
    labels = word_content_kmeans.labels_

    ax.scatter(word_content_df.iloc[:,xs],word_content_df.iloc[:,ys],word_content_df.iloc[:,zs],
               c=labels.astype(np.float))

    ax.margins(x=0.00000001) 

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel(word_content_df.columns[xs])
    ax.set_ylabel(word_content_df.columns[ys])
    ax.set_zlabel(word_content_df.columns[zs])
    ax.set_title('Kmeans Word Content')
    ax.dist = 12

interact(slider_explore,xs=(0,896,1),ys=(0,896,1),zs=(0,896,1))

interactive(children=(IntSlider(value=396, description='xs', max=896), IntSlider(value=374, description='ys', …

<function __main__.slider_explore(xs=396, ys=374, zs=398)>

4. In addition to these clusters, we are also able to get information about how malicious/benign each one is, based on how their members are classified in our database. To do this, we've included the hashes paired with their VirusTotal scores, with the higher the number the greater the chance of maliciousness, and our own manual labels, if available.

In [5]:
classification_df = pd.read_csv(open(os.path.join(os.getcwd(), 'classification.csv')),index_col = 0)
df = classification_df.merge(word_content_df,left_index=True,right_index=True)
df

Unnamed: 0,vt_score,classification,abs,accelerator,activate,activecell,activecodepane,activecontrol,activedocument,activesheet,...,Join,LBound,Split,UBound,CurDir,Dir,FileAttr,FileDateTime,FileLen,GetAttr
00027b55ffe7329faff173bc3046f579d176c5a79091bf21f31062e17bfec922.macro,0,0,0.000000,0.0,0.168157,0.058883,0.0,0.0,0.000000,0.196722,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00125528f276afcb74b5607e38b03edd41efafac58570589ef08b983cfa1231d.macro,14,0,0.000000,0.0,0.013680,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
001331fc0a289089ddeaab9ece4b1cf919f4852afe42b6ad64e672e0afccc588.macro,0,0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0014a559b4421bcc8f002e9a8b130f47ca04b7944ba89cf6e80524ed2912474c.macro,16,2,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.337177,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
002d9d8f1664df94a985beff2388badaca96ae46bafa92df76eb19d18c154dcd.macro,5,0,0.000000,0.0,0.028296,0.000000,0.0,0.0,0.000000,0.013241,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ffd776fd6c53a5d4053d2f3c5e69aa513de519ebd9d915cc0cc1f6279c0e8326.macro,1,1,0.005087,0.0,0.006297,0.000000,0.0,0.0,0.000000,0.515676,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ffde3b4dfa144508fb40ff4a57d1659f57e2a3432ed2aad7fe710fd60ed6e271.macro,0,0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.687522,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ffe3c1134a7ca913889b1ce47dabd218bc45d4d8bedeb7fd8448b6cb05d93d84.macro,13,0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
fff7628f553c5df11fd36add8eaf80910bb80293fb1d01ec8cb79e03108de057.macro,0,0,0.000000,0.0,0.000000,0.058392,0.0,0.0,0.000000,0.022760,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


6. The clusters can now be evaluated on their maliciousness/benignness, in a similar fashion to how they were displayed above.

In [68]:
def slider_explore(xs=398,ys=376,zs=400):
    
    fig = plt.figure(1, figsize=(10, 9))
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
    labels = df['classification'].to_numpy()

    ax.scatter(df.iloc[:,xs],df.iloc[:,ys],df.iloc[:,zs],
               c=labels.astype(np.float))

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel(df.columns[xs])
    ax.set_ylabel(df.columns[ys])
    ax.set_zlabel(df.columns[zs])
    ax.set_title('Kmeans Word Content')
    ax.dist = 12

interact(slider_explore,xs=(0,896,1),ys=(0,896,1),zs=(0,896,1))


interactive(children=(IntSlider(value=398, description='xs', max=896), IntSlider(value=376, description='ys', …

<function __main__.slider_explore(xs=398, ys=376, zs=400)>