In [None]:
Cluster Analysis
To better understand the result of the clustering algorithm we would like to see the features characterizing the computed clusters.
Since the dataset dimensionality was reduced with PCA before clustering we would need to reverse this step to understand the characteristics of the obtained clusters.
To achieve this we will compute the centroids as the average of the data for each cluster and then multiply it by the transposed components matrix.
We will start by creating an inverted index of the clustering.
In [ ]:

inverted_clustering = defaultdict(list)
for i in range(len(uuids)):
    inverted_clustering[clustering[i]].append(uuids[i])
Using Pandas we can construct a dataframe representing our reduced data matrix with dimensions (n_samples×n_pca_components)
In [ ]:

reduced_df = pd.DataFrame(reduced, index=uuids)
To compute the centroids we will just average the values of the PCA-reduced features of each cluster.
In [ ]:

centroids = {label : np.zeros(len(reduced[0])) for label in sorted(set(clustering))}
​
i = 0
for index, vector in reduced_df.iterrows():
    centroids[clustering[i]] += vector.values
    i += 1
​
centroid_matrix = []
for centroid in sorted(centroids.keys()):
    centroids[centroid] /= len(inverted_clustering[centroid])
    centroid_matrix.append(centroids[centroid])
    
centroid_matrix = np.array(centroid_matrix)
Once we have the centroid matrix in the PCA space, we can bring it back to its original dimensions by multiplying it with the PCA components matrix.
This will result in a (n_centroids×n_original_features)
 matrix.
In [ ]:

centroids_orig_fts = np.dot(centroid_matrix, dr_model.components_)
centroids_orig_fts.shape
Once in the original dimension space we can identify the ten most influencial words for each cluster.
In [ ]:

words = dict(zip(range(len(words)), sorted(words.keys())))
​
i = -1
for centroid in centroids_orig_fts:
    cent_series = pd.Series(np.abs(centroid), index=sorted(words.values()))
    
    print('Centroid {}:'.format(i))
    print(cent_series.nlargest(10))
    print()
    i += 1
It may also be interesting to see which of the initial malware families compose each cluster.
In [ ]:

clust_compositions = {i: Counter() for i in sorted(set(clustering.flatten()))}
​
for i in range(len(uuids)):
    clust_compositions[clustering[i]][uuids_family[uuids[i]]] += 1
​
for clu in sorted(clust_compositions.keys()):
    print('Cluster {}:'.format(clu))
    print(clust_compositions[clu].most_common())
    print()
Cluster Visualization
We can also generate a visual output from our clustering.
Let's start by visualizing the original dataset. Since the ~300000 original features would not allow us to plot the data, we will use a 2-dimensional tSNE reduced version of our feature vectors.
The color of each data point will be defined by the AV label extracted form VirusTotal using AVClass.
In [ ]:

families = samples_data.family[samples_data['selected'] == 1].tolist()
vis_data.plot_data('data/d_matrices/tsne_2_1209.txt', families)
Now we can compare the classification provided by the AV data with the result of our clustering, plotted over the same dimensionality reduced data points.
Here, the color of the points will reflect the cluster in which they are assigned by the algorithm.
In [ ]:

vis_data.plot_data('data/d_matrices/tsne_2_1209.txt', clustering)
We can repeat the same comparison process with a 3-dimensional representation of the dataset. Since in this case tSNE generated a representation quite difficult to explore visually, we will use PCA to reduce the dimensions of our vectors.
In [ ]:

vis_data.plot_data('data/d_matrices/pca_3_1209.txt', families)
In [ ]:

vis_data.plot_data('data/d_matrices/pca_3_1209.txt', clustering)
In [ ]:

​
