
<h1><center>DCM Module Tutorial</h1></center>

This is a basic example of how to use the DCM module.

In [2]:
import pandas as pd
import os
import math

path="F:\\Masters degree\\Capstone Project\\data"
os.chdir(path)

dns_headings = ['time','source_computer','computer_resolved']
dns = pd.read_csv('dns.txt', sep=',', header=None, names=dns_headings)

Below is the code contained in dcm.py.

In [None]:
"""DCM (DNS Clustering Module) was designed to perform cluster analysis on the 
Los Alamos National Laboratory DNS data from the cyber security dataset
(available at https://csr.lanl.gov/data/cyber1/). The clustering is based on the 
number of connections a computer receives within a given timeframe along with 
associated statistical features.

This module also contains the option of performing a more general analysis using 
sliding time based windows on the top n resolved computers.

More features, such as different clustering methods, will likely be added in the
immediate future."""

import pandas as pd
import scipy.stats
from sklearn.cluster import KMeans


def GetTopn(data,n=50):
    """Returns a list of the names of the highest resolved computers."""
    data=data.computer_resolved.groupby(data.computer_resolved).count()
    data=data.sort_values(ascending=False)
    highest_resolved_computers=data.index[0:n]
    
    return highest_resolved_computers


def GetConnectionStats(group, extended_features=False):
    """Calculate the connection statistics of a pandas df grouped by resolved 
    computer."""
    num_connections=group.time.value_counts()
    
    if extended_features:
    
        connection_stats={'minimum': num_connections.min(), 
                        'maximum': num_connections.max(), 
                        'count': num_connections.count(),
                        'mean': num_connections.mean(),
                        'std': num_connections.std(),
                        'skew': scipy.stats.skew(num_connections),
                        'kurtosis': scipy.stats.kurtosis(num_connections)}
    else:
        
        connection_stats={'count': num_connections.count()}
                        
    return connection_stats
    
    
def DNSDFToStatsDF(dns_df, n=50, use_extended_features=False):
    """Calculate the connection statistics of resolved computers in a DNS-like 
    dataframe.
    
    Takes a dataframe in the same format as the original DNS data.
    Various statistical parameters are then calculated for the top n resolved
    computers. These statistics are then returned with the corresponding computer
    resolved as a feature."""
    highest_resolved_computers=GetTopn(dns_df,n)
    dns_df=dns_df.loc[dns_df.computer_resolved.isin(highest_resolved_computers)]
    
    new_df=dns_df.groupby(dns_df.computer_resolved).apply(GetConnectionStats, 
                                            extended_features=use_extended_features)
    new_df=pd.DataFrame(new_df)
    new_df.reset_index(inplace=True)

    stats=new_df[0].tolist()

    stats_df=pd.DataFrame(stats)

    stats_df["computer_resolved"]=new_df.computer_resolved
    stats_df=stats_df.sort_values(by="count",ascending=False)
    
    return stats_df
    
    
def ClusterAnalysis(data, k=3):
    """Cluster using k-means. Different methods may be added later."""
    computer_resolved=data.computer_resolved
    data.drop(["computer_resolved"], axis=1, inplace=True)
    
    data=Normalize(data)
    
    Clustering=KMeans(k).fit(data)
    
    return computer_resolved, Clustering
    
    
def Normalize(data):
    """Normalize the data."""
    mean = data.mean(axis=0)
    std = data.std(axis=0)
    std[std<0.01]=1
    
    data=data-mean 
    data=data/(std)
    
    return data
    
    
def SlidingWindow(data, analysis_function=ClusterAnalysis, window_size=5011199,
                    stride=1, n=50, use_extended_features=False):
    """Applies analysis_function to sliding windows across the data.
    
    Takes in the DNS dataframe and then applies analysis_function to sliding
    windows defined by window_size and stride. These parameters can be altered
    to result in smaller or larger windows with greater or lesser overlap.
    analysis_function may be anything that operates in the DNS dataframe.
    
    The return value is a dictionary indexed by sliding window number containing
    the results of analysis_function."""
    start_time=0
    end_time=window_size
    max_time=data.time.max()
    
    results={}
    window_number=1
    is_final_window=False
    
    while is_final_window==False:
        
        current_data=data[(data.time>=start_time) & (data.time<end_time)]
        
        highest_resolved_computers=GetTopn(current_data, n)
        highest_resolved_dns=current_data.loc[current_data.computer_resolved.isin(highest_resolved_computers)]

        highest_resolved_stats=DNSDFToStatsDF(highest_resolved_dns, use_extended_features=use_extended_features)
        
        results["Window_"+str(window_number)]=analysis_function(highest_resolved_stats)
        
        window_number+=1
        
        start_time+=stride
        end_time+=stride
        
        if end_time>=max_time:
            
            is_final_window=True
       
    return results 

First we will cluster the entire dataset using only the total number of connections for the top 50 resolved computers.

The result is in the form of a dictionary containing a single key: 'Window_1'. The corresponding value is a list of the top 50 resolved computers, along with the clustering results.

In [4]:
cluster_all_data=SlidingWindow(dns)
top50_resolved_computers, clustering_results=cluster_all_data['Window_1']

From the clustering results, we can obtain the cluster labels for each of the listed computers, in addition to the cluster centers.

In [5]:
print("Top 10 resolved computers:")
print(top50_resolved_computers[0:10])

labels=clustering_results.labels_
print("Labels:")
print(labels[0:10])

cluster_centers=clustering_results.cluster_centers_
print("Cluster centers:")
print(cluster_centers)

Top 10 resolved computers:
40     C586
7     C1707
6     C1685
29    C5030
47     C754
1     C1025
45     C706
2     C1065
14    C2189
27     C457
Name: computer_resolved, dtype: object
Labels:
[1 1 1 1 2 2 2 2 2 2]
Cluster centers:
[[-0.53554605]
 [ 2.82316459]
 [ 0.57632562]]


By playing with window_size and stride, we can choose instead to cluster disjoint subsets of the data, or overlapping subsets of the data.

The data timestamps range from 0 to 5011199. 5011199 is also the default window size. Hence the default settings (above) result in a single window - essentially performing a single cluster analysis over the entire range of time. To do an analysis on 2 equally sized, disjoint sets of the data (i.e. we split the data in half and cluster each portion), set window_size=floor(5011199/2) and set stride=floor(5011199/2). You need to manually determine the values of window_size and stride yourself, depending on the result you want.

In [2]:
cluster_all_data=SlidingWindow(dns, window_size=math.floor(5011199/2), stride=math.floor(5011199/2))

The result is a dictionary that now has two keys: 'Window_1' and 'Window_2'. The values of these keys are in the same format as the previous example.

In [3]:
window_1_resolved_computers, window_1_cluster_results=cluster_all_data['Window_1']

print("Top 10 resolved computers for window 1:")
print(window_1_resolved_computers[0:10])

window_1_labels=window_1_cluster_results.labels_
print("Labels:")
print(window_1_labels[0:10])

window_1_cluster_centers=window_1_cluster_results.cluster_centers_
print("Cluster centers:")
print(window_1_cluster_centers)


Top 10 resolved computers for window 1:
46     C706
5     C1685
7     C1707
28    C5030
41     C586
3     C1065
26     C457
30     C528
27     C467
31     C529
Name: computer_resolved, dtype: object
Labels:
[1 1 1 1 1 2 2 2 2 2]
Cluster centers:
[[-0.476392  ]
 [ 2.72032766]
 [ 0.39427485]]


In [4]:
window_2_resolved_computers, window_2_cluster_results=cluster_all_data['Window_2']

print("Top 10 resolved computers for window 2:")
print(window_2_resolved_computers[0:10])

window_2_labels=window_2_cluster_results.labels_
print("Labels:")
print(window_2_labels[0:10])

window_2_cluster_centers=window_2_cluster_results.cluster_centers_
print("Cluster centers:")
print(window_2_cluster_centers)

Top 10 resolved computers for window 2:
40     C586
28    C5030
6     C1707
5     C1685
47     C754
1     C1025
45     C706
13    C2189
2     C1065
26     C457
Name: computer_resolved, dtype: object
Labels:
[1 1 1 1 1 1 2 2 2 2]
Cluster centers:
[[-0.57009701]
 [ 2.4432564 ]
 [ 0.2318053 ]]


As can be seen, there is some difference between the clusters. (Why?)

In the final example, we will split the data into three <i>overlapping</i> windows using window_size=floor(5011199/2) and stride=floor(5011199/4).

In [5]:
cluster_all_data=SlidingWindow(dns, window_size=math.floor(5011199/2), stride=math.floor(5011199/4))

We should now have a dictionary containing three keys: 'Window_1', 'Window_2' and 'Window_3'. The results can be viewed in the same way as in the previous examples.

In [13]:
window_1_resolved_computers, window_1_cluster_results=cluster_all_data['Window_1']

print("Top 10 resolved computers for window 1:")
print(window_1_resolved_computers[0:10])

window_1_labels=window_1_cluster_results.labels_
print("Labels:")
print(window_1_labels[0:10])

window_1_cluster_centers=window_1_cluster_results.cluster_centers_
print("Cluster centers:")
print(window_1_cluster_centers)

Top 10 resolved computers for window 1:
46     C706
5     C1685
7     C1707
28    C5030
41     C586
3     C1065
26     C457
30     C528
27     C467
31     C529
Name: computer_resolved, dtype: object
Labels:
[2 2 2 2 2 0 0 0 0 0]
Cluster centers:
[[ 0.39427485]
 [-0.476392  ]
 [ 2.72032766]]


In [12]:
window_2_resolved_computers, window_2_cluster_results=cluster_all_data['Window_2']

print("Top 10 resolved computers for window 2:")
print(window_2_resolved_computers[0:10])

window_2_labels=window_2_cluster_results.labels_
print("Labels:")
print(window_2_labels[0:10])

window_2_cluster_centers=window_2_cluster_results.cluster_centers_
print("Cluster centers:")
print(window_2_cluster_centers)

Top 10 resolved computers for window 2:
40     C586
8     C1707
6     C1685
28    C5030
45     C706
1     C1025
47     C754
14    C2189
2     C1065
30     C528
Name: computer_resolved, dtype: object
Labels:
[1 1 1 1 1 2 2 2 2 2]
Cluster centers:
[[-0.54907453]
 [ 2.62834412]
 [ 0.41481157]]


In [11]:
window_3_resolved_computers, window_3_cluster_results=cluster_all_data['Window_3']

print("Top 10 resolved computers for window 3:")
print(window_3_resolved_computers[0:10])

window_3_labels=window_3_cluster_results.labels_
print("Labels:")
print(window_3_labels[0:10])

window_3_cluster_centers=window_3_cluster_results.cluster_centers_
print("Cluster centers:")
print(window_3_cluster_centers)

Top 10 resolved computers for window 3:
40     C586
28    C5030
6     C1707
5     C1685
47     C754
1     C1025
45     C706
13    C2189
2     C1065
26     C457
Name: computer_resolved, dtype: object
Labels:
[1 1 1 1 1 1 2 2 2 2]
Cluster centers:
[[-0.570097  ]
 [ 2.44325628]
 [ 0.23180533]]


That concludes the tutorial. Thank you for reading! Here are some ideas for where to go next:

1) Try using more statistical features by setting use_extended_features=True for dcm.SlidingWindow.<BR>
2) See if computers move in and out of clusters depending on the time (hint: think about it in terms of moving from a larger cluster to a smaller one, or vice versa, as the cluster numbers change from window to window).<BR>
3) Try different numbers of clusters.<BR>
3) Use a different form of clustering.<BR>
4) Explore the dcm.py file and possibly add a new function to perform a different type of analysis on the sliding windows.<BR>