# CLX Asset Classification (Supervised)

## Authors
- Eli Fajardo (NVIDIA)
- Görkem Batmaz (NVIDIA)
- Bhargav Suryadevara (NVIDIA)


## Table of Contents 
* Introduction
* Dataset
* Reading in the datasets
* Training and inference
* References

# Introduction

In this notebook, we will show how to predict the function of a server with Windows Event Logs using cudf, cuml and pytorch. The machines are labeled as DC, SQL, WEB, DHCP, MAIL and SAP. The dependent variable will be the type of the machine. The features are selected from Windows Event Logs which is in a tabular format. This is a first step to learn the behaviours of certain types of machines in data-centres by classifying them probabilistically. It could help to detect unusual behaviour in a data-centre. For example, some compromised computers might be acting as web/database servers but with their original tag. 

This work could be expanded by using different log types or different events from the machines as features to improve accuracy. Various labels can be selected to cover different types of machines or data-centres.

## Library imports

In [1]:
from clx.analytics.asset_classification import AssetClassification
import cudf
from cuml.preprocessing import train_test_split
from cuml.preprocessing import LabelEncoder
import torch
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import pandas as pd

## Initialize variables

10000 is chosen as the batch size to optimise the performance for this dataset. It can be changed depending on the data loading mechanism or the setup used. 

EPOCH should also be adjusted depending on convergence for a specific dataset. 

label_col indicates the total number of features used plus the dependent variable. Feature names are listed below.

In [2]:
batch_size = 10000
label_col = '19'
epochs = 15

In [3]:
ac = AssetClassification()

## Read the dataset into a GPU dataframe with `cudf.read_csv()` 

The original data had many other fields. Many of them were either static or mostly blank. After filtering those, there were 18 meaningful columns left. In this notebook we use a fake continuous feature to show the inclusion of continuous features too. When you are using raw data the cell below need to be uncommented

In [4]:
# win_events_gdf = cudf.read_csv("raw_features_and_labels.csv")

```
win_events_gdf.dtypes

eventcode                                                       int64
keywords                                                       object
privileges                                                     object
message                                                        object
sourcename                                                     object
taskcategory                                                   object
account_for_which_logon_failed_account_domain                  object
detailed_authentication_information_authentication_package     object
detailed_authentication_information_key_length                float64
detailed_authentication_information_logon_process              object
detailed_authentication_information_package_name_ntlm_only     object
logon_type                                                    float64
network_information_workstation_name                           object
new_logon_security_id                                          object
impersonation_level                                            object
network_information_protocol                                  float64
network_information_direction                                  object
filter_information_layer_name                                  object
cont1                                                           int64
label                                                          object
dtype: object
```

### Define categorical and continuous feature columns.

In [5]:
cat_cols = [
    "eventcode",
    "keywords",
    "privileges",
    "message",
    "sourcename",
    "taskcategory",
    "account_for_which_logon_failed_account_domain",
    "detailed_authentication_information_authentication_package",
    "detailed_authentication_information_key_length",
    "detailed_authentication_information_logon_process",
    "detailed_authentication_information_package_name_ntlm_only",
    "logon_type",
    "network_information_workstation_name",
    "new_logon_security_id",
    "impersonation_level",
    "network_information_protocol",
    "network_information_direction",
    "filter_information_layer_name",
    "label"
]

In [6]:
cont_cols = [
    "cont1"
]

The following are functions used to preprocess categorical and continuous feature columns. This can very depending on what best fits your application and data.

In [7]:
def categorize_columns(cat_gdf):
    for col in cat_gdf.columns:
        cat_gdf[col] = cat_gdf[col].astype('str')
        cat_gdf[col] = cat_gdf[col].fillna("NA")
        cat_gdf[col] = LabelEncoder().fit_transform(cat_gdf[col])
        cat_gdf[col] = cat_gdf[col].astype('int16')
        
    return cat_gdf

In [8]:
def normalize_conts(cont_gdf):
    means, stds = (cont_gdf.mean(0), cont_gdf.std(ddof=0))
    cont_gdf = (cont_gdf - means) / stds
    
    return cont_gdf

Preprocessing steps below are not executed in this notebook, because we release already preprocessed data.

In [9]:
#win_events_gdf[cat_cols] = categorize_columns(win_events_gdf[cat_cols])

In [10]:
#win_events_gdf[cont_cols] = normalize_conts(win_events_gdf[cont_cols])

Read Windows Event data already preprocessed by above steps

In [11]:
win_events_gdf = cudf.read_csv("win_events_features_preproc.csv")

In [12]:
win_events_gdf.head()

Unnamed: 0,eventcode,keywords,privileges,message,sourcename,taskcategory,account_for_which_logon_failed_account_domain,detailed_authentication_information_authentication_package,detailed_authentication_information_key_length,detailed_authentication_information_logon_process,detailed_authentication_information_package_name_ntlm_only,logon_type,network_information_workstation_name,new_logon_security_id,impersonation_level,network_information_protocol,network_information_direction,filter_information_layer_name,cont1,label
0,0,1,0,15,0,4,22,0,0,5,0,1,932,38,3,6,1,1,-1.73203,1
1,14,1,0,7,0,5,22,3,2,6,1,6,932,25,3,6,1,1,-1.731988,0
2,14,1,0,7,0,5,22,3,2,6,1,6,932,25,3,6,1,1,-1.731945,0
3,14,1,0,7,0,5,22,3,2,6,1,6,932,25,3,6,1,1,-1.731903,0
4,14,1,0,7,0,5,22,3,2,6,1,6,932,25,3,6,1,1,-1.731861,0


### Split the dataset into training and test sets using cuML `train_test_split` function
Column 19 contains the ground truth about each machine's function that the logs come from. i.e. DC, SQL, WEB, DHCP, MAIL and SAP. Hence it will be used as a label.

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(win_events_gdf, "label", train_size=0.9)
X_train["label"] = Y_train

In [14]:
X_train.head()

Unnamed: 0,eventcode,keywords,privileges,message,sourcename,taskcategory,account_for_which_logon_failed_account_domain,detailed_authentication_information_authentication_package,detailed_authentication_information_key_length,detailed_authentication_information_logon_process,detailed_authentication_information_package_name_ntlm_only,logon_type,network_information_workstation_name,new_logon_security_id,impersonation_level,network_information_protocol,network_information_direction,filter_information_layer_name,cont1,label
37674,0,1,0,15,0,4,22,0,0,5,0,1,0,2100,0,6,1,1,-0.142192,0
22227,0,1,0,15,0,4,22,0,0,5,0,1,0,6228,2,6,1,1,-0.794054,0
61787,1,0,0,14,0,4,17,5,0,0,0,4,530,25,3,6,1,1,0.875373,2
2010,0,1,0,15,0,4,22,4,1,7,3,1,932,7980,3,6,1,1,-1.647208,1
6284,0,1,0,15,0,4,22,0,0,5,0,1,932,8913,3,6,1,1,-1.466846,0


In [15]:
Y_train.unique()

0    0
1    1
2    2
3    3
4    4
5    5
Name: label, dtype: int64

### Print Labels
Making sure the test set contains all labels

In [16]:
Y_test.unique()

0    0
1    1
2    2
3    3
4    4
5    5
Name: label, dtype: int64

## Training 

Asset Classification training uses the fastai tabular model. More details can be found at https://github.com/fastai/fastai/blob/master/fastai/tabular/models.py#L6

Feature columns will be embedded so that they can be used as categorical values. The limit can be changed depending on the accuracy of the dataset.

Adam is the optimizer used in the training process; it is popular because it produces good results in various tasks. In its paper, computing the first and the second moment estimates and updating the parameters are summarized as follows

$$\alpha_{t}=\alpha \cdot \sqrt{1-\beta_{2}^{t}} /\left(1-\beta_{1}^{t}\right)$$

More detailson Adam can be found at https://arxiv.org/pdf/1412.6980.pdf

We have found that the way we partition the dataframes with a 10000 batch size gives us the optimum data loading capability. The **batch_size** argument can be adjusted for different sizes of datasets.

In [17]:
cat_cols.remove("label")
ac.train_model(X_train, cat_cols, cont_cols, "label", batch_size, epochs, lr=0.01, wd=0.0)

  return libdlpack.to_dlpack(gdf_cols)


training loss:  1.439008299144179
valid loss 1.036 and accuracy 0.684
training loss:  0.9147559678295825
valid loss 0.748 and accuracy 0.762
training loss:  0.6894188726639124
valid loss 0.620 and accuracy 0.812
training loss:  0.5769385592484341
valid loss 0.529 and accuracy 0.830
training loss:  0.4955211123959615
valid loss 0.460 and accuracy 0.850
training loss:  0.4377053847582359
valid loss 0.412 and accuracy 0.870
training loss:  0.3931454664111896
valid loss 0.373 and accuracy 0.885
training loss:  0.3547998978519864
valid loss 0.341 and accuracy 0.892
training loss:  0.32641247374150406
valid loss 0.314 and accuracy 0.897
training loss:  0.3033208168420156
valid loss 0.295 and accuracy 0.904
training loss:  0.28589000026317457
valid loss 0.281 and accuracy 0.911
training loss:  0.271534545184028
valid loss 0.269 and accuracy 0.915
training loss:  0.2614127159816973
valid loss 0.260 and accuracy 0.918
training loss:  0.2521680032575314
valid loss 0.252 and accuracy 0.919
traini

## Evaluation

In [18]:
pred_results = ac.predict(X_test, cat_cols, cont_cols).to_array()
true_results = Y_test.to_array()

In [19]:
f1_score_ = f1_score(pred_results, true_results, average='micro')
print('micro F1 score: %s'%(f1_score_))

micro F1 score: 0.9136313801924717


In [20]:
torch.cuda.empty_cache()

In [21]:
labels = ["DC","DHCP","MAIL","SAP","SQL","WEB"]
a = confusion_matrix(true_results, pred_results)

In [22]:
pd.DataFrame(a, index=labels, columns=labels)

Unnamed: 0,DC,DHCP,MAIL,SAP,SQL,WEB
DC,3445,18,19,8,55,5
DHCP,128,639,2,4,2,0
MAIL,18,0,2593,7,14,0
SAP,27,0,10,158,4,0
SQL,244,0,10,22,614,19
WEB,59,0,1,2,31,51


The confusion matrix shows that some machines' function can be predicted really well, whereas some of them need more tuning or more features. This work can be improved and expanded to cover individual data-centres to create a realistic map of the network using ML by not just relying on the naming conventions. It could also help to detect more prominent scale anomalies like multiple machines, not acting per their tag.

## References:
* https://github.com/fastai/fastai/blob/master/fastai/tabular/models.py#L6
* https://jovian.ml/aakashns/04-feedforward-nn
* https://www.kaggle.com/dienhoa/reverse-tabular-module-of-fast-ai-v1
* https://github.com/fastai/fastai/blob/master/fastai/layers.py#L44