# The nPrint Machine Learning Pipeline

This notebook is designed to give you a **very** simple example of how to use nPrint in a generic machine learning pipeline and the rapid pace at which we can train new models and test new ideas on network traffic. Note that this example is simply to show the pipeline, not to test a hard problem. The traffic collected is to the same website over the course of about 15 seconds.


### Requirements

nPrint must be installed into $PATH for external commands to work

### Directory Sturcture

There are 2 `pcap` files in this directory
1. `port443.pcap` - a small trace of packets sent and received over https  
2. `port80.pcap` - a small trace of packets sent and received over http

# Generating nPrints from Traffic

First, lets generate nPrints from each traffic trace. Let's **only** include the TCP headers in the nPrints for now.

Lets examine the nPrints, which can be directly loaded with Pandas

In [2]:
import pandas as pd

friday = pd.read_csv('TrafficLabelling/Friday-WorkingHours-Morning.pcap_ISCX.csv', index_col=0)

print('Friday nPrint: Number of Packets: {0}, Features per packet: {1}'.format(friday.shape[0], friday.shape[1]))

Friday nPrint: Number of Packets: 191033, Features per packet: 84


In [14]:
friday

Unnamed: 0_level_0,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
Flow ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
192.168.10.3-192.168.10.50-3268-56108-6,192.168.10.50,56108,192.168.10.3,3268,6,7/7/2017 8:59,112740690,32,16,6448,...,32,3.594286e+02,1.199802e+01,380.0,343.0,16100000.0,4.988048e+05,16400000.0,15400000.0,BENIGN
192.168.10.3-192.168.10.50-389-42144-6,192.168.10.50,42144,192.168.10.3,389,6,7/7/2017 8:59,112740560,32,16,6448,...,32,3.202857e+02,1.574499e+01,330.0,285.0,16100000.0,4.987937e+05,16400000.0,15400000.0,BENIGN
8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,7/7/2017 9:00,113757377,545,0,0,...,0,9.361829e+06,7.324646e+06,18900000.0,19.0,12200000.0,6.935824e+06,20800000.0,5504997.0,BENIGN
192.168.10.9-224.0.0.252-63210-5355-17,192.168.10.9,63210,224.0.0.252,5355,17,7/7/2017 9:00,100126,22,0,616,...,32,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0,BENIGN
192.168.10.9-224.0.0.22-0-0-0,192.168.10.9,0,224.0.0.22,0,0,7/7/2017 9:00,54760,4,0,0,...,0,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0,BENIGN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
192.168.10.3-192.168.10.14-53-51018-17,192.168.10.14,51018,192.168.10.3,53,17,7/7/2017 12:59,61452,4,2,180,...,20,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0,BENIGN
192.168.10.3-192.168.10.14-53-49984-17,192.168.10.14,49984,192.168.10.3,53,17,7/7/2017 12:59,171,2,2,80,...,32,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0,BENIGN
192.168.10.3-192.168.10.14-53-64015-17,192.168.10.14,64015,192.168.10.3,53,17,7/7/2017 12:59,222,2,2,90,...,32,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0,BENIGN
192.168.10.17-198.100.147.178-123-123-17,192.168.10.17,123,198.100.147.178,123,17,7/7/2017 12:59,16842,1,1,48,...,20,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0,BENIGN


## nPrint to Machine Learning Samples

We label each sample with its corresponding operating system by matching its source IP address to the OS found
on this website https://www.unb.ca/cic/datasets/ids-2017.html

In [None]:
import numpy as np
import math 
samples = []
labels = []

friday = friday.replace([np.inf, -np.inf], np.nan)
friday = friday.dropna()
# label the operating system based on IP address from this website: https://www.unb.ca/cic/datasets/ids-2017.html
ip_to_os = {'205.174.165.73': 'Kali', '205.174.165.69': 'Win', '205.174.165.70': 'Win', '205.174.165.71': 'Win', '192.168.10.50': 'Web server 16 Public', '192.168.10.205.174.165.68': 'Web server 16 Public', '192.168.10.51': 'Ubuntu server 12 Public', '192.168.10.205.174.165.66': 'Ubuntu server 12 Public', '192.168.10.19': 'Ubuntu 14.4, 32B', '192.168.10.17': 'Ubuntu 14.4, 64B', '192.168.10.16': 'Ubuntu 16.4, 32B', '192.168.10.12': 'Ubuntu 16.4, 64B', '192.168.10.9': 'Win 7 Pro, 64B', '192.168.10.5': 'Win 8.1, 64B', '192.168.10.8': 'Win Vista, 64B', '192.168.10.14': 'Win 10, pro 32B', '192.168.10.15': 'Win 10, 64B', '192.168.10.25': 'MAC'}
labels = friday[' Source IP'].apply(lambda x : "Other" if x not in ip_to_os else ip_to_os[x])
samples = friday.drop([' Source IP', ' Source Port', ' Destination IP', ' Destination Port', ' Timestamp'], axis=1)
samples = samples._get_numeric_data()
samples = samples.reset_index(drop=True)

samples = samples.astype(np.float32)
samples.dtypes.value_counts()

## Training a Classifier

We're already ready to train and test a model on the traffic we gathered. Let's split the data into training and testing data, train a model, and get a stat report.

In [132]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(samples, labels)

# Initialize Classifier
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)

# Train 
clf.fit(X_train, y_train) 

# Predict
y_pred = clf.predict(X_test)

# Statistics

# First, lets get a stat report about the precision and recall:
report = classification_report(y_test, y_pred)
print(report)

# Let's also get the ROC AUC score while we're here, which requires a probability instead of just the prediction
y_pred_proba = clf.predict_proba(X_test)
# predict_proba gives us a probability estimate of each class, while roc_auc just cares about the "positive" class
y_pred_proba_pos = [sublist[1] for sublist in y_pred_proba]

                         precision    recall  f1-score   support

                              0.98      0.99      0.98     13986
                   Kali       0.94      0.92      0.93       158
                    MAC       0.88      0.80      0.84      1738
       Ubuntu 14.4, 32B       0.59      0.61      0.60      3820
       Ubuntu 14.4, 64B       0.47      0.38      0.42      2522
       Ubuntu 16.4, 32B       0.48      0.40      0.43      2432
       Ubuntu 16.4, 64B       0.52      0.62      0.56      4160
Ubuntu server 12 Public       0.96      0.91      0.94       263
   Web server 16 Public       0.84      0.87      0.86       967
            Win 10, 64B       0.48      0.48      0.48      3270
        Win 10, pro 32B       0.44      0.36      0.40      2585
         Win 7 Pro, 64B       0.51      0.54      0.53      4426
           Win 8.1, 64B       0.57      0.59      0.58      4828
         Win Vista, 64B       0.52      0.51      0.51      2573

               accuracy

In [133]:
roc = roc_auc_score(y_test, y_pred_proba_pos, multi_class="ovo")
print('ROC AUC Score: {0}'.format(roc))

AxisError: axis 1 is out of bounds for array of dimension 1

## Understanding the model

nPrint's alignment of each packet allows for understanding the specific features (parts of the packet) that are driving the model's performance. It turns out that the options that are being set in the TCP header is actually more important than the port numbers themselves!

In [6]:
# Get Raw feature importances
feature_importances = clf.feature_importances_
# Match the feature names we know with the importances
named_importances = []
for column_name, importance in zip(nprint_80.columns, feature_importances):
    named_importances.append((column_name, importance))
# Sort the named feature importances
sorted_feature_importances = sorted(named_importances, key=lambda tup: tup[1], reverse=True)
# Now lets print the top 20 important features (bits)
print(*sorted_feature_importances[0:20], sep='\n') 

('tcp_opt_67', 0.044505100017426795)
('tcp_opt_6', 0.016730629081809993)
('tcp_opt_55', 0.01523586076750689)
('tcp_opt_20', 0.014805459966212969)
('tcp_opt_44', 0.014620986047823093)
('tcp_opt_40', 0.013847115275440285)
('tcp_opt_24', 0.013612515683690887)
('tcp_opt_48', 0.013313782454536936)
('tcp_opt_72', 0.013230183523847874)
('tcp_opt_77', 0.012961325676052151)
('tcp_opt_29', 0.012785034592834255)
('tcp_opt_32', 0.012665315285534077)
('tcp_opt_50', 0.01253854986829483)
('tcp_opt_37', 0.012524446118117365)
('tcp_opt_49', 0.012504783737309715)
('tcp_opt_42', 0.01246014726713846)
('tcp_opt_54', 0.012341988396905099)
('tcp_opt_68', 0.012192884127280043)
('tcp_opt_64', 0.01216579933207458)
('tcp_opt_75', 0.011726936074581147)


## Rapidly testing different versions of the problem

now that we have a generic pipeline, we can leverage nPrint's flags to generate different versions of nPrints. Let's test a version of this classification problem using **only** the IPv4 Headers of the packets

In [7]:
# Generate nPrints
cmd_80 = 'nprint -P port80.pcap -4  -W port80.npt'
cmd_443 = 'nprint -P port443.pcap -4 -W port443.npt'
!{cmd_80}
!{cmd_443}

# Load nPrints
nprint_80 = pd.read_csv('port80.npt', index_col=0)
nprint_443 = pd.read_csv('port443.npt', index_col=0)

# Assoicate with Labels
samples = []
labels = []
for _, row in nprint_80.iterrows():
    samples.append(np.array(row))
    labels.append('unencrypted')

for _, row in nprint_443.iterrows():
    samples.append(np.array(row))
    labels.append('encrypted')
    
# Train and Test the Classifier
# Split data
X_train, X_test, y_train, y_test = train_test_split(samples, labels)
# Initialize Classifier
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)
# Train 
clf.fit(X_train, y_train) 
# Predict
y_pred = clf.predict(X_test)
# Statistics
report = classification_report(y_test, y_pred)
print(report)



              precision    recall  f1-score   support

   encrypted       1.00      1.00      1.00       628
 unencrypted       1.00      1.00      1.00       603

    accuracy                           1.00      1231
   macro avg       1.00      1.00      1.00      1231
weighted avg       1.00      1.00      1.00      1231



How about Testing using just the first 30 payload bytes in each packet?

In [8]:
# Generate nPrints
cmd_80 = 'nprint -P port80.pcap -p 30 -W port80.npt'
cmd_443 = 'nprint -P port443.pcap -p 30 -W port443.npt'
!{cmd_80}
!{cmd_443}

# Load nPrints
nprint_80 = pd.read_csv('port80.npt', index_col=0)
nprint_443 = pd.read_csv('port443.npt', index_col=0)

# Assoicate with Labels
samples = []
labels = []
for _, row in nprint_80.iterrows():
    samples.append(np.array(row))
    labels.append('unencrypted')

for _, row in nprint_443.iterrows():
    samples.append(np.array(row))
    labels.append('encrypted')
    
# Train and Test the Classifier
# Split data
X_train, X_test, y_train, y_test = train_test_split(samples, labels)
# Initialize Classifier
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)
# Train 
clf.fit(X_train, y_train) 
# Predict
y_pred = clf.predict(X_test)
# Statistics
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

   encrypted       0.64      0.65      0.64       630
 unencrypted       0.63      0.62      0.62       601

    accuracy                           0.63      1231
   macro avg       0.63      0.63      0.63      1231
weighted avg       0.63      0.63      0.63      1231



A much harder problem, with a much lower score. It may be likely that many packets don't have a payload at all, making it impossible to guess the traffic! What if we remove those packets from our dataset?

In [9]:
# Load nPrints
nprint_80 = pd.read_csv('port80.npt', index_col=0)
nprint_443 = pd.read_csv('port443.npt', index_col=0)

# Assoicate with Labels
samples = []
labels = []
for _, row in nprint_80.iterrows():
    # Check for no payload, all bits will be -1. There are more efficient ways to do this
    if len(set(row)) == 1:
        continue
    samples.append(np.array(row))
    labels.append('unencrypted')

for _, row in nprint_443.iterrows():
    # Check for no payload, all bits will be -1. There are more efficient ways to do this
    if len(set(row)) == 1:
        continue
    samples.append(np.array(row))
    labels.append('encrypted')
    
# Train and Test the Classifier
# Split data
X_train, X_test, y_train, y_test = train_test_split(samples, labels)
# Initialize Classifier
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)
# Train 
clf.fit(X_train, y_train) 
# Predict
y_pred = clf.predict(X_test)
# Statistics
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

   encrypted       0.67      1.00      0.80       440
 unencrypted       1.00      0.24      0.39       284

    accuracy                           0.70       724
   macro avg       0.84      0.62      0.60       724
weighted avg       0.80      0.70      0.64       724



## Conclusion

Hopefully this gives you a better idea of how nPrint can be used to rapidly train and test models for different traffic analysis problems. While this problem was contrived and simple, the same basic steps can be performed for any single-packet classification problem. If you want to train and test using **sets** of packets as input to a model, you'll either need a model that can handle that input, such as a CNN, or to flatten the 2D packet sample into a 1d sample for use with a model such as the random forest above.