# The nPrint Machine Learning Pipeline

This notebook is designed to give you a **very** simple example of how to use nPrint in a generic machine learning pipeline and the rapid pace at which we can train new models and test new ideas on network traffic. Note that this example is simply to show the pipeline, not to test a hard problem. The traffic collected is to the same website over the course of about 15 seconds.


### Requirements

nPrint must be installed into $PATH for external commands to work

### Directory Sturcture

There are 2 `pcap` files in this directory
1. `port443.pcap` - a small trace of packets sent and received over https  
2. `port80.pcap` - a small trace of packets sent and received over http

# Generating nPrints from Traffic

First, lets generate nPrints from each traffic trace. Let's **only** include the TCP headers in the nPrints for now.

In [1]:
cmd_80 = 'nprint -P port80.pcap -t -W port80.npt'
cmd_443 = 'nprint -P port443.pcap -t -W port443.npt'
!{cmd_80}
!{cmd_443}

Lets examine the nPrints, which can be directly loaded with Pandas

In [2]:
import pandas as pd

nprint_80 = pd.read_csv('port80.npt', index_col=0)
nprint_443 = pd.read_csv('port443.npt', index_col=0)

print('Port 80 nPrint: Number of Packets: {0}, Features per packet: {1}'.format(nprint_80.shape[0], nprint_80.shape[1]))
print('Port 443 nPrint: Number of Packets: {0}, Features per packet: {1}'.format(nprint_443.shape[0], nprint_443.shape[1]))

Port 80 nPrint: Number of Packets: 2421, Features per packet: 480
Port 443 nPrint: Number of Packets: 2500, Features per packet: 480


Looks like they have the same number of features, which is the maximum number of bits in a TCP header. Let's look at the header itself.

In [3]:
print(nprint_80.columns)
print(nprint_443.columns)

Index(['tcp_sprt_0', 'tcp_sprt_1', 'tcp_sprt_2', 'tcp_sprt_3', 'tcp_sprt_4',
       'tcp_sprt_5', 'tcp_sprt_6', 'tcp_sprt_7', 'tcp_sprt_8', 'tcp_sprt_9',
       ...
       'tcp_opt_310', 'tcp_opt_311', 'tcp_opt_312', 'tcp_opt_313',
       'tcp_opt_314', 'tcp_opt_315', 'tcp_opt_316', 'tcp_opt_317',
       'tcp_opt_318', 'tcp_opt_319'],
      dtype='object', length=480)
Index(['tcp_sprt_0', 'tcp_sprt_1', 'tcp_sprt_2', 'tcp_sprt_3', 'tcp_sprt_4',
       'tcp_sprt_5', 'tcp_sprt_6', 'tcp_sprt_7', 'tcp_sprt_8', 'tcp_sprt_9',
       ...
       'tcp_opt_310', 'tcp_opt_311', 'tcp_opt_312', 'tcp_opt_313',
       'tcp_opt_314', 'tcp_opt_315', 'tcp_opt_316', 'tcp_opt_317',
       'tcp_opt_318', 'tcp_opt_319'],
      dtype='object', length=480)


Notice how each bit (feature) is named according to the exact bit it represents in the packet, and all the possible bits of a TCP header are accounted for.

## nPrint to Machine Learning Samples

Now we need to take each nPrint and make each packet a "sample" for the machine learning task at hand. In this case, we'll set up a supervised learning task where port 80 traffic is labeled "unencrypted" and port 443 traffic is labeled "encrypted"

In [4]:
import numpy as np

samples = []
labels = []
for _, row in nprint_80.iterrows():
    samples.append(np.array(row))
    labels.append('unencrypted')

for _, row in nprint_443.iterrows():
    samples.append(np.array(row))
    labels.append('encrypted')

## Training a Classifier

We're already ready to train and test a model on the traffic we gathered. Let's split the data into training and testing data, train a model, and get a stat report.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(samples, labels)

# Initialize Classifier
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)

# Train 
clf.fit(X_train, y_train) 

# Predict
y_pred = clf.predict(X_test)

# Statistics

# First, lets get a stat report about the precision and recall:
report = classification_report(y_test, y_pred)
print(report)

# Let's also get the ROC AUC score while we're here, which requires a probability instead of just the prediction
y_pred_proba = clf.predict_proba(X_test)
# predict_proba gives us a probability estimate of each class, while roc_auc just cares about the "positive" class
y_pred_proba_pos = [sublist[1] for sublist in y_pred_proba]
roc = roc_auc_score(y_test, y_pred_proba_pos)
print('ROC AUC Score: {0}'.format(roc))


              precision    recall  f1-score   support

   encrypted       1.00      1.00      1.00       625
 unencrypted       1.00      1.00      1.00       606

    accuracy                           1.00      1231
   macro avg       1.00      1.00      1.00      1231
weighted avg       1.00      1.00      1.00      1231

ROC AUC Score: 1.0


## Understanding the model

nPrint's alignment of each packet allows for understanding the specific features (parts of the packet) that are driving the model's performance. It turns out that the options that are being set in the TCP header is actually more important than the port numbers themselves!

In [6]:
# Get Raw feature importances
feature_importances = clf.feature_importances_
# Match the feature names we know with the importances
named_importances = []
for column_name, importance in zip(nprint_80.columns, feature_importances):
    named_importances.append((column_name, importance))
# Sort the named feature importances
sorted_feature_importances = sorted(named_importances, key=lambda tup: tup[1], reverse=True)
# Now lets print the top 20 important features (bits)
print(*sorted_feature_importances[0:20], sep='\n') 

('tcp_opt_67', 0.044505100017426795)
('tcp_opt_6', 0.016730629081809993)
('tcp_opt_55', 0.01523586076750689)
('tcp_opt_20', 0.014805459966212969)
('tcp_opt_44', 0.014620986047823093)
('tcp_opt_40', 0.013847115275440285)
('tcp_opt_24', 0.013612515683690887)
('tcp_opt_48', 0.013313782454536936)
('tcp_opt_72', 0.013230183523847874)
('tcp_opt_77', 0.012961325676052151)
('tcp_opt_29', 0.012785034592834255)
('tcp_opt_32', 0.012665315285534077)
('tcp_opt_50', 0.01253854986829483)
('tcp_opt_37', 0.012524446118117365)
('tcp_opt_49', 0.012504783737309715)
('tcp_opt_42', 0.01246014726713846)
('tcp_opt_54', 0.012341988396905099)
('tcp_opt_68', 0.012192884127280043)
('tcp_opt_64', 0.01216579933207458)
('tcp_opt_75', 0.011726936074581147)


## Rapidly testing different versions of the problem

now that we have a generic pipeline, we can leverage nPrint's flags to generate different versions of nPrints. Let's test a version of this classification problem using **only** the IPv4 Headers of the packets

In [7]:
# Generate nPrints
cmd_80 = 'nprint -P port80.pcap -4  -W port80.npt'
cmd_443 = 'nprint -P port443.pcap -4 -W port443.npt'
!{cmd_80}
!{cmd_443}

# Load nPrints
nprint_80 = pd.read_csv('port80.npt', index_col=0)
nprint_443 = pd.read_csv('port443.npt', index_col=0)

# Assoicate with Labels
samples = []
labels = []
for _, row in nprint_80.iterrows():
    samples.append(np.array(row))
    labels.append('unencrypted')

for _, row in nprint_443.iterrows():
    samples.append(np.array(row))
    labels.append('encrypted')
    
# Train and Test the Classifier
# Split data
X_train, X_test, y_train, y_test = train_test_split(samples, labels)
# Initialize Classifier
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)
# Train 
clf.fit(X_train, y_train) 
# Predict
y_pred = clf.predict(X_test)
# Statistics
report = classification_report(y_test, y_pred)
print(report)



              precision    recall  f1-score   support

   encrypted       1.00      1.00      1.00       628
 unencrypted       1.00      1.00      1.00       603

    accuracy                           1.00      1231
   macro avg       1.00      1.00      1.00      1231
weighted avg       1.00      1.00      1.00      1231



How about Testing using just the first 30 payload bytes in each packet?

In [8]:
# Generate nPrints
cmd_80 = 'nprint -P port80.pcap -p 30 -W port80.npt'
cmd_443 = 'nprint -P port443.pcap -p 30 -W port443.npt'
!{cmd_80}
!{cmd_443}

# Load nPrints
nprint_80 = pd.read_csv('port80.npt', index_col=0)
nprint_443 = pd.read_csv('port443.npt', index_col=0)

# Assoicate with Labels
samples = []
labels = []
for _, row in nprint_80.iterrows():
    samples.append(np.array(row))
    labels.append('unencrypted')

for _, row in nprint_443.iterrows():
    samples.append(np.array(row))
    labels.append('encrypted')
    
# Train and Test the Classifier
# Split data
X_train, X_test, y_train, y_test = train_test_split(samples, labels)
# Initialize Classifier
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)
# Train 
clf.fit(X_train, y_train) 
# Predict
y_pred = clf.predict(X_test)
# Statistics
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

   encrypted       0.64      0.65      0.64       630
 unencrypted       0.63      0.62      0.62       601

    accuracy                           0.63      1231
   macro avg       0.63      0.63      0.63      1231
weighted avg       0.63      0.63      0.63      1231



A much harder problem, with a much lower score. It may be likely that many packets don't have a payload at all, making it impossible to guess the traffic! What if we remove those packets from our dataset?

In [9]:
# Load nPrints
nprint_80 = pd.read_csv('port80.npt', index_col=0)
nprint_443 = pd.read_csv('port443.npt', index_col=0)

# Assoicate with Labels
samples = []
labels = []
for _, row in nprint_80.iterrows():
    # Check for no payload, all bits will be -1. There are more efficient ways to do this
    if len(set(row)) == 1:
        continue
    samples.append(np.array(row))
    labels.append('unencrypted')

for _, row in nprint_443.iterrows():
    # Check for no payload, all bits will be -1. There are more efficient ways to do this
    if len(set(row)) == 1:
        continue
    samples.append(np.array(row))
    labels.append('encrypted')
    
# Train and Test the Classifier
# Split data
X_train, X_test, y_train, y_test = train_test_split(samples, labels)
# Initialize Classifier
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)
# Train 
clf.fit(X_train, y_train) 
# Predict
y_pred = clf.predict(X_test)
# Statistics
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

   encrypted       0.67      1.00      0.80       440
 unencrypted       1.00      0.24      0.39       284

    accuracy                           0.70       724
   macro avg       0.84      0.62      0.60       724
weighted avg       0.80      0.70      0.64       724



## Conclusion

Hopefully this gives you a better idea of how nPrint can be used to rapidly train and test models for different traffic analysis problems. While this problem was contrived and simple, the same basic steps can be performed for any single-packet classification problem. If you want to train and test using **sets** of packets as input to a model, you'll either need a model that can handle that input, such as a CNN, or to flatten the 2D packet sample into a 1d sample for use with a model such as the random forest above.