# Realtime Anomalous Protocol Detection

## Overview

In this lab, we will explore one of the ideas that we started thinking about in book 2.  If you recall, in one of the statistics sections we examined files and applied statistics to determine if data was encrypted.  Along the way, we noticed that files of the same type have the same starting bytes, or header.

In a similar way, application layer protocols (SSH, HTTP, TLS, SMTP, Kerberos, NTP, etc.) have a header structure.  Can we train a network to correctly identify different protocols?  More importantly, can we build a system that can alert us when an unknown protocol is seen?

## Goals

By the end of this lab, you should be able to:

 * Convert bytes to a stream of bits
 * Apply all of the lessons learned so far to build a new neural network for protocol categorization
 * Learn about categorical classification as opposed to the logistic regressions that we have performed so far
 * Understand how to know if your network has *failed* to classify something
 
## Estimated Time: 45-60 minutes

# Obtaining the Data

Unlike the rest of our labs, this lab will *not* start with the low level data acquisition.  Instead, we performed some initial preparation of the data for you.  Note that *this is not our preference.*

In this lab, we will be looking at how we can interface a neural network to a network monitoring tool, Zeek.  Zeek provides a Python library, broker, that allows us to connect to and communicate with a Zeek cluster.  Unfortunately, this library is only available if you build and install Zeek on your local system.  Installing Zeek is *well* outside of the goals of this course, especially since it isn't even supported on platforms like Windows!

Because of this, we have used Zeek to extract the necessary data and it has been stored into the ../data/Day 4/ directory tree for your use.

# <img src="../images/task.png" width=20 height=20> Task 5.1

Please use the following cell to:
 * Import the *numpy* library with the typical `np` alias
 * Import *tensorflow* as `tf`
 * Import the `models` and `layers` packages from `tensorflow.keras`
 * Import the `os` library -- we will need this to find and load our training and test data
 

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import models, layers
import os

2023-12-27 10:01:25.175297: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-27 10:01:25.175344: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-27 10:01:25.175378: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-27 10:01:25.182405: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Preprocessing the Training Data
We have taken the time to generate training data that contains the first 16 raw bytes of data from a variety of types of communication.  Take note that we are looking at the first 16 bytes of the *stream*, not each *packet*.  The training data can be found in `../data/Day 4/content` directory (relative to the location of this notebook).  The name of each file should be used as the training label for the data in the file.

# <img src="../images/task.png" width=20 height=20> Task 5.2

Implement Python code that:

 * Defines a function that obtains a list of all of the files in the training data directory
 * Defines a function that can read in a file containing training data, returning the data as a list of strings that have had leading and trailing whitespace and newlines stripped off
 
Each line in each file contains a row of data.  What is the shortest length row across the entire data set?  What is the longest length?


In [2]:
PATH = "../data/Day 4/content/"
def get_file_list():
    return os.listdir(PATH)

def get_file(filename):
    with open(f'{PATH}{filename}', 'r') as f:
        return [s.strip() for s in f.readlines()]
    
def min_max(list_of_strings):
    x = [len(s) for s in list_of_strings]
    return (min(x), max(x))

for f in get_file_list():
    print(min_max(get_file(f)))

(32, 32)
(32, 32)
(32, 32)
(30, 32)
(32, 32)
(32, 32)
(2, 32)
(32, 32)
(32, 32)
(32, 32)
(12, 32)
(32, 32)
(32, 32)
(32, 32)
(18, 32)
(32, 32)
(32, 32)
(32, 32)
(32, 32)
(32, 32)


Assuming that all has gone well, you have likely found that all of the files have a maximum line length of 32, but several of them have lines that are significantly shorter.  The shortest has at least one line with a length of two!

Why are some of the lines shorter?  Didn't we say that Zeek was giving us the first 16 bytes of the stream?  For that matter, why do we have lengths of 32?  That's twice 16!  Part of the answer has to do with how the data has been encoded.

Getting data out of Zeek is not very difficult, but Zeek is very resistant to sending that content as binary data. If we wanted to save the content or otherwise manipulate it within Zeek, there would be no problem. However, relaying it to another host using broker in binary is impossible. To help you solve this problem, we are making use of a built-in Zeek function to convert the binary data into the ASCII representation of the hexadecimal. It turns out that this works just fine!

This actually explains why most of the lines are 32 characters long. Every byte can be represented by two hexadecimal digits.  These hexadecimal digits are then converted into ASCII which is what we see in our files.

This explains why most lines are 32 characters long, but why are some lines shorter? The answer is that we are using real network data, and not all network streams have at least 16 bytes in them! This is totally normal, but it does represent a problem for us since we will need to vectorize these.  In this workbook we will use the strategy of padding.

> ### Optional
> While it is optional when you complete this lab on your own, your instructor will almost certainly examine some of the short lines.  While we very much believe in using real world data as much as possible, having just one byte as input seems suspicious.  More than likely, Zeek somehow missed the bytes or dropped the bytes or there was some other odd occurence when the data was collected.  We really don't want that to affect our model.


> ### Applying Padding
> When padding our data, we could theoretically use anything for that padding. As long as we are consistent, the actual value used does not matter. Conveniently, Python provides justification functions that we can use with string-based data. We can use these functions to pad the strings in the files that are shorter than 32 to be 32 characters long.  What shall we paid them with? Let's use zeros. If we right-pad the data (left-justify) with "0", we can add what amount to nulls to the data as padding.

# <img src="../images/task.png" width=20 height=20> Task 5.3

Implement a function in Python that takes a string of any arbitrary length.  If the string is longer than a specified length, the string is truncated.  If the string is shorter than that length, it is padded to that length with ASCII zeros.

***Hint:*** The `ljust()` method is available on Python strings.  Examine the help for this function and determine how to use it to pad the string out to the appropriate length.

In [3]:
def normalize_string(string, length=32):
    if len(string) >= length:
        return string[:length]
    return string.ljust(length, "0")

print(normalize_string('XX'))
print(normalize_string('012345678901234567890123456789012345678901234567890123456789'))

XX000000000000000000000000000000
01234567890123456789012345678901


We now have 16 bytes of data on each line as hexadecimal values encoded using ASCII.  We need to convert these into 16 bytes of binary data.  Ultimately, we need to convert them into an array of 128 bits.

One way to accomplish this is to convert the ASCII to bytes, then convert the bytes into 8 value bit arrays.  These bit-arrays can then be pushed into a Numpy array in order to flatten them to a single vector of 128 bits.  If we implement this as functions, we can make use of them later if we decide to connect this to a real world monitoring system.

# <img src="../images/task.png" width=20 height=20> Task 5.4

Implement a set of functions that will:

  * Convert an ASCII encoded string of hexadecimal bytes into binary bytes
  * Convert an array of binary bytes into a numpy array of bits (0s and 1s)
  

***Hint:*** You may find the `ord()` built-in function useful to convert a character byte into an integer byte.


In [4]:
def ascii_to_bytes(string):
    # Define a list to hold the bytes
    byte_list = []
    # Traverse the string by twos
    for i in range(0, len(string), 2):
        # Convert the two character hex value to an integer, translating from Base 16
        i_val = int(string[i:i+2], 16)
        # Convert the resulting integer into a 1 byte value using the `chr()` function
        byte_val = chr(i_val)
        # Append that byte to the list of bytes
        byte_list.append(byte_val)
    return byte_list

def bytes_to_bits(byte_list):
    bit_list = []
    # Iterate over all of the bytes in the list
    for byte in byte_list:
        # For each byte, we need to test each bit to see if it's on or off
        # We can do this using bit shifting.  We'll start with the  high order
        # bit and work our way down, so we shift from 7 first, 0 last.
        for shift in [7, 6, 5, 4, 3, 2, 1, 0]:
            # Define the mask to isolate a bit by shifting a 1 left
            mask = 1 << shift
            # Apply the mask to isolate the bit in place
            the_shifted_bit = ord(byte) & mask
            # Shift the bit down to the bottom and capture it
            the_bit = the_shifted_bit >> shift
            # Add the bit to our growing list of bits
            bit_list.append(the_bit)
    return np.array(bit_list)

print(bytes_to_bits(ascii_to_bytes("5AFFFFFFAA")))

[0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1
 0 1 0]


# <img src="../images/task.png" width=20 height=20> Task 5.5

Let's collapse all of this down into a single function.  Please implement a function named `content_to_features()` that:

 * Accepts an arbitrary length string of ASCII encoded hexadecimal data
 * Truncates or justifies the strings to a defined length (32 in our case), returning a numpy array of bits that represent that data.

Test your function by sending it an example from the content data provided above.

*Note:* You are really just writing a wrapper function for what has already been accomplished.

In [5]:
def content_to_features(content_string):
    return bytes_to_bits(ascii_to_bytes(normalize_string(content_string)))

content_to_features(get_file(get_file_list()[0])[0])

array([1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

> #### Of Chickens and Eggs
> In the future, we may want to have our system automatically classifying every connection that occurs.  While we could use something that will parse raw packets and pcaps, this might not be the best way to go.  Why not?  Simply because we would have to write what amounts to a very complex piece of code to find streams, reassemble them, and then deliver us only 16 bytes of them.  If we think about the problem in terms of what it is that we need, we can then evaluate if there is *already* some tool or system that we might have in place (or could put into place easily) that can produce the data that we need.  In this case, the answer is yes:  **Zeek**
>
> Knowing that we will be using Zeek in our future can help to define our present.  In this case, the present from the perspective of this notebook is the generation of the data that will be used for training.  Since our design specifies that we will use Zeek in the future, it only makes sense to use Zeek to generate the stream content that will be used for training, too!
>
> Unless *everything* is implemented in Zeek, which would require us to integrate our Keras/TensorFlow model into C++ (which doesn't seem pleasant), we will have to make a trade-off.  Why?  Because Zeek is limited in the ways that we can access the raw content of the packets.  In fact, it's less that the *access* is limited and more that the ways that we can *render* that data are limited.  Specifically, Zeek will not allow us to easily work with the raw bytes as raw bytes and save them as such unless we export the stream as a file... which isn't our goal.
>
> Add to this that, in the future, we will want to have some Python process running that is receiving stream content via Broker (which is then fed to our trained network for classification) and it makes an enormous amount of sense to pass that content around as ASCII.
>

### Thinking Ahead
Think about the code above and how it might be used.  While the code that loads in files and data from files is specific to training, what about the rest of the code for converting that content?  Couldn't we have just written that all into the code that reads the files?

When we have finished building our network, we are going to want to push data through the network.  To do so, we will need to preprocess that data into the same set of features that we are using for training.  Rather than recreate that code later, it makes sense to abstract it into a function now.  Our implementation also supports a second argument for the number of bytes to be processed.  This allows us the flexibility to later change our mind when it comes to the number of bytes from the streams that we wish to capture.

## Building the Training Data

The only major task that remains before we can build our network is that of creating a training dataset with labels.  It turns out that we already have the labels; the labels are the names of the files!  So, then, we need to process each file, assigning the name of the file as the category

# <img src="../images/task.png" width=20 height=20> Task 5.6

Implement Python code that will read in all of the training data, building an in-memory numpy array where:

 * The first column is the label, which is the name of the file
 * The other 128 columns are the bits for the respective line in the files
 * Though it is counter-intuitive, your numpy array should be an array of *strings*
 
Your in-memory array should be named `training_data_and_labels`

***Hint:*** We went through some work to turn our bytes into a NumPy array of bits.  You might puzzle over the direction that we now convert it into an array of strings that include the protocol name.  Don't forget that our NumPy array needs to have a consistent data type!  When/if we use this network in production, we would never convert the bits from strings to bytes to bits and then back to strings!  We are only doing that *for now* because it makes it easier to manipulate the training dataset.  We will convert the bits back to integer values before too much longer.

In [6]:
all_data = []
for protocol in get_file_list():
    for row_of_data in get_file(protocol):
        string = f'{protocol},{",".join(map(str,content_to_features(row_of_data)))}'
        row = np.array(string.split(','))
        all_data.append(row)
training_data_and_labels = np.array(all_data)

In [7]:
training_data_and_labels.shape

(596266, 129)

If the above has gone well, you should now have a numpy array named `training_data_and_labels` that has the shape `(596266, 129)`.

## Shuffling the data

Currently, all of the data for a single protocol is clustered together in our array.  While TensorFlow will randomize data within batches, we want to pull some of this data aside to be used as test data.  Slicing the array is the easiest way to do it, but with the protocols clustered this won't be a great approach at this point.  Instead, we first need to shuffle the data.

This is also the reason that we converted our bits to strings in the previous task.  By doing so, it's easy to shuffle our array *with the labels still in place.*  Otherwise, we would have to figure out a way to shuffle both the labels and the training data the same way at the same time!

# <img src="../images/task.png" width=20 height=20> Task 5.7

Write Python code to:

 * Shuffle the `training_data_and_labels` array
 * Set aside 10,000 rows as test data
 * Peel the training data and training labels apart into two arrays, `raw_x_train` and `raw_y_train`
 * Peel the test data and labels apart into two arrays, `raw_x_test` and `raw_y_test`

In [8]:
np.random.shuffle(training_data_and_labels)
raw_x_train = training_data_and_labels[:-10000, 1:]
raw_y_train = training_data_and_labels[:-10000, 0]
raw_x_test = training_data_and_labels[-10000:, 1:]
raw_y_test = training_data_and_labels[-10000:, 0]


# <img src="../images/task.png" width=20 height=20> Task 5.8

The entire dataset has been shuffled, the train and test data are now pulled apart, and the labels isolated.  Our labels are currently strings.  There's nothing wrong with this, but we need to encode them in some way so that we can connect them to our model.

When doing a categorical model, such as this one, we will typically have an output layer with one neuron for each of the possible categories.  To make use of this pattern, we need to convert our labels into something that we can map to these output neurons.  It makes a lot of sense to arbitrarily assign a range of numbers to the labels and then use those numbers to indicate which neuron goes with which label.

 * Determine how many unique labels are present in `raw_y_test` and `raw_y_train`.  The number *should be the same!*
 * Generate an array that can be used to map each label to an ordinal number.  Name this array `unique_labels`
 * Create two new arrays, `y_train` and `y_test`, that contain the mapped values for the labels

In [9]:
assert len(np.unique(raw_y_test)) == len(np.unique(raw_y_train))

unique_labels = np.unique(raw_y_test)
label_dictionary = {protocol:i for i,protocol in enumerate(unique_labels)}
y_train = np.array([label_dictionary[proto] for proto in raw_y_train])
y_test = np.array([label_dictionary[proto] for proto in raw_y_test])

for i in range(10):
    print(f'Original: {raw_y_test[i]} Mapped: {y_test[i]} Translated: {unique_labels[y_test[i]]}')

Original: SSDP Mapped: 15 Translated: SSDP
Original: QTLSRequest Mapped: 11 Translated: QTLSRequest
Original: DNS Mapped: 3 Translated: DNS
Original: SSDP Mapped: 15 Translated: SSDP
Original: QTLSRequest Mapped: 11 Translated: QTLSRequest
Original: QTLSRequest Mapped: 11 Translated: QTLSRequest
Original: BootServiceDiscoveryProtocol Mapped: 0 Translated: BootServiceDiscoveryProtocol
Original: SSDP Mapped: 15 Translated: SSDP
Original: BootServiceDiscoveryProtocol Mapped: 0 Translated: BootServiceDiscoveryProtocol
Original: SSDP Mapped: 15 Translated: SSDP


# <img src="../images/task.png" width=20 height=20> Task 5.9

Now that our labels are converted to numbers, we need to translate them to something that the neural network can use.  While we could attempt to do a regression, training our model that the models represent some continuous range of values, that's not ideal here.  This really is a categorical problem.  We already stated that this means that we need to have a neuron for each possible category.  How can we map the two?  *One-hot encoding* is the answer!

We are already familiar with multi-hot encoding.  One-hot encoding, as the name implies, means that a single column in our vector is on.  TensorFlow provides a useful convenience function that will take an integer encoding and convert it to a one-hot encoded value:  `tf.keras.utils.to_categorical()`

In the cell below, write Python code to one-hot encode the `y_train` and `y_test` arrays into new arrays, `y_train_onehot` and `y_test_onehot` respectively.

In [10]:
y_train_onehot = tf.keras.utils.to_categorical(y_train)
y_test_onehot = tf.keras.utils.to_categorical(y_test)

# <img src="../images/task.png" width=20 height=20> Task 5.10

One final bit of housekeeping is required.  Currently, because of how we attached our labels for shuffling, our training data and test data are numpy arrays of strings.

Write Python code to create two new arrays, `x_train` and `x_test`.  Populate these with the converted `raw_x_test` and `raw_x_train` arrays converted `np.int` datatypes.

In [11]:
x_train = raw_x_train.astype(np.int32)
x_test = raw_x_test.astype(np.int32)


### Stand Back...
If you stand back and think about how we are working our way through this, you will realize that we are creating a *general* solution for categorizing network content rather than a *specific* solution that categorizes these twelve types of data.  How would you get this existing system to categorize additional data types?

Really, all you need to do is create another input file containing training samples for the protocol of interest and put it into the input directory!  With that done the model can be quickly retrained *without making any other changes to any of this code!*  When we have spoken in class about automating solutions, this is precisely what we mean.  As *soon* as you realize that you will need to do something a second time, create a generalized solution for all tasks of that type!  While it takes more time and forethought to do this, in the long run it allows you to operate far more efficiently.

## Building the Network
We now have all of the pieces in place to build and train our network.  As usual, the majority of our effort is spent massaging the training data into a form that we can use within our network, while the network itself is rather simple to build.

# <img src="../images/task.png" width=20 height=20> Task 5.11

Build your network as follows:

 * Use two dense layers with 64 and 32 neurons respectively
 * Use the `relu` activation function for each dense layer
 * The final output layer should be another dense layer with the appropriate number of neurons
 * Your loss function ***must be*** `categorical_crossentropy`
 * set `tf.random.set_seed(42)` so that all of our networks train identically
 * Use a validation split of 10%

***Use the second empty cell to fit your model as follows:***
  * Use a `batch_size` of 32
  * Train for 3 epochs


In [12]:
import datetime

tf.random.set_seed(42)

model = models.Sequential()
model.add(layers.InputLayer(128, name="Input"))
model.add(layers.Dense(64, name="Hidden_1", activation='relu'))
model.add(layers.Dense(32, name="Hidden_2", activation='relu'))
model.add(layers.Dense(len(unique_labels), name="Output_Layer", activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

2023-12-27 10:05:39.440899: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-27 10:05:39.450660: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-27 10:05:39.450948: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

In [13]:
training_history = model.fit(x_train, y_train_onehot, validation_split=0.10, epochs=3, batch_size=128)

2023-12-27 10:05:44.743605: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 270151168 exceeds 10% of free system memory.
2023-12-27 10:05:44.956867: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 42211120 exceeds 10% of free system memory.
2023-12-27 10:05:45.032047: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 42211120 exceeds 10% of free system memory.


Epoch 1/3


2023-12-27 10:05:46.000091: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f4b941b69a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-12-27 10:05:46.000114: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1060, Compute Capability 6.1
2023-12-27 10:05:46.007826: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-12-27 10:05:46.053312: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8700
2023-12-27 10:05:46.134054: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.




2023-12-27 10:05:57.582380: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 30017024 exceeds 10% of free system memory.


Epoch 2/3
Epoch 3/3


# <img src="../images/task.png" width=20 height=20> Task 5.12

Let's test our network out.  Use the `model.evaluate()` method to check the performance of our network on the test dataset.

In [14]:
model.evaluate(x_test, y_test_onehot, verbose=False)

[0.00760048208758235, 0.9976000189781189]

Wow!  Based on this, we have an exceptionally accurate network... At least when it comes to identifying protocols that it has seen before.  Let's examine this a bit more closely.

# <img src="../images/task.png" width=20 height=20> Task 5.13

The data below represent the first 16 bytes of a number of different DNS queries.  Use the `predict()` function to classify each of these.  What do the predictions look like?

```
dns_request_data = [
    "e1598180000100020000000005666f6e",
    "2600818000010004000000000d737461",
    "f0188180000100020000000009676f6f",
    "92ec818000010005000000000572756c",
    "3dbc8180000100020000000003777777",
    "79998180000100030000000002736211",
    "fdd98180000100010000000002716107"
    ]
```

***Tip:*** In previous versions of TensorFlow, the predict method worked silently.  Sometime in 2022 it was changed to be verbose, which can be very irritating.  You may wish to pass `verbose=False` in your `predict()` calls.

***Hint:*** Using the `predict()` function can take a bit of thinking and experimenting.  If you want to run predictions on the test data, that's easy:  `model.predict(x_test)`.  Everything is the right shape.  But what, exactly, does x_test look like?  Remember that it is an array with some arbitrary number of rows, each of which is 128 columns wide.  This means that we need to pass `predict()` an array of rows of features, even if that array has only one row!

In [15]:
dns_request_data = [
    "e1598180000100020000000005666f6e",
    "2600818000010004000000000d737461",
    "f0188180000100020000000009676f6f",
    "92ec818000010005000000000572756c",
    "3dbc8180000100020000000003777777",
    "79998180000100030000000002736211",
    "fdd98180000100010000000002716107"
    ]

for request in dns_request_data:
    features = content_to_features(request)
    print(model.predict(np.array([features]), verbose=False))

[[1.2338198e-12 1.5314219e-08 4.4750670e-12 9.9999988e-01 5.4248117e-14
  1.1665294e-16 1.0278299e-09 4.1222906e-10 1.9502959e-13 7.4253201e-08
  1.3087713e-15 3.3206723e-10 2.3231573e-20 7.9875922e-13 3.0094578e-09
  7.6069678e-10 1.2084300e-10 9.1718340e-14 1.7814004e-08 4.7010176e-15]]
[[3.9635416e-11 2.8335545e-08 1.0048592e-09 9.9999595e-01 3.0221539e-12
  4.4793830e-14 1.6668288e-08 3.6643666e-09 1.3080349e-11 3.4616310e-06
  1.5494638e-13 2.6078360e-11 7.2038481e-18 2.7227582e-12 1.3126187e-07
  8.0152041e-08 1.1406762e-09 1.3890159e-12 3.0889831e-07 7.2579921e-11]]
[[1.52894678e-12 2.89904190e-08 1.43806494e-11 9.99999881e-01
  6.28956793e-14 3.20932380e-17 8.62222099e-11 7.84457777e-10
  2.29934431e-12 4.41793162e-08 7.57141237e-15 4.70276040e-10
  1.02001289e-19 2.52653882e-13 6.45491056e-08 3.68270137e-10
  4.23204839e-11 2.77189093e-14 1.90920009e-08 1.48066780e-13]]
[[9.1878043e-14 2.2908817e-09 6.1991276e-13 1.0000000e+00 1.9650666e-14
  5.5365883e-18 3.3269599e-11 3.5570

That output can be a little bit tough to read.  Also, it's pretty inconvenient to have to work out how to pass in a row of data every time we want to run a prediction.

# <img src="../images/task.png" width=20 height=20> Task 5.14

You have two tasks:

 * Implement a function that accepts a string of 32 hexadecimally encoded characters representing bytes and returns the prediction vector from our model
 * Use this function to print out the predictions for the dns request data above; when you do so, adjust the output so that it is displayed as percentages


In [16]:
def predict_protocol(feature_string):
    features = content_to_features(feature_string)
    return model.predict(np.array([features]), verbose=False)

 
np.set_printoptions(precision=3, suppress=True)
for request in dns_request_data:
    print(predict_protocol(request)*100)

[[  0.   0.   0. 100.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.]]
[[  0.   0.   0. 100.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.]]
[[  0.   0.   0. 100.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.]]
[[  0.   0.   0. 100.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.]]
[[  0.   0.   0. 100.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.]]
[[  0.   0.   0. 100.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.]]
[[  0.   0.   0. 100.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.]]


When we convert these values to percentages and reduce the accuracy to just a few decimal places, we can see that it becomes clear that one neuron is turned on while the others are effectively turned off.

# <img src="../images/task.png" width=20 height=20> Task 5.15

Write a function that accepts a hexadecimally encoded ASCII string of bytes as input and returns the predicted protocol name.  Once you have written this function, print the predictions for the dns request data above.

*Hint:* You may want to make use of the numpy `numpy.ndarray.argmax()` method.

In [17]:
def classify_protocol(feature_string):
    return unique_labels[predict_protocol(feature_string).argmax()]

for request in dns_request_data:
    print(classify_protocol(request))

DNS
DNS
DNS
DNS
DNS
DNS
DNS


### So Far So Good!
So far, things seem to be working very well!  Let's give this a bit more thought, though.  Before we start running production data through this, we need to figure out how to determine if the protocol is unknown and what to do about it.

Right now, we are using the typical approach, asking for the `argmax` of the prediction.  What does this mean?  It means that our function is passing the features into the network and then using the numpy `argmax` function against the output layer to determine which one of the neurons or nodes in the output layer has the greatest value.

Each of the nodes in the output layer will be populated with a numeric result indicating how activated each output neuron has become based on the input data.  In the above examples, we received a very clear result that matched our expectations.  Perhaps there is a way to use this degree of activation as an indicator?

### Next Intuition...

Let's think about this a bit.  The network is going to return predictions that are, in essence, percentages indicating the likelihood that the data is of a particular type.  We see that we can view the entire matrix directly if we choose to, and we know that we can use `argmax` to figure out the index of the node that has the greatest value in it.  In fact, that's how our functions previously mapped that prediction to a label.

What if, instead of seeing which *index* had the greatest value, we determined *what the greatest value is*?  In fact, we can do this!  Numpy arrays have a `max()` function that will return the *value* in a vector that has the greatest magnitude.  We can use this to figure out just what the neuron that `argmax()` returns had in it!

# <img src="../images/task.png" width=20 height=20> Task 5.16

Implement another function.  This function returns a tuple containing the predicted protocol and the `max()` value for the most activated neuron.  Use this function to run predictions and print output for all of the following sample data:

```
# A variety of types of known traffic
known_data = [
    "e1598180000100020000000005666f6e",
    "2600818000010004000000000d737461",
    "47455420687474703a2f2f7370616d63",
    "504f5354202f55706c6f616444617461",
    "5353482d322e302d50755454590d0a",
    "5353482d322e302d6c69627373682d30"
]

# Some streams that we have NOT classified:
unknown_data = [
    "76336279204341504142494c4954590d",
    "76357861204341504142494c4954590d",
    "76387730204341504142494c4954590d",
    "81000044204545454444424341434143",
    "82000000",
    "f60e31f121bc3592a53c03fd597992d7",
    "ff"
]
```


In [18]:
known_data = [
    "e1598180000100020000000005666f6e",
    "2600818000010004000000000d737461",
    "47455420687474703a2f2f7370616d63",
    "504f5354202f55706c6f616444617461",
    "5353482d322e302d50755454590d0a",
    "5353482d322e302d6c69627373682d30"
]

# Some streams that we have NOT classified:
unknown_data = [
    "76387730204341504142494c4954590d",
    "81000044204545454444424341434143",
    "82000000",
    "10d018332fa62985a0312"
]

def c_and_c_protocol(feature_string):
    prediction = predict_protocol(feature_string)
    return (unique_labels[prediction.argmax()], prediction.max())

print("Known Protocols")
for request in known_data:
    print(c_and_c_protocol(request))

print("\nUnknown Protocols")
for request in unknown_data:
    print(c_and_c_protocol(request))

Known Protocols
('DNS', 0.9999999)
('DNS', 0.99999595)
('HTTPRequest', 0.99859995)
('HTTPRequest', 0.9999347)
('SSHRequest', 0.9999105)
('SSHRequest', 0.9998099)

Unknown Protocols
('SMTPResponse', 0.9642178)
('Drobo_Discovery', 0.26057768)
('NTP', 0.98377234)
('QTLSRequest', 0.96716)


### Wow!
This changes things a great deal!  We can see that the packets from known, classified, protocols come back with very strong predictions, in this case all at 100%.  On the other hand, we can see that the data that comes from non-classified protocols is *much* weaker!

> As an aside, in preparing this data, we do not *actually* know what protocols are represented in the unknown data.  We simply used a Zeek script to extract the first content bytes from connections for connections that were **not** on ports 22, 25, 80, 443, 123, 53, etc.  This means that something with a prediction value less than 98% *could* actually be what has been predicted... we don't know.  Looking at the data, we can see that it's definitely in ASCII.

This leads us to an important intuition.  See if this logic makes sense to you:
 * Our network seems to be very good at finding things that it *knows*
 * It exhibits significantly more spread out activations (or, the activation of the `max()` value is much lower) when the protocol is *unknown*
 * We can't create an output node or class for "unknown," but we can see how confident the network is in its prediction
 * Therefore...
 
## Therefore...
Based on the deductive steps so far, it seems reasonable to create an automated deduction:

**If the confidence in the prediction is less than some arbitrary value, the data is likely a previously unknown protocol.**


## Production Use
To move this into a production application, we need to save the configuration of our network so that it can be loaded from another script later.  We'll also copy some of the functions that we've built to the new script.  Keras provides handy functions that allow us to save and load trained networks.  We can do this with the `model.save()` method.

With that step completed, we need simply load this model into a Python script that is interfaced to Zeek.  Within that script, it is trivial to have the network classify the data and then generate messages should the network fail to identify the protocol.

We will not save the model in this lab (though you are welcome to).  In fact, if you would like a Zeek and Python script that you can use your trained model with, visit the [ShowMeThePacket repository](https://github.com/dhoelzer/ShowMeThePackets/tree/master/Zeek).

# Conclusion

This lab covered a lot of ground.  Not only did we work with a new type of machine learning, but we built something that can give us useful real-time results in our organization.  We also developed an important technique that allows us to use a supervised learning solution that can be trained with very little data and can be used for anomaly detection.