# Finding Non-IP Packets - A Lesson in False Correlations

## Overview

In this experiment, we are going to create a very simple model to solve a very simple problem... And fail. You will use your existing knowledge to build the network and learn a bit about parsing packet capture files along the way.

## Goals

By the end of this lab, you should be able to:

 * Read packet captures
 * Build a training data set from packets
 * Build a model to attempt to distinguish IP from non-IP packets on a network
 
## Estimated Time: 30 minutes

# Obtaining the Data

The data for this lab is stored in the `../data/Day 4/` directory relative to this notebook.  Within that directory are two packet captures: `ip-notip.cap` and `test_ip_notip.cap`.

# <img src="../images/task.png" width=20 height=20> Task 4.1

Please use the following cell to:
 * Import the *numpy* library with the typical `np` alias
 * Import *tensorflow* as `tf`
 * Import the `models` and `layers` packages from `tensorflow.keras`
 * Import the `PcapReader` and `hexdump` class/method from `scapy.all`
 
**Note:** You may receive a warning regarding deprecated aspects of the `ipsec.py` file from within Scapy.  These can be safely ignored and are expected.

In [1]:
from scapy.all import PcapReader, hexdump
import numpy as np
import tensorflow as tf
from tensorflow.keras import models, layers

2023-12-27 09:58:38.278140: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-27 09:58:38.278168: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-27 09:58:38.278203: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-27 09:58:38.286806: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


We are about to read in packet capture data.  These capture files are stored in *pcap* format, which is used by tools such as Wireshark, Scapy, Snort, Firepower, tcpdump, Zeek, etc.  Within these captures are a variety of packets, some of which are IPv4, some of which are not.  All of these packets have been sent using Ethernet.

An Ethernet frame header consists of a six byte destination MAC address, a six byte source MAC address, and a two byte ethertype.  It is this ethertype value that should be used to distinguish between IPv4 packets and other types of data.  IPv4 will have the hexadecimal value `0x0800` in these two bytes located at offsets 12 and 13 from the start of the frame header.

While this is a trivial problem, certainly simpler than our spam classifier, we will try to solve it in an interesting way.  If we had subject matter expertise with regard to network packets, then we could tell our network to use only those two bytes.  However, if we were going to take this tack, we may as well use something like a decision tree rather than a neural network to solve the problem.  Our experiment, though, is slightly different.

What if we don't know which bytes indicate that a packet is IP?  Is it possible for a neural network to work out a way to figure it out for us?  This is the problem we're going to solve.

## Preparing the Data
The first step, as usual, is to preprocess the data into a useable format.  We will leverage the Scapy library to read the packets in and building a data set for training.  In the matrix that is created, each row will be a packet and the row vector within that row will be all of the bytes that make up the 14 byte Ethernet header (the first fourteen bytes present).

# <img src="../images/task.png" width=20 height=20> Task 4.2

The `PcapReader()` class exposes the packets from a packet capture to us as though it is an interable file.  For example, we can write something like this:

```
with PcapReader("pcap_file") as packets:
    for packet in packets:
    ...
```

In the cell that follows:

  * Use the `PcapReader()` class to iterate over the list of packets from the `ip-notip.cap` file
  * The file can be found in `../data/Day 4`
  * The bytes are unsigned 8 bit values; create a variable to hold `np.dtype('uint8')` as a type named *byte*
  * Assign the first fourteen bytes of each packet to a numpy array of type *byte*.
  * Use the `hexdump()` function to print out the resulting array for the first ten packets
  * Use the Python `bytes()` function to treat the data as indexable list of bytes


In [2]:
byte = np.dtype('uint8')

# Let's load in some packet data.
with PcapReader("../data/Day 4/ip-notip.cap") as packets:
    for i, packet in enumerate(packets):
        parray = np.frombuffer(bytes(packet), dtype=byte, offset=0, count=14)
        if i < 10:
            hexdump(parray)

0000  01 00 5E 00 00 09 00 50 56 86 27 F6 08 00        ..^....PV.'...
0000  00 50 56 86 27 F6 00 50 56 86 01 A9 08 00        .PV.'..PV.....
0000  00 50 56 86 01 A9 00 50 56 86 27 F6 08 00        .PV....PV.'...
0000  00 50 56 86 27 F6 00 50 56 86 01 A9 08 00        .PV.'..PV.....
0000  00 50 56 86 01 A9 00 50 56 86 27 F6 08 00        .PV....PV.'...
0000  00 50 56 86 27 F6 00 50 56 86 01 A9 08 00        .PV.'..PV.....
0000  00 50 56 86 01 A9 00 50 56 86 27 F6 08 00        .PV....PV.'...
0000  00 50 56 86 27 F6 00 50 56 86 01 A9 08 00        .PV.'..PV.....
0000  00 50 56 86 01 A9 00 50 56 86 27 F6 08 00        .PV....PV.'...
0000  01 00 5E 00 00 09 00 50 56 86 D4 C3 08 00        ..^....PV.....


So far, we've managed to load a pcap in using PcapReader and we can turn the Ethernet header into a row vector containing all of the bytes in the header!  Looking at the output above, you can see that the first ten rows of data all end with the values `08 00` indicating that the encapsulated packet is an IPv4 packet.

# <img src="../images/task.png" width=20 height=20> Task 4.3

Load all of the data into an array named `x_data`.  At the same time, create a label array named `y_labels` that simply has a zero or a one to indicate whether each respective packet is IP or not.  To make this decision, you will need to evaluate each packet as you read it, checking offsets 12 and 13 for the values `08` and `00`.

When you complete this task, print one or two values from the `x_data` array and verify that the label in `y_data` is correct.

In [3]:
x_data = list()
y_labels = list()
with PcapReader("../data/Day 4/ip-notip.cap") as packets:
    for packet in packets:
        linkLayer = np.frombuffer(bytes(packet), dtype=byte, offset=0, count=14)
        x_data.append(linkLayer)
        if linkLayer[12]==0x08 and linkLayer[13]==0x00:
            y_labels.append(1)
        else:
            y_labels.append(0)
        
print(f"The label for item offset 10 is {y_labels[10]}")
print(f"The bytes from offset 10 are:")
hexdump(x_data[10])
x_data = np.array(x_data)
y_labels = np.array(y_labels)
print(f"After converting to numpy arrays, the data has shape {x_data.shape} and the labels have shape {y_labels.shape}")

The label for item offset 10 is 1
The bytes from offset 10 are:
0000  00 50 56 86 27 F6 00 50 56 86 01 A9 08 00        .PV.'..PV.....
After converting to numpy arrays, the data has shape (50000, 14) and the labels have shape (50000,)


## Building the Network

# <img src="../images/task.png" width=20 height=20> Task 4.4

Now that we've beaten our data into submission, it's time to build and train our network.  Please build a sequential model with three dense layers.  The layers should have 64, 16, and 1 neuron respectively.  Use the `relu` activation function for the first two and `sigmoid` for the output layer.

Configure the optimizer to use `rmsprop`.  Set the `loss` value appropriately for the type of problem we are solving.  Use the `tf.random.set_seed(42)` call to ensure that all of our networks perform identically.  Train your model for four epochs using a `batch_size` of 32.

In [4]:
tf.random.set_seed(42)
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(14,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
training_history = model.fit(x_data, y_labels, epochs=4, batch_size=32)

Epoch 1/4


2023-12-27 10:00:05.676918: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-27 10:00:05.682385: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-27 10:00:05.682674: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Epoch 2/4
Epoch 3/4
Epoch 4/4


## Testing the Model

This all seems very promising!  The training went really quickly and it's reporting an accuracy of 99.82%!  The next step is to try this against some real data and see what it can do.  To do so, we'll continue to manipulate the model here within this notebook.

# <img src="../images/task.png" width=20 height=20> Task 4.5

Repeat the above required steps to load in a test dataset from the packet capture `test_ip_notip.cap` from the same location.  Use this dataset to run predictions and then print out the hexdump of each, the prediction generated, and whether or not the prediction is correct.

In [5]:
x_test = list()
y_test = list()
with PcapReader("../data/Day 4/test_ip_notip.cap") as packets:
    for packet in packets:
        linkLayer = np.frombuffer(bytes(packet), dtype=byte, offset=0, count=14)
        x_test.append(linkLayer)
        if linkLayer[12]==0x08 and linkLayer[13]==0x00:
            y_test.append(1)
        else:
            y_test.append(0)
            
x_test = np.array(x_test)

# The next step is to make predictions!
predictions = model.predict(x_test)
for i in range(0,predictions.size):
    predicted = 1 if predictions[i] >= 0.5 else 0
    print(f'Prediction: { predicted }\tGround Truth: {y_test[i]} {"           **** MISMATCH!" if predicted != y_test[i] else ""}')
    hexdump(x_test[i])
    print("---------------------")

Prediction: 1	Ground Truth: 1 
0000  00 50 56 86 A7 48 00 50 56 86 7C 39 08 00        .PV..H.PV.|9..
---------------------
Prediction: 1	Ground Truth: 1 
0000  00 50 56 86 7C 39 00 50 56 86 A7 48 08 00        .PV.|9.PV..H..
---------------------
Prediction: 1	Ground Truth: 1 
0000  00 50 56 86 A7 48 00 50 56 86 7C 39 08 00        .PV..H.PV.|9..
---------------------
Prediction: 1	Ground Truth: 1 
0000  00 50 56 86 7C 39 00 50 56 86 A7 48 08 00        .PV.|9.PV..H..
---------------------
Prediction: 1	Ground Truth: 1 
0000  00 50 56 86 A7 48 00 50 56 86 7C 39 08 00        .PV..H.PV.|9..
---------------------
Prediction: 1	Ground Truth: 1 
0000  00 50 56 86 7C 39 00 50 56 86 A7 48 08 00        .PV.|9.PV..H..
---------------------
Prediction: 1	Ground Truth: 1 
0000  00 50 56 86 A7 48 00 50 56 86 7C 39 08 00        .PV..H.PV.|9..
---------------------
Prediction: 1	Ground Truth: 1 
0000  00 50 56 86 7C 39 00 50 56 86 A7 48 08 00        .PV.|9.PV..H..
---------------------
Prediction: 1	Gr

# Results
So how did the model do?  Pretty well!  But... there's a problem.  The majority of the predictions are correct, but there are a few non-IP packets that are marked as being IP!  (You can tell because the last two values in the hexdump are `08 06`.) What went wrong?  

Think about this question and how you could correct this problem.  We know from the problem statement at the beginning that we are including lots of irrelevant features to our model.  A simple and direct solution would be to change the model so that it *only* looks at the last two bytes.  If we made that change, this model would train to 100% after just a handful of examples.

Still, *why* is the network arriving at the wrong answer some of the time?  In this case, the network has made a *false correlation*.  We can point at how we've modeled the features as the cause, but we can also point at the training data.  In our case, it is a reasonable guess that the packets in our training data set have some particular characteristics.  Most notably, the majority (if not all) of the training examples are either IP or ARP.  Since ARP packets are *broadcast packets* (meaning that they are sent to all hosts) at least half of the time, this strongly biases our model.  It incorrectly identifies the broadcast addresses (the series of `FF FF FF FF FF FF` in many of the non-IP packets) as the indicator that the packet isn't IP!  This results in incorrect predictions because some of our test data has non-IP packets that are *not* broadcast.

Ultimately, the details about the packets are not what matter.  The big picture outcome is that by including extraneous features or by poorly selecting training data we can easily end up with a model that makes false correlations.

# Conclusion

While this lab was relatively simple compared to the other problems we've solved, there are some important takeaways:

 * We have successfully applied what we have learned about deep learning to a completely different problem
 * We can read and process packet captures as bytes easily
 * Feature selection can have a big impact on how well a model generalizes
 * Training data can have an equally large impact on how well a model generalizes.