# Data Preparation

With a better understanding of data representation, let's now turn to preparing data for input into a machine learning pipeline. In the case of unsupervised learning, a simple matrix-level representation can suffice for input to machine learning models; we also need accompanying labels.

Often, traffic capture datasets are accompanied by labels. These labels can tell us something about the accompanying data points (i.e., flows, packets) in the traffic, and can be used to train the model for future prediction.

Automated tools exist for assigning labels to traffic flows, including [pcapML](https://nprint.github.io/pcapml/). Before we use those tools, we will do some automatic preparation and labeling from an existing dataset, a log4j trace from [malware traffic analysis](https://www.malware-traffic-analysis.net/2021/12/20/index.html) and a regular trace.

You can use the NetML traffic library to generate features.

In [69]:
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)

from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data

import pandas as pd

## Load the Packet Capture Files

Load the Log4j and HTTP packet capture files and extract features from the flows. You can feel free to compute features manually, although it will likely be more convenient at this point to use the `netML` library.

In [70]:
hpcap = PCAP('data/http.pcap', flow_ptks_thres=2, verbose=10)
lpcap = PCAP('data/log4j.pcap', flow_ptks_thres=2, verbose=10)

## Convert the Packet Capture Into Flows

Find the function in `netml` that converts the pcap file into flows. Examing the resulting data structure. What does it contain?

In [71]:
# extract flows from pcap
hpcap.pcap2flows()
lpcap.pcap2flows()

'_pcap2flows()' starts at 2023-10-18 16:04:19
pcap_file: data/http.pcap
ith_packets: 0
ith_packets: 10000
ith_packets: 20000
len(flows): 593
total number of flows: 593. Num of flows < 2 pkts: 300, and >=2 pkts: 293 without timeout splitting.
kept flows: 293. Each of them has at least 2 pkts after timeout splitting.
flow_durations.shape: (293, 1)
        col_0
count 293.000
mean   11.629
std    15.820
min     0.000
25%     0.076
50%     0.455
75%    20.097
max    46.235
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_0   293 non-null    float64
dtypes: float64(1)
memory usage: 2.4 KB
None
0th_flow: len(pkts): 4
After splitting flows, the number of subflows: 291 and each of them has at least 2 packets.
'_pcap2flows()' ends at 2023-10-18 16:04:29 and takes 0.1692 mins.
'_pcap2flows()' starts at 2023-10-18 16:04:29
pcap_file: data/log4j.pcap
ith_packets

## Explore the Flows

How many flows are in each of these pcaps? (Use the `netml` library output to determine the size of each data structure.)

In [72]:
len(lpcap.flows)

4795

In [73]:
len(hpcap.flows)

291

## Extract Features for Each Dataset

Extract features from each of the two datasets. You may want to use the `netml` library to generate features, although you can certainly compute your own. The [documentation](https://pypi.org/project/netml/) and [accompanying paper](https://arxiv.org/pdf/2006.16993.pdf) provide examples of features that you can try to extract. 

You can attempt to generate any of the following features available in the `netml` library.

In [74]:
# extract features from each flow via IAT
lpcap.flow2features('IAT', fft=False, header=False)
ld = pd.DataFrame(lpcap.features)

'_flow2features()' starts at 2023-10-18 16:04:53
True
'_flow2features()' ends at 2023-10-18 16:04:53 and takes 0.0017 mins.


In [75]:
# extract features from each flow via IAT
hpcap.flow2features('IAT', fft=False, header=False)
hd = pd.DataFrame(hpcap.features)

'_flow2features()' starts at 2023-10-18 16:04:53
True
'_flow2features()' ends at 2023-10-18 16:04:53 and takes 0.0007 mins.


## Normalize the Shapes of Each Feature Set

If you loaded the two pcaps with `netml` separately, the features will not be of the same dimension.  

1. Adjust your data frames so that the two have the same number of columns.
2. Merge (i.e., concatenate) the two data frames, but preserve the labels as a separate vector called "target".

In [76]:
hd.shape

(291, 92)

In [77]:
ld.shape

(4795, 5)

In [78]:
hds = hd.loc[:,:5]
hds.shape

(291, 6)

In [79]:
pd.set_option('mode.chained_assignment', None)
hds['label'] = 0
ld['label'] = 1

In [80]:
ld.head(3)

Unnamed: 0,0,1,2,3,4,label
0,0.012,0.0,0.013,5.041,0.097,1
1,0.012,0.001,5.004,0.146,0.0,1
2,1.001,2.02,4.253,0.0,0.0,1


In [81]:
hds.head(3)

Unnamed: 0,0,1,2,3,4,5,label
0,0.037,30.003,0.037,0.0,0.0,0.0,0
1,0.02,30.02,0.02,0.0,0.0,0.0,0
2,15.494,4.999,15.131,5.336,0.0,0.0,0


In [82]:
data = pd.concat([ld,hds])
data

Unnamed: 0,0,1,2,3,4,label,5
0,0.012,0.000,0.013,5.041,0.097,1,
1,0.012,0.001,5.004,0.146,0.000,1,
2,1.001,2.020,4.253,0.000,0.000,1,
3,1.816,1.001,1.001,1.001,1.000,1,
4,0.000,0.000,0.000,0.000,0.000,1,
...,...,...,...,...,...,...,...
286,0.013,0.000,0.015,0.000,0.000,0,0.015
287,0.013,0.000,0.001,0.016,0.000,0,0.000
288,0.019,0.002,0.019,0.001,0.000,0,0.000
289,0.024,0.001,0.000,0.000,0.012,0,0.032


## Try Your Data on a Model

You should now have data that can be input into a model with scikit-learn. Import the scikit-learn package (`sklearn`) and a classification model of your choice to test that you can train your model with the above data. 

Hint: The function you want to call is `fit`.

**Note:** If you plan to use a linear model such as logistic regression, your label should be a numerical value, and if the problem is a binary classification model, as in this case, then the appropriate label should be 0 and 1 for each respective class. (If you are using a tree-based model, then the labels could take any format.)

(Note that we have not done anything here except train the model with all of the data. To evaluate the model, we will need to split the data into train and test sets.)

In [83]:
from sklearn.linear_model import LogisticRegression

In [84]:
features = data.loc[:,:4]
targets = data['label']

lr = LogisticRegression(random_state=0).fit(features, targets)

## Test Your Trained Model

We used the entire dataset to train the model in this example (no split), and so of course the model will be well-fit to all of the data. To simply test that your trained model works, call `predict` using a feature vector that you generate by hand (e.g., from scratch, using a random set of numbers, from another pcap).

In [85]:
import numpy as np
results = []
for i in range(100000):
    test = np.random.rand(5).reshape(1,-1)
    #print(test)
    results.append(lr.predict(test))
print(sum(results)/len(results))

[1.]


## Bonus  

Consider the following extensions to the above exercise:

* Concatenate or combine multiple features (either from `netml` or some of your own) into the same feature representation.
* Normalize your features so that they are in the same range (helpful for some models).

The above exercise gives you an example of how to generate features from a packet capture, attach labels to the dataset, and train a model using the labeled data. 

## Looking Ahead

Many other steps exist in the machine learning pipeline, including splitting the data into training and test sets, tuning model parameters, and evaluating the model. These will be the next steps we walk through.