# Generating feature set

The code used to aggregate the data from binetflows to time window intervals is in `main.py`. There are some command line arguments built into the script that
will allow you to aggregate based on attack types. Some examples:

```bash
python3 main.py --attack_type=ddos --interval=1
```

The above command will take all the files that have a DDOS attack and aggregate them on 1 second time intervals. The results are stored in the folder `minute_aggregated/` under
the format `{attack name}-{interval}s.featureset.csv`.

**NOTE:** By default, `main.py` excludes background connections. To include background connections, use the flag `--use_background`.

## Features used 
The features that are used can be found in the `summarizer.py` script. The full list of features that are written to the csv file are under the class `Summarizer` on the field `features`. You can see the features we used below

In [2]:
from summarizer import Summarizer
print(Summarizer().features)

['n_conn', 'avg_duration', 'n_udp', 'n_tcp', 'n_icmp', 'n_sports>1024', 'n_sports<1024', 'n_dports>1024', 'n_dports<1024', 'n_d_a_p_address', 'n_d_b_p_address', 'n_d_c_p_address', 'n_d_na_p_address', 'std_packets', 'std_bytes', 'std_time', 'std_srcbytes', 'src_to_dst', 'entropy_sports>1024', 'entropy_sports<1024', 'entropy_dports>1024', 'entropy_dports<1024', 'entropy_srcport', 'entropy_dstport', 'entropy_dstip', 'n_s_a_p_address', 'n_s_b_p_address', 'n_s_c_p_address', 'n_s_na_p_address', 'entropy_srcip', 'entropy_src_a_ip', 'entropy_src_b_ip', 'entropy_src_c_ip', 'entropy_src_na_ip', 'entropy_dst_a_ip', 'entropy_dst_b_ip', 'entropy_dst_c_ip', 'entropy_dst_na_ip', 'entropy_bytes', 'entropy_src_bytes', 'entropy_time', 'entropy_state', 'entropy_packets']


Removing or adding to the that array will effect what is written to the output csv.

The summarizer class has an `add()` function which takes a dictionary object that is basically a line from a botnet file, and it will add it to the overall features
to it's appropriate window. For example, the function will look at the value `dur` and add it and average the result to the `avg_duration` feaure found in the features array. 
So if there is another feature you want to implement, look at the `add()` function where you'll get a line from botnet file and use the information provided appropriately.

# Training the models

There is a script, `botnet_detection_main.py`, that has some functions set up to train some models. You can train Random Forest and Deep Learning models
using their default parameters using the command line arguments of the script. Here's an example:

```bash
python3 botnet_detection_main.py --attack_type=ddos --interval=1 --model_type=rf
```

Running this will find the file that has aggregated files that had a DDOS attack on 1 second intervals, and train and test a Random Forest model on it using a 70/30 
train/test split. You can use `--model_type=dl` to use Deep Learning instead. The results will be printed which will contain the accuracy, precision, recall, and F1 
score

**NOTE:** The command line arguemnts here are similar to `main.py`. So by default, the script will look for files that excluded background connections. To make it look for files
with background connections, user the flag `--use_background`.

If you want to change the parameters, you will have to go into the code in `botnet_detection.py`. There is two functions you'll want to look at: `rf_train()` and `dl_train()`. 
These functions create the models for their respective models so you can look into them to change parameters to how you wan them.
