# Starter
This notebook is intended to provide you an easy use of the repository. It is tested on windows.

---
## Check Python Version
Due to some issues in former versions, we recommend to use Python Version 3.9.1.

In [1]:
!python --version

Python 3.9.1


---
## Install Requirements
Install the required packages to use the other functions and methods.

In [2]:
!pip install -r requirements.txt



You should consider upgrading via the 'd:\applications\ddos-dompteure\ml_vnev\scripts\python.exe -m pip install --upgrade pip' command.





---
## Load necessary packages for this Guide

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
from os import path
from pathlib import Path
from joblib import load
import dask.dataframe as dd
from scripts.dataset_all_prep import pipeline
from scripts.classifier import Classifier

---
## Get Dataset and Prepare it
**This step is only necessary if you do not want to use the pre-trained models or the use cases.**

Since the dataset is to large for version control, you can download it using [here](https://www.kaggle.com/devendra416/ddos-datasets?select=ddos_imbalanced). Make sure to download the unbalanced dataset. Place the file inside the root folder of this repo and name if 'unbalaced_20_80_dataset.csv' if it is not.

Check if the dataset is correctly inserted. The output of the following cell should be 'True'.

In [5]:
path.isfile('unbalaced_20_80_dataset.csv') 

True

Load the dataset into a Dask-DataFrame.

In [6]:
df = dd.read_csv('unbalaced_20_80_dataset.csv')

In this step, the data is cleaned and prepared using a pre-built pipeline. If you want to got through each step, use the files data_cleaning.ipynb, data_exploration.ipynb and data_preparation.ipynb.

**NOTE

In [8]:
pipeline(df, output_file=Path('./prepared_ds.csv'))

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.


KeyboardInterrupt



Check if the pipeline generated the required file. The output of the following cell should be 'True'.

In [5]:
path.isfile('prepared_ds.csv') 

True

---
## Create Models
**This step is only necessary if you do not want to use the pre-trained models.**

To train the models by yourself, use either file starting with 'model_'. However, some files may take up to several hours depending on the system. For further instructions on how to use the models, please see [sklearn documentation](https://scikit-learn.org/stable/).

---
## Use (pre-built) Models
To use models which are already built, load any model of the models folder with the file ending '.model' or '.joblib'. For further instructions on how to use the models, please see [sklearn documentation](https://scikit-learn.org/stable/). Please note, that we could not upload all models due to storage limitations.

In [10]:
clf = load('models/RandomForestClassifier_AdaBoost2.joblib')

In [13]:
clf.base_estimator

RandomForestClassifier(n_estimators=10, verbose=1)

---
## Example Use Case 1: Classifier
This is an example use case of the models. The classifier is a custom built class which utilizes the models to predict the classification of a network communication entry.

First, we need to initialize a new object.

In [2]:
c = Classifier()

We can change the model to predict the values using the method shown below. Currently, the following models are available:
  - 'RFC': RandomForestClassifier.joblib, Default on initialization
  - 'NN': neural_network_prod.model 
  - 'RFC_Ada': RandomForestClassifier_AdaBoost.joblib

In [6]:
c.set_current_model('RFC_Ada')

To test the model, we create a new dataset which consists of about 0.1% of the whole dataset. First, check if the necessary file is available. The output of the first cell should be 'True'.

In [7]:
path.isfile('unbalaced_20_80_dataset.csv') 

True

In [3]:
df_test = dd.read_csv('unbalaced_20_80_dataset.csv').sample(frac=0.001, random_state = 4).compute()

Predict the values of the test dataset and show the according ips.

In [4]:
predictions, ips = c.predict(df_test)

In [5]:
print(predictions)
print(ips)

['ddos' 'ddos' 'ddos' ... 'Benign' 'Benign' 'Benign']
                 Src IP
72997     18.219.193.20
119451     172.31.69.25
51578     18.219.193.20
59757      172.31.69.28
97909    18.216.200.189
...                 ...
21456      172.31.64.86
24966      79.175.45.92
18724   145.239.183.169
5823       172.31.67.80
15845           8.6.0.1

[7567 rows x 1 columns]


Test if a specific ip address is predicted as 'ddos'. The output is 'False' if the ip is either classified as 'benign' or unknown to the list.

In [11]:
c.is_malicious('18.219.193.20')

False

In [12]:
c.is_malicious('8.6.0.1')

False

---
## Example Use Case 2: API
This use case utilizes the models inside of an api. The api is inteded to be used by different firewall providers. The firewalls of these providers can send their aggregated network communication to the api and receive a classification of the request based on which the firewall can exclude the ip address from further communication. Moreover, black- and whitelisted ips can be requested. Using the `/ip/{ip}/status` endpoint, the firewall receives a classification based on the classification of the last 10 entries for a specific ip.

The classification inside this system encompasses a voting system. The voting system consists of mutliple of our trained models which classify each entry. In this specifc case, two of our best models are used. To prevent the models from predicting contrary values (e.g. 1 ddos, 1 benign), the `predict_proba` function of the *sklearn* library is used. The mean of all predicted values for a record is used to return the classification.

### Installation and Use

To install the docker images and build the container, open a cmd prompt inside the service folder. Type `docker compose up` inside the prompt. All required images will be installed and the container built up. The process might take a few minutes. The first installation will take longer since dummy data is inserted into the database.

The api can be accessed using `localhost:5000/v1` inside the browser. An example user is already implemented:
- email: herbert@gmail.com
- password: Passwort123!
After a login, copy the authorization header looking similar to this: 
>`Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE2MjUxNjY2MTYsImlhdCI6MTYyNTA4MDIxNiwic3ViIjoiaGVyYmVydEBnbWFpbC5jb20ifQ.n5UNLWJzdvVToTANstRMJwVo5Up4GEgi8XhxXVEFbd8` 

and insert it inside the authorization field of the endpoints. You can try this using the `/user/company` endpoint. The request should return `{"company": "Apple"}`.

### API Description

#### users

| Endpoint                       | Method     | Description                                   |
| ------------------------------ | ---------- | --------------------------------------------- |
| /user/company                  | get        | get company of user by token (jwt)            |
| /user/signin                   | post       | signin with user credentials                  |
| /user/signup                   | post       | create new user                               |
| /user/update                   | post       | update names of a user                        |
| /user/update                   | get        | get updatable fields of an user               |
| /user/verify-token             | post       | verify that the token is valid                |

#### companies

| Endpoint                       | Method     | Description                                   |
| ------------------------------ | ---------- | --------------------------------------------- |
| /company/list                  | get        | get list of all registered companies          |
| /company/register              | post       | register a new company<sup>1</sup>            |
| /company/{company}             | get        | get all users of a company                    |

<sup>1</sup>This endpoint is only for illustrative purposes. In a real environment we do not recommend to have an unprotected endpoint for creating a company.

#### ips

| Endpoint                       | Method     | Description                                   |
| ------------------------------ | ---------- | --------------------------------------------- |
| /ip/                           | post       | post new record, returns classification       |
| /ip/blacklist                  | get        | list of possible harmful ip addresses         |
| /ip/whitelist                  | get        | list of possible harmless ip addresses        |
| /ip/{ip}                       | get        | get class counts of last 10 entries of ip     |
| /ip/{ip}/status                | get        | get current status of ip                      |