# <FONT COLOR='red'>**_OVERALL DESCRIPTION_**</FONT>

---
---

The purpose of this notebook is to replicate, with current libraries, the construction of the model with the ARF (Adaptive Random Forest) algorithm implemented by NELLY so that it can correctly classify elephant and mouse flows from DCN traffic traces using a set of modified UNI1 data. called UNIV1.

## <FONT COLOR = 'gray'>**INSTALL LIBRARIES**</FONT>

---

Next, we need install some libraries to use the model.

1. `pip install river`: the command allows you to install the `river` library of 2022, the `river` library is the result of the merger between the `creme` (Halfordet al. 2019) and `scikit-multiflow` libraries (Montiel et al. 2018).

In [1]:
%%capture
!pip install river

## <FONT COLOR = 'gray'>**IMPORT MODEL LIBRARIES**</FONT>

---

Next, we need import some libraries to use the model.

1. `from river import evaluate, forest, metrics, preprocessing, stream`: `evaluate` is imported, a module that provides tools for the evaluation of `ML` models such as `ARF` where models are evaluated as they are updated with new data. One of the most common functions is `evaluate.progressive_val_score`, which allows you to evaluate the performance of a model progressively. The `forest` module contains implementations of random forest algorithms adapted for streaming learning. forest includes models such as Adaptive Random Forest, which is a variant of the random forest designed to adapt to changes in data over time (conceptual drifts). The `metrics` library provides access to metrics such as `accuracy`, `precision`, `recall` and `f1` designed to be updated continuously.

  On the other hand, there is the `preprocessing` library, which is a module that contains tools for preprocessing data in continuous streams. Finally you have `stream` which provides classes and functions to handle data streams. It includes synthetic data generators, streaming data file readers, and tools to simulate the arrival of sequential data.

## <FONT COLOR = 'gray'>**IMPORT DATA ANALYSIS LIBRARIES**</FONT>

---

Next, we need import some libraries to data analysis graphics.
5. `import pandas as pd`: An essential component for data analysis and manipulation in Python, it provides data structures such as Series (one-dimensional) and DataFrames (two-dimensional) that allow you to handle tabular data with ease.
6. `import numpy as np`:Fundamental component for numerical computation in Python, as it provides support for multidimensional arrays and matrices, as well as a large collection of mathematical functions to operate on these arrays.

In [2]:
# IMPORT MODEL LIBRARIES
from river import evaluate, forest, metrics, preprocessing, stream

# IMPORT DATA ANALYSIS LIBRARIES
import pandas as pd
import numpy as np

## <FONT COLOR = 'orange'>**LOAD DATASET**</FONT>

---

We will use the UNIV1 dataset which is a modification of the UNI1 dataset, which contains traffic traces from several DCNs collected by a university in 2010. The traffic traces in the dataset are only IPv4.

UNIV1 contain the follow structure:

1. `start_time`: Represent the beginning of the capture per flow.
2. `end_time`: Represent the finish of the capture per flow.
3. `ip_src`: Represent the source IPv4.
4. `ip_dst`: Represent the destination IPv4.
5. `ip_proto`: Represent the IP protocol used.
6. `port_src`: Represent the source port.
7. `port_dst`: Represent the destination port.
8. `size_pkt1`: Represent the size of the packet 1.
9. `size_pkt2`: Represent the size of the packet 2.
10. `size_pkt3`: Represent the size of the packet 3.
11. `size_pkt4`: Represent the size of the packet 4.
12. `size_pkt5`: Represent the size of the packet 5.
13. `size_pkt6`: Represent the size of the packet 6.
14. `size_pkt7`: Represent the size of the packet 7.
15. `iat_pkt2`: Represent the inter arrive time of the packet 2.
16. `iat_pkt3`: Represent the inter arrive time of the packet 3.
17. `iat_pkt4`: Represent the inter arrive time of the packet 4.
18. `iat_pkt5`: Represent the inter arrive time of the packet 5.
19. `iat_pkt6`: Represent the inter arrive time of the packet 6.
20. `iat_pkt7`: Represent the inter arrive time of the packet 7.
21. `tot_size`: Represent the total size per flow.
22. `flow_type`: Represent the flow type as elephant or mice.

In [3]:
# UNIV1 DATASET ID
id = '1Nt4A7U0P_2x7VYfX0T7Tr21tt3CHIREl'

# GENERATE THE DOWNLOAD URL
url_univ1 = f'https://drive.google.com/uc?id={id}'

# DOWNLOAD AND LOAD UNI1 DATASET IN A DATAFRAME OF PANDAS
univ1_df = pd.read_csv(url_univ1)
univ1_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73256 entries, 0 to 73255
Data columns (total 22 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   start_time  73256 non-null  int64 
 1   end_time    73256 non-null  int64 
 2   ip_src      73256 non-null  int64 
 3   ip_dst      73256 non-null  int64 
 4   ip_proto    73256 non-null  int64 
 5   port_src    73256 non-null  int64 
 6   port_dst    73256 non-null  int64 
 7   size_pkt1   73256 non-null  int64 
 8   size_pkt2   73256 non-null  int64 
 9   size_pkt3   73256 non-null  int64 
 10  size_pkt4   73256 non-null  int64 
 11  size_pkt5   73256 non-null  int64 
 12  size_pkt6   73256 non-null  int64 
 13  size_pkt7   73256 non-null  int64 
 14  iat_pkt2    73256 non-null  int64 
 15  iat_pkt3    73256 non-null  int64 
 16  iat_pkt4    73256 non-null  int64 
 17  iat_pkt5    73256 non-null  int64 
 18  iat_pkt6    73256 non-null  int64 
 19  iat_pkt7    73256 non-null  int64 
 20  tot_si

## <FONT COLOR = 'orange'>**DATA CLEANING**</FONT>

---

For the creation of any type of ML (Machine Learning) model, it is necessary to perform data cleaning to, in this case, manipulate the records so that their data type is interpretable by the `ARF` algorithm, as well as, perform the replacing elephant with 1 and mice with 0 in the target prediction field.

In [4]:
# CONVERT UNIQUE FLOW_TYPE VALUES THAT ARE OBJECT TO NUMERIC
print(f'Unique values of flow_type: {univ1_df["flow_type"].unique()}')
print(f'mice: 0\nelephant: 1\n')
univ1_df['flow_type'] = univ1_df['flow_type'].map({'mice': 0, 'elephant': 1})
print(f'New unique values of flow_type: {univ1_df["flow_type"].unique()}\n')

Unique values of flow_type: ['mice' 'elephant']
mice: 0
elephant: 1

New unique values of flow_type: [0 1]



In [5]:
# CONVERT UNIV1 INTO A RIVER DATA STREAM
dataset = stream.iter_pandas(
    X=univ1_df.drop(columns=['flow_type']),
    y=univ1_df['flow_type']
)

In [6]:
# IMPLEMENTATION OF A STANDARDIZER
scaler = preprocessing.StandardScaler()

In [7]:
# CREATE ARF MODEL
arf_model = forest.ARFClassifier(
    n_models=5,
    max_depth=5,
    max_features='log2',
    lambda_value=3,
    min_branch_fraction=0.15,
    max_share_to_split=0.85,
    split_criterion='gini',
    leaf_prediction='nb',
    merit_preprune=True,
    delta=0.25,
    seed=42
)

In [8]:
# CHAIN ​​PREPROCESSING AND MODEL
model = scaler | arf_model

In [9]:
# DEFINE THE METRICS
metricas = metrics.ClassificationReport()

In [10]:
# PERFORM PROGRESSIVE VALIDATION AND CALCULATE METRICS
evaluate.progressive_val_score(dataset, model, metric=metricas)

           Precision   Recall   F1       Support  
                                                  
       0      98.75%   94.79%   96.73%     47531  
       1      91.04%   97.78%   94.29%     25724  
                                                  
   Macro      94.89%   96.29%   95.51%            
   Micro      95.84%   95.84%   95.84%            
Weighted      96.04%   95.84%   95.87%            

                 95.84% accuracy                  