# Introduction of Running NetShare: 

Remember to install NetShare Python package before running the dataset:  
#### 1. Assume Anaconda is installed  
#### 2. Create virtual environment if not exists:  
conda create --name NetShare python=3.9

#### 3. Activate virtual env:  
conda activate NetShare  

#### 4. Install NetShare package:  
git clone https://github.com/netsharecmu/NetShare.git  
pip3 install -e NetShare/  

#### 5. Install SDMetrics package:  
git clone https://github.com/netsharecmu/SDMetrics_timeseries  
pip3 install -e SDMetrics_timeseries/ 

# Run the Dataset

#### Import the module and enable the Ray

In [1]:
import random
import json
import netshare.ray as ray
from netshare import Generator
from pre_post_processor_mixed import pre_processor, post_processor
from zeek_processor import parse2csv
# Change to False if you would not like to use Ray
ray.config.enabled = False
ray.init(address="auto")

Ray is disabled


## Part 1: Data Pre Processor

### 1.1. Introduction of the Pipeline 


![image](./WholeProprocessor.png)


From the image, we have separated the whole proprocessor into three parts: 
Firstly, we need to convert the original file (like pcap file) into csv file in the log file processor. Secondly, after we convert it into csv file, we need to process the mixed format like list and convert them into single format. Finally, we can process the single format data and encode it to the numeric values in forms of vector and then we can put dataset into DG model.  

### 1.2. Pipeline of the Log file preprocessor 


![image](./logpreprocessor.png)


#### 1.2.1 Code for mixed format preprocessor

In [2]:
# pre processor for the log file 
preprocess_stage1_output = parse2csv.parse_to_csv('rahul_config.json')

### 1.3. Pipeline of the list format preprocessor


This is our processor pipeline for mixed format. Firstly, we get the input csv file from the log processor and user's input for the columns and encoding methods, then we will try to detect if there exists the wrong input and remove the unmatched column. Then we eill try to convert the mixed format into single format. Finally, we can encoding each columns into single format and train it with DG model.  

![image](./preprocessor.png)


#### Detection for the wrong input   
We support the simple function that avoid the user's wrong input. If we find the user has inputed wrong format for type, we may remove column from the current list.   
The error cases are listed below:  

| Detected dtype         | User's input                       |
| ------------------- | -------------------------------------- |
| object        | int                              |
| int, float     | string                            |
| int, float     | IP address  

#### Processor for mixed formats 
We have defined two different kind of format for each column for dataset:  
One is single format and the other is mixed format:   

| Format          | Details                        |
| ------------------- | ---------------------------------------|
| Single format     | int, float, string               |
| Mixed format      | timestamp, IPv4, IPv6,list          |

The single format is the format that can directly encoded by four encoding methods (bit, word-vec, float and categorical)  
The mixed format means that we need some processing before encoding it. For example, it makes more sense to convert the IP address to integer than directly use categorical encoding. Besides, for the timestamp, we should also convert it into format of ns. Finally, we support to encode the list format for the dataset: 

| Encoding         | Examples                     | Supported format              |
| ------------------- | ---------------------------------------|--------------------------------------|
| IPv4    | IP__src_s             | "128.237.82.10" (to int) |
| IPv6    | No              |  "FE80:CD00:0:CDE:1257:0:211E:729C" (to int)            |
| Timestamp   | packet__time	             |  "%Y-%m-%d %H:%M:%S.%f"   (to int)          |
| List attributes    | packet__layers              |  "a,b,c,d" (delimiter: ",")            |
| List values       | DNS__answers                  |  "fields1 = val1 fields2 = val2"  (delimiter: '=')            |   


For the list format, we support two kinds of encoding: one is list attributes and another is list values.   
The list attributes will regard each value as new columns and create columns for packet__layers with value of 
"a,b" like below:  

| packet__layers_a   | packet__layers_b                | packet__layers_c              |
| ------------------- | ---------------------------------------|--------------------------------------|
| yes            | yes                         |   no                     |

The list values will extract the field's name with its values and create the new columns like below:    
Original format is: "fields1: val1 fields2:val2"   

| DNS__answers_fields1  | DNS__answers_fields2                |
| ------------------- | ---------------------------------------|
| val1            | val2                      | 

Besides, we also support all kinds of delimiters, and the users need to specify the kind of the delimiter in the column.



#### 1.3.1 Attributes and its encoding method in Sample DataSet

Here is the example dataset for this notebook with its corresponding attributes and encoding methods: 

| Attributes          | Encoding Method                        |
| ------------------- | -------------------------------------- |
| packet__time        | TimeStamp                              |
| IP__ttl      | Float (included in time series fields) |
| DNS_ttl             | Float (include in time series field)   |
| IP__len             | Bitwise encoding                       |
| IP__src_s           | Bitwise encoding                       |
| IP__dst_s           | Bitwise encoding                       |
| IP__p               | Word vector encoding - protocol        |
| packet__len         | Bitwise encoding                       |
| UDP__dport          | Word vector encoding  – port           |
| UDP__sport          | Word vector encoding  – port           |
| TCP_seq             | Bitwise encoding                       |     


| Attributes          | Encoding Method                        |
| ------------------- | -------------------------------------- |
| TCP__dport          | Word vector encoding  - port           |
| TCP__flags          | Bitwise encoding                       |
| TCP__sport          | Word vector encoding  - port           |
| DNS__dlen           | Bitwise encoding                       |
| DNS__an             | Bitwise encoding                       |
| DNS__opcode         | Bitwise encoding                       |
| DNS__answers_amount | Bitwise encoding                       |
| IEEE__type          | Bitwise encoding                       |
| IEEE__dsr           | Bitwise encoding                       |
| MQTT__mlen          | Bitwise encoding                       |
| DNS__answers        | List value encoding                        |
| packet__layers      | List attributes encoding                       |


#### 1.3.3 Code for mixed format preprocessor
input_file: The location of the original csv file  
input_field_configs: The columns users are interested and the encoding methods assigned to it.   
config_file: The default parameters we have set up for the DG model   
output_file: The location of the preprocessed csv file for the mixed format   
output_config: The location of the output configuration file

In [3]:
input_file = "../../traces/Datasets/rahul.csv"
config_file = "config_default.json"
input_field_configs = "rahul_config.json"
output_file = "../../traces/Datasets/dataset.csv"
output_config = "sample.json"
Pre_processor = pre_processor.Pre_processor(filename = input_file, 
                            default_configs = config_file,
                            input_field_configs = input_field_configs,
                            output_path = output_file, 
                            output_config = output_config)
Pre_processor.processor() 

[{'name': 'IP__src_s', 'format': 'IP', 'encoding': 'bit', 'type': 'IPv4'}, {'name': 'IP__dst_s', 'format': 'IP', 'encoding': 'bit', 'type': 'IPv4'}, {'name': 'IP__p', 'format': 'integer', 'encoding': 'word_proto'}, {'name': 'IP__type', 'format': 'integer', 'abnormal': True, 'encoding': 'bit'}, {'name': 'Label', 'format': 'string', 'encoding': 'categorical'}, {'name': 'packet__layers', 'format': 'list', 'encoding': 'list_attributes', 'names': ['Ethernet', 'IP', 'TCP'], 'delimiter': ','}, {'name': 'UDP__dport', 'format': 'integer', 'abnormal': True, 'encoding': 'word_port'}, {'name': 'UDP__sport', 'format': 'integer', 'abnormal': True, 'encoding': 'word_port'}, {'name': 'TCP__sport', 'format': 'integer', 'encoding': 'word_port'}, {'name': 'TCP__dport', 'format': 'integer', 'encoding': 'word_port'}]
abnormal lists are  ['TCP__seq', 'TCP__flags', 'TCP__sport', 'TCP__dport']
DATA is  {'input_file': {'path': '../../traces/Datasets/75_small.csv', 'format': 'csv'}, 'output_file': '../../traces

## 2. Run the DG model

In [4]:
# run generator with configuration file generated in preprocessor
generator = Generator(config= output_config)

# Remember to delete test-dataset in results folder before running this code
generator.train(work_folder=f'../../results/test-dataset')
generator.generate(work_folder=f'../../results/test-dataset')

default config path is  ['', '/home/ubuntu/NetShare/netshare/configs/default']
config is  {'global_config': {'overwrite': True, 'original_data_file': '../../traces/Datasets/dataset.csv', 'dataset_type': 'netflow', 'n_chunks': 1, 'dp': False, 'allowed_data_types': ['ip_string', 'integer', 'float', 'string'], 'allowed_data_encodings': ['categorical', 'bit', 'word2vec_port', 'word2vec_proto']}, 'pre_post_processor': {'class': 'NetsharePrePostProcessor', 'config': {'max_flow_len': None, 'norm_option': 0, 'split_name': 'multichunk_dep_v2', 'df2chunks': 'fixed_time', 'truncate': 'per_chunk', 'word2vec': {'vec_size': 10, 'model_name': 'word2vec_vecSize', 'annoy_n_trees': 100, 'pretrain_model_path': None}, 'metadata': [{'column': 'IP__src_s', 'type': 'integer', 'encoding': 'bit', 'n_bits': 32, 'categorical_mapping': False}, {'column': 'IP__dst_s', 'type': 'integer', 'encoding': 'bit', 'n_bits': 32, 'categorical_mapping': False}, {'column': 'IP__p', 'type': 'integer', 'encoding': 'word2vec_prot

[999 rows x 33 columns]
m is  {'column': 'IP__src_s', 'type': 'integer', 'encoding': 'bit', 'n_bits': 32, 'categorical_mapping': False}
m is  {'column': 'IP__dst_s', 'type': 'integer', 'encoding': 'bit', 'n_bits': 32, 'categorical_mapping': False}
m is  {'column': 'IP__p', 'type': 'integer', 'encoding': 'word2vec_proto'}
m is  {'column': 'IP__type', 'type': 'integer', 'encoding': 'bit', 'n_bits': 32, 'categorical_mapping': False}
m is  {'column': 'Label', 'type': 'string', 'encoding': 'categorical'}
m is  {'column': 'packet__layers_Ethernet', 'type': 'string', 'encoding': 'categorical'}
m is  {'column': 'packet__layers_IP', 'type': 'string', 'encoding': 'categorical'}
m is  {'column': 'packet__layers_TCP', 'type': 'string', 'encoding': 'categorical'}
m is  {'column': 'UDP__dport', 'type': 'integer', 'encoding': 'word2vec_port'}
m is  {'column': 'UDP__sport', 'type': 'integer', 'encoding': 'word2vec_port'}
m is  {'column': 'TCP__sport', 'type': 'integer', 'encoding': 'word2vec_port'}
m 



Word2Vec model is saved at ../../results/test-dataset/pre_processed_data/word2vec_vecSize_10.model
Building annoy dictionary word2vec...
{'proto': ['IP__p'], 'port': ['UDP__dport', 'UDP__sport', 'TCP__sport', 'TCP__dport']}
Finish building Angular trees...
field is  {'column': 'IP__src_s', 'type': 'integer', 'encoding': 'bit', 'n_bits': 32, 'categorical_mapping': False}
field name is  IP__src_s
list of df column is  [3232235521, 3743906754, 3188503052, 3232235533, 3232235536, 3743905456, 3232235544, 1823571388, 1823564221]
field choices are  [3232235521, 3743906754, 3188503052, 3232235533, 3232235536, 3743905456, 3232235544, 1823571388, 1823564221]
field is  {'column': 'IP__dst_s', 'type': 'integer', 'encoding': 'bit', 'n_bits': 32, 'categorical_mapping': False}
field name is  IP__dst_s
list of df column is  [3743906754, 4177189317, 4177136106, 3232235533, 3232235534, 3232235536, 3743905456, 3232235543, 3232235544]
field choices are  [3743906754, 4177189317, 4177136106, 3232235533, 323

field choices are  [0, 3546378244, 3546480644, 3546552338, 3546648595, 3546642455, 3546378274, 3546595366, 378376240, 3546630199, 3546513474, 3546484808, 3546630229, 3546556502, 265709653, 593274981, 3546382438, 3546407016, 3546599530, 3546517638, 3546433676, 3546488972, 3546560666, 265754617, 3546601630, 3546423458, 3546489008, 3546411210, 3546521802, 3546403020, 3546620125, 3546564830, 3546605794, 265797860, 3546628327, 3546437870, 265726193, 265742577, 265758961, 265767153, 265783537, 265791729, 3546394872, 3546620155, 3546360062, 3546470660, 3546622215, 3546444046, 3546525966, 3546493202, 3546394902, 3546386716, 3546652961, 3546568994, 3546609958, 3546444076, 3546442034, 3546386746, 3546652991, 3546466636, 265713997, 3546530130, 3546497366, 3546642775, 3546399066, 3546364256, 3546474854, 3546573158, 3546614122, 3546448240, 3546390910, 3546419588, 3546534294, 265762809, 3546479018, 3546577322, 3546618286, 3546450378, 3546538458, 3546458592, 265730537, 265746921, 265771497, 354649956

0it [00:00, ?it/s]


Chunk_id: 0
Before truncation, df_per_chunk: (999, 33)


To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  processed = grouped.apply(process_group)


After truncation, df_per_chunk: (999, 33)
current field is  IP__src_s
field instance is  <netshare.utils.field.BitField object at 0x7fefc725fbb0>
current field is  IP__dst_s
field instance is  <netshare.utils.field.BitField object at 0x7fefc725fb80>
current field is  IP__type
field instance is  <netshare.utils.field.BitField object at 0x7fefc725fc10>
field column is  Label
field column is  packet__layers_Ethernet
field column is  packet__layers_IP
field column is  packet__layers_TCP
field column is  DNS__answers_name
field column is  DNS__answers_address
df_per_chunk: (999, 282)



100%|##########| 14/14 [00:00<00:00, 273.46it/s]


data_attribute: (14, 248), 2.7776e-05GB in memory
data_feature: (14, 636, 21), 0.001495872GB in memory
data_gen_flag: (14, 636), 7.1232e-05GB in memory


1it [00:00,  1.03it/s]


work folder is  ../../results/test-dataset
NetShareManager._train
Number of valid chunks: 1
Number of configurations after expanded: 3
config_group_list is [{'dp_noise_multiplier': None, 'dp': False, 'pretrain': True, 'config_ids': [0]}, {'dp_noise_multiplier': None, 'dp': False, 'pretrain': True, 'config_ids': [1]}, {'dp_noise_multiplier': None, 'dp': False, 'pretrain': True, 'config_ids': [2]}]
Config group 0: DP: False, pretrain: True
Start launching chunk0 experiments...
DoppelGANgerTorchModel._train
Currently training with config: {'overwrite': True, 'original_data_file': '../../traces/Datasets/dataset.csv', 'dataset_type': 'netflow', 'n_chunks': 1, 'dp': False, 'allowed_data_types': ['ip_string', 'integer', 'float', 'string'], 'allowed_data_encodings': ['categorical', 'bit', 'word2vec_port', 'word2vec_proto'], 'pretrain_dir': '../../results/test-dataset/models/chunkid-0/sample_len-1/checkpoint/epoch_id-49.pt', 'skip_chunk0_train': False, 'pretrain_non_dp': True, 'pretrain_non_dp_

100%|##########| 50/50 [00:00<00:00, 62.47it/s]


Finish launching chunk0 experiments ...
Start waiting for other chunks from config_group_id 0 experiments finished ...
Other chunks from config_group_id 0 training finished
objs is  [<netshare.ray.remote.ResultWrapper object at 0x7feff0498940>]
Config group 1: DP: False, pretrain: True
Start launching chunk0 experiments...
DoppelGANgerTorchModel._train
Currently training with config: {'overwrite': True, 'original_data_file': '../../traces/Datasets/dataset.csv', 'dataset_type': 'netflow', 'n_chunks': 1, 'dp': False, 'allowed_data_types': ['ip_string', 'integer', 'float', 'string'], 'allowed_data_encodings': ['categorical', 'bit', 'word2vec_port', 'word2vec_proto'], 'pretrain_dir': '../../results/test-dataset/models/chunkid-0/sample_len-5/checkpoint/epoch_id-49.pt', 'skip_chunk0_train': False, 'pretrain_non_dp': True, 'pretrain_non_dp_reduce_time': 4.0, 'pretrain_dp': False, 'run': 0, 'batch_size': 100, 'sample_len': 5, 'sample_len_expand': True, 'iteration': 200000, 'vis_freq': 100000, 

100%|##########| 50/50 [00:00<00:00, 57.35it/s]


Finish launching chunk0 experiments ...
Start waiting for other chunks from config_group_id 1 experiments finished ...
Other chunks from config_group_id 1 training finished
objs is  [<netshare.ray.remote.ResultWrapper object at 0x7feff0498940>, <netshare.ray.remote.ResultWrapper object at 0x7fef14c00880>]
Config group 2: DP: False, pretrain: True
Start launching chunk0 experiments...
DoppelGANgerTorchModel._train
Currently training with config: {'overwrite': True, 'original_data_file': '../../traces/Datasets/dataset.csv', 'dataset_type': 'netflow', 'n_chunks': 1, 'dp': False, 'allowed_data_types': ['ip_string', 'integer', 'float', 'string'], 'allowed_data_encodings': ['categorical', 'bit', 'word2vec_port', 'word2vec_proto'], 'pretrain_dir': '../../results/test-dataset/models/chunkid-0/sample_len-10/checkpoint/epoch_id-49.pt', 'skip_chunk0_train': False, 'pretrain_non_dp': True, 'pretrain_non_dp_reduce_time': 4.0, 'pretrain_dp': False, 'run': 0, 'batch_size': 100, 'sample_len': 10, 'sam

100%|##########| 50/50 [00:00<00:00, 50.19it/s]


Finish launching chunk0 experiments ...
Start waiting for other chunks from config_group_id 2 experiments finished ...
Other chunks from config_group_id 2 training finished
objs is  [<netshare.ray.remote.ResultWrapper object at 0x7feff0498940>, <netshare.ray.remote.ResultWrapper object at 0x7fef14c00880>, <netshare.ray.remote.ResultWrapper object at 0x7fefc72be940>]
work folder is  ../../results/test-dataset
Number of valid chunks: 1
Number of configurations after expanded: 3
Start generating attributes ...
DoppelGANgerTorchModel._generate
Currently generating with config: {'overwrite': True, 'original_data_file': '../../traces/Datasets/dataset.csv', 'dataset_type': 'netflow', 'n_chunks': 1, 'dp': False, 'allowed_data_types': ['ip_string', 'integer', 'float', 'string'], 'allowed_data_encodings': ['categorical', 'bit', 'word2vec_port', 'word2vec_proto'], 'pretrain_dir': '../../results/test-dataset/models/chunkid-0/sample_len-1/checkpoint/epoch_id-49.pt', 'skip_chunk0_train': False, 'pre

100%|##########| 3/3 [00:00<00:00,  3.94it/s]


Config group #0: {'dp_noise_multiplier': None, 'dp': False, 'pretrain': True, 'config_ids': [0]}
Chunk_id: 0, # of syn dfs: 10, best_syndf: epoch_id-44.csv
Average truncation ratio: 0.19276209725513502
Big syndf shape: (33, 33)

Config group #1: {'dp_noise_multiplier': None, 'dp': False, 'pretrain': True, 'config_ids': [1]}
Chunk_id: 0, # of syn dfs: 10, best_syndf: epoch_id-49.csv
Average truncation ratio: 0.0
Big syndf shape: (24, 33)

Config group #2: {'dp_noise_multiplier': None, 'dp': False, 'pretrain': True, 'config_ids': [2]}
Chunk_id: 0, # of syn dfs: 10, best_syndf: epoch_id-14.csv
Average truncation ratio: 0.00625
Big syndf shape: (33, 33)

Aggregated final dataset syndf
None (33, 33)
best_syn_df filename: ../../results/test-dataset/post_processed_data/syn_df,dp_noise_multiplier-None,truncate-per_chunk,id-1.csv
Generated data is at ../../results/test-dataset/post_processed_data


True

## 3. Introduction of Post Processor 


![image](./postprocessor.png)

In the Post processor, we will processed the raw sythetic data that we get from the DG model, then we recover the columns into origin format and create the flow id column. 
#### 1. Recover into origin format
In the first part, we just recover the processed columns into origin format, this part may include deleting some columns that we create for the list format.   

| Encoding         | Examples                     | Supported format              |
| ------------------- | ---------------------------------------|--------------------------------------|
| IPv4    | IP__src_s             | From int to format like "128.237.82.10" |
| IPv6    | No              |  From int to format like "FE80:CD00:0:CDE:1257:0:211E:729C" (to int)            |
| Timestamp   | packet__time	             |  From int to format like "%Y-%m-%d %H:%M:%S.%f"  |
| List attributes    | packet__layers              |  concatenate all columns and construct column like "a,b,c,c"             |
| List values       | DNS__answers                  | concatenate all columns and construct column like"fields1: val1 fields2:val2"              |  

#### 2. flow id
Besides, we have grouped by all the columns in metadata and create the new column called "flow_id" and if the different row have all same value for metadata, they will have the same value for flow id. The flow id mainly distinushes different metadata. 

In [5]:
input_path = "../../results/test-dataset/post_processed_data"
output_file = "../../results/test-dataset/post_processed_data/final_output.csv"
Post_processor = post_processor.Post_processor(
                            input_path = input_path,
                            output_path = output_file,
                            configs = input_field_configs)
Post_processor.processor() 

['../../results/test-dataset/post_processed_data/syn_df,dp_noise_multiplier-None,truncate-per_chunk,id-1.csv']
{'DNS__answers_name': {'encoding': 'categorical', 'origin': 'name'}, 'DNS__answers_type': {'encoding': 'float', 'origin': 'type'}, 'DNS__answers_cls': {'encoding': 'float', 'origin': 'cls'}, 'DNS__answers_ttl': {'encoding': 'float', 'origin': 'ttl'}, 'DNS__answers_dlen': {'encoding': 'float', 'origin': 'dlen'}, 'DNS__answers_address': {'encoding': 'categorical', 'origin': 'address'}}
item is  DNS__answers_name
0.0
name
item is  DNS__answers_type
-7.100916406292342e-10
type
item is  DNS__answers_cls
-3.81100227692275e-10
cls
item is  DNS__answers_ttl
-7.275188466561347e-10
ttl
item is  DNS__answers_dlen
-9.79461719442203e-10
dlen
item is  DNS__answers_address
0.0
address
{'DNS__answers_name': {'encoding': 'categorical', 'origin': 'name'}, 'DNS__answers_type': {'encoding': 'float', 'origin': 'type'}, 'DNS__answers_cls': {'encoding': 'float', 'origin': 'cls'}, 'DNS__answers_ttl':

#### Look at the final synthetic data 

In [6]:
import pandas as pd
df = pd.read_csv(output_file)
df.head()

Unnamed: 0,IP__src_s,IP__dst_s,IP__p,IP__type,Label,UDP__dport,UDP__sport,TCP__sport,TCP__dport,IP__ttl,packet__len,DNS__query,DNS__an,DNS__dlen,DNS__ttl,DNS__opcode,DNS__type,IEEE__type,IEEE__dsr,MQTT__mlen,TCP__flags,IP__len,TCP__seq,packet__time,packet__layers,DNS__answers,flow_id
0,196.162.194.145,20.44.218.103,6,4040823268,Normal,40795,9088,37434,8281,115.468377,361.027768,6.059265e-10,-1.089073e-09,2.015567e-09,-1.848921e-09,-1.330395e-09,1.26305e-09,9.612198e-11,-1.23194e-09,4.942762e-10,21.180553,209.882082,62351.230169,2019-05-20 08:56:16.000000,"Ethernet,IP,TCP",name=0.0\ntype=-9.929150838011883e-10\ncls=3.3...,1
1,196.171.194.129,92.44.218.103,6,4040692164,Normal,40795,9088,49784,44974,110.918708,306.519783,-5.903638e-10,-5.972827e-10,8.631718e-10,5.153763e-10,-3.840089e-10,-3.090662e-10,-2.561421e-10,-1.904051e-09,1.234076e-09,9.16319,172.576872,114751.166209,2019-05-20 08:56:16.000000,"Ethernet,IP,TCP",name=0.0\ntype=-8.939105768297721e-10\ncls=3.8...,2
2,196.171.194.129,92.44.218.103,6,4040692164,Normal,40795,9088,49784,44974,129.91797,330.726678,-4.743755e-10,-2.585614e-10,8.55292e-10,5.371356e-10,2.229794e-09,8.573766e-11,1.010046e-09,2.005183e-10,-1.348795e-09,13.029426,253.997649,108402.145008,2019-05-20 08:56:16.000000,"Ethernet,IP,TCP",name=0.0\ntype=-6.427228948709683e-10\ncls=-2....,2
3,196.171.194.129,92.44.218.103,6,4040692164,Normal,40795,9088,49784,44974,114.403444,337.084015,5.543816e-10,8.056688e-10,3.308689e-10,-8.784539e-10,7.364833e-10,1.252903e-09,3.100836e-10,-1.608645e-09,5.98644e-10,11.455757,242.909387,25056.330462,2019-05-20 08:56:16.000000,"Ethernet,IP,TCP",name=0.0\ntype=2.648317314223969e-10\ncls=-2.8...,2
4,196.171.194.129,92.44.218.103,6,4040692164,Normal,40795,9088,49784,44974,114.711213,193.249483,1.287302e-09,-1.447273e-10,1.298689e-10,6.59954e-11,-1.518093e-09,-1.100807e-09,3.527867e-10,-1.67029e-10,-8.458937e-10,8.987655,294.95106,13267.524628,2019-05-20 08:56:16.000000,"Ethernet,IP,TCP",name=0.0\ntype=7.165002325442835e-10\ncls=9.64...,2


#### Look at the flow id column

In [7]:
 
df[["IP__src_s", "IP__dst_s", "IP__p", "IP__type", "Label", "packet__layers", "UDP__dport", "UDP__sport", "TCP__sport", "TCP__dport", "flow_id"]].head()

Unnamed: 0,IP__src_s,IP__dst_s,IP__p,IP__type,Label,packet__layers,UDP__dport,UDP__sport,TCP__sport,TCP__dport,flow_id
0,196.162.194.145,20.44.218.103,6,4040823268,Normal,"Ethernet,IP,TCP",40795,9088,37434,8281,1
1,196.171.194.129,92.44.218.103,6,4040692164,Normal,"Ethernet,IP,TCP",40795,9088,49784,44974,2
2,196.171.194.129,92.44.218.103,6,4040692164,Normal,"Ethernet,IP,TCP",40795,9088,49784,44974,2
3,196.171.194.129,92.44.218.103,6,4040692164,Normal,"Ethernet,IP,TCP",40795,9088,49784,44974,2
4,196.171.194.129,92.44.218.103,6,4040692164,Normal,"Ethernet,IP,TCP",40795,9088,49784,44974,2


In [None]:
##After Training we can view the result of the attributes at: http://127.0.0.1:8050/
generator.visualize(work_folder=f'../../results/test-dataset')

work folder is  ../../results/test-dataset
The filename with the largest ID is: syn_df,dp_noise_multiplier-None,truncate-per_chunk,id-1.csv
columns are  Index(['IP__src_s', 'IP__dst_s', 'IP__p', 'IP__type', 'Label',
       'packet__layers_Ethernet', 'packet__layers_IP', 'packet__layers_TCP',
       'UDP__dport', 'UDP__sport', 'TCP__sport', 'TCP__dport', 'IP__ttl',
       'packet__len', 'DNS__answers_name', 'DNS__answers_type',
       'DNS__answers_cls', 'DNS__answers_ttl', 'DNS__answers_dlen',
       'DNS__answers_address', 'DNS__query', 'DNS__an', 'DNS__dlen',
       'DNS__ttl', 'DNS__opcode', 'DNS__type', 'IEEE__type', 'IEEE__dsr',
       'MQTT__mlen', 'TCP__flags', 'IP__len', 'TCP__seq', 'packet__time'],
      dtype='object')
      IP__src_s   IP__dst_s  IP__p  IP__type   Label packet__layers_Ethernet  \
0    3232235533  3232235536      6         0  Normal                     Yes   
1    3232235533  3232235536      6         0  Normal                     Yes   
2    3232235533  3232


Real or synthetic column DNS__answers_type is a constant list. Not generating plots.


Real or synthetic column DNS__answers_cls is a constant list. Not generating plots.


Real or synthetic column DNS__answers_ttl is a constant list. Not generating plots.


Real or synthetic column DNS__answers_dlen is a constant list. Not generating plots.


Real or synthetic column DNS__query is a constant list. Not generating plots.


Real or synthetic column DNS__an is a constant list. Not generating plots.


Real or synthetic column DNS__dlen is a constant list. Not generating plots.


Real or synthetic column DNS__ttl is a constant list. Not generating plots.


Real or synthetic column DNS__opcode is a constant list. Not generating plots.


Real or synthetic column DNS__type is a constant list. Not generating plots.


Real or synthetic column IEEE__type is a constant list. Not generating plots.


Real or synthetic column IEEE__dsr is a constant list. Not generating plots.


Real or synthetic co

Dash is running on http://127.0.0.1:8050/



07/19/2023 00:37:38:INFO:Dash is running on http://127.0.0.1:8050/



 * Serving Flask app 'sdmetrics.reports.timeseries.quality_report'
 * Debug mode: off


 * Running on http://127.0.0.1:8050
07/19/2023 00:37:38:INFO:[33mPress CTRL+C to quit[0m
07/19/2023 00:37:44:INFO:127.0.0.1 - - [19/Jul/2023 00:37:44] "GET / HTTP/1.1" 200 -
07/19/2023 00:37:44:INFO:127.0.0.1 - - [19/Jul/2023 00:37:44] "GET /_dash-component-suites/dash/deps/polyfill@7.v2_10_2m1685736327.12.1.min.js HTTP/1.1" 200 -
07/19/2023 00:37:44:INFO:127.0.0.1 - - [19/Jul/2023 00:37:44] "GET /_dash-component-suites/dash/deps/react@16.v2_10_2m1685736327.14.0.min.js HTTP/1.1" 200 -
07/19/2023 00:37:44:INFO:127.0.0.1 - - [19/Jul/2023 00:37:44] "GET /_dash-component-suites/dash/deps/react-dom@16.v2_10_2m1685736327.14.0.min.js HTTP/1.1" 200 -
07/19/2023 00:37:44:INFO:127.0.0.1 - - [19/Jul/2023 00:37:44] "GET /_dash-component-suites/dash/deps/prop-types@15.v2_10_2m1685736327.8.1.min.js HTTP/1.1" 200 -
07/19/2023 00:37:44:INFO:127.0.0.1 - - [19/Jul/2023 00:37:44] "GET /_dash-component-suites/dash/dash-renderer/build/dash_renderer.v2_10_2m1685736327.min.js HTTP/1.1" 200 -
07/19/2023 00:

## Other Methods to run the code:

In [None]:
### run all the part at one time
%run driver.py