This project intends to create a Federated Learning based Intrusion Detection System to detect malicious network traffic in IoT devices. The implementation simulates FedAVG using TensorFlow Federated running on Python 3.9. More specifically, SimpleFederatedAveraging implementation is re-adapted to work with Gower Distance matrices as input for the models. Although the used dataset is TON_IOT, designed system is supposed to work with other anomaly detection datasets equally.
Three Federated (FL) systems have been created; a vanilla version, an Autoencoder (AE) version and a version counting with an Attention Mechanism (AM). Moreover, two centralized (CNL) analogous versions are created as well, vanilla and AE; to be used as comparison baseline.
- datasets: TON_IOT network dataset directory
- source (main development): source files, initialization script
- source/init (configuration files): CNL and FL systems configuration files
- source/results (results directory): CNL and FL systems output destination
- libs (required packages): federated, gower (modified) and deatf
Install the required libraries present on deps.req.
Note: Python3.9 is required and pip3 package installer recommended.
Two type of configuration files are admitted depending on the CNL or FL variant. Each configuration file serves the matrix creation and the model learning modules.
Note: The name must be cnl<n>.ini or fl<n>.ini
Download or create a custom dataset, the implementation is adapted to TON_IOT which should be downloaded, extracted and placed into the dataset directory.
Located in source/init/cnl the files contain the following structure:
- run_name: run name
- print_scr: visualize output in terminal
- train_size: number of train instances
- test_size: number of test instances
- epochs: max number of training rounds with early stopping (patience 2)
- batch_size: hyperparameter
- learning_rate: hyperparameter
- balance_data: used in module create_matrix_cnl.py
- outliers: isoltion_forest, svm_one_class_classifction or whole_dataset
- seed: added for replicability
Located in source/init/fl the files contain the following structure:
- run_name: run name
- total_rounds: total number of averaging (communication) rounds
- rounds_per_eval: validate the model each k rounds.
- train_clients_per_round: number of agents taking part in each averaging
- client_epochs_per_round: number of local epochs in each node
- batch_size: hyperparameter
- test_batch_size: hyperparameter
- server_learning_rate: hyperparameter
- client_learning_rate: hyperparameter
- num_clients: total number of agents in the network
- train_size: total number of train instances summing all nodes datasets
- test_size: total number of test instances summing all nodes datasets
- outliers: isoltion_forest, svm_one_class_classifction or whole_dataset
- balance_data: used in module create_matrix_fl.py
- print_scr: visualize output in terminal
- seed: added for replicability
Once again, depending on which system is being deployed, two execution variants exist. However, if specific data mining wants to be performed previously; outlier detection, shap values... preprocess.ipynb should be used.
Note: Filtered datasets should be placed into the same location of their raw analogous inside dataset directory as well.
Before running a CNL-IDS, previous Gower Matrix elaboration step is compulsory and has to end in first place.
Note: The configuration file corresponding to the specified run name must exist at source/init/cnl.
python3 source/create_matrix_cnl.py <run_name>
Note: Created matrix is stored at source/mats/cnl.
python3 source/netw_cnl.py <run_name>
python3 source/netw_cnl_AE.py <run_name>
Before running a FL-IDS, previous Gower Matrices elaboration step is compulsory and has to end in first place.
Note: The configuration file corresponding to the specified run name must exist at source/init/fl.
In this case the dataset is IID splitted among the selected number of agents as well as independent train/test matrices are created for each of them.
python3 source/create_matrix_fl.py <run_name>
Note: Created matrices are stored at source/mats/fl.
python3 source/netw_fl.py <run_name>
python3 source/netw_fl_AE.py <run_name>
python3 source/netw_fl_AM.py <run_name>
Note: The attention percentage is specified in the source code under the fixed constant BEST_PERC.
In each experiment, the .h5 model and the train and validation losses are saved as well as the accuracy, precision, recall, F1-score and ROC-AUC metrics. However, in the Federated versions each entry in the results corresponds to the evaluation of the learned model on an specific agent test partition. Therefore, alongisde the mentioned stats; the client ID and its local train dataset size are stored.
For better visualizing the results, visualize_results_cnl.ipynb and visualize_results_fl.ipynb modules have been developed. They show accuracy and pr_scores of each performed experiment as well as train and validation losses collected during the learning process. In the Federated versions, additional indicators are visualized such as dataset size per agent.
Note: Result visualization modules are pre-configured to work with 6 experiments and plot grids of 2x3. They might be adapted for other experimental setups.
gower_matrix_limit_cols and sliced_gower_matrix_limit_cols methods have been added to gower_dist.py in order to segment the matrices in train/test subsets efficiently.
Taking the original simple_fedavg_tff.py as baseline; simple_fedavg_tff.py has been implemented. On it, run_one_round1 and run_one_round2 methods are coded to send the temporal client weights of each client to the server and select the k best performing agents. Then, FedAVG is performed on the selected models of the selected subset of nodes.
- Aitor Belenguer
MIT