# Colab Setup  
> Make sure you configure notebook with GPU: Click Edit->notebook settings->hardware accelerator->GPU

> Uncomment the following cell after opening in Google colab. (Do not uncomment it in local setup.)  

<a target="_blank" href="https://colab.research.google.com/github/SEED-VT/FedDebug/blob/main/fault-localization/Reproduce_Table1-Table2.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>


In [2]:
# !pip install pytorch-lightning
# !pip install diskcache
# !pip install dotmap
# !pip install torch torchvision torchaudio
# !pip install matplotlib
# !git clone https://github.com/SEED-VT/FedDebug.git
# # appending the path
# import sys
# sys.path.append("FedDebug/fault-localization/")

# Description

- It defines some variables for the simulation such as the learning rate, batch size, noise rate, number of clients, and number of epochs. 

- It then runs the simulation for the given configuration (`args`) related to Table 1 and Table 2 configurations. 

- Finally, it prints out the faulty client(s) localization accuracy, along with information about the distribution, number of faulty clients, total number of clients, architecture, and dataset used in the simulation .

In [3]:
import logging
import matplotlib.pyplot as plt
import time
from dotmap import DotMap
from pytorch_lightning import seed_everything
from torch.nn.init import kaiming_uniform_ 
from utils.faulty_client_localization.FaultyClientLocalization import FaultyClientLocalization
from utils.faulty_client_localization.InferenceGuidedInputs import InferenceGuidedInputs
from utils.FLSimulation import trainFLMain
from utils.fl_datasets import initializeTrainAndValidationDataset
from utils.util import aggToUpdateGlobalModel
from utils.util import testAccModel



logging.basicConfig(filename='example.log', level=logging.ERROR)
logger = logging.getLogger("pytorch_lightning")
seed_everything(786)



def evaluateFaultLocalization(predicted_faulty_clients_on_each_input, true_faulty_clients):
    true_faulty_clients = set(true_faulty_clients)
    detection_acc = 0
    for pred_faulty_clients in predicted_faulty_clients_on_each_input:
        print(f"+++ Faulty Clients {pred_faulty_clients}")
        correct_localize_faults = len(
            true_faulty_clients.intersection(pred_faulty_clients))
        acc = (correct_localize_faults/len(true_faulty_clients))*100
        detection_acc += acc
    fault_localization_acc = detection_acc / \
        len(predicted_faulty_clients_on_each_input)
    return fault_localization_acc


def runFaultyClientLocalization(client2models, exp2info, num_bugs, random_generator=kaiming_uniform_, apply_transform=True, k_gen_inputs=10, na_threshold=0.003, use_gpu=True):
    print(">  Running FaultyClientLocalization ..")
    input_shape = list(exp2info['data_config']['single_input_shape'])
    generate_inputs = InferenceGuidedInputs(client2models, input_shape, randomGenerator=random_generator, apply_transform=apply_transform,
                                            dname=exp2info['data_config']['name'], min_nclients_same_pred=5, k_gen_inputs=k_gen_inputs)
    selected_inputs, input_gen_time = generate_inputs.getInputs()

    start = time.time()
    faultyclientlocalization = FaultyClientLocalization(
        client2models, selected_inputs, use_gpu=use_gpu)

    potential_benign_clients_for_each_input = faultyclientlocalization.runFaultLocalization(
        na_threshold, num_bugs=num_bugs)
    fault_localization_time = time.time()-start
    return potential_benign_clients_for_each_input, input_gen_time, fault_localization_time



results = {}

# ====== Simulation ===== 

args = DotMap()
args.lr = 0.001
args.weight_decay = 0.0001
args.batch_size = 512

args.noise_rate = 1  # noise rate 0 to 1 
args.clients = 30 # keep under 30 clients and use Resnet18, Resnet34, or Densenet to evaluate on Colab 
args.epochs = 10  # range 10-25
args.faulty_clients_ids = "0" # can be multiple clients separated by comma e.g. "0,1,2"  but keep under args.clients clients and at max less than 7 


  from .autonotebook import tqdm as notebook_tqdm
Global seed set to 786
Global seed set to 786


> Note: You can comment a complete cell to skip its execution in order to evalutate any particular configuration

 ### Table 1: resnet18, cifar10, iid distribution and 30 clients

In [4]:
args.model = "resnet18" # [resnet18, resnet34, resnet50, densenet121, vgg16]
args.dataset = "cifar10" # ['cifar10', 'femnist']
args.sampling = "iid" # [iid, "niid"] 
args.clients = 30 # keep under 30 clients and use Resnet18, Resnet34, or Densenet to evaluate on Colab 

# FL training
c2ms, exp2info = trainFLMain(args)
client2models = {k: v.model.eval() for k, v in c2ms.items()}


# Fault localazation
potential_faulty_clients, _, _ = runFaultyClientLocalization(
    client2models=client2models, exp2info=exp2info, num_bugs=len(exp2info['faulty_clients_ids']))
fault_acc = evaluateFaultLocalization(
    potential_faulty_clients, exp2info['faulty_clients_ids'])
# print(f"Fault Localization Acc: {fault_acc}")

print(f"#Fault Localization Accuracy: {fault_acc}, Distribution: {args.sampling},  Faulty clients: {len(args.faulty_clients_ids.split(','))}, Total Clients: {args.clients}, Architecture: {args.model}, Dataset: {args.dataset}")



  ***Simulating FL setup iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001 ***
Files already downloaded and verified
Files already downloaded and verified
Spliting Datasets 50000 into parts:[1686, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666, 1666]
input shape, torch.Size([1, 3, 32, 32])
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/faulty_client_0_noise_rate_1_classes.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Train mod batch = 150, and drop_last = False


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Epoch 9: 100%|██████████| 4/4 [00:00<00:00,  4.27it/s, loss=2.31, train_acc=0.0818, train_loss=2.310, val_acc=0.0879, val_loss=2.420]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 00010: reducing learning rate of group 0 to 2.5000e-04.
Epoch 9: 100%|██████████| 4/4 [00:00<00:00,  4.25it/s, loss=2.31, train_acc=0.0818, train_loss=2.310, val_acc=0.0879, val_loss=2.420]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_1.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:00<00:00,  4.13it/s, loss=0.728, train_acc=0.787, train_loss=0.653, val_acc=0.642, val_loss=1.630]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 00010: reducing learning rate of group 0 to 2.5000e-04.
Epoch 9: 100%|██████████| 4/4 [00:00<00:00,  4.12it/s, loss=0.728, train_acc=0.787, train_loss=0.653, val_acc=0.642, val_loss=1.630]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_2.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:00<00:00,  4.03it/s, loss=0.684, train_acc=0.759, train_loss=0.607, val_acc=0.664, val_loss=1.440]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:00<00:00,  4.01it/s, loss=0.684, train_acc=0.759, train_loss=0.607, val_acc=0.664, val_loss=1.440]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_3.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.67it/s, loss=0.675, train_acc=0.788, train_loss=0.623, val_acc=0.655, val_loss=1.560]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 00010: reducing learning rate of group 0 to 2.5000e-04.
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.65it/s, loss=0.675, train_acc=0.788, train_loss=0.623, val_acc=0.655, val_loss=1.560]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_4.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.61it/s, loss=0.675, train_acc=0.841, train_loss=0.469, val_acc=0.664, val_loss=1.470]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 00010: reducing learning rate of group 0 to 2.5000e-04.
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.59it/s, loss=0.675, train_acc=0.841, train_loss=0.469, val_acc=0.664, val_loss=1.470]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_5.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.55it/s, loss=0.656, train_acc=0.887, train_loss=0.397, val_acc=0.674, val_loss=1.430]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.53it/s, loss=0.656, train_acc=0.887, train_loss=0.397, val_acc=0.674, val_loss=1.430]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_6.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.59it/s, loss=0.703, train_acc=0.761, train_loss=0.671, val_acc=0.671, val_loss=1.490]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.57it/s, loss=0.703, train_acc=0.761, train_loss=0.671, val_acc=0.671, val_loss=1.490]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_7.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.94it/s, loss=0.738, train_acc=0.746, train_loss=0.723, val_acc=0.628, val_loss=1.540]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.93it/s, loss=0.738, train_acc=0.746, train_loss=0.723, val_acc=0.628, val_loss=1.540]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_8.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 6: 100%|██████████| 4/4 [00:01<00:00,  3.39it/s, loss=1, train_acc=0.695, train_loss=0.746, val_acc=0.554, val_loss=1.780]   Epoch 00007: reducing learning rate of group 0 to 2.5000e-04.
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.69it/s, loss=0.67, train_acc=0.829, train_loss=0.468, val_acc=0.692, val_loss=1.250] 

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.67it/s, loss=0.67, train_acc=0.829, train_loss=0.468, val_acc=0.692, val_loss=1.250]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_9.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.31it/s, loss=0.678, train_acc=0.808, train_loss=0.504, val_acc=0.675, val_loss=1.490]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.29it/s, loss=0.678, train_acc=0.808, train_loss=0.504, val_acc=0.675, val_loss=1.490]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_10.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.53it/s, loss=0.639, train_acc=0.775, train_loss=0.698, val_acc=0.677, val_loss=1.400]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.52it/s, loss=0.639, train_acc=0.775, train_loss=0.698, val_acc=0.677, val_loss=1.400]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_11.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.43it/s, loss=0.676, train_acc=0.770, train_loss=0.590, val_acc=0.671, val_loss=1.470]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.42it/s, loss=0.676, train_acc=0.770, train_loss=0.590, val_acc=0.671, val_loss=1.470]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_12.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.61it/s, loss=0.655, train_acc=0.840, train_loss=0.524, val_acc=0.676, val_loss=1.460]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 4/4 [00:01<00:00,  3.60it/s, loss=0.655, train_acc=0.840, train_loss=0.524, val_acc=0.676, val_loss=1.460]
Training : .storage/checkpoints/iid_resnet18_cifar10_clients_30_faulty_[0]_bsize_512_epochs_10_lr_0.001/client_13.ckpt


Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Train mod batch = 130, and drop_last = False
Epoch 9:   0%|          | 0/4 [00:00<?, ?it/s, loss=0.782, train_acc=0.753, train_loss=0.715, val_acc=0.642, val_loss=1.610]        

 ### Table 1: densenet121, cifar10, niid distribution and 10 clients

In [None]:
args.model = "densenet121" # [resnet18, resnet34, resnet50, densenet121, vgg16]
args.dataset = "cifar10" # ['cifar10', 'femnist']
args.sampling = "niid" # [iid, "niid"] 
args.clients = 10


# FL training
c2ms, exp2info = trainFLMain(args)
client2models = {k: v.model.eval() for k, v in c2ms.items()}



# Fault localazation
potential_faulty_clients, _, _ = runFaultyClientLocalization(
    client2models=client2models, exp2info=exp2info, num_bugs=len(exp2info['faulty_clients_ids']))
fault_acc = evaluateFaultLocalization(
    potential_faulty_clients, exp2info['faulty_clients_ids'])
# print(f"Fault Localization Acc: {fault_acc}")

print(f"#Fault Localization Accuracy: {fault_acc}, Distribution: {args.sampling},  Faulty clients: {len(args.faulty_clients_ids.split(','))}, Total Clients: {args.clients}, Architecture: {args.model}, Dataset: {args.dataset}")

NameError: name 'args' is not defined

### Table 2: Five Fautly Clients, densenet121, cifar10, and 30 clients 

In [None]:
args.sampling = "iid"
args.faulty_clients_ids = "0,1,3,4,7" # can be multiple clients separated by comma e.g. "0,1,2"  but keep under args.clients clients and at max less than 7 
args.model = "densenet121" # [resnet18, resnet34, resnet50, densenet121, vgg16]
args.dataset = "cifar10" # ['cifar10', 'femnist']
args.clients = 30 

# FL training
c2ms, exp2info = trainFLMain(args)
client2models = {k: v.model.eval() for k, v in c2ms.items()}

# Fault localazation
potential_faulty_clients, _, _ = runFaultyClientLocalization(
    client2models=client2models, exp2info=exp2info, num_bugs=len(exp2info['faulty_clients_ids']))
fault_acc = evaluateFaultLocalization(
    potential_faulty_clients, exp2info['faulty_clients_ids'])
# print(f"Fault Localization Acc: {fault_acc}")

print(f"#Table 2: Fault Localization Accuracy: {fault_acc}, Distribution: {args.sampling},  Faulty clients: {len(args.faulty_clients_ids.split(','))}, Total Clients: {args.clients}, Architecture: {args.model}, Dataset: {args.dataset}")

### Table 2: Five Fautly Clients, resnet-50, cifar10, and 30 clients 

In [None]:
args.sampling = "iid"
args.faulty_clients_ids = "0,1,3,4,7" # can be multiple clients separated by comma e.g. "0,1,2"  but keep under args.clients clients and at max less than 7 
args.model = "resnet50" # [resnet18, resnet34, resnet50, densenet121, vgg16]
args.dataset = "cifar10" # ['cifar10', 'femnist']
args.clients = 30 

# FL training
c2ms, exp2info = trainFLMain(args)
client2models = {k: v.model.eval() for k, v in c2ms.items()}

# Fault localazation
potential_faulty_clients, _, _ = runFaultyClientLocalization(
    client2models=client2models, exp2info=exp2info, num_bugs=len(exp2info['faulty_clients_ids']))
fault_acc = evaluateFaultLocalization(
    potential_faulty_clients, exp2info['faulty_clients_ids'])
# print(f"Fault Localization Acc: {fault_acc}")

print(f"#Table 2: Fault Localization Accuracy: {fault_acc}, Distribution: {args.sampling},  Faulty clients: {len(args.faulty_clients_ids.split(','))}, Total Clients: {args.clients}, Architecture: {args.model}, Dataset: {args.dataset}")