GitHub

Replication package for "A Probabilistic Framework for Mutation Testing in Deep Neural Networks" paper

This replication package contains all the scripts/data necessary to plot figures from our paper and redo experiments of our paper "A Probabilistic Framework for Mutation Testing in Deep Neural Networks?" published in the journal of Information and Software Technology.

A preprint version is available on arxiv, while the published version is available on the publisher website

We also provide a quick way to adapt the framework for custom models/datasets/mutations.

Note that the package was designed to be functional and facilitate replication of the experiment, as such it's not as optimized speed wise as it could be.

A note on Replication

If you wish to replicate completely from scratch, you will need to download the trained instances (here) and rerun the script as described below. Note that, however, some script can take a few hours even with parallelization and the archive for all models instances is quite large (> 100 GB Total).

Note that if you were to retrain the instances, you wouldn't get exactly the same results as us, but it should be pretty similar to us (it is the point of the method after all). Yet, it requires quite some time to train the instances.

In any case, we provide all data necessary to run the script below if one just wish to replicate our figures as well as the ones in plot_results/.

Index

Architecture
Requirements
Training models based on DeepCrime mutations
Generating the accuracy data
Running mutation testing on DeepCrime models
Calculating posterior distributions and plotting them
Calculating Ratio
Calculating the Monte Carlo error over the instances (Bagged posterior stability)
Calculating the sampling effect for a given population size
Generating the figure from the paper
Making it works with your models/mutations/datasets

Architecture

.
├── Datasets/ # directory for datasets
├── README.md
├── comp_deepcrime.py # Comparison script with deepcrime
├── exp.py # Monte-Carlo simulation generation
├── generate_acc_files.py # Generating accuracy file for MNIST
├── generate_acc_files_lenet.py # Generating accuracy file for UnityEyes
├── generate_acc_files_movie.py # Generating accuracy file for MovieRecomm
├── mce_estim.py # Calculate MCE
├── mutated_models # Files for training models
│   ├── lenet
│   ├── mnist
│   └── movie
├── mutations.py # DeepCrime file, necessary for training mutated models
├── operators/ # DeepCrime files, necessary for training mutated models
├── plot_param.py # Calculating estimates
├── plot_posterior.py # Plotting figure such as Figure 3
├── plot_results # Directory with all figures
│   ├── lenet
│   ├── mnist
│   ├── movie_recomm
├── pop_var.py # To generate figure similar to Figure 4/5
├── raw_data # Raw data (accuracy files)
│   ├── deepcrime_comp
│   ├── lenet
│   ├── mnist
│   └── movie_recomm
├── rep_mce # Data of Monte-Carlo simulations (200 instances)
│   ├── lenet
│   ├── mnist
│   └── movie_recomm
├── rep_practicality # Data of 30 repetitions of Monte-Carlo simulation for different population size
│   ├── lenet
│   ├── mnist
│   └── movie_recomm
├── requirements.txt
├── run_mp.py # Getting 30 repetitions of the Monte-Carlo simulation of different population size
├── settings.py # Settings file
├── trained_models/ # Directory to put our trained models
├── trained_models_dc/ # Directory to put DeepCrime directory
├── utils/ # DeepCrime files, necessary for training mutated models
└── utils.py # utils file (plot functions as well as math related functions)

Requirements

The framework uses Python 3.8 and we recommend creating a conda environment to manage the different packages.

The base requirements are the same as DeepCrime provided requirements38.txt.

numpy~=1.18.5 
tensorflow~=2.3.0 
tensorflow-gpu~=2.3.0 
Keras~=2.4.3 
matplotlib~=3.3.0 
progressbar~=2.5 
scikit-learn~=0.23.1 
termcolor~=1.1.0 
h5py~=2.10.0 
pandas~=1.1.0 
statsmodels~=0.11.1 
opencv-python~=4.3.0.36 
networkx~=2.5.1 
patsy~=0.5.1 
scipy~=1.4.1

We extend those requirements with some additional packages needed in our scripts:

tqdm~=4.64.0
colorama~=0.4.4 
tqdm-multiprocess~=0.0.11

All package required are present in the requirements.txt file.

Training models based on DeepCrime mutations

Files/Directory concerned:

utils/
operators/
mutations.py
mutated_models/
Datasets/

We extracted files relevant to train models based on given a given mutation, mainly properties and operators DeepCrime used. We generated the training files according to DeepCrime definition, they all can be found in mutated_models/. Files are by default programmed to train 200 instances, only if instance i-th instance doesn't exist, for the defined mutations. Models used dataset from DeepCrime (not included in package) put into Datasets/.

Generating the accuracy data

Files/Directory concerned:

trained_models/
trained_models_dc/
raw_data/

Trained models can be downloaded as .zip files.

All trained instances are hosted on Zenodo.

Zenodo Packages

zenodo_1 corresponds to part of mnist models (source level mutations)
zenodo_2 corresponds to rest of mnist models, movie models and deepcrime models for comparison. (source level mutations)
zenodo_3 corresponds to lenet models. (source level mutations)
zenodo_4 corresponds to model level mutations of mnist/lenet.

All files inside the trained_models/ directory inside the .zip should then be extracted into relevant directory (trained_models_dc/ for deepcrime_models.zip, trained_models/ for all others). Then, generate_acc_files*.py can be executed (where * can be , _movie or _lenet) to generate the accuracy files for each model/mutation. Usage is:

python generate_acc_files.py --name [mutation] (--comp)

where --comp flag instructs program to use trained_models_dc/ instead of trained_models/. For instance:

python generate_acc_files.py --name 'change_label_mutated0_MP_3.12'

will generate the accuracy file mnist_change_label_mutated0_MP_3.12.csv in raw_data/mnist/.

Running mutation testing on DeepCrime models

Execution time: ~1 min/mutation

Files/Directory concerned:

raw_data/
utils.py
comp_deepcrime.py

If --dc is used, the scripts uses data in raw_data/deepcrime_comp/ to yield the p-value and cohen's d, as well as the decision (Killed or not) for the test when comparing healthy instances against mutated instances. The instances used in that vae are the ones provided in DeepCrime's replication package. Results are already presented in raw_data/deepcrime_comp/ in the [model]_results_kill_DC.txt.

If --dc is NOT used, the script will use DeepCrime's mutation test over multiple experiences using our instances, returning the average number of times the mutation test passed for each magntiude following the protocol we detailled in the Motivating Example of Section 4 in our paper.

--same needs to be used for model level mutation so the healthy instances are compared to their mutated counter-part.

files.

Usage is:

python comp_deepcrime.py --model [model] --mut [mutation] [--dc] [--same]

For instance, using only their instances and their single test:

python comp_deepcrime.py --model 'mnist' --mut 'delete_training_data' --dc

Exp: delete_training_data_3.1
p-value 0.907, effect_size 0.03689417305183907
*********************
Exp: delete_training_data_9.29
p-value 0.048, effect_size 0.6241987215473714
Killed
*********************
Exp: delete_training_data_12.38
p-value 0.387, effect_size 0.2735018636400621
*********************
Exp: delete_training_data_18.57
p-value 0.001, effect_size 1.0074392666953291
Killed
*********************
Exp: delete_training_data_30.93
p-value 0.0, effect_size 1.6320878597970647
Killed
*********************

For instance, using multiple iterations of the test over our instances:

python comp_deepcrime.py --model 'mnist' --mut 'delete_training_data'

Average number of mutation test passed for healthy instances vs healthy instances: 0.06 (0.051)
Average number of mutation test passed for healthy instances vs mutated instances (delete_training_data_3.1): 0.13 (0.093)
Average number of mutation test passed for healthy instances vs mutated instances (delete_training_data_9.29): 0.45 (0.120)
Average number of mutation test passed for healthy instances vs mutated instances (delete_training_data_12.38): 0.47 (0.143)
Average number of mutation test passed for healthy instances vs mutated instances (delete_training_data_18.57): 0.85 (0.091)
Average number of mutation test passed for healthy instances vs mutated instances (delete_training_data_30.93): 1.00 (0.001)

Calculating posterior distributions and plotting them

Execution time: ~1 min/mutation

Files/Directory concerned:

raw_data/
utils.py
plot_posterior.py
plot_results/

To calculate the posterior distribution as we detailed in Section 5.1-3, after calculating/putting accuracy files in the correct directory in raw_data/, one can execute plot_posterior.py. This will generate a figure of the same type as Figure 3 from our paper.

Usage is:

python plot_posterior.py --model [model] --mut [mutation] [--same]

--same needs to be used for model level mutation so the healthy instances are compared to their mutated counter-part.

For instance:

python plot_posterior.py --model 'mnist' --mut 'delete_training_data'

To allow for replication, the seed is fixed in this file. By default, the figures are saved in plot_results/[model]/ as a .png file.

Calculating Ratio

Execution time: ~1 min/mutation

Files/Directory concerned:

raw_data/
utils.py
plot_param.py

This allows to print the Hellinger distances between the considered mutation and the ideal non-mutant/mutant posteriors, as well as the similarity ratio.

Usage is:

python plot_param.py --model [model] --mut [mutation] [--same]

--same needs to be used for model level mutation so the healthy instances are compared to their mutated counter-part.

For instance:

python plot_param.py --model 'mnist' --mut 'delete_training_data'

which will return the value of the ratio and Hellinger distances. For instance:

Original: Hellinger distance to ideal non-mutant posterior: 0.75663, Hellinger distance to ideal mutant posterior: 1.00000
Ratio:  0.7566258775312085
Mut: 3.1, Hellinger distance to ideal non-mutant posterior: 0.91377, Hellinger distance to ideal mutant posterior: 0.9999999994857593
Ratio:  0.9137732097912725
Mut: 9.29, Hellinger distance to ideal non-mutant posterior: 0.99836, Hellinger distance to ideal mutant posterior: 0.9992365636693191
Ratio:  0.9991240438353294
Mut: 12.38, Hellinger distance to ideal non-mutant posterior: 0.99942, Hellinger distance to ideal mutant posterior: 0.999789474624726
Ratio:  0.9996269920003796
Mut: 18.57, Hellinger distance to ideal non-mutant posterior: 1.00000, Hellinger distance to ideal mutant posterior: 0.9544110921908092
Ratio:  1.0477663646846322
Mut: 30.93, Hellinger distance to ideal non-mutant posterior: 1.00000, Hellinger distance to ideal mutant posterior: 0.0035531532521594113
Ratio:  281.44015442966185

Calculating the Monte Carlo error over the instances (Bagged posterior stability)

Execution time: ~ 20 min for exp.py and ~ 10 sec for mce_estim.py

Files/Directory concerned:

raw_data/
utils.py
exp.py
rep_mce/
mce_estim.py

To calculate the Monte-Carlo error (MCE) as we detailed in Section 5.4, after calculating/putting accuracy files in the correct directory in raw_data/, one needs to first execute exp.py to generate monte-carlo simulation data.

Usage is:

python exp.py --model [model] --mut [mutation] [--param [parameter] ] [--proc n] [--same]

The model and mut parameters are the same as before. param controls the mutation magnitude/parameter. By default (if flag is not used), uses original models. proc parameter control the number of core to use (for parallalelisation). By default, only one is used.

--same needs to be used for model level mutation so the healthy instances are compared to their mutated counter-part.

For instance:

python exp.py --model 'mnist' --mut 'change_label' --param 3.12 --proc 8

This will generate a file such as mnist_change_label_3.12_200_pop_size.npy in rep_mce/mnist/ using 8 cores for instance.

Then use mce_estim.py to calculate the jackknife estimates as described in Section 4.5.

Usage is:

python mce_estim.py --model [model] --mut [mutation] [--param [parameter] ]

For instance:

python mce_estim.py --model 'mnist' --mut 'change_label' --param 3.12

Which will returns:

Model mnist, Mutation change_label (Param 3.12)
Mean: Avg 0.3688340049807526, 95% Confidence Interval (0.3657941422169647, 0.37187386774454045)
Variance: Avg 0.017087663244883932, 95% Confidence Interval: (0.01661289712701975, 0.017562429362748114)

Calculating the sampling effect for a given population size

Execution time: ~2 hours with 30 cores for run_mp.py (see note below to improve speed) and ~ 5 min with 8 cores for pop_var.py (instant if data have already been calculated and we just want to plot/print results).

Files/Directory concerned:

raw_data/
utils.py
pop_var.py
rep_practicality/
run_mp.py
plot_results/

To calculate the Sampling Effect after calculating/putting accuracy files in the correct directory in raw_data/, one needs to first execute run_mp.py to generate monte-carlo simulations for different population size.

Usage is:

python run_mp.py --model [model] --mut [mutation] [--size [size] ] [--param [parameter] ] [--proc n] [--same]

The model and mut parameters are the same as before. param controls the mutation magnitude/parameter. By default (if flag is not used), uses original models. size controls the sample size (default 100). proc parameter control the number of core to use (for parallalelisation). By default, only one is used.

--same needs to be used for model level mutation so the healthy instances are compared to their mutated counter-part.

For instance:

python run_mp.py --model 'mnist' --mut 'change_label' --size 25 --param 3.12 --proc 8

This will generate a file such as mnist_change_label_3.12_30_rep_25_size_pop.npy in rep_practicality/mnist/ using 8 cores for instance.

Note: this part can actual be long (around ~2 hours with 30 cores) with base parameters. You can reduce the value of exp_n (number of monte carlo simulation) to reduce the computation time without affecting much the results. One may also reduce the value of B (number of bootstrap) to further decrease it but at the possible cost of increased error.

Then use pop_var.py to calculate confidence interval over the jackknife estimates similarly to Figure 4/5.

Usage is:

python pop_var.py --model [model] --mut [mutation] [--param [parameter] ] [--pop_size [size] ] [--proc n]

All parameters are as before except --pop_size instead of --size. If the flag is not provided, the program will generate the figure similarly to Figure 4/5, except that it will return all estimates for one mutation/parameter instead of one estimate for mutliple models/parameters (see here to replicate figure).

For instance:

python pop_var.py --model 'mnist' --mut 'change_label' --param 3.12 --proc 8

Which will return a figure in plot_results/mnist/std/. To only output the results for a given population size.

For instance:

python pop_var.py --model 'mnist' --mut 'change_label' --param 3.12 --proc 8 --pop_size 25

*** Results ***
Pop size 25

Average and Std of each estimate:
Mean: 0.27597470147127784, Confidence Interval: (0.023650261256531954, 0.9899319769698173)
Variance: 0.0052768355722456525, Confidence Interval: (0.00015252938705970809, 0.0844377655575625)

Average and Std of each estimate lower bound:
Mean: 0.2508633503241399, Confidence Interval: (0.022862401295896513, 0.9898047342083249)
Variance: 0.004110943537826673, Confidence Interval: (0.00014684731827894768, 0.07961359227438368)

Average and Std of each estimate upper bound:
Mean: 0.30108605261841576, Confidence Interval: (0.024484684235331938, 0.9900592197313096)
Variance: 0.005828029840162291, Confidence Interval: (0.0001582114558404685, 0.08926193884074131)

Generating the figure from the paper

In all case, the figures are already present in their respective directory, however here are how to re-generate them:

Motivating example

Run the following script

python comp_deepcrime.py --model mnist --mut delete_training_data

Figure 3

Run the following script

python plot_posterior.py --model 'mnist' --mut 'delete_training_data'
python plot_posterior.py --model 'mnist' --mut 'change_activation_function'
python plot_posterior.py --model 'movie_recomm' --mut 'delete_training_data'
python plot_posterior.py --model 'movie_recomm' --mut 'unbalance_train_data'
python plot_posterior.py --model 'lenet' --mut 'delete_training_data'
python plot_posterior.py --model 'lenet' --mut 'change_label'

Data need to be present in raw_data/{model}/ (see here). Figure will be saved to plot_results/{model}/.

Figure 4

Run the following script

python pop_var.py --model 'mnist' --mut 'delete_training_data' --param 3.1
python pop_var.py --model 'mnist' --mut 'delete_training_data' --param 9.29
python pop_var.py --model 'mnist' --mut 'delete_training_data' --param 30.93

Data need to be present in raw_data/{model}/ (see here). Figure will be saved to plot_results/{model}/data_plot/.

Figure 5

Run the following script

python pop_var.py --model 'mnist' --mut 'change_label' --param 3.12
python pop_var.py --model 'movie_recomm' --mut 'change_label' --param 3.12
python pop_var.py --model 'lenet' --mut 'change_label' --param 3.12

Data need to be present in raw_data/{model}/ (see here). Figure will be saved to plot_results/{model}/data_plot/.

Making it works with your models/mutations/datasets

If the code is intended to be more of a replication package than a general framework, it is pretty easy to change it to have it work with any model/mutation/dataset.

To do so, the only requirements are:

Having trained the instances (sound and mutated) one wish to apply the framework on.
Having generated the accuracy files similarly to here. Please make sure the structure is the same as other files in raw_data/. Note that the files generate_acc_files*.py works only for the models we studied. Yet they can serve as a template for custom models. The files then need to be put in raw_data/[model]/.

Once this is done, you need to edit settings.py and add to main_dict you model as well as the mutation label/parameter similarly to studied models/mutations.

After that, you can plot the posterior for analysis such as Calculating posterior distributions and plotting them or calculate the similarity ratio similarly to Calculating Ratio, in order to decide which mutation is killed. With a sufficient number of training instances, it is not even needed to calculate Monte-Carlo error or Sample size effect as we showed in the paper, so the more time-consuming operations are removed.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
mutated_models		mutated_models
operators		operators
plot_results		plot_results
raw_data		raw_data
rep_mce		rep_mce
rep_practicality		rep_practicality
utils		utils
.gitignore		.gitignore
README.html		README.html
README.md		README.md
comp_deepcrime.py		comp_deepcrime.py
exp.py		exp.py
generate_acc_files.py		generate_acc_files.py
generate_acc_files_lenet.py		generate_acc_files_lenet.py
generate_acc_files_movie.py		generate_acc_files_movie.py
lenet_orig.npy		lenet_orig.npy
lenet_orig_same.npy		lenet_orig_same.npy
mce_estim.py		mce_estim.py
mnist_orig.npy		mnist_orig.npy
mnist_orig_same.npy		mnist_orig_same.npy
movie_recomm_orig.npy		movie_recomm_orig.npy
mutations.py		mutations.py
plot_param.py		plot_param.py
plot_posterior.py		plot_posterior.py
pop_var.py		pop_var.py
requirements.txt		requirements.txt
run_mp.py		run_mp.py
settings.py		settings.py
utils.py		utils.py

FlowSs/PMT

Folders and files

Latest commit

History

Repository files navigation

Replication package for "A Probabilistic Framework for Mutation Testing in Deep Neural Networks" paper

A note on Replication

Index

Architecture

Requirements

Training models based on DeepCrime mutations

Generating the accuracy data

Running mutation testing on DeepCrime models

Calculating posterior distributions and plotting them

Calculating Ratio

Calculating the Monte Carlo error over the instances (Bagged posterior stability)

Calculating the sampling effect for a given population size

Generating the figure from the paper

Making it works with your models/mutations/datasets

About

Resources

Stars

Watchers

Forks

Languages