Comparison of machine learning models applied on anonymized data with different techniques

Authors: Judith Sáinz-Pardo Díaz and Álvaro López García (IFCA - CSIC).

Abstract: Anonymization techniques based on the application of different levels of hierarchies on quasi-identifiers are widely used in order to achieve pre-established levels of privacy. In order to prevent different types of attacks against database privacy it is necessary to apply several anonymization techniques beyond the classical ones such as k-anonymity or l-diversity. However, the application of these methods is directly connected to a reduction of their usefulness in terms of their use in prediction tasks, decision making, etc. Four classical Machine Learning methods currently used for classification tasks are studied in order to analyze the results as a function of the anonymization techniques applied and the parameters selected for each of them. First, different values of k for k-anonymity are stablished, then, with a values of k=5 fixed three other methods are applied: l-diversity, t-closeness and δ-disclosure privacy. The anonymization process has been carried out using the ARX Software [1].

Note: The paper associated to this work has been accepted for publication in the IEEE International Conference on Cyber Security and Resilience 2023 (IEEE CSR 2023), under the title "Comparison of machine learning models applied on anonymized data with different techniques" (see the preprint).

Brief summary of results

The results of applying four machine learning models: kNN, Random Forest, Adaptive Boosting and Gradient Tree Boosting on the following 5 scenarios on the adult datset are shown by means of ROC curves in Figure 1.

Scenario 1: raw data (adult_raw.ipynb).
Scenario 2: k-anonymity with k=5 (adult_k5.ipynb).
Scenario 3: k-anonymity with k=5 and l-diversity with l=2 (adult_k5_l2.ipynb).
Scenario 2: k-anonymity with k=5 and t-closenss with t=0.7 (adult_k5_t07.ipynb).
Scenario 2: k-anonymity with k=5 and δ-disclosure privacy with δ=1.5 (adult_k5_delta15.ipynb).

Using the Python library pyCANON [2], we have analyzed the metrics that are verified for each of the anonymity techniques used (see check_anonymity.py). Specifically, it is can be cheacked that when δ=1.5, the value of t for t-closeness is 0.47 (lower than the one set in the scenario where t-closeness is satisfied for t=0.7). This is the most restrictive scenario (δ=1.5), and the results in each case go hand in hand with the ranking metric calculated in each case (see anonymity_metrics.py).

An analogous analysis has also been carried out for different values of k for k-anonymity, in particular for k=2, 5, 10, 15, 20, 25, 50, 75, 100 (see ml_varying_k.py). And the average equivalence class metric has also been analyzed togueter with the accuracy and the AUC obtained with each ML model.

Two utility metrics are also analyzed: the average equivalence class size metric and the classification metric (CM).

License

This project is licensed under Apache License Version 2.0 (http://www.apache.org/licenses/).

References

[1] Prasser, Fabian, and Florian Kohlmayer. "Putting statistical disclosure control into practice: The ARX data anonymization tool." Medical data privacy handbook (2015): 111-148.

[2] Sáinz-Pardo Díaz, Judith, and Álvaro López García. "A Python library to check the level of anonymity of a dataset." Scientific Data 9.1 (2022): 785.

Acknowledgments

The authors would like to thank the funding through the European Union - NextGenerationEU (Regulation EU 2020/2094), through CSIC’s Global Health Platform (PTI+ Salud Global) and the support from the project AI4EOSC “Artificial Intelligence for the European Open Science Cloud” that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement number 101058593.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
anonymity_metrics		anonymity_metrics
data		data
ml_models		ml_models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anonymity_metrics

anonymity_metrics

data

data

ml_models

ml_models

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Comparison of machine learning models applied on anonymized data with different techniques

Brief summary of results

License

References

Acknowledgments

About

Releases

Packages

Languages

License

IFCA-Advanced-Computing/anonymity-ml

Folders and files

Latest commit

History

Repository files navigation

Comparison of machine learning models applied on anonymized data with different techniques

Brief summary of results

License

References

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages