Anonymization and pseudonymization tools for tabular data.
This library provides tools and methods for anonymization and privacy protection of data in Pandas DataFrame format.
To install this package using pip-tools:
Add -e https://github.com/Datahel/tabular-anonymizer.git#egg=tabular_anonymizer
to your requirements.in
Run:
$ pip-compile --generate-hashes --allow-unsafe -o requirements.txt requirements.in
$ pip-sync requirements.txt
To install this package using pip:
Run:
$ pip install git+https://github.com/Datahel/tabular-anonymizer.git
You can alternatively clone this repository and install library from local folder with pip using -e flag:
$ git clone https://github.com/Datahel/tabular-anonymizer.git
$ pip install -e tabular-anonymizer
You can then try out the examples found under examples/
folder.
DataFrameAnonymizer anonymization functionality supports K-anonymity alone or together with L-diversity or T-closeness using Mondrian algorithm.
In this simplified example there is mock dataset of 20 persons about their age, salary and education. We will anonymize this using mondrian algorithm with K=5.
After mondrian partitioning process (with K=5), data is divided to groups of at least 5 using age and salary as dimensions.
In anonymization process a new dataframe is constructed and groups are divided to separate rows by sensitive attribute (education).
You can test this in practice with: examples/plot_partitions.py
import pandas as pd
from tabular_anonymizer import DataFrameAnonymizer
# Setup dataframe
df = pd.read_csv("./adult.csv", sep=",")
# Define sensitive attributes
sensitive_columns = ['label']
# Anonymize dataframe with k=10
p = DataFrameAnonymizer(sensitive_columns)
df_anonymized = p.anonymize_k_anonymity(df, k=10)
import pandas as pd
from tabular_anonymizer import DataFrameAnonymizer
# Setup dataframe
df = pd.read_csv("./adult.csv", sep=",")
# Define sensitive attributes
sensitive_columns = ['label']
# Anonymize dataframe with k=10
p = DataFrameAnonymizer(sensitive_columns)
df_anonymized = p.anonymize_l_diversity(df, k=10, l=2)
Pseudonymization tool is intended for combining data from multiple sources. Both datasets share an identifier column. The function combine_and_pseudonymize
replaces the identifier with a hash.
from tabular_anonymizer import utils
file1 = "exampples/adult.csv"
df = pd.read_csv(file1, sep=",", index_col=0)
# Simple way
utils.pseudonymize(df, 'column_name', generate_nonce=True)
# Let's assume we have two dataframes df1 and df2.
# Both dataframes have common identifier data in columns column_name1 and column_name2, for example birth date
# If you want to merge these datasets for example you can encrypt both columns using shared salt before that.
from tabular_anonymizer import utils
# Generate nonces to be used as salt
nonce1 = utils.create_nonce() # Generated random salt #1
nonce2 = utils.create_nonce() # Generated random salt #2
# Pseudonymize given columns using sha3_224 with two salts
utils.pseudonymize(df1, 'column_name1', nonce1, nonce2)
utils.pseudonymize(df2, 'column_name2', nonce1, nonce2)
# Let's assume that dataframes df1 and df2 have equal size and common column "id" which is direct identifier
# (such as phonenumber). We can combine (merge) these two datasets and pseudonymize values in ID-column
# so it is no longer sensitive information.
from tabular_anonymizer import utils
# combine (merge) two datasets with common index column and pseudonymize
df_c = utils.combine_and_pseudonymize(df1, df2, 'id')
# Convert intervals to partially masked ['20220', '20210'] => '202**'
generalize(df, 'zip', generalize_partial_masking)
# Original table
# id| zip
# 1 | '20220'
# 2 | '20210'
# Anonymized table (K=2)
# zip
# ['20220', '20210']
# After partial masking
# zip
# '202**'
Besides example scripts, there are Jupyter notebooks can be found in examples-folder for testing purposes.
examples/sample_notebook.ipynb # Example how to use tabular anonymizer
examples/check_anonymity.ipynb # Example for validating anonymizer results
If you use GitHub codespaces, you can execute example scripts directly in VSCode browser interface. Required plugins are included in codespaces container configuration.
Codespaces allows you to run notebooks directly in we interface. However, if you need to run jupyter-lab server in codespaces, follow these instructions.
-
Start jupyter-lab server in codespaces terminal using following command :
jupyter-lab --ip 0.0.0.0 --config .devcontainer/jupyter-server-config.py --no-browser
-
Observe jupyter-lab server log and click link pointing to 127.0.0.1, eg: http://127.0.0.1:8888/lab?token=... A small popup with link titled "Follow link using forwarded port" appears. Click the link and codespaces will redirect you to Jupyterlab user interface.
You can run Jupyterlab and do experiments with tabular anonymizer in docker container:
docker build . -t tabular-anonymizer && docker run --rm -it -p 8888:8888 tabular-anonymizer
Open http://127.0.0.1:8888 in your web browser and navigate to examples/sample_notebook.ipynb
Hit ctrl + c to quit container.
Mondrian algorithm of this library is based on glassonion1/AnonyPy mondrian implementation.
Visualization example (plot_partitions.py) is based on Nuclearstar/K-Anonymity plot implementation. Nuclearstar/K-Anonymity