Skip to content

TantorLabs/pg_anon

Repository files navigation

pg_anon

Overview

pg_anon is an efficient tool for the anonymization of Postgres data specifically designed for IT companies. These companies often store "sensitive" data that includes both commercial secrets and personal user information such as contact numbers, passport details, etc.

The tool comes in handy when it is necessary to transfer the database contents from the production environment to other environments for performance testing or functionality debugging during the development process. With pg_anon, no sensitive data is exposed, preventing potential data leaks.

Features

pg_anon works in several modes:

  • init: Creates anon_funcs schema with anonymization functions.
  • create_dict: Scans the DB data and creates a metadict with an anonymization profile.
  • dump: Creates a database structure dump using Postgres pg_dump tool, and data dumps using COPY ... queries with anonymization functions. The data dump step saves data locally in *.bin.gz format. During this step, the data is anonymized on the database side by anon_funcs.
  • restore: Restores database structure using Postgres pg_restore tool and data from the dump to the target DB. restore mode can separately restore database structure or data.

Requirements & Dependencies

pg_anon is based on Python3 and also requires the third-party libraries listed in requirements.txt.

It uses the following tools and technologies:

  • Postgres pg_dump tool for dumping the database structure.
  • Postgres pg_restore tool for restoring the database structure.
  • Postgres functions for the anonymization process.

Installation Guide

Preconditions

The tool supports Python3.11 and higher versions. The code is hosted on the following repository: pg_anon repository on Github.

Installation Instructions

Installation processes slightly differ depending on your operating system.

macOS

  1. Install Python3 if it isn't installed:
  2. Clone the repository: git clone https://github.com/TantorLabs/pg_anon.git
  3. Go to the project directory: cd pg_anon
  4. Set up a virtual environment:
    • Install the virtual environment: python3 -m venv venv
    • Activate the virtual environment: source venv/bin/activate
  5. Install the dependencies: pip install -r requirements.txt

Ubuntu/Redhat/CentOS

  1. Install Python3 if it isn't installed: sudo apt-get install python3.11 (for Ubuntu), sudo yum install python311 (for Redhat/Centos)
  2. Clone the repository: git clone https://github.com/TantorLabs/pg_anon.git
  3. Go to the project directory: cd pg_anon
  4. Set up a virtual environment:
    • Install the virtual environment: python3 -m venv venv
    • Activate the virtual environment: source venv/bin/activate
  5. Install the dependencies: pip install -r requirements.txt

Windows 7/Windows 11

  1. Install Python3 if it isn't installed: Download it from the official Python website
  2. Clone the repository: git clone https://github.com/TantorLabs/pg_anon.git
  3. Go to the project directory: cd pg_anon
  4. Set up a virtual environment:
    • Install the virtual environment: py -m venv venv
    • Activate the virtual environment: .\venv\Scripts\activate
  5. Install the dependencies: pip install -r requirements.txt

Testing

To test pg_anon, you need to have a local database installed. This section covers the installation of postgres and running the test suite.

Setting Up PostgreSQL

To facilitate the testing, here are instructions on how to set up PostgreSQL on Ubuntu:

  1. Add repository configuration:

    echo "deb [arch=amd64] http://apt.postgresql.org/pub/repos/apt focal-pgdg main" >> /etc/apt/sources.list.d/pgdg.list
    wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
    
  2. Update packages and install PostgreSQL:

    apt -y install postgresql-15 postgresql-client-15
    
  3. Allow connections to the PostgreSQL server:

    sed -i  '/listen_addresses/s/^#//g' /etc/postgresql/15/main/postgresql.conf
    sed -ie "s/^listen_addresses.*/listen_addresses = '127.0.0.1'/" /etc/postgresql/15/main/postgresql.conf
    
  4. Restart the PostgreSQL instance for the changes to take effect:

    pg_ctlcluster 15 main restart
    
  5. Create a test user with superuser rights to allow running the COPY commands:

    psql -c "CREATE USER anon_test_user WITH PASSWORD 'mYy5RexGsZ' SUPERUSER;" -U postgres
    

Executing Tests

To validate that your setup is functioning correctly, run the unit tests:

python test/full_test.py -v

Upon successful execution, the output should resemble the following:

Ran N tests in ...
OK

If all tests pass, the application is ready to use.

To run a specific test case, use the following pattern:

python3 test/full_test.py -v PGAnonValidateUnitTest

Test Database Configuration

Test database connection settings can be overridden using environment variables:

set TEST_DB_USER=anon_test_user
set TEST_DB_USER_PASSWORD=mYy5RexGsZ
set TEST_DB_HOST=127.0.0.1
set TEST_DB_PORT=5432
set TEST_SOURCE_DB=test_source_db
set TEST_TARGET_DB=test_target_db

Usage

To display the help message for the CLI, run:

python pg_anon.py --help

Common pg_anon options:

Option Description
--debug Enable debug mode (default false)
--verbose Configure verbose mode: [info, debug, error] (default info)
--threads Amount of threads for IO operations (default 4)
--processes Amount of processes for multiprocessing operations (default 4)

Database configuration options:

Option Description
--db-host Specifies your database host
--db-port Specifies your database port
--db-name Specifies your database name
--db-user Specifies your database user
--db-user-password Specifies your database user password
--db-passfile Path to the file containing the password to be used when connecting to the database
--db-ssl-key-file Path to the client SSL key file for secure connections to the database
--db-ssl-cert-file Path to the client SSL certificate file for secure connections to the database
--db-ssl-ca-file Path to the root SSL certificate file. This certificate is used to verify the server's certificate

Run init mode

To init schema "anon_funcs", run pg_anon in 'init' mode:

python pg_anon.py --mode init \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_source_db

Run create_dict mode

Prerequisites:

  • Generated or manually created dictionary *.py file with anonymization profile
  • "anon_funcs" created in init mode

To create the dictionary:

python pg_anon.py --mode create-dict \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_source_db \
                  --dict-file test_meta_dict.py \
                  --output-dict-file test_dict_output.py \
                  --processes 2
Option Description
--dict-file Specify the dictionary file with data about anonymization
--output-dict-file Output file will be saved to this value
--scan-mode defines whether to scan all data or only part of it ["full", "partial"] (default "partial")
--scan-partial-rows In --scan-mode partial defines amount of rows to scan (default 10000)

Run dump mode

Prerequisites:

  • anon_funcs schema with anonymization functions should be created. See --mode init example.
  • output dict file with meta information about database fields, and it's anonymization should be created. See --mode create-dict

Dump modes:

  1. To create the structure dump and data dump:

    python pg_anon.py --mode dump \
                      --db-host=127.0.0.1 \
                      --db-user postgres \
                      --db-user-password postgres \
                      --db-name test_source_db \
                      --dict-file test_dict_output.py
    
  2. To create only structure dump:

    python pg_anon.py --mode sync-struct-dump \
                      --db-host=127.0.0.1 \
                      --db-user postgres \
                      --db-user-password postgres \
                      --db-name test_source_db \
                      --output-dir test_sync_struct_dump \
                      --dict-file test_dict_output.py
    
  3. To create only data dump:

    python pg_anon.py --mode sync-data-dump \
                      --db-host=127.0.0.1 \
                      --db-user postgres \
                      --db-user-password postgres \
                      --db-name test_source_db \
                      --output-dir test_sync_data_dump \
                      --dict-file test_dict_output.py
    

    This mode could be useful for scheduling the database synchronization, for example with cron.

Possible options in mode=dump:

Option Description
--validate-dict Validate dictionary, show the tables and run SQL queries without data export (default false)
--validate-full Same as --validate-dict + data export with limit (default false)
--clear-output-dir In dump mode clears output dict from previous dump or another files. (default true)
--pg-dump Path to the pg_dump Postgres tool (default /usr/bin/pg_dump).
--output-dir Output directory for dump files. (default "")

Run restore mode

Prerequisites:

  • Each mode requires dump for restore.

Restore modes:

  1. Restore structure and data:
    This mode requires the dump output, created in --mode=dump.

    python pg_anon.py --mode restore \
                      --db-host=127.0.0.1 \
                      --db-user postgres \
                      --db-user-password postgres \
                      --db-name test_target_db \
                      --input-dir test_dict_output \
                      --verbose debug
    
  2. Restore structure only:
    This mode requires the dump output, created in --mode=sync-struct-dump.

    python pg_anon.py --mode sync-struct-restore \
                      --db-host=127.0.0.1 \
                      --db-user postgres \
                      --db-user-password postgres \
                      --db-name test_target_db \
                      --input-dir test_sync_struct_dump \
                      --verbose debug
    
  3. Restore data only:
    This mode requires the dump output, created in --mode=sync-data-dump.

    python pg_anon.py --mode sync-data-restore \
                      --db-host=127.0.0.1 \
                      --db-user postgres \
                      --db-user-password postgres \
                      --db-name test_target_db \
                      --input-dir test_sync_data_dump \
                      --verbose debug
    

Possible options in --mode restore:

Option Description
--input-dir Input directory, with the dump files, created in dump mode
--disable-checks Disable checks of disk space and PostgreSQL version (default false)
--seq-init-by-max-value Initialize sequences based on maximum values. Otherwise, the sequences will be initialized based on the values of the source database.
--drop-custom-check-constr Drop all CHECK constrains containing user-defined procedures to avoid performance degradation at the data loading stage.
--pg-restore Path to the pg_dump Postgres tool.

Generate dictionary from table rows

If you have a table that contains objects and fields for anonymization, you can use this SQL query to generate a dictionary in json format:

select
	jsonb_pretty(
		json_agg(json_build_object('schema', T.schm, 'table', T.tbl, 'fields', flds ))::jsonb
	)
from (
    select
        T.schm,
        T.tbl,
        JSON_OBJECT_AGG(fld, mrule) as flds
    from (
        select 'schm_1' as schm, 'tbl_a' as tbl, 'fld_1' as fld, 'md5(fld_1)' as mrule
        union all
        select 'schm_1', 'tbl_a', 'fld_2', 'md5(fld_2)'
        union all
        select 'schm_1','tbl_b', 'fld_1', 'md5(fld_1)'
        union all
        select 'schm_1','tbl_b', 'fld_2', 'md5(fld_2)'
    ) T
    group by schm, tbl
) T
>>
	[
	    {
	        "table": "tbl_b",
	        "fields": {
	            "fld_1": "md5(fld_1)",
	            "fld_2": "md5(fld_2)"
	        },
	        "schema": "schm_1"
	    },
	    {
	        "table": "tbl_a",
	        "fields": {
	            "fld_1": "md5(fld_1)",
	            "fld_2": "md5(fld_2)"
	        },
	        "schema": "schm_1"
	    }
	]

How to escape/unescape complex names of objects

import json
j = {"k": "_TBL.$complex#имя;@&* a'2"}
json.dumps(j)
>>
	'{"k": "_TBL.$complex#\\u0438\\u043c\\u044f;@&* a\'2"}'

s = '{"k": "_TBL.$complex#\\u0438\\u043c\\u044f;@&* a\'2"}'
u = json.loads(s)
print(u['k'])
>>
	_TBL.$complex#имя;@&* a'2

Contributing

Dependencies

The pg_anon uses Poetry dependency management tool for managing dependencies and creating packages.
For adding new dependencies install Poetry and run command:

poetry add <package_name>

For locking the dependencies use command:

poetry lock --no-update

Additionally, export the latest packages to requirements.txt using poetry export plugin:

poetry export -f requirements.txt --output requirements.txt

Build package

For building the package use command:

poetry build

Additionally package could be build package using setuptools:

python3 setup.py sdist

Future plans:

  • --format: COPY data format, can be overwritten by --copy-options. Selects the data format to be read or written: text, csv or binary.
  • --copy-options: Options for COPY command like "with binary".
  • Supporting restore after full dump: Right now struct restoring the structure of the database possible only after struct dump. So you don't able to restore the structure after full dump.
  • Simplify commands and options to improve the user experience.