# DATA ANONYMIZATION

## Introduction

This notebook show how to use the anonymization feature with an small example. We will start by setting up the notebook, then we will create our dummy dataset and its metadata, and finally we will `model` and `sample` the data, checking the diferencies in both the data and the internal state of the objects.

## Notebook preparation

In [1]:
import json

import numpy as np
import pandas as pd
import rdt

from sdv.data_navigator import CSVDataLoader, Table
from sdv.modeler import Modeler
from sdv.sampler import Sampler

In [2]:
assert rdt.__version__ >= '0.1.2' , 'RDT minimal version that supports the feature'

## Creating dataset and metadata

We are going to create a dataset of a single table containing three different columns : `primary_key`, `name` and `credit_card_number` and two different metadata, one that does use anonymization, and the other that it doesn't.

In [3]:
# Generating data for table.
table_data = pd.DataFrame([
    {
        'primary_key': 1,
        'name': 'Bill',
        'credit_card_number': '1111-2222-3333-4444'
    },
    {
        'primary_key': 2,
        'name': 'Jeff',
        'credit_card_number': '0000-0000-0000-0000'
    },
    {
        'primary_key': 2,
        'name': 'Warren',
        'credit_card_number': '2222-2222-2222-2222'
    },
    {
        'primary_key': 2,
        'name': 'Bill',
        'credit_card_number': '9999-9999-9999-9999'
    },
    {
        'primary_key': 2,
        'name': 'Jeff',
        'credit_card_number': '8888-8888-8888-8888'
    },

]*10000) # We repeat the same values to smoth the modelling

# Storing as CSV
table_data.to_csv('table.csv', index=False, header=True)

Now we are going to generate the metadata. There are, a part from the anonymization parameters, two major differences with the other example metadata:



In [4]:
normal_table_name = 'normal'
normal_table_metadata = {
    'fields': [
        {
            'name': 'name', 
            'type': 'categorical', 
        },
        {
            'name': 'credit_card_number', 
            'type': 'categorical', 
        },
        {
            'name': 'primary_key', 
            'subtype': 'integer', 
            'type': 'number', 
            'regex': '^[0-9]{10}$'
        },

    ],
    'headers': True,
    'name': normal_table_name,
    'path': 'table.csv',
    'primary_key': 'primary_key',
    'use': True
}
normal_table_metadata

{'fields': [{'name': 'name', 'type': 'categorical'},
  {'name': 'credit_card_number', 'type': 'categorical'},
  {'name': 'primary_key',
   'subtype': 'integer',
   'type': 'number',
   'regex': '^[0-9]{10}$'}],
 'headers': True,
 'name': 'normal',
 'path': 'table.csv',
 'primary_key': 'primary_key',
 'use': True}

In [5]:
anon_table_name = 'anon'
anon_table_metadata = {
    'fields': [
        {
            'name': 'name', 
            'type': 'categorical',
            'pii': True,
            'pii_category': 'first_name'
        },
        {
            'name': 'credit_card_number', 
            'type': 'categorical',
            'pii': True,
            'pii_category': 'credit_card_number'
        },
        {
            'name': 'primary_key', 
            'subtype': 'integer', 
            'type': 'number', 
            'regex': '^[0-9]{10}$'
        },

    ],
    'headers': True,
    'name': anon_table_name,
    'path': 'table.csv',
    'primary_key': 'primary_key',
    'use': True
}
anon_table_metadata

{'fields': [{'name': 'name',
   'type': 'categorical',
   'pii': True,
   'pii_category': 'first_name'},
  {'name': 'credit_card_number',
   'type': 'categorical',
   'pii': True,
   'pii_category': 'credit_card_number'},
  {'name': 'primary_key',
   'subtype': 'integer',
   'type': 'number',
   'regex': '^[0-9]{10}$'}],
 'headers': True,
 'name': 'anon',
 'path': 'table.csv',
 'primary_key': 'primary_key',
 'use': True}

In [6]:
metadata = {
    'path': '',
    'tables': [
        anon_table_metadata,
        normal_table_metadata
    ]
}

metadata_filename = 'metadata.json'
with open(metadata_filename, 'w') as f:
    json.dump(metadata, f)


Now we have all that we needed in order to model and sample our example dataset, that is:

- A table of data stored as `table.csv` file
- Two table metadata, both **using to the same table** , but only **one of them anonymizing data**, and each of them using a different name.
- A full metadata specification, including the table metadata mentioned above, stored as `metadata.json`


# Modelling the dataset


Now that we have prepared our data and metadata files is time to model and sample them. To do so, we will:

1. We will use `CSVDataLoader` to generate a `DataNavigator` instance for our dataset.
2. We will transform the data from our dataset in a format that can be modeled by `Modeler` by calling `data_navigator.transform_table()`.
3. After that, we will create a `Modeler` instance called `modeler`that will recieve as argument the `data_navigator`.
4. Then, we can finally **model** the dataset with `modeler.model_database`.
5. If the modelling is succesful, we can finally create a `Sampler` instance ready to sample new rows

In [7]:
# Loading data
data_loader = CSVDataLoader(metadata_filename)
data_navigator = data_loader.load_data()

# Let's check that everything has gone as expected.
assert anon_table_name in data_navigator.tables
assert normal_table_name in data_navigator.tables

# Let's check that the data has been anonymized at load time.
assert not data_navigator.tables[anon_table_name].data.equals(data_navigator.tables[normal_table_name].data)

# Transform data
data_navigator.transform_data()
modeler = Modeler(data_navigator)

# Model the dataset/database
modeler.model_database()
sampler = Sampler(data_navigator, modeler)

## Sample and compare results

Now we are ready to samnple some data.

We will sample data from tables `anon` and `normal` that are originated from the same exact dataframe as we have confirmed before. The behavior that we are expecting is that on the anonymized table, unique values on the columns `credit_card_number` and `name` are not a subset of the unique values of the same columns in the original data table

In [8]:
anon_sampled = sampler.sample_rows('anon', 10)
anon_sampled

Unnamed: 0,credit_card_number,name,primary_key
0,3590859138917017,Jodi,0
1,3590859138917017,Jodi,1
2,30010139443140,David,2
3,3590859138917017,Darrell,3
4,5467680614604365,David,4
5,3590859138917017,Jodi,5
6,30010139443140,David,6
7,5467680614604365,David,7
8,30010139443140,David,8
9,30010139443140,Jodi,9


Here we can see, that the `name` and `credit_card_number` have different values that on the original data, for exemple, in the names column, unique values have changed from `['Bill', 'Jeff', 'Warren']` to `['Jodi', 'David', 'Darrell']`. (Please note that this concrete values are from this execution, and running this notebook again, may yield different results)

On the `credit_card_number` the difference is even more noticeable as they don't have keep the same format. This is not an issue as this data will be transfomed before being passed to the `Modeler` and the transformation for categorical values into numeric should yield close enough results.



In [9]:
normal_sampled = sampler.sample_rows('normal', 10)
normal_sampled

Unnamed: 0,credit_card_number,name,primary_key
0,1111-2222-3333-4444,Bill,0
1,2222-2222-2222-2222,Jeff,1
2,2222-2222-2222-2222,Jeff,2
3,1111-2222-3333-4444,Bill,3
4,8888-8888-8888-8888,Bill,4
5,1111-2222-3333-4444,Bill,5
6,2222-2222-2222-2222,Jeff,6
7,1111-2222-3333-4444,Warren,7
8,9999-9999-9999-9999,Bill,8
9,8888-8888-8888-8888,Warren,9
