# Remap PII Tutorial


This tutorial demonstrates how sensitive data can be anonymized in Rockfish. We show two examples here for anonymizing datasets with multiple kinds of Personally Identifiable Information (PII).


In [1]:
%%capture
%pip install -U 'rockfish[labs]' -f 'https://packages.rockfish.ai'

In [2]:
import rockfish as rf
import rockfish.actions as ra

In [3]:
# connect locally
conn = rf.Connection.local()

# Example 1


Download sample dataset with PII:


Convert into a Rockfish dataset:


In [4]:
!wget --no-clobber https://raw.githubusercontent.com/tokern/piicatcher/master/tests/samples/sample-data.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4748  100  4748    0     0  41635      0 --:--:-- --:--:-- --:--:-- 42017


In [5]:
dataset = rf.Dataset.from_csv("sample-data", "sample-data.csv")

We can see that this dataset has PII: SSNs, birthdates, email addresses, etc.


In [6]:
dataset.to_pandas()

Unnamed: 0,id,gender,birthdate,maiden_name,lname,fname,address,city,state,zip,phone,email,cc_type,cc_number,cc_cvc,cc_expiredate
0,172-32-1176,m,1958/04/21,Smith,White,Johnson,10932 Bigge Rd,Menlo Park,CA,94025,408 496-7223,jwhite@domain.com,m,5270 4267 6450 5516,123,2010/06/25
1,514-14-8905,f,1944/12/22,Amaker,Borden,Ashley,4469 Sherman Street,Goff,KS,66428,785-939-6046,aborden@domain.com,m,5370 4638 8881 3020,713,2011/02/01
2,213-46-8915,f,1958/04/21,Pinson,Green,Marjorie,309 63rd St. #411,Oakland,CA,94618,415 986-7020,mgreen@domain.com,v,4916 9766 5240 6147,258,2009/02/25
3,524-02-7657,m,1962/03/25,Hall,Munsch,Jerome,2183 Roy Alley,Centennial,CO,80112,303-901-6123,jmunsch@domain.com,m,5180 3807 3679 8221,612,2010/03/01
4,489-36-8350,m,1964/09/06,Porter,Aragon,Robert,3181 White Oak Drive,Kansas City,MO,66215,816-645-6936,raragon@domain.com,v,4929 3813 3266 4295,911,2011/12/01
5,514-30-2668,f,1986/05/27,Nicholson,Russell,Jacki,3097 Better Street,Kansas City,MO,66215,913-227-6106,jrussell@domain.com,a,345389698201044,232,2010/01/01
6,505-88-5714,f,1963/09/23,Mcclain,Venson,Lillian,539 Kyle Street,Wood River,NE,68883,308-583-8759,lvenson@domain.com,d,30204861594838,471,2011/12/01
7,690-05-5315,m,1969/10/02,Kings,Conley,Thomas,570 Nancy Street,Morrisville,NC,27560,919-656-6779,tconley@domain.com,v,4916 4811 5814 8111,731,2010/10/01
8,646-44-9061,M,1978/01/12,Kurtz,Jackson,Charles,1074 Small Street,New York,NY,10011,212-847-4915,cjackson@domain.com,m,5218 0144 2703 9266,892,2011/11/01
9,421-37-1396,f,1980/04/09,Linden,Davis,Susan,4222 Bedford Street,Jasper,AL,35501,205-221-9156,sdavis@domain.com,v,4916 4034 9269 8783,33,2011/04/01


## Remap = Action

Users can add remap actions to a Rockfish Workflow.

To use a `remap` function, the following things need to be specified:

1. `remap_type`: The type of remap function you want to use. Supported remap types are described below.
2. `select_col`: The name of the field in your dataset that you want to remap.
3. `new_remapped_col`: The name of the new remapped field that should be added to the dataset, in case you don't want to
   overwrite the original field (e.g., for testing purposes).
4. `options`: An optional dictionary of arguments to customize your remap function.

We will look at a few example remap actions below.


## Remapping SSNs

Mask the last 8 characters using "X".

- `remap_type`: "ssn"
- `options` to customize the default function if needed:
  - `mask_char`: Any string that will be used as the masking character.
  - `mask_length`: Number of characters to mask.
  - `from_end`: A boolean to mask from the beginning of the field (False) or from the end (True).


In [7]:
remap_type = "ssn"
select_col = "id"
options = None
remap_ssn = ra.Transform(
    {"function": {"remap": [remap_type, select_col, options]}}
)

## Remapping Dates

Replace timestamps that contain both time and date with day, month, and year. To remap field that only have dates, you
can specify a more general `format_str` option (e.g., to keep month and year use "%b %Y").

- `remap_type`: "date"
- `options` to customize the default function if needed: - `format_str`: A valid datetime format string (see
  [datetime documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) for possible formats).


In [8]:
remap_type = "date"
select_col = "birthdate"
options = {"format_str": "%b %Y"}
new_remapped_col = "remapped_birthdate"
remap_date = ra.Apply(
    {
        "function": {"remap": [remap_type, select_col, options]},
        "append_field": new_remapped_col,
    }
)

## Remapping Email Addresses

Replace emails with randomly generated fake email addresses.

- `remap_type`: "email"
- `options` to customize the default function if needed:
  - `gender`: "M" to generate email addresses with male first names (default = female first names).
  - `locale`: See [locale documentation](https://faker.readthedocs.io/en/master/locales.html)
    for supported locale types (default locale = "en_US").
  - `seed`: Seed for the random generator (default = 0).


In [9]:
remap_type = "email"
select_col = "email"
options = None
new_remapped_col = "remapped_email"
remap_email = ra.Apply(
    {
        "function": {"remap": [remap_type, select_col, options]},
        "append_field": new_remapped_col,
    }
)

## Remapping Phone Numbers

Replace with randomly generated fake phone numbers.

- `remap_type`: "phone_number"
- `options` to customize the default function if needed:
  - `locale`: See [locale documentation](https://faker.readthedocs.io/en/master/locales.html)
    for supported locale types (default locale = "en_US").
  - `seed`: Seed for the random generator (default = `None`).


In [10]:
remap_type = "phone_number"
select_col = "phone"
options = None
new_remapped_col = "remapped_phone"
remap_phone = ra.Apply(
    {
        "function": {"remap": [remap_type, select_col, options]},
        "append_field": new_remapped_col,
    }
)

## Remapping CVCs Using Custom Bins

Replace CVC values with the bucket they fall into.

- `remap_type`: "custom_bins"
- `options`:
  - `bins`: A number `n` to split the range of values into `n` buckets, or a list that contains specific intervals.
    For example, to use intervals "[0, 10)" and "[10, 20)", specify `bins = [0, 10, 20]`.
  - `right`: A boolean to make the right side of the interval inclusive (True) or not (False).
  - `labels`: Labels for intervals, if needed (default = None).

See [documentation for pandas.cut()](https://pandas.pydata.org/docs/dev/reference/api/pandas.cut.html) for a more detailed explanation of the arguments.


In [11]:
remap_type = "custom_bins"
select_col = "cc_cvc"
options = {
    "bins": 10,
    "right": False,
    "labels": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
}
new_remapped_col = "remapped_cc_cvc"
remap_cvc = ra.Apply(
    {
        "function": {"remap": [remap_type, select_col, options]},
        "append_field": new_remapped_col,
    }
)

## Remapping Gender Using Custom Dict

Replace values according to a dictionary containing mappings from original values to new values.

- `remap_type`: "custom_dict"
- `options`:
  - `new_values_dict`: Dictionary with mappings from original values to new values. Not all original values in a field
    need to have a mapping. Values without a mapping will not be replaced.


In [12]:
remap_type = "custom_dict"
select_col = "gender"
options = {
    "new_values_dict": {
        "m": "Male",
        "M": "Male",
        "f": "Female",
    }
}
new_remapped_col = "remapped_gender"
remap_gender = ra.Apply(
    {
        "function": {"remap": [remap_type, select_col, options]},
        "append_field": new_remapped_col,
    }
)

## Save Remapped Dataset


In [13]:
save_remapped = rf.actions.DatasetSave(name="remapped_dataset")

## Build And Run Workflow


In [14]:
preprocess_builder = rf.WorkflowBuilder()
preprocess_builder.add_dataset(dataset)
preprocess_builder.add_action(remap_ssn, alias="remap_ssn", parents=[dataset])
preprocess_builder.add_action(
    remap_date, alias="remap_date", parents=[remap_ssn]
)
preprocess_builder.add_action(
    remap_email, alias="remap_email", parents=[remap_date]
)
preprocess_builder.add_action(
    remap_phone, alias="remap_phone", parents=[remap_email]
)
preprocess_builder.add_action(
    remap_cvc, alias="remap_cvc", parents=[remap_phone]
)
preprocess_builder.add_action(
    remap_gender, alias="remap_gender", parents=[remap_cvc]
)
preprocess_builder.add_action(save_remapped, parents=[remap_gender])

In [15]:
for action in preprocess_builder.actions:
    print(action)

dataset-load
remap_ssn
remap_date
remap_email
remap_phone
remap_cvc
remap_gender
dataset-save


In [16]:
preprocess_workflow = await preprocess_builder.start(conn)
remapped_dataset = None

print(f"Workflow: {preprocess_workflow.id()}")

Workflow: 1W58Q0QsW0z1kvx2kIRoVI


In [17]:
async for log in preprocess_workflow.logs():
    print(log)

2024-12-12T22:31:30.648394Z dataset-load: INFO Loading dataset '1mjABBrolgVt8fG2VWm5d7' with 30 rows
2024-12-12T22:31:30.686169Z dataset-save: WARN No session metadata provided
2024-12-12T22:31:30.686511Z dataset-save: INFO Saved dataset '7871JhUC33hHd9aaHtbJXb' with 30 rows


In [18]:
async for sds in preprocess_workflow.datasets():
    remapped_dataset = await sds.to_local(conn)

## Outputs


In [19]:
remapped_dataset.to_pandas()[["id"]][:10]

Unnamed: 0,id
0,172XXXXXXXX
1,514XXXXXXXX
2,213XXXXXXXX
3,524XXXXXXXX
4,489XXXXXXXX
5,514XXXXXXXX
6,505XXXXXXXX
7,690XXXXXXXX
8,646XXXXXXXX
9,421XXXXXXXX


In [20]:
remapped_dataset.to_pandas()[["birthdate", "remapped_birthdate"]][:10]

Unnamed: 0,birthdate,remapped_birthdate
0,1958/04/21,Apr 1958
1,1944/12/22,Dec 1944
2,1958/04/21,Apr 1958
3,1962/03/25,Mar 1962
4,1964/09/06,Sep 1964
5,1986/05/27,May 1986
6,1963/09/23,Sep 1963
7,1969/10/02,Oct 1969
8,1978/01/12,Jan 1978
9,1980/04/09,Apr 1980


In [21]:
remapped_dataset.to_pandas()[["email", "remapped_email"]][:10]

Unnamed: 0,email,remapped_email
0,jwhite@domain.com,Sarah.Chang@williams-sheppard.info
1,aborden@domain.com,Jennifer.Bowers@faulkner-howard.com
2,mgreen@domain.com,Kathy.Campbell@montgomery.net
3,jmunsch@domain.com,Victoria.Patrick@collins.com
4,raragon@domain.com,Stephanie.Sutton@castro-gomez.com
5,jrussell@domain.com,Lisa.Durham@woods.net
6,lvenson@domain.com,Sydney.Davis@page-glover.com
7,tconley@domain.com,Lisa.Clayton@sanchez-nguyen.com
8,cjackson@domain.com,Cheryl.Bradley@pratt.net
9,sdavis@domain.com,Jo.Miller@golden-bolton.info


In [22]:
remapped_dataset.to_pandas()[["phone", "remapped_phone"]][:10]

Unnamed: 0,phone,remapped_phone
0,408 496-7223,(460)648-7647x5938
1,785-939-6046,(319)748-9241
2,415 986-7020,281.256.5938x7784
3,303-901-6123,560-597-5351
4,816-645-6936,328.671.1587
5,913-227-6106,+1-641-985-8398
6,308-583-8759,496-259-3423
7,919-656-6779,+1-647-511-2201x868
8,212-847-4915,796.394.7751
9,205-221-9156,295.333.0413x5256


In [23]:
remapped_dataset.to_pandas()[["cc_cvc", "remapped_cc_cvc"]][:10]

Unnamed: 0,cc_cvc,remapped_cc_cvc
0,123,0
1,713,7
2,258,2
3,612,6
4,911,9
5,232,2
6,471,4
7,731,7
8,892,8
9,33,0


In [24]:
remapped_dataset.to_pandas()[["gender", "remapped_gender"]][:10]

Unnamed: 0,gender,remapped_gender
0,m,Male
1,f,Female
2,f,Female
3,m,Male
4,m,Male
5,f,Female
6,f,Female
7,m,Male
8,M,Male
9,f,Female


# Example 2


Download sample dataset with IP addresses:


In [25]:
!wget --no-clobber https://docs142.rockfish.ai/tutorials/pcap.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7043  100  7043    0     0  52584      0 --:--:-- --:--:-- --:--:-- 52954


Convert into Rockfish dataset:


In [26]:
dataset = rf.Dataset.from_csv("pcap", "pcap.csv")

We can see that this dataset has PII: IP Addresses.


In [27]:
dataset.to_pandas()

Unnamed: 0,srcip,dstip,srcport,dstport,proto,timestamp,pkt_len
0,244.3.253.224,244.3.160.239,3396,80,6,2009-12-17 16:27:36.075494,40
1,41.177.26.91,68.157.168.194,80,65003,6,2009-12-17 16:27:36.075515,1500
2,41.177.26.91,68.157.168.194,80,65003,6,2009-12-17 16:27:36.075519,940
3,41.177.26.91,68.157.168.194,80,65003,6,2009-12-17 16:27:36.075553,1500
4,41.177.26.91,68.157.168.194,80,65003,6,2009-12-17 16:27:36.075603,1500
...,...,...,...,...,...,...,...
95,68.157.168.194,41.177.26.91,45615,80,6,2009-12-17 16:27:36.099423,60
96,41.177.26.91,68.157.168.194,80,45615,6,2009-12-17 16:27:36.099891,64
97,41.177.3.203,41.177.3.224,58381,1791,6,2009-12-17 16:27:36.100508,40
98,244.3.41.84,244.3.31.67,2626,1592,6,2009-12-17 16:27:36.105025,237


## Remapping IP Addresses

Replace IP addresses with randomly generated fake IP addresses.

- `remap_type`: "ip"
- `options` to customize the default function if needed:
  - `cidr`: A netmask value in ["/0", "/8", "/16", "/24"] (default = "/24").
  - `seed`: Seed for the random generator (default = `None`).


In [28]:
remap_type = "ip"
select_col = "srcip"
options = None
new_remapped_col = "remapped_srcip"
remap_ip = ra.Apply(
    {
        "function": {"remap": [remap_type, select_col, options]},
        "append_field": new_remapped_col,
    }
)

In [29]:
save_remapped = rf.actions.DatasetSave(name="remapped_dataset")

## Build And Run Workflow


In [30]:
preprocess_builder = rf.WorkflowBuilder()
preprocess_builder.add_dataset(dataset)
preprocess_builder.add_action(remap_ip, alias="remap_ip", parents=[dataset])
preprocess_builder.add_action(save_remapped, parents=[remap_ip])

In [31]:
for action in preprocess_builder.actions:
    print(action)

dataset-load
remap_ip
dataset-save


In [32]:
preprocess_workflow = await preprocess_builder.start(conn)
remapped_dataset = None

print(f"Workflow: {preprocess_workflow.id()}")

Workflow: 5Cv1ID7yIYm0YjxjFruMwJ


In [33]:
async for log in preprocess_workflow.logs():
    print(log)

2024-12-12T22:31:31.395453Z dataset-load: INFO Loading dataset '3sLFNxc4BuoMBZVUFgoLxS' with 100 rows
2024-12-12T22:31:31.396429Z dataset-save: WARN No session metadata provided
2024-12-12T22:31:31.396649Z dataset-save: INFO Saved dataset '5aiXif7bX42inHUWTtjY62' with 100 rows


In [34]:
async for sds in preprocess_workflow.datasets():
    remapped_dataset = await sds.to_local(conn)

## Outputs


In [35]:
remapped_dataset.to_pandas()[["srcip", "remapped_srcip"]][:10]

Unnamed: 0,srcip,remapped_srcip
0,244.3.253.224,244.3.253.89
1,41.177.26.91,41.177.26.254
2,41.177.26.91,41.177.26.254
3,41.177.26.91,41.177.26.254
4,41.177.26.91,41.177.26.254
5,244.3.160.239,244.3.160.125
6,41.177.26.91,41.177.26.254
7,41.177.26.91,41.177.26.254
8,244.3.160.80,244.3.160.168
9,244.3.160.239,244.3.160.125
