# Carduus Databricks Demo

This notebook demonstrates the use of the `carduus` python package from within the Databricks environment to parallelize the tokenization processes using Spark clusters.

We use a small 5 row sample dataset of PII, however `carduus` is capable to distributing the workload of tokenizing and transcrypting large datasets across the cluster.

In [None]:
pii = spark.createDataFrame(
    [
        (1, "Jonas", "Salk", "male", "1914-10-28"),
        (1, "jonas", "salk", "M", "1914-10-28"),
        (2, "Elizabeth", "Blackwell", "F", "1821-02-03"),
        (3, "Edward", "Jenner", "m", "1749-05-17"),
        (4, "Alexander", "Fleming", "M", "1881-08-06"),
    ],
    ("label", "first_name", "last_name", "gender", "birth_date")
)
display(pii)

label,first_name,last_name,gender,birth_date
1,Jonas,Salk,male,1914-10-28
1,jonas,salk,M,1914-10-28
2,Elizabeth,Blackwell,F,1821-02-03
3,Edward,Jenner,m,1749-05-17
4,Alexander,Fleming,M,1881-08-06


# One-time Workspace Setup

See the [Databricks documentation](https://docs.databricks.com/en/security/secrets/index.html) for managing secrets to get started. You will need an to install the [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html) and [authenticate](https://docs.databricks.com/en/dev-tools/cli/authentication.html) as a user that can manage Databricks secrets.

Create a secret scope for `carduus` encryption keys by running the following command with the Databricks CLI.

```
databricks secrets create-scope carduus
```

Add your private key as a secret in the `carduus` scope.

```
(cat << EOF
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
EOF
) | databricks secrets put-secret carduus PrivateKey
```

Add the secrets for the public keys of the third party organizations you will be sending and receiving tokenized data to/from.

```
(cat << EOF
-----BEGIN PUBLIC KEY-----
...
-----END PUBLIC KEY-----
EOF
) | databricks secrets put-secret carduus AcmeCorpPublicKey
```

# Cluster Setup

Add spark configuration properties to your custer by reading from the secrets created above. Your organization's private key must be stored under the `carduus.token.privateKey` key and all third party public keys must be stored under a key with the format `carduus.token.publicKey.<recipient>` where `<recipient>` is the name you would like to associate with the public key of a specific recipient that you will send data to.


```
carduus.token.privateKey {{secrets/carduus/PrivateKey}}
carduus.token.publicKey.AcmeCorp {{secrets/carduus/AcmeCorpPublicKey}}
```

# Notebook setup

Ensure the `carduus` package is installed in your notebook's python environment. The package can also be installed across the entire cluster.

In [None]:
%pip install carduus

# Tokenization

For more information on tokenization see the Carduus [getting started guide](https://spindle-health.github.io/carduus/guides/getting-started/).

In [None]:
from carduus.token import tokenize, OpprlPii, OpprlToken
tokens = tokenize(
    pii,
    pii_transforms=dict(
        first_name=OpprlPii.first_name,
        last_name=OpprlPii.last_name,
        gender=OpprlPii.gender,
        birth_date=OpprlPii.birth_date,
    ),
    tokens=[OpprlToken.token1, OpprlToken.token2]
)
display(tokens)

label,opprl_token_1,opprl_token_2
1,d2tUj3yRFPIBSwR/ntUi8v1B/A9H+Q0iNwlz0+OVpO54MER9bRnTgHxOO8Q4IM+gdoxKGJGV9STb6DH2hxKIb50v6IFa+StqSnKRy7GJG4U=,DQhKG+AMgrFh16dLiogYy/U9YDLsATVvhMtQ4cHkWmyr8+JmuABYYQV3OJCNLYupaevu9qapeWo80zq+VVumrKwuUXqUCjQzs7ui5qceUVc=
1,d2tUj3yRFPIBSwR/ntUi8v1B/A9H+Q0iNwlz0+OVpO54MER9bRnTgHxOO8Q4IM+gdoxKGJGV9STb6DH2hxKIb50v6IFa+StqSnKRy7GJG4U=,DQhKG+AMgrFh16dLiogYy/U9YDLsATVvhMtQ4cHkWmyr8+JmuABYYQV3OJCNLYupaevu9qapeWo80zq+VVumrKwuUXqUCjQzs7ui5qceUVc=
2,2wnRWhN9Y4DBMeuvwbriLCmz5xVyCNJO1liVdO/bu/nwtTkudQCfmHazpSYftTGVDVfIz8z0gy5eAAap9qKY7D6LmCM4Df6wRttrlh5XBQw=,A7fUKAZi/Ra2T6p8y4nsW0DRuI0P4KGatyuj7XyI4LIU+6lRCYukLDxPepOB52xPLDU2J2B9n8YGO9x0dmK+t4ZEAjJNsBnZrKy3hPvN5PY=
3,I17X+CT3kjqB9l0rAOyiGVYKTzFrSLp4KSHfdUR89rYTyC/d24QcuYZ2VHWVzPawTFNqaIr7oMjgoMCX8gruUSgU42YLoq64KHdRiqTBuQ8=,qGCUVyI7MJLkQ9SSrz0uNIFQXDJBcCNdMVneZ/fLdAfn39ZNZgD+6Vm4l20pM/0zKeLRYAHCpJh/AH7UB21ssqzfbSdOkjrGDL51lYZ9fK8=
4,6Ee3NabHaa/lyKsrpou/DqufDgEOVyq3Hb9RKg7GpkoAY3ZlPn9k1iMj6NhJuo4t35i5aixmOr8hXYnBvql0DVQqJwujrk1rdOnNYwhulvs=,dbwNqy0rN6hFoHJYnWo9G9REKGLLbJKvvnWYS9uR1gj3x8z6oQENN6K0DqU9INgldMikxac7EG5ZR9wicpVSMDBAy2HbqFcBL4ZHoNWWDoM=


# Sender Transcryption

For more information on transcryption see the Carduus [getting started guide](https://spindle-health.github.io/carduus/guides/getting-started/).

In [None]:
from carduus.token import transcrypt_out
tokens_to_send = transcrypt_out(
    tokens, 
    token_columns=("opprl_token_1", "opprl_token_2"), 
    destination="AcmeCorp",
)
display(tokens_to_send)

label,opprl_token_1,opprl_token_2
1,d8JaP02dS0M+COGihZCp+Kr2piXyNU5yvPhFR1/ajN7m+gr63NsQee8v7WMKnEJkY6/BDOjREQ8sf1iD0xWDsvCM8rCqlG8uFlVKQ4FY2atSGuqHUzHcjGf8eNEsbpM2AZnLliFDssvljVgBy9CpYUti20c8L/TgrVMKd1rSgJaDXjXRHS5abWD6EaD77IV3KpELoii7XhE4xqLV3PIH+5TaF6Yaa7FbLaYtG0+wIQ+huC9H4QLHgGam8JE71k/aznve90koz8/hO0vM5/RpLd+dTQ4MMRCPxt/SBVtJp3fuvPtbVdR+rf5KPuCkS3IvUtOGgUlYKVMovNYA2uUDNg==,hyh5cNCkhptFNbbndY/l43GBuTBcDbAdNQWQdznelQVsugwQ9Fip7rC1N0Adj2oBxe+0qL86Mv079BKoDRpHCRIw6aqvHbTJHBV+CuDRmmLUcbFUVTKI3AwGEom2i076wulOX0Wjvs8fXe4vCTX5AZXWwpPCMJUBaMu8zLqgIPtiJhD0UMUsoBwTeFEzxR6sIVBU7E3pFHToYCkT7ISZ55rY+CcsIAI8I5TS0qtYAgDuCWWhp8ZJ2NNRWcWd9qHPMqXWkQ//k75vS0ng018Z4hH0/EVGVVf9iOKGb6ibgTrfmxM0lWDpZqBfjZZ/uExeZsI3EVcQUmzFccWpyB1nYw==
1,Pj3ZIaIj9/i5N8ZnBpe23u6WfeEXR3nka8J+Y+5pawvmgnAnRFzhTekmcx350Viyp4FHsKh4ogO2J6714oSLm4zaELZhcaPAI0slee/7LnaYZNAArXtg4bxqNdz/vifIvcWj8S+TyrEXsxmzIpgS5nKYG9SRbQ0IO5/Ungxx+GouhGeGC54nlL7dt1dHlgUmMkQx+UH89t+D/C55F/lDmRkhddjiSO1LFIZYyo3xT0F/j5o7b7cI0OoZpvN6OYdrE9IfsfqYFh/RoHhMjfpajYDSUcy6Cq4evcjhEy5XvGqNJpQTToC/A1rmTB5WtFO1oCyVF0HGb+FgHwO9ccVIjw==,N3M6t4ZW5/DnzfhYIcUJdCHAorl7lWvrBnZGtInrno3z62V6n1odgQmY6fgXHy5sTTHdujeIn/dRi6PD9rBANjTpi42puSGHf7VLIx9uqr+kSiXGTCBnAHy8C3DNfpxtx4I391/8aM2mbnS/5icuyIKgzZP+fWsFxehfMSLDgW3YMQsgZy0JBSpetk+8Jl1L06pQYHt967zcAvi0JRiUmZHshOGAdkVO1hQGgUDOsHuBrElqkooFfuIw0QC4v1aeqQDtVH3Un/9cJMonmsRtnuRYUZE7aW1Zl/RGYKq1LhegIzL/+JlYuqOHZoSPyB/0O6t7u/USRkkZCUKM+V7ryg==
2,bLkqSZYRsRZ8/7EQsxE/DU2Rf5n2qxjNiJgNUuJHfTcMLVC3C0tPlS+AztVVwWBj2G+GaA8QoeM7jqfxzou/xK13ggOOIiXWAapqOMT96Z5C3y+GG8och8g9w52Mr/3zKINIe5VLaxoMetH67J5d85gxFtixKTJiQq6GPUiJdm4ice/XACuG1bYtUKB3hfoRE3ZUsgOxQDlocSOcvrLvYK2SRgoRTqP3Upk+Cuemc0n8a0fLEcm3nXBrte2LKL2othx0Zost+ev8SL1LG22LJwjAKhRXDf6s7/DooEsoWaQM4mGImpZAp7CNxuLzL5BAdE2njkq8VTeLM6zkFSxw8Q==,N9soErrrifQW4t0ZzwlErWKlbXaSggSve+MT+L0PzZpS+2m/qg+sKSJHDTlM4WWyly/dy0fHgLjANWl4vLAgWl6BwZY51pDVPldBfOD2NfjRmWKhk5vvh0j2BPOEkRn7k0UisUoCqSWglMZMxo6gA5vhUS1s9Gg2e8hg410eCengabReUSEpZpcbyB4yPdPXniwScXLBQX6AjIHEK2lRxtjLI5qWaSwOKeOP+b12Hrxgu3HB9eZ2Yr2Rt1O9+R8z/A5Xvz4NQ2oTxxLAwJ2LBbWbY6CvS/8n5460oSBe28xAG2rl2I96RbzRzoTXni7GNDEjkh27298t5v/HaOD/WA==
3,lyaQetBbfzUuDoK+u6MLs18H6pH4iafWZNrwc8CKjmilA0mTuUHt9+joXXexkGhkKV37WGnpz4+0lzNOAVQ0c6/LMqRi2FImWh4a1xlk7BM4MDKUY9SNfa562yvxvm3Sast3NHfrYbs/9Nwrd5eXsHwI60DmfRkxbc+P3mUMoK7trvnOG9vHPedKqDh3uXGVf0oAK+d6WLOxN5r5PCldEaqFt/ErDsHUgcvy3F0b6W9+Xx/a9s2x2/ZvcLkfY5oF7+td7a8ghQ9xWzeC8hNt35FcbDe0v1vPkV32UiiO3atKIz1rr2Gag29xaAEonXucwtaIaCGGywFzbMV/CvLC/w==,iFz4K/xh2sKLl8dobmZeB1xwtgXySSGmtn8qMPtPk3bJ5rxFKwgM/HxI77td3FFVsPS5sAN1WSDOuVBMwSWkgzX5qGB2qt6FgOdFj4Eh3ihSb9YqKOzTDGD4AzUwTbhYpdF406kji1ERurefGompR+qxgtPWDx9hqBbqwtC+qyp35Wiqp0tMgCn1ZPUC8Bgqb6SC9BakYS/XF8MAnA4mI7ARFvw74VRXdmHmaPox/lFSkUai5mfO2cOZaU6cqlFsLy6AzYTGhJsSQDIooFmyXjnmKq+sNxsliR/p4sCZUCF5nIXQo8uHWFcl+PI0eDidQOop/MVrB/P38E+3dttu4g==
4,CQdw+tk0Z2/bUzRVsTHv4i0DRPNE7ji/WqaiX3BDNlYKGRim96C6IxuBwElbz0gnvzrc19pc68SBUHF6+aNobR2uLEpQpJiOPGtrfFTvr101kQL/WqKsczgNkU9UVNEcNOiB9yBBNbH+3oevbbQ5uOkblUgf6gNbcN1n3mfRr6b0npaET/EiBwjJMAfHCGZ8KBacqfUn9M8Hv4Ie/tCs5B80fDcmm+GLVKZMCVeUa4olTtl8gOSqGlNNhUbX0K0xlFNzrxQ2G6NiQOy8M0Z4my+adW9o37EbcT03ypzHIKbYvyuiEE73lkAZKR0Rf1uhh51Y0pvTi2ELMXgjDcBn9A==,It9sQjB4X15z/1xgmZLg1UrZfmeM7mT++Me5AYQM9weuH8WXs9lQ+AdBeuH2yRly3ETlCz08vEqm/eHl0/z24eLQeHJz0t/S/8Qz2CnO/KX9AKlLr/rj5XVXCFWKkoh/qDPAd28d0OfL5WcbIkVB7uF9UbjZp2SdWEDaPRzFMiBE9OJ2lu/5bZSMjAqRylwwLKDc5jSJNtHzquMvoG+X7b08fzzPe2Nc7ViP9UvqADFGu8vZFhkQ9UiBdglHS05VV0PIcUmpn40pr+evnphX6fOlWgMh/F3T85EVtAtU1AWnH3vqnyqhhMIArHM5W2dflKp657d9x/DytVfa4iiy+Q==


# Recipient Transcryption

You can also load secrets via `dbutils` and provide a custom key provider to Carduus.

In [None]:
from carduus.keys import SimpleKeyProvider

keys = SimpleKeyProvider(
  private_key=dbutils.secrets.getBytes("carduus", "PrivateKey"),
  public_keys={},
)

In [None]:
from carduus.token import transcrypt_in

tokens2 = transcrypt_in(
    tokens_to_send, 
    token_columns=("opprl_token_1", "opprl_token_2"),
    key_service=keys,
)
display(tokens2)

label,opprl_token_1,opprl_token_2
1,O47siK/9rItAv6lwaKgbH9MVuvZe0IwaBzEkQBRgHx9MWi02k8Gn/VvxoTsbA+4NB5DFMcAZeCS44azIKPV6M3tNySHQwSSDCQCPo+vGYPc=,uoPojmvjl3Mk734UlQb/bDHms7vHiptbYKbWCV1VWiE1svY/nAfp+Lm14b6Y/xJViuvqOrNIzpfItlOgU0UN//VCxg54FDLaZIZGfRhW1co=
1,O47siK/9rItAv6lwaKgbH9MVuvZe0IwaBzEkQBRgHx9MWi02k8Gn/VvxoTsbA+4NB5DFMcAZeCS44azIKPV6M3tNySHQwSSDCQCPo+vGYPc=,uoPojmvjl3Mk734UlQb/bDHms7vHiptbYKbWCV1VWiE1svY/nAfp+Lm14b6Y/xJViuvqOrNIzpfItlOgU0UN//VCxg54FDLaZIZGfRhW1co=
2,LfRQxBPEW0tEskwxtnooILPEi6/AtbfKCsWsPKMAS455xUeGfzxXdoTIdVnvKrpmT81aQA85dk7QqF+2K7Vjcg291ov2Skz+eMjF7n8DdZM=,P9PLhpSz+kWSSVmYQ8yCwAbQpohy/Ua8R1sSPytPpYb+CLN1KigfKedlIHZUWhf1tTq5gOjPeZp80rHtjJpsPbXAG0rWe9pjhP/Vjhdp140=
3,QWvpT0wMEezlurVFMa+yJIKJdAJQXuie8/VuqqKsSQDaGMDkV1s0PX9RTumY75oQdAwPGTFZvCoOGWn0jIIrdY93T7s0qK3O0raXgZCKtHg=,W9BxmxZf1/yqIDx3WCwqq0aPLznw7GATRx7ETrxLYfNBiuozxUBTs5Z3Ig9sErG7owJm16suzKHJ4ICYRtteoni8Ystu2ou7ODgnny/BxOs=
4,aS/BfI5qw8UnhOWUXyUNQMAaZRKGCC3/l/htFpHYhVecUv1EtRE0ht893KPkZbVpNsw7nORH9TlRbl8vU1WaNz2Kf04sw2iqcMVHx+VjD3s=,nWfWyxQcrSfeiMEHr98Z+mq7u3DMT1xdSJ0nrw8nTeKR0s/Gh78f+pDJNooeJUv4vfuuJFFqWEI2dkC5Q99ZBkdwiYnp/cj5Htif/Gk9lHY=
