# UN PET Lab Competition: Sandbox 🏜️

Welcome! This is a short walkthrough to give you a step through the elements of the competition. You can do what you like here and it won't affect your score. Once you experiment and come up with a good strategy, then you can head over to the competition notebooks and copy in any of the cells you think might be helpful.

## Prerequisites: Proxy Installation and Quick Start

### Step 1: Install the OBLV via Datalore

We've tried to make life easy for you by adding two scripts to this notebooks file system in the `/data/notebook_files/` folder. The first is `./install-oblv.sh` which will help us to download and install the proxy on Linux, specifically Ubuntu:20.04 which this notebook is running in. We will introduce the second script, `./start-oblv.sh`, in the next subsection.

To run the installation script, first open a terminal. To do this, on the top panel on this window click `Tools >> Terminal`. Your screen will be split and you will have a terminal on the right-hand side. You will already be in the `/data/notebook_files/` so you can go ahead and execute the script:

```
./install-oblv.sh
```

That should have installed the `oblv` proxy. You can test this by running `oblv --help` and you should see the following output:

```
Configuration file stored at: "/home/user/.config/oblv/oblv_config.yaml"

oblv 0.4.0
Oblivious Software Ltd. <oblivious.ai>
Oblivious client app for encrypted connection to secure enclave

USAGE:
    oblv <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    connect      Connect to enclave
    help         Prints this message or the help of the given subcommand(s)
    keygen       Generate public/private rsa key pair
    reconnect    Reconnect to a previously connected enclave

```

### Step 2: Run the Proxy as a Background Task

Now that we've installed proxy, we can connect to the enclave. Now, to make a connection you both authenticate yourself **and** the attestation of the enclave. To do this, the `oblv` cli proxy needs access to your public/private key pair which you were emailed (or will be, depending on when you are reading this). 

These keys should have been labelled `oblv_public.der` and `oblv_private.der` and the keys the contents will be unique for you. If you look at the sidebar on the left, and click the paper click icon, you can upload those keys as files to the `/data/notebook_files/`. In principle, it doesn't matter what you call these files, but we've hardcoded `oblv_public.der` and `oblv_private.der` into the script `./start-oblv.sh` for your convenience. 

**Note**: The `./start-oblv.sh` script is only a template for you to fill in based on the details you received in an email from the organisers. You will have to copy in the PCR codes (there are 3 of these) and the URL hosting the enclave. The script is a single command and there is no reason at all this needs to be a script and not run directly - however, in testing we found some people had issues pasting into the Datalore terminal in the web console, so we put it into a script to make life easier for you. 

Once the keys are in place, go ahead and hit `./start-oblv.sh` in the terminal. This will run the proxy to listen to port 3031 and forward it securely end-to-end into the enclave. Encryption is performed on the fly, so you can just send requests to `localhost:3031` an interact with the enclave as if it is running in locally. 

## The Remote-Execution Client

### Installation with Pip

Now that we've the proxy connection sorted, we can send traffic to the localhost:3031 and it will get directed to the encalve with secure end-to-end encryption and attestation. But we actually want to use the DP libraries locally, and then serialize them into JSON to be sent as REST API calls. Having to write this yourself would just be clunky... so instead we provided a small client library which allows you to pass in the classes of the DP libraries and the serialization/deserialization and request handling will be taken care for you. Let's go ahead and install that in the next cell: 

In [1]:
# to be replaced with normal pip install when ready
!pip install dp-serial

Collecting dp-serial
  Downloading dp_serial-0.1.3-py3-none-any.whl (7.1 kB)
Collecting diffprivlib==0.6.0
  Downloading diffprivlib-0.6.0-py3-none-any.whl (174 kB)
[?25l     |█▉                              | 10 kB 27.5 MB/s eta 0:00:01     |███▊                            | 20 kB 31.6 MB/s eta 0:00:01     |█████▋                          | 30 kB 39.0 MB/s eta 0:00:01     |███████▌                        | 40 kB 30.7 MB/s eta 0:00:01     |█████████▍                      | 51 kB 23.0 MB/s eta 0:00:01     |███████████▎                    | 61 kB 25.9 MB/s eta 0:00:01     |█████████████▏                  | 71 kB 28.4 MB/s eta 0:00:01     |███████████████                 | 81 kB 30.9 MB/s eta 0:00:01     |████████████████▉               | 92 kB 33.0 MB/s eta 0:00:01     |██████████████████▊             | 102 kB 28.8 MB/s eta 0:00:01     |████████████████████▋           | 112 kB 28.8 MB/s eta 0:00:01     |██████████████████████▌         | 122 kB 28.8 MB/s eta 0:00:01    

### Running the Client

Once the library has been installed, we can create a Client object to send and receive our requests with. This is agnostic of the enclaves (like it would work if you were sending the serialized models to a normal server too). So we simply parameterize the client with the REST API endpoint, so it knows what to speak to. In our case, this is `localhost:3031`:

In [2]:
#import required libs
from dp_serial.client.client import Client

# Ensure the URL is as you've set it in the ./start-oblv.sh script
SANDBOX_URL = "localhost:3031" 

# Ensure the URL is as you've set it in the ./start-oblv.sh script
client_sandbox = Client(SANDBOX_URL)

### The dataset for this Sandbox

For this sandbox, we are hosting enclaves with the UCI Car Evaluation Dataset as an example. It has columns named `col1` through `col6` and a columns named `label`. The first three cols range [0-3] and and remaining in [0-2], we've made the label binary. This has nothing to do with the real dataset but is intended as a sandbox to learn from. 

### Using the client with SmartNoise SQL

This is probably the simplest interaction you can do with the client. Just call the below function with the SQL query you wish to execute and the epsilon and delta budget per step of execution. 

There are 2 steps to executing an SQL query which you should always do. SQL queries may take multiple steps and you may spend more privacy budget than you intended to. So the first step is free and simply returns the epsilon and delta that _would be_ applied if you decide to execute it for real. The second then executes the call for you. 

The table name you are selecting from is always `comp.comp` and the column names will be as they are documented in the competition notebook. We'll just use the names `col1` and `col2` in the sandbox for demonstration purposes.

In [None]:
#Getting SQL query epsilon, delta estimate from API Server
estimate = client_sandbox.sql_privacy_estimate("SELECT col1, COUNT(labels) FROM comp.comp GROUP BY col1", 1,0.0001)

print(estimate)

Now that you know what the [epsilon, delta] cost would be, you can choose to get the result of the query:

In [None]:
#Running SQL query using Smartnoise SQL
query_result = client_sandbox.sql("SELECT col1, COUNT(labels) FROM comp.comp GROUP BY col1", 1,0.0001)

print(query_result)

Hurray! We just learned something about the sensitive data via an SQL command (assuming col1 was public and col2 was sensitive for example).

Seems simple enough right? The art of the competition, if you were to use the DP-SQL interface, is working out what questions to ask and how much budget you should spend on them - but mastery of that is left up to you.

### Using the client with SmartNoise Synth

The next, similarly straight forward, interface is that of SmartNoise Synth for synthetic data. To use this, you need to specify what model you would like to use and the privacy budget you would like to apply. There are some advanced additional data you can add, but we'll get to that shortly.

In [None]:
#Generating Data with MWEM Synthesizer 
mwem_synthetic_data = client_sandbox.synth("MWEM", 0.1, 0.00001)

print(mwem_synthetic_data)

Of course, you do not need to use `MWEM` as your preferred synthetic data. You can choose from `[MWEM, DPCTGAN, MST, PATECTGAN]` and then the second parameter is the epsilon you would like to spend and the second is the delta, as was the case for the SQL queries.

You can also request to only use a subset of the columns so that you are not spreading your privacy budget over columns which are not important in your eyes (DP synthetic data in very high dimensions would typically require a lot of epsilon to be accurate). To do this, simply pass in the column names you wish to synthasize like this:

In [None]:
#Generating Data with MWEM Synthesizer 
mwem_synthetic_data = client_sandbox.synth("MWEM", 0.1, 0.00001, select_cols=["col1", "labels"])

print(mwem_synthetic_data)

# now use your newly generated synthetic data anyway you like!

**⚠️ Warning:** Some synthetic methods a slow, but it totally depends on how many columns you select, the synthetic data model used, etc. If for any reason there is a timeout, your score should be uneffected as all queries are performed with exception handling.

### Using the client with DiffPrivLib

For DiffPrivLib, on the author's advise, we've restricted functionality to executing pipelines which may hold one or many models to be applied sequentially. These follow the DiffPrvLib docs exactly, but before applying your pipeline on the data, you pass it to the client and it will be remotely executed for you and will return the trained model for you. Let's see this in action:

In [None]:
from sklearn.pipeline import Pipeline
from diffprivlib import models

#Diffprivlib LR Pipeline 
lr_pipe = Pipeline([
    ('lr', models.LogisticRegression(data_norm=5))
])

# train the model and get the resulting trained model
trained_model = client_sandbox.diffprivlib(lr_pipe)

Equally, we can do a more complicated pipeline involving scaling, pca and training a logistic regression model:

In [None]:
#Preparing Diffprivlib SPLR Pipeline
splr_pipe = Pipeline([
    ('scaler', models.StandardScaler(bounds=([0, 0, 0, 0, 0, 0], [3, 3, 3, 2, 2, 2]))), # you might not need a scalar here, it's just an example
    ('pca', models.PCA(2, data_norm=5, centered=True)),
    ('lr', models.LogisticRegression(data_norm=5))
])

# train the model and get the resulting trained model
trained_model = client_sandbox.diffprivlib(splr_pipe)

All of accepted DiffPrivLib pipelines are accepted so check out the docs directly. Any model submitted not from DiffPrivLib will be rejected.

### Using the client with OpenDP

The last framework available is OpenDP. This is in one respect the most flexible, but from another respect the most complicated for someone who is starting out for the first time. 

In OpenDP, we create pipelines of transformations (like clipping values, selecting columns, etc) and measurements (calculations on the transformed dataset). We typically use the right shift operator, `>>`, to concatenate these transformations and measurements together from end-to-end. 

The measurement is the part of the pipeline which applies differential privacy. So if there is no measurement as part of the pipeline, then the remote execution will be rejected. Also, same as the other methods, if the epsilon or delta far exceeds the capped epsilon/delta permitted per query, it will also fail.

**🛑 Important:** When using OpenDP you must import transformers and measurements as `import dp_serial.opendp_logger.trans as trans` and `import dp_serial.opendp_logger.meas as meas` respectively. OpenDP does not natively keep a track of the abstract syntax tree (AST) of the pipeline and hence it would not be parsable. This `opendp.logger` within `dp_serial` wraps every method in OpenDP such that it can store and export the AST required for remote execution.

In [14]:
import dp_serial.opendp_logger.trans as trans
import dp_serial.opendp_logger.meas as meas
import dp_serial.opendp_logger.comb as comb

#Prepareing OpenDP Pipeline
pipeline = comb.make_pureDP_to_fixed_approxDP(
    trans.make_split_dataframe(separator=",", col_names=["col1", "col2", "col3", "col4", "col5", "col6", "labels"]) >>
    trans.make_select_column(key="labels", TOA=int) >>
    trans.make_clamp(bounds=(0, 1)) >>
    trans.make_bounded_sum((0, 1)) >>
    meas.make_base_discrete_laplace(scale=1.)
)

eps, delta = pipeline.map(1)

#opendp_result = sandbox_client.opendp(pipeline)

### Can I only use these outputs?

Yes and no. You absolutely do not need to use the above functions and stop there. You have many ways which you can be strategic in your actions before and after. By really understanding `train_x` via visualization, dimensionality reduction, etc you can hopefully make more informed queries, spending less epsilon and delta. Equally, when you get an output, for example, some synthetic data, then you have the opportunity to figure out how to use that output to make informed predictions on `test_x`. 