# Antigranular: Sample Notebook

This notebook is a sample notebook for the Antigranular project. It demonstrates use of the antigranular client package to connect to antigranular enclave server running on network of Oblivious Enclaves. 

Running inside the enclave is a restricted python-like runtime managed as a Jupyter kernel. It uses static and dynamic analysis to enforce static typing, scoping constraints and a limited api to interfact with private or sensitive data sources. 

The two core technical components the enforce our privacy constraints for data scientists are:
- Secure Enclaves (A form of input privacy)
- Differential Privacy (A form of output privacy)

### What are secure enclaves?

Secure enclaves are isolated servers with two very powerful properties:

They have extremely limitted IO and need explicit inbound and outbound connections to recieve and send data. No one can simply SSH into an enclave and see data as it is being processed, nor can data end up unexpectedly in log files.
The underlying infrastructure "attests" what is running inside. So when we write some software to deploy into an enclave, the physical infrastructure will hash the software and environment and place these values into a document which it digitally signs. In short, the cloud infrastructure implicitly gaurentees to those connecting to the enclave the exact processing and behaviour of what is running inside the enclave.

This is extremely powerful as we can use these charecteristics to clearly structure rules around what processes can decrypt what data (not what servers, or people - what actual computation is approved!). AS you can imagine, all of the major cloud providers have developed an enclave offering of one form or another (AWS, Azure, GPC, Alibaba Cloud, IBM, Oracle, OVH Cloud.... the list goes on) over the past few years and billions worth of investments have been poored into the domain.

### What is Differential Privacy? 
  
While this workshop focuses on synthetic data, there are different approaches to creating synthetic data. The only method which gives a theoretical privacy guarantee is called Differentially Private (DP) synthetic data.
  
Differential privacy, coined in [2006 by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith](https://link.springer.com/chapter/10.1007/11681878_14), is a theoretical statement about the guessing probability of data being present in a dataset given stochastic measurements/queries of the dataset (intuitively: any information gained that is derived from the dataset). Some processes may naturally create stochastic measurements; however, in most cases, calibrated noise is intentionally applied to a result of measurement such that the measurement will be differentially private. 
  
This privacy framework ultimately relies on what you want the guessing probability to be, and there is a natural trade-off between the accuracy of a measurement and the guessing probability. Strictly speaking, this guessing probability is parameterised by a coefficient ε. If ε is 0, the measurement discloses no information about the data being in the dataset, so the guessing probability is uniform (50:50). If ε is infinite, then the measurement discloses with absolute certainty whether the data is present or not in the dataset.

$$ \epsilon \geq \ln \left( \frac{Pr[M(x) \in S]}{Pr[M(x') \in S]} \right) $$
  
There is no golden rule in selecting an acceptable ε which you can deem "safe”, and you will likely need to decide internally what you believe to be a reasonable risk.
  
Finally, the ε of multiple measurements/queries can interact in complicated ways. However, it is straightforward to upper bound the worst case ε by the sum of all of the ε: $\epsilon = \sum_{i=0}^{n} \epsilon_i$

## Moving Privacy into the Code Block

In order to combine input and output privacy, we provide a framework for remotecode execution the feels somewhat native to the data science user experience. The steps are essentially:

- Connect to a enclave backend.
- Use the `%%ag` magic to send blocks of code for remote execution.
- Manipulate and query data in the safety of the enclave and export only public data (post-DP) to the client.

To get started we import the antigranular package:

In [1]:
import antigranular as ag

### Log In with OAuth

We use standard OAuth to authenticate users despite doing the full attestation based handshake to authenticate the software running within the enclave. In the example below, you can ignore the temp_user/password, this is only in place for local testing and demos:

In [2]:
ag_client  = ag.login("<user_id>", "<user_secret>", "comp_dataset_id", "temp_username", "temp_password")

Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
Started new session with session id:  7b3b831d-6158-4f4e-80c6-5438f6b9ss15


Once you've logged in you will have a session id associated with your interactions. This actually gets embedded into the ipynb metadata so when you share your notebook we can associate it with your score from a competition, or you metrics associated with analysing a dataset.

In [3]:
ag_client.session_id

'b8984214-3c4a-43d8-b19f-f81c61c6f56f'

### Executing Code with AG

We have used magic `%%ag` to let user toggle between private python and regular. Simply, add it to the top of any cell and your code will be remotely executed. Any non-private data type (int, float, list, str, etc) can be exported back to your current Jupyter instance using the `export` method as seen below:

In [4]:
%%ag 
from ag_utils import export 

export(2, "x")

Setting up exported variable in local environment: x


In [5]:
print("Look what's in x now in your local Jupyter session:", x)

Look what's in x now in your local Jupyter session: 2


### Throwing Errors

Errors can be thrown for a variety of reasons. Antigranular restrict many AST nodes, enforces strict mypy, limits the scopes of variable assignments and much more. If you try to do any of these you will be greated with an error message which is forwarded to your local runtime:

In [6]:
%%ag 

raise ConnectionError(f"Error calling /session_status: {str('lalala')}")

[0;31m---------------------------------------------------------------------------[0m[0;31mRuntimeError[0m                              Traceback (most recent call last)File [0;32m/code/dependencies/ag-private-kernel/kernel/compiler/restricted.py:20[0m, in [0;36mRestrictedCachingCompiler.__call__[0;34m(self, source, filename, symbol)[0m
[1;32m     18[0m     symbol [38;5;241m=[39m [38;5;124m"[39m[38;5;124mexec[39m[38;5;124m"[39m
[1;32m     19[0m unparsed_source: [38;5;28mstr[39m [38;5;241m=[39m ast[38;5;241m.[39munparse(source)
[0;32m---> 20[0m code_obj [38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43mparser[49m[38;5;241;43m.[39;49m[43mparse_and_compile[49m[43m([49m
[1;32m     21[0m [43m    [49m[43mcode[49m[38;5;241;43m=[39;49m[43munparsed_source[49m[43m,[49m[43m [49m[43mfilename[49m[38;5;241;43m=[39;49m[43mfilename[49m[43m,[49m[43m [49m[43mmode[49m[38;5;241;43m=[39;49m[43msymbol[49m
[1;32m     22[0

We limit the scope intensionally to limit the side effects of a method call

In [7]:
%%ag 
r = 1
r = r + 1

def goofie() -> None:
    # same is true for any free, nonlocal, global vars
    # either implicitly or explicitly defined...
    global r
    r = 6
    
goofie()

[0;31m---------------------------------------------------------------------------[0m[0;31mRuntimeError[0m                              Traceback (most recent call last)File [0;32m/code/dependencies/ag-private-kernel/kernel/compiler/restricted.py:20[0m, in [0;36mRestrictedCachingCompiler.__call__[0;34m(self, source, filename, symbol)[0m
[1;32m     18[0m     symbol [38;5;241m=[39m [38;5;124m"[39m[38;5;124mexec[39m[38;5;124m"[39m
[1;32m     19[0m unparsed_source: [38;5;28mstr[39m [38;5;241m=[39m ast[38;5;241m.[39munparse(source)
[0;32m---> 20[0m code_obj [38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43mparser[49m[38;5;241;43m.[39;49m[43mparse_and_compile[49m[43m([49m
[1;32m     21[0m [43m    [49m[43mcode[49m[38;5;241;43m=[39;49m[43munparsed_source[49m[43m,[49m[43m [49m[43mfilename[49m[38;5;241;43m=[39;49m[43mfilename[49m[43m,[49m[43m [49m[43mmode[49m[38;5;241;43m=[39;49m[43msymbol[49m
[1;32m     22[0

### Private DataFrames 

Usually the dataset you are looking for will already be present in the session you connect into, however we'll show you how these are constructed first and then how we can use most of the common pandas interfaces in a differentially private mannor to gain insights into the underlying dataset:

In [8]:
%%ag
import numpy as np
import pandas as pd
from op_pandas import PrivateDataFrame

# create the priate dataframe. This is usually done for you.
df = pd.DataFrame(np.random.randint(10, size=(1000, 2)), columns=["Example", "Example2"])
pdf = PrivateDataFrame(df, metadata={"Example": (0, 10), "Example2": (0, 10)}, _id=1234567890)


Output: ____GETATTR____
obj = <module 'pandas' from '/usr/local/lib/python3.11/site-packages/pandas/__init__.py'>
attr = DataFrame

____GETATTR____
obj = <module 'numpy' from '/usr/local/lib/python3.11/site-packages/numpy/__init__.py'>
attr = random

____GETATTR____
obj = <module 'numpy.random' from '/usr/local/lib/python3.11/site-packages/numpy/random/__init__.py'>
attr = randint




Now we can ask statistical questions from the private dataframe by spending some of our privacy budget:

In [9]:
%%ag
export(pdf.sum(1), "sum_")
export(pdf.mean(1), "mean_")
export(pdf.std(1), "std_")
export(pdf.count(1), "count_")
export(pdf.var(1), "var_")
export(pdf.median(1), "median_")
export(pdf.quantile(0.25), "quantile_")

Setting up exported variable in local environment: sum_
Setting up exported variable in local environment: mean_
Setting up exported variable in local environment: std_
Setting up exported variable in local environment: count_
Setting up exported variable in local environment: var_
Setting up exported variable in local environment: median_
Setting up exported variable in local environment: quantile_


Once exported we can plot them, print them, whatever you want:

In [10]:
print("The sum was:", sum_)
print("The mean was:", mean_)
print("The std was:", std_)
print("The count was:", count_)
print("The var was:", var_)
print("The median was:", median_)
print("The quantile was:", quantile_)

The sum was: Example     4460.312360
Example2    4422.819264
dtype: float64
The mean was: Example     4.457707
Example2    4.422029
dtype: float64
The std was: Example     2.856420
Example2    2.853768
dtype: float64
The count was: Example     888
Example2    890
dtype: int64
The var was: Example     8.228802
Example2    8.198098
dtype: float64
The median was: Example     4.681977
Example2    4.938954
dtype: float64
The quantile was: Example     4.307548
Example2    4.076156
dtype: float64


### SciKit Learn with DiffPrivLib 

One of the beauties of of the python ecosystem is the rich landscape of frameworks for data science. Fortunately, there are drop in replacements for a number of these which preserve differential privacy. One convenient one is DiffPrivLib written by Naoise Holohan at IBM Research. 

Let's go ahead and train a random forest with differential privacy, for example, on our dataset and download the resulting model to our local machine.

In [11]:
%%ag
from op_diffprivlib import PrivateRandomForestClassifier

df = pd.DataFrame(np.random.randint(2, size=(1000, 1)), columns=["targets"])
target_pdf = PrivateDataFrame(df, metadata={"targets": (0, 2)}, _id=1234567890)

clf = PrivateRandomForestClassifier(epsilon=0.2, bounds=(0,10))
clf.fit(pdf, target_pdf)

Output: ____GETATTR____
obj = <module 'pandas' from '/usr/local/lib/python3.11/site-packages/pandas/__init__.py'>
attr = DataFrame

____GETATTR____
obj = <module 'numpy' from '/usr/local/lib/python3.11/site-packages/numpy/__init__.py'>
attr = random

____GETATTR____
obj = <module 'numpy.random' from '/usr/local/lib/python3.11/site-packages/numpy/random/__init__.py'>
attr = randint


[0;31m---------------------------------------------------------------------------[0m[0;31mRuntimeError[0m                              Traceback (most recent call last)Cell [0;32mIn[6], line 1[0m
[0;32m----> 1[0m [38;5;28;01mfrom[39;00m [38;5;21;01mop_diffprivlib[39;00m [38;5;28;01mimport[39;00m PrivateRandomForestClassifier
[1;32m      3[0m df [38;5;241m=[39m pd[38;5;241m.[39mDataFrame(np[38;5;241m.[39mrandom[38;5;241m.[39mrandint([38;5;241m2[39m, size[38;5;241m=[39m([38;5;241m1000[39m, [38;5;241m1[39m)), columns[38;5;241m=[39m[[38;5;124m"[39m[38;5;124mtargets[39m[38;

In [12]:
%%ag
export(clf, "clf")

[0;31m---------------------------------------------------------------------------[0m[0;31mAttributeError[0m                            Traceback (most recent call last)Cell [0;32mIn[7], line 1[0m
[0;32m----> 1[0m [43mexport[49m[43m([49m[43mclf[49m[43m,[49m[43m [49m[38;5;124;43m"[39;49m[38;5;124;43mclf[39;49m[38;5;124;43m"[39;49m[43m)[49m
File [0;32m/code/dependencies/ag-utils/ag_utils/export.py:19[0m, in [0;36mexport[0;34m(value, name)[0m
[1;32m     14[0m     [38;5;28;01mraise[39;00m [38;5;167;01mValueError[39;00m(
[1;32m     15[0m         [38;5;124m"[39m[38;5;124mPlease ensure export function is called in a valid Jupyter environment.[39m[38;5;124m"[39m
[1;32m     16[0m     )
[1;32m     18[0m [38;5;66;03m# Pickle value to export - this ensures base64 encoding of the value[39;00m
[0;32m---> 19[0m exp_value [38;5;241m=[39m pickle[38;5;241m.[39mdumps(value)
[1;32m     21[0m [38;5;66;03m# Send a custom message to the Jupyter clien

In [13]:
ag.load_store("")

AttributeError: module 'antigranular' has no attribute 'load_store'

In [None]:
lalala2