# Antigranular: Sample Notebook

This notebook is a sample notebook for the Antigranular project. It demonstrates use of the antigranular client package to connect to antigranular enclave server running on network of Oblivious Enclaves. 

Running inside the enclave is a restricted python-like runtime managed as a Jupyter kernel. It uses static and dynamic analysis to enforce static typing, scoping constraints and a limited api to interfact with private or sensitive data sources. 

The two core technical components the enforce our privacy constraints for data scientists are:
- Secure Enclaves (A form of input privacy)
- Differential Privacy (A form of output privacy)

### What are secure enclaves?

Secure enclaves are isolated servers with two very powerful properties:

They have extremely limitted IO and need explicit inbound and outbound connections to recieve and send data. No one can simply SSH into an enclave and see data as it is being processed, nor can data end up unexpectedly in log files.
The underlying infrastructure "attests" what is running inside. So when we write some software to deploy into an enclave, the physical infrastructure will hash the software and environment and place these values into a document which it digitally signs. In short, the cloud infrastructure implicitly gaurentees to those connecting to the enclave the exact processing and behaviour of what is running inside the enclave.

This is extremely powerful as we can use these charecteristics to clearly structure rules around what processes can decrypt what data (not what servers, or people - what actual computation is approved!). AS you can imagine, all of the major cloud providers have developed an enclave offering of one form or another (AWS, Azure, GPC, Alibaba Cloud, IBM, Oracle, OVH Cloud.... the list goes on) over the past few years and billions worth of investments have been poored into the domain.

### What is Differential Privacy? 
  
While this workshop focuses on synthetic data, there are different approaches to creating synthetic data. The only method which gives a theoretical privacy guarantee is called Differentially Private (DP) synthetic data.
  
Differential privacy, coined in [2006 by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith](https://link.springer.com/chapter/10.1007/11681878_14), is a theoretical statement about the guessing probability of data being present in a dataset given stochastic measurements/queries of the dataset (intuitively: any information gained that is derived from the dataset). Some processes may naturally create stochastic measurements; however, in most cases, calibrated noise is intentionally applied to a result of measurement such that the measurement will be differentially private. 
  
This privacy framework ultimately relies on what you want the guessing probability to be, and there is a natural trade-off between the accuracy of a measurement and the guessing probability. Strictly speaking, this guessing probability is parameterised by a coefficient ε. If ε is 0, the measurement discloses no information about the data being in the dataset, so the guessing probability is uniform (50:50). If ε is infinite, then the measurement discloses with absolute certainty whether the data is present or not in the dataset.

$$ \epsilon \geq \ln \left( \frac{Pr[M(x) \in S]}{Pr[M(x') \in S]} \right) $$
  
There is no golden rule in selecting an acceptable ε which you can deem "safe”, and you will likely need to decide internally what you believe to be a reasonable risk.
  
Finally, the ε of multiple measurements/queries can interact in complicated ways. However, it is straightforward to upper bound the worst case ε by the sum of all of the ε: $\epsilon = \sum_{i=0}^{n} \epsilon_i$

## Moving Privacy into the Code Block

In order to combine input and output privacy, we provide a framework for remotecode execution the feels somewhat native to the data science user experience. The steps are essentially:

- Connect to a enclave backend.
- Use the `%%ag` magic to send blocks of code for remote execution.
- Manipulate and query data in the safety of the enclave and export only public data (post-DP) to the client.

To get started we import the antigranular package:

In [11]:
import antigranular as ag

### Log In with OAuth

We use standard OAuth to authenticate users despite doing the full attestation based handshake to authenticate the software running within the enclave. In the example below, you can ignore the temp_user/password, this is only in place for local testing and demos:

In [12]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, dataset = "Iris Dataset")

Connected to Antigranular server session id: 0c1838d0-45b2-4cab-841d-8c10d79f8e15
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server


In [13]:
%%ag
x = 6 

def func():
    global x
    x = 7 

UsageError: Restricted, Cannot use ( AST Node : <class 'ast.Global'> ) in private python



In [14]:
from antigranular import config
print(config.config.AG_OAUTH_URL)

https://auth-dev.antigranular.com/oauth/token


Once you've logged in you will have a session id associated with your interactions. This actually gets embedded into the ipynb metadata so when you share your notebook we can associate it with your score from a competition, or you metrics associated with analysing a dataset.

In [15]:
%%ag
from op_diffprivlib.models import LinearRegression
from typing import Any
def fun()->Any:
    return LinearRegression()

### Executing Code with AG

We have used magic `%%ag` to let user toggle between private python and regular. Simply, add it to the top of any cell and your code will be remotely executed. Any non-private data type (int, float, list, str, etc) can be exported back to your current Jupyter instance using the `export` method as seen below:

In [16]:
%%ag 
from ag_utils import export 
8678
export(2, "xx")

In [None]:
print(xx)

2


In [None]:
%%ag

def gh():
    y = 9
    x = y/0
gh()

[0;31m---------------------------------------------------------------------------[0m[0;31mZeroDivisionError[0m                         Traceback (most recent call last)Cell [0;32mIn[4], line 1[0m
[0;32m----> 1[0m [38;5;28;01mdef[39;00m [38;5;21mgh[39m():
[1;32m      2[0m     y [38;5;241m=[39m [38;5;241m9[39m
[1;32m      3[0m     x [38;5;241m=[39m y[38;5;241m/[39m[38;5;241m0[39m
Cell [0;32mIn[4], line 3[0m, in [0;36mgh[0;34m()[0m
[1;32m      1[0m [38;5;28;01mdef[39;00m [38;5;21mgh[39m():
[1;32m      2[0m     y [38;5;241m=[39m [38;5;241m9[39m
[0;32m----> 3[0m     x [38;5;241m=[39m [43my[49m[38;5;241;43m/[39;49m[38;5;241;43m0[39;49m
[0;31mZeroDivisionError[0m: division by zero


In [None]:
%%ag

import os 

[0;31m---------------------------------------------------------------------------[0m[0;31mImportError[0m                               Traceback (most recent call last)Cell [0;32mIn[5], line 1[0m
[0;32m----> 1[0m [38;5;28;01mimport[39;00m [38;5;21;01mos[39;00m 
File [0;32m/code/dependencies/ag_engine/ag_engine/restricted_python/custom_imports.py:14[0m, in [0;36mCustomImports.get_imports[0;34m(self, name, _globals, _locals, fromlist, level)[0m
[1;32m     11[0m target_level [38;5;241m=[39m [38;5;28mself[39m[38;5;241m.[39mtree[38;5;241m.[39mfind_node(name)
[1;32m     13[0m [38;5;28;01mif[39;00m target_level [38;5;129;01mis[39;00m [38;5;28;01mNone[39;00m:
[0;32m---> 14[0m     [38;5;28;01mraise[39;00m [38;5;167;01mImportError[39;00m([38;5;124mf[39m[38;5;124m"[39m[38;5;124mRestricted import: [39m[38;5;132;01m{[39;00mname[38;5;132;01m}[39;00m[38;5;124m is not allowed[39m[38;5;124m"[39m)
[1;32m     16[0m [38;5;66;03m# Filter access to t

In [None]:
print("Look what's in x now in your local Jupyter session:", x)

NameError: name 'x' is not defined

### Throwing Errors

Errors can be thrown for a variety of reasons. Antigranular restrict many AST nodes, enforces strict mypy, limits the scopes of variable assignments and much more. If you try to do any of these you will be greated with an error message which is forwarded to your local runtime:

In [None]:
%%ag 

raise ConnectionError(f"Error calling /session_status: {str('lalala')}")

We limit the scope intensionally to limit the side effects of a method call

In [None]:
%%ag 
r = 1
r = r + 1

def goofie() -> None:
    # same is true for any free, nonlocal, global vars
    # either implicitly or explicitly defined...
    global r
    r = 6
    
goofie()

In [None]:
%%ag

x = 6

export(2, "x")

### Private DataFrames 

Usually the dataset you are looking for will already be present in the session you connect into, however we'll show you how these are constructed first and then how we can use most of the common pandas interfaces in a differentially private mannor to gain insights into the underlying dataset:

In [None]:
%%ag
import numpy as np
import pandas as pd
from op_pandas import PrivateDataFrame

# create the priate dataframe. This is usually done for you.
df = pd.DataFrame(np.random.randint(10, size=(1000, 2)), columns=["Example", "Example2"])
pdf = PrivateDataFrame(df, metadata={"Example": (0, 10), "Example2": (0, 10)}, _id=1234567890)


In [None]:
%%ag

from ag_utils import load_dataset

data = load_dataset("Iris Dataset")

export(type(data), "data")
export(type(data['train_x']), "datax")

Now we can ask statistical questions from the private dataframe by spending some of our privacy budget:

In [None]:
%%ag
export(pdf.sum(1), "sum_")
export(pdf.mean(1), "mean_")
export(pdf.std(1), "std_")
export(pdf.count(1), "count_")
export(pdf.var(1), "var_")
export(pdf.median(1), "median_")
export(pdf.quantile(0.25), "quantile_")

Once exported we can plot them, print them, whatever you want:

In [None]:
print("The sum was:", sum_)
print("The mean was:", mean_)
print("The std was:", std_)
print("The count was:", count_)
print("The var was:", var_)
print("The median was:", median_)
print("The quantile was:", quantile_)

### SciKit Learn with DiffPrivLib 

One of the beauties of of the python ecosystem is the rich landscape of frameworks for data science. Fortunately, there are drop in replacements for a number of these which preserve differential privacy. One convenient one is DiffPrivLib written by Naoise Holohan at IBM Research. 

Let's go ahead and train a random forest with differential privacy, for example, on our dataset and download the resulting model to our local machine.

In [None]:
%%ag
from op_pandas import PrivateDataFrame , PrivateSeries

In [None]:
import oblv_client; print(oblv_client.__version__)

In [None]:
%%ag
import pandas as pd
s = pd.Series([1,5,8,2,9])
priv_s = PrivateSeries(series=s,metadata=(0,10))

In [None]:
%%ag
import pandas as pd
data = {
    'Age':[20,30,40,25],
    'Salary':[35000,60000,100000,55000],
    'Sex':['M','F','M','F']
}
metadata = {
    'Age':(18,65),
    'Salary':(20000,200000)
}
df = pd.DataFrame(data)
priv_df = PrivateDataFrame(df=df , metadata=metadata)

In [None]:
%%ag
from op_pandas import PrivateDataFrame , PrivateSeries
import numpy as np
import pandas as pd
random_data = {
    'Age': np.random.randint(18, 60, size=100000),
    'Salary': np.random.randint(40000, 200000, size=100000),
    'DOB': pd.date_range('1970-01-01', periods=100000, freq='D').strftime('%Y-%m-%d')
}
df = pd.DataFrame(random_data)
priv_df = PrivateDataFrame(df,metadata={'Age':(18,60) , 'Salary':(40000,200000)})

In [None]:
%%ag
from ag_utils import export
priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
export(priv_describe , 'df_info')

In [None]:
%%ag
from ag_utils import export
priv_s = priv_df["Age"]
export(priv_s.describe(eps=1) , "Age_series")

In [None]:
print(Age_series)


In [None]:
%%ag

from ag_utils import export
# maps string to its length if not numerical else divides it by 2.
def func(x: str | int | float) -> float:
    if isinstance(x, str):
        return len(x)
    elif isinstance(x, (int, float)):
        return x / 2
    return 0.0


result = priv_df.applymap(func,eps=1)
export(result.describe(eps=1),'private_result')
result = df.applymap(func)
export(result.describe(),'original_result')

In [None]:
print(private_result)


In [None]:
%%ag
from ag_utils import export
age_series = priv_df['Age']
def series_map(x:float)->float:
    return x/2

result = age_series.map(series_map,eps=1,na_action='ignore')
export( result.describe(eps=1), 'private_result')
export( df['Age'].map(series_map,na_action='ignore') , 'original_result')

In [None]:
%%ag
from ag_utils import export
export(priv_df[['Age','Salary']].describe(eps=1) , 'original')
export((-priv_df[['Age','Salary']]).describe(eps=1) , 'negative')

In [None]:
negative.columns = ["Age_neg","Salary_neg"]
print(original.join(negative , how="left"))

In [None]:
%%ag
from ag_utils import export
pdf = priv_df[['Age','Salary']]
result1 = pdf + (10*pdf)  # expected min-max => Age:(198,660) ,  Salary:(44000,220000)
result2 = result1/1000 # expected min-max => Age:(0.198,0.66) ,  Salary:(44,220)
export(result1.describe(eps=1) , 'result1')
export(result2.describe(eps=1) , 'result2')

In [None]:
result2.columns = ["Age2","Salary2"]
result1.columns = ["Age1","Salary1"]
print(result1.join(result2 , how="left"))

In [None]:
%%ag
import pandas as pd
import numpy as np
from ag_utils import export
# series1 and series2 should have mean roughly equal to 0.5
series1 = PrivateSeries(pd.Series(np.random.randint(0,2,size=100000)),metadata=(0,1))
series2 = PrivateSeries(pd.Series(np.random.randint(0,2,size=100000)),metadata=(0,1))
or_result = series1 | series2 # should have mean around 0.75
export(or_result.describe(eps=0.1),'or_result')

In [None]:
%%ag
import pandas as pd
import numpy as np
from ag_utils import export
# dataframes & series
df = pd.DataFrame(np.random.randint(0, 100, size=(1000, 4)), columns=['A','B','C','D'],
                      index=np.random.randint(-10, 10, size=1000))
df2 = pd.DataFrame(np.random.randint(-100, 100, size=(1000, 2)), columns=['E','F'],
                   index=np.random.randint(-10, 10, size=1000))
s = pd.Series(np.random.randint(0, 100, size=1000), name='SER')

# corresponding Private dataframes & private series
pdf = PrivateDataFrame(df, metadata={'A': (0, 100), 'B': (0, 100), 'C': (0, 100), 'D': (0, 100)})
pdf2 = PrivateDataFrame(df2, metadata={'E': (-100, 100), 'F': (-100, 100)})
ps = PrivateSeries(s)

In [None]:
%%ag #join with dataframe
result = pdf.join(pdf2,how="outer")
export(result.describe(eps=1),'private_result')
result = df.join(df2,how="outer")
export(result.describe(),'original_result')


In [None]:
%%ag # join with series
result = pdf.join(ps,how="inner")
export(result.describe(eps=1),'private_result')
result = df.join(s,how="inner")
export(result.describe(),'original_result')

In [None]:
%%ag
result = pdf.where(pdf > 0)
export(result.describe(eps=1),'private_result')
result = df.where(df > 0)
export(result.describe(),'original_result')

In [None]:
%%ag
from op_pandas import PrivateDataFrame , PrivateSeries
import numpy as np
import pandas as pd
random_data = {
    'Age': np.random.randint(18, 60, size=100000),
    'Salary': np.random.randint(40000, 200000, size=100000),
    'DOB': pd.date_range('1970-01-01', periods=100000, freq='D').strftime('%Y-%m-%d')
}
df = pd.DataFrame(random_data)
priv_df = PrivateDataFrame(df,metadata={'Age':(18,60) , 'Salary':(40000,200000)})

In [None]:
%%ag
from ag_utils import export
age_series = priv_df["Age"]
export(age_series.var(eps=0.1) , 'var')
export(age_series.count(eps=0.1) , 'count')

In [None]:
%%ag
from ag_utils import export
age_series = priv_df["Age"]
export(age_series.percentile(eps=0.1 ,p = 0) , 'min')
export(age_series.percentile(eps=0.1 , p =100) , 'max')

In [None]:
%%ag
from ag_utils import export
result = priv_df.corr(eps=2)
export(result,'private_result')
result = df.corr(numeric_only=True)
export(result,'original_result')

In [None]:
%%ag
from ag_utils import export
age = priv_df["Age"]
salary = priv_df["Salary"]
export(age.cov(salary , eps=1),'private_result')
export(df['Age'].cov(df['Salary']),'original_result')

In [None]:
%%ag
from ag_utils import export
hist_data = priv_df.hist(column='Salary',eps=0.1)
export(hist_data , 'hist_data')

In [None]:
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt
dp_hist , dp_bins = hist_data
plt.bar(dp_bins[:-1], dp_hist, width=(dp_bins[1] - dp_bins[0]) * 0.5)
plt.show()

In [None]:
%%ag
from op_pandas import PrivateDataFrame , PrivateSeries
import numpy as np
import pandas as pd

# randomly distributing nans in two columns with prob = 0.5
# prob of a record not having nan = 0.25 
# Hence expected count after dropna should be around 2500.
choice = [1,2,np.nan]
a = np.random.choice(choice,10000,p=[0.25,0.25,0.5])
b = np.random.choice(choice,10000,p=[0.25,0.25,0.5])
priv_df = PrivateDataFrame(pd.DataFrame({'a':a , 'b':b}),metadata={'a':(1,2),'b':(1,2)})


export(priv_df.dropna(axis=0).describe(eps=1), 'result')

In [None]:
print(result)

In [None]:
%ag 

In [None]:
# Comm
%%bash


In [None]:
yyy = 5 

def yf():
    te = 4
    fr = te/0

In [None]:
vvv = 45

ggg = 45/0

In [None]:
vvv

In [None]:
u = 777
g = u/0

In [None]:
print(u)

In [None]:
global u 
u = 777
g = u/0

def loc():
    global u 
    u = 9999 
    # some = 56/0

loc()

In [None]:
u

In [None]:
try:
    mypy.run()

except Exception as e:
    print(" Type Check Error ")
    print(e)