## Moving Privacy into the Code Block

In order to combine input and output privacy, we provide a framework for remotecode execution the feels somewhat native to the data science user experience. The steps are essentially:

- Connect to a enclave backend.
- Use the `%%ag` magic to send blocks of code for remote execution.
- Manipulate and query data in the safety of the enclave and export only public data (post-DP) to the client.

To get started we import the antigranular package:

In [1]:
import antigranular as ag

### Log In with OAuth

We use standard OAuth to authenticate users despite doing the full attestation based handshake to authenticate the software running within the enclave. In the example below, you can ignore the temp_user/password, this is only in place for local testing and demos:

In [2]:
ag_client  = ag.login(user_id="CDl6m0dfjtOPYzZUkvSDznKMKWVaEGCg", user_secret="_l5cUr6fGGm5hNY2mHhSQboMIWWPWnls2bGqsTJ90daCtSMnL1GscxUtmWUCxeX-", dataset="Iris Dataset")

Connected to Antigranular server session id: 3f23b120-07a9-41da-9cf4-8b663821776a
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server


Once you've logged in you will have a session id associated with your interactions. This actually gets embedded into the ipynb metadata so when you share your notebook we can associate it with your score from a competition, or you metrics associated with analysing a dataset.

In [3]:
ag_client.session_id

'3f23b120-07a9-41da-9cf4-8b663821776a'

### Executing Code with AG

We have used magic `%%ag` to let user toggle between private python and regular. Simply, add it to the top of any cell and your code will be remotely executed. Any non-private data type (int, float, list, str, etc) can be exported back to your current Jupyter instance using the `export` method as seen below:

In [4]:
%%ag 
from ag_utils import export 

export(2, "x")

Setting up exported variable in local environment: x


In [5]:
print("Look what's in x now in your local Jupyter session:", x)

Look what's in x now in your local Jupyter session: 2


### Throwing Errors

Errors can be thrown for a variety of reasons. Antigranular restrict many AST nodes, enforces strict mypy, limits the scopes of variable assignments and much more. If you try to do any of these you will be greated with an error message which is forwarded to your local runtime:

We limit the scope intensionally to limit the side effects of a method call

In [7]:
%%ag 
r = 1
r = r + 1

def goofie() -> None:
    # same is true for any free, nonlocal, global vars
    # either implicitly or explicitly defined...
    global r
    r = 6
    
goofie()

[0;31m---------------------------------------------------------------------------[0m[0;31mRuntimeError[0m                              Traceback (most recent call last)File [0;32m/code/dependencies/ag-private-kernel/kernel/compiler/restricted.py:20[0m, in [0;36mRestrictedCachingCompiler.__call__[0;34m(self, source, filename, symbol)[0m
[1;32m     18[0m     symbol [38;5;241m=[39m [38;5;124m"[39m[38;5;124mexec[39m[38;5;124m"[39m
[1;32m     19[0m unparsed_source: [38;5;28mstr[39m [38;5;241m=[39m ast[38;5;241m.[39munparse(source)
[0;32m---> 20[0m code_obj [38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43mparser[49m[38;5;241;43m.[39;49m[43mparse_and_compile[49m[43m([49m
[1;32m     21[0m [43m    [49m[43mcode[49m[38;5;241;43m=[39;49m[43munparsed_source[49m[43m,[49m[43m [49m[43mfilename[49m[38;5;241;43m=[39;49m[43mfilename[49m[43m,[49m[43m [49m[43mmode[49m[38;5;241;43m=[39;49m[43msymbol[49m
[1;32m     22[0

### Private DataFrames 

Usually the dataset you are looking for will already be present in the session you connect into, however we'll show you how these are constructed first and then how we can use most of the common pandas interfaces in a differentially private mannor to gain insights into the underlying dataset:

In [8]:
%%ag
import numpy as np
import pandas as pd
from op_pandas import PrivateDataFrame

# create the priate dataframe. This is usually done for you.
df = pd.DataFrame(np.random.randint(10, size=(1000, 2)), columns=["Example", "Example2"])
pdf = PrivateDataFrame(df, metadata={"Example": (0, 10), "Example2": (0, 10)}, _id=1234567890)


[0;31m---------------------------------------------------------------------------[0m[0;31mTypeError[0m                                 Traceback (most recent call last)Cell [0;32mIn[4], line 1[0m
[0;32m----> 1[0m [38;5;28;01mimport[39;00m [38;5;21;01mnumpy[39;00m [38;5;28;01mas[39;00m [38;5;21;01mnp[39;00m
[1;32m      2[0m [38;5;28;01mimport[39;00m [38;5;21;01mpandas[39;00m [38;5;28;01mas[39;00m [38;5;21;01mpd[39;00m
[1;32m      3[0m [38;5;28;01mfrom[39;00m [38;5;21;01mop_pandas[39;00m [38;5;28;01mimport[39;00m PrivateDataFrame
[0;31mTypeError[0m: PrivateDataFrame.__init__() got an unexpected keyword argument '_id'


Now we can ask statistical questions from the private dataframe by spending some of our privacy budget:

In [9]:
%%ag
export(pdf.sum(1), "sum_")
export(pdf.mean(1), "mean_")
export(pdf.std(1), "std_")
export(pdf.count(1), "count_")
export(pdf.var(1), "var_")
export(pdf.median(1), "median_")
export(pdf.quantile(0.25), "quantile_")

[0;31m---------------------------------------------------------------------------[0m[0;31mNameError[0m                                 Traceback (most recent call last)Cell [0;32mIn[5], line 1[0m
[0;32m----> 1[0m export(pdf[38;5;241m.[39msum([38;5;241m1[39m), [38;5;124m"[39m[38;5;124msum_[39m[38;5;124m"[39m)
[1;32m      2[0m export(pdf[38;5;241m.[39mmean([38;5;241m1[39m), [38;5;124m"[39m[38;5;124mmean_[39m[38;5;124m"[39m)
[1;32m      3[0m export(pdf[38;5;241m.[39mstd([38;5;241m1[39m), [38;5;124m"[39m[38;5;124mstd_[39m[38;5;124m"[39m)
[0;31mNameError[0m: name 'pdf' is not defined


Once exported we can plot them, print them, whatever you want:

In [10]:
print("The sum was:", sum_)
print("The mean was:", mean_)
print("The std was:", std_)
print("The count was:", count_)
print("The var was:", var_)
print("The median was:", median_)
print("The quantile was:", quantile_)

NameError: name 'sum_' is not defined

### SciKit Learn with DiffPrivLib 

One of the beauties of of the python ecosystem is the rich landscape of frameworks for data science. Fortunately, there are drop in replacements for a number of these which preserve differential privacy. One convenient one is DiffPrivLib written by Naoise Holohan at IBM Research. 

Let's go ahead and train a random forest with differential privacy, for example, on our dataset and download the resulting model to our local machine.

In [None]:
%%ag
from op_pandas import PrivateDataFrame , PrivateSeries

In [None]:
%%ag
import pandas as pd
s = pd.Series([1,5,8,2,9])
priv_s = PrivateSeries(series=s,metadata=(0,10))

In [None]:
%%ag
import pandas as pd
data = {
    'Age':[20,30,40,25],
    'Salary':[35000,60000,100000,55000],
    'Sex':['M','F','M','F']
}
metadata = {
    'Age':(18,65),
    'Salary':(20000,200000)
}
df = pd.DataFrame(data)
priv_df = PrivateDataFrame(df=df , metadata=metadata)

In [None]:
%%ag
from op_pandas import PrivateDataFrame , PrivateSeries
import numpy as np
import pandas as pd
random_data = {
    'Age': np.random.randint(18, 60, size=100000),
    'Salary': np.random.randint(40000, 200000, size=100000),
    'DOB': pd.date_range('1970-01-01', periods=100000, freq='D').strftime('%Y-%m-%d')
}
df = pd.DataFrame(random_data)
priv_df = PrivateDataFrame(df,metadata={'Age':(18,60) , 'Salary':(40000,200000)})

In [None]:
%%ag
from ag_utils import export
priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
export(priv_describe , 'df_info')

In [None]:
%%ag
from ag_utils import export
priv_s = priv_df["Age"]
export(priv_s.describe(eps=1) , "Age_series")

In [None]:
print(Age_series)


In [None]:
%%ag

from ag_utils import export
# maps string to its length if not numerical else divides it by 2.
def func(x: str | int | float) -> float:
    if isinstance(x, str):
        return len(x)
    elif isinstance(x, (int, float)):
        return x / 2
    return 0.0


result = priv_df.applymap(func,eps=1)
export(result.describe(eps=1),'private_result')
result = df.applymap(func)
export(result.describe(),'original_result')

In [None]:
print(private_result)


In [None]:
%%ag
from ag_utils import export
age_series = priv_df['Age']
def series_map(x:float)->float:
    return x/2

result = age_series.map(series_map,eps=1,na_action='ignore')
export( result.describe(eps=1), 'private_result')
export( df['Age'].map(series_map,na_action='ignore') , 'original_result')

In [None]:
%%ag
from ag_utils import export
export(priv_df[['Age','Salary']].describe(eps=1) , 'original')
export((-priv_df[['Age','Salary']]).describe(eps=1) , 'negative')

In [None]:
negative.columns = ["Age_neg","Salary_neg"]
print(original.join(negative , how="left"))

In [None]:
%%ag
from ag_utils import export
pdf = priv_df[['Age','Salary']]
result1 = pdf + (10*pdf)  # expected min-max => Age:(198,660) ,  Salary:(44000,220000)
result2 = result1/1000 # expected min-max => Age:(0.198,0.66) ,  Salary:(44,220)
export(result1.describe(eps=1) , 'result1')
export(result2.describe(eps=1) , 'result2')

In [None]:
result2.columns = ["Age2","Salary2"]
result1.columns = ["Age1","Salary1"]
print(result1.join(result2 , how="left"))

In [None]:
%%ag
import pandas as pd
import numpy as np
from ag_utils import export
# series1 and series2 should have mean roughly equal to 0.5
series1 = PrivateSeries(pd.Series(np.random.randint(0,2,size=100000)),metadata=(0,1))
series2 = PrivateSeries(pd.Series(np.random.randint(0,2,size=100000)),metadata=(0,1))
or_result = series1 | series2 # should have mean around 0.75
export(or_result.describe(eps=0.1),'or_result')

In [None]:
%%ag
import pandas as pd
import numpy as np
from ag_utils import export
# dataframes & series
df = pd.DataFrame(np.random.randint(0, 100, size=(1000, 4)), columns=['A','B','C','D'],
                      index=np.random.randint(-10, 10, size=1000))
df2 = pd.DataFrame(np.random.randint(-100, 100, size=(1000, 2)), columns=['E','F'],
                   index=np.random.randint(-10, 10, size=1000))
s = pd.Series(np.random.randint(0, 100, size=1000), name='SER')

# corresponding Private dataframes & private series
pdf = PrivateDataFrame(df, metadata={'A': (0, 100), 'B': (0, 100), 'C': (0, 100), 'D': (0, 100)})
pdf2 = PrivateDataFrame(df2, metadata={'E': (-100, 100), 'F': (-100, 100)})
ps = PrivateSeries(s)

In [None]:
%%ag #join with dataframe
result = pdf.join(pdf2,how="outer")
export(result.describe(eps=1),'private_result')
result = df.join(df2,how="outer")
export(result.describe(),'original_result')


In [None]:
%%ag # join with series
result = pdf.join(ps,how="inner")
export(result.describe(eps=1),'private_result')
result = df.join(s,how="inner")
export(result.describe(),'original_result')

In [None]:
%%ag
result = pdf.where(pdf > 0)
export(result.describe(eps=1),'private_result')
result = df.where(df > 0)
export(result.describe(),'original_result')

In [None]:
%%ag
from op_pandas import PrivateDataFrame , PrivateSeries
import numpy as np
import pandas as pd
random_data = {
    'Age': np.random.randint(18, 60, size=100000),
    'Salary': np.random.randint(40000, 200000, size=100000),
    'DOB': pd.date_range('1970-01-01', periods=100000, freq='D').strftime('%Y-%m-%d')
}
df = pd.DataFrame(random_data)
priv_df = PrivateDataFrame(df,metadata={'Age':(18,60) , 'Salary':(40000,200000)})

In [None]:
%%ag
from ag_utils import export
age_series = priv_df["Age"]
export(age_series.var(eps=0.1) , 'var')
export(age_series.count(eps=0.1) , 'count')

In [None]:
%%ag
from ag_utils import export
age_series = priv_df["Age"]
export(age_series.percentile(eps=0.1 ,p = 0) , 'min')
export(age_series.percentile(eps=0.1 , p =100) , 'max')

In [None]:
%%ag
from ag_utils import export
result = priv_df.corr(eps=2)
export(result,'private_result')
result = df.corr(numeric_only=True)
export(result,'original_result')

In [None]:
%%ag
from ag_utils import export
age = priv_df["Age"]
salary = priv_df["Salary"]
export(age.cov(salary , eps=1),'private_result')
export(df['Age'].cov(df['Salary']),'original_result')

In [None]:
%%ag
from ag_utils import export
hist_data = priv_df.hist(column='Salary',eps=0.1)
export(hist_data , 'hist_data')

In [None]:
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt
dp_hist , dp_bins = hist_data
plt.bar(dp_bins[:-1], dp_hist, width=(dp_bins[1] - dp_bins[0]) * 0.5)
plt.show()

In [None]:
%%ag
from op_pandas import PrivateDataFrame , PrivateSeries
import numpy as np
import pandas as pd

# randomly distributing nans in two columns with prob = 0.5
# prob of a record not having nan = 0.25 
# Hence expected count after dropna should be around 2500.
choice = [1,2,np.nan]
a = np.random.choice(choice,10000,p=[0.25,0.25,0.5])
b = np.random.choice(choice,10000,p=[0.25,0.25,0.5])
priv_df = PrivateDataFrame(pd.DataFrame({'a':a , 'b':b}),metadata={'a':(1,2),'b':(1,2)})


export(priv_df.dropna(axis=0).describe(eps=1), 'result')

In [None]:
print(result)

In [None]:
%ag 

In [None]:
# Comm
%%bash
