# Tutorial Notebook for UN Datathon Participants 🌐 🔐


This notebook is designed to acquaint participants of the UN datathon with antigranular, which is your private data exploration toolkit, and showcase its main functions!

## What is Antigranular?

With Antigranular, you can use Python to securely explore and gain insights from your data without ever seeing sensitive information. 🕵️‍♂️🔐

Antigranular leverages AWS enclaves, which are isolated and protected environments that run on the cloud. AWS enclaves ensure that your data and code are safe from unauthorised access, even from AWS itself. 🛡️☁️

Antigranular also integrates with powerful differential privacy libraries, such as OpenDP, SmartNoise, and DiffPrivLib. Differential privacy is a technique that adds controlled noise to your data analysis, preserving the privacy of individual records while still allowing you to draw meaningful conclusions. 📊🔇

With Antigranular, you can work with private sensitive datasets conveniently, knowing that your data analysis remains confidential and insightful! 😊👍

## How to Use antigranular?


### Install the Package 📦

First, we need to add antigranular to our local jupyter kernel. You can do this by installing it directly from PyPI, which is like an app store but for python packages!

####Installing Antigranular

In [1]:
!pip install antigranular



### Using Antigranular for Secure Computations 🔐

Now that you're connected, your ag.login initiates a secure session with Antigranular where we can work with our data confidentially.

See the the %%ag at the beginning of a cell? It indicates that the code within this cell will be executed securely on the Antigranular platform, ensuring the confidentiality of our computations. Like magic! 😍

In [2]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, competition = "UN Datathon PETs Track")

Connected to Antigranular server session id: 444cf1bf-c535-483d-b0bb-a7c83fc16c0f, the session will time out if idle for 60 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!


🔑 Note: Replace `<client_id>`, `<client_secret>`, with actual values that you will get from antigranular.com to get connected.

🔐 Ensure to substitute `<client_id>` and `<client_secret>` with your respective Antigranular credentials.

## Loading the Data 🚀

For this datathon, we possess an extensive private dataset, which we have divided into 19 separate smaller datasets. This division allows for each dataset to be loaded individually, promoting swifter execution times, so there's nothing holding you back. 💥

The [op_pandas](https://docs.antigranular.com/private-python/packages/pandas) library includes functionality for merging and joining two datasets, enabling the combination of multiple datasets.

You can load two datasets by using `load_dataset` functionality, which returns an `op_pandas.PrivateDataFrame` object.

In [3]:
%%ag
undata_ls = load_dataset("undata_ls")
undata_ls_dif = load_dataset("undata_ls_dif")

## Checking the Privacy Budget 🤑

Privacy-preserving algorithms used by libraries often consume a "privacy budget" to ensure that data privacy is maintained. Here's how you can check your spending:

In [4]:
session.privacy_odometer()

{'total_epsilon_used': 4.999999999999993,
 'total_delta_used': 0.0,
 'library_costs': {'op_pandas': {'total_delta': 0,
   'total_epsilon': 4.999999999999993,
   'total_requests': 7}},
 'dataset_costs': {'undata_ls': {'delta': 0, 'eps': 2.999999999999995},
  'undata_ls_dif': {'delta': 0, 'eps': 1.9999999999999976}}}

## Viewing Data 🔍

To protect privacy, records in PrivateDataFrame and PrivateSeries cannot be viewed directly. But that doesn't mean you can't see anything! 👀 You can still analyse and obtain statistical information about the data using methods that offer differential privacy guarantees.


### Viewing Details About the Data

`ag_print` is a function packaged within Antigranular which can be used to print objects from the ag environment.

Here's how you can print the details about the data, like `columns` and `metadata`:

In [5]:
%%ag

ag_print("undata_ls Details: \n")
ag_print("Columns: \n", undata_ls.columns)
ag_print("Metadata: \n", undata_ls.metadata)
ag_print("Dtypes: \n", undata_ls.dtypes)

undata_ls Details: 

Columns: 
 Index(['objectid', 'ls_main', 'ls_num_lastyr', 'ls_num_now', 'ls_num_diff',
       'ls_num_increased', 'ls_num_decreased', 'ls_num_no_change',
       'ls_num_inc_less_sales', 'ls_num_inc_more_birth',
       'ls_num_inc_more_acquired', 'ls_num_inc_received_free',
       'ls_num_dec_poor_health', 'ls_num_dec_death',
       'ls_num_dec_sales_good_price', 'ls_num_dec_sales_distress',
       'ls_num_dec_escape_stolen', 'ls_num_dec_consumed',
       'ls_num_inc_dec_other', 'ls_num_inc_dec_dk', 'ls_num_inc_dec_ref',
       'ls_feed_open_pasture', 'ls_feed_common_pasture',
       'ls_feed_self_produced', 'ls_feed_purchased', 'ls_feed_free_dist',
       'ls_feed_other', 'ls_feed_dk', 'ls_feed_ref'],
      dtype='object')
Metadata: 
 {'objectid': (3496, 239871), 'ls_main': (1.0, 999.0), 'ls_num_lastyr': (0.0, 200000.0), 'ls_num_now': (0.0, 200000.0), 'ls_num_diff': (-99955.0, 18000.0), 'ls_num_increased': (0.0, 1.0), 'ls_num_decreased': (0.0, 1.0), 'ls_num_no_chan

In [6]:
%%ag

ag_print("undata_ls_dif Details: \n")
ag_print("Columns: \n", undata_ls_dif.columns)
ag_print("Metadata: \n", undata_ls_dif.metadata)
ag_print("Dtypes: \n", undata_ls_dif.dtypes)

undata_ls_dif Details: 

Columns: 
 Index(['objectid', 'ls_proddif', 'ls_proddif_feed_purchase',
       'ls_proddif_access_pasture', 'ls_proddif_access_water',
       'ls_proddif_vet_serv', 'ls_proddif_vet_input', 'ls_proddif_diseases',
       'ls_proddif_theft', 'ls_proddif_access_market',
       'ls_proddif_access_credit', 'ls_proddif_access_labour',
       'ls_proddif_other', 'ls_proddif_dk', 'ls_proddif_ref', 'ls_salesmain',
       'ls_salesdif', 'ls_salesdif_marketing_cost',
       'ls_salesdif_damage_losses', 'ls_salesdif_low_demand',
       'ls_salesdif_pay_delay', 'ls_salesdif_low_price',
       'ls_salesdif_slaughterhouse', 'ls_salesdif_processing',
       'ls_salesdif_competition', 'ls_salesdif_other', 'ls_salesdif_dk',
       'ls_salesdif_ref', 'ls_salesprice'],
      dtype='object')
Metadata: 
 {'objectid': (3496, 239871), 'ls_proddif': (0.0, 999.0), 'ls_proddif_feed_purchase': (0.0, 1.0), 'ls_proddif_access_pasture': (0.0, 1.0), 'ls_proddif_access_water': (0.0, 1.0), 'ls_p

`metadata` denotes the bounds of numerical data.

### Quick Statistics 📊

One way to obtain the quick-statistic is by using the `describe()` method. You can spend some epsilon and obtain a rough meta-data about the dataset to give you a quick overview.



In [7]:
%%ag

undata_ls_describe = undata_ls.describe(eps=1)
ag_print("undata_ls Describe:\n", undata_ls_describe)

undata_ls_dif_describe = undata_ls_dif.describe(eps=1)
ag_print("undata_ls Describe:\n", undata_ls_describe)

undata_ls Describe:
             objectid        ls_main  ...     ls_feed_dk    ls_feed_ref
count  106193.000000  106193.000000  ...  106193.000000  106193.000000
mean    92227.262821       9.501422  ...       0.000000       0.001603
std     73578.981087      77.611551  ...       0.057438       0.045856
min      3849.073600       1.216553  ...       0.000000       0.000000
25%     31932.798061       1.228886  ...       0.000479       0.000578
50%     63092.628568       2.304709  ...       0.000228       0.006400
75%    148721.901942       6.916343  ...       0.006221       0.006253
max    239218.985253     933.050173  ...       0.859875       0.399871

[8 rows x 29 columns]

undata_ls Describe:
             objectid        ls_main  ...     ls_feed_dk    ls_feed_ref
count  106193.000000  106193.000000  ...  106193.000000  106193.000000
mean    92227.262821       9.501422  ...       0.000000       0.001603
std     73578.981087      77.611551  ...       0.057438       0.045856
min      38

You can view the statistics by exporting the non-private result to the local Jupyter server using the `export` method:

In [8]:
%%ag

export(undata_ls_describe, name='undata_ls_describe')

Setting up exported variable in local environment: undata_ls_describe


Now, we can access undata_ls_describe in our local jupyter environment.

In [9]:
undata_ls_describe

Unnamed: 0,objectid,ls_main,ls_num_lastyr,ls_num_now,ls_num_diff,ls_num_increased,ls_num_decreased,ls_num_no_change,ls_num_inc_less_sales,ls_num_inc_more_birth,...,ls_num_inc_dec_dk,ls_num_inc_dec_ref,ls_feed_open_pasture,ls_feed_common_pasture,ls_feed_self_produced,ls_feed_purchased,ls_feed_free_dist,ls_feed_other,ls_feed_dk,ls_feed_ref
count,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,...,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0,106193.0
mean,92227.262821,9.501422,0.0,2394.844254,-140.307544,0.25446,0.606956,0.138506,0.054788,0.213822,...,0.0,0.001186,0.572027,0.146485,0.240413,0.457689,0.01909,0.02411,0.0,0.001603
std,73578.981087,77.611551,6631.473486,35155.668324,6506.479588,0.447073,0.483209,0.340367,0.247414,0.412234,...,0.065301,0.050791,0.491773,0.373616,0.425942,0.498414,0.124016,0.152211,0.057438,0.045856
min,3849.0736,1.216553,0.875247,0.31929,-44173.01437,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,31932.798061,1.228886,3.629055,2.335665,-8.859395,0.00298,0.000294,0.005638,0.006323,0.00073,...,0.034747,0.021168,0.024102,0.010437,0.006385,0.048042,0.001503,0.004094,0.000479,0.000578
50%,63092.628568,2.304709,8.113424,5.079792,-1.558456,0.009942,0.969139,0.021323,0.008001,0.003518,...,0.004047,0.004695,0.940237,0.015741,3e-05,0.009499,0.02242,0.021282,0.000228,0.0064
75%,148721.901942,6.916343,20.914903,13.295288,0.843116,0.782992,0.999977,0.021275,0.004844,0.203501,...,0.002599,0.076524,0.999311,0.118947,0.679436,0.998096,0.005327,0.014507,0.006221,0.006253
max,239218.985253,933.050173,149617.92023,68266.338045,15769.360441,0.993362,0.970454,0.991092,0.993086,0.999756,...,0.79287,0.776746,0.998816,0.983962,0.987537,0.998874,0.887524,0.7848,0.859875,0.399871


## Data Preprocessing 🌐

### Importing External Data

Guess what? You can also import any external data, merge it with the data provided to get some inferences.

Here is an example on how to do that:

In [10]:
'''
Creating mock data

Mock data will just be 100 rows of age and salary information
'''

import pandas as pd
import numpy as np

n_num = 100
df = pd.DataFrame({'age': np.random.randint(0, 80, n_num), 'salary': np.random.randint(100, 100000, n_num)})
session.private_import(data = df, name= 'imported_df')

dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel



In [11]:
%%ag
# Creating a PrivateDataFrame out of the DataFrame imported.
import op_pandas

metadata = {
    'age': (0, 80),
    'salary': (1, 200000)
}

priv_df = op_pandas.PrivateDataFrame(imported_df ,metadata = metadata)

In [12]:
%%ag

ag_print("Private DataFrame Describe:\n", priv_df.describe(eps=1))

Private DataFrame Describe:
               age         salary
count  120.000000     120.000000
mean    56.803482       1.000000
std     24.842094   43532.227591
min      0.000000    1088.455546
25%     10.058633   66269.154215
50%     37.239998   27113.635297
75%     39.126436   46095.590793
max     77.630131  177584.857815



Now that `priv_df` is within the ag environment, you can do any operation with the original data and extract inferences.

### Combining Datasets: `undata_ls` and `undata_ls_dif`

To facilitate certain analyses and visualisations, it might be useful to combine our two datasets (`undata_ls` and `undata_ls_dif`) into a single dataset.

This can allow us to explore relationships between features more efficiently. 🤝🏼

In [13]:
%%ag

import op_pandas

joined_data = op_pandas.merge(undata_ls, undata_ls_dif, on="objectid")

ag_print("Joined Data Columns: \n", joined_data.columns)

Joined Data Columns: 
 Index(['objectid', 'ls_main', 'ls_num_lastyr', 'ls_num_now', 'ls_num_diff',
       'ls_num_increased', 'ls_num_decreased', 'ls_num_no_change',
       'ls_num_inc_less_sales', 'ls_num_inc_more_birth',
       'ls_num_inc_more_acquired', 'ls_num_inc_received_free',
       'ls_num_dec_poor_health', 'ls_num_dec_death',
       'ls_num_dec_sales_good_price', 'ls_num_dec_sales_distress',
       'ls_num_dec_escape_stolen', 'ls_num_dec_consumed',
       'ls_num_inc_dec_other', 'ls_num_inc_dec_dk', 'ls_num_inc_dec_ref',
       'ls_feed_open_pasture', 'ls_feed_common_pasture',
       'ls_feed_self_produced', 'ls_feed_purchased', 'ls_feed_free_dist',
       'ls_feed_other', 'ls_feed_dk', 'ls_feed_ref', 'ls_proddif',
       'ls_proddif_feed_purchase', 'ls_proddif_access_pasture',
       'ls_proddif_access_water', 'ls_proddif_vet_serv',
       'ls_proddif_vet_input', 'ls_proddif_diseases', 'ls_proddif_theft',
       'ls_proddif_access_market', 'ls_proddif_access_credit',
      

Using `merge` method, we merge the two PrivateDataFrames.

## Data Visualisation 🤩

The next function in your arsenal is data visualisation. This is a pivotal step in exploratory data analysis. By visualising our data, we can observe patterns, anomalies, and relationships between variables that might not be apparent from the raw data alone.


### Exploring the Data Using `histograms`

We can visualise different columns using differentially private `histograms`.

In [14]:
%%ag

hist_data = joined_data.hist(column='ls_salesprice',eps=1)
export(hist_data , 'hist_data')

Setting up exported variable in local environment: hist_data


To visualise the histogram locally, you can use matplotlib or any other plotting library of your choice.

In [15]:
import plotly.graph_objects as go
dp_hist, dp_bins = hist_data
fig = go.Figure(data=[go.Bar(x=dp_bins[:-1], y=dp_hist)])
fig.show()

### Splitting the Data

You can use `op_pandas.train_test_split` to split the data into `training` and `testing` parts randomly to train any of the models provided in `op_diffprivlib`.

Here is an example on how to remove some columns from the data, and split the rest into `training` and `testing` parts.

In [16]:
%%ag
# removing 'ls_num_dec_poor_health' and 'ls_num_dec_death' from the joined dataset.

joined_data.drop(['ls_num_dec_poor_health', 'ls_num_dec_death'])

train_data, test_data = op_pandas.train_test_split(joined_data)

ag_print("Train Data Description: \n", train_data.describe(eps = 1))
ag_print("Test Data Description: \n", test_data.describe(eps = 1))

Train Data Description: 
             objectid       ls_main  ...  ls_salesdif_ref  ls_salesprice
count   70738.000000  70738.000000  ...     70738.000000   70738.000000
mean    88029.619385      7.969086  ...         0.005477       6.477655
std     78438.574474    131.884916  ...         0.029346      81.499228
min      3872.451635      1.379914  ...         0.000000       1.913906
25%     29558.126524      1.006835  ...         0.432466       1.260141
50%     54798.710199      2.526993  ...         0.251303     489.258578
75%    136594.747193      4.100764  ...         0.036115     141.882881
max    238731.879559    903.617874  ...         0.217863      48.687546

[8 rows x 55 columns]

Test Data Description: 
             objectid       ls_main  ...  ls_salesdif_ref  ls_salesprice
count   23369.000000  23369.000000  ...     23369.000000   23369.000000
mean    94509.234734      1.000000  ...         0.017743      12.962512
std     83074.722158    159.088466  ...         0.253966     

Now you can train any model from `op_diffprivlib`.

Now that we are all done, we can terminate the session. Happy coding! 😎

In [17]:
session.terminate_session()

{'status': 'ok'}