# How to Explore and Clean Sensitive Data You Can't Even See With Antigranular

## Motivation

Differential privacy is a robust framework to work with highly sensitive data such as PII (personally identification information), financial data, health information, biometric data and so on. When data science and machine learning methods are used on these datasets under differential privacy, there is virtually no chance of extracting individual details from any row. 

But the security provided by the framework comes at a cost - you can only use functions and algorithms that are differentially private. Unfortunately, the vast majority of Python libraries do not meet this requirement, making them useless. 

For this reason, the open-source community offers alternatives to popular libraries for data manipulation and machine learning. In this tutorial, we will learn how to perform Exploratory Data Analysis (EDA) and data cleaning on a sample differentially private dataset using a platform called Antigranular. 

## What is Antigranular?

I've discussed the Antigranular platform in-depth in a [previous article](https://medium.com/towards-artificial-intelligence/antigranular-how-to-access-sensitive-datasets-without-looking-at-them-44090cb22d8a). So, we will only summarize the main points here as understanding them is a requirement for this article. 

Antigranular is a secure platform for any company to host their sensitive datasets using differential privacy. The platform stores datasets using [AWS Nitro Enclaves](https://aws.amazon.com/ec2/nitro/nitro-enclaves/) that leverage cryptographic attestation for security. Then, the data is exposed to the public through authentication and under differential privacy. 

Anyone will be able to use the data but isn't able to actually look at individual rows. This raises the question - why do this?

Well, Antigranular isn't just a secure storage - it is a Kaggle-like competition platform where users compete to solve machine learning tasks using DP methods. It also allows users to share their solutions publicly in notebooks just like in Kaggle. 

To use the differentially private libraries and datasets locally, Antigranular comes with a Python library in the same name. Today, we will use all of these to perform a typical EDA on the differentially private version of the open-source Diamonds dataset. 

## Setup

In [4]:
%%ag

from op_pandas import PrivateDataFrame, PrivateSeries

UsageError: Cell magic `%%ag` not found.


## 1. Summary statistics

In [None]:
%%ag

ag_print("Columns: \n", diamonds.columns)
ag_print("Metadata: \n", diamonds.metadata)
ag_print("Dtypes: \n", diamonds.dtypes)

In [None]:
%%ag

results = diamonds.describe(eps=1)

# Export information from the remote AG  kernel to the local Jupyter server.
ag_print(results)

In [None]:
%%ag

export(results, name="5-number summary")

In [None]:
print(results)

In [None]:
import seaborn as sns

diamonds = sns.load_dataset("diamonds")
results_local = diamonds.describe()

In [None]:
results - results_local

In [None]:
%%ag
# ps is a private series

min = ps.percentile(eps=0.1, p=0)
max = ps.percentile(eps=0.1, p=100)

ag_print(f"{min = } , {max = }")

## 2. Cleaning data

In [None]:
diamonds.drop(columns="z", inplace=True)  # Default

In [None]:
diamonds.dropna(axis=0)  # Across the rows

## 3. Selecting data

In [None]:
%%ag


def series_map(x: int) -> float:
    return x / 2


priv_df["age"] = priv_df["age"].map(series_map, eps=1)  # important
ag_print("Metadata:\n", priv_df.metadata)

## 4. General functions

## 5. Statistical methods

## 6. Plotting histograms

In [None]:
%%ag

hist_data = priv_df.hist(column="salary", eps=0.1)
export(hist_data, "hist_data")

In [None]:
import matplotlib.pyplot as plt

dp_hist, dp_bins = hist_data
# Create a bar plot using Matplotlib
plt.bar(dp_bins[:-1], dp_hist, width=np.diff(dp_bins) * 0.8, align="edge")

# Display the plot
plt.show()

## Plotting a correlation matrix

In [None]:
%%ag

result = priv_df.corr(eps=3)
export(result, "private_result")

## 7. Using GroupBy

In [None]:
%%ag
import op_pandas as opd

pdf = opd.PrivateDataFrame(
    df, metadata={"age": (0, 100)}, categorical_metadata={"groups": ["a", "b", "c"]}
)
grouped = pdf.groupby("groups")
ag_print(grouped.sum(eps=1))

In [None]:
%%ag

grouped_pdf = pdf.groupby(pdf["age"] > 30)

ag_print(grouped_pdf["salary"].mean(eps=1))

In [None]:
%%ag

ag_print(pdf.groupby(["education", "gender"]).mean(eps=1))

In [None]:
%%ag

ag_print(pdf.groupby(["education", pdf["age"] > 30]).mean(eps=1))

## 8. Visualizing GroupBy results

In [None]:
%%ag

priv_df_train, priv_df_test = op_pandas.train_test_split(priv_df)

ag_print("Count of train split:\n", priv_df_train.count(eps=1))
ag_print("Count of test split:\n", priv_df_test.count(eps=1))

## Conclusion