# How to Visualize Differentially Private Datasets in Matplotlib with Antigranular

## Introduction

In March 2024, Oblivious AI held a bootcamp on differential privacy at Oxford University. While attending the event, I got to work on a very interesting problem - preserving individual privacy in web analytics datasets using differential privacy. 

The question was - can a malicious actor find out a person's age group by analyzing their mouse and keyboard tracking data? Specifically, are older people more at risk of a targeted attack if the speed of their typing and mouse movements are exposed?

On the surface, age group itself might seem like a harmless piece of information. However, certain parties can exploit it in several ways such as:

- Targeted scams: Knowing someone's age group can help scammers tailor their approach. For example, they might target older adults with pension scams or younger people with social media phishing attempts.
- Identity theft: Age group can be a puzzle piece in a larger identity theft scheme. Combined with other pieces of information, it could be used to answer security questions or appear more believable when impersonating someone.

So, me and my teammate, Devyani Gauri had an important problem in our hands. First, by running some clustering and classification algorithms on a sample mouse tracking dataset, we conclusively proved that by analyzing mouse movement characteristics (speed, velocity, etc.), we could easily differentiate whether an individual is young or old (through the context of our dataset). 

## Differential privacy refresher

Imagine "Einzelnen" (German for "individual") is in some dataset. Differential privacy aims to make Einzelnen's presence or absence imperceptible to any kind of mathematical analysis. This is achieved by adding controlled noise to the results. 

Let's say we're calculating the average age. Differential privacy guarantees that the final average will be accurate, but with some added noise. This noise ensures Einzelnen's (or anyone else's for that matter) specific age cannot be derived from the final result, even if their data was removed.

The accuracy of calculations in differential privacy is affected by a parameter called epsilon (ε). Lower ε signifies stronger privacy guarantees but with more noise, potentially affecting accuracy. Using too little epsilon can return a useless answer like five times the original result while too much might just give it away. 

Because of such high a degree of privacy guarantee, differential privacy has applications in finance, economics, business, healthcare or any other high-risk domains where even the people who handle and analyze the data can't be allowed to "peek inside". 

## What is Antigranular?

## So, how to visualize DP datasets with Matplotlib?

![](images/diamonds.gif)

## Conclusion