# Hypothesis Testing with Antigranular: How to Unveil the Unknown Without Seeing the Data 🕵️‍♀️

This notebook aims to demonstrate effective ways in which you can explore unseen data within the Antigranular platform with hypothesis testing. 🧠

### What's in a Hypothesis? 🤔
When analysing data in a privacy-preserving way, it can feel like exploring a dark room with a flashlight. Antigranular represents your flashlight, a critical tool that allows you to make educated guesses or hypotheses with the information you have, without seeing the full picture.

Hypothesis testing is a systematic way of asking questions about our data to get the bigger picture. But instead of diving deep into specifics, which might be risky in terms of privacy, we ask broad questions relevant to the dataset, such as: "Do cars with higher maintenance costs generally have better safety ratings?", or "Is there a significant number of luxury cars with low valuations?".

With hypothesis testing, we're not pinpointing exact locations or exact numbers. Instead, we're gauging the general landscape, understanding broad patterns, and making informed decisions.

### Why Is This Essential for Data We Can't See? 👀
When working with private datasets using Antigranular, direct visibility is restricted. But that doesn't mean we are completely blind. Instead of seeing every intricate detail, we're getting a "broad strokes" view. It tells us about the major structures, patterns, and anomalies in our data without violating its sanctity and privacy.

## A Simple Approach with Differential Privacy 🚀
In this notebook, we will construct some sample private datasets and then create some tests to learn about their structure. These are mock datasets which you will create locally and can examine in plaintext to build up your intuition. This way, we'll showcase the power of hypothesis testing without compromising on data privacy.

## Getting Started: Install, Import & Connect to Antigranular






In [None]:
!pip install antigranular

In [None]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, dataset = "The Wine Dataset")

Loading dataset "The Wine Dataset" to the kernel...
Dataset "The Wine Dataset" loaded to the kernel as the_wine_dataset
Connected to Antigranular server session id: cb34fa97-07ae-4c20-b432-7085187c2798, the session will time out if idle for 60 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!


OK. Let's create some examples:

## 📅 Example 1: Decoding the Dating Dilemma

In the first example, I just want to show that dates are often fuzzy items to handle. They can be entered in many different formats, including but not limited to:

- {day as a number}-{month as a number}-{month as a number}
- {day as a number}-{month as a 3 char str}-{month as a number}
- {day as a number} {month as a 3 char str} {month as a number}
- {day as a number}/{month as a 3 char str}/{month as a number}
- {month as a number} {day as a number} {month as a number}

American's often put the month first, while Europeans tend to place the day of the month first. Beyond that, slashes, dashes, spaces, or nothing are used to split up the parts of the datetime. All this confusion makes handling dates messy work.

So if you don't actually know how clean the dates are in your dataset, getting it right is important.

Let's generate some dummy data to illustate this:

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Number of dates
n = 5000

# Generate n random dates
base = datetime.today()
date_list = [base - timedelta(days=x) for x in range(n)]

# Define a list of date formats
formats = [
    '%d-%m-%Y',   # day-month-year
    '%d%m%Y',     # ddmmyyyy
    '%B %d, %Y',  # month_str day year
    '%m/%d/%Y'    # month/day/year
]

# Convert date_list to a DataFrame with strings of random formats
df = pd.DataFrame({'Dates': [date.strftime(np.random.choice(formats)) for date in date_list]})

print(df.head())  # print first 5 rows for verification

        Dates
0  10/10/2023
1    09102023
2  08-10-2023
3    07102023
4  06-10-2023


OK. Next, let's upload it into Antigranular and make it into a private Series so we can simulate working with messy data:

In [None]:
session.private_import(data=df, name='df')

dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel



Pow! Now, we got df in the %%ag environment. Next, let's make it a privateDataframe so it simulates something sensitive we would be working with:

In [None]:
%%ag
from op_pandas import PrivateDataFrame
pdf = PrivateDataFrame(df)

Ta da — we are all set!

Now we have this private dataframe, let's pretend we know it includes "Dates" as a column of type str, but we don't know the format. Let's start asking questions until we have a good grasp of what's inside:

In [None]:
%%ag
import pandas as pd

def is_dd_mm_yyyy_format(s):
    # I want to check if the format is '%d-%m-%Y', writing a value by value test
    # which checks the format and returns 1 if its a match and 0 if it is not. This
    # map costs us no eps/delta because it is a transformation of the data with known
    # output bounds :)
    parts = s.split('-')

    # Check if there are exactly three components
    if len(parts) != 3:
        return 0

    day, month, year = parts

    # Check if day and month have two digits and year has four digits
    if len(day) == 2 and len(month) == 2 and len(year) == 4 and day.isdigit() and month.isdigit() and year.isdigit():
        return 1
    return 0

# Apply the function
is_dd_mm_yyyy_pdf = pdf.applymap(is_dd_mm_yyyy_format, output_bounds = {"Dates": (0, 1)})

  final_df = self.df.applymap(func, na_action, **kwargs)



Now, we can see approximately what proportion of the dates we correctly capture in the above check while spending only a little bit of eps:

In [None]:
%%ag
from ag_utils import ag_print

ag_print(is_dd_mm_yyyy_pdf["Dates"].mean(eps=0.05))

0.2534987652233559



Oh nice — so somewhere in the range of a quarter/a fifth are captured. We learned something neat and spent very little in the process.

Of course, we can extend this to check some other date formats:

In [None]:
%%ag

# Check if the format is '%d-%m-%Y'
def is_day_month_year_format(s):
    parts = s.split('-')
    if len(parts) != 3:
        return 0
    day, month, year = parts
    if len(day) == 2 and len(month) == 2 and len(year) == 4 and day.isdigit() and month.isdigit() and year.isdigit():
        return 1
    return 0

# Check if the format is '%d%m%Y'
def is_ddmmyyyy_format(s):
    if len(s) != 8:
        return 0
    day, month, year = s[:2], s[2:4], s[4:]
    if day.isdigit() and month.isdigit() and year.isdigit():
        return 1
    return 0

# Check if the format is '%B %d, %Y'
def is_month_str_day_year_format(s):
    if ", " not in s:
        return 0
    parts = s.split(", ")
    if len(parts) != 2:
        return 0
    month_day, year = parts
    if len(year) != 4 or not year.isdigit():
        return 0
    month_parts = month_day.split(" ")
    if len(month_parts) != 2:
        return 0
    month, day = month_parts
    if month in ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] and day.isdigit() and len(day) <= 2:
        return 1
    return 0

def combine_known_dates(s):
    return is_day_month_year_format(s) + is_ddmmyyyy_format(s) + is_month_str_day_year_format(s)

known_date = pdf.applymap(combine_known_dates, output_bounds = {"Dates": (0, 1)})

Notice we didn't cover all the dates types we originally created, just 4/5 of them. Let's see what our hypothesis test gives us now:

In [None]:
%%ag
ag_print(known_date["Dates"].mean(eps=0.05))

0.7610117253737445



Well, that looks about right! Out of curiousity, let's compare these to the true percentages using the original (non-private) dataframe:

In [None]:
%%ag
known_date_real = df.applymap(combine_known_dates)
ag_print(known_date_real["Dates"].mean())

0.7436



  known_date_real = df.applymap(combine_known_dates)



I mean, that is certainly close enough! Now we have a really solid feedback loop, which is ultimately what we wanted.

## 🖥️ Example 2: Let's Write an Algorithm!

Now we've got the basics, we're ready to use it to write some sort of algorithm to identify the max value in a set of integers.

There're obviously multiple ways we could do this, but I'll opt for a simple approach. I will use binary search to seach a linespace and perform `k` checks. The objective is to get as close as reasonably possible to the (differentially private) max value:

In [None]:
import pandas as pd
import numpy as np

# Constants
LOWER_BOUND = -2**32
UPPER_BOUND = 2**32

# Randomly determine the sample space bounds
sample_lower_bound = np.random.randint(LOWER_BOUND, UPPER_BOUND)
sample_upper_bound = np.random.randint(sample_lower_bound, UPPER_BOUND)

# Sample uniformly random numbers in the defined space
n_samples = 1000  # Example sample size
df = pd.DataFrame({
    'Values': np.random.randint(sample_lower_bound, sample_upper_bound, n_samples)
})

print(df.head())

      Values
0 -211027635
1 -162139639
2 -192110595
3 -235198290
4 -166922690


In [None]:
session.private_import(data=df, name='df')

dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel



In [None]:
%%ag
pdf = PrivateDataFrame(df.astype(int), metadata={'Values': (-2**32, 2**32)})

So, we have this mysterious top secret dataset of ints with unknown bounds. I want to tree search the log space to estimate the max:

In [None]:
%%ag

# Constants
LOWER_BOUND = -2**32
UPPER_BOUND = 2**32

current_value = 0
step_size = (UPPER_BOUND - LOWER_BOUND) / 4.0

# k steps
k = 64

# k times this will be used (1 per step)
eps = 0.02

# threshold to avoid noise effect (pick something low-ish)
t = 0.02

for i in range(k):
  # is pdf values greater than the current estimated max?
  test = pdf > current_value
  # closer to 1 more values are greater than estimated max
  result = test.mean(eps)

  if result[0] > t:
    current_value = current_value + step_size
  else:
    current_value = current_value - step_size

  step_size = step_size/2.

In [None]:
%%ag
ag_print("max value estimated as: " + str(current_value))

max value estimated as: -151612985.908081



Let's check out what the true max actually was:

In [None]:
df.max()

Values   -151861469
dtype: int64

That's pretty close, right?

## 🎁 The Wrap Up

**Summary:**

The notebook emphasises the importance of hypothesis testing for private datasets since direct data visibility is often restricted. It also introduces Differential Privacy as a method to ensure data privacy.

The notebook provides two examples to guide readers through the process:

1. **Exploring Date Formats**: The first example dealt with the complexities of different date formats (especially between American and European styles) and the challenges in standardising them. The notebook walked through the steps to generate dummy date data, import it to Antigranular, and conduct hypothesis tests to identify the proportion of data in specific date formats.

2. **Algorithm to Identify Max Value**: The second example showcased the use of binary functions and differential privacy to estimate the maximum value in a dataset of integers. A tree search method in the logarithmic space was employed to estimate this max value.

**Conclusion:**

Hypothesis testing provides a systematic and privacy-preserving way to gather insights from data. The Antigranular tool and the approach of differential privacy help to strike a balance between data exploration and ensuring data confidentiality. The examples provided above serve as practical demonstrations of how one can employ hypothesis testing techniques in real-world scenarios, highlighting its potential applications in diverse fields.