# Example using Subsample and Aggregate

This implements the same aggregate statistics performed "carrots_demo.ipynb" notebook , but instead uses Subsample and aggregate to answer queries.Subsample and aggregate is a prominent DP mechanism that reduces the effect of any individual datapoint. It works by splitting the dataset into n number of samples , computing the desired query on each of the samples and aggregating the results of the query and then adding noise to the answer of the subsamples scaled to sensitivity of each of these subsamples. 

In [None]:
# Install the PyDP package
! pip install python-dp

In [3]:
import pydp as dp # by convention our package is to be imported as dp (for Differential Privacy!)
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
import pandas as pd
import statistics # for calculating mean without applying differential privacy

## Data

Each row in `animals_and_carrots.csv` is composed of the name of an animal, and
the number of carrots it has eaten, comma-separated.


In [4]:
# get carrots data from our public github repo
url = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/carrots_demo/animals_and_carrots.csv'
df = pd.read_csv(url,sep=",", names=["animal", "carrots_eaten"])
df.head()

Unnamed: 0,animal,carrots_eaten
0,Aardvark,1
1,Albatross,88
2,Alligator,35
3,Alpaca,99
4,Ant,69


In [10]:
def create_samples(dataframe,n):
    
    samples=[]
    
    length=len(dataframe)
    sample_size=int(length/n)
    
    for sample in range(0,length,sample_size):
        
        samples.append(dataframe[sample:sample+sample_size])
        
    return samples

In [12]:
df_samples=create_samples(df,10)

Taking the mean of all the entries in a normal fashion without applying the DP library. This is the actual mean of all the records.

## Per-animal Privacy

Notice that each animal owns at most one row in the data. This means that we
provide per-animal privacy. Suppose that some animal appears multiple times in
the csv file. That animal would own more than one row in the data. In this case,
using this DP library would not guarantee per-animal privacy! The animals would
first have to pre-process their data in a way such that each animal doesn't own
more than one row.


<h2>Mean</h2>

In [22]:
# calculates mean without applying differential privacy
def mean_carrots() -> float:

    return statistics.mean(list(df["carrots_eaten"]))

In [23]:
# calculates mean without applying differential privacy
def subsample_mean_carrots() -> float:
        
    sample_means=[]
        
    for sample in df_samples:
            
        sample_means.append(statistics.mean(list(sample["carrots_eaten"])))
            
    return sample_means

In [27]:
print("Mean of entire dataframe:",mean_carrots())

Mean of entire dataframe: 53.01648351648352


In [29]:
subsample_mean=subsample_mean_carrots()
print("Mean of means calculated on subsamples:",statistics.mean(subsample_mean))

Mean of means calculated on subsamples: 51.43939393939394


In [34]:
# calculates mean applying differential privacy
def private_mean(privacy_budget: float) -> float:
    x = BoundedMean(privacy_budget, 1, 100,dtype="float")
    return x.quick_result(subsample_mean)

In [35]:
private_mean(0.7)

64.23598258095826

<h2>Count Above</h2>

In [38]:
# Calculates number of animals who ate more than "limit" carrots without applying differential privacy.
def count_above(limit: int) -> int:
    return df[df.carrots_eaten > limit].count()[0]

In [39]:
# calculates mean without applying differential privacy
def subsample_count_above(limit) -> float:
        
    sample_means=[]
        
    for sample in df_samples:
            
        sample_means.append(sample[sample.carrots_eaten > limit].count()[0])
            
    return sample_means

In [40]:
print("Count above 70: ",count_above(70))

Count above 70:  65


In [44]:
subsample_count=subsample_count_above(70)
print("Subsample count above 70: ",sum(subsample_count))

Subsample count above 70:  65


In [7]:
# Calculates number of animals who ate more than "limit" carrots applying differential privacy.
def private_count_above(privacy_budget: float, limit: int) -> int:
    x = Count(privacy_budget, dtype="int")
    return x.quick_result(list(df[df.carrots_eaten > limit]["carrots_eaten"]))

In [8]:
print("Above 70:\t" + str(count_above(70)))
print("private count above:\t" + str(private_count_above(1, 70)))

Above 70:	65
private count above:	64


<h2>Max</h2>

In [46]:
# Function to return the maximum of the number of carrots eaten by any one animal without appyling differential privacy.
def max() -> int:
    return df.max()[1]

In [47]:
# calculates mean without applying differential privacy
def subsample_count_max() -> float:
        
    sample_means=[]
        
    for sample in df_samples:
            
        sample_means.append(sample.max()[1])
            
    return sample_means

In [48]:
print("Max: ",max())

Max:  100


In [10]:
# Function to return the maximum of the number of carrots eaten by any one animal appyling differential privacy.
def private_max(privacy_budget: float) -> int:
    # 0 and 150 are the upper and lower limits for the search bound.
    x = Max(privacy_budget, 0, 100, dtype="int")
    return x.quick_result(list(df["carrots_eaten"]))

In [11]:
print("Max:\t" + str(max()))
print("private max:\t" + str(private_max(1)))

Max:	100
private max:	78.0


<h2>Sum</h2>

In [58]:
# Function to calculate sum of carrots eaten without applying differential privacy.
def sum_carrots() -> int:
    return df.sum()[1]

In [59]:
def subsample_sum() -> float:
        
    sample_means=[]
        
    for sample in df_samples:
            
        sample_means.append(sample.sum()[1])
            
    return sample_means

In [60]:
print("Sum:\t" + str(sum_carrots()))

Sum:	9649


In [61]:
samples_sum=subsample_sum()
print("Sum of subsamples",sum(samples_sum))

Sum of subsamples 9649


In [50]:
# Function to calculate sum of carrots eaten applying differential privacy.
def private_sum(privacy_budget: float) -> int:
    x = BoundedSum(privacy_budget,1,100, dtype="float")
    return x.quick_result(list(df["carrots_eaten"]))