An example for how to use this library.

In [1]:
from anonymization import *

The netflix dataset is an example of a dataset that was de-anonymized (http://arxivblog.com/?p=142) so we will use this to demonstrate our library.
We source this data from Kaggle (https://www.kaggle.com/netflix-inc/netflix-prize-data?select=combined_data_1.txt). 

In [2]:
import pandas as pd
# Read in file as dataframe, ignoring first row and using column names ['CustomerID','Rating','Date']
df = pd.read_csv('NetflixDataset/combined_data_1.txt', header=0, names=['CustomerID','Rating','Date'])
# Convert Rating column to integer
is_int = df.Rating.apply(lambda x: x.is_integer())
df = df[is_int]
df['Rating'] = df['Rating'].astype(int)

In [3]:
# Print out first few rows of dataframe
df

Unnamed: 0,CustomerID,Rating,Date
0,1488844,3,2005-09-06
1,822109,5,2005-05-13
2,885013,4,2005-10-19
3,30878,4,2005-12-26
4,823519,3,2004-05-03
...,...,...,...
24058257,2591364,2,2005-02-16
24058258,1791000,2,2005-02-10
24058259,512536,5,2005-07-27
24058260,988963,3,2005-12-20


In [4]:
# Using ['Rating','Date'] as quasi-Identifiers, test the k-anonymity of the dataframe
smallest_k = get_k(df, ['Rating','Date'])
print("The smallest k-anonymity is:", smallest_k)

The smallest k-anonymity is: 1


So there are rows where only a single person gave a specific rating on a given date. Which ones though? We can find this out with another function:

In [5]:
# Show the smallest k-anonymity of the dataframe
equiv_classes_k = smallest_classes(df, ['Rating','Date'])
# Show rows where KCount=smallest_k
equiv_classes_k

Unnamed: 0,Rating,Date,KCount
2,1,1999-12-17,1
4,1,1999-12-22,1
5,1,1999-12-24,1
6,1,1999-12-25,1
7,1,1999-12-26,1
8,1,1999-12-27,1
9,1,1999-12-29,1
2171,2,1999-11-11,1
2172,2,1999-12-09,1
2174,2,1999-12-12,1


We can get the l-distinct value for the table easily also:

In [6]:
# Get l-distinct of the dataframe
smallest_l = get_l_distinct(df, ['Rating','Date'], 'CustomerID')
print("The table is distinct-", smallest_l, " diverse")

The table is distinct- 1  diverse


We can also get the l-entropy of the table:

In [7]:
# Get l entropy of the dataframe
smallest_l_entropy = get_l_entropy(df, ['Rating','Date'], 'CustomerID')
print("The table is entropy-", smallest_l_entropy," diverse")

The table is entropy- 1.0  diverse


We can get t-closeness using equal ground distance.

In [9]:
# Random sample of df without replacement
df_sample = df.sample(frac=.01, replace=False)
t_closeness_egd = get_t_closeness(df_sample, ['Rating','Date'], 'CustomerID',0)
# print(t_closeness_egd)