## 3.2 Anonymizing your dataset - 20 marks
- Goals: The goal of k-anonymity is to modify a dataset such that any given record cannot be distinguished from at least k−1 other records regarding certain "quasi-identifier" attributes.
- Our identified quasi-identifiers from 3.1: 'region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong'

### Starting point: get a k-anonymized data set as a base
- To k-anonymize our given data set, we are using the Mondrian Multidimensional K-Anonymity approach to build on later for t-closeness and l-diversity.
- similar to other k-anonymity approaches, a simple and efficient greedy approximation algorithm is implemented to reduce complexity.

Because of runtime issues (the athletes.csv runs for more than a few hours with this jupyter notebook) we decided to take the first 20k entries of the athletes dataset to showcase the algorithm. 
We start by implementing a few functions which will then later be used to k=3 anonymize our dataset: 

In [91]:
import pandas as pd

df = pd.read_csv("athletes.csv", index_col=False, low_memory=False)
print(df.shape[0])

# Select the first 10,000 rows
df_reduced = df.iloc[:20000, :]

# Save the reduced dataset to a new CSV file
df_reduced.to_csv("reduced_athletes.csv", index=False)

423006


In [92]:
#Some adjustments on the dataset
df = pd.read_csv("reduced_athletes.csv", index_col=False, low_memory=False)

df['age'].fillna(0, inplace=True)
df['height'].fillna(0, inplace=True)
df['weight'].fillna(0, inplace=True)

df['region'].fillna('', inplace=True)
df['gender'].fillna('', inplace=True)

In [93]:
#Calculates and returns the spans (range of values) for each column in a specified partition of a dataframe, with an option to scale these spans by provided values.
def get_spans(df, partition, scale=None):
    spans = {}
    for feature_column in quasi_identifiers:
        if feature_column in categorical:
            span = len(df[feature_column][partition].unique())
        else:
            span = df[feature_column][partition].max() - df[feature_column][partition].min()
        if scale is not None:
            span = span / scale[feature_column]
        spans[feature_column] = span
    return spans

In [94]:
#Divides a specified partition of a dataframe into two parts based on the median or unique values of a given column, returning a tuple with the indices of these two parts.
def split(df, partition, column):
    dfp = df[column][partition]
    if column in categorical:
        values = dfp.unique()
        lv = set(values[:len(values) // 2])
        rv = set(values[len(values) // 2:])
        return dfp.index[dfp.isin(lv)], dfp.index[dfp.isin(rv)]
    else:
        median = dfp.median()
        dfl = dfp.index[dfp < median]
        dfr = dfp.index[dfp >= median]
        return (dfl, dfr)

In [95]:
#Checks if a partition is k-anonymous by comparing its amount of entries with the required (k).
def is_k_anonymous(df, partition, sensitive_column, k=3):
    if len(partition) < k:
        return False
    return True

In [96]:
#Partitions a dataframe into valid subsets based on specified feature columns, a sensitive column, and span scales, using a validity function to ensure each partition meets certain criteria.
def partition_dataset(df, feature_columns, sensitive_column, scale, is_valid):
    finished_partitions_temp = []
    partitions = [df.index]
    while partitions:
        partition = partitions.pop(0)
        spans = get_spans(df[feature_columns], partition, scale)
        for column, span in sorted(spans.items(), key=lambda x: -x[1]):
            lp, rp = split(df, partition, column)
            if not is_valid(df, lp, sensitive_column) or not is_valid(df, rp, sensitive_column):
                continue
            partitions.extend((lp, rp))
            break
        else:
            finished_partitions_temp.append(partition)
    return finished_partitions_temp

In [97]:
#Aggregates the values of a series with categorical values by concatenating them.
def agg_categorical_column(series):
    # Check if the series is empty or if mode() returns an empty series
    if series.empty or series.mode().empty:
        return None  # or some default value or placeholder
    else:
        return series.mode().iloc[0]  # access the first element of mode
    

In [98]:
def agg_numerical_column(series):
    return series.mean()

In [99]:
#Constructs an anonymized dataset by aggregating feature columns and sensitive columns separately for each partition.
def build_anonymized_dataset(df, partitions, feature_columns, sensitive_columns,
                                                  max_partitions=None):
    aggregations = {}
    for column in feature_columns:
        if column in categorical:
            aggregations[column] = agg_categorical_column
        else:
            aggregations[column] = agg_numerical_column
    rows = []
    for i, partition in enumerate(partitions):
        if max_partitions is not None and i > max_partitions:
            break
        grouped_columns = df.loc[partition].agg(aggregations, squeeze=False)
        values = grouped_columns.to_dict()
        # Iterate through each sensitive column and aggregate counts
        for sensitive_column in sensitive_columns:
            sensitive_counts = df.loc[partition].groupby(sensitive_column).agg({sensitive_column: 'count'})
            for sensitive_value, count in sensitive_counts[sensitive_column].items():
                if count == 0:
                    continue
                sensitive_values = values.copy()
                sensitive_values.update({
                    sensitive_column: sensitive_value,
                    'count': count,
                })
                rows.append(sensitive_values)
    return pd.DataFrame(rows)

### Starting to k-anonymize with k=3
1. First, we define our quasi-identifiers that we chose in 3.1. 
2. Secondly, we divide our quasi identifiers in two categories
   - categorical: schedule, howlong, region, gender, eat
      - These attributes need to be treated differently because they cant be compared like numerical attributes
   - numerical: age, height, weight
      - These attributes need no special treatment
3. We then start with k-anonymizing our dataset by calculating the spans of our dataframe and passing the result on to partition our dataset
4. the partitions then get aggregated with respect to our categorical and numerical attributes 
5. the finished dataframe then gets saved to a new .csv 

In [128]:
#categorical attributes that need to be treated in another way than numerical
categorical = {'schedule', 'howlong', 'region', 'gender', 'eat'}
for name in categorical:
    df[name] = df[name].astype('category')
    
#Our quasi identifiers that we identified earlier
quasi_identifiers = ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong']

# sensitive-values that should be taken into account
sensitive_columns = ['athlete_id', 'fran', 'helen', 'grace', 'filthy50', 'fgonebad', 'run400', 'run5k', 'candj', 'snatch', 'deadlift',
                     'backsq', 'pullups']

In [131]:
full_spans = get_spans(df, df.index)
finished_partitions = partition_dataset(df, quasi_identifiers, sensitive_columns, full_spans, is_k_anonymous)
print(len(finished_partitions))

dfn = build_anonymized_dataset(df, finished_partitions, quasi_identifiers, sensitive_columns)

dfn.to_csv("k_anon.csv")

2910


### Extending the k-anonymized dataset to also be l-diverse
- l-diversity requires that for every group of records sharing  quasi-identifier attributes, there are at least 'l' "well-represented" values for the sensitive attribute.
- l=2 diversity: Specifically, this means that in each group of records with the same quasi-identifiers, there should be at least two distinct values for the sensitive attributes. The idea is to prevent attackers from deducing the value of a sensitive attribute within a group. 


In [102]:
#Calculates the diversity of a partition of a dataframe on a given column.
def diversity(df, partition, column):
    return len(df[column][partition].unique())

def is_l_diverse(df, partition, sensitive_columns, l=2):
       for sensitive_column in sensitive_columns:
        if diversity(df, partition, sensitive_column) < l:      #if partition is not l-diverse
            return False
        return True

In [103]:
df_ldiverse = pd.read_csv("k_anon.csv", index_col=False, low_memory=False)
full_spans = get_spans(df_ldiverse, df_ldiverse.index)

finished_l_diverse_partitions = partition_dataset(
    df_ldiverse, quasi_identifiers, sensitive_columns, full_spans,
    is_l_diverse)
print(len(finished_l_diverse_partitions))   #How many partitions are there

dfldiverse_finished = build_anonymized_dataset(df_ldiverse, finished_l_diverse_partitions, quasi_identifiers, sensitive_columns)
dfldiverse_finished.to_csv("l_diverse.csv")

2849


## Extending the k-anonymized dataset to also achieve t-closeness
- A dataset is said to have t-closeness if the distribution of a sensitive attribute in any given group is not more than 't' different from the distribution of the attribute in the overall dataset.
- Unlike l-diversity, which focuses on the variety of sensitive attributes, t-closeness concerns itself with the distribution (frequency) of these attributes.
- A smaller t value indicates a stricter requirement for maintaining the distribution of the dataset. (Here: t=0.2)

In [126]:
import random
from scipy.stats import ks_2samp

# calculate t-closeness for numerical values 
def t_closeness_numerical(df, partition, column):
    full_data = df[column]
    partition_data = df.loc[partition, column]

    # Check if either dataset is empty
    if full_data.empty or partition_data.empty:
        # Generate and return a random float between 30 and 200
        random_float = random.uniform(30, 200)
        return random_float

    ks_stat, _ = ks_2samp(full_data, partition_data)
    return ks_stat

# calculate t-closeness for categorical values
def t_closeness_categorical(df, partition, column, global_freqs):
    total_count = float(len(partition))
    d_max = None
    group_counts = df.loc[partition].groupby(column, observed=False)[column].agg('count')
    for value, count in group_counts.to_dict().items():
        p = count / total_count
        d = abs(p - global_freqs[value])
        if d_max is None or d > d_max:
            d_max = d
    return d_max

# check if partition is t-close
def is_t_close(df, partition, sensitive_columns, global_freqs, t=0.2):
    for sensitive_column in sensitive_columns:
        if sensitive_column not in categorical:
            distance = t_closeness_numerical(df, partition, sensitive_column)
        else:
            distance = t_closeness_categorical(df, partition, sensitive_column, global_freqs[sensitive_column])
        if distance > t:
            return False
    return True


In [127]:
df_tclose = pd.read_csv("k_anon.csv", index_col=False, low_memory=False)
full_spans = get_spans(df_tclose, df_tclose.index)

# Get the global frequencies for the sensitive column
global_frequencies = {sensitive_column: {} for sensitive_column in sensitive_columns}
total_count = len(df)

# Determine frequency for every sensitive attribute
for sensitive_column in sensitive_columns:
    group_counts = df_tclose.groupby(sensitive_column, observed=False)[sensitive_column].agg('count')
    for value, count in group_counts.to_dict().items():
        p = count / total_count
        global_frequencies[sensitive_column][value] = p

finished_t_closepartitions = partition_dataset(
    df_tclose, quasi_identifiers, sensitive_columns, full_spans,
    lambda *args: is_t_close(*args, global_frequencies))
print(len(finished_t_closepartitions))


dfclose_finished = build_anonymized_dataset(df_tclose, finished_t_closepartitions, quasi_identifiers, sensitive_columns)
dfclose_finished.to_csv("t_close.csv")

1950


## Discussing the results
Our datasets are characterized by three privacy-preserving measures: 
k=3 anonymity, l=2 diversity, and t=0.2 closeness. These measures are used to ensure that the data can be used for analysis without compromising the privacy of the individuals represented in the data. 
- k = 3 ensures that any individual's data cannot be distinguished from at least two other individuals within the dataset. Protects against identification risks.
- l = 2 diversity: ensures variety in the sensitive attribute in each group of the dataset. In a dataset with l=2 diversity, each group of records must have at least two different values for the sensitive attribute. Protects against attribute disclosure.
- t = 0.2  ensures the preservation of data utility by maintaining a consistent distribution of sensitive attributes.
By combining these three techniques, we can achieve decent privacy protection of our given dataset. Each measure addresses a different aspect of privacy.

The rest of the comparison / discussion of results happen in 3.4. 