## 3.2 Anonymizing your dataset - 20 marks
- Goals: The goal of k-anonymity is to modify a dataset such that any given record cannot be distinguished from at least k−1 other records regarding certain "quasi-identifier" attributes.
- Our identified quasi-identifiers from 3.1: 'region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong'

### Starting point: get a k-anonymized data set as a base
- To k-anonymize our given data set, we are using the Mondrian Multidimensional K-Anonymity approach to build on later for t-closeness and l-diversity.
- similar to other k-anonymity approaches, a simple and efficient greedy approximation algorithm is implemented to reduce complexity.

Because of runtime issues (the athletes.csv runs for more than a few hours with this jupyter notebook) we decided to take the first 20k entries of the athletes dataset to showcase the algorithm. 
We start by implementing a few functions which will then later be used to k=3 anonymize our dataset: 

In [256]:
import pandas as pd

df = pd.read_csv("athletes.csv", index_col=False, low_memory=False)
print(df.shape[0])

# Select the first 10,000 rows
df_reduced = df.iloc[:20000, :]

# Save the reduced dataset to a new CSV file
df_reduced.to_csv("reduced_athletes.csv", index=False)

423006


In [257]:
#Some adjustments on the dataset
df = pd.read_csv("reduced_athletes.csv", index_col=False, low_memory=False)

df['age'].fillna(0, inplace=True)
df['height'].fillna(0, inplace=True)
df['weight'].fillna(0, inplace=True)

df['region'].fillna('', inplace=True)
df['gender'].fillna('', inplace=True)

In [258]:
#Calculates and returns the spans (range of values) for each column in a specified partition of a dataframe, with an option to scale these spans by provided values.
def calculate_spans(data_frame, data_partition, scaling_factors=None):
    column_spans = {}
    for column in quasi_identifiers:
        if column in categorical:
            range_span = len(data_frame[column][data_partition].unique())
        else:
            range_span = data_frame[column][data_partition].max() - data_frame[column][data_partition].min()
        if scaling_factors is not None:
            range_span = range_span / scaling_factors[column]
        column_spans[column] = range_span
    return column_spans

In [259]:
#Divides a specified partition of a dataframe into two parts based on the median or unique values of a given column, returning a tuple with the indices of these two parts.
def split_dataframe_into_two(df, partition_indices, column_name):
    partitioned_data = df[column_name][partition_indices]

    if column_name in categorical:
        unique_values = partitioned_data.unique()
        left_values = set(unique_values[:len(unique_values) // 2])
        right_values = set(unique_values[len(unique_values) // 2:])
        left_partition_indices = partitioned_data.index[partitioned_data.isin(left_values)]
        right_partition_indices = partitioned_data.index[partitioned_data.isin(right_values)]
        return left_partition_indices, right_partition_indices
    else:
        median_value = partitioned_data.median()
        left_partition_indices = partitioned_data.index[partitioned_data < median_value]
        right_partition_indices = partitioned_data.index[partitioned_data >= median_value]
        return left_partition_indices, right_partition_indices

In [260]:
#Checks if a partition is k-anonymous by comparing its amount of entries with the required (k).
def is_k_anonymous(df, partition, sensitive_column, k=3):
    if len(partition) < k:
        return False
    return True

In [261]:
#Partitions a dataframe into valid subsets based on specified feature columns, a sensitive column, and span scales, using a validity function to ensure each partition meets certain criteria.
def partition_dataset(df, feature_columns, sensitive_column, scale_factor, validate_partition):
    completed_partitions = []
    pending_partitions = [df.index]
    
    while pending_partitions:
        current_partition = pending_partitions.pop(0)
        column_spans = calculate_spans(df[feature_columns], current_partition, scale_factor)
        
        for column_name, span_value in sorted(column_spans.items(), key=lambda x: -x[1]):
            left_partition_indices, right_partition_indices = split_dataframe_into_two(df, current_partition, column_name)
            
            if not validate_partition(df, left_partition_indices, sensitive_column) or not validate_partition(df, right_partition_indices, sensitive_column):
                continue
            
            pending_partitions.extend((left_partition_indices, right_partition_indices))
            break
        else:
            completed_partitions.append(current_partition)
    
    return completed_partitions

In [262]:
def aggregate_categorical_values(series_categorical):
    # Check if the categorical series is empty or if its mode calculation results in an empty series
    if series_categorical.empty or series_categorical.mode().empty:
        return None  # Return None or a specified default value as a placeholder
    else:
        # Return the most frequent value in the series (the first element in mode)
        return series_categorical.mode().iloc[0]


In [263]:
def aggregate_numerical_series(numerical_series):
    # Calculate and return the mean (average) of the numerical series
    return numerical_series.mean()

In [264]:
def create_anonymized_dataset(dataframe, partition_indices, feature_columns, sensitive_columns, max_partitions=None):
    column_aggregations = {}

    for column in feature_columns:
        if column in categorical:
            column_aggregations[column] = aggregate_categorical_values
        else:
            column_aggregations[column] = aggregate_numerical_series

    anonymized_rows = []

    # Process each partition
    for partition_index, partition in enumerate(partition_indices):
        # Limit the number of partitions processed if max_partitions is set
        if max_partitions is not None and partition_index > max_partitions:
            break

        # Aggregate feature columns for the current partition
        aggregated_features = dataframe.loc[partition].agg(column_aggregations, squeeze=False)
        feature_values = aggregated_features.to_dict()

        # Aggregate sensitive columns separately and combine with feature columns
        for sensitive_column in sensitive_columns:
            sensitive_value_counts = dataframe.loc[partition].groupby(sensitive_column).agg({sensitive_column: 'count'})
            
            for sensitive_value, count in sensitive_value_counts[sensitive_column].items():
                if count == 0:
                    continue
                
                combined_values = feature_values.copy()
                combined_values.update({
                    sensitive_column: sensitive_value,
                    'count': count,
                })
                anonymized_rows.append(combined_values)

    return pd.DataFrame(anonymized_rows)


### Starting to k-anonymize with k=3
1. First, we define our quasi-identifiers that we chose in 3.1. 
2. Secondly, we divide our quasi identifiers in two categories
   - categorical: schedule, howlong, region, gender, eat
      - These attributes need to be treated differently because they cant be compared like numerical attributes
   - numerical: age, height, weight
      - These attributes need no special treatment
3. We then start with k-anonymizing our dataset by calculating the spans of our dataframe and passing the result on to partition our dataset
4. the partitions then get aggregated with respect to our categorical and numerical attributes 
5. the finished dataframe then gets saved to a new .csv 

In [265]:
#categorical attributes that need to be treated in another way than numerical
categorical = {'schedule', 'howlong', 'region', 'gender', 'eat'}
for name in categorical:
    df[name] = df[name].astype('category')
    
#Our quasi identifiers that we identified earlier
quasi_identifiers = ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong']

# sensitive-values that should be taken into account
sensitive_columns = ['athlete_id', 'fran', 'helen', 'grace', 'filthy50', 'fgonebad', 'run400', 'run5k', 'candj', 'snatch', 'deadlift',
                     'backsq', 'pullups']

In [266]:
full_spans = calculate_spans(df, df.index)
finished_partitions = partition_dataset(df, quasi_identifiers, sensitive_columns, full_spans, is_k_anonymous)
print(len(finished_partitions))

dfn = create_anonymized_dataset(df, finished_partitions, quasi_identifiers, sensitive_columns)

dfn.to_csv("k_anon.csv")

2910


### Extending the k-anonymized dataset to also be l-diverse
- l-diversity requires that for every group of records sharing  quasi-identifier attributes, there are at least 'l' "well-represented" values for the sensitive attribute.
- l=2 diversity: Specifically, this means that in each group of records with the same quasi-identifiers, there should be at least two distinct values for the sensitive attributes. The idea is to prevent attackers from deducing the value of a sensitive attribute within a group. 


In [267]:
def calculate_partition_diversity(dataframe, partition_indices, column):
    # Calculate the number of unique values in the specified column of the given partition
    return len(dataframe.loc[partition_indices, column].unique())

def is_l_diverse(dataframe, partition_indices, sensitive_columns_list, l_diversity_threshold=2):
    # Check if each sensitive column in the partition meets the l-diversity criterion
    for sensitive_column in sensitive_columns_list:
        if calculate_partition_diversity(dataframe, partition_indices, sensitive_column) < l_diversity_threshold:
            return False
    return True

In [268]:
df_ldiverse = pd.read_csv("k_anon.csv", index_col=False, low_memory=False)
full_spans = calculate_spans(df_ldiverse, df_ldiverse.index)

finished_l_diverse_partitions = partition_dataset(
    df_ldiverse, quasi_identifiers, sensitive_columns, full_spans,
    is_l_diverse)
print(len(finished_l_diverse_partitions))   #How many partitions are there

dfldiverse_finished = create_anonymized_dataset(df_ldiverse, finished_l_diverse_partitions, quasi_identifiers, sensitive_columns)
dfldiverse_finished.to_csv("l_diverse.csv")

1279


## Extending the k-anonymized dataset to also achieve t-closeness
- A dataset is said to have t-closeness if the distribution of a sensitive attribute in any given group is not more than 't' different from the distribution of the attribute in the overall dataset.
- Unlike l-diversity, which focuses on the variety of sensitive attributes, t-closeness concerns itself with the distribution (frequency) of these attributes.
- A smaller t value indicates a stricter requirement for maintaining the distribution of the dataset. (Here: t=0.2)

In [269]:
import random
from scipy.stats import ks_2samp

# calculate t-closeness for numerical values 
def t_closeness_numerical(df, partition_indices, numerical_column):
    full_dataset_values = df[numerical_column]
    partition_values = df.loc[partition_indices, numerical_column]

    # Check if either the full dataset or the partition is empty
    if full_dataset_values.empty or partition_values.empty:
        # Return a random float between 30 and 200 if either dataset is empty
        return random.uniform(30, 200)

    # Compute the Kolmogorov-Smirnov statistic for the two datasets
    ks_statistic, _ = ks_2samp(full_dataset_values, partition_values)
    return ks_statistic

# calculate t-closeness for categorical values
def t_closeness_categorical(df, partition_indices, categorical_column, global_frequencies):
    total_partition_count = float(len(partition_indices))
    max_deviation = None

    # Calculate the count of each value in the partition for the specified categorical column
    partition_value_counts = df.loc[partition_indices].groupby(categorical_column, observed=False)[categorical_column].agg('count')

    # Iterate through each value and calculate its deviation from the global frequencies
    for value, count in partition_value_counts.to_dict().items():
        partition_frequency = count / total_partition_count
        deviation = abs(partition_frequency - global_frequencies[value])

        # Update max_deviation if this deviation is greater than the current max
        if max_deviation is None or deviation > max_deviation:
            max_deviation = deviation

    return max_deviation

# check if partition is t-close
def is_t_close(df, partition_indices, sensitive_columns_list, global_frequencies, threshold=0.2):
    # Loop through each sensitive column to calculate the t-closeness
    for sensitive_column in sensitive_columns_list:
        if sensitive_column not in categorical:
            # Calculate t-closeness for numerical columns
            closeness_distance = t_closeness_numerical(df, partition_indices, sensitive_column)
        else:
            # Calculate t-closeness for categorical columns
            closeness_distance = t_closeness_categorical(df, partition_indices, sensitive_column, global_frequencies[sensitive_column])
        
        # Check if the closeness distance exceeds the threshold
        if closeness_distance > threshold:
            return False

    return True


In [270]:
df_tclose = pd.read_csv("k_anon.csv", index_col=False, low_memory=False)
full_spans = calculate_spans(df_tclose, df_tclose.index)

# Get the global frequencies for the sensitive column
global_frequencies = {sensitive_column: {} for sensitive_column in sensitive_columns}
total_count = len(df)

# Determine frequency for every sensitive attribute
for sensitive_column in sensitive_columns:
    group_counts = df_tclose.groupby(sensitive_column, observed=False)[sensitive_column].agg('count')
    for value, count in group_counts.to_dict().items():
        p = count / total_count
        global_frequencies[sensitive_column][value] = p

finished_t_closepartitions = partition_dataset(
    df_tclose, quasi_identifiers, sensitive_columns, full_spans,
    lambda *args: is_t_close(*args, global_frequencies))
print(len(finished_t_closepartitions))


dfclose_finished = create_anonymized_dataset(df_tclose, finished_t_closepartitions, quasi_identifiers, sensitive_columns)
dfclose_finished.to_csv("t_close.csv")

1950


## Discussing the results
Our datasets are characterized by three privacy-preserving measures: 
k=3 anonymity, l=2 diversity, and t=0.2 closeness. These measures are used to ensure that the data can be used for analysis without compromising the privacy of the individuals represented in the data. 
- k = 3 ensures that any individual's data cannot be distinguished from at least two other individuals within the dataset. Protects against identification risks.
- l = 2 diversity: ensures variety in the sensitive attribute in each group of the dataset. In a dataset with l=2 diversity, each group of records must have at least two different values for the sensitive attribute. Protects against attribute disclosure.
- t = 0.2  ensures the preservation of data utility by maintaining a consistent distribution of sensitive attributes.
By combining these three techniques, we can achieve decent privacy protection of our given dataset. Each measure addresses a different aspect of privacy.

The rest of the comparison / discussion of results happen in 3.4. 