### Step-by-step anonymisation of MVR data

#### TODOS:

- Hash number plate, look at: https://towardsdatascience.com/anonymise-sensitive-data-in-a-pandas-dataframe-column-with-hashlib-8e7ef397d91f
- Apply k-anonymity (e.g., Mondrian) to selected columns, such as Age, Year, or
- Put Age into pre-defined groups, eg., 0-14, 15-24, 25 - 44, 45-64, 65+
- Apply Faker on names
- Create an abstract ColumnAnonymiser class, takes column data type, and target policy as inputs. Implement AgeAnonymiser, NZNumberPlateAnonymiser

In [0]:
import pandas as pd
import csv
import re
from collections import defaultdict
from abc import ABC, abstractmethod
import math

df = pd.read_csv('mvr_synthetic_data.csv')
df.head()


# Mondrian algorithm

Mondrian is a Top-down greedy data anonymization algorithm for achieving k-anonymity in data anonymization. 

It works by partitioning the sensitive attributes of a dataset into groups, ensuring that each group contains at least k individuals. Here's a step-by-step guide on how to use the Mondrian algorithm for k-anonymization:

1. Define the sensitive attributes: Identify the attributes in your dataset that need to be protected and anonymized. These are typically the attributes that can uniquely identify individuals, such as names, addresses, or social security numbers.

2. Define the quasi-identifiers: Quasi-identifiers are non-sensitive attributes that can potentially be combined to identify individuals indirectly. Examples include age, gender, or zip code. Identify the quasi-identifiers in your dataset that will be used for partitioning.

3. Sort the dataset: Sort the dataset based on the quasi-identifiers. This step ensures that similar individuals are grouped together.

4. Select a partitioning attribute: Choose one quasi-identifier to start the partitioning process. Typically, the attribute with the highest information loss (i.e., the most discriminating attribute) is selected first.

5. Determine the splitting point: Determine the optimal splitting point for the chosen partitioning attribute. The splitting point should divide the dataset into two homogeneous groups, maximizing the anonymity of each group.

6. Recursively partition the data: Split the dataset at the determined splitting point, creating two new subsets. Repeat the partitioning process on each subset, selecting a new partitioning attribute at each step, until the desired k-anonymity level is achieved.

8. Repeat for other partitioned subsets: If any of the partitioned subsets have fewer than k individuals, repeat the partitioning process on those subsets to ensure they meet the k-anonymity requirement.

9. Evaluate and validate the anonymization: Assess the level of k-anonymity achieved by examining the resulting dataset. Validate that the sensitive attributes have been adequately protected and that the utility of the data is still preserved for the intended analysis.

Process the age group: Generalize the values: Generalize the values of the sensitive attributes within each partitioned group to further anonymize the data. For example, if the sensitive attribute is age, you can generalize it to age ranges (e.g., 20-30, 30-40) instead of specific ages.

In [0]:
import pandas as pd
import hashlib
from faker import Faker
from faker_vehicle import VehicleProvider

class ColumnAnonymiser:
    def __init__(self, data_type):
        self.data_type = data_type
    @abstractmethod
    def anonymise(self, column):
        raise NotImplementedError("Anonymise method not implemented.")

class AgeAnonymiser(ColumnAnonymiser):
    def __init__(self):
        super().__init__("age")

    def anonymise(self, column):
        age_groups = {
            0: "0-14",
            15: "15-24",
            25: "25-44",
            45: "45-64",
            65: "65+"
        }

        def map_age_group(age):
            for group_start, group_name in age_groups.items():
                if age <= group_start:
                    return group_name
            return "65+"

        return column.apply(map_age_group)

class NZNumberPlateAnonymiser(ColumnAnonymiser):
    def __init__(self):
        super().__init__("number_plate")

    def anonymise(self, column):
        # use faker_vehicle to anonymise number plate
        fake = Faker()
        vehicle_provider = VehicleProvider(fake)
        return column.apply(lambda x: vehicle_provider.numerify('###-####'))

class NameAnonymiser(ColumnAnonymiser):
    def __init__(self):
        super().__init__("name")

    def anonymise(self, column):
        faker = Faker()
        return column.apply(lambda x: faker.name())

# Mondrian algorithm for k-anonymity
def apply_mondrian_algorithm(df, k, quasi_identifiers):
    # Sort the dataset based on the quasi-identifiers
    sorted_df = df.sort_values(by=quasi_identifiers)

    # Apply Mondrian algorithm recursively
    k_anonymous_df = partition_dataset(sorted_df, k, quasi_identifiers)

    # Check if the resulting dataset satisfies k-anonymity
    if len(k_anonymous_df) >= k:
        print("Dataset satisfies k-anonymity.")
    else:
        print("Dataset does not satisfy k-anonymity.")

    return k_anonymous_df

def partition_dataset(df, k, quasi_identifiers):
    # Check if the dataset satisfies k-anonymity
    if len(df) >= k:
        return df
    else:
        # Select the most discriminating attribute for partitioning
        attribute = select_partitioning_attribute(df, quasi_identifiers)

        # Determine the optimal splitting point for the attribute
        splitting_point = determine_splitting_point(df, attribute)

        # Split the dataset into two subsets based on the splitting point
        subset1 = df[df[attribute] <= splitting_point]
        subset2 = df[df[attribute] > splitting_point]

        # Recursively partition the subsets
        return pd.concat([partition_dataset(subset1, k, quasi_identifiers), partition_dataset(subset2, k, quasi_identifiers)])

def select_partitioning_attribute(df, quasi_identifiers):
    # Select the attribute with the highest information loss
    # In this example, we can choose the attribute with the highest cardinality
    cardinalities = [df[attr].nunique() for attr in quasi_identifiers]

    return quasi_identifiers[cardinalities.index(max(cardinalities))]

def determine_splitting_point(df, attribute):
    # Determine the splitting point for the given attribute
    # In this example, we can choose the median value
    return df[attribute].median()


In [0]:
# Instantiate the anonymisers
age_anonymiser = AgeAnonymiser()
number_plate_anonymiser = NZNumberPlateAnonymiser()
name_anonymiser = NameAnonymiser()

# Apply anonymisation on specific columns
df['Age'] = age_anonymiser.anonymise(df['Age'])
df['NumberPlate'] = number_plate_anonymiser.anonymise(df['NumberPlate'])
df['Name'] = name_anonymiser.anonymise(df['Name'])

# Selected columns for k-anonymity
selected_columns = ['Age', 'Year']
k = 2  # k-anonymity level

# Apply Mondrian algorithm for k-anonymity on selected columns
def compute_equivalence_classes(df, quasi_identifiers):
    eq_classes = defaultdict(int)
    
    for index, row in df.iterrows():
        key = tuple(row[qi] for qi in quasi_identifiers)
        eq_classes[key] += 1
    
    return eq_classes

def is_k_anonymous(dataset, quasi_identifiers, k):
    eq_classes = compute_equivalence_classes(dataset, quasi_identifiers)
    
    for count in eq_classes.values():
        if count < k:
            return False
            
    return True

k_anonymous_df = apply_mondrian_algorithm(df, k, selected_columns)



In [0]:
k_anonymous_df.head()

# Information loss

To determine the information loss after applying anonymization techniques, we need to compare the uniqueness/distinctiveness of the data before and after anonymization. 

A common metric used to measure information loss is entropy.Entropy measures the amount of uncertainty or randomness in a dataset. 
- Higher entropy indicates more diversity and uniqueness in the data
- lower entropy suggests more homogeneity and less information. 

Therefore, by comparing the entropy of the original dataset with the anonymized dataset, you can get an estimate of the information loss.

In [0]:
def calculate_entropy(column):
    value_counts = column.value_counts()
    total_count = len(column)
    entropy = 0

    for count in value_counts:
        probability = count / total_count
        entropy -= probability * math.log2(probability)
    if (entropy == 0):
        raise NotImplementedError("Entropy cannot be 0")
    return entropy



# Calculate entropy of the original dataset
original_entropy = calculate_entropy(df['Age'])
print("Original Entropy:", original_entropy)

# Calculate entropy of the anonymized dataset
anonymized_entropy = calculate_entropy(df['Age'])
print("Anonymized Entropy:", anonymized_entropy)

# Calculate information loss
information_loss = original_entropy - anonymized_entropy
print("Information Loss:", information_loss)


# Try other selected columns

In [0]:
# Define the list of quasi-identifiers
quasi_identifiers = [
    ['CarMake', 'CarModel', 'Year'],
    ['Age', 'Gender'],
]

# Apply is_k_anonymous to each element in the big list
k = 3  # Specify the desired k value for k-anonymity
for qi in quasi_identifiers:
    k_anonymous_df = apply_mondrian_algorithm(df, k, qi)