# Project Goal

As a data analyst, my objective is to anonymize the given dataset using an anonymization scheme that takes into account the identity vulnerability of Personally Identifiable Information (PII) to ensure robust protection of user privacy.

## Step 1: read data:

In [35]:
import pandas as pd
import numpy as np

def generateData(file1, file2):
    df1 = pd.read_csv(file1)
    df2 = pd.read_csv(file2)
    df2 = df2.rename(columns={'number_plate': 'NumberPlate'})
    merged_df = pd.merge(df1, df2, on='NumberPlate', how='inner')
    merged_df = merged_df.rename(columns={'primary_contributor': 'PrimaryContributor',
                                     'crash_severity': 'CrashSeverity', 'date': 'Date'})
    return merged_df

merged_df = generateData("mvr_synthetic_data.csv", "crash.csv")
merged_df.to_csv('data.csv', sep=';', index=False)
merged_df

Unnamed: 0,Name,CarMake,CarModel,Year,NumberPlate,Gender,Age,Date,PrimaryContributor,CrashSeverity
0,John Smith,Toyota,Corolla,2017,ABC-1234,Male,34,2021-03-12,Yes,non-fatal
1,John Smith,Toyota,Corolla,2017,ABC-1234,Male,34,2022-03-31,No,non-fatal
2,John Smith,Toyota,Corolla,2017,ABC-1234,Male,34,2022-01-01,No,extremely
3,John Smith,Toyota,Corolla,2017,ABC-1234,Male,34,2022-01-14,Yes,extremely
4,John Smith,Toyota,Corolla,2017,ABC-1234,Male,34,2021-12-02,No,non-fatal
...,...,...,...,...,...,...,...,...,...,...
60,Lucas Thomas,Subaru,Outback,2021,GHI-5291,Male,41,2020-07-01,No,non-fatal
61,Lucas Thomas,Subaru,Outback,2021,GHI-5291,Male,41,2021-03-09,Yes,extremely
62,Lucas Thomas,Subaru,Outback,2021,GHI-5291,Male,41,2020-03-01,Yes,non-fatal
63,Lucas Thomas,Subaru,Outback,2021,GHI-5291,Male,41,2021-09-23,No,severe


## Step 2: Determining QIs and SAs

To implement the concept of determining Quasi-Identifiers (QIs) and Sensitive Attributes (SAs) based on the five principal concepts you mentioned, we'll need to follow these steps:

### 1. Calculate the Identity Vulnerability (IV) of Quasi-Identifiers.
In this step, we calculate the Identity Vulnerability (IV) of each Quasi-Identifier (QI) in the dataset. The IV represents the number of unique values or distinct combinations of each QI. Higher IV values indicate that the corresponding QI has a higher likelihood of potentially identifying individuals, making it more vulnerable to privacy breaches. By understanding the IV of each QI, we can prioritize and focus on the most vulnerable attributes when applying privacy protection techniques.

### 2. Rank users based on QIs values.
Here, we rank the users in the dataset based on the values of the Quasi-Identifiers (QIs). The ranking is determined by grouping users who share the same QI values together. This ranking allows us to identify groups of users who are more similar to each other, potentially forming equivalence classes (Ci) later on. The higher the rank, the more similar the users' QI values are, indicating that they might belong to the same equivalence class.

### 3. Form Equivalence Classes (Ci) using the privacy parameter k.
In this step, we form Equivalence Classes (Ci) using the privacy parameter "k." Equivalence classes are groups of users who have the same QI values and are considered indistinguishable based on the available QIs. The parameter "k" specifies the minimum number of occurrences required for a group to be considered an equivalence class. By forming equivalence classes, we can generalize and protect the privacy of users by treating them as a single entity, reducing the risk of re-identification.

### 4. Calculate the Diversity (D) and Evenness (E) of the Equivalence Classes.
Here, we calculate the Diversity (D) and Evenness (E) of the Equivalence Classes. Diversity represents the number of distinct values or unique combinations of Sensitive Attributes (SAs) within the equivalence classes. A higher diversity indicates that the SAs are spread across different values, making it more challenging for an adversary to infer specific sensitive information. Evenness measures how evenly the SAs are distributed within the equivalence classes. High evenness implies a balanced distribution of SAs, further protecting user privacy.

### 5. Perform adaptive data generalization considering both the identity vulnerability of the QIs and diversity of the SAs in equivalence classes.
In this final step, we perform adaptive data generalization based on the combined consideration of the Identity Vulnerability (IV) of the Quasi-Identifiers (QIs) and the Diversity of the Sensitive Attributes (SAs) within the equivalence classes. Generalization involves replacing specific values of QIs with more generalized versions to further protect user privacy. The generalization is applied more aggressively on QIs with higher IV and less on those with lower IV. Additionally, the generalization process aims to balance the evenness of the SAs within the equivalence classes to ensure both privacy and data utility.

### Recommended Quasi-Identifiers based on the Results:
Based on the results obtained from the above steps, the recommended Quasi-Identifiers for privacy protection are those with higher Identity Vulnerability (IV) values. These attributes are more likely to contain unique or specific information that could potentially identify individuals. By focusing on these highly vulnerable QIs, organizations can implement stronger privacy protection measures such as data generalization, aggregation, or suppression. The goal is to reduce the granularity of these vulnerable attributes without compromising the utility of the data for analysis and research purposes. Additionally, the choice of recommended QIs should also consider the specific context and nature of the dataset, as well as the desired level of privacy and data utility required for the given application.

In [36]:
df = merged_df

# Step 1: Calculate Identity Vulnerability (IV) of Quasi-Identifiers
def calculate_IV(df, quasi_identifiers):
    IV = {}
    for qid in quasi_identifiers:
        IV[qid] = len(df.groupby(qid).size())
    return IV

quasi_identifiers = ["Name", "CarMake", "CarModel", "Year", "NumberPlate", "Gender", "Age"]
IV_of_QIs = calculate_IV(df, quasi_identifiers)

# Step 2: Rank users based on QIs values (highest similarity user ranking)
def rank_users(df, quasi_identifiers):
    df["Rank"] = df.groupby(quasi_identifiers).cumcount() + 1
    return df

df = rank_users(df, quasi_identifiers)

# Step 3: Form Equivalence Classes (Ci) using privacy parameter k
def form_equivalence_classes(df, quasi_identifiers, k):
    Ci = df.groupby(quasi_identifiers).filter(lambda x: len(x) >= k)
    return Ci

k = 2  # privacy parameter k
equivalence_classes = form_equivalence_classes(df, quasi_identifiers, k)

# Step 4: Calculate Diversity (D) and Evenness (E) of the Equivalence Classes
def calculate_diversity_evenness(df, sensitive_attribute):
    D = len(df.groupby(sensitive_attribute).size())
    N = len(df)
    E = D / N
    return D, E

sensitive_attribute = "CrashSeverity"
D, E = calculate_diversity_evenness(equivalence_classes, sensitive_attribute)

# Step 5: Adaptive data generalization considering both the identity vulnerability of the QIs and diversity of the SA in equivalence classes.
def adaptive_data_generalization(df, IV_of_QIs, sensitive_attribute, k):
    generalized_df = df.copy()
    for qid, iv in IV_of_QIs.items():
        if iv < k:
            # If the identity vulnerability of QI is less than k, perform generalization on the QI
            generalized_df[qid] = generalized_df[qid].apply(lambda x: x[:k])
    return generalized_df

generalized_df = adaptive_data_generalization(df, IV_of_QIs, sensitive_attribute, k)

# Output the results
print("Identity Vulnerability (IV) of Quasi-Identifiers:")
print(IV_of_QIs)

print("\nRanked Users based on QIs values:")
print(df)

print("\nEquivalence Classes (Ci) with privacy parameter k = 2:")
print(equivalence_classes)

print("\nDiversity (D) and Evenness (E) of the Equivalence Classes:")
print("Diversity:", D)
print("Evenness:", E)

print("\nAdaptive Data Generalization:")
print(generalized_df)


Identity Vulnerability (IV) of Quasi-Identifiers:
{'Name': 10, 'CarMake': 10, 'CarModel': 10, 'Year': 6, 'NumberPlate': 10, 'Gender': 2, 'Age': 10}

Ranked Users based on QIs values:
            Name CarMake CarModel  Year NumberPlate Gender  Age        Date  \
0     John Smith  Toyota  Corolla  2017    ABC-1234   Male   34  2021-03-12   
1     John Smith  Toyota  Corolla  2017    ABC-1234   Male   34  2022-03-31   
2     John Smith  Toyota  Corolla  2017    ABC-1234   Male   34  2022-01-01   
3     John Smith  Toyota  Corolla  2017    ABC-1234   Male   34  2022-01-14   
4     John Smith  Toyota  Corolla  2017    ABC-1234   Male   34  2021-12-02   
..           ...     ...      ...   ...         ...    ...  ...         ...   
60  Lucas Thomas  Subaru  Outback  2021    GHI-5291   Male   41  2020-07-01   
61  Lucas Thomas  Subaru  Outback  2021    GHI-5291   Male   41  2021-03-09   
62  Lucas Thomas  Subaru  Outback  2021    GHI-5291   Male   41  2020-03-01   
63  Lucas Thomas  Subaru  O

### Based on the provided code results:

#### Quasi-Identifier (QI):
**The Quasi-Identifiers in the dataset are 'Name', 'CarMake', 'CarModel', 'Year', 'NumberPlate', 'Gender', and 'Age'.** These attributes are considered as QIs because they contain information that, when combined, can potentially identify individuals or distinguish them from others in the dataset.

####  Identity Vulnerability (IV) of Quasi-Identifiers:
The Identity Vulnerability (IV) values of the QIs are as follows:

'Name': 10
'CarMake': 10
'CarModel': 10
'Year': 6
'NumberPlate': 10
'Gender': 2
'Age': 10
Higher IV values indicate that the corresponding QIs have a larger number of unique or distinct combinations of values, making them more vulnerable to privacy breaches and potential re-identification.

####  Recommended Quasi-Identifiers:
Based on the Identity Vulnerability (IV) values, the recommended Quasi-Identifiers for privacy protection are 'Name', 'CarMake', 'CarModel', 'NumberPlate', and 'Age' as they have the highest IV values of 10. These attributes are more likely to contain unique or specific information that could potentially identify individuals. To protect user privacy, it is advisable to apply strong privacy protection techniques, such as data generalization, on these vulnerable QIs while ensuring that the data remains useful for analysis and research purposes.

####  Equivalence Classes (Ci) and Diversity (D) with Privacy Parameter k = 2:
With the privacy parameter k set to 2, the dataset forms equivalence classes (Ci) where users with the same combination of QIs are grouped together. However, without the specific equivalence class data, we cannot provide further details about the number of classes formed. The Diversity (D) of the equivalence classes is calculated to be 3, indicating that there are three distinct combinations of Sensitive Attributes (SAs) within the equivalence classes. The low Evenness (E) value of 0.046 suggests that the distribution of SAs among the equivalence classes is not evenly balanced, which might affect the privacy protection level.

####  Adaptive Data Generalization:
The final result shows the dataset after adaptive data generalization, considering both the Identity Vulnerability (IV) of the QIs and the Diversity of the SAs in equivalence classes. However, the provided dataset does not include the generalized values. To better understand the specific data generalization applied, we would need additional information about the privacy protection measures used and the levels of generalization applied to the QIs and SAs.

## Step 3 : anonymisation on QI attribute: 
'Name', 'CarMake', 'CarModel', 'Year', 'NumberPlate', 'Gender', and 'Age'.

In the provided code, the data anonymization process is implemented using the Faker library along with the faker_vehicle extension for generating fake vehicle information. Each column is anonymized as follows:

### Name Column:
The 'Name' column is anonymized by replacing the original names with randomly generated full names. The fake.name() method from the Faker instance is used to generate random names, ensuring that the original names are not exposed in the anonymized dataset.

### CarMake Column:
The 'CarMake' column is anonymized by replacing the original car make names with randomly generated names. The fake.name() method is used again to generate fake car make names, ensuring that the actual car make information is hidden in the anonymized dataset.

### CarModel Column:
Similarly, the 'CarModel' column is anonymized by replacing the original car model names with randomly generated names using the fake.name() method.

### Year Column:
The 'Year' column is anonymized by replacing the original years with randomly generated integers between 1990 and 2023 (inclusive). The fake.random_int(min=1990, max=2023) method generates random integers for the years, ensuring that the actual years are obscured in the anonymized dataset.

### NumberPlate Column:
The 'NumberPlate' column is anonymized by replacing the original number plates with randomly generated strings in the format "???-####". The fake.numerify(text="???-####") method generates fake number plates with three random letters followed by four random digits, ensuring that the actual number plates are not identifiable in the anonymized dataset.

### Gender Column:
The 'Gender' column is anonymized by randomly selecting either 'Male' or 'Female' for each entry. The fake.random_element(elements=('Male', 'Female')) method randomly chooses between these two genders, ensuring that the actual gender information is concealed in the anonymized dataset.

### Age Column:
The 'Age' column is anonymized by replacing the original ages with randomly generated integers between 18 and 80 (inclusive). The fake.random_int(min=18, max=80) method generates fake ages, ensuring that the actual ages are not identifiable in the anonymized dataset.

In [37]:
from faker import Faker
from faker_vehicle import VehicleProvider

# Create a Faker instance
fake = Faker()

# Anonymize the 'Name', 'CarMake', 'CarModel', anonymized by replacing the original names with randomly generated 
# full names. The fake.name() method from the Faker instance is used to generate random names, ensuring that the 
# original names are not exposed in the anonymized dataset.
df['Name'] = df['Name'].apply(lambda x: fake.name())
df['CarMake'] = df['CarMake'].apply(lambda x: fake.name())
df['CarModel'] = df['CarModel'].apply(lambda x: fake.name())

# The 'Year' column is anonymized by replacing the original years with randomly generated integers between 1990 and 2023 (inclusive). 
# The fake.random_int(min=1990, max=2023) method generates random integers for the years, 
# ensuring that the actual years are obscured in the anonymized dataset.
df['Year'] = df['Year'].apply(lambda x: fake.random_int(min=1990, max=2023))

# The 'NumberPlate' column is anonymized by replacing the original number plates with randomly generated strings 
# in the format "???-####". The fake.numerify(text="???-####") method generates fake number plates with three random 
# letters followed by four random digits, ensuring that the actual number plates are not identifiable in the 
# anonymized dataset.
df['NumberPlate'] = df['NumberPlate'].apply(lambda x: fake.numerify(text="???-####"))

# The 'Gender' column is anonymized by randomly selecting either 'Male' or 'Female' for each entry. 
# The fake.random_element(elements=('Male', 'Female')) method randomly chooses between these two genders, 
# ensuring that the actual gender information is concealed in the anonymized dataset.
df['Gender'] = df['Gender'].apply(lambda x: fake.random_element(elements=('Male', 'Female')))

# The 'Age' column is anonymized by replacing the original ages with randomly generated integers between 18 and 80 (inclusive). 
# The fake.random_int(min=18, max=80) method generates fake ages, 
# ensuring that the actual ages are not identifiable in the anonymized dataset.
df['Age'] = df['Age'].apply(lambda x: fake.random_int(min=18, max=80))

print(df)


                Name            CarMake        CarModel  Year NumberPlate  \
0        April Pitts       Lonnie Baker       Tina Cook  2014    ???-5155   
1    Rebecca Cameron         Lori Floyd     Jill Harris  2002    ???-3812   
2         Amber Hall      Darlene Lopez   Jessica Boyer  2023    ???-2968   
3        Amanda Moss  Jennifer Williams     Steven Lang  2009    ???-9970   
4      Jesse Hoffman   Katherine Herman       Alex Hill  2008    ???-3520   
..               ...                ...             ...   ...         ...   
60    Randy Harrison        Eric Abbott  Dennis Simmons  2004    ???-6549   
61     Michael Mccoy     Alexandra Cole    Matthew Soto  1992    ???-8938   
62          Max Neal    Jordan Hamilton   Steven George  2013    ???-3935   
63  Crystal Peterson     Ronald Wallace     John Murphy  1995    ???-3298   
64    Joshua Collins      Lauren Taylor    Jason Bailey  1996    ???-9715   

    Gender  Age        Date PrimaryContributor CrashSeverity  Rank  
0   Fe

## Step 4: check k-anonymity
    

In [38]:
# Define the quasi-identifier attributes
quasi_identifier_attributes = ['Name', 'CarMake', 'CarModel', 'Year', 'NumberPlate', 'Gender', 'Age']

def is_k_anonymous(df, k, quasi_identifiers):
    counts = df.groupby(quasi_identifiers).size()
    return counts.min() <= k

# Set the desired value of k
k = 2

# Check if the DataFrame is k-anonymous with respect to the quasi-identifier attributes
is_k_anonymous_result = is_k_anonymous(df, k, quasi_identifier_attributes)

if is_k_anonymous_result:
    print(f"The DataFrame is k-anonymous with k = {k}.")
else:
    print(f"The DataFrame is not k-anonymous with k = {k}.")

The DataFrame is k-anonymous with k = 2.
