In [16]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd  
import numpy as np

In [3]:
adult_dataset = fetch_ucirepo(id=2) 
adult_df = adult_dataset.data.features
adult_df['income'] = adult_dataset.data.targets
# print(adult_dataset.metadata) 
# print(adult_dataset.variables) 
adult_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Dataset Analysis

In [25]:
categorical_columns = adult_df.select_dtypes(include=['object']).columns.tolist()
numerical_columns = adult_df.select_dtypes(include=['int64']).columns.tolist()

temp_df = adult_df.copy()
temp_df[categorical_columns] = temp_df[categorical_columns].astype('string')

data_type_df = temp_df.dtypes

print("List of Columns and Their Type:")
data_type_df

List of Columns and Their Type:


age                int64
workclass         string
fnlwgt             int64
education         string
education-num      int64
marital-status    string
occupation        string
relationship      string
race              string
sex               string
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    string
income            string
dtype: object

In [26]:
unique_values_count = adult_df[categorical_columns].nunique()
print('Number of unique values in categorical columns:')
unique_values_count

Number of unique values in categorical columns:


workclass          9
education         16
marital-status     7
occupation        15
relationship       6
race               5
sex                2
native-country    42
income             4
dtype: int64

In [29]:
def count_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return ((series < lower_bound) | (series > upper_bound)).sum()


stats = {}

for column in numerical_columns:
    stats[column] = {
        'min': adult_df[column].min(),
        'max': adult_df[column].max(),
        'mean': adult_df[column].mean(),
        'std': adult_df[column].std(),
        'number_of_unique_values': adult_df[column].nunique(),
        'number_of_outliers': count_outliers(adult_df[column])
    }

stats_df = pd.DataFrame(stats).T

stats_df

Unnamed: 0,min,max,mean,std,number_of_unique_values,number_of_outliers
age,17.0,90.0,38.643585,13.71051,74.0,216.0
fnlwgt,12285.0,1490400.0,189664.134597,105604.025423,28523.0,1453.0
education-num,1.0,16.0,10.078089,2.570973,16.0,1794.0
capital-gain,0.0,99999.0,1079.067626,7452.019058,123.0,4035.0
capital-loss,0.0,4356.0,87.502314,403.004552,99.0,2282.0
hours-per-week,1.0,99.0,40.422382,12.391444,96.0,13496.0


### Preserved Utilities
When anonymizing the Adult Income dataset, which contains information like age, job, education, and income, the goal is to keep individual identities private while ensuring the data remains useful for tasks like predicting income or studying how different factors affect earnings. This means keeping important details like age range, hours per week, and education levels accurate enough for analysis but altered slightly or grouped together to protect privacy. It’s a balancing act between making sure no one can be identified from the data and maintaining its value for understanding economic patterns, ensuring fairness, and educational purposes.

# Identification of Sensitive Information

In [33]:
print('Numerical comuns:')
print(numerical_columns)
print('Categorical Columns:')
print(categorical_columns)

Numerical comuns:
['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
Categorical Columns:
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']


| Category               | Attributes                                                                                      | Reason for Classification                                                                                                                                               |
|------------------------|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Explicit Identifiers** | None provided directly in the dataset.                                                           | Explicit identifiers (like names, SSN) that can directly identify an individual are not included in the dataset for privacy reasons.                                   |
| **Quasi-Identifiers**    | Age, Workclass, Education, Marital Status, Occupation, Relationship, Race, Sex, Native-country, Hours-per-week | These attributes, in combination, can potentially be used to re-identify individuals indirectly, especially when linked with external information.                     |
| **Sensitive Attributes** | Income                                                                                          | Income is considered sensitive as it can reveal personal economic status and potentially be used for discriminatory purposes if linked to an individual.               |
| **Non-sensitive Attributes** | Education-num, Capital-gain, Capital-loss                                                       | These attributes, while informative about an individual's socioeconomic status, are less likely to identify an individual directly or be used discriminatively on their own. |


# K-Anonymization Technique (Mondiran)
Reference:

LeFevre, Kristen, David J. DeWitt, and Raghu Ramakrishnan. "Mondrian multidimensional k-anonymity." ​22nd International conference on data engineering (ICDE'06).​ IEEE,2006.

https://github.com/Nuclearstar/K-Anonymity/

Mondrian k-anonymization is a partitioning algorithm used to achieve k-anonymity in datasets, ensuring that each record is indistinguishable from at least \(k-1\) other records regarding certain identifying attributes. This process helps protect privacy by making re-identification more difficult. The basic steps for Mondrian k-anonymization are as follows:

### Preparation:
- Identify the quasi-identifiers in the dataset. These are attributes that, in combination, could be linked with external information to re-identify individuals.
- Determine the value of \(k\) for \(k\)-anonymity, reflecting the privacy level desired.

### Recursiveness:
- Mondrian is a recursive algorithm. Start by considering the entire dataset as a single group.

### Partitioning:
- Choose one quasi-identifier to partition the data. This choice is typically based on the attribute that has the highest range of values (i.e., the attribute that can be divided most evenly).
- Split the dataset into two partitions based on the median value of the chosen attribute, aiming to distribute records as evenly as possible while ensuring that each partition has at least \(k\) records.

### Recursion:
- Recursively apply the partitioning step to each new partition created until further partitioning would violate the \(k\)-anonymity constraint (i.e., a partition cannot be split without resulting in a subgroup with fewer than \(k\) records).

### Generalization:
- For each partition, generalize the quasi-identifier values to the smallest range that includes all values in the partition. This step ensures that all records in a partition are indistinguishable based on the quasi-identifiers.

### Stop Condition:
- The recursion stops when it's no longer possible to split a partition into two groups that both satisfy the \(k\)-anonymity requirement.

### Post-processing:
- After the partitioning and generalization steps are complete, review the anonymized data to ensure it meets the desired \(k\)-anonymity level. Adjustments can be made if necessary.

Mondrian k-anonymization is particularly useful for numerical and categorical data that can be ordered. The algorithm's simplicity and effectiveness make it a popular choice for anonymizing datasets while preserving their utility for analysis. However, it's important to note that k-anonymity by itself does not protect against all types of re-identification attacks, especially when attackers have access to additional background information.




---

# Anonymizer Script

Analysis of the Adult dataset is available in the `notebooks/k_anonymization.ipynb` file.

The `anonymizer.py` script applies the Mondrian k-anonymization algorithm to achieve k-anonymity in datasets, such as the Adult dataset, ensuring privacy protection and preventing individual re-identification. This script is designed to handle both numerical and categorical data, making it versatile for various anonymization needs.

## Getting Started

### Prerequisites

Ensure you have Python 3 installed on your system. You can check your Python version by running:

```bash
python3 --version
```

### Usage

The `anonymizer.py` script takes several arguments to specify the level of anonymity (k-value), the quasi-identifiers, the sensitive attribute, and the output file path. You can learn about all available arguments and the script's functionality by using the `--help` option:

```bash
python3 anonymizer.py --help
```

### Arguments

- `-k`, `--k_value`: The k-value for k-anonymity, determining the level of privacy.
- `-q`, `--quasi_identifiers`: A list of column names to be treated as quasi-identifiers, separated by commas.
- `-s`, `--sensitive_attribute`: The name of the sensitive attribute in the dataset.
- `-o`, `--output`: Path to save the anonymized dataset.

### Running Example

To anonymize a dataset with a k-value of 3, treating 'age' and 'hours-per-week' as quasi-identifiers, 'income' as a sensitive attribute, and specifying input and output files, you would run:

```bash
python3 anonymizer.py --k 3 --quasi_identifiers age,hours-per-week --sensitive_attribute income --output path/to/anonymized_dataset.csv
```

Replace `path/to/anonymized_dataset.csv` with the desired path for the anonymized output.

---
