# CS211: Data Privacy
## Homework 2

In [2]:
# Load the data and libraries
import pandas as pd
import numpy as np

adult = pd.read_csv('')
adult = adult.dropna()

## Question 1 (20 points)

Implement a more efficient version of `is_k_anonymous`. The inefficient implementation, taken from the textbook, appears below.

**Hint**: use the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) or `group_by` functions, and make sure no count is less than $k$.

In [None]:
# Checking for k-Anonymity, taken from the textbook
# def is_k_anonymous(k, qis, df):
#     for index, row in df.iterrows():
#         query = ' & '.join([f'`{col}` == "{row[col]}"' for col in qis])
#         rows = df.query(query)
#         if (rows.shape[0] < k):
#             return False
#     return True
adult['Occupation'].value_counts()

In [None]:
# Checking for k-anonymity more efficiently
def is_k_anonymous(k, qis, df):
    """Returns true if df satisfies k-Anonymity for the quasi-identifiers 
    qis. Returns false otherwise."""
    return min(df.value_counts(qis)) >= k

In [None]:
# TEST CASES for question 1

assert not is_k_anonymous(2, ['Age'], adult)
assert is_k_anonymous(1, ['Age'], adult)
assert is_k_anonymous(1, ['Age', 'Occupation'], adult)

## Question 2 (10 points)

Consider the definition of `generalize` below, taken from the textbook. The function takes a dataframe `df` and a dictionary `depths` that describes how much to generalize each column of `df`. Generalizing a column to a depth of $n$ replaces the $n$ least-significant digits of each number in that column by zeroes. For example, we could generalize column `A` by making its least-significant digit a 0 and column `B` by doing the same for 2 digits with the following depth specification:

In [None]:
depths = {
    'A': 1,
    'B': 2
}


In [None]:
def generalize(df, depths):
    return df.apply(lambda x: x.apply(lambda y: int(int(y/(10**depths[x.name]))*(10**depths[x.name]))))

Using the `generalize` function, generalize the `Age` column of the `adult` dataset to a depth of 1. Drop the other columns of the dataset. Your result should achieve $k$-Anonymity for $k=20$.

In [None]:
def generalize_adult_age():
    depths = {
        'Age': 1
    }
    
    return generalize(adult[['Age']], depths)

In [None]:
assert is_k_anonymous(20, ['Age'], generalize_adult_age())

## Question 3 (10 points)

Using the `generalize` function, generalize the `Age` and `Zip` columns of the `adult` dataset in order to achieve $k$-Anonymity for $k=5$. Your result should drop other columns besides these two.

In [None]:
def generalize_adult_age_zip():
    depths = {
        'Age': 2,
        'Zip': 2
    }

    return generalize(adult[['Age', 'Zip']], depths)

In [None]:
assert is_k_anonymous(5, ['Age', 'Zip'], generalize_adult_age_zip())

## Question 4 (30 points)

In 1-4 sentences each, answer the following:

1. How much generalization was required to achieve $k=5$ in question 3?
2. Does this level of generalization significantly impact the utility of the $k$-Anonymized data? Why or why not?
3. Why is generalizing the `adult` dataset so challenging? (**Hint**: consider outliers)
4. Is there another approach, in addition to our simple generalization method, that might work better?
5. What is a simple method for generalizing the `Occupation` column?

1) Either enough to render age basically useless (losing 2 significant digits) or zip totally useless (losing 5 significant digits).
2) Yes, if Age is something we're interested in learning something from, we need to preserve some information beyond is the person older or younger than 100.
3) Because people are messy and unique and there are very often people who are dissimilar to anyone else and so will stand out even in a highly processed dataset.
4) We could remove the outliers so that only the broader trends remain to analyze.
5) Remove or consolidate armed-forces.