# Anonymization and k-Anonymity

## Instructions

The first half of this notebook contains code to read in and preprocess the example dataset. The second half contains questions for you to answer by writing code and describing your solutions.

## Preamble: Read in Adult dataset & Preprocessing

The dataset is based on census data. I have added the columns `Name`, `DOB`, `SSN`, and `Zip` to represent personally identifiable information (PII). The values in these columns are made up.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pandas as pd
import numpy as np

def your_code_here():
    return 1

adult_data = pd.read_csv("adult_with_pii.csv")
adult_data.head()

Unnamed: 0,Name,DOB,SSN,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,Karrie Trusslove,9/7/67,732-14-6110,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,Brandise Tripony,6/7/88,150-19-2766,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,Brenn McNeely,8/6/91,725-59-9860,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,Dorry Poter,4/6/09,659-57-4974,25503,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,Dick Honnan,9/16/51,220-93-3811,75387,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [2]:
# Remove PII
adult_anon = adult_data.drop(columns=['Name', 'SSN'])
adult_anon.head()

Unnamed: 0,DOB,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,9/7/67,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,6/7/88,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,8/6/91,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,4/6/09,25503,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,9/16/51,75387,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
# PII only
pii = adult_data[['Name', 'DOB', 'SSN', 'Zip']]
pii.head()

Unnamed: 0,Name,DOB,SSN,Zip
0,Karrie Trusslove,9/7/67,732-14-6110,64152
1,Brandise Tripony,6/7/88,150-19-2766,61523
2,Brenn McNeely,8/6/91,725-59-9860,95668
3,Dorry Poter,4/6/09,659-57-4974,25503
4,Dick Honnan,9/16/51,220-93-3811,75387


## END PREAMBLE
-------------

## Collaboration Statement

In the cell below, write your collaboration statement. This statement should describe all collaborations, even high-level ones (e.g. "I discussed my general approach for answering question 3 with Josh"). High-level collaborations of this kind are allowed as long as they are described; copying of answers or code is not allowed.

In [4]:
# In this cell (in markdown or a comment), write your collaboration statement
"""Discussed with Lizhen Zhu on Question 2 about the meaning of the parameter `iqs`"""

## Question 1

Using the dataframes `pii` and `adult_anon`, perform a linking attack to recover the names of as many samples in `adult_anon` as possible.

How many names are you able to recover?

In [30]:
# In this cell, write code to perform the linking attack
results = []
ident_cols = ['DOB', 'Zip']

for idx, row in pii.iterrows():
    print(idx)
    result = adult_anon[(adult_anon['DOB'] == row['DOB']) & (adult_anon['Zip'] == row['Zip'])]
    if len(result) == 1:
        result['Name'] = row['Name']
        # print(idx, result)
        results.append(result)

7
29228
29229
29230
29231
29232
29233
29234
29235
29236
29237
29238
29239
29240
29241
29242
29243
29244
29245
29246
29247
29248
29249
29250
29251
29252
29253
29254
29255
29256
29257
29258
29259
29260
29261
29262
29263
29264
29265
29266
29267
29268
29269
29270
29271
29272
29273
29274
29275
29276
29277
29278
29279
29280
29281
29282
29283
29284
29285
29286
29287
29288
29289
29290
29291
29292
29293
29294
29295
29296
29297
29298
29299
29300
29301
29302
29303
29304
29305
29306
29307
29308
29309
29310
29311
29312
29313
29314
29315
29316
29317
29318
29319
29320
29321
29322
29323
29324
29325
29326
29327
29328
29329
29330
29331
29332
29333
29334
29335
29336
29337
29338
29339
29340
29341
29342
29343
29344
29345
29346
29347
29348
29349
29350
29351
29352
29353
29354
29355
29356
29357
29358
29359
29360
29361
29362
29363
29364
29365
29366
29367
29368
29369
29370
29371
29372
29373
29374
29375
29376
29377
29378
29379
29380
29381
29382
29383
29384
29385
29386
29387
29388
29389
29390
29391
29392
29393
29

In [31]:
# In this cell, write code to determine how many names could be recovered
len(results)

32559

## Question 2

Implement a function `is_k_anonymous` to check (for a given `k`) whether a given dataframe satisfies k-Anonymity.

In [7]:
# In this cell, write code to implement 'is_k_anonymous'

def is_k_anonymous(k, iqs, df):
    """The meaning of `iqs` is not clear to me, and I assume iqs means a list of quasi-identifier"""
    for index, row in df.iterrows():
        query = ' & '.join([f'{col} == {row[col]}' for col in iqs])
        rows = df.query(query)
        if rows.shape[0] < k:
            return False
    return True


## Question 3

In one or two sentences, informally describe how well you expect your implementation of 'is_k_anonymous' to scale with the size of the input data.

In [None]:
# In this cell, describe (in markdown or in a comment) the scaling behavior of your answer in question 2.
"""I would expect this implementation might scale poorly due to the inefficiency in iterating rows on pandas dataframe"""

## Question 4 

Write code to answer the query: "how many participants have never been married?"

*Hint*: filter the `adult_data` dataframe to contain only participants who were never married, then return the 0th element of the `shape` of the filtered dataframe.

In [9]:
query1 = your_code_here()
query1;

## Question 5 

In 2-5 sentences, answer the following:
- What privacy concerns are brought by query1?
- What could be a simple solution to limit the concern raised by Question 4? 

In [27]:
# Write your answer to Question 4 here
adult_data[adult_data['Martial Status'] == 'Never-married']

Unnamed: 0,Name,DOB,SSN,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,Karrie Trusslove,9/7/67,732-14-6110,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
8,Hedvige Fairpo,8/10/01,691-69-7317,81548,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
12,Massimiliano Plett,3/11/57,870-03-4270,79482,23,Private,122272,Bachelors,13,Never-married,Adm-clerical,Own-child,White,Female,0,0,30,United-States,<=50K
13,Gwenny Micallef,3/11/87,531-39-1676,29389,32,Private,205019,Assoc-acdm,12,Never-married,Sales,Not-in-family,Black,Male,0,0,50,United-States,<=50K
16,Rozella Coulthard,1/16/10,703-66-0223,32345,25,Self-emp-not-inc,176756,HS-grad,9,Never-married,Farming-fishing,Own-child,White,Male,0,0,35,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32537,Meredeth Pickavance,8/11/62,531-67-5318,57639,30,Private,345898,HS-grad,9,Never-married,Craft-repair,Not-in-family,Black,Male,0,0,46,United-States,<=50K
32548,Micheil Ricardou,2/14/68,344-98-3922,89379,65,Self-emp-not-inc,99359,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,1086,0,60,United-States,<=50K
32553,Leonora Shaxby,2/9/08,288-99-7030,62311,32,Private,116138,Masters,14,Never-married,Tech-support,Not-in-family,Asian-Pac-Islander,Male,0,0,11,Taiwan,<=50K
32555,Gearalt Dodshun,7/22/85,370-97-0761,60940,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K


- What privacy concerns are brought by query1?
    - *Martial Status* can be an important quasi-identifier since it filters 1/3 participants.
- What could be a simple solution to limit the concern raised by Question 4? 
    - We can simply drop this column when releasing the data