# k-anonymity
In this you will practice exploring and linking various Fake datasets and try to de-identify and re-identify owners of records. Think about an attackers who wants to gain as much information as possible. The attacker may want to ask for money based on the value of the information found about each person. 

## Datasets
There are four datasets:
1. income.csv: It is the dataset that an imaginary tax-related organization has about its clients.
2. ip.csv: This is a simple example of an internet provider company (e.g. Shaw)
3. hospital.csv: The dataset by an insurance company that provides insurance for travellers.
4. creditcard.csv: A third party organization for credit checks. 

### Load the datasets
Load each dataset as a separate dataframe and explore the data.

In [2]:
import pandas as pd

income = pd.read_csv('income.csv')
# income1 = pd.DataFrame(income)
ip = pd.read_csv('ip.csv')
hospital = pd.read_csv('hospital.csv')
creditcard = pd.read_csv('creditcard.csv')

income.info()
print('')
ip.info()
print('')
hospital.info()
print('')
creditcard.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         20 non-null     object
 1   lastname     20 non-null     object
 2   ID           20 non-null     object
 3   DOB          20 non-null     object
 4   postal_code  20 non-null     int64 
 5   color        20 non-null     object
 6   companies    20 non-null     object
 7   income       20 non-null     int64 
dtypes: int64(2), object(6)
memory usage: 1.4+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        20 non-null     object
 1   lastname    20 non-null     object
 2   DOB         20 non-null     object
 3   ip_address  20 non-null     object
 4   location    20 non-null     object
dtypes: object(5)
memory usage: 928.0+ bytes

<class 'p

### De-identification
For each dataset, justify your answers for the columns as each being: 
1. explicit identifier
2. quasi identifiers
3. sensitive data
4. other

In [1]:
#income.info()
print('Dataset: Income')
print('Explicit Identifier: name, lastname, ID')
print('Quasi Identifier: DOB, postal_code')
print('Sensitive Data: companies, income')
print('Other: color')
print('')
#ip.info()
print('Dataset: IP')
print('Explicit Identifier: name, lastname')
print('Quasi Identifier: location, DOB')
print('Sensitive Data: ip_address')
print('Other:')
print('')
#hospital.info()
print('Dataset: Hospital')
print('Explicit Identifier: name, lastname')
print('Quasi Identifier: DOB')
print('Sensitive Data: medical reason')
print('Other: last_food')
print('')
#creditcard.info()
print('Dataset: Hospital')
print('Explicit Identifier: name, lastname')
print('Quasi Identifier: DOB, postal_code')
print('Sensitive Data: credit_number, credit_security_code, credit_provider')
print('Other:')


Dataset: Income
Explicit Identifier: name, lastname, ID
Quasi Identifier: DOB, postal_code
Sensitive Data: companies, income
Other: color

Dataset: IP
Explicit Identifier: name, lastname
Quasi Identifier: location, DOB
Sensitive Data: ip_address
Other:

Dataset: Hospital
Explicit Identifier: name, lastname
Quasi Identifier: DOB
Sensitive Data: medical reason
Other: last_food

Dataset: Hospital
Explicit Identifier: name, lastname
Quasi Identifier: DOB, postal_code
Sensitive Data: credit_number, credit_security_code, credit_provider
Other:


#### anonymize data by removing explicit identifiers for each dataset

In [97]:
income2 = income.drop(columns = ['name', 'lastname', 'ID'])
income2.info()
print('')
ip2 = ip.drop(columns = ['name', 'lastname'])
ip2.info()
print('')
hospital2 = hospital.drop(columns = ['name','lastname'])
hospital2.info()
print('')
creditcard2 = creditcard.drop(columns = ['name', 'lastname'])
creditcard2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   DOB          20 non-null     object
 1   postal_code  20 non-null     int64 
 2   color        20 non-null     object
 3   companies    20 non-null     object
 4   income       20 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 928.0+ bytes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   DOB         20 non-null     object
 1   ip_address  20 non-null     object
 2   location    20 non-null     object
dtypes: object(3)
memory usage: 608.0+ bytes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   DOB         

### Re-identification by linking
Try to link the records from the datasets and re-identify the records. Notice that you might only get matching information about a record not specifically identify the individuals.


In [98]:
d1 = pd.merge(income2, hospital2)
d2 = pd.merge(ip2,creditcard2)
data = pd.merge(d1,d2)
data

Unnamed: 0,DOB,postal_code,color,companies,income,last_food,medical reason,ip_address,location,credit_number,credit_provider,credit_security_code
0,1990-10-13,92310,DarkBlue,Joseph-Burns,120000,banana,back pain,192.0.8.93,"('53.7446', '-0.33525', 'Kingston upon Hull', ...",4760000000000000.0,VISA 13 digit,8
1,2000-03-21,73196,SeaShell,Nguyen PLC,70000,apple,flue,203.48.10.235,"('48.73218', '11.18709', 'Neuburg an der Donau...",2220000000000000.0,Discover,644
2,1992-03-19,86372,Bisque,Byrd-Walton,223546,steak,vomiting,198.51.98.53,"('35.06544', '1.04945', 'Frenda', 'DZ', 'Afric...",373000000000000.0,JCB 16 digit,542
3,1945-04-02,19557,LightYellow,Pena Group,62345,coffee,fever,192.160.182.167,"('35.85', '117.7', 'Dongdu', 'CN', 'Asia/Shang...",4.42e+18,Maestro,454
4,1983-11-25,94306,DarkSlateBlue,Schneider Inc,146098,mocha,cancer,213.43.91.75,"('32.05971', '34.8732', 'Ganei Tikva', 'IL', '...",4.85e+18,JCB 16 digit,297
5,1951-02-14,29648,LightBlue,Ferguson Group,56000,strawberry,cold,192.29.160.209,"('-20.87306', '-48.29694', 'Viradouro', 'BR', ...",4.66e+18,American Express,188
6,1949-02-24,10124,Fuchsia,"Martin, Alvarez and Young",231456,apple,knee problem,198.51.2.188,"('22.37066', '114.10479', 'Tsuen Wan', 'HK', '...",4450000000000000.0,VISA 16 digit,565
7,1947-01-31,78788,OrangeRed,"Burns, Michael and Collins",210900,gala,accident,198.58.178.92,"('48.52961', '12.16179', 'Landshut', 'DE', 'Eu...",36300000000000.0,American Express,76
8,1958-10-26,77075,MediumAquaMarine,"Miller, Hanson and Roberts",93567,chicken,flue,203.3.238.205,"('48.07667', '8.64409', 'Trossingen', 'DE', 'E...",4010000000000.0,JCB 16 digit,445
9,1983-12-17,82698,IndianRed,Freeman-Perry,90000,chickenpie,injury,192.52.207.100,"('38.37255', '34.02537', 'Aksaray', 'TR', 'Eur...",3510000000000000.0,VISA 19 digit,368


### Anonymize 
Anonymize the income and credit card datasets. Use Generalization or Supression methods on postal code. 

In [99]:
bins_dob = [0,30,60,100]
bins_postal = [0,30000,60000,90000,120000]
income2['DOB'] = (2023 - pd.to_numeric(income2.DOB.str[:4]))
income2['DOB'] = pd.cut(income2['DOB'],bins_dob)
income2['postal_code'] = pd.cut(income2['postal_code'],bins_postal)
print(income2)
creditcard2['DOB'] = (2023 - pd.to_numeric(creditcard2.DOB.str[:4]))
creditcard2['DOB'] = pd.cut(creditcard2['DOB'],bins_dob)
creditcard2['postal_code'] = pd.cut(creditcard2['postal_code'],bins_postal)
print(creditcard2)

          DOB      postal_code             color  \
0    (30, 60]  (90000, 120000]          DarkBlue   
1     (0, 30]   (60000, 90000]          SeaShell   
2    (30, 60]   (60000, 90000]            Bisque   
3   (60, 100]       (0, 30000]       LightYellow   
4    (30, 60]  (90000, 120000]     DarkSlateBlue   
5   (60, 100]       (0, 30000]         LightBlue   
6   (60, 100]       (0, 30000]           Fuchsia   
7   (60, 100]   (60000, 90000]         OrangeRed   
8   (60, 100]   (60000, 90000]  MediumAquaMarine   
9    (30, 60]   (60000, 90000]         IndianRed   
10   (30, 60]   (30000, 60000]          DarkGray   
11  (60, 100]   (30000, 60000]       LightSalmon   
12   (30, 60]       (0, 30000]            Yellow   
13   (30, 60]       (0, 30000]   MediumVioletRed   
14    (0, 30]       (0, 30000]        LightCoral   
15   (30, 60]   (60000, 90000]        DarkOrchid   
16   (30, 60]       (0, 30000]    LightSteelBlue   
17  (60, 100]   (30000, 60000]       LightSalmon   
18   (30, 60

In [92]:
#print(income2)
#print(creditcard2)

#### Question: Is it k-anonymized? 
What is the maximum k that you can make each of the credit car or income datasets k-anonymized?


In [108]:
# income2
income2.groupby(['DOB', 'postal_code']).size()
print(income2.groupby(['DOB', 'postal_code']).size())
# creditcard2
creditcard2.groupby(['DOB', 'postal_code']).size()
print(creditcard2.groupby(['DOB', 'postal_code']).size())

DOB        postal_code    
(0, 30]    (0, 30000]         2
           (30000, 60000]     0
           (60000, 90000]     1
           (90000, 120000]    0
(30, 60]   (0, 30000]         4
           (30000, 60000]     1
           (60000, 90000]     3
           (90000, 120000]    2
(60, 100]  (0, 30000]         3
           (30000, 60000]     2
           (60000, 90000]     2
           (90000, 120000]    0
dtype: int64
DOB        postal_code    
(0, 30]    (0, 30000]         2
           (30000, 60000]     0
           (60000, 90000]     1
           (90000, 120000]    0
(30, 60]   (0, 30000]         4
           (30000, 60000]     1
           (60000, 90000]     3
           (90000, 120000]    2
(60, 100]  (0, 30000]         3
           (30000, 60000]     2
           (60000, 90000]     2
           (90000, 120000]    0
dtype: int64


  income2.groupby(['DOB', 'postal_code']).size()
  print(income2.groupby(['DOB', 'postal_code']).size())
  creditcard2.groupby(['DOB', 'postal_code']).size()
  print(creditcard2.groupby(['DOB', 'postal_code']).size())


In [None]:
# pd.Series.min(income2.groupby(['DOB', 'postal_code']).size())

#### Question: Does it need l-diversity?

In [109]:
k = """
Yes, l-diversity is still needed because there are still sensitive attributes which should still be anonymized. 
The goal of l-diversity is to ensure that, even after anonymization, the data remains 
sufficiently diversified and not easily susceptible to re-identification attacks.
"""
print(k)


Yes, l-diversity is still needed because there are still sensitive attributes which should still be anonymized. 
The goal of l-diversity is to ensure that, even after anonymization, the data remains 
sufficiently diversified and not easily susceptible to re-identification attacks.



### Try relocating the credit cards
Try finding out the location of the credit card holders by linking the dataset to the ip dataset. What do you find?

In [102]:
final = pd.merge(creditcard2, ip2, on='DOB')
final

Unnamed: 0,DOB,postal_code,credit_number,credit_provider,credit_security_code,ip_address,location


In [103]:
bins_dob = [0,30,60,100]
ip2['DOB'] = (2023 - pd.to_numeric(ip2.DOB.str[:4]))
ip2['DOB'] = pd.cut(ip2['DOB'],bins_dob)
final = pd.merge(creditcard2, ip2, on='DOB')
final

Unnamed: 0,DOB,postal_code,credit_number,credit_provider,credit_security_code,ip_address,location
0,"(30, 60]","(90000, 120000]",4.760000e+15,VISA 13 digit,8,192.0.8.93,"('53.7446', '-0.33525', 'Kingston upon Hull', ..."
1,"(30, 60]","(90000, 120000]",4.760000e+15,VISA 13 digit,8,198.51.98.53,"('35.06544', '1.04945', 'Frenda', 'DZ', 'Afric..."
2,"(30, 60]","(90000, 120000]",4.760000e+15,VISA 13 digit,8,213.43.91.75,"('32.05971', '34.8732', 'Ganei Tikva', 'IL', '..."
3,"(30, 60]","(90000, 120000]",4.760000e+15,VISA 13 digit,8,192.52.207.100,"('38.37255', '34.02537', 'Aksaray', 'TR', 'Eur..."
4,"(30, 60]","(90000, 120000]",4.760000e+15,VISA 13 digit,8,192.88.98.113,"('12.74482', '4.52514', 'Argungu', 'NG', 'Afri..."
...,...,...,...,...,...,...,...
153,"(60, 100]","(30000, 60000]",4.270000e+15,VISA 19 digit,845,198.51.2.188,"('22.37066', '114.10479', 'Tsuen Wan', 'HK', '..."
154,"(60, 100]","(30000, 60000]",4.270000e+15,VISA 19 digit,845,198.58.178.92,"('48.52961', '12.16179', 'Landshut', 'DE', 'Eu..."
155,"(60, 100]","(30000, 60000]",4.270000e+15,VISA 19 digit,845,203.3.238.205,"('48.07667', '8.64409', 'Trossingen', 'DE', 'E..."
156,"(60, 100]","(30000, 60000]",4.270000e+15,VISA 19 digit,845,100.38.177.193,"('7.6', '4.18333', 'Olupona', 'NG', 'Africa/La..."


In [104]:
print('Merging credit card and ip does not accurately allow for the location identification of credit cards.')

Merging credit card and ip does not accurately allow for the location identification of credit cards.
