# Using ARX to anonymise data sets
Before we want to anonymise our data, we have to define our hierarchies and dependencies. These hierarchies will help ARX to anonymise our data using generalisation and suppression.

### Example of a ZIP code hierarchy
![alt text](Images/Gen+Supp.png "Example of a hierarchy")

## Example of a hierarchy for the student set
age,Quasi_identifying,Decimal,arithmic_mean\
Medu,Quasi_identifying,Decimal,arithmic_mean\
G3,Quasi_identifying,Decimal,arithmic_mean

### Code: Can be automised in Java
data.getDefinition().setAttributeType("age", Hierarchy.create("data/test_hierarchy_age.csv", StandardCharsets.UTF_8, ';'));
data.getDefinition().setAttributeType("gender", Hierarchy.create("data/test_hierarchy_gender.csv", StandardCharsets.UTF_8, ';'));
data.getDefinition().setAttributeType("zipcode", Hierarchy.create("data/test_hierarchy_zipcode.csv", StandardCharsets.UTF_8, ';'));

## which results in..

![alt text](Images/Age.png "Example of an age hierarchy")
![alt text](Images/Grades.png "Example of a grading hierarchy")

## unfortunately, ARX is only available for Java and a desktop application.
But that does not mean we cannot use Python to run it!

## This will result in the following Java code:
![alt text](Images/Java.png "Our Java code to obtain k-anonymity")


# Comparing the input to anonymised data

In [1]:
import pandas as pd

input_df = pd.read_csv('Data/Student/student-por.csv')
anonymised_df_3 = pd.read_csv('Data/Student/output/student_3.csv', sep=';')
anonymised_df_10 = pd.read_csv('Data/Student/output/student_10.csv', sep=';')
anonymised_df_27 = pd.read_csv('Data/Student/output/student_27.csv', sep=';')

input_df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
644,MS,F,19,R,GT3,T,2,3,services,other,...,5,4,2,1,2,5,4,10,11,10
645,MS,F,18,U,LE3,T,3,1,teacher,services,...,4,3,4,1,1,1,4,15,15,16
646,MS,F,18,U,GT3,T,1,1,other,other,...,1,1,1,1,1,5,6,11,12,9
647,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,6,10,10,10


In [2]:
columns = []

for column in anonymised_df_3.columns[:-1]:
    count = anonymised_df_3[column].value_counts()['*']
    if count < len(anonymised_df_3):
        print(f"Number of suppressed rows for {column}: {count}")
        columns.append(column)

anonymised_df_3[columns]

Number of suppressed rows for address: 87
Number of suppressed rows for Pstatus: 87
Number of suppressed rows for schoolsup: 87
Number of suppressed rows for famsup: 87
Number of suppressed rows for paid: 87
Number of suppressed rows for nursery: 87
Number of suppressed rows for higher: 87
Number of suppressed rows for internet: 87
Number of suppressed rows for romantic: 87


Unnamed: 0,address,Pstatus,schoolsup,famsup,paid,nursery,higher,internet,romantic
0,*,*,*,*,*,*,*,*,*
1,U,T,no,yes,no,no,yes,yes,no
2,U,T,yes,no,no,yes,yes,yes,no
3,U,T,no,yes,no,yes,yes,yes,yes
4,U,T,no,yes,no,yes,yes,no,no
...,...,...,...,...,...,...,...,...,...
644,R,T,no,no,no,no,yes,yes,no
645,U,T,no,yes,no,yes,yes,yes,no
646,U,T,no,no,no,yes,yes,no,no
647,U,T,no,no,no,no,yes,yes,no


In [3]:
columns = []

for column in anonymised_df_10.columns[:-1]:
    count = anonymised_df_10[column].value_counts()['*']
    if count < len(anonymised_df_10):
        print(f"Number of suppressed rows for {column}: {count}")
        columns.append(column)

anonymised_df_10[columns]

Number of suppressed rows for Pstatus: 115
Number of suppressed rows for schoolsup: 115
Number of suppressed rows for paid: 115
Number of suppressed rows for nursery: 115
Number of suppressed rows for higher: 115
Number of suppressed rows for internet: 115
Number of suppressed rows for romantic: 115


Unnamed: 0,Pstatus,schoolsup,paid,nursery,higher,internet,romantic
0,*,*,*,*,*,*,*
1,T,no,no,no,yes,yes,no
2,T,yes,no,yes,yes,yes,no
3,T,no,no,yes,yes,yes,yes
4,T,no,no,yes,yes,no,no
...,...,...,...,...,...,...,...
644,T,no,no,no,yes,yes,no
645,T,no,no,yes,yes,yes,no
646,T,no,no,yes,yes,no,no
647,T,no,no,no,yes,yes,no


In [4]:
columns = []

for column in anonymised_df_27.columns[:-1]:
    count = anonymised_df_27[column].value_counts()['*']
    if count < len(anonymised_df_27):
        print(f"Number of suppressed rows for {column}: {count}")
        columns.append(column)

anonymised_df_27[columns]

Number of suppressed rows for school: 120
Number of suppressed rows for guardian: 120
Number of suppressed rows for schoolsup: 120
Number of suppressed rows for paid: 120
Number of suppressed rows for internet: 120
Number of suppressed rows for romantic: 120


Unnamed: 0,school,guardian,schoolsup,paid,internet,romantic
0,*,*,*,*,*,*
1,GP,parents,no,no,yes,no
2,GP,parents,yes,no,yes,no
3,GP,parents,no,no,yes,yes
4,GP,parents,no,no,no,no
...,...,...,...,...,...,...
644,MS,parents,no,no,yes,no
645,MS,parents,no,no,yes,no
646,MS,parents,no,no,no,no
647,MS,parents,no,no,yes,no
