# Hexanonymity - GPS data Anonymization



In [25]:
%pip install -q -r requirements.txt

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [26]:
import pandas as pd
import numpy as np
from numpy import array
from src.application.Hexanonymity.Hexanonymity import Hexanonimity

## Initialize a dataset with sample data

In [27]:
df = pd.DataFrame(
        {
            "locations": pd.Series(
                array(
                    [
                        "-8.7354573,42.2239522",
                        "-8.7357169,42.224499",
                        "-8.8932563,42.1011589",
                        "-8.8910411,42.08599",
                    ]
                ),
                dtype=str,
            ),
            "id": pd.Series(array(["1", "2", "1", "2"]), dtype=str),
            "other_locations": pd.Series(array(["a1", "b2", "c3", "d2"]), dtype=str),
        }
    )

## Generate Hexanonymity Configuration

The configuration of the Hexanonymity algorithm requires the following information: 

- `configuration`: `JSON` object with the following field: 
    - `k`: Minimum k (at least k=2 to provide privacy)
    - `min_p`: Minimum size to be applied in the hiearchy of Uber H3
    - `max_p`: Minimum size to be applied in the hiearchy of Uber H3
- `fields`: Column name which contains the geo-positioned data points
- `id_col`: Column name which contains the user identifier. 
- `sensitive_cols`: An (optional) list of column name(s) with other fields to write the anonymised position to. In some datasets, the gps data points appear in multiple columns. You can set the additional columns in this field of the configuration to anonymise all the columns at once. 

In [28]:

operation = Hexanonimity(
    configuration={"k": 2, "min_p": 0, "max_p": 14},
    fields=["locations"],
    id_col="id",
    sensitive_cols=[
        "other_locations",
    ],
)

## Anonymize and verify the results

We can observe that after anonymizing the dataset, there are no unique locations in the resultant Dataframe

In [29]:
result = operation.apply(df)
expected = array(
    [
        "-8.7354573,42.2239522",
        "-8.7354573,42.2239522",
        "-8.8932563,42.1011589",
        "-8.8932563,42.1011589",
    ],
    dtype=str,
)
assert (result["locations"].values == expected).all()
expected = array(["a1", "a1", "c3", "c3"], dtype=str)
assert (result["other_locations"].values == expected).all()

result

Unnamed: 0,locations,id,other_locations
0,"-8.7354573,42.2239522",1,a1
1,"-8.7354573,42.2239522",2,a1
2,"-8.8932563,42.1011589",1,c3
3,"-8.8932563,42.1011589",2,c3
