In [1]:
import pandas as pd
import query
import folium
from folium.plugins import MarkerCluster

For all attacks, we assume an adversary that is honest but curious. They have access to query metadata (as in queries.csv), namely IP address, location, timestamp and POI type as well as the locations that are returned by the service for each POI type (as in pois.csv). We maintain the handout's assumption that each IP address corresponds to a unique user. The common goal of these attacks is to use this data to undermine a user's privacy and find out sensitive locations (home/work) and interests.

We propose the following attacks:
1) We first show how this data allows us to map a user's movement over a certain period of time (in this case twenty days). In this attack, the adversary knows the IP address of a user and is trying to discover their sensitive locations and interests.
2) Then, we assume the adversary knows a person's sensitive locations (home / work) in real life and is trying to link that to an IP. This attack is more expensive as it requires the adversary to go through the entire dataset.
3) Finally, we use query.py to learn more about the user's interests and places they might frequent.

# Mapping User Movement
This attack relies on the query metadata. We use the ip_address field to isolate a user's data. We then place each query's location on a map along with the processed timestamp which, considering 05/05/2025 at 00:00 as the simulation start time, shows the date and time of day.
To make the map more readable, we include a start and end parameters to only visualize a subset of data. In this case, we might only want to see what a week looks like for this user as it typically is a cycle that repeats for most people.

In [5]:
def make_map(start, end, ip):
    df = pd.read_csv(
        "queries.csv",
        sep='\s+',
        header=0,
        names=["ip_address", "lat", "lon", "timestamp", "poi_type_query"]
    )
    df = df[df["ip_address"] == ip]
    sim_start = pd.to_datetime("2025-05-05 00:00")
    df["datetime"] = sim_start + pd.to_timedelta(df["timestamp"], unit="h")

    start_time = sim_start + pd.Timedelta(days=start - 1)
    end_time = sim_start + pd.Timedelta(days=end)
    df = df[(df["datetime"] >= start_time) & (df["datetime"] < end_time)]

    def ordinal(n):
        if 10 <= n % 100 <= 20:
            suffix = "th"
        else:
            suffix = {1: "st", 2: "nd", 3: "rd"}.get(n % 10, "th")
        return f"{n}{suffix}"

    df["day_ordinal"] = df["datetime"].dt.day.apply(ordinal)
    df["label"] = (
        df["datetime"].dt.strftime("%A ") +
        df["day_ordinal"] +
        df["datetime"].dt.strftime(" %H:%M")
    )

    df.head()
    center = [df['lat'].mean(), df['lon'].mean()]
    m = folium.Map(location=center, zoom_start=13)
    cluster = MarkerCluster().add_to(m)

    for _, row in df.sort_values("datetime").iterrows():
        popup = folium.Popup(
            html=(
                f"<b>Time:</b> {row['label']}<br>"
                f"<b>Query:</b> {row['poi_type_query']}"
            ),
            max_width=200
        )
        folium.Marker(
            location=[row["lat"], row["lon"]],
            popup=popup,
            icon=folium.Icon(icon="info-sign")
        ).add_to(cluster)

        coords = df.sort_values("datetime")[["lat","lon"]].values.tolist()
    folium.PolyLine(
        locations=coords,
        weight=3,
        opacity=0.7
    ).add_to(m)
    return m
make_map(1, 1, "146.71.112.211")

It is possible to extract different types of information depending on the selected dates. Let's assume we are following the user with IP address 146.71.112.211. We can change the start and end parameters of the python script in order to map the user's movement over a day or multiple days and know where they were at very specific times and dates. For example, here we selected day 1 which is a monday and we can see that the user was in the Renens Gare area around midday and in Prilly around the night.

This same mapping script allows us to guess users' sensitive locations. Let's view the user's locations over a week:

In [3]:
make_map(1, 7, "146.71.112.211")

lWe can see that a lot of their requests still emerge from around the same two locations around Renens Gare and Prilly. We know that the two places where a person spends most of their time during a week are home and work, typically work during the day and home at night. We also usually research for places to eat for lunch while we are at work while we are usually at home during the evenings, perhaps researching places to unwind such as bars, and in the mornings on the weekend. This matches perfectly with our user's data as they often send queries for "restaurant" and "cafeteria" from 11 am to 1 pm from the location near Renens Gare, specifically at Av. du Tir-Fédéral 15, 1024 Ecublens or perhaps another building nearby. We can thus safely assume that this would be the location where they work. A quick look on Google Maps shows us that a couple of companies where this person could be working such as Holinger SA, Unimed SA or a kindergarten. We also notice that the user sends a lot of queries from the Prilly area during the week evenings or early mornings on the weekends more specifically from Chem. des Charmilles 10, 1004 Lausanne which, just from looking at the map followed by a quick sanity check on Google Map, looks like a residential area which is most likely where the user lives.

# Linking real-life identities to IP addresses
In this attack, we know some of the user's sensitive locations (home/work) and we also know that they use this application. We want to identify them in the dataset. We can do this using queries.csv and match the locations of the queries to the sensitive locations. We can then use the IP address of the user to find out more about them, such as their interests and preferences in the next attack.
This attack is essentially reversing the previous one. Let's focus on the same user as before (IP = 146.71.112.211) and assume we know that they work on the Av. du Tir-Fédéral 15, 1024 Ecublens, which corresponds to (lat: 46.53591906517015, lon: 6.575487739371415) and live on Chem. des Charmilles 10, 1004 Lausanne which corresponds to (lat: 46.53086483286677, lon: 6.623208941351423). Let's see how we can use the queries.csv file to find out how to identify the user on the dataset:
In the following script, we provide the function find_ip_by_location which takes in the user's home and work locations and a radius (in degrees) around these locations. The function will then filter the queries.csv file for queries that are sent from these locations and return the IP addresses of the users who have sent a lot of queries from there according to a couple of rules: we filter for queries around the home location on the weekend mornings or weekday evenings and for queries around the work location during the week days around lunch time.
We can then use this information to identify the user in the dataset.


In [37]:
def find_ip_by_location(home_lat, home_lon, work_lat, work_lon, radius=0.001):
    df = pd.read_csv(
        "queries.csv",
        sep='\s+',
        header=0,
        names=["ip_address", "lat", "lon", "timestamp", "poi_type_query"]
    )
    sim_start = pd.to_datetime("2025-05-05 00:00")
    df["datetime"] = sim_start + pd.to_timedelta(df["timestamp"], unit="h")
    df["weekday"] = df["datetime"].dt.weekday
    df["hour"] = df["datetime"].dt.hour

    # filter for queries around home and work locations
    home_queries = df[
        (df["lat"].between(home_lat - radius, home_lat + radius)) &
        (df["lon"].between(home_lon - radius, home_lon + radius)) & (
            # weekend mornings
            ((df["weekday"].isin([5, 6])) & (df["hour"].between(6, 11))) |
            # weekday evenings
            ((df["weekday"].isin([0, 1, 2, 3, 4])) & (df["hour"].between(18, 23)))
        )
    ]
    work_queries = df[
        (df["lat"].between(work_lat - radius, work_lat + radius)) &
        (df["lon"].between(work_lon - radius, work_lon + radius)) & (
            # weekdays around lunchtime
            (df["weekday"].isin([0, 1, 2, 3, 4])) & (df["hour"].between(11, 14))
        )
    ]

    home_counts = home_queries.groupby("ip_address").size().reset_index(name='home_query_count')
    work_counts = work_queries.groupby("ip_address").size().reset_index(name='work_query_count')
    merged_counts = pd.merge(home_counts, work_counts, on="ip_address", how="outer").fillna(0)
    sig = merged_counts[
        (merged_counts['home_query_count'] > 0) & (merged_counts['work_query_count'] > 0)
    ]

    return sig

In [38]:
work_lat = 46.53591906517015
work_lon = 6.575487739371415

home_lat = 46.53086483286677
home_lon = 6.623208941351423

s = find_ip_by_location(home_lat, home_lon, work_lat, work_lon, 0.001)
print("Significant Users:\n", s)

Significant Users:
        ip_address  home_query_count  work_query_count
2  146.71.112.211              35.0              22.0


We have managed to find back the user with IP address 146.71.112.211. We can adjust the radius parameter depending on how precise our background knowledge of the user is. Here, we set it to 0.001 degrees which is around 100 meters.

# User Interests

Now that we know the IP address of the user (whether it is part of the background knowledge or a result of the previous attack), we can find out about the user's interest depending on how often they query for a specific POI type. We can also use the queries.csv file to see how often the user queries for a specific POI type and when they do so. For example, if they query for "restaurant" a lot during the week, we can assume that they are interested in food and perhaps even have a favorite restaurant. If they query for "bar" a lot on the weekends, we can assume that they are interested in nightlife. We can also use the pois.csv file to see what types of POIs are available in the area and how often the user queries for them. This will allow us to find out more about the user's interests and preferences.

In [39]:
def get_poi_types(ip_address):
    df = pd.read_csv(
        "queries.csv",
        sep='\s+',
        header=0,
        names=["ip_address", "lat", "lon", "timestamp", "poi_type_query"]
    )
    df = df[df["ip_address"] == ip_address]
    poi_counts = df["poi_type_query"].value_counts(normalize=True) * 100
    return poi_counts

poi_types = get_poi_types("146.71.112.211")
print("POI Types and Percentage of Queries:")
for poi_type, percentage in poi_types.items():
    print(f"{poi_type}: {percentage:.2f}%")

POI Types and Percentage of Queries:
supermarket: 19.42%
club: 18.45%
gym: 18.45%
cafeteria: 14.56%
restaurant: 14.56%
dojo: 14.56%


We can see that this user is interested in nightlife activities (clubbing), practice some martial arts (as the query for dojos) and to the gym.
We can go one step further and use the servers' responses to find out which exact clubs, dojos or gyms the user is most likely to frequent, assuming that they end up going to the places that are returned by the server. We can do this by checking the pois.csv file and seeing which POIs are returned for each POI type.

In [40]:
import numpy as np
from privacy_evaluation.query import get_nearby_pois


def load_queries(ip_address):
    try:
        df = pd.read_csv(
            "queries.csv",
            sep='\s+',
            header=0,
            names=["ip_address", "lat", "lon", "timestamp", "poi_type_query"]
        )
    except FileNotFoundError:
        print(f"File not found")
        return pd.DataFrame()

    user_df = df[df["ip_address"] == ip_address]
    return user_df

def map_pois(ip_address):
    df = load_queries(ip_address)
    if df.empty:
        print(f"No data found for IP address: {ip_address}")
        return None

    center = [df['lat'].mean(), df['lon'].mean()]
    m = folium.Map(location=center, zoom_start=13)
    cluster = MarkerCluster().add_to(m)

    plotted_pois = set()

    for _, row in df.iterrows():
        loc = np.array([row["lat"], row["lon"]])
        poi_type = row["poi_type_query"]

        nearby_pois = get_nearby_pois(loc, poi_type)

        for poi_id in nearby_pois:
            if poi_id not in plotted_pois:
                plotted_pois.add(poi_id)

                popup = folium.Popup(f"POI ID: {poi_id} - Type: {poi_type}", max_width=200)
                folium.Marker(
                    location=loc.tolist(),
                    popup=popup,
                    icon=folium.Icon(icon="info-sign")
                ).add_to(cluster)

    return m

map_pois("146.71.112.211")


Once we get all the POIs that are returned by the server for this user, we can guess that the user is most likely to frequent the places that are closer to their home or work, namely some gyms and dojos in Ecublens and near EPFL and some others near Prilly, where they live. In that area, multiple clubs were also returned.

# Defence using spatial obfuscation
