PrivacyContest

Differential Privacy Temporal Map Challenge - Sprint 3

In the Differential Privacy Temporal Map Challenge (DeID2) the task is to develop algorithms that preserve data utility as much as possible while guaranteeing individual privacy is protected. The challenge features a series of coding sprints to apply differential privacy methods to temporal map data, where one individual in the data may contribute to a sequence of events. The goal is to create a privacy-preserving dashboard map that shows changes across different map segments over time.

Submissions are assessed based on:

their ability to prove they satisfy differential privacy; and
the accuracy of output data as compared with ground truth.

The Data

The main dataset includes quantitative and categorical information about 16 million taxi trips in Chicago, including time, distance, location, payment, and service provider. The data includes several features along with time segments (trip_day_of_week and trip_hour_of_day), map segments (pickup_community_area and dropoff_community_area), and simulated individuals (taxi_id). Solutions in this sprint produce a list of records (i.e. synthetic data) with corresponding time and map segments.

Brief Algorithm Description

The main idea is to combine similar features in the pre-processing phase, create privatized histograms of the features, then during the post-processing phase create the simulated data. The individual taxis are created by simply counting the number of distinct taxi_ids, adding noise and then iterating through the privatized count. The number of trips per taxi_id is calculated by counting the distinct taxi_ids with k number of trips (k = 1-200) and adding noise to each bin.

A total of 5 queries are used:

Count of distinct taxi_ids
Count of distinct taxi_ids with k number of trips
Histogram of the proximity-shift-pca-dca feature by a sub-sample of taxi_id
Histogram of the company-payment_type feature by a sub-sample of taxi_id
Histogram of fare_codes feature by a sub-sample of taxi_id

A proximity dictionary is created containing a trip seconds estimate for each pca-dca combination. The proximity dictionary is used in the post-processing process to align the privatized data.

Setup and Running the Code

The environment can be setup using requirements.txt or simply use python 3.8.5+ with the following packages: pandas, numpy, json, pathlib, loguru, random.

Assuming you have the data and parameters files (ground_truth.csv and parameters.json) used in the competition, set the paths in the first 21 lines of code in main.py to correspond to your configuration.

Run the code: python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
writeup.pdf		writeup.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivacyContest

Differential Privacy Temporal Map Challenge - Sprint 3

The Data

Brief Algorithm Description

Contents

src

other files

Setup and Running the Code

About

Releases

Packages

Languages

License

JimKing100/PrivacyContest

Folders and files

Latest commit

History

Repository files navigation

PrivacyContest

Differential Privacy Temporal Map Challenge - Sprint 3

The Data

Brief Algorithm Description

Contents

src

other files

Setup and Running the Code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages