Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data can be trivially de-anonymised #1

Open
timbennett opened this issue Jan 18, 2016 · 2 comments
Open

Data can be trivially de-anonymised #1

timbennett opened this issue Jan 18, 2016 · 2 comments

Comments

@timbennett
Copy link

The data and analysis description contain enough information to de-anonymise players, matches and bookmakers. I will not disclose the method but the repo maintainer can contact me via email should you wish to check. The horse has probably bolted on mitigating this issue.

@ppaulojr
Copy link

I thought the same. Any journalist with script knowledge and a little patience could de-anonymise players.I don't see any mitigation strategy at this point.

@jaypinho
Copy link

@timbennett @ppaulojr This would be true, except for the fact that the dataset used by Buzzfeed to produce this study is extremely vague. Nowhere in the article, this repo, or in the supplementary piece are the criteria for match selection fully detailed.

The closest we get is a reference to a list of 25,993 matches (as mentioned in the README). But other than specifying that this includes ATP and Grand Slam matches in the 2009-2015 period, we know little else about how this data was collected.

After taking the file and aggregating individual player wins and losses by year, the only conclusion I arrived at with reasonable certainty is the true identity of anonymized ID 2ed14b47b1c58532b757d76404dcf1a114b712e50193f0b0a5a05f52e3067134. The others' W-L records were (at times) similar to publicly available W-L data, but (at least in the few hours I spent on this) not immediately verifiable.

The lack of clarity around the dataset begs the question of what matches were included versus which ones were left out. I was unable to discern any consistent criteria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants