Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single encounter problem #13

Closed
jasisz opened this issue Apr 4, 2020 · 14 comments
Closed

Single encounter problem #13

jasisz opened this issue Apr 4, 2020 · 14 comments
Labels
privacy risk Questions or comments regarding privacy issues and concerns protocol Questions about the protocol/cryptography will-close-soon-without-further-input For discussions that seem resolved (or stalled). We do so to be able to handle new issues.

Comments

@jasisz
Copy link

jasisz commented Apr 4, 2020

If in given epoch one is sure they've meet only one person and later they find this person id in the public repository of infected ids - they can be sure that person was infected.
This is a privacy concern and workaround is not trivial... but I believe this is still possible, although documents states this is not possible for any proximity tracing mechanism.
I have two ideas to fix that:

  1. We utilize collisions by design. Ids can collide and false-positives encounters can happen. This can happen id-generation time, or the published infected ids can be true ones + randomly generated ones (for possible collisions). This would also hide a true number of cases.
  2. The user with single (or low number) risky encounter during an epoch remembers an encountered id and then may re-use it in later epoch. This way it is not trivial to know who exactly was infected. Still only persons at risk would be alarmed.
@panisson
Copy link

panisson commented Apr 4, 2020

Couldn't this be solved by using a sort of bloom filter / spatio-temporal bloom filter to represent the infected ids?
By sharing with the server only a bloom filter representation of the infected ids, operations such as set union/intersection are still feasible, false-positive rates can be mitigated by selecting the right filter size, and exact id values can't be retrieved directly from the filter. The single encounter problem might still exist, but it might be mitigated using a filter size that ensures a good balance between small false positive rates and good privacy guarantees.

@jasisz
Copy link
Author

jasisz commented Apr 4, 2020

@panisson Clever use of Bloom filters would make it easier to publish ephids, this is a good idea, but functionally it is what I meant as collisions.
I just wonder why they stated in docs that this is not possible for any proximity tracing mechanism while it clearly is if you sacrifice same false-positives for it.

@lbarman lbarman added privacy risk Questions or comments regarding privacy issues and concerns protocol Questions about the protocol/cryptography labels Apr 6, 2020
@potiuk
Copy link

potiuk commented Apr 7, 2020

I think that's the risk of the decentralized approach where everything is decentralized. There is a lot of debate whether centralized or decentralized approach is better. I think a hybrid (ok maybe just centralized - depending how you understand it) approach where most of the information gathering and exchange happens on the phones and only some centralized data is still stored (fully anonymously) makes it much less vulnerable to single-encounter case and it has an additional benefit of algorithm testability.

I think we can modify the whole approach and the problem can be solved by centralizing algorithm of detection if a person is "endangered" or not. Taking bits and pieces of the ProteGO app implementation (and discussion - in Polish) ProteGO-Safe/specs#34

Some thoughts:

  1. Each country anyhow has the database of "positive COVID-19" cases. This should be centralized, including personal data of those people. GDPR protection, medical data security - everything should apply to those cases. Privacy of those people for state is not a concern (as long it is sufficiently protected by those regulations). So I personally see no problem to ask the sick people to submit their history to central server - providing that they are not "involuntarily" submitting other's people "private" data.

  2. When person is diagnosed COVID-positive - the data from that person (and only that person's encounters can be submitted to the central server with appropriate protection - signed with a code delivered during the diagnosis via QR-code/SMS/Phone call). There should be no way to gather the identity of those people who the sick person had encounters with.

  3. Algorithms on the server (open and transparent) could analyse the data and mark potentially (fully anonymous) IDs of people who are endangered.

  4. Application might simply query for it's own IDs to determine it's status - using single Seed query (the compromise). Those queries do not have to be frequent - once a day is more than enough which with 500M people makes 6000 calls/s. A lot, but quite possible with modern cloud technology.

Of course it means that the algorithm might be manipulated potentially. However since there is full anonymity of the IDs, I think that might be a nice compromise (Like - you do not know whom you would like to compromise if you are a bad actor on the server side). On the other hand it has the benefit that the algorithm might be tested. Before deploying new algorithms it will be possibly to make a dry run on existing data and see if there is an error. Making mistake in such algorithms might be potentially disastrous, i think being able to modify and tests the algorithms on the server.

WDYT ?

@kasiazjc
Copy link

kasiazjc commented Apr 7, 2020

When person is diagnosed COVID-positive - the data from that person (and only that person's encounters can be submitted to the central server with appropriate protection - signed with a code delivered during the diagnosis via QR-code/SMS/Phone call). There should be no way to gather the identity of those people who the sick person had encounters with.

I'm pretty sure that the code verification while sending the history will be implemented in the app. TBC by @jakublipinski though.

I am wondering how we will know if the person should be notified. This algorithm is still in progress and we do not know how it will work - should we notify people who were in x radius and spent x time next to the person who has coronavirus? It is really important in terms of transparency I think. Or maybe I missed something, correct me if I am wrong.

@potiuk
Copy link

potiuk commented Apr 7, 2020

I am wondering how we will know if the person should be notified. This algorithm is still in progress and we do not know how it will work - should we notify people who were in x radius and spent x time next to the person who has coronavirus? It is really important in terms of transparency I think. Or maybe I missed something, correct me if I am wrong.

I think there should be a team of data scientists, doctors, and technologists working on it. I do not know exactly what will work but I have a strong believe it will. That's why I also thing keeping FULLY ANONYMOUS data on the server and running/testing various algorithms there is the best approach. I've been involved in Machine Learning / Data Science projects (I worked at https://nomagic.ai - robotics and AI startup) and I know that such algorithms require a lot of iterations, testing etc.

Having a fully anonymous BT encounters data on the server + information who is diagnosed link to information about it (without de-anonymising even the sick people) should provide a good test/verification data for that. That's why I think keeping some data in a central location might make sense (as long as it is not de-anonymisable). The great thing about it that in order to train/try such algorithms we do not have to know at all who is who. We just have to know that given IDs have been diagnosed and learn the spread pattern from there and fine-tune the algorithms. Completely anonymously.

If AI/Machine learning is involved (I will try to involve some of the best specialists I know from NoMagic) then we will not know the details of such algorithms anyway. The AI algorihms are so far mostly "black boxes" that are not easily explainable, however they might be tested and verified on real anonymous data (and when you run it for the historical data, you might verify that your algorithms provide results correlating with reality so they might predict risk much better than any "well described" algorithm.

But again - I am not specialist in this area - what I would do is to provide anonymous data to the people who know what they are doing and let them work with it.

@jasisz
Copy link
Author

jasisz commented Apr 7, 2020

@potiuk Providing such data for the research is one thing, but deciding if you are at risk (as now proposed in ProteGO) is another. It can't be a black box to ensure trust.

@potiuk
Copy link

potiuk commented Apr 7, 2020

Let's wait and see how it evolves. I think it would be great to see the algorithm but for me it is not a blocker if the data it operates is truly anonymous. Maybe I am wrong here, but I do not see risk (at least from the point of view of "infiltration", "manipulation" and "preserving privacy" - which was my main concern before for ProteGO. Assuming (and this is still big if - we have to observe and shout if not) the data will be anonymous there might be other risks involved that might make the need for the algorithm to be public.

I believe for now we do not even have enough data to make any assumptions about the algorithm, its accuracy, correctness - because ... it does not exist yet (no data - > no algorithm). This algorithm will have to be worked out by data scientists not software engineers and I think it might be really complex to verify. But .. Let's see what information ProteGO provides. I think at this moment it is important that it should be anonymous by default (and you should only optionally add number), It should be opt-in not opt-out. Let's see what UX it will be..

@potiuk
Copy link

potiuk commented Apr 7, 2020

And I hope the algorithm will be made public eventually.

@potiuk
Copy link

potiuk commented Apr 8, 2020

@cloudyfuel > agree pseoudonymous != anonymous. and I think at this step it is important to fight for anonymity. I think algorithms should come next in line.

@nicorikken
Copy link

The documents in this repo mention this case as an remaining risk that cannot me mitigated. Even without an app, having limited contact helps to pinpoint the source of infection. I don't see how this can be solved with technology.

@jasisz
Copy link
Author

jasisz commented Apr 9, 2020

@nicorikken The problem is that an attacker can simulate having limited contact by changing his own identity frequently.

Maybe this is inevitable part of any proximity tracking algorithm of this kind as stated in the original doc, but I believe if we allow system to have same false positives we can at least say that there is some chance it was not a true contact (and therefore not a true source of an infection), but it was just a false positive. This may not be enough though and there might be some better ideas.

@lbarman
Copy link
Member

lbarman commented Apr 14, 2020

Hi all, thanks for your very interesting inputs! The thread goes in many directions so please bear with me.

I'll try to summarize:

Initial problem: Single encounter problem.

(please start your comment with this if you're answering this point)

This is indeed a valid concern, which I believe is not solvable in the most extreme case:
Alice and Bob live in the same home, both run the app, Alice goes to work and gets infected, does the upload. Bob learns the 1-bit "you're at risk" but never left the flat: he can infer that Alice infected him.

Even without considering such a dummy case, we believe that false positives (@jasisz first proposal) ultimately cannot prevent the attack, but they do add uncertainty to the attacker (Bob in this case). One counter-arguments is that false positives are undesirable for overall utility the system.

@jasisz, I'm sorry but I'm not sure I understand your second proposition :

The user with single (or low number) risky encounter during an epoch remembers an encountered id and then may re-use it in later epoch. This way it is not trivial to know who exactly was infected. Still only persons at risk would be alarmed.

Bloom filters

discussed in #24 and in our new design (see Whitepaper), if we could move the discussion to #24 it would be great.

3rd point: Centralized algorithm

(please start your comment with this if you're answering this point).
Discussion about having running the algorithms on anonymous data on the server:

Thanks for the many interesting comments on the topic. It is obviously a broad topic, just perhaps to highlight some of our past decisions: we decided that it was very hard to truly anonymize uploaded "graph" data; this is why our design uploads infected identities and not contacts (also see our FAQ, P1), hence the design we propose. Another comment is that even if it is possible, truly anonymizing uploads is costly (requires an anonymous communication network), see our FAQ P5, hence our design that avoid this by only uploading non/less-sensitive data to the backend.

@jasisz
Copy link
Author

jasisz commented Apr 14, 2020

@lbarman I could've made it more clear. There is a possibility that app would re-use some EphIDs it have seen in the past and advertise with them by design. It does not fit into your Design 1, but somehow fits into Design 2.

Alice has seen EphID-B belonging to Bob and it was long enough time that it is a valid contact, potentially an infectious one. Alice can present herself with EphID-B in some next epochs. In case EphID-B is reported as infected it is not clear if it was Alice own EphID or it belongs to someone Alice has seen in the past.

Of course it also leads to false-positives of two kinds:

  • Alice was infected right after seeing Bob and people who have seen Bob would be false positives
  • Alice was not really infected (but at some serious risk from meeting Bob) but we notify people one-step away from real infection, those who in fact have only met Alice

@kennypaterson
Copy link
Collaborator

@jasisz Thanks for clarifying this issue. I think we are well aware in the DP-3T project of the issue you highlight, and indeed it is discussed in the whitepaper, see Section 5.3 beginning on page 27. We consider it unavoidable in a system of the type we are aiming for. Proposals for concrete ways of addressing it that we may have missed are positively encouraged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
privacy risk Questions or comments regarding privacy issues and concerns protocol Questions about the protocol/cryptography will-close-soon-without-further-input For discussions that seem resolved (or stalled). We do so to be able to handle new issues.
Projects
None yet
Development

No branches or pull requests

7 participants