# Private Set Intersection

We perform private vertical federated learning using:
- Private Set Intersection (PSI)
    - data holders agree on a set of shared data subjects using a private protocol
- Split neural networks
    - data holders compute an intermediate tensor using part of a neural network
    - data holders send the tensors to a server, who completes computation
    - there may be multiple data holders, each with a different part of the input data
    
The purpose of this notebook is to design the API for users to perform private set intersection using Syft and Grid,
which is a requirement for privacy-preserving
federated analytics
and model training.
This notebook will also define common user stories,
requirements
and outstanding questions.

---
# Requirements and Outstanding Questions

A list of features Syft/Grid should have.
Where known,
requirements will be attached to a `syft` version
in which it will be first released
and any development teams which will be required.
The first release of `syft` which will include vertical federated learning will be `0.3.2`.

1. Data holders can send _encrypted_ data IDs to another party and receive a list of encrypted IDs which intersect with other data holders' lists
    - `syft 0.3.2`
1. PSI calculation should be asynchronous. I.e. data holders can perform other tasks while waiting for all data IDs to be sent to the server for calculation
1. PSI should work for any `n >= 2`, where `n` is the number of data holders
1. PSI calculation should be independent of a vertical federated learning setting
    - Data holders may want to be able to define a common set of data on which they can perform analytics
    - `syft 0.3.2`
1. Set intersections should be obscured to prevent a majority party from learning too much information
    - See [this paper](https://arxiv.org/pdf/2004.07427.pdf)
    - `> 0.3.2`
    - Protocols should be developed within `PSI`; Differential Privacy team for privacy calculations?

## Outstanding Questions

1. How can data holders communicate to agree on a data ID scheme? Dataset schema?
1. Who defines the encryption protocol? the server? data holders?
1. Should data holders be able to elect a data holder to perform PSI, instead of a computational server?
1. What's a reasonable upper bound on number of data holders - for testing and performance issues?
1. How do data holders agree on the level of privacy of intersecting set?
1. How are superfluous data points (introduced by intersection obscuring) handled during vertical federated learning? e.g. must the server hold information on which data points/IDs to ignore
1. How can we guarantee that data holders act on the results of PSI honestly? e.g. get rid of the data points that are not in the intersection

---
# Scenario 1 - Vertically Federated Analytics

Netflax and FitByte hold non-overlapping data on individuals' daily activities.
A researcher wants analyse the impact of daily activity on sleep quality.
To do this,
they purchase data from Netflax and FitByte.
To work out which data each company needs to sent to
the researcher,
the intersection of the dataset IDs is computed.

Assumptions:
- E-mail addresses are used as unique IDs. Each individual uses the same e-mail address to sign up to both services

![asd](images/scenario1-step1.png)

![](images/scenario1-step2.png)

![](images/scenario1-step3.png)