Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ds.summary simplification #257

Open
tombisho opened this issue Sep 8, 2020 · 5 comments
Open

ds.summary simplification #257

tombisho opened this issue Sep 8, 2020 · 5 comments
Milestone

Comments

@tombisho
Copy link
Contributor

tombisho commented Sep 8, 2020

ds.summary was originally written to deal with simple R object types like dataframes and vectors. It tests the object type on the server side, and depending on the result, calls a serverside method that generates a privacy protected summary and returns it to the client.

There are 2 things I would like to raise about this approach:

  1. As we add more exotic object types, the list of if{} then{} statements grows ever longer
  2. If the object type has not been added to ds.summary and a corresponding server side function defined to summarise it, a summary is not available

I have been looking how the native summary() function works. It is written as a generic function, which just provides a placeholder. When new objects are defined, they can provide their own method for summarising the data. For example, an object of class blob could have a method called summary.blob(). When summary() is called with a blob object, the code in summary.blob() is executed. There is also a method summary.default() which looks at the underlying structure of an object - most objects are matrices, lists etc. with some extra attributes - and then summarises it. For example, if the blob object was just a matrix with some extra attributes, and we did not define a method for summarising it, when summary() is called, it falls back to summary.default(). This sees that the blob object is really just a matrix and summarises each column.

I wonder if a similar approach might be useful in DataSHIELD, to save having to explicitly define summary functions for every object type.

This would mean having a generic summaryDS() function which is analogous to the summary() function, but with privacy protection. Objects created with DataSHIELD on the serverside could also define their own version of summaryDS(), for example summaryDS.glm() for GLMs.

@StuartWheater
Copy link
Member

What you describe is a common pattern when building extensible frameworks. Would it be acceptable to scheduled this change to v6.2?

@tombisho
Copy link
Contributor Author

tombisho commented Sep 8, 2020

Certainly v6.2 would be fine - I don't think it is urgent but I feel it becomes more relevant the more that we simply 'wrap' existing package functionality with privacy protection.

It feels like there could be wider applications beyond just the summary function - perhaps with plotting, too?

@StuartWheater StuartWheater added this to the v6.2 milestone Sep 8, 2020
@StuartWheater
Copy link
Member

I was wondering if complicated function, for example 'predict', could also benefit from a similar approach.

@tombisho
Copy link
Contributor Author

tombisho commented Sep 8, 2020

yes - I see what you mean, as it seems to be the same situation there

@tombisho
Copy link
Contributor Author

I don't know the status of this in v6.2, but the current way in which ds.summary works is also causing some challenges in the ds-helper package. At the moment it seems to lack any parallelisation, in that a separate call is made to each server. So when you have a lot of studies, distributed globally, it takes a long time to run. A side point is when I have understood correctly that DataSHIELD calls to the servers are indeed made asynchronously (ie all at the same time) not synchronously (wait for each server to return before going to the next).

Anyway, this effect is amplified in the ds-helper getStats function which produces a tabular set of stats useful for papers. So we would like to address if possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants