New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting anonymous research data #14

Open
tangollama opened this Issue Apr 22, 2016 · 1 comment

Comments

Projects
None yet
2 participants
@tangollama
Member

tangollama commented Apr 22, 2016

We need an automated strategy for supporting anonymized data sets for research. By anonymized, I'm specifically calling out dropping identifier information in patient records like:

  • First and Last Name
  • Street Address
  • Phone number
  • Email
  • Names and addresses of related contacts

It feels to me like this feature ought to be focused around filtered replication in couch to handle specific records as a research copy of the data. That said, I don't have the technical details worked out... which is why someone needs to own this as a feature.

@pgte

This comment has been minimized.

Show comment
Hide comment
@pgte

pgte May 24, 2016

Contributor

Since filtered replication can only tell whether a document should or shouldn't be replicated, I suggest we do a mapped one-way replication from the main database into a anonymized database.
This mapped replication would listen to changes from the main database and, for each document passing, map it on the fly.
This could be a special-purpose node process that would act as a replication proxy (on demand from CouchDB, so that we don't have to reimplement replication and limit ourselves to only filtering some documents on the fly).

Instead of using the main database, a research user (or any anonymized data user) would point to this database instead of the main one.

Another usage of this would be to replicate from the anonymized database into a central database, which could then be used for reporting purposes.

Some desirable side effects ideas:

(perhaps these should go into separate issues)

  • The "researcher" user role would be forced to use this database instead of the main one.
  • Filter writes made by researcher role (error when trying to write to anonymized documents).
  • Include some tests on the test suite to validate that a research user only has access to the anonymized database.
  • Separately implement the process of anonymizing a document (Simpson / Star Wars / * characters replacing real personal data)
Contributor

pgte commented May 24, 2016

Since filtered replication can only tell whether a document should or shouldn't be replicated, I suggest we do a mapped one-way replication from the main database into a anonymized database.
This mapped replication would listen to changes from the main database and, for each document passing, map it on the fly.
This could be a special-purpose node process that would act as a replication proxy (on demand from CouchDB, so that we don't have to reimplement replication and limit ourselves to only filtering some documents on the fly).

Instead of using the main database, a research user (or any anonymized data user) would point to this database instead of the main one.

Another usage of this would be to replicate from the anonymized database into a central database, which could then be used for reporting purposes.

Some desirable side effects ideas:

(perhaps these should go into separate issues)

  • The "researcher" user role would be forced to use this database instead of the main one.
  • Filter writes made by researcher role (error when trying to write to anonymized documents).
  • Include some tests on the test suite to validate that a research user only has access to the anonymized database.
  • Separately implement the process of anonymizing a document (Simpson / Star Wars / * characters replacing real personal data)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment