We need an automated strategy for supporting anonymized data sets for research. By anonymized, I'm specifically calling out dropping identifier information in patient records like:
First and Last Name
Names and addresses of related contacts
It feels to me like this feature ought to be focused around filtered replication in couch to handle specific records as a research copy of the data. That said, I don't have the technical details worked out... which is why someone needs to own this as a feature.
The text was updated successfully, but these errors were encountered:
Since filtered replication can only tell whether a document should or shouldn't be replicated, I suggest we do a mapped one-way replication from the main database into a anonymized database.
This mapped replication would listen to changes from the main database and, for each document passing, map it on the fly.
This could be a special-purpose node process that would act as a replication proxy (on demand from CouchDB, so that we don't have to reimplement replication and limit ourselves to only filtering some documents on the fly).
Instead of using the main database, a research user (or any anonymized data user) would point to this database instead of the main one.
Another usage of this would be to replicate from the anonymized database into a central database, which could then be used for reporting purposes.
Some desirable side effects ideas:
(perhaps these should go into separate issues)
The "researcher" user role would be forced to use this database instead of the main one.
Filter writes made by researcher role (error when trying to write to anonymized documents).
Include some tests on the test suite to validate that a research user only has access to the anonymized database.
Separately implement the process of anonymizing a document (Simpson / Star Wars / * characters replacing real personal data)