Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What data should be included in a user download feature #1

Open
angusmcleod opened this issue Aug 6, 2020 · 10 comments
Open

What data should be included in a user download feature #1

angusmcleod opened this issue Aug 6, 2020 · 10 comments

Comments

@angusmcleod
Copy link

angusmcleod commented Aug 6, 2020

For the Discourse Legal Tools Plugin we took a "maximalist" approach to what data should be included in a download of user data for the purposes of data portability and access.

I created a class that looped through lists of columns for each table which were then used to cumulatively add data to an object that was than bundled as a csv file. This piggy backed off of, and expanded, the existing Discourse user data download feature. The lists of columns included in the download are here. Those lists serve as a decent reference list for the same in Flarum.

In terms of the thinking behind the approach taken here, I would highlight the reasoning I laid out here, which I've copied below so we can discuss them further in the Flarum context.


I would emphasise “for the purposes of this feature”, as the purpose of this feature is to take a ‘maximalist’ approach to possible interpretations of the GDPR. It does not attempt to parse ‘likely’ approaches. I’ve laid out some of my own views on the ‘likely’ approaches in this topic (which remain unchanged).

The specifics of the reasoning behind this ‘maximalist’ approach are:

  • The broadest interpretation of A.4.1 (the definition of ‘Personal Data’ in the GDPR) as it applies to Discourse is any record in the db that contains the user’s user_id, i.e.:

    any information … identified or identifiable natural person … identified, directly or indirectly, in particular by reference to … an identification number

  • Read literally, this definition doesn’t care about how the data is produced (e.g. whether the user is acting or not). It merely requires the data to be related to the user’s identifier in some way.

  • However, applying that literally would produce a fair amount of duplication (e.g. the records in the directory_items table are duplicative of various other entries).

  • The point of the extended download is to guard against even the small risk that Article 4.1 could receive a very broad interpretation by some authority or court in Europe.

  • The factors against including it - size of download, potentially security (?) - do not outweigh the possible benefit of including it.

We also considered whether to include ‘administrative’ records with the user’s user_id such as flags, complaints and staff whispers. We decided against this, reasoning as follows:

  • They’re already in the territory of information associated with the user purely by their identifier. They are not information about the user per se (i.e. name, email, age, location etc). This is already assuming a wide interpretation of A.4.1.

  • Whether administrative records intrude on the privacy of other parties, or other relevant concerns (i.e. R. 63.5 & A.15.4) must be determined on a case-by-case basis.

  • Other parties, such as Facebook, do no include such data in their user download functionality.

@angusmcleod angusmcleod changed the title How I approached data access and portability in the Discourse Legal Tools Plugin What data should be included in a user download feature Aug 6, 2020
@katosdev
Copy link

katosdev commented Aug 6, 2020

Speaking as somebody who is working in law, and somebody who has assisted in the creation of a number of GDPR policies, and extensions, I will enter my two-cents here, and expand this further when I have more time after work.

The broadest interpretation of A.4.1 (the definition of ‘Personal Data’ in the GDPR) as it applies to Discourse is any record in the db that contains the user’s user_id, i.e.:

Honestly, I would list this out in a checklist under an issue - allowing developers and contributors for this extension to tick off as the feature is implemented.

any information … identified or identifiable natural person … identified, directly or indirectly, in particular by reference to … an identification number

Here, this should be linked to the data that Flarum collects at core. However it becomes trickier when one considers extended use cases - for instance, ignore extensions entirely, do we classify a username as a PII? Particularly when it can (or perhaps previously did) contain a name - I know for example that Phenomlab on the forums used to use his full name.

Read literally, this definition doesn’t care about how the data is produced (e.g. whether the user is acting or not). It merely requires the data to be related to the user’s identifier in some way.
However, applying that literally would produce a fair amount of duplication (e.g. the records in the directory_items table are duplicative of various other entries).
The point of the extended download is to guard against even the small risk that Article 4.1 could receive a very broad interpretation by some authority or court in Europe.

Would argue a point on this one - Data should be personal - therefore the directory would arguably not come into play at all, as this is a list of users. Instead, the data for that specific user should be exported, such as (but not limited to) their topics, posts, avatars, etc.

The factors against including it - size of download, potentially security (?) - do not outweigh the possible benefit of including it

On other implementations that I have supported, data was exported as a JSON file, which actually (even with our largest exports) was quite minimal in size. We'd have to consider the format in which we export the data. Whilst we have to provide data, it is not implicit in what format this must be provided.

We also considered whether to include ‘administrative’ records with the user’s user_id such as flags, complaints and staff whispers. We decided against this, reasoning as follows:
They’re already in the territory of information associated with the user purely by their identifier. They are not information about the user per se (i.e. name, email, age, location etc). This is already assuming a wide interpretation of A.4.1.

Partially agree. The only complaint I would have here, is that we should have a hook for extensions. The reason being, if a user was to request a name change for example, then this data should also be exported (once supported by the extension) as this contains the potential identifier of the user's name.

Whether administrative records intrude on the privacy of other parties, or other relevant concerns (i.e. R. 63.5 & A.15.4) must be determined on a case-by-case basis.

Administrative actions such as deletion of posts still retains an entry in the database. This data should also be provided to the end user, as that data is still stored.

Other parties, such as Facebook, do no include such data in their user download functionality.

Agreed, the full administrative action against a user should not be included in the data export report.

@phenomlab
Copy link

phenomlab commented Aug 6, 2020

The broadest interpretation of A.4.1 (the definition of ‘Personal Data’ in the GDPR) as it applies to Discourse is any record in the db that contains the user’s user_id

This is the correct way to establish a link between the user, and the data being requested. The purpose of the data extraction is to provide the requesting owner an exact "dump" of all data that relates to them from the system it is extracted from. The format is open in the sense that there are no physical restrictions on how this should be presented. I agree with @katosdev that JSON is clearly the way to go here as it is relatively easy to ingest this into an excel spreadsheet for example thanks to it's comma delimited structure.

Read literally, this definition doesn’t care about how the data is produced (e.g. whether the user is acting or not). It merely requires the data to be related to the user’s identifier in some way.

This is also correct. The language used is intentionally broad so that it provides headroom in terms of the format it can be exported in. The main point to be careful of there is that we do not include information relating to another user within the extract as that in itself would be a violation of GDPR.

However, applying that literally would produce a fair amount of duplication (e.g. the records in the directory_items table are duplicative of various other entries).

This isn't relevant in the sense that you are required to provide ALL records that relate to that particular owner, ensuring that you redact any information within that which does not "belong" to the owner.

We also considered whether to include ‘administrative’ records with the user’s user_id such as flags, complaints and staff whispers. We decided against this

I'd agree that provided there was no attribution to the owner in the sense of name, physical address, email address, SSN etc, then ok. However, this would still have to be extracted and manually parsed to ensure that this is indeed not the case. The issue that arises here is one of time and commitment - particularly for large data sets

Any data relating to the user that references a system outside of Flarum is not in scope. The owner of such information would need to request this from that particular source. It is not the responsibility of the forum owner to provide this as they aren't even custodians of that data - it does not reside on this particular system - even if it is a link elsewhere.

Now for @katosdev points

Honestly, I would list this out in a checklist under an issue - allowing developers and contributors for this extension to tick off as the feature is implemented.

100% agree. This is the only way to ensure any level of consistency with each check

Would argue a point on this one - Data should be personal - therefore the directory would arguably not come into play at all, as this is a list of users. Instead, the data for that specific user should be exported, such as (but not limited to) their topics, posts, avatars, etc.

Data should be personal, yes, but if it does not relate in any way to the original owner, it should not (and cannot legally) be provided without the real owner's consent. And taking this route introduces a legal minefield.

Partially agree. The only complaint I would have here, is that we should have a hook for extensions. The reason being, if a user was to request a name change for example, then this data should also be exported (once supported by the extension) as this contains the potential identifier of the user's name.

Hmm - not necessarily. The key point here is attribution. If the name change request comes from a username where there is no correlation between that and the original owner in the sense of a real name or email address, then there is no legal requirement to provide this. However, I do agree in terms of the hook for other extensions to use.

Administrative actions such as deletion of posts still retains an entry in the database. This data should also be provided to the end user, as that data is still stored.

Good point - unless they have been permanently deleted, they would need to be provided

@luceos
Copy link
Member

luceos commented Aug 6, 2020

@phenomlab small remark, can you push that information on "right to be forgotten" to a new issue to keep this one as minimal as possible (we're failing hard here😁 ).

To @katosdev and @angusmcleod and @phenomlab ; this is great!!!!!!!

I started work on this feature where:

  • the user can request a download of data
  • this dispatches a queue job and will send the user an email when finished; it stores an entry in the database for reference
  • it will delete zip files and these generated db entries after a specific time
  • users can access the link in the email if logged in to retrieve the zip

content:

  • avatar
  • all posts

Is there anything from core that I'm missing?

Content in this extension implements an interface, meaning any extension can implement their own data type and register it on this extension. When the queue job is being executed the data type will retrieve the user model and the open zip stream to append information to it.

@angusmcleod
Copy link
Author

angusmcleod commented Aug 6, 2020

@luceos Good news re the work. The mechanics of the feature sound like the right approach to me. I think we'll definitely get there in terms of the content, but it will help to hash out some of these legal issues too so we're confident on that approach.

@phenomlab @katosdev Thanks guys, some great points. Perhaps we should attempt to formulate a practical heursitic(s) framed in language relevant to software development (as opposed to legal principles) that largely reflect the legal position so these can be applied both to this initial implementation and any extensions of this initial implementation. The heuristic(s) won't be perfect of course but if the three of us, with our different backgrounds / approaches, could agree on some it could be quite a useful framework to branch off of for the edge cases.

A significant element there I think is sustaining this over time as Flarum grows and this is applied to extensions. If its possible to look back in a few years time and say "ok these guys decided to include / exclude this data based on these heuristics" that would help, as opposed to folks having to parse a lot of legal debate. Looking back on the debate that led up to the Discourse legal tools plugin I see this issue now, insofar as the thinking that led up to that and the heuristics that were applied in the end are not easy to parse if you're an outsider / looking at it later.

Scope

On the question of scope, i.e. what is "personal" data, whether usernames qualify, whether directory items qualify @katosdev I think your position is definitely arguable, however the way I see the utility of this feature is to guard against a number of possible interpretations. This is a young law and there are alot of different authorities that will be applying it. I feel its too early to say with confidence what the character and scope of application of specific articles and principles will be. This is what I meant by the "maximalist" approach. I would err on the side of including more than necssary, with the limiting principle of countervailing rights and responsibilites, i.e. as @phenomlab pointed out re records that actually belong to someone else and would require their consent to provide.

In terms of trying to formulate a a scope heuristic, I would kick that off by fomulating it as follows

Include any record that contains a user's
- ip address; or
- name; or
- username; or
- address; or
- image; or
- user_id if the record does not contain data related to another user or 3rd party

I would note the ip address in particular here. That turned out to be a bit of an issue in Discourse, which logs ips in quite a few different places.

@phenomlab @katosdev Please feel free to propose a different or alternate scope heuristic. Generally, I think we should try and formulate them in practical software terms. It doesn't have to be 100% infallible, and there will be edge cases.

Format

On the format front, I've copied what I said about that when we discussed that re Discourse. I agree JSON is a good format. In fact I think it's arguable that a JSON API (as opposed to a download) qualifies. CSV is probably also fine. I'm not sure much will turn on JSON v CSV tbh, but I would also favour JSON.


Copied from https://meta.discourse.org/t/providing-data-for-gdpr/83595/23?u=angus

Concerning the Article 29 Working Party’s Guidelines on the Right to Data Portability I note:

  • Availability of data via a JSON API is explicitly mentioned (multiple times) as a suitable data format. In fact one might even say it is encouraged vis-a-vis other methods.

  • There is no requirement to provide everything in a single package, or instantly. The data needs to be provided “within a reasonable time not exceeding one month”.

  • The thrust of the regulation is to avoid data “lock-in” and to promote interoperability.

p.s. I've started a seperate issue re the right to erasure: #2

@phenomlab
Copy link

What data should be included in a user download feature?

The simple answer to this is EVERYTHING that relates to the owner of such information, with information concerning others outside the scope of the collection redacted.

@katosdev
Copy link

katosdev commented Aug 8, 2020

I need to read over this after I've had a coffee and time to read through it properly.
I'll respond to the points raised later today :)

@katosdev
Copy link

katosdev commented Aug 8, 2020

I started work on this feature where:

  • the user can request a download of data
  • this dispatches a queue job and will send the user an email when finished; it stores an entry in the database for reference
  • it will delete zip files and these generated db entries after a specific time
  • users can access the link in the email if logged in to retrieve the zip

content:

  • avatar
  • all posts

Is there anything from core that I'm missing?

Content in this extension implements an interface, meaning any extension can implement their own data type and register it on this extension. When the queue job is being executed the data type will retrieve the user model and the open zip stream to append information to it.

Based on a quick think:

  • Username
  • IP Address history
  • Email address (and history)
  • Avatar (this may contain a personal image for instance)

We do not ask for address in the core I don't believe?

A significant element there I think is sustaining this over time as Flarum grows and this is applied to extensions. If its possible to look back in a few years time and say "ok these guys decided to include / exclude this data based on these heuristics" that would help, as opposed to folks having to parse a lot of legal debate. Looking back on the debate that led up to the Discourse legal tools plugin I see this issue now, insofar as the thinking that led up to that and the heuristics that were applied in the end are not easy to parse if you're an outsider / looking at it later.

This is why the extension should be extendable at core, insofar as it should have hooks to allow developers and web masters to build on top and hook in their own extensions for export, for example.

Scope

On the question of scope, i.e. what is "personal" data, whether usernames qualify, whether directory items qualify @katosdev I think your position is definitely arguable, however the way I see the utility of this feature is to guard against a number of possible interpretations. This is a young law and there are alot of different authorities that will be applying it. I feel its too early to say with confidence what the character and scope of application of specific articles and principles will be. This is what I meant by the "maximalist" approach. I would err on the side of including more than necssary, with the limiting principle of countervailing rights and responsibilites, i.e. as @phenomlab pointed out re records that actually belong to someone else and would require their consent to provide.

I agree with the maximalist approach to an extent, but as has been mentioned we must be very careful to not include data that may also incorporate that of another user, which as @phenomlab so rightly said, would in itself constitute a data breach.

In terms of trying to formulate a a scope heuristic, I would kick that off by fomulating it as follows

Include any record that contains a user's
- ip address; or
- name; or
- username; or
- address; or
- image; or
- user_id if the record does not contain data related to another user or 3rd party

I would agree, but as I mention above I am unsure (can't honestly remember) whether we ask for address in the core?

I would note the ip address in particular here. That turned out to be a bit of an issue in Discourse, which logs ips in quite a few different places.

@phenomlab @katosdev Please feel free to propose a different or alternate scope heuristic. Generally, I think we should try and formulate them in practical software terms. It doesn't have to be 100% infallible, and there will be edge cases.

IP Addresses are a difficult subject, and actually one of very interesting debate in my day job (I am a helpdesk manager for a law firm) - we actually see that IP Addresses are a 'gray area' in the sense that these are often shared by ISP's. As such, are these really personal identifying information? However, the counter argument is that these are used within the core for the purpose of identify, and tracking, a user. I'd be interested to hear what @phenomlab has to say on this one :)

Format

On the format front, I've copied what I said about that when we discussed that re Discourse. I agree JSON is a good format. In fact I think it's arguable that a JSON API (as opposed to a download) qualifies. CSV is probably also fine. I'm not sure much will turn on JSON v CSV tbh, but I would also favour JSON.

As @phenomlab mentioned, JSON is widely accepted as it can be easily converted and imported into a wide range of solutions.

@angusmcleod
Copy link
Author

This is why the extension should be extendable at core, insofar as it should have hooks to allow developers and web masters to build on top and hook in their own extensions for export, for example.

Yup, sounds good. We should be thinking long term here.

IP Addresses are a difficult subject, and actually one of very interesting debate in my day job (I am a helpdesk manager for a law firm) - we actually see that IP Addresses are a 'gray area' in the sense that these are often shared by ISP's. As such, are these really personal identifying information?

While it's an interesting question, under the 'maximalist' approach I don't think we have to resolve it as a user's own IP does not incorporate information about another user.

Ok, heursitic-wise so far we have

Scope

Include any record that contains that user's, and only that users:

  • ip address; or
  • name; or
  • username; or
  • address; or
  • image; or
  • user_id

Format

JSON

@katosdev
Copy link

katosdev commented Aug 9, 2020

This is why the extension should be extendable at core, insofar as it should have hooks to allow developers and web masters to build on top and hook in their own extensions for export, for example.

Yup, sounds good. We should be thinking long term here.

IP Addresses are a difficult subject, and actually one of very interesting debate in my day job (I am a helpdesk manager for a law firm) - we actually see that IP Addresses are a 'gray area' in the sense that these are often shared by ISP's. As such, are these really personal identifying information?

While it's an interesting question, under the 'maximalist' approach I don't think we have to resolve it as a user's own IP does not incorporate information about another user.

Ok, heursitic-wise so far we have

Scope

Include any record that contains that user's, and only that users:

  • ip address; or
  • name; or
  • username; or
  • address; or
  • image; or
  • user_id

Format

JSON

Agreed completely :)

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Mar 9, 2021

A few comments:

  • We don't currently store an exhaustive list mapping users to IPs. In core, posts (and access tokens, starting with the latest release) are associated with IPs. Those records are also associated with the user through user_id thugh.
  • For address, name, and image, I don't see how we can link data to users programmatically, especially since we don't collect name or address in core, and since avatars and images are only associated to the user via a user_id relation.

I like the idea of automatically downloading everything with a user id relation, as that will scale better to extensions. However, there should be a way of opting out for certain "administrative record" models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants