-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data preservation #15
Comments
thanks, and expanded to short term storage while working with the data too. |
Actually, I think the discussion drifted off a bit too much into the direction of data archiving. What I would be interested in is to have an overview of how to prevent data corruption in between data collection and data archiving. A lot of researchers temporarily store their data locally before uploading it to some kind of data archive. This is a vulnerability that could lead to corrupt data being uploaded to a data archive. For example, the NLeSC dementia project will collect data and then store it on an external hard-disk and may inspect it with the Elan software to check it's integrity. Both the software and the hard-disk could hypothetically influence the integrity of the data. I know similar examples from others researchers I interacted with. In one case the file headers were corrupted and most likely it was because they were checking the data integrity with notepad and accidentally saved the file before closing it. So, a checklist for data preservation may need to include:
|
In this sense, you could also do a checksum of your data when it is collected, and when it is processed to ensure nothing has changed. I wonder thou if this is not too prescriptive? Maybe I would make not make it a check list of steps you must take, but more like a tips-and-tricks summary of tools / practices you could use if there are concerns on data integrity. |
thanks Carlos, expanding my list of 'tips' on how to avoid getting your data corrupted before it arrives in secure storage location:
Next, it may also be good to add some tips for secure certified data storage via Surfsara/Dans:
|
On storing checksums -- yes, you need to store them somewhere. But usually they are tiny, so they can be provided along with the data. In fact, some Linux distributions provide the checksum of the iso image so you can check your image when you download it. Some links which might be nice to include to the list of tips: Just as an idea, would it be nice to write a blog post about data integrity, with kind of a 'story' of why, what and how you should handle your data?
|
Can checksums be stored in the filename? This was common practice for large video files for a while.. ('90s). |
Hi,
Storing the checksum in the filename is not common practice anymore. Also the short md5sums/sha1 are not considered safe anymore, so you would end up with a gigantic filename going for sha256 or sha512.
Ideally you would want to also sign the archives as well (which would require us setting up a 'ring of thrust'). Only having both gpg-es and checksum-ed the archive ensures you that it is not altered.
I can provide more details if needed.
Ronald
…________________________________
Van: Jisk Attema <notifications@github.com>
verzonden: donderdag 19 april 2018 10:04
Aan: NLeSC/data-sig
Cc: Subscribed
Onderwerp: Re: [NLeSC/data-sig] Data preservation (#15)
Can checksums be stored in the filename? This was common practice for large video files for a while.. ('90s).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#15 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AKA0TiYr9LBYgGZFTd0mt9gAivIyxAdDks5tqEUegaJpZM4TANn7>.
|
in addition to point 4, mention: https://www.surf.nl/en/services-and-products/surfdrive/surfdrive.html |
Links of interest: We will add section to the guide about this topic. |
Let's try to make this action point a bit more specific:
|
1 - I think Option B would make more sense |
First crude sketch of the paragraph, I will continue editing this. ?Question: Do we need to make a distinction between database versus file storage? 4.2 Data storage and preservationWe strongly advise to store your research data in a secure location where regular back-ups of the data are made, before you start working with the data. If it is logistically impossible to store the data in a secure location immediately after data collection then here are some tips on how to improve data preservation in the time window in between data collection and data arrival at a secure location. For example, you collect data on humans in an environment without (secure) internet connection and need to temporarily store your data offline on a laptop before being able to upload it to a data archive. 4.2.1 Tips for short term storageChecksum and sign your data archive:
Only having both gpg-es and checksum-ed the archive ensures you that it is not altered. File permissions and location:
Specific remarks on human data:
4.4.2 Tips for long term storageFor long term storage we advise researchers based in The Netherlands to explore the services of SURFsara website, the Collaborative organization for ICT in Dutch education and research, including but not exclusively:
For researchers outside the Netherlands alternative data storing platforms include: Data storage certificates: ...TO BE IMPROVED WITH MORE INFO ON DATA STORAGE VOLUME, FAIR COMPLIANCE, AND COSTS |
link to surfdrive is broken for me https://www.surf.nl/en/service-and-products/surfdrive/surfdrive.html |
Also, a lot of data formats allow storing the checksum in the file; ie. the metadata part contains the checksum of the data part. For example netcdf, and FITS |
Specific remarks on storing human data: Dont do that ;)
|
1 and 2. Note that this text also needs to account for research where storing person identifiable data is unavoidable like in some branches of medical research. So, we cannot state that person identifiable data cannot be stored and needs to be deleted at the end of a project. That is just not realistic and would cause a huge loss of capital investment, instead we may need to do a combination of discouraging the storage when possible and providing guidance when storage is essential.
|
@vincentvanhees when dealing with personal data we will always and completely follow the GDPR. Period. If that means some research becomes impossible, that cannot be helped. We will not in any way help, facilitate, or advise people or projects that wilfuly go against the GDPR. Also note that these are not 'recommendations' based on the GDPR; they are hard (legal) requirements. In your case, option 2 would apply. This involves partners that are routinely dealing with personal data, and have all facilities set up for handling and storing it etc. Any changes to their policy should be made in discussion with 'Data protection officers' etc. and should never be decided on based on our guide, only. For your point 3, you are confusing things. Broader than GDPR means not personal data. There is no personal data where the GDPR does not apply. The requirement for consent etc. is as far as i know it now defined in the GDPR (AVG). Again, what you describe falls under my point 2. Your last point, i'm not sure i can realistically define cases where research ethics would be less strict than GDPR. It is formulated very broadly, and applies automatically in cases where there is, or can be, doubt about it applying ;) |
@vincentvanhees To add, if by broader you mean data that has other usage restrictions (licenses, contractual, ...), then there is (should be) a contract defining what we can and must do. |
So does that mean that epidemiology as a field is now extinct? My grandfather would be turning in his grave...
Lourens
| Calls for Contributions IEEE eScience |
| We are proud to host | 14th IEEE International Conference on eScience 2018 |
| 29 Oct – 1 Nov 2018 | Amsterdam, the Netherlands | www.eScience2018.com<http://www.eScience2018.com> |
| Lourens Veen | eScience Research Engineer | Email: l.veen@esciencecenter.nl<mailto:l.veen@esciencecenter.nl> | Tel: +31(0)6 10 801 201 |
| Netherlands eScience Center<https://www.esciencecenter.nl> | Science Park 140 | 1098 XG Amsterdam | The Netherlands |
…________________________________
From: Jisk Attema <notifications@github.com>
Sent: Wednesday, May 16, 2018 2:10:01 PM
To: NLeSC/data-sig
Cc: Subscribed
Subject: Re: [NLeSC/data-sig] Data preservation (#15)
@vincentvanhees<https://github.com/vincentvanhees> when dealing with personal data we will always and completely follow the GDPR. Period. If that means some research becomes impossible, that cannot be helped. We will not in any way help, facilitate, or advise people or projects that wilfuly go against the GDPR. Also note that these are not 'recommendations' based on the GDPR; they are hard (legal) requirements.
In your case, option 2 would apply. This involves partners that are routinely dealing with personal data, and have all facilities set up for handling and storing it etc. Any changes to their policy should be made in discussion with 'Data protection officers' etc. and should never be decided on based on our guide, only.
For your point 3, you are confusing things. Broader than GDPR means not personal data. There is no personal data where the GDPR does not apply. The requirement for consent etc. is as far as i know it now defined in the GDPR (AVG). Again, what you describe falls under my point 2.
Your last point, i'm not sure i can realistically define cases where research ethics would be less strict than GDPR. It is formulated very broadly, and applies automatically in cases where there is, or can be, doubt about it applying ;)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#15 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AQPW0Y-yGP27Zm8DIG_VF-41bFHwMrv8ks5tzBcZgaJpZM4TANn7>.
|
@LourensVeen i hope not! you can work with personal data, but you should not depend on a page on the internets to prevent issues with data privacy ;) |
@jiskattema I am not suggesting to violate the law. GPDR provides freedom for storing personal data within a research context: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5243137/. Also mentioned in this blog post. I think it is important that we provide guidance in the guide or at least have internal expertise on this issue, because we as NLeSC need to be able to work on personal research data within the limits of GDPR just like hundreds of research groups around NL need to. Where I wrote "we cannot state that person identifiable data cannot be stored and needs to be deleted at the end of a project. That is just not realistic ...". I meant that if we make such statements then we also need to clarify under what conditions personal data can be stored (within GDPR limits of course). |
Taking a step back from this discussion: linking to the paper you cited, and noting that if you are working with personal data you really should get expert advise, seems the best option to me. Especially because we are dealing with a default prohibition, with exemptions only under very specific conditions, involving trained privacy experts. To illustrate how hard it is to write something useful and valid, lets look at your three remarks: If person identifiable information needs to be stored as part of the dataset then make sure the data and data carrier (e.g. hard-drive) is encrypted and the storage procedure complies with a data management plan approved by an ethics committee. For all human data make sure that only data is stored for which consent was given by the participant or their guardian following the protocol approved by an ethics committee. |
Thanks Jisk, I agree. In the data-sig meeting a month ago we agreed that I would sketch a draft for this paragraph and that the sig as a whole would then help to optimize. I am still on a learning curve for most of these topics, so it is great to have your input. |
@vincentvanhees i know i am not qualified to work as a data protection officer ;) |
@vincentvanhees, @jiskattema -- I've created a PR to add a this section to the guide. Do you have any concrete suggestions on what/how should we update this section before merging it into the guide? |
How about we try to sit down with two or three people to go over it and make some decisions about what to leave out and what to improve? I missed Monday's data-sig meeting because of the last minute talk by Aletta which messed up my timetable, otherwise we could have done this as part of the sig. |
Sitting down with two/three people sounds like a good idea to make a first proposal -- afterwards the rest of the sig can add comments/make suggestions as required (nothing we put in the guide is set in stone anyway). Could I let you and @jiskattema do this first proposal? |
See NLeSC/guide#135 |
As suggested by @vincentvanhees
What solutions are available for long term storage of data? Can we an inventory of archive options (and add this to the guide).
The text was updated successfully, but these errors were encountered: