Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datalad project for 1000GenomesProject links, and scripts for generating DATS files #14

Merged
merged 8 commits into from Apr 11, 2019

Conversation

@emmetaobrien
Copy link
Collaborator

emmetaobrien commented Apr 4, 2019

This update creates a new subdataset, projects/1000Genomes Project, using datalad. The dataset contains 24 files with names of the format '1KGP_chrnn.vcf.gz', connecting to the gzipped .vcf file containing 1000 Genomes Project sequence variation files for the appropriate chromosome, hosted at the Canadian Centre for Computational Genomics.

It also contains 4 files in the sub-directory scripts under 1000GenomesProject, used to generate the DATS formatted JSON data uploaded on April 1 2019, as requested by JB: these are

1000genomes_table.txt: Text file containing the SQL create statement used to generate a table for containing .vcf data. Has been used in MySQL.

1000genomes_vcf_read.pl: perl script that retrieves a zipped .vcf file, unzips it locally, and reads headers and a couple of summaries of the contents of the file body into the table specified above.

1000genomes_output_json.pl: perl script that reads from that table and writes DATS-format 1KGP JSON files as uploaded to conp-dataset/metadata/example on April 1 2019.

1000genomes_validate_json.pl: perl script that validates user-specified JSON files against a user-specified schema, which has been used to validate all the 1KGP JSON files against the DATS dataset_schema.json.

@glatard

This comment has been minimized.

Copy link
Contributor

glatard commented Apr 4, 2019

Hi @emmetaobrien, this looks great! I cloned the dataset, installed the sub-module, but the data files don't seem to have any location attached to them:

$ datalad get 1KGP_chr10.vcf.gz
[WARNING] Running get resulted in stderr output: git-annex: get: 1 failed
 
[ERROR  ] not available; No other repository is known to contain the file.; (Note that these git remotes have annex-ignore set: origin) [get(/home/glatard/conp-dataset/projects/1000GenomesProject/1KGP_chr10.vcf.gz)] 
get(error): /home/glatard/conp-dataset/projects/1000GenomesProject/1KGP_chr10.vcf.gz (file) [not available; No other repository is known to contain the file.; (Note that these git remotes have annex-ignore set: origin)]

$ git annex whereis 1KGP_chr10.vcf.gz
whereis 1KGP_chr10.vcf.gz (0 copies) failed
git-annex: whereis: 1 failed

If the files are accessible by URL, would it be possible to declare them using the following command?

git annex addurl <url> --file <local_path>

Note: it doesn't mean that anybody would be able to access the files, you would still handle permissions in your platform.

Copy link
Contributor

glatard left a comment

Please add file locations to your sub-module if available.

@emmetaobrien

This comment has been minimized.

Copy link
Collaborator Author

emmetaobrien commented Apr 4, 2019

Not at all sure why the added locations aren't working, will try again.

@emmetaobrien

This comment has been minimized.

Copy link
Collaborator Author

emmetaobrien commented Apr 5, 2019

I have now corrected the file locations and confirmed they are present in projects/1000GenomesProject/.git/annex

@paiva paiva added the New Dataset label Apr 5, 2019
@paiva paiva self-requested a review Apr 5, 2019
@glatard

This comment has been minimized.

Copy link
Contributor

glatard commented Apr 8, 2019

Hi @emmetaobrien, thanks!
Somehow DataLad still can't transfer your files:

(datalad) [glatard@sapajou 1000GenomesProject]$ git annex whereis 1KGP_chrY.vcf.gz
whereis 1KGP_chrY.vcf.gz (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://datahub-khvul4ng.udes.genap.ca/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
ok
(datalad) [glatard@sapajou 1000GenomesProject]$ git annex get --verbose 1KGP_chrY.vcf.gz
get 1KGP_chrY.vcf.gz (from web...) 

download failed: InvalidHeader "preload"

  Unable to access these remotes: web

  Try making some of these repositories available:
  	00000000-0000-0000-0000-000000000001 -- web
failed
git-annex: get: 1 failed

I'm still not sure what's going on here, this needs to be investigated.

@emmetaobrien

This comment has been minimized.

Copy link
Collaborator Author

emmetaobrien commented Apr 9, 2019

glatard: I have looked into the error message you are getting, and it appears to relate to git-annex default expecting write access to the remote directory in order to generate checksums. Since obtaining write access does not seem like a viable general assumption for resources we might want to add, I have regenerated the links using the "--relaxed" flag with the "git annex addurl" command, as that is a suggested usage to avoid this problem when there is no expectation of the dataset changing mid-download. Can you confirm whether these datasets will now download successfully at your end ?

@glatard

This comment has been minimized.

Copy link
Contributor

glatard commented Apr 9, 2019

I still have the same error:

$ datalad get -v 1KGP_chr10.vcf.gz 
[WARNING] Running get resulted in stderr output: download failed: InvalidHeader "preload"
git-annex: get: 1 failed
 
[ERROR  ] from web...; Unable to access these remotes: web; Try making some of these repositories available:; 	00000000-0000-0000-0000-000000000001 -- web [get(/tmp/conp-dataset/projects/1000GenomesProject/1KGP_chr10.vcf.gz)] 
get(error): /tmp/conp-dataset/projects/1000GenomesProject/1KGP_chr10.vcf.gz (file) [from web...; Unable to access these remotes: web; Try making some of these repositories available:; 	00000000-0000-0000-0000-000000000001 -- web]

@yarikoptic would you have any idea of what's going wrong here?
Here's a summary of the issue:

  1. A file in a DataLad repo is available on the Web:
$ git annex whereis 1KGP_chr10.vcf.gz
whereis 1KGP_chr10.vcf.gz (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://datahub-khvul4ng.udes.genap.ca/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
ok
  1. Its URL is accessible:
$ wget https://datahub-khvul4ng.udes.genap.ca/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
--2019-04-09 13:28:46--  https://datahub-khvul4ng.udes.genap.ca/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Resolving datahub-khvul4ng.udes.genap.ca (datahub-khvul4ng.udes.genap.ca)... 204.19.23.238
Connecting to datahub-khvul4ng.udes.genap.ca (datahub-khvul4ng.udes.genap.ca)|204.19.23.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 773788987 (738M) [application/octet-stream]
Saving to: ‘ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz’
...
  1. But git-annex can't get it:
$ git annex get -d 1KGP_chr10.vcf.gz 
[2019-04-09 13:29:46.072656158] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","--cached","-z","--","1KGP_chr10.vcf.gz"]
get 1KGP_chr10.vcf.gz [2019-04-09 13:29:46.074744841] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"]
[2019-04-09 13:29:46.076743336] process done ExitSuccess
[2019-04-09 13:29:46.076885364] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"]
[2019-04-09 13:29:46.079161299] process done ExitSuccess
[2019-04-09 13:29:46.07942225] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..e53a52c2d35c9f0787590278555d5b28022f1e80","--pretty=%H","-n1"]
[2019-04-09 13:29:46.081188993] process done ExitSuccess
[2019-04-09 13:29:46.081516782] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2019-04-09 13:29:46.081922629] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
(from web...) 

[2019-04-09 13:29:46.091792828] Request {
  host                 = "datahub-khvul4ng.udes.genap.ca"
  port                 = 443
  secure               = True
  requestHeaders       = [("Accept-Encoding","identity"),("User-Agent","git-annex/6.20181011")]
  path                 = "/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
  queryString          = ""
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}

download failed: InvalidHeader "preload"

  Unable to access these remotes: web

  Try making some of these repositories available:
  	00000000-0000-0000-0000-000000000001 -- web
failed
[2019-04-09 13:29:46.258717267] process done ExitSuccess
[2019-04-09 13:29:46.258937987] process done ExitSuccess
git-annex: get: 1 failed
@yarikoptic

This comment has been minimized.

Copy link

yarikoptic commented Apr 10, 2019

Smells like some shortcoming in haskell library used by git-annex. Filed report for @joeyh at http://git-annex.branchable.com/bugs/Unable_to_get__47__addurl_to_http_link__58___download_failed__58___InvalidHeader___34__preload__34__/

@kyleam

This comment has been minimized.

Copy link

kyleam commented Apr 10, 2019

Smells like some shortcoming in haskell library used by git-annex.

Because curl and wget work, I suppose it should be considered a shortcoming of http-conduit. But does anyone know what that preload bit in the header is for?

$ curl --head https://datahub-khvul4ng.udes.genap.ca/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
HTTP/1.1 200 OK                                     
Server: openresty/1.7.10.1
Date: Wed, 10 Apr 2019 14:11:22 GMT
Content-Type: application/octet-stream
Content-Length: 5719126
Connection: keep-alive
Last-Modified: Mon, 11 Mar 2019 16:37:52 GMT
ETag: "5c868ee0-574456"
Accept-Ranges: bytes
Strict-Transport-Security: max-age=15768000; includeSubdomains;
preload
X-Frame-Options: DENY
X-Content-Type-Options: nosniff

This is not a good or safe solution, but it might be worth noting for temporary testing: You can force git-annex into using curl by setting annex.web-options to any curl option and disabling the annex.security.allowed-http-addresses safeguards:

# give some innocuous option so annex users curl
$ git config annex.web-options --progress-bar
$ git config annex.security.allowed-http-addresses all
$ git annex get 1KGP_chrY.vcf.gz
get 1KGP_chrY.vcf.gz (from web...) 
ok                             
(recording state in git...)
@kyleam

This comment has been minimized.

Copy link

kyleam commented Apr 10, 2019

But does anyone know what that preload bit in the header is for?

Ah, it should be part of the Strict-Transport-Security value, so I think that http-conduit might be right to complain that that is a misformed header.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security

@joeyh

This comment has been minimized.

Copy link

joeyh commented Apr 10, 2019

@emmetaobrien

This comment has been minimized.

Copy link
Collaborator Author

emmetaobrien commented Apr 10, 2019

I have contacted the provider of this dataset (David Bujold at the Canadian Centre for Computational Genomics) and alerted him to this problem; shall update here when I have any feedback.

@glatard

This comment has been minimized.

Copy link
Contributor

glatard commented Apr 11, 2019

Thanks @yarikoptic @kyleam and @joeyh for all the help!
I think we can merge the PR now, as the fix is required on the Web server side and not in the dataset.

@glatard glatard merged commit 5f74e35 into CONP-PCNO:master Apr 11, 2019
@joeyh

This comment has been minimized.

Copy link

joeyh commented Apr 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Datasets
  
Awaiting triage
6 participants
You can’t perform that action at this time.