Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Missing metadata in read_metadata #5

Closed
ben-heil opened this issue Jul 13, 2021 · 5 comments
Closed

[BUG] Missing metadata in read_metadata #5

ben-heil opened this issue Jul 13, 2021 · 5 comments

Comments

@ben-heil
Copy link

ben-heil commented Jul 13, 2021

I'm working on downloading metadata from recount3, and there are a number of studies for which read_metadata does not appear to be working. If I open up the files in the cache, the metadata exists, so I imagine the issue is somewhere after the file_retrieve step.

The example below uses SRP103067, but there are around 400 studies that have the same issue. (I can paste them in if you want, I just didn't want to flood the issue text).

Code

url <- recount3::locate_url(project='SRP103067', project_home='data_sources/sra', type='metadata')
metadata_files <- recount3::file_retrieve(url)
recount3::read_metadata(metadata_files)

Small reproducible example

If you copy the lines of code that lead to your error, you can then run reprex::reprex() which will create a small website with code you can then easily copy-paste here in a way that will be easy to work with later on.

url <- recount3::locate_url(project='SRP103067', project_home='data_sources/sra', type='metadata')
metadata_files <- recount3::file_retrieve(url)
#> 2021-07-13 12:21:58 caching file sra.sra.SRP103067.MD.gz.
#> 2021-07-13 12:21:59 caching file sra.recount_project.SRP103067.MD.gz.
#> 2021-07-13 12:21:59 caching file sra.recount_qc.SRP103067.MD.gz.
#> 2021-07-13 12:22:00 caching file sra.recount_seq_qc.SRP103067.MD.gz.
#> 2021-07-13 12:22:01 caching file sra.recount_pred.SRP103067.MD.gz.
recount3::read_metadata(metadata_files)
#>   [1] rail_id                                                           
#>   [2] external_id                                                       
#>   [3] study                                                             
...                       
#> [172] recount_pred.pred.type                                            
#> [173] recount_pred.curated.cell_type                                    
#> [174] recount_pred.curated.cell_line                                    
#> <0 rows> (or 0-length row.names)

Meanwhile, the cached metadata for the study shows up fine:
image

R session information

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.1.0 (2021-05-18)
 os       Ubuntu 18.04.5 LTS          
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/New_York            
 date     2021-07-13                  

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package              * version  date       lib source        
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.1.0)
 backports              1.2.1    2020-12-09 [1] CRAN (R 4.1.0)
 Biobase                2.52.0   2021-05-19 [1] Bioconductor  
 BiocFileCache          2.0.0    2021-05-19 [1] Bioconductor  
 BiocGenerics           0.38.0   2021-05-19 [1] Bioconductor  
 BiocIO                 1.2.0    2021-05-19 [1] Bioconductor  
 BiocParallel           1.26.1   2021-07-04 [1] Bioconductor  
 Biostrings             2.60.1   2021-06-06 [1] Bioconductor  
 bit                    4.0.4    2020-08-04 [1] CRAN (R 4.1.0)
 bit64                  4.0.5    2020-08-30 [1] CRAN (R 4.1.0)
 bitops                 1.0-7    2021-04-24 [1] CRAN (R 4.1.0)
 blob                   1.2.1    2020-01-20 [1] CRAN (R 4.1.0)
 cachem                 1.0.5    2021-05-15 [1] CRAN (R 4.1.0)
 callr                  3.7.0    2021-04-20 [1] CRAN (R 4.1.0)
 cli                    3.0.0    2021-06-30 [1] CRAN (R 4.1.0)
 clipr                  0.7.1    2020-10-08 [1] CRAN (R 4.1.0)
 crayon                 1.4.1    2021-02-08 [1] CRAN (R 4.1.0)
 curl                   4.3.2    2021-06-23 [1] CRAN (R 4.1.0)
 data.table             1.14.0   2021-02-21 [1] CRAN (R 4.1.0)
 DBI                    1.1.1    2021-01-15 [1] CRAN (R 4.1.0)
 dbplyr                 2.1.1    2021-04-06 [1] CRAN (R 4.1.0)
 DelayedArray           0.18.0   2021-05-19 [1] Bioconductor  
 digest                 0.6.27   2020-10-24 [1] CRAN (R 4.1.0)
 dplyr                * 1.0.7    2021-06-18 [1] CRAN (R 4.1.0)
 ellipsis               0.3.2    2021-04-29 [1] CRAN (R 4.1.0)
 evaluate               0.14     2019-05-28 [1] CRAN (R 4.1.0)
 fansi                  0.5.0    2021-05-25 [1] CRAN (R 4.1.0)
 fastmap                1.1.0    2021-01-25 [1] CRAN (R 4.1.0)
 filelock               1.0.2    2018-10-05 [1] CRAN (R 4.1.0)
 fs                     1.5.0    2020-07-31 [1] CRAN (R 4.1.0)
 generics               0.1.0    2020-10-31 [1] CRAN (R 4.1.0)
 GenomeInfoDb           1.28.1   2021-07-01 [1] Bioconductor  
 GenomeInfoDbData       1.2.6    2021-07-06 [1] Bioconductor  
 GenomicAlignments      1.28.0   2021-05-19 [1] Bioconductor  
 GenomicRanges          1.44.0   2021-05-19 [1] Bioconductor  
 glue                   1.4.2    2020-08-27 [1] CRAN (R 4.1.0)
 highr                  0.9      2021-04-16 [1] CRAN (R 4.1.0)
 htmltools              0.5.1.1  2021-01-22 [1] CRAN (R 4.1.0)
 httr                   1.4.2    2020-07-20 [1] CRAN (R 4.1.0)
 IRanges                2.26.0   2021-05-19 [1] Bioconductor  
 knitr                  1.33     2021-04-24 [1] CRAN (R 4.1.0)
 lattice                0.20-44  2021-05-02 [4] CRAN (R 4.1.0)
 lifecycle              1.0.0    2021-02-15 [1] CRAN (R 4.1.0)
 magrittr               2.0.1    2020-11-17 [1] CRAN (R 4.1.0)
 Matrix                 1.3-4    2021-06-01 [1] CRAN (R 4.1.0)
 MatrixGenerics         1.4.0    2021-05-19 [1] Bioconductor  
 matrixStats            0.59.0   2021-06-01 [1] CRAN (R 4.1.0)
 memoise                2.0.0    2021-01-26 [1] CRAN (R 4.1.0)
 pillar                 1.6.1    2021-05-16 [1] CRAN (R 4.1.0)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.1.0)
 processx               3.5.2    2021-04-30 [1] CRAN (R 4.1.0)
 ps                     1.6.0    2021-02-28 [1] CRAN (R 4.1.0)
 purrr                  0.3.4    2020-04-17 [1] CRAN (R 4.1.0)
 R.methodsS3            1.8.1    2020-08-26 [1] CRAN (R 4.1.0)
 R.oo                   1.24.0   2020-08-26 [1] CRAN (R 4.1.0)
 R.utils                2.10.1   2020-08-26 [1] CRAN (R 4.1.0)
 R6                     2.5.0    2020-10-28 [1] CRAN (R 4.1.0)
 rappdirs               0.3.3    2021-01-31 [1] CRAN (R 4.1.0)
 Rcpp                   1.0.6    2021-01-15 [1] CRAN (R 4.1.0)
 RCurl                  1.98-1.3 2021-03-16 [1] CRAN (R 4.1.0)
 recount3               1.2.1    2021-05-25 [1] Bioconductor  
 reprex                 2.0.0    2021-04-02 [1] CRAN (R 4.1.0)
 restfulr               0.0.13   2017-08-06 [1] CRAN (R 4.1.0)
 rjson                  0.2.20   2018-06-08 [1] CRAN (R 4.1.0)
 rlang                  0.4.11   2021-04-30 [1] CRAN (R 4.1.0)
 rmarkdown              2.9      2021-06-15 [1] CRAN (R 4.1.0)
 Rsamtools              2.8.0    2021-05-19 [1] Bioconductor  
 RSQLite                2.2.7    2021-04-22 [1] CRAN (R 4.1.0)
 rstudioapi             0.13     2020-11-12 [1] CRAN (R 4.1.0)
 rtracklayer            1.52.0   2021-05-19 [1] Bioconductor  
 S4Vectors              0.30.0   2021-05-19 [1] Bioconductor  
 sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 4.1.0)
 styler                 1.5.1    2021-07-13 [1] CRAN (R 4.1.0)
 SummarizedExperiment   1.22.0   2021-05-19 [1] Bioconductor  
 tibble               * 3.1.2    2021-05-16 [1] CRAN (R 4.1.0)
 tidyr                  1.1.3    2021-03-03 [1] CRAN (R 4.1.0)
 tidyselect             1.1.1    2021-04-30 [1] CRAN (R 4.1.0)
 utf8                   1.2.1    2021-03-12 [1] CRAN (R 4.1.0)
 vctrs                  0.3.8    2021-04-29 [1] CRAN (R 4.1.0)
 withr                  2.4.2    2021-04-18 [1] CRAN (R 4.1.0)
 xfun                   0.24     2021-06-15 [1] CRAN (R 4.1.0)
 XML                    3.99-0.6 2021-03-16 [1] CRAN (R 4.1.0)
 XVector                0.32.0   2021-05-19 [1] Bioconductor  
 yaml                   2.2.1    2020-02-01 [1] CRAN (R 4.1.0)
 zlibbioc               1.38.0   2021-05-19 [1] Bioconductor  

[1] ~/R/x86_64-pc-linux-gnu-library/4.1
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
@lcolladotor
Copy link
Member

Hi,

Hmmm... I'm not sure what the issue is. But thanks to your small reproducible example I can reproduce the error. I'll look into this soon.

Best,
Leo

@lcolladotor lcolladotor self-assigned this Aug 3, 2021
lcolladotor added a commit that referenced this issue Aug 3, 2021
@lcolladotor
Copy link
Member

http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/67/SRP103067/sra.recount_pred.SRP103067.MD.gz has no rows (it's empty). So I updated read_metadata() to deal with a situation like that and give a warning.

@sjczheng @ChristopherWilks do you expect sra.recount_pred files to sometimes be empty? Or could this be a matter of re-making this file and uploading it to SciServer?

@lcolladotor lcolladotor removed their assignment Aug 3, 2021
@lcolladotor
Copy link
Member

By the way @ben-heil, recount3 v1.2.2 and 1.3.2 for bioc-release and bioc-devel, respectively, have the updated read_metadata() function implemented.

@lcolladotor
Copy link
Member

@sjczheng and @ChristopherWilks, actually the issue in #6 is also due to an empty sra.recount_pred file: http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/19/SRP102119/sra.recount_pred.SRP102119.MD.gz

@sjczheng
Copy link

sjczheng commented Aug 3, 2021

Hi @lcolladotor

I checked the big pred file and samples of SRP102119 do exist in the big pred file.
It seems that the pred file of that project was not generated.

@LieberInstitute LieberInstitute locked as resolved and limited conversation to collaborators Mar 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants