Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in [.data.table(x, r, vars, with = FALSE) : column(s) not found: prob #74

Closed
AMCalejandro opened this issue Feb 2, 2022 · 9 comments
Assignees

Comments

@AMCalejandro
Copy link
Contributor

AMCalejandro commented Feb 2, 2022

This issue is a continuation somehow #72
I am showing an example in which we have snps with a prob_col value higher than the threshold but echolocatoR fails in assigning CS and PP to SNPs

acarrasco@kronos:/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/FINEMAP$ cat ../../../../../nohup.out 
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Bioconductor version '3.12' is out-of-date; the current release version '3.14'
  is available with R version '4.1'; see https://bioconductor.org/install
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.6     ✔ dplyr   1.0.5
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

    Locus   Gene CHR     POS       SNP         P Effect
1: MAD1L1 MAD1L1   7 2153071 rs4721411 1.657e-07 0.5318
[1] "+ CONDA:: Activating conda env 'echoR'"
[1] "Checking for tabix installation..."
[1] "Checking for bcftools installation..."

)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
MAD1L1 (1 / 1)
)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
[1] "Determining chrom type from file header"
[1] "LD:: Standardizing summary statistics subset."
[1] "++ Preparing Gene col"
[1] "++ Preparing A1,A1 cols"
[1] "++ Preparing MAF,Freq cols"
[1] "++ Inferring MAF from frequency column..."
[1] "++ Preparing N_cases,N_controls cols"
[1] "++ Preparing `proportion_cases` col"
[1] "++ 'proportion_cases' not included in data subset."
[1] "++ Preparing N col"
[1] "+ Preparing sample_size (N) column"
[1] "++ Using `sample_size = 3572` TRUE"
[1] "++ Preparing t-stat col"
[1] "+ Calculating t-statistic from Effect and StdErr..."
[1] "++ Assigning lead SNP"
[1] "++ Ensuring Effect, StdErr, P are numeric"
[1] "++ Ensuring 1 SNP per row"
[1] "++ Removing extra whitespace"
[1] "++ Saving subset ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/MAD1L1_earlymotorPD_axial_subset.tsv.gz"
[1] "LD:: Using UK Biobank LD reference panel."
[1] "+ UKB LD file name: chr7_1000001_4000001"
[1] "+ LD:: Downloading full .gz/.npz UKB files and saving to disk."
[1] "+ Overwriting pre-existing file."
[1] "+ CONDA:: Identified axel executable in echoR env."
[1] "+ Overwriting pre-existing file."
[1] "+ CONDA:: Identified axel executable in echoR env."
[1] "+ LD:: load_ld() python function input: /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/chr7_1000001_4000001"
[1] "+ LD:: Reading LD matrix into memory. This could take some time..."
Processed URL: /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/chr7_1000001_4000001
Some other message at the end
[1] "+ Full UKB LD matrix: 30690 x 30690"
[1] "+ Full UKB LD SNP data.table: 30690 x 5"
[1] "+ LD:: Saving LD matrix ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/MAD1L1.UKB_LD.RDS"
[1] "5872 x 5872 LD_matrix (sparse)"
vvvvv-- ABF --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- FINEMAP --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- SUSIE --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- POLYFUN_SUSIE --vvvvv
✅ All required columns present.
✅ All suggested columns present.

+++ Multi-finemap:: ABF +++

+++ Multi-finemap:: FINEMAP +++
[1] "++ FINEMAP:: Constructing master file."
[1] "++ FINEMAP:: Constructing data.z file."
[1] "++ FINEMAP:: Constructing data.ld file."
[1] "+ Using FINEMAP v1.4"

|--------------------------------------|
| Welcome to FINEMAP v1.4              |
|                                      |
| (c) 2015-2020 University of Helsinki |
|                                      |
| Help :                               |
| - ./finemap --help                   |
| - www.finemap.me                     |
| - www.christianbenner.com            |
|                                      |
| Contact :                            |
| - christian.benner@helsinki.fi       |
| - matti.pirinen@helsinki.fi          |
|--------------------------------------|

--------
SETTINGS
--------
- dataset            : all
- corr-config        : 0.95
- n-causal-snps      : 5
- n-configs-top      : 50000
- n-conv-sss         : 100
- n-iter             : 100000
- n-threads          : 1
- prior-k0           : 0
- prior-std          : 0.05 
- prob-conv-sss-tol  : 0.001
- prob-cred-set      : 0.95

------------
FINE-MAPPING (1/1)
------------
- GWAS summary stats               : FINEMAP/data.z
- SNP correlations                 : FINEMAP/data.ld
- Causal SNP stats                 : FINEMAP/data.snp
- Causal configurations            : FINEMAP/data.config
- Credible sets                    : FINEMAP/data.cred
- Log file                         : FINEMAP/data.log_sss
- Reading input                    : done!   

- Number of GWAS samples           : 3572
- Number of SNPs                   : 5868
- Prior-Pr(# of causal SNPs is k)  : 
  (0 -> 0)
   1 -> 0.583
   2 -> 0.291
   3 -> 0.0971
   4 -> 0.0243
   5 -> 0.00485
- 154411 configurations evaluated (0.113/100%) : converged after 113 iterations
- Computing causal SNP statistics  : done!   
- Regional SNP heritability        : 0.124 (SD: 0.0667 ; 95% CI: [0.0383,0.293])
- Log10-BF of >= one causal SNP    : 169
- Post-expected # of causal SNPs   : 5
- Post-Pr(# of causal SNPs is k)   : 
  (0 -> 0)
   1 -> 0
   2 -> 0
   3 -> 0
   4 -> 4.67e-19
   5 -> 1
- Computing credible sets          : done!
- Writing output                   : done!   
- Run time                         : 0 hours, 1 minutes, 2 seconds
[1] ".cred not detected. Using .snp instead."
[1] "+ FINEMAP:: Importing prob (.snp)..."
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: prob
In addition: Warning message:
In sdY.est(d$varbeta, d$MAF, d$N) :
  estimating sdY from maf and varbeta, please directly supply sdY if known

+++ Multi-finemap:: SUSIE +++
For large R or large XtX, consider installing the  Rfast package for better performance.

+++ Multi-finemap:: POLYFUN_SUSIE +++
[1] "PolyFun:: Preparing SNP input file..."
[1] "+ PolyFun:: 5868 SNPs identified."
[1] "+ PolyFun:: Writing SNP file ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/PolyFun/snps_to_finemap.txt.gz"
[1] "/home/acarrasco/.conda/envs/echoR/bin/python /home/acarrasco/.conda/envs/echoR/lib/R/library/echolocatoR/tools/polyfun//extract_snpvar.py --snps /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/PolyFun/snps_to_finemap.txt.gz --out /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/PolyFun/snps_with_priors.snpvar.tsv.gz"
[INFO]  Writing output file to /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/PolyFun/snps_with_priors.snpvar.tsv.gz
[1] "++ Remove tmp file."
[1] "+ SUSIE:: sample_size= 3572"
[1] "+ SUSIE:: max_causal = 5"
[1] "+ SUSIE:: Utilizing prior_weights for 5868 SNPs."
[1] "+ SUSIE:: Rescaling priors"
[1] "+ Subsetting LD matrix and finemap_dat to common SNPs..."
[1] "+ LD:: Removing unnamed rows/cols"
[1] "+ LD:: Replacing NAs with 0"
[1] "+ LD_matrix = 5868 SNPs."
[1] "+ finemap_dat = 5868 SNPs."
[1] "+ 5868 SNPs in common."
[1] "+ SUSIE:: Using `susie_suff_stat()` from susieR v0.11.92"
For large R or large XtX, consider installing the  Rfast package for better performance.
[1] "+ SUSIE:: Extracting Credible Sets..."
Time difference of 8.18 mins
+-------- Locus Plot: MAD1L1--------+
[1] "+ Filling NAs in CS cols with 0"
[1] "+ Filling NAs in PP cols with 0"
[1] "+ LD:: LD_matrix detected. Coloring SNPs by LD with lead SNP."
max_transcripts=1. 
37 transcripts from 37 genes returned.
Fetching data...OK
Parsing exons...OK
Defining introns...OK
Defining UTRs...OK
Defining CDS...OK
aggregating...
Done
Constructing graphics...
[1] "+ NOTT_2019:: Importing... [1] exvivo_H3K27ac_tbp"
[1] "+ NOTT_2019:: Importing... [2] microglia_H3K27ac"
[1] "+ NOTT_2019:: Importing... [3] neurons_H3K27ac"
[1] "+ NOTT_2019:: Importing... [4] oligodendrocytes_H3K27ac"
[1] "+ NOTT_2019:: Importing... [5] astrocytes_H3K27ac"
[1] "+ NOTT_2019:: Importing... [6] exvivo_atac_tbp"
[1] "+ NOTT_2019:: Importing... [7] microglia_atac"
[1] "+ NOTT_2019:: Importing... [8] neurons_atac"
[1] "+ NOTT_2019:: Importing... [9] oligodendrocytes_atac"
[1] "+ NOTT_2019:: Importing... [10] astrocytes_atac"
[1] "+ NOTT_2019:: Importing... [11] microglia_H3K4me3"
[1] "+ NOTT_2019:: Importing... [12] neurons_H3K4me3"
[1] "+ NOTT_2019:: Importing... [13] oligodendrocytes_H3K4me3"
[1] "+ NOTT_2019:: Importing... [14] astrocytes_H3K4me3"
[1] "+ Inferring genomic limits for window: 1x"
3164573 query SNP(s) detected with reference overlap.
[1] "+ PLOT:: Calculating max histogram height"
[1] "+ NOTT_2019:: Getting interactome data."
[1] "Importing Astrocyte enhancers ..."
[1] "Importing Astrocyte promoters ..."
[1] "Importing Neuronal enhancers ..."
[1] "Importing Neuronal promoters ..."
[1] "Importing Oligo enhancers ..."
[1] "Importing Oligo promoters ..."
[1] "Importing Microglia enhancers ..."
[1] "Importing Microglia promoters ..."
[1] "Importing Microglia interactome ..."
[1] "Importing Neuronal interactome ..."
[1] "Importing Oligo interactome ..."
2755 query SNP(s) detected with reference overlap.
2959 query SNP(s) detected with reference overlap.
[1] "++ NOTT_2019:: Returning PLAC-seq track."
+>+>+>+>+ plot.zoom = all +<+<+<+<+
[1] "+ PLOT:: Get window suffix..."
[1] "+ Ensuring last track shows genomic units..."
[1] "+ PLOT:: Saving plot ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/multiview.MAD1L1.UKB.1x.jpg"
+>+>+>+>+ plot.zoom = 4x +<+<+<+<+
[1] "+ PLOT:: Get window suffix..."
[1] "+ Constructing zoom polygon..."
[1] "+ Highlighting zoom origin..."
[1] "+ Ensuring last track shows genomic units..."
[1] "+ PLOT:: Saving plot ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/multiview.MAD1L1.UKB.4x.jpg"
+>+>+>+>+ plot.zoom = 10x +<+<+<+<+
[1] "+ PLOT:: Get window suffix..."
[1] "+ Constructing zoom polygon..."
[1] "+ Highlighting zoom origin..."
[1] "+ Ensuring last track shows genomic units..."
[1] "+ PLOT:: Saving plot ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/multiview.MAD1L1.UKB.10x.jpg"
  
Fine-mapping complete in:
Time difference of 1.4 hours
There were 50 or more warnings (use warnings() to see the first 50)
/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/chr7_1000001_4000001.gz
/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/chr7_1000001_4000001.npz

We see on the error message how "prob" column cannot fe found

Even though the cred5 file is present, and the tool should have used FINEMAP.import_data.cred(),
I am going through

  FINEMAP.import_data.snp <- function(locus_dir,
                                      credset_thresh=.95,
                                      prob_col="prob",
                                      verbose=T){
    # NOTES:
    ## .snp files: Posterior probabilities in this file are the marginal posterior probability
    ## that a given variant is causal.

    # Prob column descriptions:
    ## prob: column the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l-th SNP is the posterior probability that this SNP is causal.
    ## prob_group: the posterior probability that there is at least one causal signal among SNPs in the same group with this SNP.
    ##
    printer("+ FINEMAP:: Importing",prob_col,"(.snp)...", v=verbose)
    data.snp <- data.table::fread(file.path(locus_dir,"FINEMAP/data.snp"), nThread = 1)
    data.snp <- data.snp[data.snp[[prob_col]] > credset_thresh,] %>%
      plyr::mutate(CS=1)%>%
      dplyr::rename(PP=dplyr::all_of(prob_col))
    return(data.snp)
  }

When I run this manually, we see how the prob column is present, and how the code works

r$> locus_dir                                                                                      
[1] "/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/"
r$> data.table::fread(file.path(locus_dir,"FINEMAP/data.snp"), nThread = 1)                        
      index       rsid chromosome position allele1 allele2     maf    beta     se        z
   1:  2535 rs10479762          7  2045351       T       C 0.02180  0.4526 0.1002  4.51697
   2:  3022  rs3800908          7  2159437       T       C 0.02300  0.4925 0.1005  4.90050
   3:  2597 rs11764212          7  2067593       A       C 0.02380  0.5244 0.1022  5.13112
   4:  3027  rs3778978          7  2159817       A       G 0.02760 -0.5088 0.1019 -4.99313
   5:  2495 rs11765549          7  2027311       T       G 0.02270  0.5291 0.1037  5.10222
  ---                                                                                     
5864:  2941  rs4719432          7  2140330       A       G 0.02550 -0.5066 0.1004 -5.04582
5865:  2925  rs3778965          7  2138296       A       G 0.02545  0.4901 0.1019  4.80962
5866:  2947  rs4719436          7  2141239       A       G 0.02495  0.4780 0.1016  4.70472
5867:  2547 rs13227554          7  2048220       C       G 0.02210 -0.4523 0.1002 -4.51397
5868:  2923  rs3778964          7  2138109       T       C 0.02560  0.4868 0.1020  4.77255
          prob  log10bf      mean       sd mean_incl  sd_incl
   1: 1.000000 13.43200 -0.310494 0.534601 -0.310494 0.534601
   2: 0.999531  6.89911 -0.301709 0.260602 -0.301850 0.260582
   3: 0.997378  6.15053 -0.307319 0.532777 -0.308127 0.533244
   4: 0.815520  4.21586 -0.232915 0.444014 -0.285603 0.476128
   5: 0.747752  4.04231 -0.262132 0.497410 -0.350560 0.547614
  ---                                                        
5864: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5865: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5866: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5867: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5868: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
r$> data.snp[data.snp[[prob_col]] > credset_thresh, ] %>% 
          plyr::mutate(CS=1) %>% 
          dplyr::rename(PP=dplyr::all_of(prob_col)) -> data.snp                                    

r$> data.snp                                                                                       
   index       rsid chromosome position allele1 allele2    maf   beta     se       z       PP
1:  2535 rs10479762          7  2045351       T       C 0.0218 0.4526 0.1002 4.51697 1.000000
2:  3022  rs3800908          7  2159437       T       C 0.0230 0.4925 0.1005 4.90050 0.999531
3:  2597 rs11764212          7  2067593       A       C 0.0238 0.5244 0.1022 5.13112 0.997378
    log10bf      mean       sd mean_incl  sd_incl CS
1: 13.43200 -0.310494 0.534601 -0.310494 0.534601  1
2:  6.89911 -0.301709 0.260602 -0.301850 0.260582  1
3:  6.15053 -0.307319 0.532777 -0.308127 0.533244  1

@mkoromina
Copy link

Hi, I also receive the same error message even when trying to run the vignette. Did you manage to find a workaround this issue?

@bschilder
Copy link
Member

bschilder commented Feb 4, 2022

Hi @AMCalejandro and @mkoromina, thanks for bringing this to my attention. I haven't had much time to work on this project in a while but I will try to get back to it soon (possibly this weekend). In the meantime, PRs are more than welcome!

I should also note that I'm working on a long-term project to modularize echolocatoR into different subpackages (with proper unit tests) to help minimize errors.

Thank you for your patience.

@mkoromina
Copy link

Hi @bschilder, may I also note to this end, that apart from the above mentioned error message ([Error in [.data.table(x, r, vars, with = FALSE) : column(s) not found: SNP]), I also get this one "Error in cDict[[chrom_col]] : subscript out of bounds". However, stats have been munged beforehand (.parquet format). Any advice as why this error message pops up? Thank you very much in advance.

@bschilder
Copy link
Member

Ok, so i think I've fixed this issue with reading in .cred files, as well as with reading in .snp files (when .cred is not available).
See here:
#72 (comment)

@bschilder
Copy link
Member

@mkoromina are you trying to feed in a parquet file to echolocatoR? It currently only supports whatever formats are supported by data.table::fread, so .tsv.gz for example.

@mkoromina
Copy link

Hi @bschilder, I am loading .gz files which are actually munged sumstats produced by ldsc. Do you suggest doing any amendments to it? Thanks a lot!

@bschilder
Copy link
Member

bschilder commented Mar 7, 2022

@mkoromina I don't recommend using LDSC's python script for munging sumstats since it makes a lot of assumptions of column identities (e.g. A1/A2), doesn't have as many colname mappings, doesn't perform any QC or genome build validation, and doesn't map SNPs RSIDs to a standard nomenclature (amongst other limitations).

Please use MungeSumstats which is much more robust. This is what the munged=TRUE flag is referring to specifically in echolocatoR::finemap_loci.

Here's the docs onfinemap_loci:
https://rajlabmssm.github.io/echolocatoR/reference/finemap_loci.html

@mkoromina
Copy link

@bschilder , thanks so much for this. May I ask you if munged sumstats via polyfun's respective python script will work on echolocatoR? If yes, is there a way of converting them to .tsv.gz files? Will try your MungeSumstats recommendation too as well. Thanks a lot!

@bschilder
Copy link
Member

bschilder commented Mar 7, 2022

sure, it can still potentially work. you just use pandas to read the parquets into python and then write them as tab-delimited files.

import pandas as pd
dat = pd.read_parquet("<file_path>")
dat.to_csv("<new_path>.tsv.gz", sep="\t")

You could also try out the new read/write_parquet functions I've added to echodata, though this does depend on a functioning echoR conda environment. So might be simpler to just use python directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants