Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Zero INDELs to analyze after dropping 'complex' labeled records!" error #370

Closed
tucker-bower-psjh opened this issue Jul 6, 2021 · 14 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@tucker-bower-psjh
Copy link

Hello Dr. Wang,

Firstly, thank you for making such a powerful tool for mutational signature analysis. I am trying to use sigminer sig_fit to fit COSMIC SBS, DBS, and INDEL signatures to my .vcf data, but I get the error: ""Zero INDELs to analyze after dropping 'complex' labeled records!" when I run the sig_tally function. However, when I run the same .vcf through SigProfiler Matrix Generator separately, I see that there are many indel variants not labeled as complex. Is there something wrong with my input or my script that I can rectify?

Sincerely,

Tucker Bower

Here's my Rscript

vcf <- "/Users/tucker.bowerprovidence.org/codeworld/mut-sigs/old_test_runs/argtest/XXX_DNA.filtered.genome.vcf"
maf <- read_vcf(vcf)

mt_tally <- sig_tally(
  maf,
  ref_genome = "BSgenome.Hsapiens.UCSC.hg19",
  useSyn = TRUE,
  mode = "ID",
  genome_build = "hg19",
  add_trans_bias = TRUE
)

And the output file from SigProfiler Matrix Generator (XXX_DNA.ID96.all)

MutationType    20200903_2023005257_TSO500_DNA
1:Del:C:0       170
1:Del:C:1       94
1:Del:C:2       40
1:Del:C:3       44
1:Del:C:4       45
1:Del:C:5       48
1:Del:T:0       70
1:Del:T:1       54
1:Del:T:2       46
1:Del:T:3       57
1:Del:T:4       31
1:Del:T:5       261
1:Ins:C:0       35
1:Ins:C:1       15
1:Ins:C:2       5
1:Ins:C:3       3
1:Ins:C:4       0
1:Ins:C:5       16
1:Ins:T:0       50
1:Ins:T:1       23
1:Ins:T:2       4
1:Ins:T:3       4
1:Ins:T:4       3
1:Ins:T:5       198
2:Del:R:0       38
2:Del:R:1       50
2:Del:R:2       4
2:Del:R:3       6
2:Del:R:4       11
2:Del:R:5       111
3:Del:R:0       8
3:Del:R:1       9
3:Del:R:2       10
3:Del:R:3       39
3:Del:R:4       38
3:Del:R:5       58
4:Del:R:0       5
4:Del:R:1       2
4:Del:R:2       9
4:Del:R:3       20
4:Del:R:4       5
4:Del:R:5       13
5:Del:R:0       12
5:Del:R:1       15
5:Del:R:2       25
5:Del:R:3       16
5:Del:R:4       7
5:Del:R:5       6
2:Ins:R:0       48
2:Ins:R:1       8
2:Ins:R:2       4
2:Ins:R:3       0
2:Ins:R:4       3
2:Ins:R:5       42
3:Ins:R:0       21
3:Ins:R:1       2
3:Ins:R:2       2
3:Ins:R:3       3
3:Ins:R:4       1
3:Ins:R:5       10
4:Ins:R:0       16
4:Ins:R:1       7
4:Ins:R:2       2
4:Ins:R:3       1
4:Ins:R:4       0
4:Ins:R:5       5
5:Ins:R:0       13
5:Ins:R:1       5
5:Ins:R:2       0
5:Ins:R:3       0
5:Ins:R:4       1
5:Ins:R:5       2
2:Del:M:1       75
3:Del:M:1       12
3:Del:M:2       8
4:Del:M:1       6
4:Del:M:2       1
4:Del:M:3       3
5:Del:M:1       10
5:Del:M:2       2
5:Del:M:3       6
5:Del:M:4       1
5:Del:M:5       14
2:Ins:M:1       0
3:Ins:M:1       0
3:Ins:M:2       0
4:Ins:M:1       0
4:Ins:M:2       0
4:Ins:M:3       0
5:Ins:M:1       0
5:Ins:M:2       0
5:Ins:M:3       0
5:Ins:M:4       0
5:Ins:M:5       0
complex 128
non_matching    0

Finally, a few entries from my input .vcf file so you can see the format:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  20200903_2023005257_TSO500_DNA.bam
chr1    2488102 .       G       .       100     LowDP   DP=155  GT:GQ:AD:DP:VF:NL:SB:NC:US      0/0:0:155:155:0.00000:65:-100.0000:0.1436:0,0,0,0,0,0,31,26,23,24,28,23
chr1    2488103 .       C       .       100     LowDP   DP=154  GT:GQ:AD:DP:VF:NL:SB:NC:US      0/0:0:154:154:0.00000:65:-100.0000:0.1492:0,0,0,0,0,0,32,26,25,22,27,22
chr1    2488104 .       A       .       100     LowDP   DP=157  GT:GQ:AD:DP:VF:NL:SB:NC:US      0/0:0:157:157:0.00000:65:-100.0000:0.1278:0,0,0,0,0,0,32,26,26,23,28,22
chr1    2488105 .       T       .       100     LowDP   DP=159  GT:GQ:AD:DP:VF:NL:SB:NC:US      0/0:0:159:159:0.00000:65:-100.0000:0.1264:0,0,0,0,0,0,32,25,26,24,30,22
@ShixiangWang
Copy link
Owner

@tucker-bower-psjh Hi, thanks for reporting. I don't know which version of sigminer you are using. Could you retry read_vcf() with option keep_only_pass = FALSE?

Besides, we implemented the classification method based on the following reference, which was proposed by sigprofiler. I remember I tested the code before, the (sigminer and sigprofiler) results were basically same but with a little difference in some cases (I couldn't figure out why).

Bergstrom EN, Huang MN, Mahto U, Barnes M, Stratton MR, Rozen SG, Alexandrov LB: SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics 2019, 20:685 https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6041-2

If the error still exists, could you send me an example dataset for debugging (to w_shixiang@163.com)? At the same time, I recommend you directly use matrix generated from sigprofiler for downstream analysis currently.

@ShixiangWang ShixiangWang added bug Something isn't working question Further information is requested labels Jul 6, 2021
@tucker-bower-psjh
Copy link
Author

The SigMiner version is 2.0.2

using keep_only_pass = FALSE option for read_vcf did not prevent the error.

I will look into sending you a sample dataset. I'll have to find an open source one as all of my data is PHI.

Thanks,

Tucker

@ShixiangWang ShixiangWang removed the question Further information is requested label Jul 7, 2021
@ShixiangWang
Copy link
Owner

@tucker-bower-psjh Thanks for your detail info. You could randomly select 10 INDEL records (without sample ID) that can reproduce the error.

@tucker-bower-psjh
Copy link
Author

OK, I've created a fake vcf with 10 fabricated INDEL variants, and I am still getting the "Zero INDELs to analyze after dropping 'complex' labeled records!" error when running this file through sig_tally. In the meantime, I will work on getting around this error by getting my matrices from SigProfiler.
fake.vcf.txt

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 20210507-2111817031-51722-THT-DNA.bam
chr1 2488102 . G AT 100 PASS DP=815 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:815:815:0.00000:30:-100.0000:0.0204:0,0,0,0,0,0,37,12,186,178,233,169
chr1 2488103 . C CT 100 PASS DP=805 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:805:805:0.00000:30:-100.0000:0.0301:0,0,0,0,0,0,37,12,188,175,232,161
chr1 2488104 . A AA 100 PASS DP=798 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:798:798:0.00000:30:-100.0000:0.0374:0,0,0,0,0,0,37,12,187,167,233,162
chr1 2488105 . TT T 100 PASS DP=799 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:799:799:0.00000:30:-100.0000:0.0208:0,0,0,0,0,0,37,12,191,168,227,164
chr1 2488106 . G GCG 100 PASS DP=789 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:789:789:0.00000:30:-100.0000:0.0295:0,0,0,0,0,0,37,12,187,170,224,159
chr1 2488107 . GAG G 100 PASS DP=782 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:782:782:0.00000:30:-100.0000:0.0346:0,0,0,0,0,0,35,12,188,167,223,157
chr1 2488108 . AA A 100 PASS DP=780 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:780:780:0.00000:30:-100.0000:0.0274:0,0,0,0,0,0,35,12,192,165,222,154
chr1 2488109 . G GGG 100 PASS DP=764 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:764:764:0.00000:30:-100.0000:0.0402:0,0,0,0,0,0,35,12,187,164,215,151
chr1 2488110 . CT A 3 q20 DP=770 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/1:3:769,1:770:0.00130:30:-3.1471:0.0204:0,0,1,0,0,0,35,12,192,158,217,156
chr1 2488111 . C CCCC 100 PASS DP=763 GT:GQ:AD:DP:VF:NL:SB:NC:US 0/0:0:763:763:0.00000:30:-100.0000:0.0280:0,0,0,0,0,0,34,12,183,166,215,153

@ShixiangWang
Copy link
Owner

@tucker-bower-psjh Thanks, I will test it ASAP.

@ShixiangWang
Copy link
Owner

I found why these variants are labelled as 'complex'.

data:

Browse[3]> query
     Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2
 1: TNFRSF14-AS1       chr1        2488102      2488103                G                AT
 2: TNFRSF14-AS1       chr1        2488103      2488104                C                CT
 3: TNFRSF14-AS1       chr1        2488104      2488105                A                AA
 4: TNFRSF14-AS1       chr1        2488105      2488106               TT                 T
 5: TNFRSF14-AS1       chr1        2488106      2488108                G               GCG
 6: TNFRSF14-AS1       chr1        2488107      2488109              GAG                 G
 7: TNFRSF14-AS1       chr1        2488108      2488109               AA                 A
 8: TNFRSF14-AS1       chr1        2488109      2488111                G               GGG
 9: TNFRSF14-AS1       chr1        2488110      2488111               CT                 A
10: TNFRSF14-AS1       chr1        2488111      2488114                C              CCCC
    Variant_Classification Variant_Type Tumor_Sample_Barcode
 1:                Unknown          Ins                 fake
 2:                Unknown          Ins                 fake
 3:                Unknown          Ins                 fake
 4:                Unknown          Del                 fake
 5:                Unknown          Ins                 fake
 6:                Unknown          Del                 fake
 7:                Unknown          Del                 fake
 8:                Unknown          Ins                 fake
 9:                Unknown          Del                 fake
10:                Unknown          Ins                 fake

In my code, an acceptable INDEL should have '-' label in either ref position or mut position. For example, the 2nd variant should be - > T instead of C -> CT.

  ## Seach 'complex' motif
  query[, ID_motif := ifelse(Reference_Allele != "-" & Tumor_Seq_Allele2 != "-",
    "complex", NA_character_
  )]

I will try to add a preprocessing code to transform data format like you provided before labelling the 'complex' variant.

@ShixiangWang ShixiangWang added the enhancement New feature or request label Jul 17, 2021
@ShixiangWang
Copy link
Owner

An standard input maf labelling INSs/DELs we think is like below:

> laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools")
> laml <- read_maf(maf = laml.maf)
-Reading
-Validating
-Silent variants: 475 
-Summarizing
-Processing clinical data
--Missing clinical data
-Finished in 0.343s elapsed (0.271s cpu) 
> laml@data[Variant_Type %in% c("INS", "DEL")][, .(Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2)]
     Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2
  1:         15       86262345     86262351          AAGATCA                 -
  2:          4       36149165     36149168             CTTA                 -
  3:          6      129950550    129950553             TCTT                 -
  4:         23      135772783    135772783                C                 -
  5:         20       31022727     31022727                G                 -
 ---                                                                          
201:         11       32417941     32417942                -                 A
202:         11       32417909     32417910                -             GACCG
203:         11       32417846     32417847               AT                 -
204:         11       32417909     32417910                -             GACCG
205:          4      146807298    146807298                G                 -

@ShixiangWang
Copy link
Owner

ShixiangWang commented Jul 17, 2021

I have added code (f2fc994) to preprocess your provided data format. Could you install the latest version from GitHub and try it with real data to check if the result is similar to SigProfiler?

> maf <- read_vcf("~/Downloads/fake.vcf.txt")
Reading file(s): ~/Downloads/fake.vcf.txt
Annotating Variant Type...
Annotating mutations to first matched gene based on database /Users/wsx/Documents/GitHub/sigminer/inst/extdata/human_hg19_gene_info.rds...
Transforming into a MAF object...
-Validating
--Non MAF specific values in Variant_Classification column:
  Unknown
-Summarizing
-Processing clinical data
--Missing clinical data
-Finished in 0.038s elapsed (0.036s cpu) 
> maf@data
    Tumor_Sample_Barcode Chromosome Start_Position Reference_Allele Tumor_Seq_Allele2 End_Position
 1:                 fake       chr1        2488102                G                AT      2488103
 2:                 fake       chr1        2488103                C                CT      2488104
 3:                 fake       chr1        2488104                A                AA      2488105
 4:                 fake       chr1        2488105               TT                 T      2488106
 5:                 fake       chr1        2488106                G               GCG      2488108
 6:                 fake       chr1        2488107              GAG                 G      2488109
 7:                 fake       chr1        2488108               AA                 A      2488109
 8:                 fake       chr1        2488109                G               GGG      2488111
 9:                 fake       chr1        2488110               CT                 A      2488111
10:                 fake       chr1        2488111                C              CCCC      2488114
    Variant_Type Variant_Classification  Hugo_Symbol
 1:          INS                Unknown TNFRSF14-AS1
 2:          INS                Unknown TNFRSF14-AS1
 3:          INS                Unknown TNFRSF14-AS1
 4:          DEL                Unknown TNFRSF14-AS1
 5:          INS                Unknown TNFRSF14-AS1
 6:          DEL                Unknown TNFRSF14-AS1
 7:          DEL                Unknown TNFRSF14-AS1
 8:          INS                Unknown TNFRSF14-AS1
 9:          DEL                Unknown TNFRSF14-AS1
10:          INS                Unknown TNFRSF14-AS1
> mt_tally <- sig_tally(
+   maf,
+   ref_genome = "BSgenome.Hsapiens.UCSC.hg19",
+   useSyn = TRUE,
+   mode = "ID",
+   genome_build = "hg19",
+   add_trans_bias = TRUE
+ )
ℹ [2021-07-17 16:58:06]: Started.
✓ [2021-07-17 16:58:06]: Reference genome loaded.
✓ [2021-07-17 16:58:06]: Variants from MAF object queried.
✓ [2021-07-17 16:58:06]: Chromosome names checked.
✓ [2021-07-17 16:58:06]: Sex chromosomes properly handled.
✓ [2021-07-17 16:58:06]: Only variants located in standard chromosomes (1:22, X, Y, M/MT) are kept.
✓ [2021-07-17 16:58:06]: Variant start and end position checked.
✓ [2021-07-17 16:58:06]: Variant data for matrix generation preprocessed.
ℹ [2021-07-17 16:58:06]: INDEL matrix generation - start.
✓ [2021-07-17 16:58:06]: Reference sequences queried from genome.
✓ [2021-07-17 16:58:06]: INDEL length extracted.
✓ [2021-07-17 16:58:06]: Adjacent copies counted.
✓ [2021-07-17 16:58:06]: Microhomology size calculated.
✓ [2021-07-17 16:58:06]: INDEL records classified into different components (types).
✓ [2021-07-17 16:58:06]: ID-28 matrix created.
✓ [2021-07-17 16:58:06]: ID-83 matrix created.
✓ [2021-07-17 16:58:06]: ID-415 matrix created.
ℹ [2021-07-17 16:58:06]: Return ID-415 as major matrix.
✓ [2021-07-17 16:58:06]: Done.
ℹ [2021-07-17 16:58:06]: 0.71 secs elapsed.
> sum(mt_tally$nmf_matrix)
[1] 8

The 1st and 9th variants are labelled as 'complex' here and removed from final result.

@tucker-bower-psjh
Copy link
Author

I can confirm that I can now run mt_tally and get ID28 and ID83 matrices for my real data, which is great. However, the results appear to be vastly different from those of SigProfiler when I ran both on the same vcf.

SigProfiler:
MutationType XXX_DNA
1 1:Del:C:0 170
2 1:Del:C:1 94
3 1:Del:C:2 40
4 1:Del:C:3 44
5 1:Del:C:4 45
6 1:Del:C:5 48
7 1:Del:T:0 70
8 1:Del:T:1 54
9 1:Del:T:2 46
10 1:Del:T:3 57
11 1:Del:T:4 31
12 1:Del:T:5 261
13 1:Ins:C:0 35
14 1:Ins:C:1 15
15 1:Ins:C:2 5
16 1:Ins:C:3 3
17 1:Ins:C:4 0
18 1:Ins:C:5 16
19 1:Ins:T:0 50
20 1:Ins:T:1 23
21 1:Ins:T:2 4
22 1:Ins:T:3 4
23 1:Ins:T:4 3
24 1:Ins:T:5 198
25 2:Del:R:0 38
26 2:Del:R:1 50
27 2:Del:R:2 4
28 2:Del:R:3 6
29 2:Del:R:4 11
30 2:Del:R:5 111
31 3:Del:R:0 8
32 3:Del:R:1 9
33 3:Del:R:2 10
34 3:Del:R:3 39
35 3:Del:R:4 38
36 3:Del:R:5 58
37 4:Del:R:0 5
38 4:Del:R:1 2
39 4:Del:R:2 9
40 4:Del:R:3 20
41 4:Del:R:4 5
42 4:Del:R:5 13
43 5:Del:R:0 12
44 5:Del:R:1 15
45 5:Del:R:2 25
46 5:Del:R:3 16
47 5:Del:R:4 7
48 5:Del:R:5 6
49 2:Ins:R:0 48
50 2:Ins:R:1 8
51 2:Ins:R:2 4
52 2:Ins:R:3 0
53 2:Ins:R:4 3
54 2:Ins:R:5 42
55 3:Ins:R:0 21
56 3:Ins:R:1 2
57 3:Ins:R:2 2
58 3:Ins:R:3 3
59 3:Ins:R:4 1
60 3:Ins:R:5 10
61 4:Ins:R:0 16
62 4:Ins:R:1 7
63 4:Ins:R:2 2
64 4:Ins:R:3 1
65 4:Ins:R:4 0
66 4:Ins:R:5 5
67 5:Ins:R:0 13
68 5:Ins:R:1 5
69 5:Ins:R:2 0
70 5:Ins:R:3 0
71 5:Ins:R:4 1
72 5:Ins:R:5 2
73 2:Del:M:1 75
74 3:Del:M:1 12
75 3:Del:M:2 8
76 4:Del:M:1 6
77 4:Del:M:2 1
78 4:Del:M:3 3
79 5:Del:M:1 10
80 5:Del:M:2 2
81 5:Del:M:3 6
82 5:Del:M:4 1
83 5:Del:M:5 14

SigMiner Results attached as photo, because Rstudio presented them in a weird format that was hard to copy.
Screen Shot 2021-07-20 at 8 36 11 AM

@ShixiangWang
Copy link
Owner

Thanks for your feedback. Thats strange. Could you provide the result of sigprofiler for your fake vcf. I want to compare it with sigminer one by one.

@ShixiangWang
Copy link
Owner

@tucker-bower-psjh I am setting up the SigProfilerMatrixGenerator to figure out why inconsistent INDEL classes obtained from the two tools with your provided fake vcf data. I will let you know when I get a conclusion.

@ShixiangWang
Copy link
Owner

Finally, I have obtained the result of SigProfilerMatrixGenerator with

Python 3.9.1 (default, Dec 11 2020, 14:32:07) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
>>> matrices = matGen.SigProfilerMatrixGeneratorFunc("test", "GRCh37", "./",plot=False, exome=False, bed_file=None, chrom_based=False, tsb_stat=False, seqInfo=False, cushion=100)
Starting matrix generation for INDELs...Completed! Elapsed time: 5.93 seconds.
Matrices generated for 1 samples with 0 errors. Total of 0 SNVs, 0 DINUCs, and 10 INDELs were successfully analyzed.

It's not easy to install the reference genome as the internet is not good.

I will check the 10 variants one by one.

@ShixiangWang
Copy link
Owner

@tucker-bower-psjh The latest version is very consitent with sigprofiler now, however, I cannot figure out why/how the two complex variants are classified.

You provided fake variants:

G > AT unknown
CT > A unknown
- > GG insertion
- > CCC insertion
- > T insertion
- > A insertion
- > CG insertion
GAG > G deletion
AA > A deletion
TT > T deletion

The following is how sigprofiler classify them:

1:Del:T:1  3
1:Ins:T:0  2
1:Ins:T:1  1
MH  4 (28 classifications)
  2:Ins:R:0  2
  3:Ins:R:0  1
  2:Del:M:1  1

Now sigminer how to classify them:

1:Del:T:1  2
1:Ins:T:0  1
1:Ins:T:1  1
2:Ins:R:0  2
3:Ins:R:0  1
2:Del:M:1  1
complex  2

1:Del:T:1 and 1:Ins:T:0 assigned to G > AT or CT > A, I cannot figure how to do this and this is not documented in sigprofiler wiki or the paper https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6041-2#additional-information.


Please note that the sigprofiler seems has wrong classifications on ID28 classification strategy, as it classify 2:Ins:R:0 3:Ins:R:0 to MH.

image

https://osf.io/s93d5/wiki/5.%20Output%20-%20ID/


To sum up, the sigminer has the correct logical and implementation for ID-83 classification now.

I am closing this issue, thanks for your bug report, this help me fix my bugs.

@tucker-bower
Copy link

I see. Thank you so much for looking into this. I look forward to watching SigMiner grow, it has been a pleasure to work with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants