New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Zero INDELs to analyze after dropping 'complex' labeled records!" error #370
Comments
@tucker-bower-psjh Hi, thanks for reporting. I don't know which version of sigminer you are using. Could you retry Besides, we implemented the classification method based on the following reference, which was proposed by sigprofiler. I remember I tested the code before, the (sigminer and sigprofiler) results were basically same but with a little difference in some cases (I couldn't figure out why).
If the error still exists, could you send me an example dataset for debugging (to w_shixiang@163.com)? At the same time, I recommend you directly use matrix generated from sigprofiler for downstream analysis currently. |
The SigMiner version is 2.0.2 using keep_only_pass = FALSE option for read_vcf did not prevent the error. I will look into sending you a sample dataset. I'll have to find an open source one as all of my data is PHI. Thanks, Tucker |
@tucker-bower-psjh Thanks for your detail info. You could randomly select 10 INDEL records (without sample ID) that can reproduce the error. |
OK, I've created a fake vcf with 10 fabricated INDEL variants, and I am still getting the "Zero INDELs to analyze after dropping 'complex' labeled records!" error when running this file through sig_tally. In the meantime, I will work on getting around this error by getting my matrices from SigProfiler. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 20210507-2111817031-51722-THT-DNA.bam |
@tucker-bower-psjh Thanks, I will test it ASAP. |
I found why these variants are labelled as 'complex'. data: Browse[3]> query
Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2
1: TNFRSF14-AS1 chr1 2488102 2488103 G AT
2: TNFRSF14-AS1 chr1 2488103 2488104 C CT
3: TNFRSF14-AS1 chr1 2488104 2488105 A AA
4: TNFRSF14-AS1 chr1 2488105 2488106 TT T
5: TNFRSF14-AS1 chr1 2488106 2488108 G GCG
6: TNFRSF14-AS1 chr1 2488107 2488109 GAG G
7: TNFRSF14-AS1 chr1 2488108 2488109 AA A
8: TNFRSF14-AS1 chr1 2488109 2488111 G GGG
9: TNFRSF14-AS1 chr1 2488110 2488111 CT A
10: TNFRSF14-AS1 chr1 2488111 2488114 C CCCC
Variant_Classification Variant_Type Tumor_Sample_Barcode
1: Unknown Ins fake
2: Unknown Ins fake
3: Unknown Ins fake
4: Unknown Del fake
5: Unknown Ins fake
6: Unknown Del fake
7: Unknown Del fake
8: Unknown Ins fake
9: Unknown Del fake
10: Unknown Ins fake In my code, an acceptable INDEL should have '-' label in either ref position or mut position. For example, the 2nd variant should be ## Seach 'complex' motif
query[, ID_motif := ifelse(Reference_Allele != "-" & Tumor_Seq_Allele2 != "-",
"complex", NA_character_
)] I will try to add a preprocessing code to transform data format like you provided before labelling the 'complex' variant. |
An standard input maf labelling INSs/DELs we think is like below: > laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools")
> laml <- read_maf(maf = laml.maf)
-Reading
-Validating
-Silent variants: 475
-Summarizing
-Processing clinical data
--Missing clinical data
-Finished in 0.343s elapsed (0.271s cpu)
> laml@data[Variant_Type %in% c("INS", "DEL")][, .(Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2)]
Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2
1: 15 86262345 86262351 AAGATCA -
2: 4 36149165 36149168 CTTA -
3: 6 129950550 129950553 TCTT -
4: 23 135772783 135772783 C -
5: 20 31022727 31022727 G -
---
201: 11 32417941 32417942 - A
202: 11 32417909 32417910 - GACCG
203: 11 32417846 32417847 AT -
204: 11 32417909 32417910 - GACCG
205: 4 146807298 146807298 G - |
I have added code (f2fc994) to preprocess your provided data format. Could you install the latest version from GitHub and try it with real data to check if the result is similar to SigProfiler? > maf <- read_vcf("~/Downloads/fake.vcf.txt")
Reading file(s): ~/Downloads/fake.vcf.txt
Annotating Variant Type...
Annotating mutations to first matched gene based on database /Users/wsx/Documents/GitHub/sigminer/inst/extdata/human_hg19_gene_info.rds...
Transforming into a MAF object...
-Validating
--Non MAF specific values in Variant_Classification column:
Unknown
-Summarizing
-Processing clinical data
--Missing clinical data
-Finished in 0.038s elapsed (0.036s cpu)
> maf@data
Tumor_Sample_Barcode Chromosome Start_Position Reference_Allele Tumor_Seq_Allele2 End_Position
1: fake chr1 2488102 G AT 2488103
2: fake chr1 2488103 C CT 2488104
3: fake chr1 2488104 A AA 2488105
4: fake chr1 2488105 TT T 2488106
5: fake chr1 2488106 G GCG 2488108
6: fake chr1 2488107 GAG G 2488109
7: fake chr1 2488108 AA A 2488109
8: fake chr1 2488109 G GGG 2488111
9: fake chr1 2488110 CT A 2488111
10: fake chr1 2488111 C CCCC 2488114
Variant_Type Variant_Classification Hugo_Symbol
1: INS Unknown TNFRSF14-AS1
2: INS Unknown TNFRSF14-AS1
3: INS Unknown TNFRSF14-AS1
4: DEL Unknown TNFRSF14-AS1
5: INS Unknown TNFRSF14-AS1
6: DEL Unknown TNFRSF14-AS1
7: DEL Unknown TNFRSF14-AS1
8: INS Unknown TNFRSF14-AS1
9: DEL Unknown TNFRSF14-AS1
10: INS Unknown TNFRSF14-AS1
> mt_tally <- sig_tally(
+ maf,
+ ref_genome = "BSgenome.Hsapiens.UCSC.hg19",
+ useSyn = TRUE,
+ mode = "ID",
+ genome_build = "hg19",
+ add_trans_bias = TRUE
+ )
ℹ [2021-07-17 16:58:06]: Started.
✓ [2021-07-17 16:58:06]: Reference genome loaded.
✓ [2021-07-17 16:58:06]: Variants from MAF object queried.
✓ [2021-07-17 16:58:06]: Chromosome names checked.
✓ [2021-07-17 16:58:06]: Sex chromosomes properly handled.
✓ [2021-07-17 16:58:06]: Only variants located in standard chromosomes (1:22, X, Y, M/MT) are kept.
✓ [2021-07-17 16:58:06]: Variant start and end position checked.
✓ [2021-07-17 16:58:06]: Variant data for matrix generation preprocessed.
ℹ [2021-07-17 16:58:06]: INDEL matrix generation - start.
✓ [2021-07-17 16:58:06]: Reference sequences queried from genome.
✓ [2021-07-17 16:58:06]: INDEL length extracted.
✓ [2021-07-17 16:58:06]: Adjacent copies counted.
✓ [2021-07-17 16:58:06]: Microhomology size calculated.
✓ [2021-07-17 16:58:06]: INDEL records classified into different components (types).
✓ [2021-07-17 16:58:06]: ID-28 matrix created.
✓ [2021-07-17 16:58:06]: ID-83 matrix created.
✓ [2021-07-17 16:58:06]: ID-415 matrix created.
ℹ [2021-07-17 16:58:06]: Return ID-415 as major matrix.
✓ [2021-07-17 16:58:06]: Done.
ℹ [2021-07-17 16:58:06]: 0.71 secs elapsed.
> sum(mt_tally$nmf_matrix)
[1] 8 The 1st and 9th variants are labelled as 'complex' here and removed from final result. |
Thanks for your feedback. Thats strange. Could you provide the result of sigprofiler for your fake vcf. I want to compare it with sigminer one by one. |
@tucker-bower-psjh I am setting up the SigProfilerMatrixGenerator to figure out why inconsistent INDEL classes obtained from the two tools with your provided fake vcf data. I will let you know when I get a conclusion. |
Finally, I have obtained the result of SigProfilerMatrixGenerator with
It's not easy to install the reference genome as the internet is not good. I will check the 10 variants one by one. |
@tucker-bower-psjh The latest version is very consitent with sigprofiler now, however, I cannot figure out why/how the two complex variants are classified. You provided fake variants:
The following is how sigprofiler classify them:
Now sigminer how to classify them:
Please note that the sigprofiler seems has wrong classifications on ID28 classification strategy, as it classify To sum up, the sigminer has the correct logical and implementation for ID-83 classification now. I am closing this issue, thanks for your bug report, this help me fix my bugs. |
I see. Thank you so much for looking into this. I look forward to watching SigMiner grow, it has been a pleasure to work with. |
Hello Dr. Wang,
Firstly, thank you for making such a powerful tool for mutational signature analysis. I am trying to use sigminer sig_fit to fit COSMIC SBS, DBS, and INDEL signatures to my .vcf data, but I get the error: ""Zero INDELs to analyze after dropping 'complex' labeled records!" when I run the sig_tally function. However, when I run the same .vcf through SigProfiler Matrix Generator separately, I see that there are many indel variants not labeled as complex. Is there something wrong with my input or my script that I can rectify?
Sincerely,
Tucker Bower
Here's my Rscript
And the output file from SigProfiler Matrix Generator (XXX_DNA.ID96.all)
Finally, a few entries from my input .vcf file so you can see the format:
The text was updated successfully, but these errors were encountered: