Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Showing mutational load #453

Closed
beginner984 opened this issue Jan 20, 2020 · 33 comments
Closed

Showing mutational load #453

beginner984 opened this issue Jan 20, 2020 · 33 comments

Comments

@beginner984
Copy link

beginner984 commented Jan 20, 2020

Hi

I use tcgaCompare function to look at mutational load in my data vs TCGA but is there any way to do the same for mine different types of data? Please let's say I have separate maf objects for patients before and after chemotherapy, so that would be nice to look at mutational load in two data side by side

Thank you

@PoisonAlien
Copy link
Owner

Yes, you can do that. tcgaCompare can accept multiple maf objects.

For example if you have two maf objects x and y, you could do like below..

tcgaCompare(maf = c(x, y), cohortName = c("x", "y"))

@beginner984
Copy link
Author

Thank you, is it possible to say statistically if mutational burden differs between two data?

@PoisonAlien
Copy link
Owner

Not at the moment. But I think it can be done. I will let you know if I can implement it..

@beginner984
Copy link
Author

beginner984 commented Jan 21, 2020

Sorry
Is it possible to show p-value for Transition and Transversions in samples?

Something like part of this figure

Untitled

@PoisonAlien
Copy link
Owner

No. But, one can easily do it with multiple titv outputs.

@beginner984
Copy link
Author

Sorry for too questioning
I am see number of different types of mutations in summary of maf

But can we get total number of SNV and INDEL separately for each sample?

@PoisonAlien
Copy link
Owner

Check out mafSummary function which will get you these numbers.
For example:

> ms = mafSummary(maf = laml)
> ms$variant.type.summary
 Tumor_Sample_Barcode DEL INS SNP total
  1:         TCGA-AB-3009   0   6  36    42
  2:         TCGA-AB-2807   2   0  27    29
  3:         TCGA-AB-2927   0   2  25    27
  4:         TCGA-AB-2959   0   0  27    27
  5:         TCGA-AB-3002   0   0  27    27
 ---                                       
189:         TCGA-AB-2903   0   0   1     1
190:         TCGA-AB-2909   0   1   0     1
191:         TCGA-AB-2933   0   0   1     1
192:         TCGA-AB-2942   0   1   0     1
193:         TCGA-AB-2954   0   0   1     1

However, it returns summaries for everything (synonymous and nonsynonymous).

@beginner984
Copy link
Author

Sorry tcgaCompare gives median of mutations per Mb for each maf, is there any way to access to number of mutations per Mb for each sample in maf object?

@PoisonAlien
Copy link
Owner

I am working on it. I will make the function to return pairwise t-test results and mutation load per sample. Should be done soon..

@beginner984
Copy link
Author

Great job and the best of lucks

PoisonAlien added a commit that referenced this issue Jan 22, 2020
@PoisonAlien
Copy link
Owner

Okay, could you try again? tcgaCompare output should contain pairwise t-test results and mutation load per sample.

@beginner984
Copy link
Author

Thanks; this is simply amazing

@beginner984
Copy link
Author

Check out mafSummary function which will get you these numbers.
For example:

ms = mafSummary(maf = laml)
ms$variant.type.summary
Tumor_Sample_Barcode DEL INS SNP total
1: TCGA-AB-3009 0 6 36 42
2: TCGA-AB-2807 2 0 27 29
3: TCGA-AB-2927 0 2 25 27
4: TCGA-AB-2959 0 0 27 27
5: TCGA-AB-3002 0 0 27 27


189: TCGA-AB-2903 0 0 1 1
190: TCGA-AB-2909 0 1 0 1
191: TCGA-AB-2933 0 0 1 1
192: TCGA-AB-2942 0 1 0 1
193: TCGA-AB-2954 0 0 1 1
However, it returns summaries for everything (synonymous and nonsynonymous).

Sorry for my data


> ms = mafSummary(maf = laml_res)
> ms$variant.type.summary
    Tumor_Sample_Barcode  DEL  INS   SNP MNP total
 1:    LP6008337-DNA_H06  927  773 40756   0 42456
 2:    LP6008334-DNA_D02 1049  799 31009   0 32857

What is MNP?

@PoisonAlien
Copy link
Owner

They are Multi Nucleotide Polymorphisms. They are rare but do often show up.

@beginner984
Copy link
Author

I am working on it. I will make the function to return pairwise t-test results and mutation load per sample. Should be done soon..

Sorry

The newly implemented pairwise t-test in tcgaCompare function compares Median_Mutations , Median_Mutations_log10 or total ?

@PoisonAlien
Copy link
Owner

Hi,
Test is performed based on the capture_size argument. If you provide capture_size then differences in mutation rate per mb is estimated. Or it would simply be total. You can check the messages while running.

> x_log10 = tcgaCompare(maf = laml, cohortName = "AML")
Performing pairwise t-test for differences in mutation burden..

> x_log10_mb = tcgaCompare(maf = laml, cohortName = "AML", capture_size = 50)
Capture size [TCGA]:  50
Capture size [Input]: 50
Performing pairwise t-test for differences in mutation burden (per MB)..

Log10 is used only for visualisation and nothing else.

@beginner984
Copy link
Author

Thank you. My actual question roots from here whether t test is suitable for total count comparison among groups or not, I then asked about log because I thought by that you try to make the distribution of number of mutations per Mb close to a normal distribution for using t test. Somewhere I have seen people for comparing such a raw count use wilcox test that is why I was seeking for a clue for type of test. However thank you again

@beginner984
Copy link
Author

beginner984 commented Jul 2, 2020

Hello @PoisonAlien

For getting the number of non synonymous mutations per sample

getSampleSummary(laml) gives fields like

 [1] "Tumor_Sample_Barcode" "Frame_Shift_Del"      "Frame_Shift_Ins"      "In_Frame_Del"        
 [5] "In_Frame_Ins"         "Missense_Mutation"    "Nonsense_Mutation"    "Nonstop_Mutation"    
 [9] "Splice_Site"          "total"               
> 

Does total give non synonymous mutations or I should add up "Missense_Mutation" "Nonsense_Mutation" "Nonstop_Mutation" "Splice_Site"

@PoisonAlien
Copy link
Owner

Output from getSampleSummary includes only nonsynonymous mutations. You don;t have to add up anything - unless you;re interested in specific types.

@beginner984
Copy link
Author

beginner984 commented Jul 15, 2020

Hello @PoisonAlien

I am struggling with mutation per sample

Please have a look

by

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

Rplot

And by

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = FALSE, titvRaw = FALSE)
Rplot01

You are seeing the median of mutation per sample differs if dashboard argument gets FALSE or TRUE

Am I right?

@PoisonAlien
Copy link
Owner

Hi,
I have hard time reproducing the issue. I have tried it on several tcga cohorts and it seems ok to me. Are you sure they are the same objects?

@beginner984
Copy link
Author

beginner984 commented Jul 16, 2020

Yes @PoisonAlien the same objects as I only have one maf object in environment

I also put set.seed(100)

This is my R

Screenshot 2020-07-16 at 10 15 48

Screenshot 2020-07-16 at 10 14 53

@PoisonAlien
Copy link
Owner

Seems like you have some faulty values in ref and alt alleles. Please do check and fix them.

@beginner984
Copy link
Author

Sorry @PoisonAlien , I got worry about my data, what should I check and correct in my data?

@PoisonAlien
Copy link
Owner

For SNVs, your data seems to have non A,T,G,C values. Hence the warning..

@beginner984
Copy link
Author

beginner984 commented Jul 16, 2020

Sorry @PoisonAlien you are right

> laml1=laml1[laml1$Variant_Type=="SNP",]
> View(laml1)
> unique(laml1$Reference_Allele)
[1] "G" "C" "T" "A" "-"
> unique(laml1$Tumor_Seq_Allele2)
[1] "A" "T" "G" "C" "-"

You think what is the possible source of problem where I have done wrong? I really put a hug amount of time to gather this data and this made me really panicked if I have to do everything from scratch and throw away any conclusion

Can I simply remove any -?

@PoisonAlien
Copy link
Owner

I am not sure how you generated your MAFs. I would investigate the source data used for generating these maf files.
Maybe go back to your original annotation files and check ref/alt alleles for these mis-classified alleles.
I would advice against removing them without properly checking them.

@beginner984
Copy link
Author

I have used Annovar for annotation them annovar2maf function :(

@PoisonAlien
Copy link
Owner

Yes, check annovar output files for these loci. You can share an example if you think annovar2maf is the culprit.

@beginner984
Copy link
Author

beginner984 commented Jul 16, 2020

Thank you @PoisonAlien

This is one of my Annovar files directly fed into annovar2maf

annovar.txt

However, before getting Andover I had to do some manipulation one Stelka .vcf files for compatibility like

grep -v "##" file.vcf | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}' > snp.txt

Then

table_annovar.pl snp.txt /temp/hgig/fi1d18/annovar/humandb/ -buildver hg19 -out output -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f

@beginner984
Copy link
Author

beginner984 commented Jul 16, 2020

I just checked one Annovar sample having - in maf, but there was not any - in Annovar

@PoisonAlien
Copy link
Owner

Your example file looks fine after processing with annovarToMaf

> x = maftools::annovarToMaf(annovar = "~/Downloads/annovar.txt")
> x[Variant_Type %in% "SNP"][,.N,.(Reference_Allele, Tumor_Seq_Allele2)]
    Reference_Allele Tumor_Seq_Allele2    N
 1:                C                 T 3912
 2:                G                 A 3870
 3:                T                 A 1619
 4:                C                 A 2328
 5:                A                 G 3050
 6:                T                 G 3800
 7:                T                 C 3054
 8:                G                 T 2344
 9:                G                 C  757
10:                A                 T 1620
11:                C                 G  752
12:                A                 C 3569

@beginner984
Copy link
Author

beginner984 commented Jul 16, 2020

Sorry @PoisonAlien is it possible that happened at this step?

annovar_mafs = data.table::rbindlist(l = annovar_mafs, fill = TRUE) #Merge into single MAF

I remember I was doing something like

> list.files()
[1] "indel" "snp"  
anno_files = list.files(path = "/New folder", pattern = "*a\\.txt$", recursive = TRUE, full.names = TRUE)
anno_maf = maftools::annovarToMaf(annovar = anno_files, MAFobj = FALSE)

:(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants