Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

TCGA Mutect2 and Strelka2 file appear to be identical? #483

Closed
cansavvy opened this issue Jan 28, 2020 · 11 comments
Closed

TCGA Mutect2 and Strelka2 file appear to be identical? #483

cansavvy opened this issue Jan 28, 2020 · 11 comments
Labels
data snv Related to or requires SNV data

Comments

@cansavvy
Copy link
Collaborator

What data file(s) does this issue pertain to?

pbta-tcga-snv-mutect2.vep.maf.gz and pbta-tcga-snv-strelka2.vep.maf.gz

What release are you using?

release-v13-20200116

Put your question or report your issue here.

Both have exactly 38296 number of mutation calls (number of rows) and the following prints out TRUE:

strelka_df <- data.table::fread("data/pbta-tcga-snv-strelka2.vep.maf.gz", data.table = FALSE)
mutect_df <- data.table::fread("data/pbta-tcga-snv-mutect2.vep.maf.gz", data.table = FALSE)

# This returns TRUE
all.equal(strelka_df, mutect_df)

# This returns TRUE as well
dplyr::all_equal(strelka_df, mutect_df)

Is there something I'm missing? Or did these file accidentally get mixed up?

@cansavvy cansavvy added data snv Related to or requires SNV data labels Jan 28, 2020
@cansavvy
Copy link
Collaborator Author

I deleted the files and their redirects and re-ran bash download-data.sh and retried the above and am still getting TRUE? But the md5sum.txt file does show they are different.

93a91212e90b8c2533e8dfef9e72f7c2  pbta-tcga-snv-mutect2.vep.maf.gz
3c1f3fce62e4224fbb5466bd65516171  pbta-tcga-snv-strelka2.vep.maf.gz

@jharenza
Copy link
Collaborator

@yuankunzhu @tkoganti @migbro - can you look into this?

@cansavvy
Copy link
Collaborator Author

In Terminal, diff pbta-tcga-snv-mutect2.vep.maf.gz pbta-tcga-snv-strelka2.vep.maf.gz doesn't spit back anything, but as a positive control, I tried diff pbta-tcga-snv-lancet.vep.maf.gz pbta-tcga-snv-strelka2.vep.maf.gz and it showed me all the lines that were different.

@jharenza jharenza mentioned this issue Jan 28, 2020
3 tasks
@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Jan 28, 2020

@cansavvy i think we might have merged the same content for those two files. @tkoganti is checking and re-do it now. But a quick ls looks like those two files has a 1-byte file size off?

2020-01-16 18:31:25    8805609 data/release-v13-20200116/pbta-tcga-snv-mutect2.vep.maf.gz
2020-01-16 18:31:26    8805610 data/release-v13-20200116/pbta-tcga-snv-strelka2.vep.maf.gz

that would explain the md5 difference, but with exactly same mutation content. sorry for the confusion, we will fix this in a new release.

@tkoganti
Copy link
Collaborator

Hi @cansavvy I used strelka files for mutect by mistake. Sorry about that. I uploaded new strelka and mutect2 under V14 folder here -
https://cavatica.sbgenomics.com/u/cavatica/pbta/files/#q?path=processed-data-merge%2FV14-data&search=pbta-tcga

mutect2 file is 3.3Mb and strelka is 8.4Mb (file size for new maf.gz files )

@cansavvy
Copy link
Collaborator Author

@tkoganti , this seems to be telling me I don't have access to the link you posted?

@cansavvy
Copy link
Collaborator Author

Also, the files before were 8.8Mb, but now neither file is that size?

@tkoganti
Copy link
Collaborator

tkoganti commented Jan 28, 2020

Both the new and old files have a file size of 8.4 Mb (8,805,610 bytes, it was showing as 8.4M when I looked at the files using ls -lh)

I did not realize you did not have access to cavatica. These files are in a staging area currently and we will let you know when they are released.

@jharenza
Copy link
Collaborator

@cansavvy if you have a CAVATICA username, we can add you to the project so you can preview the files. @tkoganti I am seeing that the Mutect2 gzipped maf is 3.3 MB and the Strelka2 gzipped maf is 8.4 MB - is this correct, or should Mutect2 be larger, as alluded to above?

@tkoganti
Copy link
Collaborator

@jharenza - See below for file sizes -

new and old strelka gzipped maf files - ~8.4M
old mutect file - ~8.4M (This was an error)
New mutect file - ~3.3M

@jharenza
Copy link
Collaborator

jharenza commented Feb 8, 2020

closed via #507

@jharenza jharenza closed this as completed Feb 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data snv Related to or requires SNV data
Projects
None yet
Development

No branches or pull requests

4 participants