Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAT comp - issue with big genome #165

Open
matryoskina opened this issue Mar 15, 2022 · 7 comments
Open

KAT comp - issue with big genome #165

matryoskina opened this issue Mar 15, 2022 · 7 comments

Comments

@matryoskina
Copy link

Hi, I am trying to calculate the kmer profile of this 5.0 Gb genome. Here's the command:

kat comp -t 32 -m 17 -o genome1VSgenome2 -h 'fastq1_R1.fastq.gz fastq1_R.fastq.gz fastq2_R1.fastq.gz fastq2_R.fastq.gz fastq3_R1.fastq.gz fastq3_R3.fastq.gz' genome1.fa genome2.fa

The problem is that the genome statistics are not correct, the final genome size estimate ends up being 0.90 Mb, and the plot is just something weird (no peak detected).
I tried with different kmer values (17, 21, 51) but no change. I tried to set -H and -I to 1000000000 but no change.
Do you have suggestions?
I attach the log file
Thanks!
slurm-6387284.txt

@jonwright99
Copy link
Contributor

Hi, I think you have a problem with your command line. You should have the reads as the first parameter, then the genome as the second. You are including a third which makes comp function very differently. The log file looks like you are putting one assembly as the first parameter, another assembly as the second, and the reads as the third which will give odd results.

@matryoskina
Copy link
Author

Hi,
Thanks for your help! I rerun the analysis with only the fastq and one genome, but the problem is still there. No peak was found. Shall I increase the k-mer size? Or is there something else I am missing? I am attaching the new log file
Thanks!
slurm-6522577.txt
d

@jonwright99
Copy link
Contributor

jonwright99 commented Mar 17, 2022

Is there a plot created? If so, can you post it?

Also, can you rerun without using -h and, if you set -H you will speed up the run as it won't need to double the hash size many times to find the correct size. I use -H100000000000.

So your command line above should read;
kat comp -t 32 -m 17 -H100000000000 -o genome1VSgenome2 'fastq1_R1.fastq.gz fastq1_R.fastq.gz fastq2_R1.fastq.gz fastq2_R.fastq.gz fastq3_R1.fastq.gz fastq3_R3.fastq.gz' genome1.fa

@matryoskina
Copy link
Author

There is no plot created from this job. I have one created from a previous run
osph0 7 plot

@jonwright99
Copy link
Contributor

There's something very odd with your reads here, are they paired-end reads? Also, were all the fastq files you have included in the analysis the ones used to generate the assembly? I've seen these type of plots with no peak where the libraries either are not paired-end reads or they had multiple rounds of PCR before sequencing.

@matryoskina
Copy link
Author

Yes, reads are all paired-ends. Regarding the assembly, well, the genome was assembled with long read and those short reads were used for misassemblies correction. Then I used an Hi-C library (Illumina paired-end) to get the chromosomes. Do you think I should use this library instead? Also, could I just compare two genomes without illumina reads?
Thanks

@jonwright99
Copy link
Contributor

Ah, that makes sense now. Do you know roughly the coverage of the paired-end reads that you used for misassemblies correction? I'm guessing it quite low and not enough to generate a peak on the plot. KAT is designed to compare an Illumina read dataset to an assembly generated from that dataset to show how the kmer content of the reads is represented in the assembly. Because your datasets have been used differently to generate an assembly, the plots are not working as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants