Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING: Please check the final result to confirm whether they are simply flip-flop configurations! #5

Closed
agroppi opened this issue Mar 5, 2019 · 9 comments

Comments

@agroppi
Copy link

agroppi commented Mar 5, 2019

I'm trying to assemble the chloroplast of a prunus.
I have set a prunus_cp file in GetOrganelle/Library/SeqReference
and generated the corresponding blast db (makeblastdb in /gpfs/home/ag/GetOrganelle/Library/NotationReference
But i'm facing this error at the desentangling step :
Disentangling failed: list index out of range
Here is the full log :
get_org.log.txt.zip

@Kinggerm
Copy link
Owner

Kinggerm commented Mar 6, 2019

Hi,

I thought I had made it impossible to use a choice like "prunus_cp", it seemed that I missed adding "exit()" after the printing the ERROR. Thanks for debugging!
The suggested way for you to use your customized blast db is "-F anonym --genes prunus.genes.fasta -s prunus.fasta". The "prunus.genes.fasta" is the fasta file you used to make the blast db. For looking into the error you faced, I need the fasta file you used to make the blast db.
Actually, for species within angiosperms, especially for taxa in Rosaceae which I had a bunch of tests, there is generally no need to using a customized blast db. You can easily use the default plant_cp. Let me know if it does not work for you.
Besides, the automatic word size seemed too small for your data. You can try "-w 65" or "-w 71". Since you used a closely related reference, maximum rounds is no need to be 15. You can try "-R 4" or "-R 5".

Let me know the progress!

@agroppi
Copy link
Author

agroppi commented Mar 6, 2019

Hi,
I first try to run the script with the default options and plant_cp :

get_organelle_reads.py \
-t 20 \
-1 reads_1.fq \
-2 reads_2.fq \
-o /outdir \
-R 15 \
-k 75,85,95,105 \
-F plant_cp

But the resulting assembly was bad :
graph

I will try to
1/ to run the script on plant_cp with w 65 or w 71
2/ to run on prunus_cp by adding the -F option you described

Here is my prunus_cp.fasta
prunus_cp.zip

@Kinggerm
Copy link
Owner

Kinggerm commented Mar 6, 2019

Actually, even if you use the rbcL gene of an angiosperm species Amborella trichopoda (Genbank: NC_005086) as the reference, and try to assemble the dataset sequencing a gymnosperm species Gnetum gnemon (Genbank: ERR268421), you would easily get the right complete circular plastome for Gnetum gnemon. We included similar tests on our upcoming paper. I mean if the data is good enough, the reference should have little effects. However, You may still have the chance to find some reads to close the gap with a more closely related reference, although not a big chance.
If this assembly graph is based on k-mer=95, the base coverage of plastome is really high. It is unusual, but it happens for some data samples (~ 10%). Given this graph, I think you do not need to adjust the -w nor the -R (which is previously suggested for efficiency more than success). You should change the "-k 75,85,95,105" in your options into "-k 35,45,55,65,75".

Let me know the updates!

@Kinggerm
Copy link
Owner

Kinggerm commented Mar 6, 2019

A bit more explanation for using "-F anonym --genes prunus.genes.fasta -s prunus.fasta". (Should be included when preparing for a GetOrganelle manual):

The prunus.fasta file followed with "-s" should be a fasta format sequence file containing your reference(s). Your prunus_cp.fasta file is just in the right form.

The prunus.genes.fasta, however, should be a fasta format file in the following form:

>rbcL gene
ATGC
>matK gene
ATGC

@agroppi
Copy link
Author

agroppi commented Mar 6, 2019

I have run the script on plant_cp with
-k 35,45,55,65,75
I have now a lot of warning like this :
Disentangling failed: Incomplete/Complicated graph

graph in bandage :
graph

Log :
get_org.log.txt.zip

@Kinggerm
Copy link
Owner

Kinggerm commented Mar 7, 2019

Thanks for the graph and log file. There is a big improvement now. Only one break point left (the grey line in the following image).
53895117-9b06cc80-4031-11e9-8363-c6a44f062a69 modified

You could try
further decrease the minimum k-mer, such as "-k 21,35,45,55,65,75" and increase the data used by "--max-reads 3E7" or larger (default is 1.5E7, that is why "Number of reads exceeded 15000000 in NG-6201_A2067_lib13395_1_sequence.good.fq, only top 15000000 reads are used in downstream analysis" is logged).

@agroppi
Copy link
Author

agroppi commented Mar 7, 2019

results with -k 21,35,45,55,65,75 and --max-reads 4E7

graph :
graph
and the log :
get_org.log.txt

One last question :
in the final step I have :
WARNING: More than one circular genome structure produced ...
WARNING: Please check the final result to confirm whether they are simply flip-flop configurations!
(PATH1 and PATH2)
How can I choose the best one ?

Thanks again

@Kinggerm
Copy link
Owner

Kinggerm commented Mar 8, 2019

Looks great now!
For this graph, they are surely flip-flop configurations (Walker et al. 2015). Both of them are right and coexist in plant. For publication, people tend to use just one of them. You can pick the configuration of which the SSC region is in the same direction with most of your data or outgroup, which makes downstream annotation/analysis easier.

Good luck with you study!

@agroppi
Copy link
Author

agroppi commented Mar 8, 2019

Thanks you very much for your help !
I will apply this to chloroplasts close species

@Kinggerm Kinggerm changed the title Disentangling failed: list index out of range A broad range of k-mer gradient would be preferred .. Mar 13, 2019
@Kinggerm Kinggerm changed the title A broad range of k-mer gradient would be preferred .. WARNING: Please check the final result to confirm whether they are simply flip-flop configurations! Jun 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants