Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low Busco and Gene count on Cannabis genome #660

Closed
megahitokiri opened this issue Aug 19, 2023 · 7 comments
Closed

Low Busco and Gene count on Cannabis genome #660

megahitokiri opened this issue Aug 19, 2023 · 7 comments
Assignees
Labels
question Further information is requested

Comments

@megahitokiri
Copy link

Hi Braker Team,

I am running BRAKER3 for annotating new Cannabis Haplotyes, but I am having a really low busco ~3% and only between 11,000 to 14,000 genes predicted where the expected is around 40,000.

I tried to run MAKER3 and get the expected number of genes but really low BUSCO too. I genome mode the busco is around 98%.

I do not know if you could have a recommendation on how to increase the busco and the number of genes predicted.

Here is the code that I am trying.

PROJECT="AGS106_Hap2"
GENOME="/DATA/home/jmlazaro/Projects/Annotations_Cannabis/CAP_Snakemake_Sundance/MAIN_FASTAs/AGS106_Hap2.GAP.CHR_ID.reviewed.chr_assembled.fasta"
RNA_DATA="/DATA/home/jmlazaro/Projects/Annotations_Cannabis/AGS106_Hap2/AGS106_Hap2/Minimap_Aligned/minimap2.sorted.MAPQ20.dedup.bam"
CPU_Number="48"
PROTEIN_DATASET="/DATA/home/jmlazaro/github/orthodb-clades/BRAKER3_Clades/Viridiplantae.fa"

#export sif
export BRAKER_SIF=$PWD/braker3.sif

wd=$PROJECT

singularity instance start -B ${PWD}:${PWD} ${BRAKER_SIF} Hap2

singularity exec instance://Hap2 braker.pl --genome=$GENOME --bam=$RNA_DATA --softmasking --workingdir=${wd} --GENEMARK_PATH=${ETP}/gmes --prot_seq=$PROTEIN_DATASET --threads $CPU_Number --skip_fixing_broken_genes --gff3 --verbosity 4

singularity instance stop Hap2

my augustus hint file is her: https://sunflowergenome.org/annotations-data/assets/data/annotations/Cannabis/augustus.hints.gtf.gz

Thanks for your help or insights.

@KatharinaHoff
Copy link
Member

Hard to say what goes wrong, here.

However, Cannabis sativa is a case where a reference annotation for another strain exists. Proteins should map well.

I recommend downloading the annotated proteins of Cannabis sativa (all 3 strains that have an annotation), Trema orientale, Parasponia andersonii, concatenate them and use as input for GALBA. If BUSCO scores are still low, visualize in a genome browser, check visually what's going wrong. (Do that with the BRAKER & MAKER predictions, too.) Try to visualize the BUSCOs as well.

BUSCO doesn't care about repeat masking. Maybe your genome is overmasked? You should see that when you look in a browser at the BUSCOs.

Did you run the new BUSCO with miniprot support? Or compleasm? I would give compleasm a test run, see whether that remotely reproduces the genomic BUSCO scores.

@KatharinaHoff KatharinaHoff added the question Further information is requested label Aug 21, 2023
@megahitokiri
Copy link
Author

Thanks for your reply, Katharina. I am still testing the results. I will let you know once it is finished.

@lovelynewGao
Copy link

Hello, have you resolved the issue now? I had the same issue when I predicted my genome using braker3 and could you please share any advice to improve the busco score or the number of predicted genes?

@KatharinaHoff
Copy link
Member

The number of predicted genes can be changed if you re-run TSEBRA, manually, enforcing the best previous gene set (e.g. the genemark or the augustus gene set).

The BUSCOs may or may not improve with that.

In the case originally reported here, protein BUSCOs were too low for several pipelines compared to genome level BUSCOs. It is important to be aware that BUSCO (and compleasm) are not gene predictors. They will report the presence of a conserved protein sequence in the genome regardless of whether splice sites are valid, and whether there's a valid start and stop codon associated. You can figure this out by visualizing the BUSCOs/compleasm BUSCOs in a genome browser, next to a track with gene predictions.

@KatharinaHoff KatharinaHoff self-assigned this Nov 20, 2023
@KatharinaHoff
Copy link
Member

In addition to above advice, I have today added the functionality that BUSCO runs compleasm on genome level to generate hints for prediction with Augustus. These changes are currently only in branch https://github.com/Gaius-Augustus/BRAKER/tree/compleasm . However, this will only pick up on complete or duplicated BUSCOs without frame shifts.

I am also working on automating running TSEBRA in a way that minimizes missing BUSCOs, but that's still ongoing work.

@KatharinaHoff
Copy link
Member

That branch was merged into master. The solution is documented on a poster: https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/posters/poster_PAG2024.pdf

@lovelynewGao
Copy link

That branch was merged into master. The solution is documented on a poster: https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/posters/poster_PAG2024.pdf

Thank you for your detailed suggestions, which I will take to try to run my data. If I have better results, I will let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants