Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom reference for non-human, non-mouse genome #3

Closed
hswhitbeck opened this issue Mar 19, 2021 · 12 comments
Closed

Custom reference for non-human, non-mouse genome #3

hswhitbeck opened this issue Mar 19, 2021 · 12 comments

Comments

@hswhitbeck
Copy link

Hi, the ReadMe file says "If you want to use your customs reference, you can use the -gene -te options:". We understood this as being able to use your code on other genomes than the mouse and the human. We tried this command to build the index:
scTE_build -te /path/to/hsal_v8.5_filtered_unique_ids.bed -gene /path/to/hsal_v8.5_genes_update16.gtf -o /path/to/scTE_build_1.idx
and we got the following error message:
scTE_build: error: the following arguments are required: -g/--genome
In the ReadMe file example the -g argument is not supplied for building a custom index. Why is it required? Any tips are appreciated. Thank you.

@jphe
Copy link
Contributor

jphe commented Mar 20, 2021

We have update scTE with more speices' genome included, and the the -g is optional now if the bed/gtf file were given.

@bsierieb1
Copy link

Thanks for your reply @jphe
We have downloaded the updated version of scTE and now get another error:
ERROR : Counting genome other not supported
We work with an exotic non-model species of insects. Could you please help us generate a custom index for our genome? Would it be possible to share the genome with you so that you could include it in the next update? If this is too much work for you, maybe you could guide us through the process and let us do it ourselves?
Thanks a lot!

@jphe
Copy link
Contributor

jphe commented Mar 23, 2021

For non-model species you need to make sure it has well annotated files for TEs and genes.

As GitHub has a strict file limit of 100MB, and the genmoe indices usually much bigger than that, so we can not upload the geome indices to the Github.

If you have we accessible ftp or any other web accessible tools, you can share the annotation files for us then we build the indices and send to you

@bsierieb1
Copy link

bsierieb1 commented Mar 29, 2021

here are the genome and the annotation files.

thank you so much for your help!

@bsierieb1
Copy link

P.S. you should be able to use the same link to upload the indices. please let us know if there is any issue!

@jphe
Copy link
Contributor

jphe commented Mar 30, 2021

There are only the gtf file for genes under the ftp, while scTE also needs an annotation file for TEs.

The gene annotation gtf file seems derived from transcript assembly, however, we did not recommend for such file as there are many TE derived transcripts, which will leads to underestimate of TE expression if you use scTE for quantification, as scTE assign reads to genes/transcripts first, and then for TEs.

Besides, usually the transcripts assembly highly depends on bulk RNA-seq data, while development and disease process are highly heterogenous, the transcripts from the rare cell types are often masked by bulk RNA-seq, which means the transcript assembly from bulk RNA-seq data may unreliable for the analysis of the rare cell types from single-cell.

May be you can try the strategy of this paper if you want to use the assembled transcripts, which quantifies the expression of TE derived transcripts https://genome.cshlp.org/content/early/2020/12/21/gr.265173.120.abstract

@bsierieb1
Copy link

sorry, i accidentally copied a link to one of the files instead of the link to the entire drive folder. here is the correct link.

the gene annotations file is not derived from a transcriptome assembly, but i wonder what made you think that? the gene annotations were generated by the NCBI annotation pipeline and further updated by incorporating additional RNA-seq data. the TE annotations are simply the output of RepeatMasker (edited to remove some classes of short features).

@bsierieb1
Copy link

hi @jphe, do you think you have everything you need? thank you for offering help!

@jphe jphe mentioned this issue Apr 7, 2021
@jphe
Copy link
Contributor

jphe commented Apr 23, 2021

Sorry for the late reply, we can not interpretate properly, we don't know what it means for each column, as it seems not a classical gtf file. Basically you need to convert it into a canonical gtf format for the gtf file. Or you can check if Ensemble has the gtf file for the genome, it should be canonical gtf format in Ensemble.

@akui113
Copy link

akui113 commented Jul 16, 2021

Hi, the ReadMe file says "If you want to use your customs reference, you can use the -gene -te options:". We understood this as being able to use your code on other genomes than the mouse and the human. We tried this command to build the index:
scTE_build -te /path/to/hsal_v8.5_filtered_unique_ids.bed -gene /path/to/hsal_v8.5_genes_update16.gtf -o /path/to/scTE_build_1.idx
and we got the following error message:
scTE_build: error: the following arguments are required: -g/--genome
In the ReadMe file example the -g argument is not supplied for building a custom index. Why is it required? Any tips are appreciated. Thank you.

@jphe
I also encountered the same problem,and the species is Macaca mulatta.
gene annotation file was downloaded from http://ftp.ensembl.org/pub/release-104/gtf/macaca_mulatta/Macaca_mulatta.Mmul_10.104.gtf.gz,
and repeatmask file was downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/database/rmsk.txt.gz .

Then, I treated the repeatmask file and get a six-column bed file with the option awk 'BEGIN{FS=OFS="\t"}{print $6,$7,$8,$11,$3,$10}' rmsk.txt > mmul10rmsk.bed and make sure the chromosome name consistent with gene annotation file.
Lastly, I building the index scTE_build -te mmul10rmsk.bed -gene Macaca_mulatta.Mmul_10.104.gtf -o Mmul_10scTE.idx.
However, I get the ERROR : Counting genome other not supported.

Any tips are appreciated !
Thank you for your generous help!

@antecede
Copy link

Hello team of authors and thank you for your beautiful work! Could you please write a guide process so that others can create their own custom references for non-model species, so that we can get the results file in a timely manner while reducing your work! Thanks again!
best wishes!

@antecede
Copy link

Sorry for the late reply, we can not interpretate properly, we don't know what it means for each column, as it seems not a classical gtf file. Basically you need to convert it into a canonical gtf format for the gtf file. Or you can check if Ensemble has the gtf file for the genome, it should be canonical gtf format in Ensemble.

If the research is on non modal species, there is no canonical gtf in Ensemble. If convenient, please provide the non Ensemble gtf or how to supplement the missing column content to obtain a custom reference.

BR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants