Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on HiTE Conda Version and Performance Improvement #11

Closed
WangZhSi opened this issue Jul 26, 2024 · 3 comments
Closed

Feedback on HiTE Conda Version and Performance Improvement #11

WangZhSi opened this issue Jul 26, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@WangZhSi
Copy link

Dear Author,

Thank you for developing such an excellent software. It has significantly improved my repetitive sequence annotation workflow. I have a few suggestions and observations that I would like to share:

  1. Conda Version Directory Issue: It seems that the conda version requires entering the software deployment directory due to the use of os.getcwd() in the main script to locate relative paths for subsequent functions. Can I assume this is a compromise for Singularity or Docker versions? However, this approach cannot distinguish between the software environment and the production environment. While I can modify this part myself without affecting the final results, it's just a small suggestion.

  2. Performance Testing: To test the software performance, I ran HiTE on a published Leymus chinensis genome (~8G) using 32 threads, and the analysis took 20 days to complete. This time information is provided for your reference. I noticed in other issues that you do not recommend splitting chromosomes for subsequent analysis. If I want to improve the running speed, is increasing the number of threads the only option?

  3. Annotation Types: HiTE's DNA transposon annotation types are relatively few. Are there plans to add more annotation types in future updates? (White background images are from published articles, black background images are from HiTE annotation results).

image
  1. Custom Repeat Library: I was considering raising a question about using my own repeat library, but I noticed in issue Construction a panTElib #8 that the --curated_lib option is mentioned. Can I use this option to input my own repeat library?

  2. Interruption Handling: I did not encounter any interruptions while using the software. Thank you for providing such a smooth workflow. However, are there any flags or checkpoints that would allow the software to resume from the middle if interrupted unexpectedly, instead of starting over?

Thank you again for your hard work!

@CSU-KangHu
Copy link
Owner

CSU-KangHu commented Jul 27, 2024

Hello @WangZhSi,

Thank you very much for using HiTE and for your valuable suggestions and feedback, which help make HiTE better. I will respond to your five points one by one:

  1. Conda Version Directory Issue: It seems that the conda version requires entering the software deployment directory due to the use of os.getcwd() in the main script to locate relative paths for subsequent functions. Can I assume this is a compromise for Singularity or Docker versions? However, this approach cannot distinguish between the software environment and the production environment. While I can modify this part myself without affecting the final results, it's just a small suggestion.

This is indeed an imperfect part of the code implementation, and I will improve this in the next version. Thank you very much for your suggestion.

  1. Performance Testing: To test the software performance, I ran HiTE on a published Leymus chinensis genome (~8G) using 32 threads, and the analysis took 20 days to complete. This time information is provided for your reference. I noticed in other issues that you do not recommend splitting chromosomes for subsequent analysis. If I want to improve the running speed, is increasing the number of threads the only option?
  1. Custom Repeat Library: I was considering raising a question about using my own repeat library, but I noticed in issue Construction a panTElib #8 that the --curated_lib option is mentioned. Can I use this option to input my own repeat library?

The runtime of HiTE is related not only to the genome size but also to the complexity of the repeat regions. Thank you very much for your patience in waiting 20 days for HiTE to complete. Our previous efforts were mainly focused on genomes of typical sizes. After the publication of HiTE, we realized that many users need to analyze larger genomes, and longer runtimes undoubtedly impact their research progress. We will investigate ways to improve HiTE for larger genomes in the future.

Increasing the number of threads is undoubtedly the most direct way to speed up the process. However, there are two other ways that might help reduce the runtime:

  1. You can try using the --curated_lib parameter to add your own TE library. HiTE will use the curated library to pre-mask the genome, which should reduce the runtime to some extent.
  2. Change the --chunk_size parameter. This will split the input genome into smaller chunks to reduce the computational load of a single HiTE run. The default is 400 MB. You can reduce this value, for example, to 200 MB. The risk is that you might miss low-copy TEs that span across different chunks. Therefore, we recommend avoiding small chunk sizes if the runtime is acceptable.
  1. Annotation Types: HiTE's DNA transposon annotation types are relatively few. Are there plans to add more annotation types in future updates? (White background images are from published articles, black background images are from HiTE annotation results).

The output types in HiTE.tbl are fixed formats from RepeatMasker, providing an overall distribution. I recommend following the method mentioned in #7 to obtain more detailed annotation proportion information. I will consider updating this information in the README to help others with similar needs.

  1. Interruption Handling: I did not encounter any interruptions while using the software. Thank you for providing such a smooth workflow. However, are there any flags or checkpoints that would allow the software to resume from the middle if interrupted unexpectedly, instead of starting over?

Yes, we provide a --recover parameter, which checks the output directory for existing intermediate files and determines which steps do not need to be rerun.

Thank you again for your support of HiTE. We will continue our efforts to make HiTE more useful.

Best regards,

Kang

@CSU-KangHu CSU-KangHu added the enhancement New feature or request label Jul 27, 2024
@WangZhSi
Copy link
Author

@CSU-KangHu
Hi, me again.

I noticed that the output files do not include the masked genome, and I couldn't find any corresponding settings in the parameters. If I want to obtain a masked genome for further analysis, such as gene prediction, do I need to manually mask the genome again by HiTE.gff?

Then I don't have any other questions by far, you could close this issue anytime you wish.

Thanks!

@CSU-KangHu
Copy link
Owner

Hi @WangZhSi,

Thank you very much for your suggestions and support. After using --annotate 1, HiTE does generate genome annotations, but we hadn't been retaining the masked genome.

Based on your feedback, we’ve now set the genome.fa.masked file to be kept by default. I’ve just submitted a new commit with this update.

Thanks again for your support.

Best regards,
Kang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants