Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module completeness as stand-alone package #19

Open
Alxdu opened this issue Oct 10, 2023 · 4 comments
Open

Module completeness as stand-alone package #19

Alxdu opened this issue Oct 10, 2023 · 4 comments

Comments

@Alxdu
Copy link

Alxdu commented Oct 10, 2023

First of all, thank you for putting together this really great package.
I find the module completeness assessment really unique, with only a few other lesser options out there (e.g., KeggDecoder). I also liked the way you break down the module definition in .kk files for improved completeness assessment. Therefore, I look forward to see continued support and development for this function.

In my case, I use ko annotations made within a different pipeline to assess module completeness with KEMET. In theory I would only need the annotation .txt file, but I have to also provide the genome assembly .fasta file to run the script (which is not really needed when running with --skip_hmm and --skip_gsmm arguments).

If I could make a feature request/suggestion, it would be to separate the module completeness functionality where it accepts just ko annotation files (either a path to a file or a path to a folder for batch operation).

It would also be great to have a stand-alone tool to create module definition .kk files from the official kegg module .txt files, for situations where KEMET is not continuously supported and current .kk files become obsolete.

Thank you for giving these some consideration.

@jolespin
Copy link

I would also like this feature.

@Alxdu have you found any alternatives?

@Matteopaluh
Copy link
Owner

Matteopaluh commented Oct 17, 2023

Hello both of you,

Indeed KEMET was conceived and structured in 3 different scripts, but at the time of first manuscript submission to a journal, one reviewer suggested to bundle all functions in a single package.

Due to this, the design of the main script was reworked and it's now in the present form, but lines 2444-2495 are remnants of the initial concept about Module annotation alone.

I've briefly checked the code of kemet.py and just found a minor code rework that could permit using a workaround in bash language, allowing for batch annotation of KOs without FASTA sequences - given the presence of suitable annotation files.

The script is not specifically asking for FASTA files as input, but it's using file names of said files to keep a constant flow for all operations connected to the same MAGs/genomes.

That is to say that if --skip_hmm parameter is added, mandatory FASTA_file argument can be a path to the annotation folder.
A couple of lines of code would be sufficient to rename the variable file_name checked in lines 2452, 2458, and 2464. That way a simple

for f in $(ls PATH/TO/ANNOTATION-FILES/);
do
    ./kemet.py $f -a ANNOTATION_FORMAT --skip_hmm;
done

should work for batch annotation.

In the meantime I guess another workaround could be to truncate the names of KO annotations files with a code like:

for f in $(ls PATH/TO/ANNOTATION-FILES/);
do
    f1="${f%%.*}";
    ./kemet.py $f1 -a ANNOTATION_FORMAT --skip_hmm;
done

For single file annotation, instead of pointing to fasta files path, it is possible to point to an annotation file, with the exception of leaving out the extension.

I'll work on the solution I mentioned in this reply, to include single file and batch use cases, soon when I'll be available!

@Alxdu regarding the tool to create module definition it could be available, but it would take a while more. I already have some code for that, which was used as backbone for the most of .kk files but it still needs some manual curation for a minority of them. Therefore I was figuring out a way to eliminate this manual curation on the code, and in the meanwhile I had updated to the second to last KEGG version. I'll also try to do the same for the last one in the close future.

Best,

Matteo

@jolespin
Copy link

jolespin commented Oct 17, 2023

@Matteopaluh this is great news. Would also be possible to include some functionality that takes in something like just a list of KO ids? Something like this:

for GENOME_ID in $(cat genomes.list);
do
    KO_IDS=kofam_results/${GENOME_ID}.ko_ids.list
    kemet.py $KO_IDS -a ko_list > kemet_results/${GENOME_ID}.mcr.tsv
done

If you're able to implement this functionality and add the module as a conda package I will incorporate it into my https://github.com/jolespin/veba package. I'm working on the v2 publication right now so your package of course would be cited and properly referenced.

What would be very useful would be to give kemet a list of KO ids that are present in a particular genome and get an output that says the KEGG module and the module completion ratio (plus any extra data).

How difficult would this be on your end to make this type of update?

@Alxdu
Copy link
Author

Alxdu commented Oct 19, 2023

@Matteopaluh is it excellent to hear you intend to revisit and improve upon the module completeness functionality. I will have a go at your suggested code modifications as a workaround, but I also look forward to your own implementation in upcoming updates. Same goes for module definition tooling (i.e., rebuilding .kk file).
What you have is original and a fairly unique offering for Kegg users. Nice work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants