New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write pickle files in defined location #170
Comments
In response to:
|
From Chase (https://github.com/KwanLab/Autometa/pull/157/files#r638097651)
|
As @WiscEvan mentioned in a PR, the process below is only long the first time it is run, so may need to be modified when this issue is addressed
|
🎨 lca entrypoint now has updated output parameter 🍏🎨 update lca.nf with updated entrypoint param
fixes #170 stores pickled data structures for LCA/RMQ to specified directory :bug: Update entrypoint parameters in autometa.sh workflow for autometa-taxonomy-lca :art::green_apple: Include meta.id in LCA outputs
* start to fixing issue-#170 🎨 lca entrypoint now has updated output parameter 🍏🎨 update lca.nf with updated entrypoint param * 🎨:green-apple:🐍 WIP * 🎨🐛 Update entrypoint parameters fixes #170 stores pickled data structures for LCA/RMQ to specified directory :bug: Update entrypoint parameters in autometa.sh workflow for autometa-taxonomy-lca :art::green_apple: Include meta.id in LCA outputs * :bug: Replace LCA(...) instantiation outdir param to cache * :bug: Replace incorrect variable to prevent passing pd.DataFrame to load(...) func in markers.get(...) * :art: change process_low to process_medium in prepare_lca.nf
Yes, during the
lca.py
step a few different data structures are constructed from the NCBI database files. These are used to quickly look-up LCA values from the precomputed sparse arrray ofprecomputed_lcas.pkl.gz
.Definition of serialized files
These filepaths are defined here:
Autometa/autometa/taxonomy/lca.py
Lines 102 to 108 in 50f7a60
Datastructure preparation
These are prepared with the method
prepare_lca
Autometa/autometa/taxonomy/lca.py
Lines 276 to 291 in 50f7a60
Use/Look-up of serialized files
Autometa/autometa/taxonomy/lca.py
Lines 293 to 347 in 50f7a60
Additional Notes
Here is a reference from topcoder for a better explanation of this algorithm
Tree of Life Creation
The tree of life is constructed from the node.dmp file from NCBI's taxonomy database located in the taxdump.tar.gz compressed file. Branches stemming from the root are constructed until the entire tree has been built. Paths between each node by are built by traversing the tree using an Eulerian tour method. During the Eulerian tour, features of each tax ID are stored for sparse table creation.
Sparse Table
Sparse Table: The depth of each taxonomic ID in relation to the rest of the tree is used to efficiently store the tree in memory for quick lookup of taxonomic information. This algorithm employs dynamic programming to assess each tax ID in a range of other tax IDs starting from a range of the tax ID and it's closest relative and increasing to a range from the tax ID and it's furthest relative.
RMQ
Range Minimum Query (RMQ): Following the generation of the sparse table, respective ORFs from the BLAST query are reduced by the RMQ algorithm to determine the LCA. The RMQ algorithm utilizes the generated tree of tax IDs, sparse table and features of each tax ID. i.e. depth and location within the tree of tax IDs. The RMQ algorithm will look at the ORFs in pairs reducing to a final lowest common ancestor. Upon receiving the ORF list input, the RMQ algorithm will look at ORF pairs, determine the tax IDs between the two and return the closest tax ID to the root. Each ORF pair has an array of tax IDs linking the relation between the two. The array of tax IDs between the two ORFs is investigated for a lowest common ancestor. An LCA is returned and subsequent RMQ is performed between the returned LCA and the next ORF until a final LCA is reached. As more divergent ORFs are introduced the LCA will get higher until the lowest common ancestor is the root.
Originally posted by @WiscEvan in #157 (comment)
The text was updated successfully, but these errors were encountered: