Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write pickle files in defined location #170

Closed
chasemc opened this issue Jun 3, 2021 · 3 comments · Fixed by #211
Closed

Write pickle files in defined location #170

chasemc opened this issue Jun 3, 2021 · 3 comments · Fixed by #211
Assignees
Labels
enhancement New feature or request nextflow Nextflow related issues/code python Python related issues/code

Comments

@chasemc
Copy link
Member

chasemc commented Jun 3, 2021

Yes, during the lca.py step a few different data structures are constructed from the NCBI database files. These are used to quickly look-up LCA values from the precomputed sparse arrray of precomputed_lcas.pkl.gz.

Definition of serialized files

These filepaths are defined here:

self.tour_fp = os.path.join(self.dbdir, "tour.pkl.gz")
self.tour = None
self.level_fp = os.path.join(self.dbdir, "level.pkl.gz")
self.level = None
self.occurrence_fp = os.path.join(self.dbdir, "occurrence.pkl.gz")
self.occurrence = None
self.sparse_fp = os.path.join(self.dbdir, "precomputed_lcas.pkl.gz")

Datastructure preparation

These are prepared with the method prepare_lca

def prepare_lca(self):
"""Prepare LCA internal data structures for :func:`~lca.LCA.lca`.
e.g. self.tour, self.level, self.occurrence, self.sparse are all ready.
Returns
-------
NoneType
Prepares all LCA internals and if successful sets `self.lca_prepared` to True.
"""
self.prepare_tree()
self.preprocess_minimums()
self.lca_prepared = True
# tour, level, occurrence, sparse all ready
return

Use/Look-up of serialized files

def lca(self, node1, node2):
"""Performs Range Minimum Query between 2 taxids.
Parameters
----------
node1 : int
taxid
node2 : int
taxid
Returns
-------
int
LCA taxid
Raises
-------
ValueError
Provided taxid is not in the nodes.dmp tree.
"""
if not self.lca_prepared:
self.prepare_lca()
if node1 == None and node2 == None:
return 1
if node1 not in self.occurrence:
raise ValueError(f"{node1} not in tree")
if node2 not in self.occurrence:
raise ValueError(f"{node2} not in tree")
if node1 == None:
return node2
if node2 == None:
return node1
if node1 == node2:
return node1
if self.occurrence[node1] < self.occurrence[node2]:
low = self.occurrence[node1]
high = self.occurrence[node2]
else:
low = self.occurrence[node2]
high = self.occurrence[node1]
# equipartition range b/w both nodes.
cutoff_range = int(np.floor(np.log2(high - low + 1)))
lower_index = self.sparse[low, cutoff_range]
upper_index = self.sparse[(high - (2 ** cutoff_range) + 1), cutoff_range]
lower_index, upper_index = map(int, [lower_index, upper_index])
lower_range = self.level[lower_index]
upper_range = self.level[upper_index]
if lower_range <= upper_range:
lca_range = lower_range
else:
lca_range = upper_range
lca_node = self.tour[self.level.index(lca_range, low, high)]
# (parent, child)
return lca_node[1]

Additional Notes

Here is a reference from topcoder for a better explanation of this algorithm

Tree of Life Creation

The tree of life is constructed from the node.dmp file from NCBI's taxonomy database located in the taxdump.tar.gz compressed file. Branches stemming from the root are constructed until the entire tree has been built. Paths between each node by are built by traversing the tree using an Eulerian tour method. During the Eulerian tour, features of each tax ID are stored for sparse table creation.

Sparse Table

Sparse Table: The depth of each taxonomic ID in relation to the rest of the tree is used to efficiently store the tree in memory for quick lookup of taxonomic information. This algorithm employs dynamic programming to assess each tax ID in a range of other tax IDs starting from a range of the tax ID and it's closest relative and increasing to a range from the tax ID and it's furthest relative.

RMQ

Range Minimum Query (RMQ): Following the generation of the sparse table, respective ORFs from the BLAST query are reduced by the RMQ algorithm to determine the LCA. The RMQ algorithm utilizes the generated tree of tax IDs, sparse table and features of each tax ID. i.e. depth and location within the tree of tax IDs. The RMQ algorithm will look at the ORFs in pairs reducing to a final lowest common ancestor. Upon receiving the ORF list input, the RMQ algorithm will look at ORF pairs, determine the tax IDs between the two and return the closest tax ID to the root. Each ORF pair has an array of tax IDs linking the relation between the two. The array of tax IDs between the two ORFs is investigated for a lowest common ancestor. An LCA is returned and subsequent RMQ is performed between the returned LCA and the next ORF until a final LCA is reached. As more divergent ORFs are introduced the LCA will get higher until the lowest common ancestor is the root.

Originally posted by @WiscEvan in #157 (comment)

@chasemc
Copy link
Member Author

chasemc commented Jun 3, 2021

In response to:

I think the python code pickles some stuff and puts it in the same directory as the ncbi databases, and if I remember correctly that's why some of the docker calls in nextflow required write access to the volume?

@chasemc chasemc added enhancement New feature or request nextflow Nextflow related issues/code python Python related issues/code labels Jun 3, 2021
@chasemc
Copy link
Member Author

chasemc commented Jun 3, 2021

From Chase (https://github.com/KwanLab/Autometa/pull/157/files#r638097651)

I want to discuss the pickling. I don't remember if I already mentioned it but I don't think Autometa should be writing the pickle files to the same directory as the database (if I remember correctly, that is what it does). There should probably be a storeDir that it writes to, keeping tack of what version/date of a file it has pickled. This would allow us to also change the docker permissions to only read

This should be done in the download database step which is currently in another local branch of mine.

@chasemc
Copy link
Member Author

chasemc commented Jun 3, 2021

As @WiscEvan mentioned in a PR, the process below is only long the first time it is run, so may need to be modified when this issue is addressed

label 'process_medium'
label 'process_long'

@evanroyrees evanroyrees self-assigned this Sep 28, 2021
evanroyrees added a commit that referenced this issue Dec 21, 2021
🎨 lca entrypoint now has updated output parameter
🍏🎨 update lca.nf with updated entrypoint param
evanroyrees added a commit that referenced this issue Jan 11, 2022
fixes #170 stores pickled data structures for LCA/RMQ to specified directory
:bug: Update entrypoint parameters in autometa.sh workflow for autometa-taxonomy-lca
:art::green_apple: Include meta.id in LCA outputs
@evanroyrees evanroyrees linked a pull request Jan 11, 2022 that will close this issue
evanroyrees added a commit that referenced this issue Jan 12, 2022
* start to fixing issue-#170
🎨 lca entrypoint now has updated output parameter
🍏🎨 update lca.nf with updated entrypoint param

* 🎨:green-apple:🐍 WIP

* 🎨🐛 Update entrypoint parameters

fixes #170 stores pickled data structures for LCA/RMQ to specified directory
:bug: Update entrypoint parameters in autometa.sh workflow for autometa-taxonomy-lca
:art::green_apple: Include meta.id in LCA outputs

* :bug: Replace LCA(...) instantiation outdir param to cache

* :bug: Replace incorrect variable to prevent passing pd.DataFrame to load(...) func in markers.get(...)

* :art: change process_low to process_medium in prepare_lca.nf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request nextflow Nextflow related issues/code python Python related issues/code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants