Write pickle files in defined location #170

chasemc · 2021-06-03T20:21:09Z

Yes, during the lca.py step a few different data structures are constructed from the NCBI database files. These are used to quickly look-up LCA values from the precomputed sparse arrray of precomputed_lcas.pkl.gz.

Definition of serialized files

These filepaths are defined here:

Autometa/autometa/taxonomy/lca.py

Lines 102 to 108 in 50f7a60

    
           self.tour_fp = os.path.join(self.dbdir, "tour.pkl.gz") 
        
           self.tour = None 
        
           self.level_fp = os.path.join(self.dbdir, "level.pkl.gz") 
        
           self.level = None 
        
           self.occurrence_fp = os.path.join(self.dbdir, "occurrence.pkl.gz") 
        
           self.occurrence = None 
        
           self.sparse_fp = os.path.join(self.dbdir, "precomputed_lcas.pkl.gz")

Datastructure preparation

These are prepared with the method prepare_lca

Autometa/autometa/taxonomy/lca.py

Lines 276 to 291 in 50f7a60

    
               def prepare_lca(self): 
        
                   """Prepare LCA internal data structures for :func:`~lca.LCA.lca`. 
        
                   e.g. self.tour, self.level, self.occurrence, self.sparse are all ready. 
        
                   Returns 
        
                   ------- 
        
                   NoneType 
        
                       Prepares all LCA internals and if successful sets `self.lca_prepared` to True. 
        
                   """ 
        
                   self.prepare_tree() 
        
                   self.preprocess_minimums() 
        
                   self.lca_prepared = True 
        
                   # tour, level, occurrence, sparse all ready 
        
                   return

Use/Look-up of serialized files

Autometa/autometa/taxonomy/lca.py

Lines 293 to 347 in 50f7a60

    
               def lca(self, node1, node2): 
        
                   """Performs Range Minimum Query between 2 taxids. 
        
                   Parameters 
        
                   ---------- 
        
                   node1 : int 
        
                       taxid 
        
                   node2 : int 
        
                       taxid 
        
                   Returns 
        
                   ------- 
        
                   int 
        
                       LCA taxid 
        
                   Raises 
        
                   ------- 
        
                   ValueError 
        
                       Provided taxid is not in the nodes.dmp tree. 
        
                   """ 
        
                   if not self.lca_prepared: 
        
                       self.prepare_lca() 
        
                   if node1 == None and node2 == None: 
        
                       return 1 
        
                   if node1 not in self.occurrence: 
        
                       raise ValueError(f"{node1} not in tree") 
        
                   if node2 not in self.occurrence: 
        
                       raise ValueError(f"{node2} not in tree") 
        
                   if node1 == None: 
        
                       return node2 
        
                   if node2 == None: 
        
                       return node1 
        
                   if node1 == node2: 
        
                       return node1 
        
                   if self.occurrence[node1] < self.occurrence[node2]: 
        
                       low = self.occurrence[node1] 
        
                       high = self.occurrence[node2] 
        
                   else: 
        
                       low = self.occurrence[node2] 
        
                       high = self.occurrence[node1] 
        
                   # equipartition range b/w both nodes. 
        
                   cutoff_range = int(np.floor(np.log2(high - low + 1))) 
        
                   lower_index = self.sparse[low, cutoff_range] 
        
                   upper_index = self.sparse[(high - (2 ** cutoff_range) + 1), cutoff_range] 
        
                   lower_index, upper_index = map(int, [lower_index, upper_index]) 
        
                   lower_range = self.level[lower_index] 
        
                   upper_range = self.level[upper_index] 
        
                   if lower_range <= upper_range: 
        
                       lca_range = lower_range 
        
                   else: 
        
                       lca_range = upper_range 
        
                   lca_node = self.tour[self.level.index(lca_range, low, high)] 
        
                   # (parent, child) 
        
                   return lca_node[1]

Additional Notes

Here is a reference from topcoder for a better explanation of this algorithm

Tree of Life Creation

The tree of life is constructed from the node.dmp file from NCBI's taxonomy database located in the taxdump.tar.gz compressed file. Branches stemming from the root are constructed until the entire tree has been built. Paths between each node by are built by traversing the tree using an Eulerian tour method. During the Eulerian tour, features of each tax ID are stored for sparse table creation.

Sparse Table

Sparse Table: The depth of each taxonomic ID in relation to the rest of the tree is used to efficiently store the tree in memory for quick lookup of taxonomic information. This algorithm employs dynamic programming to assess each tax ID in a range of other tax IDs starting from a range of the tax ID and it's closest relative and increasing to a range from the tax ID and it's furthest relative.

RMQ

Range Minimum Query (RMQ): Following the generation of the sparse table, respective ORFs from the BLAST query are reduced by the RMQ algorithm to determine the LCA. The RMQ algorithm utilizes the generated tree of tax IDs, sparse table and features of each tax ID. i.e. depth and location within the tree of tax IDs. The RMQ algorithm will look at the ORFs in pairs reducing to a final lowest common ancestor. Upon receiving the ORF list input, the RMQ algorithm will look at ORF pairs, determine the tax IDs between the two and return the closest tax ID to the root. Each ORF pair has an array of tax IDs linking the relation between the two. The array of tax IDs between the two ORFs is investigated for a lowest common ancestor. An LCA is returned and subsequent RMQ is performed between the returned LCA and the next ORF until a final LCA is reached. As more divergent ORFs are introduced the LCA will get higher until the lowest common ancestor is the root.

Originally posted by @WiscEvan in #157 (comment)

The text was updated successfully, but these errors were encountered:

chasemc · 2021-06-03T20:22:00Z

In response to:

I think the python code pickles some stuff and puts it in the same directory as the ncbi databases, and if I remember correctly that's why some of the docker calls in nextflow required write access to the volume?

chasemc · 2021-06-03T20:53:42Z

From Chase (https://github.com/KwanLab/Autometa/pull/157/files#r638097651)

I want to discuss the pickling. I don't remember if I already mentioned it but I don't think Autometa should be writing the pickle files to the same directory as the database (if I remember correctly, that is what it does). There should probably be a storeDir that it writes to, keeping tack of what version/date of a file it has pickled. This would allow us to also change the docker permissions to only read

This should be done in the download database step which is currently in another local branch of mine.

chasemc · 2021-06-03T20:56:43Z

As @WiscEvan mentioned in a PR, the process below is only long the first time it is run, so may need to be modified when this issue is addressed

Autometa/nextflow/modules/contig_split_by_taxonomy/process/lca.nf

Lines 5 to 6 in cfa5670

    
           label 'process_medium' 
        
           label 'process_long'

🎨 lca entrypoint now has updated output parameter 🍏🎨 update lca.nf with updated entrypoint param

fixes #170 stores pickled data structures for LCA/RMQ to specified directory :bug: Update entrypoint parameters in autometa.sh workflow for autometa-taxonomy-lca :art::green_apple: Include meta.id in LCA outputs

* start to fixing issue-#170 🎨 lca entrypoint now has updated output parameter 🍏🎨 update lca.nf with updated entrypoint param * 🎨:green-apple:🐍 WIP * 🎨🐛 Update entrypoint parameters fixes #170 stores pickled data structures for LCA/RMQ to specified directory :bug: Update entrypoint parameters in autometa.sh workflow for autometa-taxonomy-lca :art::green_apple: Include meta.id in LCA outputs * :bug: Replace LCA(...) instantiation outdir param to cache * :bug: Replace incorrect variable to prevent passing pd.DataFrame to load(...) func in markers.get(...) * :art: change process_low to process_medium in prepare_lca.nf

chasemc added enhancement New feature or request nextflow Nextflow related issues/code python Python related issues/code labels Jun 3, 2021

evanroyrees self-assigned this Sep 28, 2021

evanroyrees mentioned this issue Oct 1, 2021

Large data mode #182

Closed

evanroyrees added a commit that referenced this issue Dec 21, 2021

start to fixing issue-#170

f7d3f59

🎨 lca entrypoint now has updated output parameter 🍏🎨 update lca.nf with updated entrypoint param

evanroyrees linked a pull request Jan 11, 2022 that will close this issue

Refactor autometa-taxonomy-lca #211

Merged

evanroyrees mentioned this issue Jan 12, 2022

Refactor autometa-taxonomy-lca #211

Merged

evanroyrees closed this as completed Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write pickle files in defined location #170

Write pickle files in defined location #170

chasemc commented Jun 3, 2021

chasemc commented Jun 3, 2021 •

edited

chasemc commented Jun 3, 2021 •

edited

chasemc commented Jun 3, 2021

Write pickle files in defined location #170

Write pickle files in defined location #170

Comments

chasemc commented Jun 3, 2021

Definition of serialized files

Datastructure preparation

Use/Look-up of serialized files

Additional Notes

Tree of Life Creation

Sparse Table

RMQ

chasemc commented Jun 3, 2021 • edited

chasemc commented Jun 3, 2021 • edited

chasemc commented Jun 3, 2021

chasemc commented Jun 3, 2021 •

edited

chasemc commented Jun 3, 2021 •

edited