31 implement family addition import by nilsoberg · Pull Request #151 · EnzymeFunctionInitiative/EST

nilsoberg · 2025-04-03T16:59:27Z

In order to fully close this issue changes were made that close issues #31 , #29 , and #147 .

- Reader for parsing, with an API rather than raw hash ref access - Writer for saving, with an API for saving rather than raw hash ref access

- Code reuse by inheriting from EFI::Options - Pass necessary hash data to EFI::Import::Source modules rather than Config object - Removed a lot of unnecessary code in the Config modules

… and metadata files

…dition - Remove sunburst and stats code - Remove filter-specific code (moved in previous commit) - Metadata is managed by EFI::Sequence::Collection and not individual Sources - Sequences are added to a sequence collection by the source, (i.e. the source no longer returns a sequence collection), which allows family addition

- get_sequence_ids outputs 'source' files -> filter_ids - filter_ids outputs main sequence_metadata and accession_ids files - Add central sequence typing - Provide extra error message output to printHelp in EFI::Options - Config subclasses can update options for use by calling scripts

- get_sequence_ids outputs a metadata file and a UniRef mapping table (used to be accession_ids.txt) - Filtering removes from metadata as well as UniRef mapping - Support dual file approach and UniRef in EFI::Sequence::Collection - Update filter_ids filter order

- Fix filter bugs - Fix EFI::Sequence api bug

…ring - Filtering moved to separate process - UniRef moved to get_sequences process - Metadata retrieval occurs in filtering - Add sunburst ID retrieval

- Previous accession_ids file is now accession_table to better reflect purpose - Output of step 1/step 2 is sequence IDs file (to replace accession IDs), which is used in the split/get_sequence processes - import_fasta outputs a file that contains only sequence IDs that were added from families - this gets passed to split/get processes - Support empty files in split/get_sequence processes

- Statistics can now read/write - Source outputs a stats file, filter modifies it to the expected import_stats.json

…hange

rbdavid

Overall, very good. More documentation would be great but isn't highest priority at the moment. I left a whole bunch of questions and comments; most are low importance. I only noticed two major things:

Major issue: pipelines/est/import/import_fasta.pl, boolean checks in if, elsif, else block may result in no filtering actually being applied.
Major question: pipelines/est/import/filter_ids.pl, order of applying the filters will affect the sequences that make it through the "import" step of the nextflow pipeline. Should that order of operations be changed?

Great work Nils!

Edit: Should have marked this as "Request Changes". Sorry!

nilsoberg · 2025-04-15T17:31:25Z

Regarding your comment regarding placement of documentation: I made a decision early on to put documentation for public functions in the POD at the end of the file, and write documentation at the start of a subroutine for internal functions (i.e. those called within the module only). I think that embedding POD at the top of every function would make things cluttered (Perl's POD is very verbose). It is not convenient to have to search the POD for the function documentation (although for us that use Vim using the # character and :hsplit work wonders).

- import_fasta.pl was importing all sequences regardless of filtering - Fix this by fixing logic

rbdavid

Looks good to me. I think all the new issues referred to in the comments have been created.

rbdavid · 2025-04-18T14:39:47Z

Just noticing, test 01d is failing before starting the nextflow run call. This is happening because the output directory associated with the test isn't created before the user-defined taxonomy filter file is created. Instead, lets include this user-provided taxonomy filter file in the suite of testing files instead of writing it during the test.

nilsoberg · 2025-04-18T17:42:39Z

Just noticing, test 01d is failing before starting the nextflow run call. This is happening because the output directory associated with the test isn't created before the user-defined taxonomy filter file is created. Instead, lets include this user-provided taxonomy filter file in the suite of testing files instead of writing it during the test.

The advantage of including the filter file in a temporary (e.g. test) directory is that we can write different test cases without having to add files to source control. I will fix the code so that the file is written to the test directory using the test name.

nilsoberg added 30 commits February 21, 2025 14:31

Create robust metadata modules

2da5634

- Reader for parsing, with an API rather than raw hash ref access - Writer for saving, with an API for saving rather than raw hash ref access

Refactor EST import config for clarity

c00eecf

- Code reuse by inheriting from EFI::Options - Pass necessary hash data to EFI::Import::Source modules rather than Config object - Removed a lot of unnecessary code in the Config modules

Add sequence collection module for a single file to replace accession…

283ab55

… and metadata files

Centralize output directory argument retrieval for import utilities

b6d7344

Move filter code into separate script

6f56850

Filter using new EFI::Sequence::Collection dual file approach

3aba25c

- Fix filter bugs - Fix EFI::Sequence api bug

Add sunburst json file output scripts

f51a43d

Add predefined taxonomy filters

0708c9d

Make UniRef sequence collection API consistent

7b7ea2b

Move sequence attribute formatting to EFI::Sequence

fd01b92

Support removal of members of UniRef clusters from a sequence collection

730ae03

Share sequence filtering code between modules

ee5991f

Add UniRef support to filtering and sequence collection

2a7def3

Refactor Nextflow EST pipeline to support UniRef, metadata, and filte…

53af57a

…ring - Filtering moved to separate process - UniRef moved to get_sequences process - Metadata retrieval occurs in filtering - Add sunburst ID retrieval

Add centralized unknown ID detection capability

b23c5f6

Add unmatched IDs as output from sequence ID retrieval process

3f21cc1

Centralize database schema constants

aeef1b8

Validate UniProt IDs by default when reverse mapping IDs

ea23411

Support parsing option groups

d43cea8

Add documentation and minor code cleanup

2f7ff06

Centralize batch retrieval by using grouped conditions in SQL

6f02a0a

Name test module output directories based on the script file name

18d606c

Add tests for new import features and combinations

6838f7f

Support filter stats in import stats file output

98a59e0

- Statistics can now read/write - Source outputs a stats file, filter modifies it to the expected import_stats.json

Cleanup to remove unnecessary code

939de8e

Remove family source import option and update tests to reflect that c…

a311858

…hange

nilsoberg linked an issue Apr 3, 2025 that may be closed by this pull request

Implement support for adding sequences from families to import processes #31

Closed

5 tasks

rbdavid reviewed Apr 11, 2025

View reviewed changes

nilsoberg mentioned this pull request Apr 15, 2025

Document the taxonomy filtering file format #157

Open

rbdavid mentioned this pull request Apr 15, 2025

154 add gnn outputs into color ssn #155

Merged

nilsoberg mentioned this pull request Apr 16, 2025

Use custom character for annotation row separators #159

Open

nilsoberg added 10 commits April 16, 2025 12:56

Code and code comment cleanup

265a0de

Add the invalid path to error messages on import scripts

8bf6e05

Allow user-specified field separator for sequence metadata

339be0e

Fix FASTA import issue

9162787

- import_fasta.pl was importing all sequences regardless of filtering - Fix this by fixing logic

Remove unnecessary emits in workflow

e90808b

Remove redundant code

422fd14

Clarify UniRef ID association functions

ed458a8

Update and add POD

191b8b8

Add clarifying comments

b9e19ef

Update help text for test environment script

f60d3e0

nilsoberg requested a review from rbdavid April 16, 2025 20:38

Fix undefined reference issue added in previous commit

18a64db

rbdavid mentioned this pull request Apr 17, 2025

Order of operations for applying filters. #161

Closed

rbdavid approved these changes Apr 17, 2025

View reviewed changes

rbdavid reviewed Apr 18, 2025

View reviewed changes

Comment thread pipelines/est/est.nf

nilsoberg added 2 commits April 18, 2025 12:53

Update rST documentation for Perl updates

0d29793

Reorganize tests to distinguish between deployment and development tests

c38f288

rbdavid mentioned this pull request Apr 18, 2025

Apply the fraction filter to families only #166

Closed

1 task

nilsoberg linked an issue Apr 18, 2025 that may be closed by this pull request

Add sunburst data to EST pipeline #147

Closed

Merge branch 'nextflow-test' into 31-implement-family-addition-import

2a3b81c

nilsoberg merged commit d4a6278 into nextflow-test Apr 18, 2025

nilsoberg deleted the 31-implement-family-addition-import branch April 18, 2025 19:33

Conversation

nilsoberg commented Apr 3, 2025

Uh oh!

rbdavid left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nilsoberg commented Apr 15, 2025

Uh oh!

rbdavid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rbdavid commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nilsoberg commented Apr 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rbdavid left a comment •

edited

Loading

rbdavid commented Apr 18, 2025 •

edited

Loading