Skip to content

31 implement family addition import#151

Merged
nilsoberg merged 54 commits into
nextflow-testfrom
31-implement-family-addition-import
Apr 18, 2025
Merged

31 implement family addition import#151
nilsoberg merged 54 commits into
nextflow-testfrom
31-implement-family-addition-import

Conversation

@nilsoberg
Copy link
Copy Markdown
Collaborator

In order to fully close this issue changes were made that close issues #31 , #29 , and #147 .

- Reader for parsing, with an API rather than raw hash ref access
- Writer for saving, with an API for saving rather than raw hash ref access
- Code reuse by inheriting from EFI::Options
- Pass necessary hash data to EFI::Import::Source modules rather than Config object
- Removed a lot of unnecessary code in the Config modules
…dition

- Remove sunburst and stats code
- Remove filter-specific code (moved in previous commit)
- Metadata is managed by EFI::Sequence::Collection and not individual Sources
- Sequences are added to a sequence collection by the source, (i.e. the source
  no longer returns a sequence collection), which allows family addition
- get_sequence_ids outputs 'source' files -> filter_ids
- filter_ids outputs main sequence_metadata and accession_ids files
- Add central sequence typing
- Provide extra error message output to printHelp in EFI::Options
- Config subclasses can update options for use by calling scripts
- get_sequence_ids outputs a metadata file and a UniRef mapping table (used to be accession_ids.txt)
- Filtering removes from metadata as well as UniRef mapping
- Support dual file approach and UniRef in EFI::Sequence::Collection
- Update filter_ids filter order
- Fix filter bugs
- Fix EFI::Sequence api bug
…ring

- Filtering moved to separate process
- UniRef moved to get_sequences process
- Metadata retrieval occurs in filtering
- Add sunburst ID retrieval
- Previous accession_ids file is now accession_table to better reflect purpose
- Output of step 1/step 2 is sequence IDs file (to replace accession IDs), which is
  used in the split/get_sequence processes
- import_fasta outputs a file that contains only sequence IDs that were added from
  families - this gets passed to split/get processes
- Support empty files in split/get_sequence processes
- Statistics can now read/write
- Source outputs a stats file, filter modifies it to the expected import_stats.json
@nilsoberg nilsoberg linked an issue Apr 3, 2025 that may be closed by this pull request
5 tasks
Copy link
Copy Markdown
Contributor

@rbdavid rbdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, very good. More documentation would be great but isn't highest priority at the moment. I left a whole bunch of questions and comments; most are low importance. I only noticed two major things:

  • Major issue: pipelines/est/import/import_fasta.pl, boolean checks in if, elsif, else block may result in no filtering actually being applied.
  • Major question: pipelines/est/import/filter_ids.pl, order of applying the filters will affect the sequences that make it through the "import" step of the nextflow pipeline. Should that order of operations be changed?

Great work Nils!

Edit: Should have marked this as "Request Changes". Sorry!

Comment thread tests/test_env.sh
Comment thread tests/test_env.sh
Comment thread pipelines/est/est.nf
Comment thread pipelines/est/est.nf Outdated
Comment thread pipelines/est/est.nf Outdated
Comment thread pipelines/est/import/filter_ids.pl
Comment thread pipelines/est/import/get_sunburst_data.pl
Comment thread pipelines/est/import/get_sunburst_data.pl
Comment thread lib/EFI/Import/Config.pm
Comment thread lib/EFI/Import/Filter/Taxonomy.pm Outdated
@nilsoberg
Copy link
Copy Markdown
Collaborator Author

Regarding your comment regarding placement of documentation: I made a decision early on to put documentation for public functions in the POD at the end of the file, and write documentation at the start of a subroutine for internal functions (i.e. those called within the module only). I think that embedding POD at the top of every function would make things cluttered (Perl's POD is very verbose). It is not convenient to have to search the POD for the function documentation (although for us that use Vim using the # character and :hsplit work wonders).

@nilsoberg nilsoberg requested a review from rbdavid April 16, 2025 20:38
Copy link
Copy Markdown
Contributor

@rbdavid rbdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think all the new issues referred to in the comments have been created.

Comment thread pipelines/est/est.nf
Comment thread bin/create_est_nextflow_params.py
Comment thread lib/EFI/Sequence/Collection.pm
Comment thread lib/EFI/Sequence/Collection.pm
Comment thread lib/EFI/Sequence/Collection.pm
Comment thread pipelines/est/import/filter_ids.pl
Comment thread lib/EFI/Sunburst/Data.pm
Comment thread lib/EFI/Annotations/Fields.pm
@rbdavid
Copy link
Copy Markdown
Contributor

rbdavid commented Apr 18, 2025

Just noticing, test 01d is failing before starting the nextflow run call. This is happening because the output directory associated with the test isn't created before the user-defined taxonomy filter file is created. Instead, lets include this user-provided taxonomy filter file in the suite of testing files instead of writing it during the test.

Comment thread pipelines/est/est.nf
@nilsoberg
Copy link
Copy Markdown
Collaborator Author

Just noticing, test 01d is failing before starting the nextflow run call. This is happening because the output directory associated with the test isn't created before the user-defined taxonomy filter file is created. Instead, lets include this user-provided taxonomy filter file in the suite of testing files instead of writing it during the test.

The advantage of including the filter file in a temporary (e.g. test) directory is that we can write different test cases without having to add files to source control. I will fix the code so that the file is written to the test directory using the test name.

@nilsoberg nilsoberg linked an issue Apr 18, 2025 that may be closed by this pull request
@nilsoberg nilsoberg merged commit d4a6278 into nextflow-test Apr 18, 2025
@nilsoberg nilsoberg deleted the 31-implement-family-addition-import branch April 18, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add sunburst data to EST pipeline Implement support for adding sequences from families to import processes

2 participants