31 implement family addition import#151
Conversation
- Reader for parsing, with an API rather than raw hash ref access - Writer for saving, with an API for saving rather than raw hash ref access
- Code reuse by inheriting from EFI::Options - Pass necessary hash data to EFI::Import::Source modules rather than Config object - Removed a lot of unnecessary code in the Config modules
… and metadata files
…dition - Remove sunburst and stats code - Remove filter-specific code (moved in previous commit) - Metadata is managed by EFI::Sequence::Collection and not individual Sources - Sequences are added to a sequence collection by the source, (i.e. the source no longer returns a sequence collection), which allows family addition
- get_sequence_ids outputs 'source' files -> filter_ids - filter_ids outputs main sequence_metadata and accession_ids files - Add central sequence typing - Provide extra error message output to printHelp in EFI::Options - Config subclasses can update options for use by calling scripts
- get_sequence_ids outputs a metadata file and a UniRef mapping table (used to be accession_ids.txt) - Filtering removes from metadata as well as UniRef mapping - Support dual file approach and UniRef in EFI::Sequence::Collection - Update filter_ids filter order
- Fix filter bugs - Fix EFI::Sequence api bug
…ring - Filtering moved to separate process - UniRef moved to get_sequences process - Metadata retrieval occurs in filtering - Add sunburst ID retrieval
- Previous accession_ids file is now accession_table to better reflect purpose - Output of step 1/step 2 is sequence IDs file (to replace accession IDs), which is used in the split/get_sequence processes - import_fasta outputs a file that contains only sequence IDs that were added from families - this gets passed to split/get processes - Support empty files in split/get_sequence processes
- Statistics can now read/write - Source outputs a stats file, filter modifies it to the expected import_stats.json
There was a problem hiding this comment.
Overall, very good. More documentation would be great but isn't highest priority at the moment. I left a whole bunch of questions and comments; most are low importance. I only noticed two major things:
- Major issue: pipelines/est/import/import_fasta.pl, boolean checks in
if, elsif, elseblock may result in no filtering actually being applied. - Major question: pipelines/est/import/filter_ids.pl, order of applying the filters will affect the sequences that make it through the "import" step of the nextflow pipeline. Should that order of operations be changed?
Great work Nils!
Edit: Should have marked this as "Request Changes". Sorry!
|
Regarding your comment regarding placement of documentation: I made a decision early on to put documentation for public functions in the POD at the end of the file, and write documentation at the start of a subroutine for internal functions (i.e. those called within the module only). I think that embedding POD at the top of every function would make things cluttered (Perl's POD is very verbose). It is not convenient to have to search the POD for the function documentation (although for us that use Vim using the |
- import_fasta.pl was importing all sequences regardless of filtering - Fix this by fixing logic
rbdavid
left a comment
There was a problem hiding this comment.
Looks good to me. I think all the new issues referred to in the comments have been created.
|
Just noticing, test 01d is failing before starting the nextflow run call. This is happening because the output directory associated with the test isn't created before the user-defined taxonomy filter file is created. Instead, lets include this user-provided taxonomy filter file in the suite of testing files instead of writing it during the test. |
The advantage of including the filter file in a temporary (e.g. test) directory is that we can write different test cases without having to add files to source control. I will fix the code so that the file is written to the test directory using the test name. |
In order to fully close this issue changes were made that close issues #31 , #29 , and #147 .