-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big review branch before 0.10b #37
base: review-branch
Are you sure you want to change the base?
Conversation
Add license, cmake makefiles, clion project files and the basic application folder layout. Sources: ======================================================================== Add main.cpp/h: - Will create a Starter and a Runner instance, check premises and display errors Add Starter class: - Will evaluate command line options - Will either select the Indexer/ExtractorRunner or the ShowStopperRunner depending on the cli opts - Will also check premises for the selected runner and fail, if the premises could not be validated Add the Runner class: - This is the base class for other Runners (Indexer, Extractor, ShowStopper) - run() must be overriden by specific Runners. This method will be called by the Starter instance - errorMessages / getErrorMessages() can be used to store errors will e.g. be used by the main method to display the errors on stderr - checkPremises() can be overriden to check if everything is setup for a Runner instance (e.g. file existence for the ActualRunner) - Various isXXXRunner() to mimic instancof Add the ActualRunner class: - Base class for IndexerRunner and ExtractorRunner - Has the two fields fastqFile and indexFile - Both fields will be checked in checkPremises(); checkPremises() also adds error entries to the error vector Add the IndexerRunner class: - Will be used to run the fastq indexing Add the ExtractorRunner class: - Has two more fields startLine and lineCount which are needed for extraction Add the CommonStructs.h file: - Contains IndexHeader and IndexEntry structs - Both structs will be used for index files, maybe this will be changed to a custom class IndexFile or so Add the ErrorMessages.h file, which contains error message strings Tests: ======================================================================== Unit tests are based on UnitTest++ which needs to be installed on your local machine Add the TestResourcesAndFunctions class: - Class for helper methods and test related tasks - Can be initialized with the name of the test suite and must be initialized with the test id - This class can be used to create a temp folder which will be deleted when the class destructor is called. Use getTestPath() to get (and create) the folder - filePath() will return the temporary folder extended by the given filename - createEmptyFile() will create the given file in the temporary folder - CreateEmptyFile() will create the given path object. Does not use the temporary folder Add the TestResourcesAndFunctionsTest suite Add test suites for the Starter class and the Runner classes
=== Application files === Add the IndexReader class: - It will be used to read (safely) from a fastq index file. - It uses IndexProcessor as a base class to perform concurrency safe file operations. - To create an instance of IndexReader, use IndexReader::create() - The class is tested with the IndexReaderTest test cases. - When a file is opened, its size is checked and for some issues with ifstream, the current implementation precalculates the amount of stored entries. Add the IndexWriter class: - Used to write (safely) to a fastq index file. - Like the IndexReader, it uses IndexProcessor for safe concurrent file access. - It cannot override existing files. - Write access is exclusive! The IndexWriter will try to aquire exclusive access or fail! - Tested with IndexWriter test cases. Add the IndexProcessor class: - Base class for IndexWriter and IndexReader. - Tries to guarantee safe thread and process concurrent file access to the specified index file. - It allows multiple readers and exclusive writer access. Thus said, if a reader is active, write is not allowed but other readers can read as well. If a write is active, no other writer or reader is allowed. - Lacks an integration test for now. The current impementation relies on boosts interprocess locks functionality. Thread-wide access tests are testes in with the IndexProcessor test cases. Update the Indexer class: - Added the createIndex() method, which will later use the IndexWriter class to write to index files and which is used to create the fastq index. - The class contains some fields, which are exclusively used in tests. To use them, debuggingEnabled needs to be set to true upon creation. Added the ErrorMessages file, which will hold various strings containing error messages. === Test files === In all test files, replace the const string test identifiers by const char*. clang-tidy complained about possible problems with const string. The solution with const char* seems safe. Added more tests to CommonStructsTest. Added the IndexerTest class which currently tests the constructor, createHeader() and createIndex(). The tests all fail at the moment, as the Indexer is not yet functional. Added test files for IndexProcessor, IndexReader and IndexWriter. Added the getResource() method to TestResourcsAndFunctions. Added a test for the method.
Renamed ShowStopperRunner to PrintCLIOptionsRunner. This name is more meaningful. Modify the main method to return the exit code of runner->run() or 1, if the runner could not be started. ErrorAccumulator: - Introduce class ErrorAccumulator to provide basic functionality for error string collection. Make Runner, Indexer and IndexWriter inherit from this class. - The method mergeToNewVector can be used to do a copy merge of two vectors. Rename CommonStructsh to CommonStructsAndConstants.h IndexHeader: - Change order of elements to reduce Byte padding. - Increase total size of header to 512 Byte. - Implement equals and non-equals operator. IndexEntryV1 - Rename struct IndexEntry to IndexEntryV1, as we will keep individual versions when its necessary. - Change order of elements to reduce Byte padding. - Rename variables to match our camel case syntax. - Implement equals and non-equals operator. IndexerRunner: - Add enableDebugging parameter to constructor, will be passed to the indexer - Add a checkPremises method, which will call the super.checkPremises and the one from the indexer - Add getErrorMessages, which will merge its own and the messages from the indexer Indexer: - Create and IndexWriter instance when the object is constructed. - checkPremises will call tryOpen on the indexWriter - One could think of moving the checks in tryOpen to a custom method. Not sure about this yet. - Add getErrorMessages, which will merge its own and the messages from the indexWriter IndexProcessor: - Add the hasLock() method, which will return true if either a read or a write lock was aquired. IndexWriter: - Renamed open() to tryOpen() - Remove the create method. An instance of Indexer needs to be opened with tryOpen() after creation to use it. - Add a writerIsOpen field, which is set when the writer is successfully locked or unlocked. The field is used before write operations to make sure, that no write ops to the ofstream are done when it wasn't opened - Added a destructor, which will flush and close the output stream. - Added some error checks and error messages. ///// Tests Modified tests for IndexHeader and IndexEntryV1, they now contain some sanity checks, which will check the whole structs and each field in them for their size. This way, we hopefully will notice, if someone accidentally changed the structs. Some sort of md5 / hash mechanism could be added, but I do not know how yet. Added tests for the equals methods and made the other tests use == / !=. Added another test for IndexerRunner. The tests will also validate checkPremises(). Added hasLock() checks to the IndexProcessor tests. Removed the compareHeader and compareEntry methods from IndexReaderTest, they are now part of other tests. Added a path variable to the test binary path to main.cpp. Added a test constants file.
Add ZLIB dependency to make files. Reduces CXX_STANDARD to 14, cmake 3.5.3 complained about 17. Change layout of IndexEntryV1 to consume only 8Byte instead of 24. This is possible due to some assumptions: - Entry data will be relative, you need the process the whole file to calculate the absolute values. Index files are small, this is not a real fuzz. - Relative offsets are below 65k, thus an unsigned short is sufficient. - 2Byte overhead is overhead due to memory padding. We store and read the struct and not the individual fields. Adapted tests and methods correspondingly. Extended the Indexer class and make it work: - Tried to modularize the createIndex() method by extracting methods and by pulling fields to the class level. The Indexer is only for single-use, this is safe. - Added some methods: initializeZStream(),readCompressedDataFromStream() checkStreamForBlockEnd(), finalizeProcessingForCurrentBlock(), storeLinesOfCurrentBlockForDebugMode() - Changed the debugging field types. ////// Tests IndexerTest - Added tests for a small and a "large" test file. Manual testing with a 1.2GB FASTQ also works. Or at least it does not crash :D - Added two comments for more tests. TestResourcesAndFunctions - Fixed a bug in getResource(), the method will now access the correct path. - Extended the checks for this.
Made indexing and extraction work. However, to make things work, we need large indices. This is because we need to store a dictionary for each entry in the index. This is currently of size 32kByte and can be reduced by using zlibs compress2/uncompress methods by around 60%. This is not a final version and needs a lot more heavy grinding, e.g: - Code cleanups and unification - More tests and checks on startup - Improve the CLI interface - Change access to FASTQ record access (currently it is working on lines, which might still be kept, but the main purpose of this tool is FASTQ processing) - Improve code to use (much) less memory (Just need to work on different things atm, so at least push it to the repo) Raise minimum cmake to 3.13 Also tried to upgrade Boost from 1.54 to 1.69, which horribly failed. Methods like e.g. exists(path), is_symlink(path) left horrible pointer mess in calling objects. Though we could track this down a bit, we do not know, where the real bug source is located. /////// Source code In CommonStructsAndConstants: - Move several constants to CommonStructsAndConstants. - Add equality (== !=) not (!) and bool() operators to IndexHeader - In IndexEntryV1: * Layout changes from relative to absolute values. * Add the blockID() field to IndexEntryV1. * Rename relativeBlockOffsetInRawFile to blockOffsetInRawFile. * Change types for offsets from ushort to ulong. * Add the 32kByte sized dictionary field which is needed for inflate init in extraction mode. This value should be compressed in a future release (ca 60% storage decrease). The field compressedDictionarySize is already planned but not functional. Added the IndexLine struct, which holds a decompressed and processed IndexEntry. Also this is currently brute force and the whole index is loaded fully to memory. This will need to change in the future to decrease memory consumption. The Starter will now create an ExtractorRunner with proper startline and linecount. In ActualRunner, extend checkPremises() with more checks. Also here, we ran into a lot of trouble when trying to upgrade Boost from 1.54 to 1.69 In IndexerRunner, change indexer to Indexer* type and delete in in the constructor. This and the other changes in the class were done due to the upgrade conflicts of Boost. In ExtractorRunner: - Add checkPremises() and run() - checkPremises() will check the Runner premises + the used extractor premises. - run() will call extractor->extractReadsToCout() In IndexReader: - Add comments. - Add the readerIsOpen field. - Add the readHeader field and make readIndexHeader() private. - Remove the create() method and make the class instantiable via a public constructor. - Rename open() to tryOpenAndReadHeader() and make it public. Add checks and error messages. - Add readIndexFile() method. This will most likely be removed in future versions with a more intelligent index loader approach. - Add readIndexFileV1() method an comments for V2/V3 In Indexer: - This class really needs to be optimized! - Remove constants and use global constants. - Add the calculateIndexBlockInterval() which will calculate some "smart" interval depending on the FASTQ file size. - Change a lot of code. Tried to put it comments, but this really needs improvement. In Extractor: - First: Added the class - Copied some code from Indexer, this should later be merged somehow. - Most important: Add the extractReadsToCout() method which currently lacks checks. The method will jump into the gz file, initialize the inflater, search for the first line to extract and then extract until EOF or extractedLines == lineCount. Also here, the extractor is based on lines rather than records. This should be changed. /////// Tests Change a lot of the tests to match new constructors. In CommonStructsTest: - Change tests for IndexHeader and IndexEntry to match new sizes. - Add tests for new operators() Extend ExtractorRunnerTest to cover one more test case. In IndexerTest: - Add test for CalculateBlockInterval. - Change most tests so, that the Indexer is created with -1 for the block interval to request block interval calculation. (Test create large index will use a block interval of 4.) - Extend test create large index to check some index entries for validity. Values for this were calculated using the original zran.c code. Also put in an extraction test which compares decompressed with original data. Add ExtractorTest: - Add two non test methods runRangedExtractionTest() and runComplexDecompressionTest() which will be called by TEST_CREATE_EXTRACTOR_AND_EXTRACT_SMALL_TO_COUT to run a series of index / extract processes and compare extracted data to decompressed original data. Extend StarterTest to use a complete set of parameters for extraction.
Started to use "using namespace::class" in some cases. Change a lot of minor things due to clang-tidy complaints. Changed ulong to u_int64_t in the whole application to guarantee a length of 8 Byte unsigned values. ulong itself is not guaranteed to be of length 8 Byte (e.g. Microsoft says it is 4 Bytes long). Also changed some of the uint's to u_int32_t though this is not necessary. Changed the WINDOW_SIZE / CHUNK_SIZE to #defines instead of const uint, with the old code, the chunk[CHUNK_SIZE]{0} syntax did not work and needed an additional memset. Renamed IndexLine to IndexEntry. IndexEntry is a decompressed and processed index entry representation for all the versioned IndexEntryV[n] structs. This is to docouple index reading and writing from internal processes. Added the toIndexEntry() method to IndexEntryV1. Created the ZLibBasedFASTQProcessorBaseClass class as a mutually used base class for the Indexer and Extractor classes. The class stores a list of variables and methods used for both subclasses. Cleanup Indexer and Extractor and make them more readable. Though a lot more work could be done here. Also unify the buildIndex() and extract() methods so they look a lot alike. This should be base for further refactoring. Moved a lot of values from both classes to the new base class. The Extractor will now read in IndexEntry-wise until the proper IndexEntry is found. This way, we do not need to read in the whole index which makes sense for memory consumption. Big FASTQ files will have big index files and the Extractor shall use only few memory. Changed readIndexFileV1() to return a vector of IndexEntryV1 and create the readIndexEntry() method. /// Tests Add ZLibBasedFASTQProcessorBaseClassTest file with a mock class to enable testing of the class. Move the test for z_stream init to this file.
Add installation / compilation instructions. Usage will follow later. Add a description and links to libraries.
Extend installation and build instructions. Build was tested on my local machine and on our HPC env. Though I could get cmake running with all the custom libraries, g++ always failed because 'inflateGetDictionary' seems to be missing (it is available in zlib 1.2.8 / 1.2.11).
Tested the code several times with valgrind to get rid of memory leaks. Extend installation and build instructions. Build was tested on my local machine and on our HPC environment. After a lot of tries, it finally builds and runs. Removed Boost dependency. * Replace boost::filesystem by experimental::filesystem classes and functions where necessary. * Replace boost command line interface with tclap library. * Replace boost shared pointers with stdlib shared pointers. * Replace the boost splitstring stuff with a custom method splitStr() in ZLibBasedFASTQProcessorBaseClass. * Replace the boost::mutex class with std::mutex and boost::mutex::scoped_lock with std::lock_guard. Use Linux's flock to lock files shared or exclusive also over NFS. Override getErrorMessages() in ExtractorRunner class. Add blockInterval to IndexerRunner constructor. Add some more error messages, when file open / close fails. Rename the PrintCLIOptionsRunner to DoNothingRunner. It actually does nothing, so the name is more precise. Fully implement new command line interface using tclap. We tried to mimic the samtools command line interface as far as possible. This is because we work with FASTQ files. Modify and extend tests where necessary.
The conda environment contains all necessary tools and libraries except UnitTest++, this still needs to be downloaded and installed manually. Adapted the CMake files to make them aware of an existing (and active) environment. Use TCLAP 1.2.1 instead of 1.2.2. This version is in Conda.
Testing TCLAP (which is eventually done in the Starter tests) fails, because TCLAP calls exit() effectively stopping all tests, when the command line parameters cannot be parsed. This leads to the bad effect, that we can only run valid tests but no tests with invalid command line parameters. To at least get a message, that something went wrong and that the test binary exited unintentionally, I put in a switch after the tests are finished. The switch is recognized by a trap to exit() and a message is printed to the command line to indicate an error, if the switch is not set properly.
-D enables the indexer or extractor debug mode (e.g. for IDE's) -w allows overwriting of an existing .fqi file Add one integration test, which calls the application itself with parameters. Rename the .idx to .fqi to look more like .fai or .bai
The files are working on our LSF environments. Basically you call fqimasstestLSF.sh with a list of FASTQ files and an output directory to where the index files will be written. The script fqitest.sh is called for each FASTQ. Note, that the script will extract the FASTQ to the local scratch storage. Keep in mind, that these scripts are more an example and will not necessarily work on your platform.
I renamed .idx to .fqi in an earlier commit. This lead to SigSevs in the tests, which is now corrected by renaming the test.idx files. Fixed some typos in the test scripts.
Index and extract will now work with concatenated gz files, tests are included. Also reworked the test structure for the extractor tests. TODO Indexer and Extractor need a lot of rework to make the code more readable.
Add the possibility to stream in a FASTQ file for indexing. Internally, I created the InputSource class for this, which is inherited by the PathInputSource and the StreamInputSource classes. This way, both classes share a common interface and can be treated equally. As we are working with (char/byte)buffers, it is more convenient to have a C-like interface with methods like read(), skip() and so on. As the previous code was based on path instances, a lot of changes were necessary to adapt to the new structure. Note, that there are some tests, which still fail. They need to be implemented. Sources ======= ActualRunner: - Add additional constructors with a path and a shared pointer of type InputSource. - Change the checkPremises method to be aware of the different input source types. - Add allowsReadFromStreamedSource() which needs to be overriden, if necessary. ExtractorRunner: - Make it use a PathInputSource for input instead of a path. NOTE: In difference to the indexer, the extractor is not allowed to read from a stream (it would also not make any sense.) IndexerRunner: - Reformated the constructor code and added a second constructor. - Override the allowsReadFromStreamedSource() method to return true. Extractor: - Add the resultFile, useFile and forceOverwrite fields. This allows the extractor to either write to stdout or to a result file. - Extend the constructor to be aware of the new fields. - Extend checkPremises to be aware of the changes. - Adapt the extractor to use the new (Path)InputSource instead of path. Indexer: - Make the Indexer use an InputSource instance instead of a path. - Make the indexing process use the new InputSource methods. Starter: - The Starter will now recognize "-" as a valid parameter for a piped FASTQ file in index mode and for the extracted text in extract mode. Added the InputSource, PathInputSource and StreamInputSource classes. Tests ===== Made the test classes use the new Indexer/Extractor(Runner) constructors Added tests for the new InputSource classes and updated the Indexer tests to also test with streamed data. Replaced several duplicate strings by constants. Disabled the integration tests for now. Streamed data can be tested with e.g. ifstream instances. TestResourcesAndFunctions: - Add a format string method. However, this method is not very safe and really only for test purposes. It should not be used in production code! It is ok for tests though. - Add the runCommand(), runIndexerBinary(), runExtractorBinary(), extractGZFile() and createConcatenatedFile() methods to remove code duplication and simplify tests.
Allow to set some verbosity with -v/-verbosity from 0(default) to 3. Actually only level 3 is used. Add one more message to the end of the indexing process. Fix the StreamInputSource by make it using different methods for reading and checking eof() (sync(), read(), gcount(), peek(), eof()). Add some logger methods to ErrorAccumulator. This is nothing spectacular but offers a very small amount of logging. Moved method bodies from ErrorAccumulator header to source file. Added some more tests and set verbosity for the testapp to 3.
** NO TESTS YET! ** The stats mode will display some basic information about a given index file. You can also select a range of index entries to display.
Fix the binary in such a way, that it can handle "super-concatenated" FASTQ files. E.g. we had FASTQ files with a size of around 3,3GB which consisted of approximately 170k concatenated gzip files. Extractor class: - In some cases, the extractor did not properly extract the first line in the selected range of lines. To prevent this, the extractor will now extract the amount of [multiplier] previous lines and skip them on output. - Fixed a big mem leak: I did not reinitialize the inflate stream properly and a lot of memory was allocated but not freed. - Fixed some issues, when no line was in an extracted block. - The seek into the concatenated file was moved before the check for readability to ensure, that it is properly detected, if a file still can be read. Indexer class: - Refactor the index method and extract some methods. - The Indexer can now process the mentioned super-concatenated files where e.g. thousands of single files of less than 40k were concatenated. - Count number of concatenated files and output this at the end of the Indexing step. - The seek into the concatenated file was moved before the check for readability to ensure, that it is properly detected, if a file still can be read. The IndexStatsRunner also shows the starting record id (effectively the starting line number divided by 4. Modify the PathInputSource to use C++ streams instead of the C FILE API. In IndexerTest: - Change some CHECK_EQUAL tests to CHECK. I do not why, but C_E throws segfaults now. Might depend on the compiler version. - Added a test for concatenated FASTQ files with really small files, each smaller than a block.
Add two new indexer options: - -F will tell the Indexer to NOT write out the index file. This is to get some basic statistics about what could happen. - -S will disable the new failsafe distance. Use this, if you think, that you do not need this. However, it is more for testing. Fix Indexer behaviour for files with very small blocks by introducing a failsafe distance. It is calculated by (block distance) * 16kByte and will tell the indexer, that it should not only look at the block distance but also on the offset distance between the current and the latest written index entry. Also do some code refactorings to make the Indexer and Extractor code a bit more readable.
The Starter now parses the variables for extract BEFORE we access the indexFile and debugMode parameters. Extract a method to print an index entry to the console and put it in IndexStatsRunner. Output some minor statistics when extraction gets started. Also fix a bug in the indexer. The first index entry was not stored, when failsafe distance was on. Add A LOT of build options for debug mode. These should be handled with care to reduce warnings and increase code robustness.
Remove debug() output commands from a lot of places. They are actually more confusing than helpful. Indexer class - Changed the default block interval from 512 to 2048 - Added the postponeWrite field. This comes into play, when an empty block occurs. Empty blocks will not be used for index entries anymore as they caused a lot of problems leading to faulty results. - Added a lot of debug code which can help a developer to better understand, how the index is created. Fields and methods are: - writeOutOfPartialDecompressedBlocks - storageForDecompressedBlocks - writeOutOfPartialDecompressedBlocks - storageForPartialDecompressedBlocks - partialBlockinfoStream - enableWriteOutOfDecompressedBlocksAndStatistics() - enableWriteOutOfPartialDecompressedBlocks() - Extracted the methods createIndexEntryFromBlockData(), storeDictionaryForEntry() and writeIndexEntryIfPossible() to make the code more readable. - Reworked finalizeProcessingForCurrentBlock() to make it more readable and also make it use and therefore output the above mentioned debug information. Extractor class - Added the roundtripBuffer field, which will be used later to enable: - error recognition, if the extract data is not made up of extractionMultiplier sized records - complete output with at least empty entries for a record. However, this is not yet used. - Removed the skipLines field, as the whole mechanism was reworked and both index and extract should now produce proper results. - Reworked processDecompressedChunkOfData(), it is now much easier to read and a lot less complicated. Reworked the Starter to accept more command line options and pass them to the specific runners. Changed ZLibBasedFASTQProcessorBaseClass.splitStr so, that it does no longer add an empty line to the back of the list, if the last char is a newline character. TestResourcesAndFunctions - Add readLinesOfFile(), readFile(), readLinesOfResourceFile(), readResourceFile(), compareVectorContent() and getTestVectorWithSimulatedBlockData() to reduce code duplications and improve test readability. Code is taken from IndexerTest and ExtractorTest. - The method getTestVectorWithSimulatedBlockData() will return a vector with files for simulating decompressed gzip block data. This tries to mimic common pitfalls like small blocks, empty blocks, blocks without newlines and so on. IndexerTest - Add a test for correct block line counting which uses the aforementioned block data from getTestVectorWithSimulatedBlockData() - Move code to the aforementioned methods readLinesOfFile() and so on... ExtractorTest - Analogous to IndexerTest, add a test for correct line from block data extraction. This will also use getTestVectorWithSimulatedBlockData() - Like in IndexerTest, use the extracted methods for file reading and vector comparison. Add the blockAndLineCalculations test resource directory. This contains several files with "decompressed" block data and a file which describes the layout of the joined block files.
to allow easy publication of the standalone binary.
These strategies are used to decided whether an IndexEntry will be stored or not. The current block based (safeguarded) approach will still be usable and is encapsulated in the BlockDistanceStorageStrategy class. Add the IndexEntryStorageStrategy base class. Add the ByteDistanceStorageStrategy class to allow the user to set a Byte value like 1T, 2g, 3m and 4K as a minimum distance between index entries. Rename the field offsetInRawFile to blockOffsetInRawFile in IndexEntry to match IndexEntryV1. Move the method calculateIndexBlockInterval from Indexer to the new IndexEntryStorageStrategy class. Add a test class for the new strategy classes.
Add s3 support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these are suggestions.
@@ -0,0 +1,33 @@ | |||
name: FastqIndEx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The env doesn't work.
S3 is not included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The included library did not work and I needed to download the AWS SDK. This is described in the README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt this is acceptible for a deployment via Conda.
}; | ||
|
||
/** | ||
* Represents an entry in a gz index file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think for explanation would it be helpful to have e.g. a plant-UML diagram of the the blocks, and all other concepts related to gzip format and indexes and thus explain all the terms? Maybe we can make a sketch that I can translate into digital diagram (or we just scan the sketch!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
In CommonStructsAndConstants: - Add constants for Byte sizes like k(ilo)B(yte), M(ega)B(yte) In IndexEntryStorageStrategy: - Change the type of blockInterval and minIndexEntryByteDistance to signed. - Add a getDefault() and a from() method for both strategies - Add calculateDistanceBasedOnFileSize to ByteDistanceStorageStrategy - Implemente useFilesizeForCalculation in ByteDistanceStorageStrategy In Indexer and IndexerRunner: - Remove the blockInterval and disableFailsafeDistance parameters from the constructor and add the storageStrategy parameter. In ModeCLIParser: - Remove argumentToPath() - Add createCommandLineParser() - Add createAllowedModeArg() and use this method in the parse methods of the subclasses. In ModeCLIParser subclasses: - Rework the parse() methods - Use the new createAllowedModeArg() method. In IndexModeCLIParser: - Inline some strings back to the create..Args methods. - Add an argument to parse the storage strategy and the byte distance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent error checking and reporting! Consider making the messages shorter and with more contextual information, e.g. the condition including some values that created the error and maybe the filename (in fatal errors). Consider filename only at the end, when the program actually stops (filenames can get reallly long). Avoid duplicate messages for different conditions -- messages should be uniquely findable in code (that's how I navigate to the erring code).
We need to discuss the zlib stuff -- after I am through the tests. I am going to pull to tests forward in the review, maybe I'm to reduce the number of stupid comments by me :-P
There are some TODOs left which are marked with // TODO Reenabled compilation flags for debug code. Changed a lot of uint64 values to signed type. The signed type still offers us the possibility to work with tremendously huge files and it is easier in some operations to have a signed value like -1 for a file size query (not existing, stream, whatever). Removed unnecessary imports throughout the project. Changed a lot of code so, that equal operations are correct. See to it: - that compared types are both signed or unsigned - compared types have the right sizes 32Bit < 64Bit and 64Bit<64Bit Changed method parameter types to const ref where feasible. Changed auto loop parameters to const ref where necessary. Refactored Extractor::extract() and pulled out some methods to make it more readable. Renamed checkPremises throughout the whole project to fulfillsPremises. Changed "old-style" casts to e.g. static_cast / reinterpret_cast over all files. Added more tests for the index entry storage strategies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly skimmed over the recently changed parts but ignored for now the new parts. I will continue in the big review with these.
src/process/index/IndexWriter.h
Outdated
/** | ||
* Must be set after the the index process via setNumberOfLinesInFile | ||
*/ | ||
u_int64_t numberOfLinesInFile{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a field of the IndexWriter
? Should it not be in the index header? Same for numberOfWrittenEntries
(or numberOfEntries
-- this concerns the index entries, right? Not the FASTQ records.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be continued ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to continue with the tests! Yeah! But not today.
usedCredentialsFile = finalCredentialsFile; | ||
usedConfigFile = finalConfigurationFile; | ||
} else { | ||
addErrorMessage("Could not validate configuration for S3."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which actual configuration was not validated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole configuration. But you're right, we can improve this.
return Starter::instance; | ||
} | ||
|
||
DoNothingRunner *Starter::assembleSmallCmdLineParserAndParseOpts(int argc, const char *argv[]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me its not obvious what the method name has to do with the function of the method? You're not only assembling but actually run the parser. Assembling actually is a method internal technical detail. Maybe parseSmallCmdLineOpts
? Or parseBaseCmdLine
(not 100% sure what the "small" means)?
0, cmdLineParser.get()); | ||
|
||
auto numberOfReadsArg = _makeUIntValueArg( | ||
"n", "numberofentries", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"n", "numberofentries", | |
"n", "numberOfEntries", |
src/startup/ModeCLIParser.h
Outdated
} | ||
|
||
/** | ||
* Processes several FASTQ related parameters and create a FASTQ PathSource from the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Processes several FASTQ related parameters and create a FASTQ PathSource from the input. | |
* Processes several FASTQ related parameters and create a FASTQ Source from the input. |
Added a notes section to Contributing.md Added the s3iomain.cpp and entries to the CMakefile to compile the GetObject helper binary. Starter is not a Singleton anymore. Added IOHelper::getApplicationPath() to get the absolute path of the running application. This is used to run the helper binary. Added the S3GetObjectProcessWrapper, which is used to run the helper binary in a separate process and thread. Moved the async GetObject code from S3Source to the helper binary and made it a bit more readable. Added unsigned (U) suffixes to some numbers in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, we should try to remove all references to FASTQ from the tool code, if it is useful for any data.
// Utilizing current_path() will result in the path of current test-binary, which will be e.g.: | ||
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp | ||
// As we do want resources to be loaded from: | ||
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line does not differ from the previous path line. What do you mean?
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp | |
// ~/Projects/FastqIndEx/test/resources/ |
???
|
||
void TestResourcesAndFunctions::CreateEmptyFile(const path &_path) { | ||
|
||
ofstream streamTest("/tmp/sometestfile"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Not going to accept this. A hard-coded path for test data ...
std::string TestResourcesAndFunctions::format(std::string format, ...) { | ||
va_list args; | ||
va_start(args, format); | ||
size_t len = std::vsnprintf(nullptr, 0, format.c_str(), args);// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are all the CHECK calls indented much more?
Phuhhh. What a tour de force. I am through (except for the most recent changes in S3 stuff.
*/ | ||
bool TestResourcesAndFunctions::runIndexerBinary(const path &fastq, const path &index, bool pipeFastq) { | ||
if (!pipeFastq) { | ||
return runCommand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o we really have to run external commands for testing? Can this not be avoided by encapsulating the top-level code, which is usually called from main, but in the test configured to run from the test?
The parsing of CLI parameters should be testable independently by real unit tests.
return str; | ||
} | ||
|
||
bool TestResourcesAndFunctions::compareVectorContent(const vector<string> &reference, const vector<string> &actual, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this there is certainly a C++ standard-lib function.
// Utilizing current_path() will result in the path of current test-binary, which will be e.g.: | ||
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp | ||
// As we do want resources to be loaded from: | ||
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dito
BTW: The code is not DRY, here.
StreamSource Source(&testData); | ||
CHECK(Source.isStream()); | ||
|
||
int64_t expectedSize = -1; // Use overflow to get the max value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a constant in the C++ standard-lib. I am not sure whether this overflow approach is portable.
CHECK(!p->canRead()); | ||
CHECK(p->canWrite()); | ||
|
||
// auto p2 = S3Sink(res.createEmptyFile("noOverwrite")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented?
CHECK(p->tell() == static_cast<int64_t>(2 * test1.length() + 5)); | ||
} | ||
|
||
// TEST (S3_SINK_WRITE_OVERWRITEBYTES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... done. Round 1.
* - Due to this, it is not possible to seek within an S3 file stream! Fortunately, the StreamSource has some built-in | ||
* buffer to deal with this. | ||
* - It is not possible to abort a running request! What we did first was download with GetObjectAsync. However this | ||
* lead to errors, when we closed the pipe or tried to close the stream. Worst case: The application crashed, best |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also elsewhere
* lead to errors, when we closed the pipe or tried to close the stream. Worst case: The application crashed, best | |
* led to errors, when we closed the pipe or tried to close the stream. Worst case: The application crashed, best |
|
||
vector<string> IndexReadingRunner::getErrorMessages() { | ||
if (fastqFile.get()) { | ||
auto a = ErrorAccumulator::getErrorMessages(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of the intermediate variables?
|
||
bool fastqIsValid = ActualRunner::fulfillsPremises(); | ||
bool indexIsValid = indexFile->fulfillsPremises(); | ||
// // Index files are automatically overwritten but need to have write access! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented?
} | ||
|
||
/** | ||
* The main function will 6 parameters (+1 for the application path in parameter 0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The main function will 6 parameters (+1 for the application path in parameter 0). | |
* The main function has 6 parameters (+1 for the application path in parameter 0). |
int main(int argc, char **argv) { | ||
auto [parentPathRetrieved, parentPath] = getParentPath(); | ||
if(!parentPathRetrieved) { | ||
std::cerr << "Could not get the parent processes path.\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess these will interleave with the messages of the parent process, right?
std::cerr << "Could not get the parent processes path.\n"; | ||
return -1; | ||
} | ||
if (parentPath != "fastqindex") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think the name of also this binary should be camel-case. All lower-case is also less readable in Bash (although Bash itself may be the larger burden).
There are a lot of changes in this commit and I will only give a more detailed description if necessary. Update the README. Update the CONTRIBUTING file. Over all files: - Renamed variables from fastqFile to sourceFile. - Renamed and changed a lot of variables. - Remove obsolete code and comments. - Changed comments and fixed typos. - Also renamed methods related to fastqFile accordingly. - Changed error messages to be more precise and helpful. - Renamed some functions to better express, what they are doing. Renamed: - PathSink/Source => FileSink/Source - PathLockHandler => FileLockHandler - IndexEntryStorageStrategy => IndexEntryStorageDecisionStrategy - (Also test classes!) Extracted IndexHeader and (Base)IndexEntry(V1) from CommonStructsAndConstants to individual files in process/base. Add the template Result class and tests to have a generic class to return results, the succeeding status and an (error) message from method calls. In CommonStructsAndConstants add the stoui method plus tests, which does not come with the C++ standard libraries. In ErrorAccumulator rename mergeToNewVector() to concatenateVectors(). In IOHelper add checkFileWriteability(). In StringHelper added/moved parseStringValue() from ByteDistanceBasedStorageDecisionStrategy. In FQIS3Client.h: - Rename ObjectListEntry to S3Object. - make FQIS3ClientRequestBooleanResult a typedef of Result<bool>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o.k. discussed.
Bug #11: Make IOHelper::fullPath() aware of a starting ~. This will allow the application to resolve ~/ to your home directory. PR #37: Fix several typos throughout the whole application. Bug #16: Fix a bug, where the application will fail, when the input file exists but is empty. Did some cleanup. Renamed fastqindexs3iohelper binary to S3IOHelperForFastqIndEx to allow correct tab complextion for "fastqindex ".
No description provided.