Big review branch before 0.10b #37

dankwart-de · 2019-07-31T13:57:03Z

No description provided.

Add license, cmake makefiles, clion project files and the basic application folder layout. Sources: ======================================================================== Add main.cpp/h: - Will create a Starter and a Runner instance, check premises and display errors Add Starter class: - Will evaluate command line options - Will either select the Indexer/ExtractorRunner or the ShowStopperRunner depending on the cli opts - Will also check premises for the selected runner and fail, if the premises could not be validated Add the Runner class: - This is the base class for other Runners (Indexer, Extractor, ShowStopper) - run() must be overriden by specific Runners. This method will be called by the Starter instance - errorMessages / getErrorMessages() can be used to store errors will e.g. be used by the main method to display the errors on stderr - checkPremises() can be overriden to check if everything is setup for a Runner instance (e.g. file existence for the ActualRunner) - Various isXXXRunner() to mimic instancof Add the ActualRunner class: - Base class for IndexerRunner and ExtractorRunner - Has the two fields fastqFile and indexFile - Both fields will be checked in checkPremises(); checkPremises() also adds error entries to the error vector Add the IndexerRunner class: - Will be used to run the fastq indexing Add the ExtractorRunner class: - Has two more fields startLine and lineCount which are needed for extraction Add the CommonStructs.h file: - Contains IndexHeader and IndexEntry structs - Both structs will be used for index files, maybe this will be changed to a custom class IndexFile or so Add the ErrorMessages.h file, which contains error message strings Tests: ======================================================================== Unit tests are based on UnitTest++ which needs to be installed on your local machine Add the TestResourcesAndFunctions class: - Class for helper methods and test related tasks - Can be initialized with the name of the test suite and must be initialized with the test id - This class can be used to create a temp folder which will be deleted when the class destructor is called. Use getTestPath() to get (and create) the folder - filePath() will return the temporary folder extended by the given filename - createEmptyFile() will create the given file in the temporary folder - CreateEmptyFile() will create the given path object. Does not use the temporary folder Add the TestResourcesAndFunctionsTest suite Add test suites for the Starter class and the Runner classes

=== Application files === Add the IndexReader class: - It will be used to read (safely) from a fastq index file. - It uses IndexProcessor as a base class to perform concurrency safe file operations. - To create an instance of IndexReader, use IndexReader::create() - The class is tested with the IndexReaderTest test cases. - When a file is opened, its size is checked and for some issues with ifstream, the current implementation precalculates the amount of stored entries. Add the IndexWriter class: - Used to write (safely) to a fastq index file. - Like the IndexReader, it uses IndexProcessor for safe concurrent file access. - It cannot override existing files. - Write access is exclusive! The IndexWriter will try to aquire exclusive access or fail! - Tested with IndexWriter test cases. Add the IndexProcessor class: - Base class for IndexWriter and IndexReader. - Tries to guarantee safe thread and process concurrent file access to the specified index file. - It allows multiple readers and exclusive writer access. Thus said, if a reader is active, write is not allowed but other readers can read as well. If a write is active, no other writer or reader is allowed. - Lacks an integration test for now. The current impementation relies on boosts interprocess locks functionality. Thread-wide access tests are testes in with the IndexProcessor test cases. Update the Indexer class: - Added the createIndex() method, which will later use the IndexWriter class to write to index files and which is used to create the fastq index. - The class contains some fields, which are exclusively used in tests. To use them, debuggingEnabled needs to be set to true upon creation. Added the ErrorMessages file, which will hold various strings containing error messages. === Test files === In all test files, replace the const string test identifiers by const char*. clang-tidy complained about possible problems with const string. The solution with const char* seems safe. Added more tests to CommonStructsTest. Added the IndexerTest class which currently tests the constructor, createHeader() and createIndex(). The tests all fail at the moment, as the Indexer is not yet functional. Added test files for IndexProcessor, IndexReader and IndexWriter. Added the getResource() method to TestResourcsAndFunctions. Added a test for the method.

Renamed ShowStopperRunner to PrintCLIOptionsRunner. This name is more meaningful. Modify the main method to return the exit code of runner->run() or 1, if the runner could not be started. ErrorAccumulator: - Introduce class ErrorAccumulator to provide basic functionality for error string collection. Make Runner, Indexer and IndexWriter inherit from this class. - The method mergeToNewVector can be used to do a copy merge of two vectors. Rename CommonStructsh to CommonStructsAndConstants.h IndexHeader: - Change order of elements to reduce Byte padding. - Increase total size of header to 512 Byte. - Implement equals and non-equals operator. IndexEntryV1 - Rename struct IndexEntry to IndexEntryV1, as we will keep individual versions when its necessary. - Change order of elements to reduce Byte padding. - Rename variables to match our camel case syntax. - Implement equals and non-equals operator. IndexerRunner: - Add enableDebugging parameter to constructor, will be passed to the indexer - Add a checkPremises method, which will call the super.checkPremises and the one from the indexer - Add getErrorMessages, which will merge its own and the messages from the indexer Indexer: - Create and IndexWriter instance when the object is constructed. - checkPremises will call tryOpen on the indexWriter - One could think of moving the checks in tryOpen to a custom method. Not sure about this yet. - Add getErrorMessages, which will merge its own and the messages from the indexWriter IndexProcessor: - Add the hasLock() method, which will return true if either a read or a write lock was aquired. IndexWriter: - Renamed open() to tryOpen() - Remove the create method. An instance of Indexer needs to be opened with tryOpen() after creation to use it. - Add a writerIsOpen field, which is set when the writer is successfully locked or unlocked. The field is used before write operations to make sure, that no write ops to the ofstream are done when it wasn't opened - Added a destructor, which will flush and close the output stream. - Added some error checks and error messages. ///// Tests Modified tests for IndexHeader and IndexEntryV1, they now contain some sanity checks, which will check the whole structs and each field in them for their size. This way, we hopefully will notice, if someone accidentally changed the structs. Some sort of md5 / hash mechanism could be added, but I do not know how yet. Added tests for the equals methods and made the other tests use == / !=. Added another test for IndexerRunner. The tests will also validate checkPremises(). Added hasLock() checks to the IndexProcessor tests. Removed the compareHeader and compareEntry methods from IndexReaderTest, they are now part of other tests. Added a path variable to the test binary path to main.cpp. Added a test constants file.

Add ZLIB dependency to make files. Reduces CXX_STANDARD to 14, cmake 3.5.3 complained about 17. Change layout of IndexEntryV1 to consume only 8Byte instead of 24. This is possible due to some assumptions: - Entry data will be relative, you need the process the whole file to calculate the absolute values. Index files are small, this is not a real fuzz. - Relative offsets are below 65k, thus an unsigned short is sufficient. - 2Byte overhead is overhead due to memory padding. We store and read the struct and not the individual fields. Adapted tests and methods correspondingly. Extended the Indexer class and make it work: - Tried to modularize the createIndex() method by extracting methods and by pulling fields to the class level. The Indexer is only for single-use, this is safe. - Added some methods: initializeZStream(),readCompressedDataFromStream() checkStreamForBlockEnd(), finalizeProcessingForCurrentBlock(), storeLinesOfCurrentBlockForDebugMode() - Changed the debugging field types. ////// Tests IndexerTest - Added tests for a small and a "large" test file. Manual testing with a 1.2GB FASTQ also works. Or at least it does not crash :D - Added two comments for more tests. TestResourcesAndFunctions - Fixed a bug in getResource(), the method will now access the correct path. - Extended the checks for this.

Made indexing and extraction work. However, to make things work, we need large indices. This is because we need to store a dictionary for each entry in the index. This is currently of size 32kByte and can be reduced by using zlibs compress2/uncompress methods by around 60%. This is not a final version and needs a lot more heavy grinding, e.g: - Code cleanups and unification - More tests and checks on startup - Improve the CLI interface - Change access to FASTQ record access (currently it is working on lines, which might still be kept, but the main purpose of this tool is FASTQ processing) - Improve code to use (much) less memory (Just need to work on different things atm, so at least push it to the repo) Raise minimum cmake to 3.13 Also tried to upgrade Boost from 1.54 to 1.69, which horribly failed. Methods like e.g. exists(path), is_symlink(path) left horrible pointer mess in calling objects. Though we could track this down a bit, we do not know, where the real bug source is located. /////// Source code In CommonStructsAndConstants: - Move several constants to CommonStructsAndConstants. - Add equality (== !=) not (!) and bool() operators to IndexHeader - In IndexEntryV1: * Layout changes from relative to absolute values. * Add the blockID() field to IndexEntryV1. * Rename relativeBlockOffsetInRawFile to blockOffsetInRawFile. * Change types for offsets from ushort to ulong. * Add the 32kByte sized dictionary field which is needed for inflate init in extraction mode. This value should be compressed in a future release (ca 60% storage decrease). The field compressedDictionarySize is already planned but not functional. Added the IndexLine struct, which holds a decompressed and processed IndexEntry. Also this is currently brute force and the whole index is loaded fully to memory. This will need to change in the future to decrease memory consumption. The Starter will now create an ExtractorRunner with proper startline and linecount. In ActualRunner, extend checkPremises() with more checks. Also here, we ran into a lot of trouble when trying to upgrade Boost from 1.54 to 1.69 In IndexerRunner, change indexer to Indexer* type and delete in in the constructor. This and the other changes in the class were done due to the upgrade conflicts of Boost. In ExtractorRunner: - Add checkPremises() and run() - checkPremises() will check the Runner premises + the used extractor premises. - run() will call extractor->extractReadsToCout() In IndexReader: - Add comments. - Add the readerIsOpen field. - Add the readHeader field and make readIndexHeader() private. - Remove the create() method and make the class instantiable via a public constructor. - Rename open() to tryOpenAndReadHeader() and make it public. Add checks and error messages. - Add readIndexFile() method. This will most likely be removed in future versions with a more intelligent index loader approach. - Add readIndexFileV1() method an comments for V2/V3 In Indexer: - This class really needs to be optimized! - Remove constants and use global constants. - Add the calculateIndexBlockInterval() which will calculate some "smart" interval depending on the FASTQ file size. - Change a lot of code. Tried to put it comments, but this really needs improvement. In Extractor: - First: Added the class - Copied some code from Indexer, this should later be merged somehow. - Most important: Add the extractReadsToCout() method which currently lacks checks. The method will jump into the gz file, initialize the inflater, search for the first line to extract and then extract until EOF or extractedLines == lineCount. Also here, the extractor is based on lines rather than records. This should be changed. /////// Tests Change a lot of the tests to match new constructors. In CommonStructsTest: - Change tests for IndexHeader and IndexEntry to match new sizes. - Add tests for new operators() Extend ExtractorRunnerTest to cover one more test case. In IndexerTest: - Add test for CalculateBlockInterval. - Change most tests so, that the Indexer is created with -1 for the block interval to request block interval calculation. (Test create large index will use a block interval of 4.) - Extend test create large index to check some index entries for validity. Values for this were calculated using the original zran.c code. Also put in an extraction test which compares decompressed with original data. Add ExtractorTest: - Add two non test methods runRangedExtractionTest() and runComplexDecompressionTest() which will be called by TEST_CREATE_EXTRACTOR_AND_EXTRACT_SMALL_TO_COUT to run a series of index / extract processes and compare extracted data to decompressed original data. Extend StarterTest to use a complete set of parameters for extraction.

Started to use "using namespace::class" in some cases. Change a lot of minor things due to clang-tidy complaints. Changed ulong to u_int64_t in the whole application to guarantee a length of 8 Byte unsigned values. ulong itself is not guaranteed to be of length 8 Byte (e.g. Microsoft says it is 4 Bytes long). Also changed some of the uint's to u_int32_t though this is not necessary. Changed the WINDOW_SIZE / CHUNK_SIZE to #defines instead of const uint, with the old code, the chunk[CHUNK_SIZE]{0} syntax did not work and needed an additional memset. Renamed IndexLine to IndexEntry. IndexEntry is a decompressed and processed index entry representation for all the versioned IndexEntryV[n] structs. This is to docouple index reading and writing from internal processes. Added the toIndexEntry() method to IndexEntryV1. Created the ZLibBasedFASTQProcessorBaseClass class as a mutually used base class for the Indexer and Extractor classes. The class stores a list of variables and methods used for both subclasses. Cleanup Indexer and Extractor and make them more readable. Though a lot more work could be done here. Also unify the buildIndex() and extract() methods so they look a lot alike. This should be base for further refactoring. Moved a lot of values from both classes to the new base class. The Extractor will now read in IndexEntry-wise until the proper IndexEntry is found. This way, we do not need to read in the whole index which makes sense for memory consumption. Big FASTQ files will have big index files and the Extractor shall use only few memory. Changed readIndexFileV1() to return a vector of IndexEntryV1 and create the readIndexEntry() method. /// Tests Add ZLibBasedFASTQProcessorBaseClassTest file with a mock class to enable testing of the class. Move the test for z_stream init to this file.

Add installation / compilation instructions. Usage will follow later. Add a description and links to libraries.

Extend installation and build instructions. Build was tested on my local machine and on our HPC env. Though I could get cmake running with all the custom libraries, g++ always failed because 'inflateGetDictionary' seems to be missing (it is available in zlib 1.2.8 / 1.2.11).

Tested the code several times with valgrind to get rid of memory leaks. Extend installation and build instructions. Build was tested on my local machine and on our HPC environment. After a lot of tries, it finally builds and runs. Removed Boost dependency. * Replace boost::filesystem by experimental::filesystem classes and functions where necessary. * Replace boost command line interface with tclap library. * Replace boost shared pointers with stdlib shared pointers. * Replace the boost splitstring stuff with a custom method splitStr() in ZLibBasedFASTQProcessorBaseClass. * Replace the boost::mutex class with std::mutex and boost::mutex::scoped_lock with std::lock_guard. Use Linux's flock to lock files shared or exclusive also over NFS. Override getErrorMessages() in ExtractorRunner class. Add blockInterval to IndexerRunner constructor. Add some more error messages, when file open / close fails. Rename the PrintCLIOptionsRunner to DoNothingRunner. It actually does nothing, so the name is more precise. Fully implement new command line interface using tclap. We tried to mimic the samtools command line interface as far as possible. This is because we work with FASTQ files. Modify and extend tests where necessary.

The conda environment contains all necessary tools and libraries except UnitTest++, this still needs to be downloaded and installed manually. Adapted the CMake files to make them aware of an existing (and active) environment. Use TCLAP 1.2.1 instead of 1.2.2. This version is in Conda.

Testing TCLAP (which is eventually done in the Starter tests) fails, because TCLAP calls exit() effectively stopping all tests, when the command line parameters cannot be parsed. This leads to the bad effect, that we can only run valid tests but no tests with invalid command line parameters. To at least get a message, that something went wrong and that the test binary exited unintentionally, I put in a switch after the tests are finished. The switch is recognized by a trap to exit() and a message is printed to the command line to indicate an error, if the switch is not set properly.

-D enables the indexer or extractor debug mode (e.g. for IDE's) -w allows overwriting of an existing .fqi file Add one integration test, which calls the application itself with parameters. Rename the .idx to .fqi to look more like .fai or .bai

The files are working on our LSF environments. Basically you call fqimasstestLSF.sh with a list of FASTQ files and an output directory to where the index files will be written. The script fqitest.sh is called for each FASTQ. Note, that the script will extract the FASTQ to the local scratch storage. Keep in mind, that these scripts are more an example and will not necessarily work on your platform.

I renamed .idx to .fqi in an earlier commit. This lead to SigSevs in the tests, which is now corrected by renaming the test.idx files. Fixed some typos in the test scripts.

Index and extract will now work with concatenated gz files, tests are included. Also reworked the test structure for the extractor tests. TODO Indexer and Extractor need a lot of rework to make the code more readable.

Add the possibility to stream in a FASTQ file for indexing. Internally, I created the InputSource class for this, which is inherited by the PathInputSource and the StreamInputSource classes. This way, both classes share a common interface and can be treated equally. As we are working with (char/byte)buffers, it is more convenient to have a C-like interface with methods like read(), skip() and so on. As the previous code was based on path instances, a lot of changes were necessary to adapt to the new structure. Note, that there are some tests, which still fail. They need to be implemented. Sources ======= ActualRunner: - Add additional constructors with a path and a shared pointer of type InputSource. - Change the checkPremises method to be aware of the different input source types. - Add allowsReadFromStreamedSource() which needs to be overriden, if necessary. ExtractorRunner: - Make it use a PathInputSource for input instead of a path. NOTE: In difference to the indexer, the extractor is not allowed to read from a stream (it would also not make any sense.) IndexerRunner: - Reformated the constructor code and added a second constructor. - Override the allowsReadFromStreamedSource() method to return true. Extractor: - Add the resultFile, useFile and forceOverwrite fields. This allows the extractor to either write to stdout or to a result file. - Extend the constructor to be aware of the new fields. - Extend checkPremises to be aware of the changes. - Adapt the extractor to use the new (Path)InputSource instead of path. Indexer: - Make the Indexer use an InputSource instance instead of a path. - Make the indexing process use the new InputSource methods. Starter: - The Starter will now recognize "-" as a valid parameter for a piped FASTQ file in index mode and for the extracted text in extract mode. Added the InputSource, PathInputSource and StreamInputSource classes. Tests ===== Made the test classes use the new Indexer/Extractor(Runner) constructors Added tests for the new InputSource classes and updated the Indexer tests to also test with streamed data. Replaced several duplicate strings by constants. Disabled the integration tests for now. Streamed data can be tested with e.g. ifstream instances. TestResourcesAndFunctions: - Add a format string method. However, this method is not very safe and really only for test purposes. It should not be used in production code! It is ok for tests though. - Add the runCommand(), runIndexerBinary(), runExtractorBinary(), extractGZFile() and createConcatenatedFile() methods to remove code duplication and simplify tests.

Allow to set some verbosity with -v/-verbosity from 0(default) to 3. Actually only level 3 is used. Add one more message to the end of the indexing process. Fix the StreamInputSource by make it using different methods for reading and checking eof() (sync(), read(), gcount(), peek(), eof()). Add some logger methods to ErrorAccumulator. This is nothing spectacular but offers a very small amount of logging. Moved method bodies from ErrorAccumulator header to source file. Added some more tests and set verbosity for the testapp to 3.

** NO TESTS YET! ** The stats mode will display some basic information about a given index file. You can also select a range of index entries to display.

Fix the binary in such a way, that it can handle "super-concatenated" FASTQ files. E.g. we had FASTQ files with a size of around 3,3GB which consisted of approximately 170k concatenated gzip files. Extractor class: - In some cases, the extractor did not properly extract the first line in the selected range of lines. To prevent this, the extractor will now extract the amount of [multiplier] previous lines and skip them on output. - Fixed a big mem leak: I did not reinitialize the inflate stream properly and a lot of memory was allocated but not freed. - Fixed some issues, when no line was in an extracted block. - The seek into the concatenated file was moved before the check for readability to ensure, that it is properly detected, if a file still can be read. Indexer class: - Refactor the index method and extract some methods. - The Indexer can now process the mentioned super-concatenated files where e.g. thousands of single files of less than 40k were concatenated. - Count number of concatenated files and output this at the end of the Indexing step. - The seek into the concatenated file was moved before the check for readability to ensure, that it is properly detected, if a file still can be read. The IndexStatsRunner also shows the starting record id (effectively the starting line number divided by 4. Modify the PathInputSource to use C++ streams instead of the C FILE API. In IndexerTest: - Change some CHECK_EQUAL tests to CHECK. I do not why, but C_E throws segfaults now. Might depend on the compiler version. - Added a test for concatenated FASTQ files with really small files, each smaller than a block.

Add two new indexer options: - -F will tell the Indexer to NOT write out the index file. This is to get some basic statistics about what could happen. - -S will disable the new failsafe distance. Use this, if you think, that you do not need this. However, it is more for testing. Fix Indexer behaviour for files with very small blocks by introducing a failsafe distance. It is calculated by (block distance) * 16kByte and will tell the indexer, that it should not only look at the block distance but also on the offset distance between the current and the latest written index entry. Also do some code refactorings to make the Indexer and Extractor code a bit more readable.

The Starter now parses the variables for extract BEFORE we access the indexFile and debugMode parameters. Extract a method to print an index entry to the console and put it in IndexStatsRunner. Output some minor statistics when extraction gets started. Also fix a bug in the indexer. The first index entry was not stored, when failsafe distance was on. Add A LOT of build options for debug mode. These should be handled with care to reduce warnings and increase code robustness.

Remove debug() output commands from a lot of places. They are actually more confusing than helpful. Indexer class - Changed the default block interval from 512 to 2048 - Added the postponeWrite field. This comes into play, when an empty block occurs. Empty blocks will not be used for index entries anymore as they caused a lot of problems leading to faulty results. - Added a lot of debug code which can help a developer to better understand, how the index is created. Fields and methods are: - writeOutOfPartialDecompressedBlocks - storageForDecompressedBlocks - writeOutOfPartialDecompressedBlocks - storageForPartialDecompressedBlocks - partialBlockinfoStream - enableWriteOutOfDecompressedBlocksAndStatistics() - enableWriteOutOfPartialDecompressedBlocks() - Extracted the methods createIndexEntryFromBlockData(), storeDictionaryForEntry() and writeIndexEntryIfPossible() to make the code more readable. - Reworked finalizeProcessingForCurrentBlock() to make it more readable and also make it use and therefore output the above mentioned debug information. Extractor class - Added the roundtripBuffer field, which will be used later to enable: - error recognition, if the extract data is not made up of extractionMultiplier sized records - complete output with at least empty entries for a record. However, this is not yet used. - Removed the skipLines field, as the whole mechanism was reworked and both index and extract should now produce proper results. - Reworked processDecompressedChunkOfData(), it is now much easier to read and a lot less complicated. Reworked the Starter to accept more command line options and pass them to the specific runners. Changed ZLibBasedFASTQProcessorBaseClass.splitStr so, that it does no longer add an empty line to the back of the list, if the last char is a newline character. TestResourcesAndFunctions - Add readLinesOfFile(), readFile(), readLinesOfResourceFile(), readResourceFile(), compareVectorContent() and getTestVectorWithSimulatedBlockData() to reduce code duplications and improve test readability. Code is taken from IndexerTest and ExtractorTest. - The method getTestVectorWithSimulatedBlockData() will return a vector with files for simulating decompressed gzip block data. This tries to mimic common pitfalls like small blocks, empty blocks, blocks without newlines and so on. IndexerTest - Add a test for correct block line counting which uses the aforementioned block data from getTestVectorWithSimulatedBlockData() - Move code to the aforementioned methods readLinesOfFile() and so on... ExtractorTest - Analogous to IndexerTest, add a test for correct line from block data extraction. This will also use getTestVectorWithSimulatedBlockData() - Like in IndexerTest, use the extracted methods for file reading and vector comparison. Add the blockAndLineCalculations test resource directory. This contains several files with "decompressed" block data and a file which describes the layout of the joined block files.

to allow easy publication of the standalone binary.

These strategies are used to decided whether an IndexEntry will be stored or not. The current block based (safeguarded) approach will still be usable and is encapsulated in the BlockDistanceStorageStrategy class. Add the IndexEntryStorageStrategy base class. Add the ByteDistanceStorageStrategy class to allow the user to set a Byte value like 1T, 2g, 3m and 4K as a minimum distance between index entries. Rename the field offsetInRawFile to blockOffsetInRawFile in IndexEntry to match IndexEntryV1. Move the method calculateIndexBlockInterval from Indexer to the new IndexEntryStorageStrategy class. Add a test class for the new strategy classes.

Add s3 support

vinjana

All these are suggestions.

CMakeLists.txt

README.md

LICENSE.txt

vinjana · 2019-07-31T14:16:58Z

env/environment.yml

@@ -0,0 +1,33 @@
+name: FastqIndEx


The env doesn't work.
S3 is not included.

The included library did not work and I needed to download the AWS SDK. This is described in the README

I doubt this is acceptible for a deployment via Conda.

README.md

src/common/CommonStructsAndConstants.h

vinjana · 2019-07-31T16:06:31Z

src/common/CommonStructsAndConstants.h

+};
+
+/**
+ * Represents an entry in a gz index file


Do you think for explanation would it be helpful to have e.g. a plant-UML diagram of the the blocks, and all other concepts related to gzip format and indexes and thus explain all the terms? Maybe we can make a sketch that I can translate into digital diagram (or we just scan the sketch!)

makes sense

src/common/CommonStructsAndConstants.h

src/common/ErrorAccumulator.cpp

src/common/ErrorMessages.cpp

src/common/IOHelper.h

src/common/IOHelper.cpp

src/main.cpp

In CommonStructsAndConstants: - Add constants for Byte sizes like k(ilo)B(yte), M(ega)B(yte) In IndexEntryStorageStrategy: - Change the type of blockInterval and minIndexEntryByteDistance to signed. - Add a getDefault() and a from() method for both strategies - Add calculateDistanceBasedOnFileSize to ByteDistanceStorageStrategy - Implemente useFilesizeForCalculation in ByteDistanceStorageStrategy In Indexer and IndexerRunner: - Remove the blockInterval and disableFailsafeDistance parameters from the constructor and add the storageStrategy parameter. In ModeCLIParser: - Remove argumentToPath() - Add createCommandLineParser() - Add createAllowedModeArg() and use this method in the parse methods of the subclasses. In ModeCLIParser subclasses: - Rework the parse() methods - Use the new createAllowedModeArg() method. In IndexModeCLIParser: - Inline some strings back to the create..Args methods. - Add an argument to parse the storage strategy and the byte distance.

src/process/base/ZLibBasedFASTQProcessorBaseClass.h

src/process/base/ZLibBasedFASTQProcessorBaseClass.cpp

src/process/extract/Extractor.h

src/process/extract/Extractor.cpp

vinjana

Excellent error checking and reporting! Consider making the messages shorter and with more contextual information, e.g. the condition including some values that created the error and maybe the filename (in fatal errors). Consider filename only at the end, when the program actually stops (filenames can get reallly long). Avoid duplicate messages for different conditions -- messages should be uniquely findable in code (that's how I navigate to the erring code).

We need to discuss the zlib stuff -- after I am through the tests. I am going to pull to tests forward in the review, maybe I'm to reduce the number of stupid comments by me :-P

src/process/extract/Extractor.cpp

src/process/extract/IndexReader.cpp

There are some TODOs left which are marked with // TODO Reenabled compilation flags for debug code. Changed a lot of uint64 values to signed type. The signed type still offers us the possibility to work with tremendously huge files and it is easier in some operations to have a signed value like -1 for a file size query (not existing, stream, whatever). Removed unnecessary imports throughout the project. Changed a lot of code so, that equal operations are correct. See to it: - that compared types are both signed or unsigned - compared types have the right sizes 32Bit < 64Bit and 64Bit<64Bit Changed method parameter types to const ref where feasible. Changed auto loop parameters to const ref where necessary. Refactored Extractor::extract() and pulled out some methods to make it more readable. Renamed checkPremises throughout the whole project to fulfillsPremises. Changed "old-style" casts to e.g. static_cast / reinterpret_cast over all files. Added more tests for the index entry storage strategies.

vinjana

Mostly skimmed over the recently changed parts but ignored for now the new parts. I will continue in the big review with these.

vinjana · 2019-08-06T11:48:21Z

src/process/index/IndexWriter.h

+    /**
+     * Must be set after the the index process via setNumberOfLinesInFile
+     */
+    u_int64_t numberOfLinesInFile{0};


Is this really a field of the IndexWriter? Should it not be in the index header? Same for numberOfWrittenEntries (or numberOfEntries -- this concerns the index entries, right? Not the FASTQ records.)

src/process/index/IndexWriter.cpp

src/process/index/Indexer.cpp

src/process/index/IndexEntryStorageStrategy.h

src/process/io/ConsoleSink.h

src/process/io/s3/S3Config.h

src/process/io/s3/S3Service.cpp

vinjana

to be continued ...

src/common/CommonStructsAndConstants.h

src/process/index/IndexEntryStorageStrategy.h

src/process/io/StreamSource.cpp

src/process/io/StreamSource.h

src/process/io/locks/PathLockHandler.cpp

src/process/io/locks/PathLockHandler.h

src/process/io/s3/S3Config.h

vinjana

Going to continue with the tests! Yeah! But not today.

src/process/io/s3/S3Config.cpp

vinjana · 2019-08-12T11:23:53Z

src/process/io/s3/S3Config.cpp

+        usedCredentialsFile = finalCredentialsFile;
+        usedConfigFile = finalConfigurationFile;
+    } else {
+        addErrorMessage("Could not validate configuration for S3.");


Which actual configuration was not validated?

The whole configuration. But you're right, we can improve this.

vinjana · 2019-08-12T11:39:34Z

src/startup/Starter.cpp

+    return Starter::instance;
+}
+
+DoNothingRunner *Starter::assembleSmallCmdLineParserAndParseOpts(int argc, const char *argv[]) {


To me its not obvious what the method name has to do with the function of the method? You're not only assembling but actually run the parser. Assembling actually is a method internal technical detail. Maybe parseSmallCmdLineOpts? Or parseBaseCmdLine (not 100% sure what the "small" means)?

vinjana · 2019-08-12T13:40:58Z

src/startup/IndexStatsModeCLIParser.h

+                0, cmdLineParser.get());
+
+        auto numberOfReadsArg = _makeUIntValueArg(
+                "n", "numberofentries",


Suggested change

"n", "numberofentries",

"n", "numberOfEntries",

src/startup/IndexStatsModeCLIParser.h

src/startup/ModeCLIParser.h

vinjana · 2019-08-12T13:52:26Z

src/startup/ModeCLIParser.h

+    }
+
+    /**
+     * Processes several FASTQ related parameters and create a FASTQ PathSource from the input.


Suggested change

* Processes several FASTQ related parameters and create a FASTQ PathSource from the input.

* Processes several FASTQ related parameters and create a FASTQ Source from the input.

src/startup/ModeCLIParser.h

Added a notes section to Contributing.md Added the s3iomain.cpp and entries to the CMakefile to compile the GetObject helper binary. Starter is not a Singleton anymore. Added IOHelper::getApplicationPath() to get the absolute path of the running application. This is used to run the helper binary. Added the S3GetObjectProcessWrapper, which is used to run the helper binary in a separate process and thread. Moved the async GetObject code from S3Source to the helper binary and made it a bit more readable. Added unsigned (U) suffixes to some numbers in tests.

vinjana

Generally, we should try to remove all references to FASTQ from the tool code, if it is useful for any data.

CONTRIBUTING.md

src/process/io/s3/FQIS3Client.h