Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big review branch before 0.10b #37

Open
wants to merge 54 commits into
base: review-branch
Choose a base branch
from
Open

Big review branch before 0.10b #37

wants to merge 54 commits into from

Conversation

dankwart-de
Copy link
Contributor

No description provided.

heinold added 30 commits December 19, 2018 12:53
Add license, cmake makefiles, clion project files and the basic
application folder layout.

Sources:
========================================================================

Add main.cpp/h:
- Will create a Starter and a Runner instance, check premises and
  display errors

Add Starter class:
- Will evaluate command line options
- Will either select the Indexer/ExtractorRunner or the
  ShowStopperRunner depending on the cli opts
- Will also check premises for the selected runner and fail, if the
  premises could not be validated

Add the Runner class:
- This is the base class for other Runners (Indexer, Extractor,
  ShowStopper)
- run() must be overriden by specific Runners. This method will be
  called by the Starter instance
- errorMessages / getErrorMessages() can be used to store errors
  will e.g. be used by the main method to display the errors on stderr
- checkPremises() can be overriden to check if everything is setup
  for a Runner instance (e.g. file existence for the ActualRunner)
- Various isXXXRunner() to mimic instancof

Add the ActualRunner class:
- Base class for IndexerRunner and ExtractorRunner
- Has the two fields fastqFile and indexFile
- Both fields will be checked in checkPremises(); checkPremises() also
  adds error entries to the error vector

Add the IndexerRunner class:
- Will be used to run the fastq indexing

Add the ExtractorRunner class:
- Has two more fields startLine and lineCount which are needed for
  extraction

Add the CommonStructs.h file:
- Contains IndexHeader and IndexEntry structs
- Both structs will be used for index files, maybe this will be changed
  to a custom class IndexFile or so

Add the ErrorMessages.h file, which contains error message strings

Tests:
========================================================================

Unit tests are based on UnitTest++ which needs to be installed on your
local machine

Add the TestResourcesAndFunctions class:
- Class for helper methods and test related tasks
- Can be initialized with the name of the test suite and must be
  initialized with the test id
- This class can be used to create a temp folder which will be deleted
  when the class destructor is called. Use getTestPath() to get (and
  create) the folder
- filePath() will return the temporary folder extended by the given
  filename
- createEmptyFile() will create the given file in the temporary folder
- CreateEmptyFile() will create the given path object. Does not use
  the temporary folder

Add the TestResourcesAndFunctionsTest suite

Add test suites for the Starter class and the Runner classes
=== Application files ===

Add the IndexReader class:
- It will be used to read (safely) from a fastq index file.
- It uses IndexProcessor as a base class to perform concurrency safe
  file operations.
- To create an instance of IndexReader, use IndexReader::create()
- The class is tested with the IndexReaderTest test cases.
- When a file is opened, its size is checked and for some issues with
  ifstream, the current implementation precalculates the amount of
  stored entries.

Add the IndexWriter class:
- Used to write (safely) to a fastq index file.
- Like the IndexReader, it uses IndexProcessor for safe concurrent
  file access.
- It cannot override existing files.
- Write access is exclusive! The IndexWriter will try to aquire
  exclusive access or fail!
- Tested with IndexWriter test cases.

Add the IndexProcessor class:
- Base class for IndexWriter and IndexReader.
- Tries to guarantee safe thread and process concurrent file access
  to the specified index file.
- It allows multiple readers and exclusive writer access. Thus said,
  if a reader is active, write is not allowed but other readers can
  read as well. If a write is active, no other writer or reader is
  allowed.
- Lacks an integration test for now. The current impementation relies
  on boosts interprocess locks functionality. Thread-wide access tests
  are testes in with the IndexProcessor test cases.

Update the Indexer class:
- Added the createIndex() method, which will later use the IndexWriter
  class to write to index files and which is used to create the fastq
  index.
- The class contains some fields, which are exclusively used in tests.
  To use them, debuggingEnabled needs to be set to true upon creation.

Added the ErrorMessages file, which will hold various strings containing
error messages.

=== Test files ===

In all test files, replace the const string test identifiers by
const char*. clang-tidy complained about possible problems with const
string. The solution with const char* seems safe.

Added more tests to CommonStructsTest.

Added the IndexerTest class which currently tests the constructor,
createHeader() and createIndex(). The tests all fail at the moment, as
the Indexer is not yet functional.

Added test files for IndexProcessor, IndexReader and IndexWriter.

Added the getResource() method to TestResourcsAndFunctions. Added a test
for the method.
Renamed ShowStopperRunner to PrintCLIOptionsRunner. This name is more
meaningful.

Modify the main method to return the exit code of runner->run() or 1,
if the runner could not be started.

ErrorAccumulator:
- Introduce class ErrorAccumulator to provide basic functionality for
  error string collection. Make Runner, Indexer and IndexWriter inherit
  from this class.
- The method mergeToNewVector can be used to do a copy merge of two
  vectors.

Rename CommonStructsh to CommonStructsAndConstants.h

IndexHeader:
- Change order of elements to reduce Byte padding.
- Increase total size of header to 512 Byte.
- Implement equals and non-equals operator.

IndexEntryV1
- Rename struct IndexEntry to IndexEntryV1, as we will keep individual
  versions when its necessary.
- Change order of elements to reduce Byte padding.
- Rename variables to match our camel case syntax.
- Implement equals and non-equals operator.

IndexerRunner:
- Add enableDebugging parameter to constructor, will be passed to
  the indexer
- Add a checkPremises method, which will call the super.checkPremises
  and the one from the indexer
- Add getErrorMessages, which will merge its own and the messages from
  the indexer

Indexer:
- Create and IndexWriter instance when the object is constructed.
- checkPremises will call tryOpen on the indexWriter - One could think
  of moving the checks in tryOpen to a custom method. Not sure about
  this yet.
- Add getErrorMessages, which will merge its own and the messages from
  the indexWriter

IndexProcessor:
- Add the hasLock() method, which will return true if either a read or a
  write lock was aquired.

IndexWriter:
- Renamed open() to tryOpen()
- Remove the create method. An instance of Indexer needs to be opened
  with tryOpen() after creation to use it.
- Add a writerIsOpen field, which is set when the writer is successfully
  locked or unlocked. The field is used before write operations to make
  sure, that no write ops to the ofstream are done when it wasn't opened
- Added a destructor, which will flush and close the output stream.
- Added some error checks and error messages.

///// Tests

Modified tests for IndexHeader and IndexEntryV1, they now contain some
sanity checks, which will check the whole structs and each field in them
for their size. This way, we hopefully will notice, if someone
accidentally changed the structs. Some sort of md5 / hash mechanism
could be added, but I do not know how yet.

Added tests for the equals methods and made the other tests use == / !=.

Added another test for IndexerRunner. The tests will also validate
checkPremises().

Added hasLock() checks to the IndexProcessor tests.

Removed the compareHeader and compareEntry methods from IndexReaderTest,
they are now part of other tests.

Added a path variable to the test binary path to main.cpp.

Added a test constants file.
Add ZLIB dependency to make files.

Reduces CXX_STANDARD to 14, cmake 3.5.3 complained about 17.

Change layout of IndexEntryV1 to consume only 8Byte instead of 24. This
is possible due to some assumptions:
- Entry data will be relative, you need the process the whole file to
  calculate the absolute values. Index files are small, this is not a
  real fuzz.
- Relative offsets are below 65k, thus an unsigned short is sufficient.
- 2Byte overhead is overhead due to memory padding. We store and read
  the struct and not the individual fields.
Adapted tests and methods correspondingly.

Extended the Indexer class and make it work:
- Tried to modularize the createIndex() method by extracting methods and
  by pulling fields to the class level. The Indexer is only for
  single-use, this is safe.
- Added some methods: initializeZStream(),readCompressedDataFromStream()
  checkStreamForBlockEnd(), finalizeProcessingForCurrentBlock(),
  storeLinesOfCurrentBlockForDebugMode()
- Changed the debugging field types.

////// Tests

IndexerTest
- Added tests for a small and a "large" test file. Manual testing with
  a 1.2GB FASTQ also works. Or at least it does not crash :D
- Added two comments for more tests.

TestResourcesAndFunctions
- Fixed a bug in getResource(), the method will now access the correct
  path.
- Extended the checks for this.
Made indexing and extraction work. However, to make things work, we need
large indices. This is because we need to store a dictionary for each
entry in the index. This is currently of size 32kByte and can be reduced
by using zlibs compress2/uncompress methods by around 60%.

This is not a final version and needs a lot more heavy grinding, e.g:
- Code cleanups and unification
- More tests and checks on startup
- Improve the CLI interface
  - Change access to FASTQ record access (currently it is working on
    lines, which might still be kept, but the main purpose of this tool
    is FASTQ processing)
- Improve code to use (much) less memory
(Just need to work on different things atm, so at least push it to the
 repo)

Raise minimum cmake to 3.13

Also tried to upgrade Boost from 1.54 to 1.69, which horribly failed.
Methods like e.g. exists(path), is_symlink(path) left horrible pointer
mess in calling objects. Though we could track this down a bit, we do
not know, where the real bug source is located.

/////// Source code

In CommonStructsAndConstants:
- Move several constants to CommonStructsAndConstants.
- Add equality (== !=) not (!) and bool() operators to IndexHeader
- In IndexEntryV1:
  * Layout changes from relative to absolute values.
  * Add the blockID() field to IndexEntryV1.
  * Rename relativeBlockOffsetInRawFile to blockOffsetInRawFile.
  * Change types for offsets from ushort to ulong.
  * Add the 32kByte sized dictionary field which is needed for inflate
    init in extraction mode. This value should be compressed in a future
    release (ca 60% storage decrease). The field
    compressedDictionarySize is already planned but not functional.

Added the IndexLine struct, which holds a decompressed and processed
IndexEntry. Also this is currently brute force and the whole index is
loaded fully to memory. This will need to change in the future to
decrease memory consumption.

The Starter will now create an ExtractorRunner with proper startline and
linecount.

In ActualRunner, extend checkPremises() with more checks. Also here, we
ran into a lot of trouble when trying to upgrade Boost from 1.54 to 1.69

In IndexerRunner, change indexer to Indexer* type and delete in in the
constructor. This and the other changes in the class were done due to
the upgrade conflicts of Boost.

In ExtractorRunner:
- Add checkPremises() and run()
- checkPremises() will check the Runner premises + the used extractor
  premises.
- run() will call extractor->extractReadsToCout()

In IndexReader:
- Add comments.
- Add the readerIsOpen field.
- Add the readHeader field and make readIndexHeader() private.
- Remove the create() method and make the class instantiable via a
  public constructor.
- Rename open() to tryOpenAndReadHeader() and make it public. Add checks
  and error messages.
- Add readIndexFile() method. This will most likely be removed in future
  versions with a more intelligent index loader approach.
- Add readIndexFileV1() method an comments for V2/V3

In Indexer:
- This class really needs to be optimized!
- Remove constants and use global constants.
- Add the calculateIndexBlockInterval() which will calculate some
  "smart" interval depending on the FASTQ file size.
- Change a lot of code. Tried to put it comments, but this really needs
  improvement.

In Extractor:
- First: Added the class
- Copied some code from Indexer, this should later be merged somehow.
- Most important: Add the extractReadsToCout() method which currently
  lacks checks. The method will jump into the gz file, initialize the
  inflater, search for the first line to extract and then extract until
  EOF or extractedLines == lineCount. Also here, the extractor is based
  on lines rather than records. This should be changed.

/////// Tests

Change a lot of the tests to match new constructors.

In CommonStructsTest:
- Change tests for IndexHeader and IndexEntry to match new sizes.
- Add tests for new operators()

Extend ExtractorRunnerTest to cover one more test case.

In IndexerTest:
- Add test for CalculateBlockInterval.
- Change most tests so, that the Indexer is created with -1 for the
  block interval to request block interval calculation.
  (Test create large index will use a block interval of 4.)
- Extend test create large index to check some index entries for
  validity. Values for this were calculated using the original zran.c
  code. Also put in an extraction test which compares decompressed with
  original data.

Add ExtractorTest:
- Add two non test methods runRangedExtractionTest() and
  runComplexDecompressionTest() which will be called by
  TEST_CREATE_EXTRACTOR_AND_EXTRACT_SMALL_TO_COUT to run a series of
  index / extract processes and compare extracted data to decompressed
  original data.

Extend StarterTest to use a complete set of parameters for extraction.
Started to use "using namespace::class" in some cases.

Change a lot of minor things due to clang-tidy complaints.

Changed ulong to u_int64_t in the whole application to guarantee a
length of 8 Byte unsigned values. ulong itself is not guaranteed to be
of length 8 Byte (e.g. Microsoft says it is 4 Bytes long). Also changed
some of the uint's to u_int32_t though this is not necessary.

Changed the WINDOW_SIZE / CHUNK_SIZE to #defines instead of const uint,
with the old code, the chunk[CHUNK_SIZE]{0} syntax did not work and
needed an additional memset.

Renamed IndexLine to IndexEntry. IndexEntry is a decompressed and
processed index entry representation for all the versioned
IndexEntryV[n] structs. This is to docouple index reading and writing
from internal processes.

Added the toIndexEntry() method to IndexEntryV1.

Created the ZLibBasedFASTQProcessorBaseClass class as a mutually used
base class for the Indexer and Extractor classes. The class stores a
list of variables and methods used for both subclasses.

Cleanup Indexer and Extractor and make them more readable. Though a lot
more work could be done here. Also unify the buildIndex() and extract()
methods so they look a lot alike. This should be base for further
refactoring. Moved a lot of values from both classes to the new base
class.

The Extractor will now read in IndexEntry-wise until the proper
IndexEntry is found. This way, we do not need to read in the whole index
which makes sense for memory consumption. Big FASTQ files will have big
index files and the Extractor shall use only few memory.

Changed readIndexFileV1() to return a vector of IndexEntryV1 and create
the readIndexEntry() method.

/// Tests

Add ZLibBasedFASTQProcessorBaseClassTest file with a mock class to
enable testing of the class. Move the test for z_stream init to this
file.
Add installation / compilation instructions. Usage will follow later.

Add a description and links to libraries.
Extend installation and build instructions. Build was tested on my local
machine and on our HPC env. Though I could get cmake running with all
the custom libraries, g++ always failed because 'inflateGetDictionary'
seems to be missing (it is available in zlib 1.2.8 / 1.2.11).
Tested the code several times with valgrind to get rid of memory leaks.

Extend installation and build instructions. Build was tested on my local
machine and on our HPC environment. After a lot of tries, it finally
builds and runs.

Removed Boost dependency.
* Replace boost::filesystem by experimental::filesystem classes and
  functions where necessary.
* Replace boost command line interface with tclap library.
* Replace boost shared pointers with stdlib shared pointers.
* Replace the boost splitstring stuff with a custom method splitStr() in
  ZLibBasedFASTQProcessorBaseClass.
* Replace the boost::mutex class with std::mutex and
  boost::mutex::scoped_lock with std::lock_guard. Use Linux's flock to
  lock files shared or exclusive also over NFS.

Override getErrorMessages() in ExtractorRunner class.

Add blockInterval to IndexerRunner constructor.

Add some more error messages, when file open / close fails.

Rename the PrintCLIOptionsRunner to DoNothingRunner. It actually does
nothing, so the name is more precise.

Fully implement new command line interface using tclap. We tried to
mimic the samtools command line interface as far as possible. This is
because we work with FASTQ files.

Modify and extend tests where necessary.
The conda environment contains all necessary tools and libraries except
UnitTest++, this still needs to be downloaded and installed manually.

Adapted the CMake files to make them aware of an existing (and active)
environment.

Use TCLAP 1.2.1 instead of 1.2.2. This version is in Conda.
Testing TCLAP (which is eventually done in the Starter tests) fails,
because TCLAP calls exit() effectively stopping all tests, when the
command line parameters cannot be parsed. This leads to the bad effect,
that we can only run valid tests but no tests with invalid command line
parameters.

To at least get a message, that something went wrong and that the test
binary exited unintentionally, I put in a switch after the tests are
finished. The switch is recognized by a trap to exit() and a message is
printed to the command line to indicate an error, if the switch is not
set properly.
-D enables the indexer or extractor debug mode (e.g. for IDE's)

-w allows overwriting of an existing .fqi file

Add one integration test, which calls the application itself with
parameters.

Rename the .idx to .fqi to look more like .fai or .bai
The files are working on our LSF environments.

Basically you call fqimasstestLSF.sh with a list of FASTQ files and an
output directory to where the index files will be written.

The script fqitest.sh is called for each FASTQ. Note, that the script
will extract the FASTQ to the local scratch storage.

Keep in mind, that these scripts are more an example and will not
necessarily work on your platform.
I renamed .idx to .fqi in an earlier commit. This lead to SigSevs in the
tests, which is now corrected by renaming the test.idx files.

Fixed some typos in the test scripts.
Index and extract will now work with concatenated gz files, tests are
included.

Also reworked the test structure for the extractor tests.

TODO Indexer and Extractor need a lot of rework to make the code more
readable.
Add the possibility to stream in a FASTQ file for indexing. Internally,
I created the InputSource class for this, which is inherited by the
PathInputSource and the StreamInputSource classes. This way, both
classes share a common interface and can be treated equally. As we are
working with (char/byte)buffers, it is more convenient to have a C-like
interface with methods like read(), skip() and so on.

As the previous code was based on path instances, a lot of changes were
necessary to adapt to the new structure.

Note, that there are some tests, which still fail. They need to be
implemented.

Sources
=======

ActualRunner:
- Add additional constructors with a path and a shared pointer of type
  InputSource.
- Change the checkPremises method to be aware of the different input
  source types.
- Add allowsReadFromStreamedSource() which needs to be overriden, if
  necessary.

ExtractorRunner:
- Make it use a PathInputSource for input instead of a path.
NOTE: In difference to the indexer, the extractor is not allowed to
read from a stream (it would also not make any sense.)

IndexerRunner:
- Reformated the constructor code and added a second constructor.
- Override the allowsReadFromStreamedSource() method to return true.

Extractor:
- Add the resultFile, useFile and forceOverwrite fields. This allows the
  extractor to either write to stdout or to a result file.
- Extend the constructor to be aware of the new fields.
- Extend checkPremises to be aware of the changes.
- Adapt the extractor to use the new (Path)InputSource instead of path.

Indexer:
- Make the Indexer use an InputSource instance instead of a path.
- Make the indexing process use the new InputSource methods.

Starter:
- The Starter will now recognize "-" as a valid parameter for a piped
  FASTQ file in index mode and for the extracted text in extract mode.

Added the InputSource, PathInputSource and StreamInputSource classes.

Tests
=====

Made the test classes use the new Indexer/Extractor(Runner) constructors

Added tests for the new InputSource classes and updated the Indexer
tests to also test with streamed data.

Replaced several duplicate strings by constants.

Disabled the integration tests for now. Streamed data can be tested with
e.g. ifstream instances.

TestResourcesAndFunctions:
- Add a format string method. However, this method is not very safe and
  really only for test purposes. It should not be used in production
  code! It is ok for tests though.
- Add the runCommand(), runIndexerBinary(), runExtractorBinary(),
  extractGZFile() and createConcatenatedFile() methods to remove code
  duplication and simplify tests.
Allow to set some verbosity with -v/-verbosity from 0(default) to 3.
Actually only level 3 is used.

Add one more message to the end of the indexing process.

Fix the StreamInputSource by make it using different methods for reading
and checking eof() (sync(), read(), gcount(), peek(), eof()).

Add some logger methods to ErrorAccumulator. This is nothing spectacular
but offers a very small amount of logging.

Moved method bodies from ErrorAccumulator header to source file.

Added some more tests and set verbosity for the testapp to 3.
** NO TESTS YET! **

The stats mode will display some basic information about a given index
file. You can also select a range of index entries to display.
Fix the binary in such a way, that it can handle "super-concatenated"
FASTQ files. E.g. we had FASTQ files with a size of around 3,3GB which
consisted of approximately 170k concatenated gzip files.

Extractor class:
- In some cases, the extractor did not properly extract the first line
  in the selected range of lines. To prevent this, the extractor will
  now extract the amount of [multiplier] previous lines and skip them on
  output.
- Fixed a big mem leak: I did not reinitialize the inflate stream
  properly and a lot of memory was allocated but not freed.
- Fixed some issues, when no line was in an extracted block.
- The seek into the concatenated file was moved before the check for
  readability to ensure, that it is properly detected, if a file still
  can be read.

Indexer class:
- Refactor the index method and extract some methods.
- The Indexer can now process the mentioned super-concatenated files
  where e.g. thousands of single files of less than 40k were
  concatenated.
- Count number of concatenated files and output this at the end of the
  Indexing step.
- The seek into the concatenated file was moved before the check for
  readability to ensure, that it is properly detected, if a file still
  can be read.

The IndexStatsRunner also shows the starting record id (effectively the
starting line number divided by 4.

Modify the PathInputSource to use C++ streams instead of the C FILE API.

In IndexerTest:
- Change some CHECK_EQUAL tests to CHECK. I do not why, but C_E throws
  segfaults now. Might depend on the compiler version.
- Added a test for concatenated FASTQ files with really small files,
  each smaller than a block.
Add two new indexer options:
- -F will tell the Indexer to NOT write out the index file. This is to
  get some basic statistics about what could happen.
- -S will disable the new failsafe distance. Use this, if you think,
  that you do not need this. However, it is more for testing.

Fix Indexer behaviour for files with very small blocks by introducing
a failsafe distance. It is calculated by (block distance) * 16kByte and
will tell the indexer, that it should not only look at the block
distance but also on the offset distance between the current and the
latest written index entry.

Also do some code refactorings to make the Indexer and Extractor code a
bit more readable.
The Starter now parses the variables for extract BEFORE we access the
indexFile and debugMode parameters.

Extract a method to print an index entry to the console and put it in
IndexStatsRunner.

Output some minor statistics when extraction gets started.
Also fix a bug in the indexer. The first index entry was not stored,
when failsafe distance was on.

Add A LOT of build options for debug mode. These should be handled with
care to reduce warnings and increase code robustness.
Remove debug() output commands from a lot of places. They are actually
more confusing than helpful.

Indexer class
- Changed the default block interval from 512 to 2048
- Added the postponeWrite field. This comes into play, when an empty
  block occurs. Empty blocks will not be used for index entries anymore
  as they caused a lot of problems leading to faulty results.
- Added a lot of debug code which can help a developer to better
  understand, how the index is created. Fields and methods are:
  - writeOutOfPartialDecompressedBlocks
  - storageForDecompressedBlocks
  - writeOutOfPartialDecompressedBlocks
  - storageForPartialDecompressedBlocks
  - partialBlockinfoStream
  - enableWriteOutOfDecompressedBlocksAndStatistics()
  - enableWriteOutOfPartialDecompressedBlocks()
- Extracted the methods createIndexEntryFromBlockData(),
  storeDictionaryForEntry() and writeIndexEntryIfPossible() to make the
  code more readable.
- Reworked finalizeProcessingForCurrentBlock() to make it more readable
  and also make it use and therefore output the above mentioned debug
  information.

Extractor class
- Added the roundtripBuffer field, which will be used later to enable:
  - error recognition, if the extract data is not made up of
    extractionMultiplier sized records
  - complete output with at least empty entries for a record.
  However, this is not yet used.
- Removed the skipLines field, as the whole mechanism was reworked and
  both index and extract should now produce proper results.
- Reworked processDecompressedChunkOfData(), it is now much easier to
  read and a lot less complicated.

Reworked the Starter to accept more command line options and pass them
to the specific runners.

Changed ZLibBasedFASTQProcessorBaseClass.splitStr so, that it does no
longer add an empty line to the back of the list, if the last char is a
newline character.

TestResourcesAndFunctions
- Add readLinesOfFile(), readFile(), readLinesOfResourceFile(),
  readResourceFile(), compareVectorContent() and
  getTestVectorWithSimulatedBlockData() to reduce code duplications and
  improve test readability. Code is taken from IndexerTest and
  ExtractorTest.
- The method getTestVectorWithSimulatedBlockData() will return a vector
  with files for simulating decompressed gzip block data. This tries to
  mimic common pitfalls like small blocks, empty blocks, blocks without
  newlines and so on.

IndexerTest
- Add a test for correct block line counting which uses the
  aforementioned block data from getTestVectorWithSimulatedBlockData()
- Move code to the aforementioned methods readLinesOfFile() and so on...

ExtractorTest
- Analogous to IndexerTest, add a test for correct line from block data
  extraction. This will also use getTestVectorWithSimulatedBlockData()
- Like in IndexerTest, use the extracted methods for file reading and
  vector comparison.

Add the blockAndLineCalculations test resource directory. This contains
several files with "decompressed" block data and a file which describes
the layout of the joined block files.
to allow easy publication of the standalone binary.
heinold and others added 2 commits July 26, 2019 18:48
These strategies are used to decided whether an IndexEntry will be
stored or not. The current block based (safeguarded) approach will still
be usable and is encapsulated in the BlockDistanceStorageStrategy class.

Add the IndexEntryStorageStrategy base class.

Add the ByteDistanceStorageStrategy class to allow the user to set a
Byte value like 1T, 2g, 3m and 4K as a minimum distance between index
entries.

Rename the field offsetInRawFile to blockOffsetInRawFile in IndexEntry
to match IndexEntryV1.

Move the method calculateIndexBlockInterval from Indexer to the new
IndexEntryStorageStrategy class.

Add a test class for the new strategy classes.
@dankwart-de dankwart-de requested a review from vinjana July 31, 2019 13:57
Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these are suggestions.

CMakeLists.txt Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
LICENSE.txt Show resolved Hide resolved
@@ -0,0 +1,33 @@
name: FastqIndEx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env doesn't work.
S3 is not included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The included library did not work and I needed to download the AWS SDK. This is described in the README

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this is acceptible for a deployment via Conda.

README.md Outdated Show resolved Hide resolved
src/common/CommonStructsAndConstants.h Outdated Show resolved Hide resolved
};

/**
* Represents an entry in a gz index file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think for explanation would it be helpful to have e.g. a plant-UML diagram of the the blocks, and all other concepts related to gzip format and indexes and thus explain all the terms? Maybe we can make a sketch that I can translate into digital diagram (or we just scan the sketch!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

src/common/CommonStructsAndConstants.h Outdated Show resolved Hide resolved
src/common/CommonStructsAndConstants.h Outdated Show resolved Hide resolved
src/common/CommonStructsAndConstants.h Outdated Show resolved Hide resolved
src/common/ErrorAccumulator.cpp Outdated Show resolved Hide resolved
src/common/ErrorAccumulator.cpp Outdated Show resolved Hide resolved
src/common/ErrorMessages.cpp Show resolved Hide resolved
src/common/IOHelper.h Outdated Show resolved Hide resolved
src/common/IOHelper.cpp Outdated Show resolved Hide resolved
src/common/IOHelper.cpp Show resolved Hide resolved
src/common/IOHelper.cpp Outdated Show resolved Hide resolved
src/common/IOHelper.cpp Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
In CommonStructsAndConstants:
- Add constants for Byte sizes like k(ilo)B(yte), M(ega)B(yte)

In IndexEntryStorageStrategy:
- Change the type of blockInterval and minIndexEntryByteDistance to
  signed.
- Add a getDefault() and a from() method for both strategies
- Add calculateDistanceBasedOnFileSize to ByteDistanceStorageStrategy
- Implemente useFilesizeForCalculation in ByteDistanceStorageStrategy

In Indexer and IndexerRunner:
- Remove the blockInterval and disableFailsafeDistance parameters from
  the constructor and add the storageStrategy parameter.

In ModeCLIParser:
- Remove argumentToPath()
- Add createCommandLineParser()
- Add createAllowedModeArg() and use this method in the parse methods
  of the subclasses.

In ModeCLIParser subclasses:
- Rework the parse() methods
- Use the new createAllowedModeArg() method.

In IndexModeCLIParser:
- Inline some strings back to the create..Args methods.
- Add an argument to parse the storage strategy and the byte distance.
src/process/base/ZLibBasedFASTQProcessorBaseClass.h Outdated Show resolved Hide resolved
src/process/base/ZLibBasedFASTQProcessorBaseClass.h Outdated Show resolved Hide resolved
src/process/base/ZLibBasedFASTQProcessorBaseClass.cpp Outdated Show resolved Hide resolved
src/process/extract/Extractor.h Outdated Show resolved Hide resolved
src/process/extract/Extractor.cpp Outdated Show resolved Hide resolved
src/process/extract/Extractor.cpp Outdated Show resolved Hide resolved
src/process/extract/Extractor.cpp Outdated Show resolved Hide resolved
src/process/extract/Extractor.cpp Outdated Show resolved Hide resolved
Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent error checking and reporting! Consider making the messages shorter and with more contextual information, e.g. the condition including some values that created the error and maybe the filename (in fatal errors). Consider filename only at the end, when the program actually stops (filenames can get reallly long). Avoid duplicate messages for different conditions -- messages should be uniquely findable in code (that's how I navigate to the erring code).

We need to discuss the zlib stuff -- after I am through the tests. I am going to pull to tests forward in the review, maybe I'm to reduce the number of stupid comments by me :-P

src/process/extract/Extractor.cpp Outdated Show resolved Hide resolved
src/process/extract/Extractor.cpp Show resolved Hide resolved
src/process/extract/Extractor.cpp Show resolved Hide resolved
src/process/extract/Extractor.cpp Outdated Show resolved Hide resolved
src/process/extract/Extractor.cpp Outdated Show resolved Hide resolved
src/process/extract/IndexReader.cpp Outdated Show resolved Hide resolved
src/process/extract/IndexReader.cpp Outdated Show resolved Hide resolved
src/process/extract/IndexReader.cpp Show resolved Hide resolved
src/process/extract/IndexReader.cpp Outdated Show resolved Hide resolved
src/process/extract/IndexReader.cpp Outdated Show resolved Hide resolved
There are some TODOs left which are marked with // TODO

Reenabled compilation flags for debug code.

Changed a lot of uint64 values to signed type. The signed type still
offers us the possibility to work with tremendously huge files and it
is easier in some operations to have a signed value like -1 for a file
size query (not existing, stream, whatever).

Removed unnecessary imports throughout the project.

Changed a lot of code so, that equal operations are correct. See to it:
- that compared types are both signed or unsigned
- compared types have the right sizes 32Bit < 64Bit and 64Bit<64Bit

Changed method parameter types to const ref where feasible.

Changed auto loop parameters to const ref where necessary.

Refactored Extractor::extract() and pulled out some methods to make it
more readable.

Renamed checkPremises throughout the whole project to fulfillsPremises.

Changed "old-style" casts to e.g. static_cast / reinterpret_cast over
all files.

Added more tests for the index entry storage strategies.
Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly skimmed over the recently changed parts but ignored for now the new parts. I will continue in the big review with these.

/**
* Must be set after the the index process via setNumberOfLinesInFile
*/
u_int64_t numberOfLinesInFile{0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a field of the IndexWriter? Should it not be in the index header? Same for numberOfWrittenEntries (or numberOfEntries -- this concerns the index entries, right? Not the FASTQ records.)

src/process/index/IndexWriter.cpp Outdated Show resolved Hide resolved
src/process/index/IndexWriter.cpp Outdated Show resolved Hide resolved
src/process/index/IndexWriter.cpp Outdated Show resolved Hide resolved
src/process/index/Indexer.cpp Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/io/ConsoleSink.h Show resolved Hide resolved
src/process/io/s3/S3Config.h Outdated Show resolved Hide resolved
src/process/io/s3/S3Service.cpp Show resolved Hide resolved
Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be continued ...

src/common/CommonStructsAndConstants.h Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/io/StreamSource.cpp Show resolved Hide resolved
src/process/io/StreamSource.h Show resolved Hide resolved
src/process/io/locks/PathLockHandler.cpp Outdated Show resolved Hide resolved
src/process/io/locks/PathLockHandler.h Outdated Show resolved Hide resolved
src/process/io/s3/S3Config.h Show resolved Hide resolved
Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to continue with the tests! Yeah! But not today.

src/process/io/s3/S3Config.cpp Outdated Show resolved Hide resolved
src/process/io/s3/S3Config.cpp Outdated Show resolved Hide resolved
src/process/io/s3/S3Config.cpp Show resolved Hide resolved
usedCredentialsFile = finalCredentialsFile;
usedConfigFile = finalConfigurationFile;
} else {
addErrorMessage("Could not validate configuration for S3.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which actual configuration was not validated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole configuration. But you're right, we can improve this.

return Starter::instance;
}

DoNothingRunner *Starter::assembleSmallCmdLineParserAndParseOpts(int argc, const char *argv[]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me its not obvious what the method name has to do with the function of the method? You're not only assembling but actually run the parser. Assembling actually is a method internal technical detail. Maybe parseSmallCmdLineOpts? Or parseBaseCmdLine (not 100% sure what the "small" means)?

0, cmdLineParser.get());

auto numberOfReadsArg = _makeUIntValueArg(
"n", "numberofentries",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"n", "numberofentries",
"n", "numberOfEntries",

src/startup/IndexStatsModeCLIParser.h Show resolved Hide resolved
src/startup/ModeCLIParser.h Outdated Show resolved Hide resolved
}

/**
* Processes several FASTQ related parameters and create a FASTQ PathSource from the input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Processes several FASTQ related parameters and create a FASTQ PathSource from the input.
* Processes several FASTQ related parameters and create a FASTQ Source from the input.

src/startup/ModeCLIParser.h Show resolved Hide resolved
Added a notes section to Contributing.md

Added the s3iomain.cpp and entries to the CMakefile to compile the
GetObject helper binary.

Starter is not a Singleton anymore.

Added IOHelper::getApplicationPath() to get the absolute path of the
running application. This is used to run the helper binary.

Added the S3GetObjectProcessWrapper, which is used to run the helper
binary in a separate process and thread.

Moved the async GetObject code from S3Source to the helper binary and
made it a bit more readable.

Added unsigned (U) suffixes to some numbers in tests.
Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we should try to remove all references to FASTQ from the tool code, if it is useful for any data.

CONTRIBUTING.md Outdated Show resolved Hide resolved
src/process/io/s3/FQIS3Client.h Outdated Show resolved Hide resolved
src/process/io/s3/FQIS3Client.h Outdated Show resolved Hide resolved
src/process/io/s3/FQIS3Client.h Outdated Show resolved Hide resolved
src/process/io/s3/FQIS3Client.h Outdated Show resolved Hide resolved
test/TestResourcesAndFunctions.cpp Show resolved Hide resolved
test/TestResourcesAndFunctions.cpp Show resolved Hide resolved
// Utilizing current_path() will result in the path of current test-binary, which will be e.g.:
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp
// As we do want resources to be loaded from:
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line does not differ from the previous path line. What do you mean?

Suggested change
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp
// ~/Projects/FastqIndEx/test/resources/

???


void TestResourcesAndFunctions::CreateEmptyFile(const path &_path) {

ofstream streamTest("/tmp/sometestfile");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Not going to accept this. A hard-coded path for test data ...

std::string TestResourcesAndFunctions::format(std::string format, ...) {
va_list args;
va_start(args, format);
size_t len = std::vsnprintf(nullptr, 0, format.c_str(), args);// TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What TODO?

Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all the CHECK calls indented much more?

Phuhhh. What a tour de force. I am through (except for the most recent changes in S3 stuff.

*/
bool TestResourcesAndFunctions::runIndexerBinary(const path &fastq, const path &index, bool pipeFastq) {
if (!pipeFastq) {
return runCommand(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o we really have to run external commands for testing? Can this not be avoided by encapsulating the top-level code, which is usually called from main, but in the test configured to run from the test?

The parsing of CLI parameters should be testable independently by real unit tests.

return str;
}

bool TestResourcesAndFunctions::compareVectorContent(const vector<string> &reference, const vector<string> &actual,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this there is certainly a C++ standard-lib function.

test/TestResourcesAndFunctions.h Show resolved Hide resolved
test/TestResourcesAndFunctionsTest.cpp Show resolved Hide resolved
// Utilizing current_path() will result in the path of current test-binary, which will be e.g.:
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp
// As we do want resources to be loaded from:
// ~/Projects/FastqIndEx/cmake-build-debug/test/testapp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

BTW: The code is not DRY, here.

test/process/io/PathSinkTest.cpp Outdated Show resolved Hide resolved
test/process/io/PathSourceTest.cpp Outdated Show resolved Hide resolved
StreamSource Source(&testData);
CHECK(Source.isStream());

int64_t expectedSize = -1; // Use overflow to get the max value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a constant in the C++ standard-lib. I am not sure whether this overflow approach is portable.

CHECK(!p->canRead());
CHECK(p->canWrite());

// auto p2 = S3Sink(res.createEmptyFile("noOverwrite"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented?

CHECK(p->tell() == static_cast<int64_t>(2 * test1.length() + 5));
}

// TEST (S3_SINK_WRITE_OVERWRITEBYTES) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented?

Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... done. Round 1.

src/process/io/s3/S3GetObjectProcessWrapper.h Show resolved Hide resolved
src/process/io/s3/S3GetObjectProcessWrapper.h Show resolved Hide resolved
src/process/io/s3/S3GetObjectProcessWrapper.h Show resolved Hide resolved
src/process/io/s3/S3Sink.h Show resolved Hide resolved
* - Due to this, it is not possible to seek within an S3 file stream! Fortunately, the StreamSource has some built-in
* buffer to deal with this.
* - It is not possible to abort a running request! What we did first was download with GetObjectAsync. However this
* lead to errors, when we closed the pipe or tried to close the stream. Worst case: The application crashed, best
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also elsewhere

Suggested change
* lead to errors, when we closed the pipe or tried to close the stream. Worst case: The application crashed, best
* led to errors, when we closed the pipe or tried to close the stream. Worst case: The application crashed, best


vector<string> IndexReadingRunner::getErrorMessages() {
if (fastqFile.get()) {
auto a = ErrorAccumulator::getErrorMessages();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of the intermediate variables?


bool fastqIsValid = ActualRunner::fulfillsPremises();
bool indexIsValid = indexFile->fulfillsPremises();
// // Index files are automatically overwritten but need to have write access!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented?

}

/**
* The main function will 6 parameters (+1 for the application path in parameter 0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The main function will 6 parameters (+1 for the application path in parameter 0).
* The main function has 6 parameters (+1 for the application path in parameter 0).

int main(int argc, char **argv) {
auto [parentPathRetrieved, parentPath] = getParentPath();
if(!parentPathRetrieved) {
std::cerr << "Could not get the parent processes path.\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these will interleave with the messages of the parent process, right?

std::cerr << "Could not get the parent processes path.\n";
return -1;
}
if (parentPath != "fastqindex") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the name of also this binary should be camel-case. All lower-case is also less readable in Bash (although Bash itself may be the larger burden).

There are a lot of changes in this commit and I will only give a more
detailed description if necessary.

Update the README.

Update the CONTRIBUTING file.

Over all files:
- Renamed variables from fastqFile to sourceFile.
- Renamed and changed a lot of variables.
- Remove obsolete code and comments.
- Changed comments and fixed typos.
- Also renamed methods related to fastqFile accordingly.
- Changed error messages to be more precise and helpful.
- Renamed some functions to better express, what they are doing.

Renamed:
- PathSink/Source => FileSink/Source
- PathLockHandler => FileLockHandler
- IndexEntryStorageStrategy => IndexEntryStorageDecisionStrategy
- (Also test classes!)

Extracted IndexHeader and (Base)IndexEntry(V1) from
CommonStructsAndConstants to individual files in process/base.

Add the template Result class and tests to have a generic class to
return results, the succeeding status and an (error) message from method
calls.

In CommonStructsAndConstants add the stoui method plus tests, which does
not come with the C++ standard libraries.

In ErrorAccumulator rename mergeToNewVector() to concatenateVectors().

In IOHelper add checkFileWriteability().

In StringHelper added/moved parseStringValue() from
ByteDistanceBasedStorageDecisionStrategy.

In FQIS3Client.h:
- Rename ObjectListEntry to S3Object.
- make FQIS3ClientRequestBooleanResult a typedef of Result<bool>
Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o.k. discussed.

src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/index/IndexEntryStorageStrategy.h Outdated Show resolved Hide resolved
src/process/io/PathSink.cpp Outdated Show resolved Hide resolved
src/process/io/StreamSource.cpp Show resolved Hide resolved
dankwart-de pushed a commit that referenced this pull request Sep 19, 2019
Bug #11: Make IOHelper::fullPath() aware of a starting ~. This will
allow the application to resolve ~/ to your home directory.

PR #37: Fix several typos throughout the whole application.

Bug #16: Fix a bug, where the application will fail, when the input file
exists but is empty.

Did some cleanup.

Renamed fastqindexs3iohelper binary to S3IOHelperForFastqIndEx to allow
correct tab complextion for "fastqindex ".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants