Changes for PR #37

There are a lot of changes in this commit and I will only give a more detailed description if necessary. Update the README. Update the CONTRIBUTING file. Over all files: - Renamed variables from fastqFile to sourceFile. - Renamed and changed a lot of variables. - Remove obsolete code and comments. - Changed comments and fixed typos. - Also renamed methods related to fastqFile accordingly. - Changed error messages to be more precise and helpful. - Renamed some functions to better express, what they are doing. Renamed: - PathSink/Source => FileSink/Source - PathLockHandler => FileLockHandler - IndexEntryStorageStrategy => IndexEntryStorageDecisionStrategy - (Also test classes!) Extracted IndexHeader and (Base)IndexEntry(V1) from CommonStructsAndConstants to individual files in process/base. Add the template Result class and tests to have a generic class to return results, the succeeding status and an (error) message from method calls. In CommonStructsAndConstants add the stoui method plus tests, which does not come with the C++ standard libraries. In ErrorAccumulator rename mergeToNewVector() to concatenateVectors(). In IOHelper add checkFileWriteability(). In StringHelper added/moved parseStringValue() from ByteDistanceBasedStorageDecisionStrategy. In FQIS3Client.h: - Rename ObjectListEntry to S3Object. - make FQIS3ClientRequestBooleanResult a typedef of Result<bool>
DKFZ-ODCF · Aug 15, 2019 · 5d92837 · 5d92837
1 parent 1c078fd
commit 5d92837
Show file tree

Hide file tree

Showing 79 changed files with 1,273 additions and 1,111 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,17 +1,21 @@
 # Contributing to FastqIndEx
 
-If you would like to contribute code you can do so through GitHub by forking the repository and sending us a pull request.
+If you would like to contribute code you can do so through GitHub by
+ forking the repository and sending us a pull request.
 
-When submitting code, please make every effort to follow existing conventions and style in order to keep the code as readable as possible.
+When submitting code, please make every an effort to follow already used
+conventions and style of the current code in order to keep the code as 
+readable as possible.
 
 Please also try to write unit tests wherever it is possible.
 
 ## License
 
-By contributing your code, you agree to license your contribution under the terms of the MIT License:
+By contributing your code, you agree to license your contribution under 
+the terms of the MIT License:
 
-https://opensource.org/licenses/mit-license.html
-https://github.com/DKFZ-ODCF/FastqIndEx/blob/master/LICENSE.txt
+* [MIT License](https://opensource.org/licenses/mit-license.html)
+* [FastqIndEx License](https://github.com/DKFZ-ODCF/FastqIndEx/blob/master/LICENSE.txt)
 
 If you are adding a new file it should have a header like this:
 
@@ -40,27 +44,31 @@ The version 1 header is exactly 512 Byte wide and can be described like:
 
 ``` bash
 |                                                                              (IndexHeader)                                                                             | 
-|    (u_int32_t)     |    (u_int32_t)   | (u_int32_t) |  (u_int32_t)  |   (int64_t)   |     (int64_t)    |           (bool)          |   (u_char)  | (int64_t)[59] |
+|    (u_int32_t)     |    (u_int32_t)   | (u_int32_t) |  (u_int32_t)  |    (int64_t)    |      (int64_t)     |           (bool)          |   (u_char)  |  (int64_t)[59]  |
 | indexWriterVersion | sizeOfIndexEntry | magicNumber | blockInterval | numberOfEntries | linesInIndexedFile | dictionariesAreCompressed | placeholder |     reserved    |
 ```
 
-The version 1 index entry has an extracted width of 32800 Byte index entry can be described like:
+The version 1 index entry has an extracted width of 32800 Byte index 
+entry can be described like:
 
 ```bash
 |                                                                          IndexEntry                                                                         |
-|   (int64_t)   |      (int64_t)     |     (int64_t)     |      (u_int32_t)       | (u_int32_t) | (u_char) |        (u_int16_t)       | (u_char)[32768] |
+|    (int64_t)    |       (int64_t)      |      (int64_t)      |      (u_int32_t)       | (u_int32_t) | (u_char) |        (u_int16_t)       | (u_char)[32768] |
 |     blockID     | blockOffsetInRawFile | startingLineInEntry | offsetOfFirstValidLine |     bits    | reserved | compressedDictionarySize |    dictionary   |
 ```
 
-If compression is enabled, this looks a bit different (note the last field differs then and depends on compressedDictionarySize!), when it is stored in the FQI file:
+If compression is enabled, this looks a bit different (note the last 
+field differs then and depends on compressedDictionarySize!), when it is
+ stored in the FQI file:
 
 ```bash
 |                                                                                   IndexEntry                                                                                   |
-|   (int64_t)   |      (int64_t)     |     (int64_t)     |      (u_int32_t)       | (u_int32_t) | (u_char) |        (u_int16_t)       | (u_char)[compressedDictionarySize] |
+|    (int64_t)    |       (int64_t)      |      (int64_t)      |      (u_int32_t)       | (u_int32_t) | (u_char) |        (u_int16_t)       | (u_char)[compressedDictionarySize] |
 |     blockID     | blockOffsetInRawFile | startingLineInEntry | offsetOfFirstValidLine |     bits    | reserved | compressedDictionarySize |             dictionary             |
 ```
 
-If compression is enabled, you need to read out the IndexEntry without the dictionary first. 
+If compression is enabled, you need to read out the IndexEntry without 
+the dictionary first. 
 
 ## Development setup
 
@@ -80,6 +88,20 @@ as safe as possible.
 **<span style="color:orange;">=>  We will not get the application a 100%
   safe, but we try to minimize the risk of data corruption.</span>**
 
+### Error handling
+
+In our code base, you will not find assert and throws but collect errors
+as soon as possible. The class used for this is the ErrorAccumulator. 
+Most classes inherit this class. Now, before anything is done, we try to
+collect errors as soon as possible and abort before index / extract or 
+stats is running. 
+
+In short:
+* Do not use throws or assert, when not necessary. Document it, when you 
+  plan to diverge from this.
+* Try to keep your error messages precise and helpful.
+* Mark programming errors with the prefix "BUG: "
+
 ### Code validation with clang-tidy
 
 We use CLion to develop FastqIndEx. CLion has clang-tidy support built 

diff --git a/README.md b/README.md
@@ -9,22 +9,20 @@ files.
 
 ## Features at a glance
 
-* Support for accessing:
-  - Local files (either by piped or streamed access)
-  - Files safely over NFS (using flock(), also with either piped or
-    streamed access)
+* Support for:
+  - Local file access (either by piped or streamed access)
+  - Safe concurrent file access over NFS (using flock(), also with 
+    either piped or streamed access)
   - Files in an S3 bucket (experimental, without locking yet!)
 * Many checks to make sure, that the application does what you expect
   and runs safely.
 * A number of options to tell the indexer how you like to get things 
   done.
-* Different configurable indexing strategies:
-  - Block distance based
-  - Byte distance based (default)
+* A flexible and configurable index size.
 * Different extraction strategies:
   - Extract a range of records 
-  - Extract one of n-segements, where you define the number of segments 
-    on the fly.
+  - (Virtually) divide the FASTQ on the fly into n segments and extract 
+    one segement of your choice.
 
 ## License and Contributing 
 
@@ -129,12 +127,12 @@ FastqIndEx has the following dependencies, which should be met before building i
 
 ##### Use Conda to manage your dependencies
 
-* Install Miniconda / Anaconda
-* The conda recipe is contained in env/environment.yml, use conda-env 
-  to import it:
-  conda env create -n FastqIndEx -f env/environment.yml
-* Download and install UnitTest++ like described in the next section, 
-  it is not available in Conda.
+1. Install Miniconda / Anaconda
+2. The conda recipe is contained in env/environment.yml, use conda-env 
+   to import it:
+   conda env create -n FastqIndEx -f env/environment.yml
+3. Download and install UnitTest++ like described in the next section, 
+   it is not available in Conda.
 
 ##### Compilation with manual installation of dependencies
 
@@ -181,83 +179,79 @@ activate the Conda environment of FastqIndEx.
 
 #### FastqIndEx
 
-First, you need to clone the FastqIndEx Git repository. You can do this
-by executing:
-
-``` bash
-cd ~/Projects           # Or any other desired location, we'll stick to this
-git clone https://github.com/DKFZ-ODCF/FastqIndEx.git
-git checkout master     # Or any other version you like
-```
-
-To compile it, create a CMake build directory and run CMake afterwards:
-
-``` Bash
-export CONDA_DIR="~/miniconda2/envs/FastqIndEx"     # Miniconda is normally installed in <home>/miniconda2
-export CONDA_LIB=${CONDA_DIR}/lib
-export CONDA_LIB64=${CONDA_DIR}/lib64
-export CONDA_INCLUDE=${CONDA_DIR}/include
-cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=ON  -D OPENSSL_ROOT_DIR=${CONDA_DIR}  -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB}  -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so  -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..
-cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=OFF  -D OPENSSL_ROOT_DIR=${CONDA_DIR}  -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB}  -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so  -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..
-
-
-AWS_DIR="/path/to/aws-sdk-cpp/install_shared"
-cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Debug \
-    -D "UnitTest++_DIR":PATH="/path/to/unittest-cpp/install/lib/cmake/UnitTest++" \
-    -D "AWSSDK_DIR":PATH="${AWS_DIR}/lib64/cmake/AWSSDK" \
-    -D "aws-cpp-sdk-core_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-core" \
-    -D "aws-c-event-stream_DIR":PATH="${AWS_DIR}/lib64/aws-c-event-stream/cmake" \
-    -D "aws-c-common_DIR":PATH="${AWS_DIR}/lib64/aws-c-common/cmake" \
-    -D "aws-checksums_DIR":PATH="${AWS_DIR}/lib64/aws-checksums/cmake" \
-    -D "aws-cpp-sdk-s3_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-s3" ..
-
-
-cd FastqIndEx
-mkdir release                                                             # Or also debug, in case you want to develop
-cd release
-cmake -G "Unix Makefiles" \
-    -D "UnitTest++_DIR":PATH=/path/to/unittest-cpp/lib/cmake/UnitTest++ \ # If necessary
-    -D ZLIB_LIBRARY=/path/to/zlib-1.2.11/libz.a \                         # If necessary
-    -D ZLIB_INCLUDE_DIR=/path/to/zlib-1.2.11 \                            # If necessary
-    -D AWSSDK_DIR:PATH=<aws installation>/lib64/cmake/AWSSDK                          \  # For S3 support, you need to do this.                          
-    -D aws-cpp-sdk-core_DIR:PATH=<aws installation>/lib64/cmake/aws-cpp-sdk-core      \                    
-    -D aws-c-event-stream_DIR:PATH=<aws installation>/lib64/aws-c-event-stream/cmake  \                        
-    -D aws-c-common_DIR:PATH=<aws installation>/lib64/aws-c-common/cmake              \            
-    -D aws-checksums_DIR:PATH=<aws installation>/lib64/aws-checksums/cmake            \            
-    -D aws-cpp-sdk-s3_DIR:PATH=<aws installation>/lib64/cmake/aws-cpp-sdk-s3          \                
-    -DCMAKE_BUILD_TYPE=Release                                            # Or =Debug, if you plan to develop 
-    ..
-cd ..
-cmake --build release --target all -- -j 2                                # Or --build debug
-```
-
-**<span style="color:ffffa0;">
-Note, that the -D flags for the includes are only necessary, if you
-installed the libraries manually. If they are already installed on your 
-system or (e.g. for UnitTest++) you installed them to the system folders
-or if you use the Conda environment, you can omit these flags.
-</span>**
-
-To clean the build directory use:
-
-``` Bash
-cmake --build build --target clean -- -j 2
-```
-
-To run the tests, run the test binary like:
-
-``` Bash
-(cd build/test && ./testapp)
-```
-
-If you want, you can add the release or debug directory to your PATH
-variable. E.g. in your local .bashrc file add the following:
-
-``` bash
-# This assumes, that you cloned the repo to ~/Projects/FastqIndEx and 
-# created the release sub directory like described above.
-export PATH=~/Projects/FastqIndEx/release/src:$PATH
-```
+1. First, you need to clone the FastqIndEx Git repository. You can do this
+   by executing:
+
+    ``` bash
+    cd ~/Projects           # Or any other desired location, we'll stick to this
+    git clone https://github.com/DKFZ-ODCF/FastqIndEx.git
+    git checkout master     # Or any other version you like
+    ```
+
+2. To compile it, create a CMake build directory and run CMake afterwards:
+
+    ``` Bash
+    export CONDA_DIR="~/miniconda2/envs/FastqIndEx"     # Miniconda is normally installed in <home>/miniconda2
+    export CONDA_LIB=${CONDA_DIR}/lib
+    export CONDA_LIB64=${CONDA_DIR}/lib64
+    export CONDA_INCLUDE=${CONDA_DIR}/include
+    cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=ON  -D OPENSSL_ROOT_DIR=${CONDA_DIR}  -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB}  -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so  -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..
+    cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=OFF  -D OPENSSL_ROOT_DIR=${CONDA_DIR}  -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB}  -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so  -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..
+
+
+    export AWS_DIR="/path/to/aws-sdk-cpp/install_shared"
+    export UNITTESTPP_DIR="/path/to/unittest-cpp"
+    export ZLIB_DIR="/path/to/zlib-1.2.11"                                    # If necessary
+
+    # The following instructions are for a release build of FastqIndEx. 
+    # If you want to create a debug build, change "release" to "debug"
+    export MODE=release
+    cd FastqIndEx
+    mkdir ${MODE}                                                             
+    cd ${MODE}
+    cmake -G "Unix Makefiles" \
+        -D "UnitTest++_DIR":PATH="${UNITTESTPP_DIR}/install/lib/cmake/UnitTest++" \ # If necessary
+        -D ZLIB_LIBRARY="${ZLIB_DIR}/libz.a" \                                      # If necessary
+        -D ZLIB_INCLUDE_DIR="${ZLIB_DIR}"" \                                        # If necessary
+        -D "AWSSDK_DIR":PATH="${AWS_DIR}/lib64/cmake/AWSSDK" \                      # For S3 support, you need to do this.
+        -D "aws-cpp-sdk-core_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-core" \
+        -D "aws-c-event-stream_DIR":PATH="${AWS_DIR}/lib64/aws-c-event-stream/cmake" \
+        -D "aws-c-common_DIR":PATH="${AWS_DIR}/lib64/aws-c-common/cmake" \
+        -D "aws-checksums_DIR":PATH="${AWS_DIR}/lib64/aws-checksums/cmake" \
+        -D "aws-cpp-sdk-s3_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-s3" \                
+        -DCMAKE_BUILD_TYPE=Release                                             
+        ..
+    cd ..
+    cmake --build ${MODE} --target all -- -j 2                                
+    ```
+
+    **<span style="color:ffffa0;">
+    Note, that the -D flags for the includes are only necessary, if you
+    installed the libraries manually. If they are already installed on your 
+    system or (e.g. for UnitTest++) you installed them to the system folders
+    or if you use the Conda environment, you can omit these flags.
+    </span>**
+
+3. To clean the build directory use:
+
+    ``` Bash
+    cmake --build build --target clean -- -j 2
+    ```
+
+4. To run the tests, run the test binary like:
+
+    ``` Bash
+    (cd build/test && ./testapp)
+    ```
+
+5. If you want, you can add the release or debug directory to your PATH
+    variable. E.g. in your local .bashrc file add the following:
+
+    ``` bash
+    # This assumes, that you cloned the repo to ~/Projects/FastqIndEx and 
+    # created the release sub directory like described above.
+    export PATH=~/Projects/FastqIndEx/release/src:$PATH
+    ```
 
 ## Links
 

diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -33,16 +33,18 @@ add_library(
         fastqindexlib
         common/CommonStructsAndConstants.cpp common/CommonStructsAndConstants.h
         common/ErrorAccumulator.cpp common/ErrorAccumulator.h
-        common/IndexHeader.cpp common/IndexHeader.h
         common/ErrorMessages.cpp common/ErrorMessages.h
-        common/IndexEntry.cpp common/IndexEntry.h
-        common/IndexEntryV1.h
         common/IOHelper.cpp common/IOHelper.h
+        common/Result.h
         common/StringHelper.cpp common/StringHelper.h
+        process/base/BaseIndexEntry.h
+        process/base/IndexHeader.cpp process/base/IndexHeader.h
+        process/base/IndexEntry.cpp process/base/IndexEntry.h
+        process/base/IndexEntryV1.h
         process/base/ZLibBasedFASTQProcessorBaseClass.cpp process/base/ZLibBasedFASTQProcessorBaseClass.h
         process/extract/Extractor.cpp process/extract/Extractor.h
         process/extract/IndexReader.cpp process/extract/IndexReader.h
-        process/index/IndexEntryStorageStrategy.h
+        process/index/IndexEntryStorageDecisionStrategy.h
         process/index/Indexer.cpp process/index/Indexer.h
         process/index/IndexWriter.cpp process/index/IndexWriter.h
         process/io/s3/FQIS3Client.h
@@ -53,11 +55,11 @@ add_library(
         process/io/s3/S3Sink.h
         process/io/s3/S3Source.cpp process/io/s3/S3Source.h
         process/io/locks/LockHandler.h
-        process/io/locks/PathLockHandler.cpp process/io/locks/PathLockHandler.h
+        process/io/locks/FileLockHandler.cpp process/io/locks/FileLockHandler.h
         process/io/locks/S3LockHandler.h
         process/io/IOBase.h
-        process/io/PathSink.cpp process/io/PathSink.h
-        process/io/PathSource.cpp process/io/PathSource.h
+        process/io/FileSink.cpp process/io/FileSink.h
+        process/io/FileSource.cpp process/io/FileSource.h
         process/io/Sink.h
         process/io/Source.h
         process/io/ConsoleSink.h

diff --git a/src/common/CommonStructsAndConstants.cpp b/src/common/CommonStructsAndConstants.cpp
@@ -29,3 +29,14 @@ const int64_t TB = GB * 1024;
 
 const int DEFAULT_RECORD_SIZE = 4;
 
+uint stoui(std::string value) {
+    // Unfortunately, C++ does not offer the stoui method (or stou). Why? Nobody knows. I found
+    // the following code snippet here: https://stackoverflow.com/questions/8715213/why-is-there-no-stdstou
+
+    unsigned long intermediate = std::stoul(value);
+    if (intermediate > std::numeric_limits<unsigned>::max()) {
+        throw std::out_of_range("stou");
+    }
+    return static_cast<uint>(intermediate);
+}
+
diff --git a/src/common/CommonStructsAndConstants.h b/src/common/CommonStructsAndConstants.h
@@ -52,10 +52,14 @@ extern const int DEFAULT_RECORD_SIZE;
  */
 #define CLEAN_WINDOW_SIZE (WINDOW_SIZE + 1)
 
-struct VirtualIndexEntry {
-    bool operator==(const VirtualIndexEntry &rhs) const { return true; };
-};
-
+/**
+ * stoui is not available as a core method (unlike stoi or stol ...). Mimic it here.
+ *
+ * Throws an out_of_range excpetion, if the value is too large
+ * @param value A string
+ * @return The hopefully converted value.
+ */
+uint stoui(std::string value);
 
 
 #endif //FASTQINDEX_COMMONSTRUCTS_H
diff --git a/src/common/ErrorAccumulator.cpp b/src/common/ErrorAccumulator.cpp
@@ -57,7 +57,7 @@ string ErrorAccumulator::join(_cstr s0, _cstr s1, _cstr s2, _cstr s3, _cstr s4,
 }
 
 
-vector<string> ErrorAccumulator::mergeToNewVector(const vector<string> &l, const vector<string> &r) {
+vector<string> ErrorAccumulator::concatenateVectors(const vector<string> &l, const vector<string> &r) {
     std::vector<string> merged;
     merged.reserve(l.size() + r.size());
     merged.insert(merged.end(), l.begin(), l.end());
@@ -66,7 +66,7 @@ vector<string> ErrorAccumulator::mergeToNewVector(const vector<string> &l, const
 }
 
 vector<string>
-ErrorAccumulator::mergeToNewVector(const vector<string> &a, const vector<string> &b, const vector<string> &c) {
+ErrorAccumulator::concatenateVectors(const vector<string> &a, const vector<string> &b, const vector<string> &c) {
     std::vector<string> merged;
     merged.reserve(a.size() + b.size() + c.size());
     merged.insert(merged.end(), a.begin(), a.end());

diff --git a/src/common/ErrorAccumulator.h b/src/common/ErrorAccumulator.h
@@ -76,12 +76,12 @@ class ErrorAccumulator {
      * @param r The right vector
      * @return A new vector<string> with both vectors merged. The messages of l will be placed before the messages of r.
      */
-    static vector<string> mergeToNewVector(const vector<string> &l, const vector<string> &r);
+    static vector<string> concatenateVectors(const vector<string> &l, const vector<string> &r);
 
     /**
      * Similar to its version with two input parameters.
      */
-    static vector<string> mergeToNewVector(const vector<string> &a, const vector<string> &b, const vector<string> &c);
+    static vector<string> concatenateVectors(const vector<string> &a, const vector<string> &b, const vector<string> &c);
 
 };