Skip to content

Commit

Permalink
Changes for PR #37
Browse files Browse the repository at this point in the history
There are a lot of changes in this commit and I will only give a more
detailed description if necessary.

Update the README.

Update the CONTRIBUTING file.

Over all files:
- Renamed variables from fastqFile to sourceFile.
- Renamed and changed a lot of variables.
- Remove obsolete code and comments.
- Changed comments and fixed typos.
- Also renamed methods related to fastqFile accordingly.
- Changed error messages to be more precise and helpful.
- Renamed some functions to better express, what they are doing.

Renamed:
- PathSink/Source => FileSink/Source
- PathLockHandler => FileLockHandler
- IndexEntryStorageStrategy => IndexEntryStorageDecisionStrategy
- (Also test classes!)

Extracted IndexHeader and (Base)IndexEntry(V1) from
CommonStructsAndConstants to individual files in process/base.

Add the template Result class and tests to have a generic class to
return results, the succeeding status and an (error) message from method
calls.

In CommonStructsAndConstants add the stoui method plus tests, which does
not come with the C++ standard libraries.

In ErrorAccumulator rename mergeToNewVector() to concatenateVectors().

In IOHelper add checkFileWriteability().

In StringHelper added/moved parseStringValue() from
ByteDistanceBasedStorageDecisionStrategy.

In FQIS3Client.h:
- Rename ObjectListEntry to S3Object.
- make FQIS3ClientRequestBooleanResult a typedef of Result<bool>
  • Loading branch information
heinold committed Aug 15, 2019
1 parent 1c078fd commit 5d92837
Show file tree
Hide file tree
Showing 79 changed files with 1,273 additions and 1,111 deletions.
44 changes: 33 additions & 11 deletions CONTRIBUTING.md
@@ -1,17 +1,21 @@
# Contributing to FastqIndEx

If you would like to contribute code you can do so through GitHub by forking the repository and sending us a pull request.
If you would like to contribute code you can do so through GitHub by
forking the repository and sending us a pull request.

When submitting code, please make every effort to follow existing conventions and style in order to keep the code as readable as possible.
When submitting code, please make every an effort to follow already used
conventions and style of the current code in order to keep the code as
readable as possible.

Please also try to write unit tests wherever it is possible.

## License

By contributing your code, you agree to license your contribution under the terms of the MIT License:
By contributing your code, you agree to license your contribution under
the terms of the MIT License:

https://opensource.org/licenses/mit-license.html
https://github.com/DKFZ-ODCF/FastqIndEx/blob/master/LICENSE.txt
* [MIT License](https://opensource.org/licenses/mit-license.html)
* [FastqIndEx License](https://github.com/DKFZ-ODCF/FastqIndEx/blob/master/LICENSE.txt)

If you are adding a new file it should have a header like this:

Expand Down Expand Up @@ -40,27 +44,31 @@ The version 1 header is exactly 512 Byte wide and can be described like:

``` bash
| (IndexHeader) |
| (u_int32_t) | (u_int32_t) | (u_int32_t) | (u_int32_t) | (int64_t) | (int64_t) | (bool) | (u_char) | (int64_t)[59] |
| (u_int32_t) | (u_int32_t) | (u_int32_t) | (u_int32_t) | (int64_t) | (int64_t) | (bool) | (u_char) | (int64_t)[59] |
| indexWriterVersion | sizeOfIndexEntry | magicNumber | blockInterval | numberOfEntries | linesInIndexedFile | dictionariesAreCompressed | placeholder | reserved |
```

The version 1 index entry has an extracted width of 32800 Byte index entry can be described like:
The version 1 index entry has an extracted width of 32800 Byte index
entry can be described like:

```bash
| IndexEntry |
| (int64_t) | (int64_t) | (int64_t) | (u_int32_t) | (u_int32_t) | (u_char) | (u_int16_t) | (u_char)[32768] |
| (int64_t) | (int64_t) | (int64_t) | (u_int32_t) | (u_int32_t) | (u_char) | (u_int16_t) | (u_char)[32768] |
| blockID | blockOffsetInRawFile | startingLineInEntry | offsetOfFirstValidLine | bits | reserved | compressedDictionarySize | dictionary |
```

If compression is enabled, this looks a bit different (note the last field differs then and depends on compressedDictionarySize!), when it is stored in the FQI file:
If compression is enabled, this looks a bit different (note the last
field differs then and depends on compressedDictionarySize!), when it is
stored in the FQI file:

```bash
| IndexEntry |
| (int64_t) | (int64_t) | (int64_t) | (u_int32_t) | (u_int32_t) | (u_char) | (u_int16_t) | (u_char)[compressedDictionarySize] |
| (int64_t) | (int64_t) | (int64_t) | (u_int32_t) | (u_int32_t) | (u_char) | (u_int16_t) | (u_char)[compressedDictionarySize] |
| blockID | blockOffsetInRawFile | startingLineInEntry | offsetOfFirstValidLine | bits | reserved | compressedDictionarySize | dictionary |
```

If compression is enabled, you need to read out the IndexEntry without the dictionary first.
If compression is enabled, you need to read out the IndexEntry without
the dictionary first.

## Development setup

Expand All @@ -80,6 +88,20 @@ as safe as possible.
**<span style="color:orange;">=> We will not get the application a 100%
safe, but we try to minimize the risk of data corruption.</span>**

### Error handling

In our code base, you will not find assert and throws but collect errors
as soon as possible. The class used for this is the ErrorAccumulator.
Most classes inherit this class. Now, before anything is done, we try to
collect errors as soon as possible and abort before index / extract or
stats is running.

In short:
* Do not use throws or assert, when not necessary. Document it, when you
plan to diverge from this.
* Try to keep your error messages precise and helpful.
* Mark programming errors with the prefix "BUG: "

### Code validation with clang-tidy

We use CLion to develop FastqIndEx. CLion has clang-tidy support built
Expand Down
178 changes: 86 additions & 92 deletions README.md
Expand Up @@ -9,22 +9,20 @@ files.

## Features at a glance

* Support for accessing:
- Local files (either by piped or streamed access)
- Files safely over NFS (using flock(), also with either piped or
streamed access)
* Support for:
- Local file access (either by piped or streamed access)
- Safe concurrent file access over NFS (using flock(), also with
either piped or streamed access)
- Files in an S3 bucket (experimental, without locking yet!)
* Many checks to make sure, that the application does what you expect
and runs safely.
* A number of options to tell the indexer how you like to get things
done.
* Different configurable indexing strategies:
- Block distance based
- Byte distance based (default)
* A flexible and configurable index size.
* Different extraction strategies:
- Extract a range of records
- Extract one of n-segements, where you define the number of segments
on the fly.
- (Virtually) divide the FASTQ on the fly into n segments and extract
one segement of your choice.

## License and Contributing

Expand Down Expand Up @@ -129,12 +127,12 @@ FastqIndEx has the following dependencies, which should be met before building i

##### Use Conda to manage your dependencies

* Install Miniconda / Anaconda
* The conda recipe is contained in env/environment.yml, use conda-env
to import it:
conda env create -n FastqIndEx -f env/environment.yml
* Download and install UnitTest++ like described in the next section,
it is not available in Conda.
1. Install Miniconda / Anaconda
2. The conda recipe is contained in env/environment.yml, use conda-env
to import it:
conda env create -n FastqIndEx -f env/environment.yml
3. Download and install UnitTest++ like described in the next section,
it is not available in Conda.

##### Compilation with manual installation of dependencies

Expand Down Expand Up @@ -181,83 +179,79 @@ activate the Conda environment of FastqIndEx.

#### FastqIndEx

First, you need to clone the FastqIndEx Git repository. You can do this
by executing:

``` bash
cd ~/Projects # Or any other desired location, we'll stick to this
git clone https://github.com/DKFZ-ODCF/FastqIndEx.git
git checkout master # Or any other version you like
```

To compile it, create a CMake build directory and run CMake afterwards:

``` Bash
export CONDA_DIR="~/miniconda2/envs/FastqIndEx" # Miniconda is normally installed in <home>/miniconda2
export CONDA_LIB=${CONDA_DIR}/lib
export CONDA_LIB64=${CONDA_DIR}/lib64
export CONDA_INCLUDE=${CONDA_DIR}/include
cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=ON -D OPENSSL_ROOT_DIR=${CONDA_DIR} -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB} -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..
cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=OFF -D OPENSSL_ROOT_DIR=${CONDA_DIR} -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB} -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..


AWS_DIR="/path/to/aws-sdk-cpp/install_shared"
cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Debug \
-D "UnitTest++_DIR":PATH="/path/to/unittest-cpp/install/lib/cmake/UnitTest++" \
-D "AWSSDK_DIR":PATH="${AWS_DIR}/lib64/cmake/AWSSDK" \
-D "aws-cpp-sdk-core_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-core" \
-D "aws-c-event-stream_DIR":PATH="${AWS_DIR}/lib64/aws-c-event-stream/cmake" \
-D "aws-c-common_DIR":PATH="${AWS_DIR}/lib64/aws-c-common/cmake" \
-D "aws-checksums_DIR":PATH="${AWS_DIR}/lib64/aws-checksums/cmake" \
-D "aws-cpp-sdk-s3_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-s3" ..


cd FastqIndEx
mkdir release # Or also debug, in case you want to develop
cd release
cmake -G "Unix Makefiles" \
-D "UnitTest++_DIR":PATH=/path/to/unittest-cpp/lib/cmake/UnitTest++ \ # If necessary
-D ZLIB_LIBRARY=/path/to/zlib-1.2.11/libz.a \ # If necessary
-D ZLIB_INCLUDE_DIR=/path/to/zlib-1.2.11 \ # If necessary
-D AWSSDK_DIR:PATH=<aws installation>/lib64/cmake/AWSSDK \ # For S3 support, you need to do this.
-D aws-cpp-sdk-core_DIR:PATH=<aws installation>/lib64/cmake/aws-cpp-sdk-core \
-D aws-c-event-stream_DIR:PATH=<aws installation>/lib64/aws-c-event-stream/cmake \
-D aws-c-common_DIR:PATH=<aws installation>/lib64/aws-c-common/cmake \
-D aws-checksums_DIR:PATH=<aws installation>/lib64/aws-checksums/cmake \
-D aws-cpp-sdk-s3_DIR:PATH=<aws installation>/lib64/cmake/aws-cpp-sdk-s3 \
-DCMAKE_BUILD_TYPE=Release # Or =Debug, if you plan to develop
..
cd ..
cmake --build release --target all -- -j 2 # Or --build debug
```

**<span style="color:ffffa0;">
Note, that the -D flags for the includes are only necessary, if you
installed the libraries manually. If they are already installed on your
system or (e.g. for UnitTest++) you installed them to the system folders
or if you use the Conda environment, you can omit these flags.
</span>**

To clean the build directory use:

``` Bash
cmake --build build --target clean -- -j 2
```

To run the tests, run the test binary like:

``` Bash
(cd build/test && ./testapp)
```

If you want, you can add the release or debug directory to your PATH
variable. E.g. in your local .bashrc file add the following:

``` bash
# This assumes, that you cloned the repo to ~/Projects/FastqIndEx and
# created the release sub directory like described above.
export PATH=~/Projects/FastqIndEx/release/src:$PATH
```
1. First, you need to clone the FastqIndEx Git repository. You can do this
by executing:

``` bash
cd ~/Projects # Or any other desired location, we'll stick to this
git clone https://github.com/DKFZ-ODCF/FastqIndEx.git
git checkout master # Or any other version you like
```

2. To compile it, create a CMake build directory and run CMake afterwards:

``` Bash
export CONDA_DIR="~/miniconda2/envs/FastqIndEx" # Miniconda is normally installed in <home>/miniconda2
export CONDA_LIB=${CONDA_DIR}/lib
export CONDA_LIB64=${CONDA_DIR}/lib64
export CONDA_INCLUDE=${CONDA_DIR}/include
cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=ON -D OPENSSL_ROOT_DIR=${CONDA_DIR} -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB} -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..
cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=OFF -D OPENSSL_ROOT_DIR=${CONDA_DIR} -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB} -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..


export AWS_DIR="/path/to/aws-sdk-cpp/install_shared"
export UNITTESTPP_DIR="/path/to/unittest-cpp"
export ZLIB_DIR="/path/to/zlib-1.2.11" # If necessary

# The following instructions are for a release build of FastqIndEx.
# If you want to create a debug build, change "release" to "debug"
export MODE=release
cd FastqIndEx
mkdir ${MODE}
cd ${MODE}
cmake -G "Unix Makefiles" \
-D "UnitTest++_DIR":PATH="${UNITTESTPP_DIR}/install/lib/cmake/UnitTest++" \ # If necessary
-D ZLIB_LIBRARY="${ZLIB_DIR}/libz.a" \ # If necessary
-D ZLIB_INCLUDE_DIR="${ZLIB_DIR}"" \ # If necessary
-D "AWSSDK_DIR":PATH="${AWS_DIR}/lib64/cmake/AWSSDK" \ # For S3 support, you need to do this.
-D "aws-cpp-sdk-core_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-core" \
-D "aws-c-event-stream_DIR":PATH="${AWS_DIR}/lib64/aws-c-event-stream/cmake" \
-D "aws-c-common_DIR":PATH="${AWS_DIR}/lib64/aws-c-common/cmake" \
-D "aws-checksums_DIR":PATH="${AWS_DIR}/lib64/aws-checksums/cmake" \
-D "aws-cpp-sdk-s3_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-s3" \
-DCMAKE_BUILD_TYPE=Release
..
cd ..
cmake --build ${MODE} --target all -- -j 2
```
**<span style="color:ffffa0;">
Note, that the -D flags for the includes are only necessary, if you
installed the libraries manually. If they are already installed on your
system or (e.g. for UnitTest++) you installed them to the system folders
or if you use the Conda environment, you can omit these flags.
</span>**
3. To clean the build directory use:
``` Bash
cmake --build build --target clean -- -j 2
```
4. To run the tests, run the test binary like:
``` Bash
(cd build/test && ./testapp)
```
5. If you want, you can add the release or debug directory to your PATH
variable. E.g. in your local .bashrc file add the following:
``` bash
# This assumes, that you cloned the repo to ~/Projects/FastqIndEx and
# created the release sub directory like described above.
export PATH=~/Projects/FastqIndEx/release/src:$PATH
```
## Links
Expand Down
16 changes: 9 additions & 7 deletions src/CMakeLists.txt
Expand Up @@ -33,16 +33,18 @@ add_library(
fastqindexlib
common/CommonStructsAndConstants.cpp common/CommonStructsAndConstants.h
common/ErrorAccumulator.cpp common/ErrorAccumulator.h
common/IndexHeader.cpp common/IndexHeader.h
common/ErrorMessages.cpp common/ErrorMessages.h
common/IndexEntry.cpp common/IndexEntry.h
common/IndexEntryV1.h
common/IOHelper.cpp common/IOHelper.h
common/Result.h
common/StringHelper.cpp common/StringHelper.h
process/base/BaseIndexEntry.h
process/base/IndexHeader.cpp process/base/IndexHeader.h
process/base/IndexEntry.cpp process/base/IndexEntry.h
process/base/IndexEntryV1.h
process/base/ZLibBasedFASTQProcessorBaseClass.cpp process/base/ZLibBasedFASTQProcessorBaseClass.h
process/extract/Extractor.cpp process/extract/Extractor.h
process/extract/IndexReader.cpp process/extract/IndexReader.h
process/index/IndexEntryStorageStrategy.h
process/index/IndexEntryStorageDecisionStrategy.h
process/index/Indexer.cpp process/index/Indexer.h
process/index/IndexWriter.cpp process/index/IndexWriter.h
process/io/s3/FQIS3Client.h
Expand All @@ -53,11 +55,11 @@ add_library(
process/io/s3/S3Sink.h
process/io/s3/S3Source.cpp process/io/s3/S3Source.h
process/io/locks/LockHandler.h
process/io/locks/PathLockHandler.cpp process/io/locks/PathLockHandler.h
process/io/locks/FileLockHandler.cpp process/io/locks/FileLockHandler.h
process/io/locks/S3LockHandler.h
process/io/IOBase.h
process/io/PathSink.cpp process/io/PathSink.h
process/io/PathSource.cpp process/io/PathSource.h
process/io/FileSink.cpp process/io/FileSink.h
process/io/FileSource.cpp process/io/FileSource.h
process/io/Sink.h
process/io/Source.h
process/io/ConsoleSink.h
Expand Down
11 changes: 11 additions & 0 deletions src/common/CommonStructsAndConstants.cpp
Expand Up @@ -29,3 +29,14 @@ const int64_t TB = GB * 1024;

const int DEFAULT_RECORD_SIZE = 4;

uint stoui(std::string value) {
// Unfortunately, C++ does not offer the stoui method (or stou). Why? Nobody knows. I found
// the following code snippet here: https://stackoverflow.com/questions/8715213/why-is-there-no-stdstou

unsigned long intermediate = std::stoul(value);
if (intermediate > std::numeric_limits<unsigned>::max()) {
throw std::out_of_range("stou");
}
return static_cast<uint>(intermediate);
}

12 changes: 8 additions & 4 deletions src/common/CommonStructsAndConstants.h
Expand Up @@ -52,10 +52,14 @@ extern const int DEFAULT_RECORD_SIZE;
*/
#define CLEAN_WINDOW_SIZE (WINDOW_SIZE + 1)

struct VirtualIndexEntry {
bool operator==(const VirtualIndexEntry &rhs) const { return true; };
};

/**
* stoui is not available as a core method (unlike stoi or stol ...). Mimic it here.
*
* Throws an out_of_range excpetion, if the value is too large
* @param value A string
* @return The hopefully converted value.
*/
uint stoui(std::string value);


#endif //FASTQINDEX_COMMONSTRUCTS_H
4 changes: 2 additions & 2 deletions src/common/ErrorAccumulator.cpp
Expand Up @@ -57,7 +57,7 @@ string ErrorAccumulator::join(_cstr s0, _cstr s1, _cstr s2, _cstr s3, _cstr s4,
}


vector<string> ErrorAccumulator::mergeToNewVector(const vector<string> &l, const vector<string> &r) {
vector<string> ErrorAccumulator::concatenateVectors(const vector<string> &l, const vector<string> &r) {
std::vector<string> merged;
merged.reserve(l.size() + r.size());
merged.insert(merged.end(), l.begin(), l.end());
Expand All @@ -66,7 +66,7 @@ vector<string> ErrorAccumulator::mergeToNewVector(const vector<string> &l, const
}

vector<string>
ErrorAccumulator::mergeToNewVector(const vector<string> &a, const vector<string> &b, const vector<string> &c) {
ErrorAccumulator::concatenateVectors(const vector<string> &a, const vector<string> &b, const vector<string> &c) {
std::vector<string> merged;
merged.reserve(a.size() + b.size() + c.size());
merged.insert(merged.end(), a.begin(), a.end());
Expand Down
4 changes: 2 additions & 2 deletions src/common/ErrorAccumulator.h
Expand Up @@ -76,12 +76,12 @@ class ErrorAccumulator {
* @param r The right vector
* @return A new vector<string> with both vectors merged. The messages of l will be placed before the messages of r.
*/
static vector<string> mergeToNewVector(const vector<string> &l, const vector<string> &r);
static vector<string> concatenateVectors(const vector<string> &l, const vector<string> &r);

/**
* Similar to its version with two input parameters.
*/
static vector<string> mergeToNewVector(const vector<string> &a, const vector<string> &b, const vector<string> &c);
static vector<string> concatenateVectors(const vector<string> &a, const vector<string> &b, const vector<string> &c);

};

Expand Down

0 comments on commit 5d92837

Please sign in to comment.