Changes according to PR #37; Merge new Byte distance storage strategy

DKFZ-ODCF · Aug 7, 2019 · bcfcc57 · bcfcc57
1 parent c4b008b
commit bcfcc57
Show file tree

Hide file tree

Showing 67 changed files with 893 additions and 707 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -33,6 +33,7 @@ set(aws-checksums_DIR ${AWS_DIR}/lib64/aws-checksums/cmake)
 set(aws-cpp-sdk-s3_DIR ${AWS_DIR}/lib64/cmake/aws-cpp-sdk-s3)
 
 # Make cmake aware of an existing conda environment and try to load zlib and tclap from there.
+# The variable CONDA_DEFAULT_ENV is set, if a Conda environment is active.
 set(CONDA_DEFAULT_ENV $ENV{CONDA_DEFAULT_ENV})
 set(CONDA_PREFIX $ENV{CONDA_PREFIX})
 set(TCLAP_INCLUDE_DIR ../tclap-1.2.1/include)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -23,3 +23,111 @@ If you are adding a new file it should have a header like this:
  */
  ```
 
+## FASTQ Index (FQI) Format, Version 1
+
+FQI files are binary files with the following general structure:
+
+```
+|HEADER|ENTRY 0|ENTRY 1|ENTRY 2|ENTRY ...|
+```
+
+The version 1 header is exactly 512 Byte wide and can be described like:
+
+``` bash
+|                                                                              (IndexHeader)                                                                             | 
+|    (u_int32_t)     |    (u_int32_t)   | (u_int32_t) |  (u_int32_t)  |   (u_int64_t)   |     (u_int64_t)    |           (bool)          |   (u_char)  | (u_int64_t)[59] |
+| indexWriterVersion | sizeOfIndexEntry | magicNumber | blockInterval | numberOfEntries | linesInIndexedFile | dictionariesAreCompressed | placeholder |     reserved    |
+```
+
+The version 1 index entry has an extracted width of 32800 Byte index entry can be described like:
+
+```bash
+|                                                                          IndexEntry                                                                         |
+|   (u_int64_t)   |      (u_int64_t)     |     (u_int64_t)     |      (u_int32_t)       | (u_int32_t) | (u_char) |        (u_int16_t)       | (u_char)[32768] |
+|     blockID     | blockOffsetInRawFile | startingLineInEntry | offsetOfFirstValidLine |     bits    | reserved | compressedDictionarySize |    dictionary   |
+```
+
+If compression is enabled, this looks a bit different (note the last field differs then and depends on compressedDictionarySize!), when it is stored in the FQI file:
+
+```bash
+|                                                                                   IndexEntry                                                                                   |
+|   (u_int64_t)   |      (u_int64_t)     |     (u_int64_t)     |      (u_int32_t)       | (u_int32_t) | (u_char) |        (u_int16_t)       | (u_char)[compressedDictionarySize] |
+|     blockID     | blockOffsetInRawFile | startingLineInEntry | offsetOfFirstValidLine |     bits    | reserved | compressedDictionarySize |             dictionary             |
+```
+
+If compression is enabled, you need to read out the IndexEntry without the dictionary first. 
+
+## Development setup
+
+**<span style="color:ffffa0;">If you use an IDE like CLion, make sure to 
+activate the environment before running the IDE.</span>**
+
+**<span style="color:ffffa0;">Also make sure to use the right compilers
+and tools. They are named a bit differently in Conda. CLion recognizes
+them, if you use the environment like mentioned.</span>**
+
+## Code stability and safety
+
+There are several things we consider and do to make usage of FastqIndEx
+as safe as possible.
+
+
+**<span style="color:orange;">=>  We will not get the application a 100%
+  safe, but we try to minimize the risk of data corruption.</span>**
+
+### Code validation with clang-tidy
+
+We use CLion to develop FastqIndEx. CLion has clang-tidy support built 
+in, which we use to eliminate well known issues and problems.
+
+### Runtime checks with Valgrind
+
+We use [Valgrind](http://valgrind.org/) to check for memory leaks which 
+e.g. lead to SIGSEV or SIGABRT.
+
+### Index file access considerations
+
+When it comes to safely accessing the index file, we want to have it so
+that:
+- Writing to the index is exclusive. No other reader or writer is 
+  allowed at the same time.
+- It is allowed to read multiple times from an index file, if no writer 
+  is active. 
+
+As we work with network file systems, we need to deal with several 
+problems:
+- flock only works correctly with NFS on newer Linux kernels.
+- However, we cannot guarantee, that a file which was not locked, will
+  be overriden during our read.
+
+To overcome these problems:
+- We can use flock for our writer. So writing will be safe, as long as
+  no other process starts writing to the file. 
+  <br><span style="color:green;">=> Implemented</span>
+- We can write out a md5sum file for the index after it was created. 
+  <br><span style="color:red;">Not implemented</span>
+- We can check for an existing lockfile before we read from a file and 
+  abort early. 
+  <br><span style="color:green;">Implemented</span>
+- We can read in the md5sum file and calculate our md5sum during our 
+  read. 
+  <br><span style="color:red;">Not implemented</span> 
+- We can also constantly perform sanity check for changes in timestamp,
+  file size or file existence. 
+  <br><span style="color:red;">Not implemented</span>
+- We can store md5sums for IndexEntries, e.g. for the first valid 
+  line(s) in each compressed block. 
+  <br><span style="color:red;">Not implemented</span>
+
+## Links
+
+* [AWS](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/setup.html)
+* [CMake](https://cmake.org/)
+* [CMake Tutorial 1](https://preshing.com/20170511/how-to-build-a-cmake-based-project/)
+* [CMake Tutorial 2](http://wiki.ogre3d.org/Getting+Started+With+CMake)
+* [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
+* [UnitTest++](https://github.com/unittest-cpp/unittest-cpp)
+* [Valgrind](http://valgrind.org/)
+* [zindex](https://mattgodbolt.github.io/zindex/#/)
+* [zran.c](https://github.com/madler/zlib/blob/master/examples/zran.c) 
+  random gz file access example.