SWGTS comes packaged as an out-of-the box docker compose application.
In order to start the software you need to:
- Generate the folder structure (see below)
- Provide a TLS certificate (see below)
- Configure/Provide a database (see below)
The software expects the following folders to exist:
- input
- output
By design we also require encrypted communication via TLS, this means that you need to have a valid certificate. You can create a self-signed certificate (for testing purposes) using:
openssl req -new -newkey rsa:2048 -sha256 -days 365 -nodes -x509 -keyout server.pem -out server.crt
The paths to server.pem and server.crt can be adjusted in docker-compose.yml.
You can use the fetchExampleDatabase.sh script to generate a combined database of human (T2T) and hCoV-19. This requires minimap2 to be installed (also takes a significant amount of memory).
The software consists of the following containers:
- Traefik, used as a reverse proxy to forward the incoming requests to the API/Frontend. Also serves as a load balancer
- Redis is used as a fast and efficient key,value storage and maintains the currently pending reads that need to be filtered.
- SWGTS-API handles the communication with clients
- SWGTS-Filter performs the actual host depletion with mappy
An example implementation of a CLI client is provided in the subfolder swgts-submit. To run it create a virtual environment for example using
mamba env create -f swgts-submit/requirements.txt -n swgts-submit
(Make sure that conda-forge is your default channel which should be the case in newer conda installations)
Then execute using:
cd swgts-submit
python -m swgts-submit --server https://example.com/swgts/api example_data/example_cov_ont_reads.fq
to send an example file to a swgts server running on the domain www.example.com
. The chunk size, i.e., the amount of bases sent to the server at one time, can be adjusted with the --count
argument.
As a default setting this is a fraction of the server's buffer size to allow efficient parallelization.
Note that if the chunk size exceeds the buffer size the server will reject the transmission.
A website frontend is also included in the docker application and can be found in the folder swgts-frontend. Requests are forwarded to localhost by default (assuming the backend is located on the same machine), this can be adjusted in the package.json
proxy directive.
Configuration of the software is done via two configuration files that are generated on first launch in the input folder: config_api.py
and config_filter.py
.
In addition, you can edit the docker-compose.yml
file to scale up individual components.
Since the entire system is stateless it is for example possible to use multiple API containers or even distribute them to different machines. For the filter component it is recommended
to utilize multiple threads (see below) instead of using multiple containers since this allows to take advantage of shared memory, drastically reducing the memory footprint in many scenarios.
Configurable options are:
If this is set to true the server will not save anything to disk. This can be useful if the server is just used to guarantee uniform host depletion quality or to take load off client machines. The clients are still able to reconstruct filtered read files based on the server responses.
This is the buffer size and limits how many bases can be held in RAM per context at any time. Decreasing this increases transmission times but reduces the risk of personal identification.
If this timeout in seconds is exceeded a upload that was inactive will be removed.
Configurable options are:
Can be set to POSITIVE (retention) or NEGATIVE (host subtraction).
This is used for POSITIVE (retention) based modes and is used to identify the reference to which reads should align in order to be retained.
Can be used to adjust for short reads (sr) or long reads (map-ont).
This is used for the NEGATIVE mode and requires an additional mapping quality to filter a read.
The amount of worker threads that are used in each filter container. The set of worker threads share the same database and thus require it to only be loaded into memory once.
Alice is a hosting provider and wants to collect reads of a target pathogen potentially contaminated with reads of human hosts.
Two collaborating parties, Bob and Christine, want to send their reads to Alice.
Since Alice wants to guarantee that the reads are depleted of host reads with sufficient quality, she hosts the SWGTS software under her domain
www.example.com
. Bob has no experience with CLI-based tools and just opens the website www.example.com/swgts/frontend
in his browser.
He drags the .fastq
file containing the reads onto the screen and starts the upload process. Christine wants to upload data while it is being
sequenced and uses the CLI client to initiate an upload. She has previously followed the installation instructions listed above and created a conda environment
for the CLI client. Now, she invokes
python3 -m swgts-submit --server https://example.com/swgts/api christines_reads.fastq --outfolder local_out
while the file christines_reads.fastq is still being generated by her connected sequencing machine. Since she also wants to have a copy of the host-depleted reads, she provides the outfolder argument which results in the CLI client writing a depleted .fastq to the given folder.
In the following section we document the REST API. This can be of relevance if you are interested in implementing your own client.
HTTP response code: 200 (OK) HTTP response body:
{
"commit": "String",
"version": "String",
"date": "String",
"uptime": 23.14,
"maximum pending bytes": 100000
}
HTTP request body: List of all the files that are to be transmitted.
{
"filenames": [
"pair1.fastq",
"pair2.fastq"
]
}
HTTP response code: 200 (OK)
HTTP request body:
# For each read index, for each file contains a list of the 4 lines making up one read
[
[
[
"...",
"...",
"...",
"...."
]
]
]
HTTP response code: 200 (OK) HTTP response body:
{
"processed reads": 901000,
"pending bytes": 9000
}
or
HTTP response code: 400 (Bad Request) HTTP response body:
{
"message": "Error message"
}
or
HTTP response code: 404 (Not Found) HTTP response body:
{
"message": "Error message"
}
or
HTTP response code: 422 (Unprocessable Content) HTTP response header includes 'Retry-After'. HTTP response body:
{
"message": "Error message",
"processed reads": 47119000,
"pending bytes": 18000
}
HTTP response code: 200 (OK) HTTP response body:
{
"processed reads": 15000001,
"pending bytes": 234568
}
or
HTTP response code: 503 (Service Unavailable) HTTP response header includes 'Retry-After'. HTTP response body:
{
"message": "Error message"
}
We provide a Jupyter Notebook with benchmarking scripts (designed for two machines) in the benchmarking folder. A conda environment definition file is included that contains the required python packages.
SWGTS uses minimap2's (https://doi.org/10.1093/bioinformatics/bty191) in-memory version mappy as the default mapping tool.