-
Notifications
You must be signed in to change notification settings - Fork 2
3. Workflow Demos
The container image of KaMRaT integrates a snakemake
workflow for construction and basic pre-processing of the input k-mer count matrix required by KaMRaT.
The scripts are findable in the folder related-tools/ . Some usecases on a toy dataset are also provided in the folder toyroom/usecases/MakeTab.usecases/.
The matrix construction takes fastq files as inputs. To run the snakemake
workflow, all the fastq files are required to be wrapped in a same folder, with suffixes in the same pattern. For example, in the showcase, all the fastq files for the two toy samples are wrapped in the folder toyroom/data/fastq_dir/
, with the shared suffixes pattern as .R1.fastq.gz
and .R2.fastq.gz
.
Users are expected to prepare and provide a config.json
file for the snakemake
workflow, containing the keys:
- samples_tsv: a file indicating samples to be analysed. Multi-column tab-separated table is allowed, but a header line with the first field being "sample" is mandatory.
- lib_type: sequencing strandedness, can be one of "rf", "fr" or unstranded.
- fastq_dir: the folder that wraps fastq files related to the samples provided in
samples_tsv
. - r1_suffix: suffix patterns of the first read files.
- r2_suffix: suffix patterns of the second read files.
- output_dir: output folder.
- kmer_length: k-mer length.
- min_rec: requirement of minimum recurrent sample number to report the k-mer.
- min_rec_abd: requirement of minimum abundance threshold to report k-mer occurrence in one sample.
- n_cores: number of CPU cores to be used for computation.
An example of config.json
is findable here.
An example value for the samples_tsv
can be found here.
To launch the workflow, please run:
apptainer exec -B /in_dir/:/sif_data/ -B /out_dir/:/sif_out/ -B $PWD:/sif_pwd/ KaMRaT.sif \
snakemake -s /usr/KaMRaT/related-tools/make-matrix/Snakefile \
--configfile /sif_pwd/config.json --cores 1
Users are free to prepare the input feature count matrix in their own way. This can be done either using an independent tool or with the companion tools provided along with KaMRaT.
Users can use other independent tools to build the matrix. One interesting example is kmtricks.
Please make sure that the matrix has rows representing features and columns representing samples. Also, the first column of the matrix should be the features and the remaining ones are counts in each sample.
Users are also free to build the input matrix with their own workflow.
One way for this is to use the companion tools jellyfish
and DE-kupl joinCounts
provided in container. An example is shown here as below:
# Bash variables
indir="demo-data/inputs"
outdir="demo-data/outputs"
sample_list=(sample1 sample2)
dsgnfile=$indir/rank-design.txt
kmer_tab_path=$outdir/kmer-counts.tsv.gz
mkdir $outdir
# jellyfish count & dump
for s in ${sample_list[@]} # $sample_list contains list of considered sample names
do
jellyfish count -m 31 -s 1000000 -C -o $outdir/$s.jf -F 2 <(zcat $indir/$s.R1.fastq.gz) <(zcat $indir/$s.R2.fastq.gz)
jellyfish dump -c $outdir/$s.jf | sort -k 1 > $outdir/$s.txt # <= here sort is important !
done
# DE-kupl joinCounts
echo -n "tag" | gzip -c > $kmer_tab_path
for s in ${sample_list[@]} # $sample_list contains list of considered sample names
do
echo -ne "\t"$s | gzip -c >> $kmer_tab_path
done
echo "" | gzip -c >> $kmer_tab_path
apptainer exec --bind /src:/des kamrat.sif joinCounts -r 1 -a 1 $outdir/*.txt | gzip -c >> $kmer_tab_path # no filter of recurrence
Note: please keep in mind that the sort
after jellyfish dump
is important for joinCounts.
Users are free to combine KaMRaT operations according to their analysis design, as shown in the figure above.
The first step of KaMRaT workflow is always index. For example, to deal with the k-mer count table given in toyroom/
:
mkdir $outdir/kamrat.idx
# Make index for k-mer matrix with k=31, unstranded mode, and with a count per billion normalization
apptainer exec --bind /src:/des kamrat.sif kamrat index -intab $kmer_tab_path -outdir $outdir/kamrat.idx -klen 31 -unstrand -nfbase 1000000000
KaMRaT's functional operations can be arranged in various order as shown in the figure above, including:
- a single operation of filter, mask, merge, score or query;
- filter-merge or filter-rank;
- mask-merge or mask-rank;
- merge-rank;
- rank-merge.
When a functional operation is used as the terminal one (e.g., the merge operation in filter-merge workflow), it is required to put -withcounts
option in order to form a human-readable count table, otherwise a binary intermediate file will be produced to be taken by the terminal operation (indicated by the -with
argument).
Some example usecases are provided in the folder toyroom/usecases/KaMRaT.usecases
.