Skip to content

3. Workflow Demos

Haoliang Xue edited this page Mar 17, 2024 · 1 revision

Step 1. Build k-mer count matrix

Simple option: directly use the companion snakemake workflow provided in container

The container image of KaMRaT integrates a snakemake workflow for construction and basic pre-processing of the input k-mer count matrix required by KaMRaT.

The scripts are findable in the folder related-tools/ . Some usecases on a toy dataset are also provided in the folder toyroom/usecases/MakeTab.usecases/.

The matrix construction takes fastq files as inputs. To run the snakemake workflow, all the fastq files are required to be wrapped in a same folder, with suffixes in the same pattern. For example, in the showcase, all the fastq files for the two toy samples are wrapped in the folder toyroom/data/fastq_dir/, with the shared suffixes pattern as .R1.fastq.gz and .R2.fastq.gz.

Users are expected to prepare and provide a config.json file for the snakemake workflow, containing the keys:

  • samples_tsv: a file indicating samples to be analysed. Multi-column tab-separated table is allowed, but a header line with the first field being "sample" is mandatory.
  • lib_type: sequencing strandedness, can be one of "rf", "fr" or unstranded.
  • fastq_dir: the folder that wraps fastq files related to the samples provided in samples_tsv.
  • r1_suffix: suffix patterns of the first read files.
  • r2_suffix: suffix patterns of the second read files.
  • output_dir: output folder.
  • kmer_length: k-mer length.
  • min_rec: requirement of minimum recurrent sample number to report the k-mer.
  • min_rec_abd: requirement of minimum abundance threshold to report k-mer occurrence in one sample.
  • n_cores: number of CPU cores to be used for computation.

An example of config.json is findable here.

An example value for the samples_tsv can be found here.

To launch the workflow, please run:

apptainer exec -B /in_dir/:/sif_data/ -B /out_dir/:/sif_out/ -B $PWD:/sif_pwd/ KaMRaT.sif \
               snakemake -s /usr/KaMRaT/related-tools/make-matrix/Snakefile \
                         --configfile /sif_pwd/config.json --cores 1

Advanced options:

Users are free to prepare the input feature count matrix in their own way. This can be done either using an independent tool or with the companion tools provided along with KaMRaT.

Using other independent tools

Users can use other independent tools to build the matrix. One interesting example is kmtricks.

Please make sure that the matrix has rows representing features and columns representing samples. Also, the first column of the matrix should be the features and the remaining ones are counts in each sample.

Using companion tools along with KaMRaT

Users are also free to build the input matrix with their own workflow.

One way for this is to use the companion tools jellyfish and DE-kupl joinCounts provided in container. An example is shown here as below:

# Bash variables
indir="demo-data/inputs"
outdir="demo-data/outputs"
sample_list=(sample1 sample2)
dsgnfile=$indir/rank-design.txt
kmer_tab_path=$outdir/kmer-counts.tsv.gz

mkdir $outdir

# jellyfish count & dump
for s in ${sample_list[@]} # $sample_list contains list of considered sample names
do
    jellyfish count -m 31 -s 1000000 -C -o $outdir/$s.jf -F 2 <(zcat $indir/$s.R1.fastq.gz) <(zcat $indir/$s.R2.fastq.gz)
    jellyfish dump -c $outdir/$s.jf | sort -k 1 > $outdir/$s.txt # <= here sort is important !
done

# DE-kupl joinCounts
echo -n "tag" | gzip -c > $kmer_tab_path
for s in ${sample_list[@]} # $sample_list contains list of considered sample names
do
	    echo -ne "\t"$s | gzip -c >> $kmer_tab_path
done
echo "" | gzip -c >> $kmer_tab_path
apptainer exec --bind /src:/des kamrat.sif joinCounts -r 1 -a 1 $outdir/*.txt | gzip -c >> $kmer_tab_path # no filter of recurrence

Note: please keep in mind that the sort after jellyfish dump is important for joinCounts.

Step 2: KaMRaT

Users are free to combine KaMRaT operations according to their analysis design, as shown in the figure above.

KaMRaT index

The first step of KaMRaT workflow is always index. For example, to deal with the k-mer count table given in toyroom/:

mkdir $outdir/kamrat.idx
# Make index for k-mer matrix with k=31, unstranded mode, and with a count per billion normalization
apptainer exec --bind /src:/des kamrat.sif kamrat index -intab $kmer_tab_path -outdir $outdir/kamrat.idx -klen 31 -unstrand -nfbase 1000000000

KaMRaT functional operations in various combinations

KaMRaT's functional operations can be arranged in various order as shown in the figure above, including:

  • a single operation of filter, mask, merge, score or query;
  • filter-merge or filter-rank;
  • mask-merge or mask-rank;
  • merge-rank;
  • rank-merge.

When a functional operation is used as the terminal one (e.g., the merge operation in filter-merge workflow), it is required to put -withcounts option in order to form a human-readable count table, otherwise a binary intermediate file will be produced to be taken by the terminal operation (indicated by the -with argument).

Some example usecases are provided in the folder toyroom/usecases/KaMRaT.usecases.