Skip to content
John Lees edited this page Oct 31, 2017 · 23 revisions

An example wrapper script is given as example_wrapper.py, in the top level directory. Main programs are installed into bin. It is best to run the first two steps (kmer counting, kmds) on all the samples you have, then seer on the subset for which you have a phenotype.

See the test directory for example files.

  1. Pipeline summary
  2. Phenotype file
  3. Counting k-mers
  4. kmds
  5. seer

Once complete, see the page for interpreting the output.

Pipeline summary

1. Count kmers using one of three programs, the choice of which depends on the size of your dataset and your available resources. 2. Run kmds once, for most datasets with no filtering. This will produce .dsm files which are a subsample of all the kmers. 3. Run kmds a second time using the first step's output as a list of these .dsm files. This will calculate a distance matrix, and project it into three dimensions (or as specified by --pc). 4. Use this projection for running seer, which filters kmers and calculates the association statistics of those passing.

Pheno file

This is equivalent to a [plink](https://www.cog-genomics.org/plink2/input#pheno) .pheno file. There are three columns, tab separated with no header row. The first two, FID and IID, are identical and are the sample names used in dsm/fsm-lite/dsk.

The third field is the phenotype. Entirely 0/1 will be interpreted as binary control/case, anything else as quantitative.

Samples with a missing phenotype should be excluded.

Count your k-mers

First you'll need to count the k-mers which are present in your samples, which is best done from fasta files of assembled reads. Three methods can be used, the output of which can be directly read by kmds and seer.

If possible, we recommend the use of distributed string mining (dsm) as this is memory efficient and selects all k-mer sizes, rather than being a fixed arbitrary word length. If you do not have a cluster suitable for dsm, then you should try the single core version fsm-lite.

Once installed, run fsm-lite using a tab-separated list of your sample names and corresponding fasta files in fsm_list.txt:

fsm-lite -v -s 5 -S 95 -l fsm_list.txt -t tmp_index > fsm_kmers.txt

The -s and -S options control the frequency of k-mers to report. In the above example there are 100 samples and only k-mers appearing in between 5 and 95 samples are output. This is equivalent to the recommended 5% minor allele frequency cutoff.

dsk

If this is not possible, use dsk to count your kmers, followed by the combineKmers program we provide:

dsk -file sample1_contigs.fa -abundance-min 1 -out sample1_dsk
dsk2ascii -file sample1_dsk.h5 -out sample1_dsk.txt
combineKmers -r 21mer_samples.txt -o all_21mers_new --min_samples 2

The final step is fairly memory intensive. Taking the union of 21mers in 3000 samples took around 25Gb of RAM and about 90 minutes, however this step only needs to be completed once.

If using dsk, we've found using kmers between 21 and 41 seems to work well.

kmds

**Update**

It may be easier and faster to use mash to perform this step. See this page for details.

Next, you'll create a matrix representing population structure. This is done in two steps using kmds. If you haven't run dsm, first split your k-mer files into pieces and gzip them to facilitate parallelisation

split -d -n l/16 fsm_kmers.txt fsm_out
rm fsm_kmers.txt
gzip fsm_out*

Run kmds once on all these files to subsample k-mers for use in projection. A larger size will give a more accurate projection, but slower runtime of the second step. Between 0.1-1% of the total number has worked for us.

kmds -k fsm_out{1..16}.gz --pheno metadata.pheno --no_filtering --no_mds --size 10000

In this command {1..16} means run this command 16 times, for each of the files from the above step. You can do this in bash with a for loop, or if you have multiple cores with GNU parallel.

To run the projection, use the subsampled .dsm output files from this first kmds step

ls *.dsm > subsampled_matrices.txt
kmds -p metadata.pheno --mds_concat subsampled_matrices.txt -o all_structure --threads 16 --write_distances

See the troubleshooting section if this stage causes problems

seer

Type 'seer' with no options to get brief usage, or for a full option listing
seer -h

Example usage is as follows

seer -k fsm_out{1..16}.gz --pheno metadata.pheno --struct all_structure --threads 4 --print_samples > significant_kmers.txt

See above for how to correctly run the {1..16} commands.

Various filtering options are available. Note the the default is a MAF of >1%, unadjusted p-value of < 10e-5 and adjusted p-value of < 10e-8.

Output columns are sequence of element, frequency in sample set, uncorrected p-value (chi-square), corrected p-value (wald test), corrected p-value (likelihood ratio test), effect size (beta), standard error, p-values of covariates, comments on regression fit and optionally the samples the sequence element appears in.

In general, you should use the likelihood ratio test (lrt) p-value for downstream analysis. P-values are not multiple test corrected in the output.

Using covariates

To use covariates add --covar_file and --covar_list. The format of the latter is the columns in the --covar_file to use, comma separated. They are assumed to be categorical, unless a q is added. The covar_file should be tab separated, with the first column containing sample name. For example, to use columns 2, 3 and 5 where column 2 is quantitative, run

seer -k fsm_out{1-16}.gz --pheno metadata.pheno --struct all_structure --threads 4 --print_samples --covar_file covariates.txt --covar_list 2q,3,5