Skip to content

Releases: katholt/srst2

SRST2 v0.2.0 - Short Read Sequence Typing for Bacterial Pathogens

30 Jul 05:18
Compare
Choose a tag to compare

Updates in v0.2.0

  1. Some improvements to allele calling, particularly for Klebsiella MLST locus mdh, kindly contributed by andreyto. Includes rejection of read alignments that are clipped on both ends (likely to be spurious) and minor bug fixes associated with depth calculations.
  2. Updated E. coli serotype database to remove duplicate sequences.
  3. Added mcr-2 colistin resistance gene to ARGannot.r1.fasta resistance gene database.
  4. A --threads option was added, which makes SRST2 call Bowtie and Samtools with their threading options. The resulting speed up is mostly due to the Bowtie mapping step which parallelises very well.
  5. The VFDB_cdhit_to_csv.py script was updated to work with the new VFDB FASTA format.
  6. Versions of Bowtie2 up to 2.2.9 are now supported. Samtools v1.3 can now be used as well, however v0.1.18 is still the recommended version (for reasons discussed below).
  7. Added scripts/qsub_srst2.py to generate SRST2 jobs for the Grid Engine (qsub) scheduling system (http://gridscheduler.sourceforge.net/). Thanks to Ramon Fallon from the University of St Andrews for putting this together. Some of the specifics are set up for his cluster, so modifications may be necessary to make it run properly on a different cluster using Grid Engine.
  8. Various other small bug fixes!

SRST2 v0.1.8 - Short Read Sequence Typing for Bacterial Pathogens

15 Mar 02:57
Compare
Choose a tag to compare

Updates in v0.1.8

/data directory includes files for subtyping of the LEE pathogenicity island of E. coli, as per Ingle et al, 2016, Nature Microbiology. Instructions below

Resistance gene database updates:

Fixed ARGannot.r1.fasta to include proper mcr1 DNA sequence.
Added columns to the ARGannot_clustered80.csv table, to indicate classes of beta-lactamases included in the ARGannot.r1.fasta database according to the NCBI beta-lactamase resource (new location for the Lahey list).

Fixed some issues with handling of missing data (i.e. where there were no hits to MLST and/or no hits to genes) when compiling results into a table via --prev_output. This could result in misalignment of gene columns in previous versions.

SRST2 v0.1.7 - Short Read Sequence Typing for Bacterial Pathogens

12 Jan 05:48
Compare
Choose a tag to compare

Updates in v0.1.7

  1. Use the following environment variables to specify your prefered samtools and bowtie2 executables (thanks to Ben Taylor for this):
  • SRST2_SAMTOOLS
  • SRST2_BOWTIE2
  • SRST2_BOWTIE2_BUILD
  1. Added mcr1, the plasmid-borne colisting resistance gene to the included ARG-Annot-based resistance gene DB (ARGannot.r1.fasta)
  2. Fixed a problem with writing consensus files that occurred when a directory structure was specified using --output (bug introduced in v0.1.6)

SRST2 v0.1.6 - Short Read Sequence Typing for Bacterial Pathogens

19 Nov 23:10
Compare
Choose a tag to compare

Updates in v0.1.6:

  1. The original validation of SRST2 (see paper) was performed with bowtie2 version 2.1.0 and samtools v0.1.18.
  2. bowtie2: SRST2 has now been tested on the tutorial example and other test data sets using the latest versions of bowtie2, 2.2.3 and 2.2.4, which gave identical results to those obtained with bowtie2 v2.1.0. Therefore, the SRST2 code will now run if any of these versions of bowtie2 are available: 2.1.0, 2.2.3 or 2.2.4.
  3. samtools: SRST2 has now been tested on the Staph & Salmonella test data sets used in the paper, and will work with newer samtools versions (tested up to v1.1). Note however that SRST2 still works best with samtools v0.1.18, due to small changes in the mapping algorithms in later versions that result in some loss of reads at the ends of alleles. This has most impact at low read depths, however we do recommend using v0.1.18 for optimum results.
  4. Minor fixes to the ARG-Annot database of resistance genes, including removal of duplicate sequences and fixes to gene names (thanks to Wan Yu for this). Old version remains unchanged for backwards compatibility, but we recommend using the revised version (located in data/ARGannot.r1.fasta).
  5. Added EcOH database for serotyping E. coli (thanks to Danielle Ingle for this). See Using the EcOH database for serotyping E. coli with SRST2.
  6. Fixed a problem where, when analysing multiple read sets in one SRST2 call against a gene database in which cluster ids don't match gene symbols, individual gene clusters appear multiple times in the output. The compile function was unaffected and remains unchanged.
  7. Fixed behaviour so that including directory paths in --output parameter works (thanks to nyunyun for contributing most of this fix). E.g. --output test_dir/test will create output files prefixed with 'test', located in test_dir/, and all SRST2 functions should work correctly including consensus allele calling. If test_dir/ doesn't exist, we attempt to create it; if this is not possible the user is alerted and SRST2 stops.
  8. Fixed problem when using a gene database with a simple fasta header (ie not clustered for SRST2; note best results are achieved by pre-clusering your sequence database beforehand) (thanks to cglambert for this one).
  9. Fixes contributed by ppcherng (thanks!):
  10. Fixed KeyErrors that occured when a given seqID was not found in the seq2cluster dictionary, which tended to happen if the FASTA file (gene database) contained empty entries that only have a header and no sequence.
  11. Note v0.1.5 included addition of ppcherng's utility scripts to help automate creation of SRST2-compatible gene databases from VFDB.
  12. Added new parameter '--samtools_args' to pass additional options to samtools mpileup (e.g. SionBayliss requested this in order to use '-A' option in samtools mpileup to include anomalous reads).
  13. Fixed problem with consensus sequence reporting of truncated alleles (issue #39).
  14. Added basic instructions for the R scripts provided for plotting output. See Plotting output in R

SRST2 v0.1.5 - Short Read Sequence Typing for Bacterial Pathogens

29 Dec 02:35
Compare
Choose a tag to compare

Updates in v0.1.5:

  1. Optionally switch on reporting of pileups and consensus sequences (fasta) for novel alleles (--report_new_consensus) or for all alleles (--report_all_consensus). See Printing consensus sequences
  2. Post-process consensus sequences from a set of strains, to generate one file per locus containing all/new consensus sequences. See Collate consensus sequences
  3. Some enhancements to getmlst.py script to handle some more unusual scheme names (force download of specific schemes that have non-unique names, handle forward slashes in names).
  4. Fixed an issue where, if multiple readsets analysed in serial in a srst2 run, the fullgenes report would only contain the results for the last readset. Fullgenes report now contains gene output for all readsets.
  5. Added option (--merge_paired) to accommodate cases where users have multiple read sets for the same sample. If this flag is used, SRST2 will assume that all the input reads belong to the same sample, and outputs will be named as [prefix]combined.xxx, where srst2 was run using "-output [prefix]". If the flag is not used, SRST2 will operate as usual and assume that each read pair is a new sample, with output files named as [prefix][sample].xxx, where [sample] is taken from the base name of the reads fastq files. Note that if you have lots of multi-run read sets to analyse, the ease of job submission will depend heavily on how your files are named and you will need to figure out your own approach to manage this (ie there is no way to submit multiple sets of multiple reads).
  6. The original validation of SRST2 (see paper) was performed with bowtie2 version 2.1.0. SRST2 has now been tested on the tutorial example using the latest versions of bowtie2, 2.2.3 and 2.2.4, which gave identical results to those obtained with bowtie2 v2.1.0. Therefore, the SRST2 code will now run if any of these versions of bowtie2 are available: 2.1.0, 2.2.3 or 2.2.4. (Note however that there are still incompatibilities with the recent release of samtools, so you will need to stick to samtools v0.1.18 unless you want to modify the SRST2 code to allow later versions, and are happy with a dramatic loss in accuracy!)

SRST2 v0.1.4 - Short Read Sequence Typing for Bacterial Pathogens

26 Jun 09:55
Compare
Choose a tag to compare

Note the pre-print of the paper will shortly be available in BioRxiv.

  1. No longer store sam and unsorted bam (can be retained via the --keep_interim_alignment flag)
  2. Added options to specify a maximum number of mismatches to allow during mapping; this is specified separately for mlst and genes, so that it is possible to relax the stringency of gene detection in the same run as a high-accuracy MLST test.
    Default value for both is 10 mismatches.
    --mlst_max_mismatch
    --gene_max_mismatch
  3. The highest minor allele frequency (MAF) of variants encountered in the alignment is now calculated and reported for each allele (in the scores file) and also at the gene level and ST level, to facilitate checking for mixed/contaminated read sets.

This value is in the range 0 -> 0.5; with e.g. 0 indicating no variation between reads at any aligned base (i.e. at all positions in the alignment, all aligned reads agree on the same base call; although this agreed base may be different from the reference); and 0.25 indicating there is at least one position in the alignment at which all reads do not agree, and the least common variant (either match or mismatch to the reference) is present in 25% of reads. This value is printed, for all alleles, to the scores file. Note this is different to the ‘LeastConfident’ information printed to scores, which presents the strongest evidence for mismatch compared to the reference, i.e. between 0 -> 1.

The highest such value for each gene/cluster/locus is reported in the fullgenes output table.

The highest such value across all MLST loci is reported in the mlst output table.

Note that all compiled reports will now include a maxMAF column; if you provide MLST or compiled reports from previous versions without this columns, the value “NC” will be inserted in the maxMAF column to indicate “not calculated”. This ensures the updated SRST2 (v0.1.4+) is backwards compatible with previous SRST2 outputs; do be aware though that the older versions of SRST2 (<v0.1.4) will not be forwards-compatible with output generated by more recent versions (v0.14 onwards).

  1. Added R code for plotting SRST2 output in R (plotSRST2data.R).
    Instructions will be added to the read me.
  2. Added srst2-formatted ARG-Annot resistance gene database and plasmid replicon databases to /data.

SRST2 v0.1.3 - Short Read Sequence Typing for Bacterial Pathogens

07 Feb 01:20
Compare
Choose a tag to compare

Updates in this release

  • fixed a bug that occurred while trying to type genes from a user-supplied database (see issue #5, thanks to Scott Long)
  • fixed a bug in gene detection reporting - genes are now correctly reported by cluster, rather than by gene symbol (see issue #7)
  • added maximum divergence option for reporting, default is now to report only hits with <10% divergence from the database (see issue #8)
  • added parameter to pass to bowtie2 parameter '-u N' to stop mapping after the first N reads. Default behaviour remains to map all reads. However, for large read sets (e.g. >100x), extra reads do not help and merely increase the time taken for mapping and scoring, and you may want to limit to the first million reads (100x of a 2 Mbp genome) using '--stop_after 1000000'.

SRST2 v0.1.3 - Short Read Sequence Typing for Bacterial Pathogens

06 Feb 06:56
Compare
Choose a tag to compare

Updates in this release

  • fixed a bug that occurred while trying to type genes from a user-supplied database (see issue #5, thanks to Scott Long)
  • fixed a bug in gene detection reporting - genes are now correctly reported by cluster, rather than by gene symbol (see issue #7)
  • added maximum divergence option for reporting, default is now to report only hits with <10% divergence from the database (see issue #8)
  • added parameter to pass to bowtie2 parameter '-u N' to stop mapping after the first N reads. Default behaviour remains to map all reads. However, for large read sets (e.g. >100x), extra reads do not help and merely increase the time taken for mapping and scoring, and you may want to limit to the first million reads (100x of a 2 Mbp genome) using '--stop_after 1000000'.

SRST2 v0.1.2-beta - Short Read Sequence Typing for Bacterial Pathogens

29 Sep 00:18
Compare
Choose a tag to compare

Changes in this release:

  • Updated srst2.py and scores_vs_expected.py to ensure compatible generation and parsing of scores files.
    • Updated the readme to include more detailed descriptions of features, the example, and details of creating/parsing gene databases.

SRST2 Short Read Sequence Typing for Bacterial Pathogens

26 Sep 04:00
Compare
Choose a tag to compare

Fixed a bug in identification of forward/reverse read pairs.