Skip to content

Releases: FelixKrueger/Bismark

Version 0.24.2

27 Sep 09:01
acf965c
Compare
Choose a tag to compare

Just a few fixes, also added two flavours of scripts for merging coverage files (e.g. for when R1 and R2 had been run in single-end mode)

Bismark

  • removed an exit 0 that would terminate runs after processing a single (set of) input file(s).

deduplicate_bismark

  • Changed the path to Samtools to custom variable (#609)

coverage2cytosine

  • set threshold reads to 1 (if it was 0) for --gc_context as intended and mentioned in the help text. Fixes #621

monolithic beast no more

29 May 08:48
7288cb6
Compare
Choose a tag to compare
  • Added entirely new documentation website, built using Material for Mkdocs. Thanks to @ewels for a fantastic (late-night) effort to break up and restructure what had become a fairly unwieldy monolithic beast of markdown document...

  • Added docs for cytosine context summary, useful for GpC methylation or filtering for specific C context (e.g. CpA)

  • Updated docs for the dovetailing

Bismark

  • Warning messages for closing ambiguous and unmapped file handles only occur when these options were specified see here

0.24.0 - long read support with minimap2

06 Oct 16:04
430df15
Compare
Choose a tag to compare

Bismark

  • Added new option --strandID which reports the alignment strand identity for paired-end, non-directional libraries, e.g. YS:Z:CTOT. This information may be difficult to obtain if third party tools interfered with the read ordering (admittedly there is a fine balance of read reporting position, FLAG, Read 1 and Genome conversion state to make it work in the first place. More information can be found in this thread).

  • runs with --parallel/--multicore > 1 specified will now terminate with an error message whenever one of the child processes fails. This prevents potentially incomplete result files making it through to the end unnoticed (more #494)

  • runs with --parallel/--multicore > 1 as well as --unmapped and/or --ambiguous specified will no longer produce potentially corrupt FastQ files (more #495)

  • Added option --mm2/--minimap2 to use minimap2 as the underlying aligner. The minimap2 alignment modes include Oxford Nanopore, PacBio and accurate short reads. In its current implementation, minimap2 can be invoked in one of the following ways:

  • --mm2_nanopore: Sets preset settings for Oxford Nanopore vs reference mapping '-x map-ont' [default]

  • --mm2_pacbio: Sets preset settings for PacBio vs. reference mapping '-x map-pb'

  • --mm2_short_reads: Sets preset settings for accurate short reads '-x sr'

  • added option --mm2_maximum_length <int> to set a maximum length cutoff, which might be required for very long reads exceeding the maximum number of CIGAR operations tolerated by the BAM formatted reads (>65535). The default is 10,000 bp.

Other options that are currently set within Bismark include '-a' (SAM output), '--MD' (MD tag), '--secondary=no'.

Prompted by fairly slow alignment speeds with the minimap2 default settings, we set out to improve the performance of the alignment process by tweaking several different parameters

Speed optimisiation: optimisation of minimap2 parameters

k-mer size
Due to the reduced DNA alphabet the minimap2 default k-mer size of 15 leads to substantially higher alignment times. Based on our tests we settled for a new default of ‘-k 20’
minibatch size
The minimap2 default minibatch size of 500 million bp means that a substantial amount of data is aligned and held in memory before additional alignment threads can be started. Reducing the minibatch size to 250K reads seemed to be a good compromise (‘-K 250K’).
minimap2 multi-threading
minimap2 alignments may utilize multiple cores for each alignment process; we found that ‘-t 2’ offered a good speed-up, while allowing additional resources had diminishing returns.
Bismark multi-threading
We also tested the potential of using additional resources for Bismark itself (--parallel), which appeared to result in a speed-up of the alignment process as expected; however this comes at the cost of requiring additional system resources.

As a result of these tests, we changed the default settings for minimap2 alignment parameters to ‘-t 2 -k 20 -K 250K’.

methylation_consistency

  • Added new option --chh to use cytosines in CHH instead of CpG context to enable some trouble shooting and method development

bismark2report

  • The CHH/CHG labels for the Cytosine Methylation after Extraction plot now appear in the correct order

bismark_methylation_extractor

  • removed a print statement that would flood STDOUT the logfile if --merge_non_CG (but not --comprehensive) had been selected

  • runs with --parallel/--multicore specified will now terminate with an error message whenever one of the child processes fails. This prevents potentially incomplete result files making it through to the end unnoticed

  • changed the option -o/--output to -o/--output_dir for consistency reasons...

bismark_genome_preparation

  • Added option --mm2/--minimap2. The genome indexing process (bismark_genome_preparation) writes out a minimap2 index to the genome folder, using the optimized k-mer size of -k 20 (see comments for bismark itself). This pre-generated minimap2 index takes precedence over indexing options that would otherwise happen as part of the alignment procedure.

deduplicate_bismark

  • when using an output filename -o customname the deduplication report will also be derived from customname.

Added a sentence to the Docs that Genozip 14 and above supports Bismark BAM files (with a substantial gain in compression).

fix auto-detection

26 Jul 07:40
Compare
Choose a tag to compare

filter_non_conversion

  • fixed global setting of --paired or --single mode. Auto-detection now works by only looking at the @PG ID:Bismark line of the SAM header.

methylation_consistency

  • Auto-detection now works by only looking at the @PG ID:Bismark line of the SAM header.

coverage2cytosine

  • Swapped the columns for count methylated and count unmethylated for the context summary report to match the header line.

v0.23.0

09 Nov 13:29
Compare
Choose a tag to compare

Bismark Release v0.23.0

  • Migrated CI tests from Travis to Github Actions

deduplicate_bismark

  • the command deduplicate_bismark --barcode *bam now works again. Previously the output file names were accidentally all derived from the first supplied file in --barcode (= UMI) mode (it had been fixed for normal files in 0.22.2).

  • Changed the way the library auto-detection works to only look at the @PG ID:Bismark line of the SAM header (to only look for the Bismark command)

bismark_methylation_extractor / bismark2bedGraph

  • Added a new option --ucsc to bismark2bedGraph and bismark_methylation_extractor that will produce a UCSC-ready bedGraph file if the genome version used came from Ensembl. This option (i) prefixes chromosome names with 'chr', and (ii) changes the mitochondrial chromosome from 'MT' to 'chrM'. In addition, it will also write out a new file ending in .chromosome_sizes.txt for easier use of bedGraphToBigWig. More here.

  • Changed the way the library auto-detection works to only look at the @PG ID:Bismark line of the SAM header.

coverage2cytosine

  • Added a new output file for all cytosine context methylation totals. More information here: #321.

  • Added new option --drach/--m6A. Most m6A sites are found in the conserved sequence motif DRACH (where D=G/A/U, R=G/A, H=A/U/C), and if bound by anti-m6A antibody, it causes the reverse transcriptase to introduce C to T transitions at the cytosine which follows A in the DRACH motif. This option also sets a coverage threshold of at 1 unless specified explicitly. This is a very specialised option and should only be used by experimentalists looking at m6A methylation (where the C to T transition acts as a proxy of m6A).

bismark2summary

  • Samples with absolutely 0 methylation calls in some context are now excluded from the graphical HTML output (as they break rendering the entire summary graph section). These samples and their statistics do still appear in the file bismark_summary_report.txt. More information here: #315.

v0.22.3

19 Nov 11:35
Compare
Choose a tag to compare

Bismark

  • Accepted pull request to fix the MAPQ score calculation in local mode.

methylation_consistency

0.22.2

16 Oct 14:41
f960b3a
Compare
Choose a tag to compare
  • Added FAQ document for questions that keep coming up. Will be populated over time.

Bismark

  • the option --non_bs_mm is now only allowed in end-to-end mode

  • Fixed the calculation of non bisulfite mismatches for paired-end data which happened correctly only when R2 had an InDel (see here)

  • When the option -u was used in conjunction with --parallel, only -u sequences will be written to the temporary subset files for each spawn of Bismark (previously, the entire file was split for --parallel, but then only a small subset of those files was used for -u, which resulted in very long runs even for a small number of analysed sequences)

deduplicate_bismark

  • the command deduplicate_bismark *bam now works again. Previously the output file names were accidentally all derived from the first supplied file.

coverage2cytosine

  • Added new option --coverage_threshold INT. Positions have to be covered by at least INT calls (irrespective of their methylation state) before they get reported. For NOMe-seq, the minimum threshold is automatically set to 1 unless specified explicitly. Setting a coverage threshold does not work in conjunction with --merge_CpGs (as all genomix CpGs are required for this). Default: 0 (i.e. all genomic positions get reported)

bismark2report

  • added seconds to the timestamp report statement (which caused a warning on certain, but not all, platforms)

bismark2summary

  • Now reads splitting reports even for non-deduplicated files (such as RRBS).

Essential Easter Performance Release [EEPR]

21 Apr 15:52
Compare
Choose a tag to compare

Bismark

  • Hot-fixed (read: removed) the cause of delay during the MD:Z: field computation for reads containing a deletion (which was roughly equal to 1 second per read). Apologies, I did it again...

  • Changed the default --score_min function for HISAT2 in --local mode back to a linear function (instead of using the logarithmic model that is employed by Bowtie 2). The default is now --score_min L,0,-0.2 for both end-to-end (default) and --local mode. It should be mentioned that we currently don't understand how exactly the scoring mode in HISAT2 works (even though the scores appear to be all negative with a maximum value of 0), so this might change somewhat in the future. See here for more info.

Easter Release - local alignments for single-cell and scNMT-seq

16 Apr 15:42
Compare
Choose a tag to compare

Expanding on our observation that single-cell BS-seq, or PBAT libraries in general, can generate chimeric read pairs, a recent publication by Wu et al. described in further detail that intra-fragment chimeras can hinder the efficient alignment of single-cell BS-seq libraries. In there, the authors described a pipeline that uses paired-end alignments first, followed by a second, single-end alignment step that uses local alignments in a bid to improve the mapping of intra-molecular chimeras. To allow this type of improvement for single-cell or PBAT libraries, we have been experimenting with allowing local alignments.

Please note that we still do not recommend using local alignments as a means to magically increase mapping efficiencies (please see here), but we do acknowledge that PBAT/scBSs-seq/scNMT-seq are exceptional applications where local alignments might indeed make a difference (there is only so much data to be had from a single cell...).
We didn't have the time yet to set more appropriate or stringent default values for local alignments (suggestions welcome), nor did we investigate whether the methylation extraction will require an additional --ignore flag if a read was found to the be soft-clipped (the so called 'micro-homology domains'). This might be added in the near future.

Bismark

  • Added support for local alignments by introducing the new option --local. This means that the CIGAR operation S (soft-clipping) is now supported

  • fixed typo in option --path_to_bowtie2 (a single missing 2 was preventing the specified path to be accepted)

  • fixed typo in option --no-spliced-alignment in HISAT2 mode

  • fixed missing end-of-line character for unmapped or ambiguous FastQ sequences in paired-end FastQ mode

  • fixed output file naming in --hisat2 and --parallel mode (_hisat2 was missing in --parallel mode). Thanks to @phue for spotting this.

bismark_genome_preparation

  • Added option --large-index to force the generation of LARGE genome indexes. This may be required for indexing extremely large genomes (e.g. the Axolotl (32 GigaBases)) in --parallel mode. For more information on why the indexing was failing previously see here

bismark_methylation_extractor

  • Now supporting reads containing soft-clipped bases (CIGAR operation S)

bam2nuc

  • Now supporting reads containing soft-clipped bases (CIGAR operation S)

deduplicate_bismark

  • Now supporting reads containing soft-clipped bases (CIGAR operation S)

[New]: HISAT2 and SLAM-mode; [Retired]: Bowtie 1

14 Mar 13:14
Compare
Choose a tag to compare

For the upcoming version Bismark has undergone some substantial changes, which sometimes affect more than one module within the Bismark suite. Here is a short description of the major changes:

[Retired]: Bowtie 1 support

  • Bowtie (1) support, and all of its options, has been completely dropped from bismark_genome_preparation and bismark. This decision was not made lightly, but it seems no one is using the original Bowtie short read aligner anymore, even short reads have moved on...
  • Consequently, the option --vanilla and its handling has been removed from a number of modules (bismark_genome_preparation, bismark, bismark_methylation_extractor and deduplicate_bismark). Too bad, I liked that name...

[Added]: HISAT2 support

  • Instead, the DNA and RNA aligner HISAT2 has been added as a new choice of aligner. The reason for this is not necessarily that RNA methylation is now a thing, but certain alignment modes (see below) do require splice-aware mapping if we don't want to miss out on a whole class of (spliced) alignments. Bowtie 2 is the default mode, HISAT2 alignments can be enabled with the option --hisat2

  • Similar to the Bowtie2 mode, alignments with HISAT2 are restricted to global (end-to-end) alignments, i.e. soft-clipping is disabled. Furthermore, in paired-end mode, the options --no-mixed and --no-discordant are permanently enabled, meaning that only properly aligned read pairs are put out.

  • As the --hisat2 mode supports spliced alignments, the new CIGAR operation N is now supported in all Bismark modules (this includes bismark_genome_preparation, bismark, bismark_methylation_extractor, deduplicate_bismark and some others).

At the time of writing this, the --hisat2 mode appears to be working as expected. It should be mentioned however that we have not done a lot of testing of these new files, so comments and feedback are welcome.

SLAM-seq mode

We also added a new, experimental and completely different type of alignment for SLAM-seq type data (option --slam). This fairly recent method to interrogate newly synthesized messenger RNA is akin to bisulfite conversion, in that newly synthesized RNA may contain T to C conversions following an alkylation reaction (original publication and https://www.nature.com/articles/nmeth.4435). The new Bismark alignment mode --slam performs T>C conversions of both the genome (in the genome preparation step) and the subsequent alignment steps (Bismark alignment step). Currently, the rest of the processing of SLAM-seq data hijacks the standard methylation pipeline:

  • T>C conversions are written out as methylation events in CpG context, while T-T matches are scored as unmethylated events in CpG context. Other cytosine contexts are not being used.

So in a nut-shell: methylation calls in --slam mode are either Ts (unmethylated calls = matches at T positions), or T to C mismatches (methylated calls = C mismatches at T positions).

It should be noted that this is currently an experimental workflow. One might argue that T/C conversion aware (or T/C mis-mapping agnostic) mapping is currently not necessary for SLAM-seq, NASC-Seq, or scSLAM-seq data as the labeling reaction is very inefficient (1 in only 50 to 200 newly incorporated Ts is a 4sU, which may get alkylated). This might be true - for now. If and when the conversion reaction improves over time, C/T agnostic mapping, similar to bisulfite-Seq data, might very well become necessary.

Here is a screenshot of a comparison of aligning the same data (SLAM-seq-like) with Bismark in Bowtie 2 mode (top track) and HISAT2 mode (middle track). Alignments with HISAT2 recover a lot more alignments to short exons, as well as exon-exon spanning reads (evidenced in bottom track):

Bowtie2 HISAT2 aligment comparison

  • Added documentation for NOMe-seq or scNMT-seq processing.

bismark

  • Dropped support for Bowtie

  • Removed all traces of --vanilla

  • Added support for HISAT2 with option --hisat2.

  • Added HISAT2 option --no-spliced-aligments to disable spliced alignments altogether

  • Added HISAT2 option --known-splicesite-infile <path> to provide a list of known splice sites.

  • Added option --slam to allow T/C mismatch agnostic mapping (3-letter alignment). More here.

  • Added a new option --icpc to truncate read IDs at the first space (or tab) it encounters in the (FastQ) read ID, which are sometimes used to add comments to a FastQ entry (instead of replacing them with underscores which is the default behaviour).

bismark_genome_preparation

  • Dropped support for Bowtie

  • Added support for HISAT2 with option --hisat2.

  • Added option --slam. Instead of performing an in-silico bisulfite conversion, this mode transforms T to C (forward strand), or A to G (reverse strand). The folder structure and rest of the indexing process is currently exactly the same as for bisulfite sequences, but this might change at some point. This means that a genome prepared in --slam mode is currently indistinguishable from a true Bisulfite Genome (until the alignments are in) so please make sure you name the genome folder appropriately to avoid confusion.

deduplicate_bismark

  • Removed all traces of --vanilla

  • --bam mode is now the default. Uncompressed SAM output may still be obtained using the new option --sam

  • Added new option -o/--outfile <basename>. This basename is then modified to remove file endings such as .bam, .sam, .txt or .gz, and .deduplicated.bam, or .multiple.deduplicated.bam in --multiple mode, is then appended for consistency reasons.

  • Added support for new CIGAR operation N

bismark_methylation_extractor

  • Added support for new CIGAR operation N for all extraction modes

  • Removed all traces of --vanilla

bismark2summary/bismark2report

  • Adapted to work with Bismark HISAT2 reports instead of Bowtie 1 reports.

bam2nuc

  • Reads containing spliced reads are now also skipped when determining the genomic base composition (as are reads with InDels).