Skip to content

ribokit/M2seq

Repository files navigation

M2seq

##Analysis of 2D mutate-and-map signal in next-generation-sequencing data

Installation

Ensure that you have the following installed:

Add the M2seq, novobarcode, etc. folders to your PATH. For example in your .bashrc add lines like PATH=$PATH:$HOME/src/M2seq.

Running the code

To test on example sequencing data, download the two example FASTQs from this link and move them to the Tutorial folder. Then, run:

m2seq.py P4P6.fa RTBbarcodes.fa Sample1_S1_L001_R1_001.fastq Sample1_S1_L001_R2_001.fastq --config example.cfg --offset 89
  • P4P6.fa [required] is a fasta-formatted file with the name and sequence of the RNA. Note that the name of the fasta file and the sequence in the fasta file must match the name of the reference sequence given in the config file.
  • RTBbarcodes.fa [required] is a Novobarcode-formatted file with the names and sequences of the barcodes in the RTB primers.
  • The read 1 and read 2 FASTQs are required inputs.
  • example.cfg [optional] is a config file in the format required by ShapeMapper. If a config file is provided, the 2D dataset will be generated by ShapeMapper analysis followed by conversion of binary mutation-counted files to 2D data that are output as RDAT files. The correct names of the conditions in the pilot experiment used for this tutorial are already input in the [alignments] and [profiles] sections of the config file. These sections must be edited for different experiments.
  • Make sure no spaces are in any of the folder names parent to this directory -- you may see fails of bowtie2 otherwise.
  • Useful tip: Especially if this is your first run, tou may want to do a 'pilot run' with test fastq files that hold just a few test lines from your full FASTQ files. Prepare these as head -n 10000 Sample1_S1_L001_R1_001.fastq > test/Sample1_S1_L001_R1_001.fastq and head -n 10000 Sample1_S1_L001_R2_001.fastq > test/Sample1_S1_L001_R2_001.fastq and supply test/RTBbarcodes.fa Sample1_S1_L001_R1_001.fastq test/Sample1_S1_L001_R2_001.fastq in the m2seq.py command line.
  • If you need to rerun, you'll need to cleanly remove the previous output directories: rm -rf 1_Demultiplex/ 2_ShapeMapper/ 3_M2seq/.

The m2seq.py command performs the following:

  • First, it uses Novobarcode to demultiplex the pairs of sequencing reads using the barcodes provided in RTBbarcodes.fa
  • Then, it performs ShapeMapper analysis with the settings in the config file to compare the reference sequence to the paired sequencing reads for each barcode
  • ShapeMapper outputs a mutation strings text file for each barcode recording the mutations per read compared to the reference sequence, which is then converted to binary format
  • The binary-formatted mutation strings are analyzed with simple_to_rdat.py to generate 2D datasets

The output of the pipeline includes, for each barcode:

  • 2D reactivity RDAT
  • 2D raw counts RDAT
  • counted mutations over all sequencing reads
  • log files The ExampleResults archive contains these expected outputs from running the analysis on the example sequencing data at the link above.

Visualizing the mutate-and-map output

The key modes of visualizing the M2seq data are as a 2D plot (mutate-and-map style) and as mutation spectra (rates of mutation/deletion both across the sequence and on average).

To view the 2D plot, open MATLAB, make sure the matlab folder is in your MATLAB path, and run:

m2seqplot('3_M2seq/simple_files/RTB005_P4P6_P4P6.reactivity.rdat');

Compare the resulting plot to the 2D plot in the ExampleResults folder (you may need to unzip ExampleResults.zip): Example2D.eps

To view the mutation spectra, open MATLAB and run:

mut_heatmap('2_ShapeMapper/output/counted_mutations/RTB005_P4P6.csv',103:260,'',13);

Compare the resulting plot to the mutation spectra in the Figures folder: ExampleSpectra.eps, ExampleSpectra_mut_avg.eps, ExampleSpectra_mut_max.eps

Structural analysis

Two modes of structural analysis are available:

  • M2-net is a neural-net-inspired analysis that detects helix signatures in the mutate-and-map data without making assumptions about RNA folding energy models. It largely reproduces visual analysis.

  • The RNAstructure secondary structure modeling package (version 6.0, expected to be released in mid-2017) can take in M2seq data to guide modeling of all the helices in a model and estimate confidence in the structures.

Both styles of analysis can be run through MATLAB scripts available in the Biers repository – check out installation instructions and some examples on the P4-P6 and GIR1 RNAs on RiboKit.

Citation

If you use this software, please cite:

Cheng, C. Kladwang, W. Yesselman, J. Das, R. (2017) "RNA structure inference through chemical mapping after accidental or intentional mutations" BioRxiv https://doi.org/10.1101/169953.