##Analysis of 2D mutate-and-map signal in next-generation-sequencing data
Ensure that you have the following installed:
- Novobarcode, part of the Novoalign software package, which is freely available for educational and not-for-profit use. Download the latest version of Novoalign at http://www.novocraft.com/support/download/
- ShapeMapper 1.2, software developed by the Weeks lab at UNC Chapel Hill for 1D analysis of mutational profiling data. Available at http://www.chem.unc.edu/rna/software.html (Make sure you go into that directory and run
make
.) - BowTie2 is needed for ShapeMapper. Available here: https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/. Version 2.2.9 works.
- the
nwalign
python package, which you can install viasudo pip install nwalign
- numpy
- The RDATkit for handling RDAT data files. Available at https://github.com/ribokit/RDATKit
Add the M2seq, novobarcode, etc. folders to your PATH. For example in your .bashrc
add lines like PATH=$PATH:$HOME/src/M2seq
.
To test on example sequencing data, download the two example FASTQs from this link and move them to the Tutorial folder. Then, run:
m2seq.py P4P6.fa RTBbarcodes.fa Sample1_S1_L001_R1_001.fastq Sample1_S1_L001_R2_001.fastq --config example.cfg --offset 89
P4P6.fa
[required] is a fasta-formatted file with the name and sequence of the RNA. Note that the name of the fasta file and the sequence in the fasta file must match the name of the reference sequence given in the config file.RTBbarcodes.fa
[required] is a Novobarcode-formatted file with the names and sequences of the barcodes in the RTB primers.- The read 1 and read 2
FASTQ
s are required inputs. example.cfg
[optional] is a config file in the format required by ShapeMapper. If a config file is provided, the 2D dataset will be generated by ShapeMapper analysis followed by conversion of binary mutation-counted files to 2D data that are output as RDAT files. The correct names of the conditions in the pilot experiment used for this tutorial are already input in the [alignments] and [profiles] sections of the config file. These sections must be edited for different experiments.- Make sure no spaces are in any of the folder names parent to this directory -- you may see fails of bowtie2 otherwise.
- Useful tip: Especially if this is your first run, tou may want to do a 'pilot run' with test fastq files that hold just a few test lines from your full
FASTQ
files. Prepare these ashead -n 10000 Sample1_S1_L001_R1_001.fastq > test/Sample1_S1_L001_R1_001.fastq
andhead -n 10000 Sample1_S1_L001_R2_001.fastq > test/Sample1_S1_L001_R2_001.fastq
and supplytest/RTBbarcodes.fa Sample1_S1_L001_R1_001.fastq test/Sample1_S1_L001_R2_001.fastq
in them2seq.py
command line. - If you need to rerun, you'll need to cleanly remove the previous output directories:
rm -rf 1_Demultiplex/ 2_ShapeMapper/ 3_M2seq/
.
The m2seq.py
command performs the following:
- First, it uses Novobarcode to demultiplex the pairs of sequencing reads using the barcodes provided in RTBbarcodes.fa
- Then, it performs ShapeMapper analysis with the settings in the config file to compare the reference sequence to the paired sequencing reads for each barcode
- ShapeMapper outputs a mutation strings text file for each barcode recording the mutations per read compared to the reference sequence, which is then converted to binary format
- The binary-formatted mutation strings are analyzed with simple_to_rdat.py to generate 2D datasets
The output of the pipeline includes, for each barcode:
- 2D reactivity RDAT
- 2D raw counts RDAT
- counted mutations over all sequencing reads
- log files The ExampleResults archive contains these expected outputs from running the analysis on the example sequencing data at the link above.
The key modes of visualizing the M2seq data are as a 2D plot (mutate-and-map style) and as mutation spectra (rates of mutation/deletion both across the sequence and on average).
To view the 2D plot, open MATLAB, make sure the matlab
folder is in your MATLAB path, and run:
m2seqplot('3_M2seq/simple_files/RTB005_P4P6_P4P6.reactivity.rdat');
Compare the resulting plot to the 2D plot in the ExampleResults
folder (you may need to unzip ExampleResults.zip
): Example2D.eps
To view the mutation spectra, open MATLAB and run:
mut_heatmap('2_ShapeMapper/output/counted_mutations/RTB005_P4P6.csv',103:260,'',13);
Compare the resulting plot to the mutation spectra in the Figures folder: ExampleSpectra.eps
, ExampleSpectra_mut_avg.eps
, ExampleSpectra_mut_max.eps
Two modes of structural analysis are available:
-
M2-net
is a neural-net-inspired analysis that detects helix signatures in the mutate-and-map data without making assumptions about RNA folding energy models. It largely reproduces visual analysis. -
The
RNAstructure
secondary structure modeling package (version 6.0, expected to be released in mid-2017) can take in M2seq data to guide modeling of all the helices in a model and estimate confidence in the structures.
Both styles of analysis can be run through MATLAB
scripts available in the Biers repository – check out installation instructions and some examples on the P4-P6 and GIR1 RNAs on RiboKit.
If you use this software, please cite:
Cheng, C. Kladwang, W. Yesselman, J. Das, R. (2017) "RNA structure inference through chemical mapping after accidental or intentional mutations" BioRxiv https://doi.org/10.1101/169953.