GitHub - DasLab/BPPalign

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
RNAstructure		RNAstructure
add		add
b12		b12
gly		gly
scripts		scripts
README.txt		README.txt

Repository files navigation

########################
BPPalign v0.5
########################
(c) Rhiju Das, Stanford University 2011
rhiju@stanford.edu

Used in publication: Kladwang, W, Chou, F.C., Das, R., "Automated RNA structure prediction uncovers a missing link in double glycine riboswitches", under review. (2011).

This tool provides basic scripts to run the partition executable RNAstructure on hundreds of homolog sequences from RFAM, and to then visualize the results by averaging across these matrices. Examples from purine-binding and double glycine-binding riboswitches are included.

Requirements:
1. Python 2.x or later.
2. MATLAB
3. This SVN archive includes a version of the RNAstructure executable 'partition' that you need to compile for UNIX machines. This executable has been modiifed to output the base pair probability matrix into a text file bpp.txt.
4. Internet connection (for Entrez queries to get nucleotide sequences)

BPPalign has been tested on Mac and some Linux variants.

##########################################
Before you run, compile RNAstructure
##########################################
This follows the standard commands:

cd RNAstructure
make all

But then you must also set the $DATAPATH variable to the data_tables/ directory inside RNAstructure. E.g., in bash:

export DATAPATH=RNAstructure/data_tables [replace this with the path to the RNAstructure directory]

This command can also go into your ~/.bashrc file.

#################################
EXAMPLE 1 [purine riboswitch]
#################################

Go into the directory add/.

An example starting alignment, taken from the RFAM database, is there:

add_rfam.txt

First, we want to get sequence files that correspond these sequences. One of the main uses of bpp_align is to look for structure in flanking sequences that may have been missed in the RFAM alignment. So we retrieve the original sequences along with 5' and 3' sequences, through queries to Entrez Pubmed:

../scripts/get_original_sequences.py add_rfam.txt add_seqs 10 70

The numbers 10 and 70 mean look 10 nucleotides upstream (5') of the RFAM alignment boundary and 70 nucleotides downstream (3') of the boundary. This command should retrieve the sequences from the Entrez nucleotide database into the folder add_seqs/ as .fasta files and also as .seq files appropriate for input to the RNAstructure executable for predicting secondary structure. There is an additional file 'new_align.txt' that includes alignment information for these sequences.

Then, let's run RNAstructure.

../scripts/run_partition.py add_seqs/*seq

This requires having a working executable in RNAstructure/exe/partition -- please check directions for compiling RNAstructure. Note that the version included with BPPalign has a modification to automatically output the base pairing probability matrix bpp to a text file.

To visualize the data go into the scripts directory, and start MATLAB. Run the commands in

plot_bpp_ADD_script.m

Check out that script for comments on what input is needed to get the plot. The plot should look like BPPalign_add_example.eps

##########################################
EXAMPLE 2 [double glycine riboswitch]
##########################################
Go into directory gly/. Run

get_double_sequences.py gly_rfam.txt

This looks at the alignment of single-riboswitch domains in gly_rfam.txt and finds pairs of single aptamers that can go together into double riboswitches. Some nucleotides in the linker are tagged 'X'. The output is in gly2_rfam.txt.

For the paper, I also just took to the first 360 sequences (the rest are marine metagenome shotgun sequences.):

head -n 360 gly2_rfam.txt > gly2_rfam_nometagenome.txt

Then, again. retrieve sequences with downstream and upstream flanking regions -- also fillin 'X' nucleotides in the linker:

../scripts/get_original_sequences.py gly2_rfam_nometagenome.txt gly_seqs 100 100

Here, 100 100 means get 100 nucleotides both upstream and downstream of the double-riboswitch domain. Then run RNAstructure on all these sequences:

../scripts/run_partition.py gly_seqs/*seq

Go back up and into the scripts/ directory. In MATLAB, run the commands in

plot_bpp_GLY_script.m

The default commands in this script select out sequences 1-360 (the rest are marine metagenome sequences and appear pretty redundant), and align them to the F. nucleatum sequence.