### RNA folding kinetics control riboswitch sensitivity in vivo 
#### David Z. Bushhouse1,3 , Jiayu Fu1,3, & Julius B. Lucks1,2,3,4,5,6* 

 

1 Interdisciplinary Biological Sciences Graduate Program, Northwestern University, Evanston, Illinois 60208, USA 

2 Department of Chemical and Biological Engineering, Northwestern University, Evanston, Illinois 60208, USA 

3 Center for Synthetic Biology, Northwestern University, Evanston, Illinois 60208, USA 

4 Center for Water Research, Northwestern University, Evanston, Illinois 60208, USA 

5 Center for Engineering Sustainability and Resilience, Northwestern University, Evanston, Illinois 60208, USA 

6 International Institute for Nanotechnology, Northwestern University, Evanston, Illinois 60208, USA 


* To whom correspondence should be addressed. Tel: 1-847-467-2943; Email: jblucks@northwestern.edu  

#### Import Packages

In [1]:
import Bio.Entrez as entrez
import Bio.SeqIO as SeqIO
import pandas as pd
import numpy as np
import regex as re
import ViennaRNA as RNA

#### Using Rfam FASTA to Query NCBI and get downstream sequence for each aptamer
Rfam provides a FASTA of all sequences for a given family. That must be converted to `.csv` to begin. The ZTP riboswitch family can be found at http://rfam.org/family/RF01750.
-  Note that Rfam has some duplicate entries that come from different genome entries of the same organism (e.g. C. botulinum has several duplicates). These are resolved manually after all other analysis by checking for identical riboswitch sequences.

In [2]:
align_df = pd.read_csv('RF01750.csv', header=None).rename(columns={0:'entry', 1:'sequence'})

# Filter out entries with underscores, which result in errors when querying NCBI
filtered_df = align_df[~align_df['entry'].str.contains('_')]

# Initialize lists for data
species_list = []
start_list = []
stop_list = []

entrez.email = 'AnneExample@gmail.com'
# Extract data from the 'entry' column
for entry in filtered_df['entry']:
    split_strings = entry.split('/')
    species = split_strings[0].strip('>')
    species_list.append(species)
    start, stop = split_strings[1].split(' ')[0].split('-')
    start_list.append(int(start))
    stop_list.append(int(stop))


align_df = pd.DataFrame({'species': species_list, 'start': start_list, 'stop': stop_list, 'sequence': filtered_df['sequence']})
align_df.reset_index(drop=True, inplace=True)
pd.DataFrame(align_df)


Unnamed: 0,species,start,stop,sequence
0,MLQM01000024.1,37782,37700,TGATGTGGTCGTGACTGGCGCGGAAGGTGGAGCACCACCGGGGAGC...
1,JJMM01000002.1,68149,68234,AGAGTTTTATGCGACTGGCGGAAATAGTGGAATCAACCACGTGGAG...
2,MJEH01000064.1,71685,71607,TATTGTATGTGTGACTGGCGAAATGTGGAATACCACAGGGGAGCAC...
3,CP002400.1,1795690,1795605,TCAGGCTCTTATGACTGACGGAATTGTGGAGCACACCACAAGGGAG...
4,ADVL01000269.1,3206,3290,GCCCTCCCGCGTGACTGGCGAGAGAGATGGAGCACCATCGGGGAGC...
...,...,...,...,...
1750,ABAX03000018.1,2486,2565,UGAAAGCUGUGCGACUGGCGGAUAGUGGAGAAACCACGGGGAGCAC...
1751,AAVO02000015.1,35242,35324,AAAGUCUCUGGUAACUGGCGGAGUAGUGGAAAUCACCACAUGGGAA...
1752,AE000513.1,880781,880696,CUGCGGAAGGUGACUGGCGCGAAAACGUGGGAAGACCACGGGGAAG...
1753,CP000667.1,2571803,2571720,UGGAAACGAUGCGACUGGCGGCACCGAGGAUGGAACGACCAUCGGG...


In [3]:
# This is the step that queries NCBI. It takes a while. 
# It grabs the 200 nt after the Rfam aptamer entry, which should include the expression platform

seq_strings = []
for index, row in align_df.iterrows():
    if row['start'] < row['stop']:
        strand = 'forward'
        start = row['start']
        stop = row['stop'] + 200
    else:
        strand = 'reverse'
        start = row['stop'] - 200
        stop = row['start']
        
    handle = entrez.efetch(db="nucleotide", 
                           id=row['species'], 
                           rettype="gb", 
                           retmode="text", 
                           seq_start=start, 
                           seq_stop=stop)
    
    record = SeqIO.read(handle, "genbank")
    seq_strings.append(str(record.seq))
    print(index, "/", len(align_df))


0 / 1755
1 / 1755
2 / 1755
3 / 1755
4 / 1755
5 / 1755
6 / 1755
7 / 1755
8 / 1755
9 / 1755
10 / 1755
11 / 1755
12 / 1755
13 / 1755
14 / 1755
15 / 1755
16 / 1755
17 / 1755
18 / 1755
19 / 1755
20 / 1755
21 / 1755
22 / 1755
23 / 1755
24 / 1755
25 / 1755
26 / 1755
27 / 1755
28 / 1755
29 / 1755
30 / 1755
31 / 1755
32 / 1755
33 / 1755
34 / 1755
35 / 1755
36 / 1755
37 / 1755
38 / 1755
39 / 1755
40 / 1755
41 / 1755
42 / 1755
43 / 1755
44 / 1755
45 / 1755
46 / 1755
47 / 1755
48 / 1755
49 / 1755
50 / 1755
51 / 1755
52 / 1755
53 / 1755
54 / 1755
55 / 1755
56 / 1755
57 / 1755
58 / 1755
59 / 1755
60 / 1755
61 / 1755
62 / 1755
63 / 1755
64 / 1755
65 / 1755
66 / 1755
67 / 1755
68 / 1755
69 / 1755
70 / 1755
71 / 1755
72 / 1755
73 / 1755
74 / 1755
75 / 1755
76 / 1755
77 / 1755
78 / 1755
79 / 1755
80 / 1755
81 / 1755
82 / 1755
83 / 1755
84 / 1755
85 / 1755
86 / 1755
87 / 1755
88 / 1755
89 / 1755
90 / 1755
91 / 1755
92 / 1755
93 / 1755
94 / 1755
95 / 1755
96 / 1755
97 / 1755
98 / 1755
99 / 1755
100 / 1755

In [4]:
# Cleaning up results and dumping to CSV

# If source is RNA sequence, convert 'U' to 'T'
def convert_U_to_T(sequence):
    return sequence.replace('U', 'T')

# If source is oriented in reverse, provide the reverse complement
def reverse_complement(sequence):
    complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C', 'N': 'N', 'R': 'Y', 'Y': 'R'}
    return ''.join(complement_dict[base] for base in sequence[::-1])

# Write seq strings collected from NCBI to the df
align_df['locus'] = seq_strings

# Function to convert 'U' to 'T'
align_df['locus'] = align_df['locus'].apply(convert_U_to_T)

# Replace seq with reverse complement if the output from NCBI is backwards
align_df['locus'] = align_df.apply(lambda row: reverse_complement(row['locus'])
                                               if row['start'] > row['stop'] else row['locus'], axis=1)

# Mark if sequence contains N values
align_df['has_N'] = align_df['locus'].str.contains('N')

# Write to csv while filtering our seqs with Ns
align_df[align_df['has_N'] == False].to_csv('RF01750_output.csv', index=False)

#### Filtering out sequences that do not seem to have a polyU tract downstream
We use a sliding window, the stringency of which was altered to find the optimal threshold to not keep or remove too many sequences.

In [5]:
seqs_df = pd.read_csv('RF01750_output.csv')

# Define column indices
ADIndex = 3
SeqIndex = 4
StartToPolyUIndex = 6
PolyUFoundIndex = 5

# Iterate through the 'sequence' column by index
for index, value in enumerate(seqs_df.iloc[:, SeqIndex]):
    aptamer = seqs_df['sequence'][index]
    # Length of the aptamer sequence
    aptamer_length = len(aptamer)

    # Length of the polyU tract length window
    window_length = 8

    # Minimum number of 'T's required
    min_T_count = 6

    # Initialize counter
    t_count = 0

    # Initialize variables to store extracted data
    polyU_found = False
    start_to_polyU = ""

    # Iterate through the DNA sequence, starting after the aptamer
    for i in range(aptamer_length, len(value)):
        if value[i] == 'T':
            t_count += 1

        if i - aptamer_length >= window_length:
            # If the window has moved past the desired length, remove the first base from the count
            if value[i - window_length] == 'T':
                t_count -= 1
        
        if t_count >= min_T_count and not polyU_found:
            start_position = i - window_length + 1
            t_rich_region = value[start_position:i+1]
            polyU_found = True
            start_to_polyU = value[:i+3]  # Extract the sequence from the beginning to the end of the T-rich region
    

    # Update DataFrame columns
    seqs_df.at[index, 'polyU_found'] = polyU_found
    seqs_df.at[index, 'start_to_polyU'] = start_to_polyU

txal_seqs_df = seqs_df.loc[seqs_df['polyU_found'] == True]

print((len(txal_seqs_df)/len(seqs_df)))
txal_seqs_df.to_csv('RF01750_with_polyU.csv', index=False)

0.43896848137535815


### Identifying the spacer region in between P3 and the invader


This isolates the sequence extending from P3 through to the polyU tract identified above

In [9]:
txal_seqs_df = pd.read_csv('RF01750_with_polyU.csv')

# Define column indices
toPolyU = 7
noMatch = 0

for index, value in enumerate(txal_seqs_df.iloc[:, toPolyU]):
    aptamer = txal_seqs_df['sequence'][index]
    aptamer_length = len(aptamer)

    # This regex pattern is the consensus sequence for P3 from Rfam
    patternP3 = r'G[CT][CT].{5,8}TGGG[CT]'

    aptamer_real = value[:aptamer_length] 

    matches = re.finditer(patternP3, aptamer_real)
    last_match = None  # Initialize a variable to store the last match
    
    for match in matches:
        start = match.start()
        end = match.end()
        matched_sequence = match.group()
        last_match = matched_sequence  # Update the last_match variable with the current match

    if last_match == None:
        noMatch +=1
    
    P3toPolyU = value[start:]
    
    txal_seqs_df.at[index, 'P3_to_polyU'] = P3toPolyU

print(f"{noMatch} of {len(txal_seqs_df)} have no match")

37 of 766 have no match


Checking secondary structure prediction to identify the terminator hairpin. P3+EP should form a single stem-loop structure.

In [12]:
P3toPolyU = 8

# Define a function to check for a single stem-loop
def has_single_stem_loop(dot_bracket):

    # Thie regex pattern identifies stem-loop patterns
    pattern = r'\(\.*\)'
    matches = re.findall(pattern, dot_bracket)
    
    # Check if there's only one stem-loop
    return len(matches) == 1

# Iterate through sequences in the DataFrame
for index, value in enumerate(txal_seqs_df.iloc[:, P3toPolyU]):
    term = False
    rna_sequence = value[:-7]
    
    if len(rna_sequence) > 0:
        # Create a ViennaRNA sequence object
        sequence = RNA.fold_compound(rna_sequence)

        # Predict the secondary structure
        (ss, mfe) = sequence.mfe()
        dot_bracket = ss

        # Check if the structure has a single stem-loop
        if has_single_stem_loop(dot_bracket):
            term = True
            print(value, dot_bracket)

        # Update the 'term' column in the DataFrame
        txal_seqs_df.loc[index, 'term'] = term

term_seqs_df = txal_seqs_df.loc[txal_seqs_df['term'] == True]

GCCGACCGCCTGGGCACTTAGTCCGGGCGGTCTGTATTTTTT ((.((((((((((((.....)))))))))))).))
GCCGACCGTCTGGGCAGCATTTGGTTATGACCGAATGCTGCCCTTTTGTCTTT ...........((((((((((((((....))))))))))))))...
GCCGACCGCCTGGGCAAAGATTAGTCCAGGGGGTCTTTTTTT ...((((.(((((((........))))))).))))
GCCGTTCGCCTGGGCATCTTGATCAATCTCTTCGAGACAGCCCATGACGATTGCCAAAGTTCTTTCTTT (.((.((((.(((((.((((((.........))))))..))))).).))).)).).......
GCCGACCGCCTGGGCAATCAGTGCAATGCTTTTGTCATTGTGTGATGGCCCGGGCGGTTATTTATTT ...((((((((((((.((((..(((((((....).)))))).)))).)))))))))))).
GCCGACCGTCTGGGCAGCATCATTTTCTGGTGCTGCTCTTTTTTT ...........(((((((((((.....)))))))))))
GTCGACCGTCTGGGCAGAGCGTGGCAGGGGAATGCCTCGCAAGGCGTCCAGGCGGTTTTTTAT ....(((((((((((...(((.((((......)))).))).....)))))))))))
GCCGACCGCCTGGGCAGTTTTATAAACTACCCGGGTTTTTTTA .......(((((((.(((((...))))).)))))))
GCCGGCCGTCTGGGCACACTTACTGCAGGCGGTAAGTGTGCCCTTTTTCATTT ...........((((((((((((((....))))))))))))))...
GCCGGCCGTCTGGGCACACTTACTGCAGGCGGTAAGTGTGCCCTTTTTCATTT ...........((((((((((((((....))))))))

GCCGACCGCCTGGGCAATCATCACAATGCTTTTGCCATTGTGTGATGGCCCGGGCGGTTATTTATTT ...((((((((((((.((((.((((((((....).))))))))))).)))))))))))).
GCCGACCGTCTGGGCAGATAGTTATAATTGCTATCTGTTCAGGCGGTTTTTTAT ....((((((((((((((((((.......))))))))))))))))))
GCCGACCGCCTGGGCGATAAGTCTGGGGGTCTATCGTTTTTTT ...((((.(((((((.....))))))))))).....
GTCGACCGCCTGGGCAATCAAGTTAAGGATTGCTCAGGCTTTTCTTTT .......((((((((((((........))))))))))))..
GCCGACCGCCTGGGCAAAATAAAATTTTGTCCGGGCGGTTTTGTTTT ...(((((((((((((((((...)))))))))))))))))
GCCGTTCGCCTGGGCAGCCACGATGGAAACGTGCGTTTCGGCTGCCCTGTCATTTTTTG ...........((((((((((((((....))).)))...)))))))).....
GCCGACCGTCTGGGCAACATTTGAGCTTCTCGAATGTTGTCCTTTTTTA ...........((((((((((((((...))))))))))))))
GCCGTACGCCTGGGCAGAGTATCGGTTCTCTTTTA (((((((...........))).))))..
GCCGGCCGCCTGGGCAACTTTTATTTAGTGTGCCCAGGCGGCTTATTTTTA ...((((((((((((((((.......))).))))))))))))).
GCCGACCGTCTGGGCAGCATTTTGGTTTTTAC (((.........)))..........
GCCGGCCGCCTGGGCAAATTGCGGAATTTGCTCAGGCGGCTTTTGCTTTT ...(((((((((((((((((....)))))

Finding putative invaders and isolating the spacer region between P3 and the invader.

- For some sequences there are multiple regions in an EP that match the invader regex pattern. To determine which is the genuine invader, manual validation was performed using NUPACK secondary structure prediction.
- For some sequences there are deletions or insertions that result in there being no match to the invader regex pattern (e.g. G108∆). This is resolved by manually inspecting secondary structure prediction to identify the spacer region.

In [13]:
P3toPolyU = 8
total = 0

# Initialize a list to store extracted sequences
extracted_sequences = []

for index, value in enumerate(term_seqs_df.iloc[:, P3toPolyU]):
    
    # The regex pattern here is the complement of the consensus 3' side of P3
    patternInvader = r'G[CT][CT][CT][GA]'
    matches = re.finditer(patternInvader, value[15:-8]) #P3 is 15 nt long so the search begins after P3

    extracted_sequence = None  # Initialize extracted_sequence for each sequence

    for match in matches:
        start = match.start()
        end = match.end()
        matched_sequence = match.group()
        invader = value[15:-8].index(matched_sequence)  # Find the index of the invader within the range of P3-polyU
        extracted_sequence = value[15:15 + invader]  # Extract the sequence up to and including the first match
        break  # Break out of the loop after processing the first match

    extracted_sequences.append(extracted_sequence)  # Append extracted_sequence to the list
    total += 1

# Add the list of extracted sequences as a new column in the DataFrame
term_seqs_df['spacer'] = extracted_sequences
term_seqs_df.reset_index(drop=True, inplace=True)

term_seqs_df.to_csv('RF01750_term-spacers.csv', index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  term_seqs_df['spacer'] = extracted_sequences


### Manual analysis
The output of this pipeline was manually processed.

1. Sequences were removed if:
    - Secondary structure prediction showed ≥6 unpaired bases at the beginning of the P3-polyU region (indicating not all of the PK is involved in terminator formation) 
    - Secondary structure prediction showed the presence of only the P3 stem
    - Secondary structure prediction showed ≥3 unpaired bases preceding the polyU tract, or large unpaired regions within the putative terminator
    - Duplicate sequences originated from the same organism
2. Spacer region predictions were manually verified for all sequences by performing NUPACK secondary structure prediction
3. Verified spacer regions were manually classified
