Author: Thomas Lane

Purpose: This notebook generates a random sequence with motifs at a specified frequency and uses siteout to make sure the 
    linking sequences are motif free.

Use: - Change the pathname for motif_reader to wherever the motif_pfm files are.

 - Specify the frequency for each motif to be inserted in a list
 - Specify the length of the sequence you want
 - .pfm files had to be modified by deleting the letters and symbols before the numbers
 
Note: siteout.py is written in Python2. So to run this you must have python2 downloaded. Change the command in sitout_call() to 
your path to python2

In [1]:
import random
from Bio import motifs
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
import os


motif_reader looks in the folder that contains the .pfm for motifs (downloaded from Jaspar) and reads and saves each motif in the folder

In [2]:
def motif_reader(path_name):
    motif_list =[]
    for filename in os.listdir(path_name):
        print(filename)
        with open(path_name + filename) as handle:
             word = motifs.read(handle, "pfm")
             handle.close()
        motif = str(word.consensus)
        print(motif)
        motif_list.append(motif)

    return motif_list

motifs_test = motif_reader("/mnt/c/users/Jesse Woo/Documents/GitHub/team_neural_network/code/utility/PWM_generator/motif_pfm/")
motifs_test

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Lanes/ResearchLab/team_neural_network/data/input/motif_pfm/'

Creates a list of the motifs with the proportion of the motifs determined by the probabilities and length

In [87]:
def motif_dist(motifs, probabilities, length):
	dist = []
	for i,j in zip(motifs,probabilities):
		repeats = round(j*length)
		for k in range(0, repeats):
			dist.append(i)
	
	random.shuffle(dist)
	return dist

motif_dist(motifs_test,[.001,.003,.01,.007], 1000)

['ATTACGTAAT',
 'GGCCATAAAA',
 'ATTACGTAAT',
 'ATTACGTAAT',
 'TAATCC',
 'ATTACGTAAT',
 'ATTACGTAAT',
 'TAATCC',
 'GGCCATAAAA',
 'GCATAAAAAA',
 'GGCCATAAAA',
 'GGCCATAAAA',
 'GGCCATAAAA',
 'GGCCATAAAA',
 'GGCCATAAAA',
 'GGCCATAAAA',
 'GGCCATAAAA',
 'TAATCC',
 'GGCCATAAAA',
 'ATTACGTAAT',
 'ATTACGTAAT']

Writes the Sequence.txt file, used for siteout.py, using the distribution and the list of motifs

In [88]:
def write_seq_txt(dist, motifs):

	seq_txt = open("Sequences.txt", "w")
	for i in dist:
		
		
		try:
			mot = str(motifs.pop())
			seq_txt.write(mot+",")
			
		except:
			seq_txt.write('A'+",")
		seq_txt.write(str(i) + "")
		seq_txt.write("\n")

	seq_txt.close()

Writes the Motifs.txt file, used for siteout.py

In [89]:
def write_motif_txt(motifs):
	mot_txt = open("motifs.txt", "w")
	for i in range(1, len(motifs)+1 ):
		mot_txt.write("> motif" + str(i))
		mot_txt.write("\n")
		mot_txt.write(motifs[i-1])
		mot_txt.write("\n")
	mot_txt.close()

Uses Python2 to run siteout.py

Change the first argument to you location of the python2 python.exe file

Estrada J, Ruiz-Herrero T, Scholes C, Wunderlich Z, DePace AH (2016)
SiteOut: An Online Tool to Design Binding Site-Free DNA Sequences.
    PLoS ONE 11(3): e0151740. https://doi.org/10.1371/journal.pone.0151740

In [90]:
import subprocess
def siteout_call(GC):
	process = subprocess.run("C:/Python27/python.exe siteout.py .05 .5 "+GC+ " Sequences.txt motifs.txt")

creates a distribution of numbers that add up to the amount of netural sequence we need to create a sequence with the 
desired length and number of motifs


In [91]:
def distribution(length, probability):
	divisions = (length * probability) + 1
	length= length - (divisions * 8)
	rand_col = []
	for i in range(0, int(divisions)):
		rand_col.append(random.random())

	total = sum(rand_col)
	sum_k = []
	for j in rand_col:
		sum_k.append(round((j/total)*length))

	return sum_k

distribution(1000,sum([.001,.003,.01,.007]))

[53,
 33,
 57,
 70,
 63,
 31,
 20,
 43,
 66,
 60,
 3,
 66,
 52,
 20,
 13,
 4,
 71,
 51,
 15,
 6,
 16,
 10]

This is the function that develops one neutral sequence using all the previous functions.

Use:

    -Length: integer ex. 10000
    -probabilities: a list of floats with same length as motifs in the motif folder ex. [.001,.003,.004,.0008]
        for best results make sure the sum of probabilities is less than .1
    -GC_percent: float (has to be float and not integer) ex. 50.0

In [92]:
def main(length, probabilities, GC_percent=50.0):

	motifs = motifs_test
	motif_list = motif_dist(motifs,probabilities, length)
	numbers = distribution(length, sum(probabilities))
	write_seq_txt(numbers, motif_list)
	write_motif_txt(motifs)
	siteout_call(str(GC_percent))

This brings together all functions and  creates a fasta file that has 24 randomly generated sequences with
a random enhancer functionality

In [93]:
def sequence_generator(length, motif_frequencies):

    write=[]
    for i in range(0,24):

        main(length,motif_frequencies)

        generated_sequence = SeqIO.read("neutralseq.fa", "fasta")
        generated_sequence = generated_sequence.seq
        

        
        enhancer = str(random.randint(0,1))
        output = SeqRecord(generated_sequence, id=str(i)+'|'+enhancer, name="random", description="This is a randomly generated sequence")
        write.append(output)

    SeqIO.write(write, "my_example.faa", "fasta")

In [94]:
#Be Patient, this may take a while (approx. 30sec)
sequence_generator(10000,[.001,.005,.002,.005])
print("result saved to my_example.fa")

result saved to my_example.fa
