Author: Thomas Lane

Purpose: This notebook generates a random sequence with motifs at a specified frequency.

Use: - Change the pathname for motif_reader to wherever the motif_pfm files are.
     - Specify the frequency you want motifs to be inserted
     - Specify the length of the sequence you want
     - If you set save=True the sequence output will be saved to a file (default "my_example.faa")
     - .pfm files had to be modified by deleting the letters and symbols before the numbers


In [84]:
import random
from Bio import motifs
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
import os

motif_reader looks in the folder that contains the .pfm for motifs (downloaded from Jaspar) and reads and saves each motif in the folder

In [85]:
def motif_reader(path_name):
    motif_list =[]
    for filename in os.listdir(path_name):
        print(filename)
        with open(path_name + filename) as handle:
             word = motifs.read(handle, "pfm")
             handle.close()
        motif = str(word.consensus)
        print(motif)
        motif_list.append(motif)

    return motif_list

This function can create a random sequence where you can specify options

- motif_frequency: (float 0 <= x <= )1The frequency that the function will insert the motif
- length: (positve integer) the length of the generated sequence
- save: (bool) if you want the function to output to a file and save it

In [86]:
def sequence_generator(motif_frequency=.002, length=1000, save=False):
    
    
    write = []
    motifs = motif_reader("C:/Users/Lanes/ResearchLab/team_neural_network/data/input/motif_pfm/")
    sequences = []

    # this is to create 24 sequences to mirror our 24 species
    for _ in range(0,24):

        bases= ['A','G','C','T']
        generated_sequence =""

        # This creates a sequence of the specified length
        for i in range(0,length):
            # this creates a random number according to frequency and decides if it should insert a random motif
            check = random.random()
            if motif_frequency >= check:
                generated_sequence += random.choice(motifs)
            else:
                generated_sequence+= random.choice(bases)

        #This creates a BioPython Sequence that is randomly assigned a positive or negative enhancer function
        seq = Seq(generated_sequence,  IUPAC.unambiguous_dna)
        enhancer = str(random.randint(0,1))
        output = SeqRecord(seq, id="000000|"+enhancer, name="random", description="This is a randomly generated sequence")
        write.append(output)
        sequences.append(generated_sequence)
        
    # saves the sequence to a fasta file
    if save:
        SeqIO.write(write, "my_example.faa", "fasta")
    
    return sequences

In [87]:
i = sequence_generator(motif_frequency=0.003, save=False, length=1200)

MA0049.1.pfm
GCATAAAAAA
MA0212.1.pfm
TAATCC
MA0216.2.pfm
GGCCATAAAA
MA0447.1.pfm
ATTACGTAAT


This cell counts how many of one motif are in each sequence, this is used as a quick check to make sure the function worked

In [88]:
motifs = motif_reader("C:/Users/Lanes/ResearchLab/team_neural_network/data/input/motif_pfm/")
for k in i:
    counter = k.count(motifs[0])
    print(counter)

MA0049.1.pfm
GCATAAAAAA
MA0212.1.pfm
TAATCC
MA0216.2.pfm
GGCCATAAAA
MA0447.1.pfm
ATTACGTAAT
1
1
0
1
3
3
1
2
0
2
3
0
2
1
0
1
0
2
1
1
1
0
3
1
