# Motif Detection

Once we've obtained our DNA sequences, we need to identify the motifs that could most likely be transcrition factor binding sites. Technically, it is very easy to generate all the possible candidate motifs of $n$ nucleotides, but we know that not all of them are going to biologically significant or pertinent to our search. For example, motifs made up of multiple repeating 'T's and 'A's are most likely the TATA box, a known transcription factor binding site that is not specific to any one gene or family of genes. 

Therefore, we must eliminate motifs from the total generated set based on the following parameters :
- size
- complexity (homopolymers, heteropolymers) 
- presence of "TA"-rich sub-motif

In [39]:
# import
from itertools import product, combinations
from collections import Counter

# intializations
nucleotides = {'A','C','G','T'}

## Candidate Elimination Functions

In [23]:
def generate_motifs(n:int) : # n : length in nucleotides of the motifs to be generated
    return set(product(nucleotides, repeat=n))


In [54]:
def remove_homo(motifs, limit) : # limit : the maximum number of times a nucleotide can appear in the motif
    toRem = set() # set of motifs to be eliminated
    
    for motif in motifs :
        for _,freq in Counter(motif).items() : # count the frequencies of each nucleotide present in the motif
            if freq >= limit : toRem.add(motif)
    return set(motifs)-toRem

In [37]:
def remove_hetero(motifs, limit) : 
    toRem = set()

    for motif in motifs :
        motif_to_dinuc = list()
        for i in range(len(motifs[0])-2) :
            motif_to_dinuc.append(motif[i:i+2])
        for _,freq in Counter(motif_to_dinuc).items() :
            if freq >= limit : toRem.add(motif)

    return set(motifs)-toRem


In [None]:
dinuc = {a + b for a,b in set(product(nucleotides, repeat=2)) if a != b} # set of dinucleotides without 'a' and 'b' be the same nucleotide; that case is handled with {remove_homo}

In [53]:
motif = "AGTGCGTCAGTC"

motif_to_dinuc = list()
for i in range(len(motif)-2) :
    motif_to_dinuc.append(motif[i:i+2])

print(motif_to_dinuc)


['AG', 'GT', 'TG', 'GC', 'CG', 'GT', 'TC', 'CA', 'AG', 'GT']


## Algorithms

In [1]:
def hashtable(sequences:list, motifs:list) :
    motifs_freqs = dict()
    n = len(motifs[0]) # length of the candidate motifs

    for seq in sequences :
        for i in range(len(seq)-n+1) :
            aux = seq[i:i+n]
            if aux in motifs_freqs : motifs_freqs[aux] += 1
            else : motifs_freqs[aux] = 1
    
    return motifs_freqs