# Motif Detection

Once we've obtained our DNA sequences, we need to identify the motifs that could most likely be transcrition factor binding sites. Technically, it is very easy to generate all the possible candidate motifs of $n$ nucleotides, but we know that not all of them are going to biologically significant or pertinent to our search. For example, motifs made up of multiple repeating 'T's and 'A's are most likely the TATA box, a known transcription factor binding site that is not specific to any one gene or family of genes. 

Therefore, we must eliminate motifs from the total generated set based on the following parameters :
- size
- complexity (homopolymers, heteropolymers) 
- presence of "TA"-rich sub-motif

In [39]:
# import
from itertools import product, combinations
from collections import Counter

# intializations
nucleotides = {'A','C','G','T'}

## Candidate Elimination Functions

In [23]:
# to generate candidate motifs
def generate_motifs(n:int) : # n : length in nucleotides of the motifs to be generated
    return set(product(nucleotides, repeat=n))


In [54]:
# to eliminate motifs that contain more than a certain number of a nucleotide
def remove_homo(motifs, limit) : # limit : the maximum number of times a nucleotide can appear in the motif
    toRem = set() # set of motifs to be eliminated
    
    for motif in motifs :
        for _,freq in Counter(motif).items() : # count the frequencies of each nucleotide present in the motif
            if freq >= limit : toRem.add(motif)
    return set(motifs)-toRem

In [62]:
# >>> TO FIX + TEST <<<
# to eliminate motifs that contain more than a certain number of a dinucleotide
def remove_hetero(motifs, limit) : 
    toRem = set()

    for motif in motifs :
        motif_to_dinuc = list()
        for i in range(len(motifs[0])-2) :
            motif_to_dinuc.append(motif[i:i+2])
        for dinuc,freq in Counter(motif_to_dinuc).items() :
            (a,b) = dinuc
            if a != b : 
                if freq >= limit : toRem.add(motif)

    return set(motifs)-toRem


In [63]:
mine = remove_hetero(['GGTTTGG', 'TGAGTTA', 'TGCCGTG', 'AGAGAGA', 'TCACCGA', 'TTGGTAT', 'AGGGTGG', 'TGGCTTA', 'AGAGTAG', 'GCCCCTC'], 3)

print(mine)

ref = ['GGTTTGG', 'TGAGTTA', 'TGCCGTG', 'AGAGAGA', 'TCACCGA', 'TTGGTAT', 'TGGCTTA', 'AGAGTAG']
[x in mine for x in ref], [x in ref for x in mine]

{'AGGGTGG', 'TGGCTTA', 'GGTTTGG', 'TGAGTTA', 'TGCCGTG', 'GCCCCTC', 'AGAGTAG', 'TCACCGA', 'TTGGTAT'}


([True, True, True, False, True, True, True, True],
 [False, True, True, True, True, False, True, True, True])

In [None]:
dinuc = {a + b for a,b in set(product(nucleotides, repeat=2)) if a != b} # set of dinucleotides without 'a' and 'b' be the same nucleotide; that case is handled with {remove_homo}

In [79]:
# to eliminate motifs of which A and T make up more than a certain proportion
def remove_ta(motifs, limit) : # limit is a proportion (0 < limit < 1)
    result = set()
    
    for motif in motifs :
        cpt = 0
        a,t = False, False
        for x in motif :
            print(x,end="")
            if x == 'A' :
                cpt += 1
                a = True
            elif x == 'T' :
                cpt += 1
                t = True
        if not (cpt/len(motif) > limit) and (a and t) : 
            result.add(motif)

    return result

## Algorithms

In [1]:
def hashtable(sequences:list, motifs:list) :
    motifs_freqs = dict()
    n = len(motifs[0]) # length of the candidate motifs

    for seq in sequences :
        for i in range(len(seq)-n+1) :
            aux = seq[i:i+n]
            if aux in motifs_freqs : motifs_freqs[aux] += 1
            else : motifs_freqs[aux] = 1
    
    return motifs_freqs