## Find Tandem Repeats

In some DNA sequences there are repeating patterns that are located directly next to one another. These are called [Tandem Repeats](https://en.wikipedia.org/wiki/Tandem_repeat) and are of some interest to biologists.


`ACGTACGTAAACGTAAACGTAAACGTGTGTACACCCAAAGTCA`

In the above sequence, `ACGTAA` is repeated three times. This might warrant a closer look?

Let's create a model that can detect repeated patterns within these sequences.

### Creating a dataset

We will create a dataset by 
 - Randomly generating sequences of 100 random base pairs 
 - Replacing elements of them with repeated patterns of 
   - Length 2-20. 
   - Patterns may be repeated 2-4 times.

In [28]:
import random
import numpy as np

In [34]:
DNA_BASE_LENGTH = 100
NUM_EXAMPLES = 100

train_seq = []
train_idx = np.zeros((NUM_EXAMPLES, DNA_BASE_LENGTH), dtype=np.int)

# Create 100 random sequences of DNA bases
for i, _ in enumerate(range(NUM_EXAMPLES)):
    
    dna_read = ''.join([random.choice('ACGT') for n in range(DNA_BASE_LENGTH)])
    
    # Build the repeated sequence
    repetition_length = random.randint(2, 20)
    number_of_repeats = random.randint(2, 4)
    repeated_sequence = ''.join([random.choice('ACGT') for n in range(repetition_length)]) * number_of_repeats
      
    # Choose where to place the repetition
    repeated_location = random.randint(0, len(dna_read) - 1 - len(repeated_sequence))
    
    # Insert the string
    new_dna_read = dna_read[0:repeated_location] + repeated_sequence + dna_read[repeated_location + len(repeated_sequence):]
    
    # Mark the indexes of the start of reptitions
    for j in range(number_of_repeats):
        train_idx[i, repeated_location + (j * repetition_length)] = 1
   
    train_seq.append(new_dna_read)

**TODO:** Handle accidentally repeated sequences. (For example we might have "CC" at the beginning or end of our sequence by chance)