## Find Tandem Repeats

In some DNA sequences there are repeating patterns that are located directly next to one another. These are called [Tandem Repeats](https://en.wikipedia.org/wiki/Tandem_repeat) and are of some interest to biologists.


`ACGTACGTAAACGTAAACGTAAACGTGTGTACACCCAAAGTCA`

In the above sequence, `ACGTAA` is repeated three times. This might warrant a closer look?

Let's create a model that can detect repeated patterns within these sequences.

### Creating a dataset

We will create a dataset by 
 - Randomly generating sequences of 100 random base pairs 
 - Replacing elements of them with repeated patterns of 
   - Length 2-20. 
   - Patterns may be repeated 2-4 times.

In [2]:
import random
import numpy as np

In [7]:
DNA_BASE_LENGTH = 100
NUM_EXAMPLES = 100

train_seq = []
train_idx = np.zeros((NUM_EXAMPLES, DNA_BASE_LENGTH), dtype=np.int)

# Create 100 random sequences of DNA bases
for i, _ in enumerate(range(NUM_EXAMPLES)):
    
    dna_read = ''.join([random.choice('ACGT') for n in range(DNA_BASE_LENGTH)])
    
    # Build the repeated sequence
    repetition_length = random.randint(2, 20)
    number_of_repeats = random.randint(2, 4)
    repeated_sequence = ''.join([random.choice('ACGT') for n in range(repetition_length)]) * number_of_repeats
      
    # Choose where to place the repetition
    repeated_location = random.randint(0, len(dna_read) - 1 - len(repeated_sequence))
    
    # Insert the string
    new_dna_read = dna_read[0:repeated_location] + repeated_sequence + dna_read[repeated_location + len(repeated_sequence):]
    
    # Mark the indexes of the start of reptitions
    for j in range(number_of_repeats):
        train_idx[i, repeated_location + (j * repetition_length)] = 1
   
    train_seq.append(new_dna_read)

In [8]:
train_seq[0]

'TCCTGACATTCATCAATCCCAATTGAGCCGGTAGTGCCGGTAGTGCCGGTAGTGCCGGTAGTGCCGGTTATTGTGACCAAGATATCGGGATGTGTCTCTC'

In [9]:
train_idx[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

**TODO:** Handle accidentally repeated sequences. (For example we might have "CC" at the beginning or end of our sequence by chance)

In [10]:
train_seq[0]

'TCCTGACATTCATCAATCCCAATTGAGCCGGTAGTGCCGGTAGTGCCGGTAGTGCCGGTAGTGCCGGTTATTGTGACCAAGATATCGGGATGTGTCTCTC'

## Train a model

In [1]:
from torch.utils.data import Dataset, DataLoader

In [37]:
class TandemRepeatDataset(Dataset):
    """
    A PyTorch Dataset that handles simple tokenization of our DNA base sequences.
    """
    
    def __init__(self, train_seq, train_idx):
        
        self.mapping = {'A':0, 'C': 1, 'G': 2, 'T': 3}
        self.train_seq = self.tokenize(train_seq)
        self.train_idx = train_idx
        
    def tokenize(self, train_seq):
        tokenized_seq = []
        
        for seq in train_seq:
            token_seq = [self.mapping[char] for char in seq]
            tokenized_seq.append(token_seq)
       
        return tokenized_seq
    
    def __getitem__(self, idx):
        return self.train_seq[idx], self.train_idx[idx]
    
    def __len__(self):
        return len(self.train_seq)
    


In [38]:
train_dataset = TandemRepeatDataset(train_seq, train_idx)