## Find Tandem Repeats

In some DNA sequences there are repeating patterns that are located directly next to one another. These are called [Tandem Repeats](https://en.wikipedia.org/wiki/Tandem_repeat) and are of some interest to biologists.


`ACGTACGTAAACGTAAACGTAAACGTGTGTACACCCAAAGTCA`

In the above sequence, `ACGTAA` is repeated three times. This might warrant a closer look?

Let's create a model that can detect repeated patterns within these sequences.

### Creating a dataset

We will create a dataset by 
 - Randomly generating sequences of 100 random base pairs 
 - Replacing elements of them with repeated patterns of 
   - Length 5-20. 
   - Patterns may be repeated 2-4 times.

In [1]:
import torch
import random
import numpy as np

In [2]:
DNA_BASE_LENGTH = 100
NUM_EXAMPLES = 100

train_seq = []
train_idx = np.zeros((NUM_EXAMPLES, DNA_BASE_LENGTH), dtype=np.int)

# Create 100 random sequences of DNA bases
for i, _ in enumerate(range(NUM_EXAMPLES)):
    
    dna_read = ''.join([random.choice('ACGT') for n in range(DNA_BASE_LENGTH)])
    
    # Build the repeated sequence
    repetition_length = random.randint(5, 20)
    number_of_repeats = random.randint(2, 4)
    repeated_sequence = ''.join([random.choice('ACGT') for n in range(repetition_length)]) * number_of_repeats
      
    # Choose where to place the repetition
    repeated_location = random.randint(0, len(dna_read) - 1 - len(repeated_sequence))
    
    # Insert the string
    new_dna_read = dna_read[0:repeated_location] + repeated_sequence + dna_read[repeated_location + len(repeated_sequence):]
    
    # Mark the indexes of the start of reptitions
    for j in range(number_of_repeats):
        train_idx[i, repeated_location + (j * repetition_length)] = 1
   
    train_seq.append(new_dna_read)

In [3]:
train_seq[0]

'CGTCCCGAAGTTGGCAAATGTGAGGTCAAGACGACTCACGTGGGTTGACGCAAACTGCGCCGCAGCCTGGGGGCCTGGGGGCCTGGGGGTTTTGATCGTA'

In [4]:
train_idx[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

**TODO:** Handle accidentally repeated sequences. (For example we might have "CC" at the beginning or end of our sequence by chance)

## Train a model

In [5]:
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2LMHeadModel, GPT2Config, GPT2PreTrainedModel, GPT2Model
from transformers import BertConfig, BertModel

In [6]:
class TandemRepeatDataset(Dataset):
    """
    A PyTorch Dataset that handles simple tokenization of our DNA base sequences.
    """
    
    def __init__(self, train_seq, train_idx):
        
        self.mapping = {'A':0, 'C': 1, 'G': 2, 'T': 3}
        self.train_seq = self.tokenize(train_seq)
        self.train_idx = train_idx
        
    def tokenize(self, train_seq):
        tokenized_seq = []
        
        for seq in train_seq:
            token_seq = [self.mapping[char] for char in seq]
            tokenized_seq.append(token_seq)
       
        return tokenized_seq
    
    def __getitem__(self, idx):
        
        x = torch.Tensor(self.train_seq[idx]).long()
        y = torch.Tensor(self.train_idx[idx]).long()
        return x, y
    
    def __len__(self):
        return len(self.train_seq)


In [7]:
BATCH_SIZE = 4

In [8]:
train_dataset = TandemRepeatDataset(train_seq, train_idx)
train_dl = DataLoader(train_dataset, batch_size=BATCH_SIZE)

In [10]:
# GPT2 Model
#config = GPT2Config.from_pretrained('gpt2', cache_dir=None)
#model = GPT2Model.from_pretrained('gpt2', from_tf=False, config=config, cache_dir='.')

In [31]:
class TandemRepeatModel(torch.nn.Module):
    """
    Wrapper around a BERT model that predicts a single binary output for each input token
    """
    
    def __init__(self):
        super().__init__()
        
        self.config = BertConfig.from_pretrained('bert-base-uncased', cache_dir=None)
        self.bert_model = BertModel.from_pretrained('bert-base-uncased', from_tf=False, config=self.config, cache_dir='.')
        self.bert_model.resize_token_embeddings(new_num_tokens=4)

        self.linear =  torch.nn.Linear(in_features=768, out_features=2)
        
    def forward(self, x):    
        out, _ = self.bert_model(x)
        out = self.linear(out)
        return out    

In [32]:
# BERT Model
model = TandemRepeatModel()

In [33]:
# Allow our model to accept only 4 input words (A, C, G or T)
# Allow our model to only output 2 possibilies (0 or 1)
#model.lm_head = torch.nn.Linear(in_features=768, out_features=2, bias=False)

In [34]:
# NUM_EPOCHS = 1

# for i in range(NUM_EPOCHS):
    
#     for x, y in train_dl:
#         print(x,y)
        
#         model.zero_grad()
#         loss = model(x, y)
        
#         loss.backward()
        
#         break

In [35]:
x,y = next(iter(train_dl))

In [36]:
out1 = model(x)

In [38]:
out1.shape

torch.Size([4, 100, 2])