<a href="https://colab.research.google.com/github/MartinekV/DL-for-bio-course/blob/master/04_DNA_enhancers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [2]:
!pip install -q genomic-benchmarks
!pip install torchmetrics -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/519.2 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m512.0/519.2 KB[0m [31m19.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.2/519.2 KB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Text preprocessing

In [13]:
import torch
example_seq = 'ACCCTGCCAACACGGGACTTTAC'
vocab = {'A':0,'C':1,'T':2,'G':3}

In [15]:
numericalized = [vocab[c] for c in example_seq]
numericalized

[0, 1, 1, 1, 2, 3, 1, 1, 0, 0, 1, 0, 1, 3, 3, 3, 0, 1, 2, 2, 2, 0, 1]

In [18]:
numericalized_tensor = torch.tensor(numericalized)
ohe_seq = torch.nn.functional.one_hot(numericalized_tensor, num_classes=4)
ohe_seq

tensor([[1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 1],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [1, 0, 0, 0],
        [1, 0, 0, 0],
        [0, 1, 0, 0],
        [1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 0, 1],
        [0, 0, 0, 1],
        [0, 0, 0, 1],
        [1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 1, 0],
        [0, 0, 1, 0],
        [1, 0, 0, 0],
        [0, 1, 0, 0]])

In [30]:
vocab = {'A':0,'C':1,'T':2,'G':3}
def tokenize_batch(batch):
  res = []
  for seq in batch:
    numericalized = [vocab[c] for c in seq]
    numericalized_tensor = torch.tensor(numericalized)
    ohe_seq = torch.nn.functional.one_hot(numericalized_tensor, num_classes=4).float()
    res.append(ohe_seq)
    print('Appending tensor of size', ohe_seq.size())

  print('Stacking tensors along a new dimension')
  out = torch.stack(res)
  print('Resulting tensor size', out.size())
  return out

batch = ['ACTGATCACG','GGAATAAACG','CTGATCATAG','TCGAGAATCG', 'AACAGAATCG', 'TGGAGAATCG']
tokenized_batch = tokenize_batch(batch)

Appending tensor of size torch.Size([10, 4])
Appending tensor of size torch.Size([10, 4])
Appending tensor of size torch.Size([10, 4])
Appending tensor of size torch.Size([10, 4])
Appending tensor of size torch.Size([10, 4])
Appending tensor of size torch.Size([10, 4])
Stacking tensors along a new dimension
Resulting tensor size torch.Size([6, 10, 4])


In [34]:
# Swapping the Length and Channel dimensions
swapped_batch = tokenized_batch.permute(0,2,1)
swapped_batch.size()

torch.Size([6, 4, 10])

## Enhancers Project 

Your task is to 

1.   **Create model for DNA sequence classification based on if it contains an enhancer (label 1) or not (label 0)**
2.   **Show that your model is generalizing on new unseen data**

Tips
*   Use the pytorch documentation and ChatGPT for help
*   Use validation set for hyperparameter tuning (use torch.utils.data.random_split to split the dataset)
*   Do NOT use test set for any adjustments, only use it for the final evaluation
*   You can use nn.Conv1d layer to perform convolution over 1D data. Use the one-hot-encoding dimension as the channel dimension. E.g. 4 letters of DNA = 4 channels.
*   Explore the data, use proper metrics 
*   You can use One-hot tokenization, or any other tokenization (e.g. k-mers)
*   Feel free to use any other improvements you find on the internet
*   Feel free to use any other libraries - e.g. pytorch lightning
*   Use GPU for training






## Data preparation and exploration

In [41]:
from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersCohn

train_dset =  HumanEnhancersCohn('train') 

## Testing

In [40]:
test_dset = HumanEnhancersCohn('test')