In [None]:
!pip install numpy

In [None]:
import os
import joblib
import matplotlib.pyplot as plt
import random
import seaborn as sns
plt.rc('axes', facecolor='white')

In [None]:
DATA_DIR = '/kaggle/input/joblib-data'
if not os.path.exists(DATA_DIR):
    DATA_DIR = 'data'

In [None]:
! [ -e outputs ] || mkdir outputs
OUPTUT_DIR_PATH = 'outputs'
to_output = lambda file_path : os.path.join(OUPTUT_DIR_PATH,file_path)

# Ex1: Onehot coding DNA

Write a function called **onehot_dna(dna_str)** that allows to encode a DNA segment where each base is encoded as a vector of all zeros except one in a specific position. The result of this function is an array numpy.  DNA is a long chain of repeating bases strung together. There are 4 bases: A, C, G, T. For example, "AACCCAAATCGGGGG" is a DNA segment.



For example, **onehot_dna('AAT')** should return

array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])






In [None]:
import numpy as np

In [None]:
DNA_TABLE = "ACGT"
def onehot_dna(dna_str):
    one_hot = np.zeros((len(dna_str), 4),dtype=int)
    
    for i, base in enumerate(dna_str):
        if base in DNA_TABLE:
            one_hot[i, DNA_TABLE.index(base)] = 1
    
    return one_hot


In [None]:
onehot_dna('AAT')

# Deep learning to classify Transcription Factor Biding


In the next exercises, we will learn how to use Deep learning to predict whether a segment of DNA does include or does not include a sit where JUND binds. (JUND is a particular transcription factor).

In this purpose, we will use data that is extracted from the chapter 6 of the book: 'Deep learning for the life science'. This book is written by B.Ramsundar, P.Eastman, P. Walters and V.Pande.


Data consist of DNA segments that have been split up from a full chromosome. Each segment is of 101 bases long and has been labeled to indicate whether it does or does not include a site where JUND binds to.


This is a binary classification problem.
The process of creating a PyTorch neural network binary classifier consists of several steps:

1. Prepare the training and test data

2. Implement a Dataset object to serve up the data

3. Design and implement a neural network

4. Write code to train the network

5. Write code to evaluate the model (the trained network)


# Ex 2:  Load Data

1. With the help of the joblib library, load the following files for training set:  **y_train.joblib**, **X_train.joblib**  and then store the results in variables **y_train, X_train** ,respectively.

2. Do the same thing for the test set: load  **y_test.joblib**, **X_test.joblib**  and then store the results in variables **y_test, X_test**, respectively.

3. What are the shape of **X_train** and **y_train** ? How many DNA segments are there in traning set ?

4. Display a DNA segment from **X_train** (using matplotlib.pyplot.imshow ).

5. Plot the histogram of **y_train** to see whether data is imbalanced or not.


In [None]:
## 1.
X_train = joblib.load(os.path.join(DATA_DIR,'X_train.joblib'))
y_train = joblib.load(os.path.join(DATA_DIR,'y_train.joblib'))

In [None]:
## 2.
X_test = joblib.load(os.path.join(DATA_DIR,'X_test.joblib'))
y_test = joblib.load(os.path.join(DATA_DIR,'y_test.joblib'))

In [None]:
## 3.
print(f'{X_train.shape =}')
print(f'{y_train.shape =}')
print(f'There are {X_train.shape[0]} DNA segments int the training set.')

In [None]:
## 4.
dna_segment = random.choice(X_train)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(dna_segment, cmap='binary', aspect='auto')
ax.set_xlabel('Position dans la séquence')
ax.set_ylabel('Échantillons')

ax.set_title("Encodage de l'ADN")

plt.show()

In [None]:
## 5.
fig = plt.figure()
ax = fig.add_subplot(111)
_,counts = np.unique(dna_segment.argmax(axis=1),return_counts=True)
ax.hist(list(DNA_TABLE),weights=counts,label="Fréquence des bases",color=(0,0,0.8))  # Ajoute l'histogramme aux axes Matplotlib
ax.set_xlabel("Valeurs")
ax.set_ylabel("Fréquence")
ax.set_title("Histogramme avec Seaborn et Matplotlib")
ax.legend()
plt.show()
fig.savefig(to_output('dna_frequency.png'))


# Ex 3: Convert numpy array to tensor pytorch

As you see in the previous exercise, **X_train** consists of 4672 segments. Each segment is encoded by 0 and 1 (one-hot encoding).


1. Convert numpy array **X_train**, **y_train** into pytorch tensor. Reshape **X_train** to (4672, 4, 101). Note that the type of **X_train** and **y_train** should be float.

2. Do the same thing for **X_test** and **y_test**


In [None]:
## 1.
import torch
X_train_ = X_train.astype('float')
X_train_ = X_train_.transpose(0,2,1)
assert X_train_.shape == (4672, 4, 101)
X_train_tensor = torch.from_numpy(X_train_)
X_train_tensor

In [None]:
## 2.

# Ex4: Create Dataset
In order to train a deep learning model with Pytorch, we need a pytorch dataset.
The DNADataset class below allows for creating a pytorch Dataset from DNA segments and their labels.

1. Using this class, create a dataset for training set. You should call it **train_dataset**

2. Create **Dataloader** from **train_dataset**. You should call it **train_loader**.

3. Do the same thing for the test set.

In [None]:
class DNADataset(torch.utils.data.Dataset):
    def __init__(self, dna, labels):
        self.labels = labels
        self.dna = dna


    def __len__(self):
        return len(self.labels)


    def __getitem__(self, idx):
        label = self.labels[idx]
        frag_dna = self.dna[idx]

        sample = {'DNA': frag_dna, 'Class': label}

        return sample

In [None]:
## 1.

In [None]:
## 2.

In [None]:
## 3.

# Design and implement a convolutional neural network

Now, it's time to build your model. This is a binary classification problem. We can use a convolution neural network, just like an image classification problem. However, since the size of a DNA segment is (4, 101), we will use 1D convolution instead of 2D convolution.



Firstly, we will test how does a 1D convolution work on our data.



# EX 5: 1D Convolution

1. With the help of the torch.nn.Conv1d class, create a 1D convolutional layer. You need to choose values for the following parameters: **in_channels**, **out_channels**, **kernel_size**.


2. Apply this layer to **dna_seg** below. What is the size of the output ?


3. [Optional] Display the output by using matplotlib.pyplot.imshow




In [None]:
## 1.

In [None]:
## 2.

In [None]:
## 3.

# EX 6: Build a model

The following code is used to build a CNN model for a classification problem. This model consists of :

1. 3 layers of  1D-convolution. Each convolutional layer is followed by an activiation ReLu.

2. 2 Linear layers


Complete the lines # TODO below to finish the definition of this network.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class DeepDNA(nn.Module):

  def __init__(self, sequence_length):
    """
    Parameters
    -----------
    sequence_length: int
    num_class: int

    """
    super(DeepDNA,self).__init__()

    #### TO DO ####


    self.lin1 = nn.Linear(101*64, 32)
    self.lin2 = nn.Linear(32, 1)


  def forward(self, x):

    # 1/ pass the first convolutional layer
    #### TODO #####

    # 2/ Pas the second convolutional layer
    #### TODO #####

    # 3/ Pass the third convolution layer
    #### TODO #####

    x = x.view(x.size(0), 101*64)

    # TODO
    # Linear Classifier

    # Sigmoid
    x = nn.Sigmoid()(x)

    return x


# Ex 7 Test the model


1. Create an instance of the DeepDNA class named **net**.

2. Print out the variable **net** to see detailed information about the model.

3. Pass **dna_seg** below to **net** in order to  test if your model **net** works well.

4. What is the size of the output ?



In [None]:
## 1.

In [None]:
## 2.

In [None]:
## 3.

In [None]:
## 4.

# Ex 8: Define loss function and optimizer


1. Define an SGD optimizer for the model. You need to choose the learning rate for your model.

2. Define a Binary Cross Entropy (BCE) Loss  function.


In [None]:
## 1.
## 2.


# Ex 9: Training your model

The following function allows to train the model for one epoch. This function returns total loss per epoch.
Implement the training pass for this function.



The general process with PyTorch for one learning step consits of several steps:

1. Make a forward pass through the network
2. Use the network output to calculate the loss
3. Perform a backward pass through the network with loss.backward() to calculate the gradients
4. Take a step with the optimizer to update the weights



# Ex 11: Accuracy Calculation

Write a function named **compute_num_correct_pred(y_prob, y_label)** that allows to compute the number of correct predictions. **y_prob** and **y_label** should be pytorch tensors.

For example,
y_prob = [[0.3],[0.4], [0.8], [0.7]].

y = [[0], [1], [1], [0]].

This function should return 2.

In [None]:
### TODO ####
def compute_num_correct_pred(y_prob, y_label):
  pass




The function below allows to calculate the accuracy of the model on dataset loader. Execute this function to see if you implemented the compute_num_correct_pred function correctly.

In [None]:
def test(loader):
  net.eval()

  correct = 0
  with torch.no_grad():
    for data in loader:
      dna = data['DNA']
      y = data['Class']

      out = net(dna)
      correct += compute_num_correct_pred(out, y)

  return correct / len(loader.dataset)

# Ex 12: Training the model

Write code to train your model on 10 epoches to see if everything is going well and then you can try to add more epoches.


# Ex13 (optional)

1. If we use torch.nn.BCEWithLogitsLoss(), what does we need to change to the definition of the model ?


2. The same question for torch.nn.CrossEntropyLoss() loss.  