<a href="https://colab.research.google.com/github/Gyuheon-Song/Bioinformatics/blob/main/Bioinformatics_Tensorflow_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to use this tutorial



This tutorial utilizes a Colab notebook , which is an interactive computational enviroment that combines live code, visualizations, and explanatory text. To run this notebook, you may first need to **sign in with your Google account** and make a copy by choosing **File > Save a Copy in Drive** from the menu bar (may take a few moments to save).

The most powerful feature of google colab is the ability to use cloud GPU for free. At first turn on the GPU from **Runtime > Change Runtime Type > Hardware Acceleration**. Then **click on the Connect button located at the top right of the page** to assign server resources.

If you are connected to a runtime, you need to **upload the sample data** to the server. Click on the **'Files'** tab on the left side of the page and press the **'upload'** button at the top to upload the data. Please note that if the connection is disconnected, all the data will be deleted, so please be careful.

The notebook is organized into a series of cells. You can modify the Python command and execute each cell as you would a Jupyter notebook. To execute each of the cells, **click on the black run button located at the top left of the code block.**



# 0. Background

In this tutorial, you will train a **recurrent neural network (RNN)** model that can discover **Cfp1 endonuclease binding motifs** (a component of CRISPR system) binding sites in given DNA sequences.

#1. Setup the environment

In [70]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam
from sklearn.utils import shuffle
import numpy as np


# 2. Set hyperparameters

In [91]:
LEARNING_RATE = 0.01
TOTAL_EPOCH = 30
BATCH_SIZE = 128

N_INPUT = 4
N_STEP = 34
N_HIDDEN = 32
N_CLASS = 2

DISPLAY_STEP = 200

# 3. Load datasets

In [92]:

def load_dataset(dataset_file_path):
    """
    read and parse given sequence dataset, line by line
    """
    dna_mapping = {"A":0, "T":1, "G":2, "C":3}
    data = list()
    labels = list()

    with open(dataset_file_path) as DATA:
        num_lines = 0
        for line in DATA:
            num_lines += 1
            sequence, activity = line.strip().split("\t")
            mapped_dna_string = [dna_mapping[k] for k in sequence]
            data.append(mapped_dna_string)
            if int(activity) == 1:
                labels.append([0.0, 1.0])
            else:
                labels.append([1.0, 0.0])

    return data, labels, num_lines

def load_next_batch(train_x, train_y, batch_size, step):
    """
    prepare batch data
    """
    start = batch_size * step
    end = start + batch_size
    batch_xs = train_x[start:end]
    batch_ys = train_y[start:end]

    return batch_xs, batch_ys


In [87]:
data_dir = "/"

train_file_path = data_dir + "content/sample_data/cfp1_train.txt"  # write your own file path
test_file_path = data_dir + "content/sample_data/cfp1_test.txt"

train_x, train_y, num_train = load_dataset(train_file_path)

train_x, train_y = shuffle(train_x, train_y)
train_x = np.array(train_x)
train_y = np.array(train_y)

# 4.  Construct the model

In [88]:
model = Sequential()
model.add(Embedding(N_INPUT, N_HIDDEN))
model.add(LSTM(N_HIDDEN, return_sequences=False))
model.add(Dense(N_CLASS, activation='softmax'))

# compile the model
optimizer = Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer,
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_12 (Embedding)    (None, None, 8)           32        
                                                                 
 lstm_12 (LSTM)              (None, 8)                 544       
                                                                 
 dense_12 (Dense)            (None, 2)                 18        
                                                                 
Total params: 594 (2.32 KB)
Trainable params: 594 (2.32 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


# 5. Train the model

In [89]:
model.fit(train_x, train_y, batch_size=BATCH_SIZE, epochs=TOTAL_EPOCH)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x7bf7f98c5c30>

# 6. Evaluate the model

In [90]:
test_xs, test_ys, num_test = load_dataset(test_file_path)
test_xs = np.array(test_xs)
test_ys = np.array(test_ys)

_, accuracy = model.evaluate(test_xs, test_ys, batch_size=BATCH_SIZE)
print("Avg. accuracy: %.5f" % accuracy)

Avg. accuracy: 0.87209
