## Recurrent Neural Network 

The third model used is a Recurrent Neural Network (RNN) using Pytorch. An RNN was chosen as it has a couple advantages when searching through a sequence for binding sites or other information that a sequence can tell us. One of the design features allow RNN's to handle temporal dependancies, the effects of previous inputs on the current decision making, which is vital when looking at sequences as the surrounding base pairs change the information that is given from the one being observed.

##### Imports

In [1]:
import torch
import torch.nn as nn
from sklearn.metrics import roc_auc_score
import pickle
import numpy as np
from sklearn.model_selection import train_test_split

##### Loading in data

In [2]:
def load_data(filename):
    with open(filename, 'rb') as f:
        data = pickle.load(f)
    return data

##### Making data usable 

In [3]:
pkl_file = "labeled_df.pkl"
test_df = load_data(pkl_file)
Seqs_X = test_df['One_Hot_Sequence'].values
Labels_y = test_df['Label'].values
Seqs_X_array = np.array([np.array(seq) for seq in Seqs_X])
Seqs_X_array = Seqs_X_array.astype(np.float32)
Seqs_X_tensor = torch.tensor(Seqs_X_array)
Labels_y_array = np.array([int(''.join(map(str, label)), 2) for label in Labels_y])
Labels_y_array = Labels_y_array.astype(np.int64)
Labels_y_tensor = torch.tensor(Labels_y_array)


In [7]:
type(Labels_y_tensor)

torch.Tensor

#### Model

In [None]:
import torch
import torch.nn as nn
import numpy as np

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        out, _ = self.rnn(x, h0)
        out = self.fc(out[:, -1, :])
        return out

def train_model(model, criterion, optimizer, input_data, labels, num_epochs=10):
    for epoch in range(num_epochs):
        optimizer.zero_grad()
        outputs = model(input_data)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        if (epoch+1) % 1 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')


Seqs_X_data =  Seqs_X_tensor  
Labels_y_data = Labels_y_tensor 

input_size = 4  
hidden_size = 64
num_layers = 2
output_size = len(np.unique(Labels_y))  

model = RNN(input_size, hidden_size, num_layers, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

num_epochs = 1 
train_model(model, criterion, optimizer, Seqs_X_tensor, Labels_y_tensor, num_epochs)

### Results and issues 

For this model we encountered several challenges that made getting results for the recurrent neural network impossible. The first hurdle we encountered was getting the data from data preperation to work with the model architecture. It took turning the data from a pandas.core.series.Series, to a numpy.array and then finally a torch.Tensor. Once those steps were understood, it was time to move on to the model itself.

The model itself was built around the pytorch implementation of a Recurrent Neural Network and gave its own set of challenges that could not be overcome. We tried to experiment with many different parameters to try and get it to work but technical issues arose across multiple computers that were causing them to be frozen and unusable, causing the program to close. After getting the code back working, it would not say what or why it was freezing and leaving us lost what to do next.

These challenges highlight the complexities involved in developing and training neural network models, particularly for tasks involving large and complex datasets. While advancements in networks like this offer promising opportunities for addressing complex problems in bioinformatics, the practical implementation of these methods are difficult to say the least. Future research endeavors in this field would benefit from collaboration between biochemistry experts, machine learning practitioners, and computer hardware specialists to mitigate technical challenges and optimize model performance.


