# Accurately predicting human essential genes based on deep learning

This section will present a comparative analysis to demonstrate the application and performance of PyTorch models for addressing sequence-based prediction problems.

We'll try to replicate the [DeepHE: Accurately predicting human essential genes based on deep learning](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008229) deep learning model and evaluate its performance. Other models will be compared to the DeepHE model.

DeepHE's model is based on the multilayer perceptron structure. It includes one input layer, three hidden layers, and one output layer. All the hidden layers utilize the ReLU activation function. The output layer uses sigmoid activation function to perform discrete classification. The loss function in DeepHE is binary cross-entropy. A dropout layer is used after each hidden layer.

In [None]:
%load_ext autoreload

import sys
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, matthews_corrcoef
from sklearn.preprocessing import StandardScaler

sys.path.append('../../../../src/')
from propythia.shallow_ml import ShallowML

from descriptors import DNADescriptor

## Classification using DNA descriptors

The dataset was already built and preprocessed in the "essential_genes.ipynb" notebook.

In [None]:
fps_x = pd.read_csv("fps_x.csv")
print(fps_x.shape)
print(fps_x.head())

print("-" * 30)

fps_y = pd.read_csv("fps_y.csv")
print(fps_x.shape)
print(fps_x.head())

As we can see, this dataset contains the sequence and the corresponding positive/negative class labels, with positive class labels corresponding to the presence of a essential gene. The amount of positive and negative examples is NOT evenly distributed across the two classes.

In [None]:
# plot the distribution of each class
fps_y.groupby('label').size().plot(kind='bar')

We need now to split the dataset into training, test and validation sets.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(fps_x, fps_y, stratify=fps_y)

scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

## Classification using One Hot Encoding

In [None]:
import torch
import torch.utils.data as data_utils
import src.encoding as enc
import os
from torch import nn
from torch.optim import Adam

In [None]:
fps_x = pd.read_csv("fps_x.csv")
print(fps_x.shape)
print(fps_x.head())

print("-" * 30)

fps_y = pd.read_csv("fps_y.csv")
print(fps_x.shape)
print(fps_x.head())

In [None]:
# plot the distribution of each class
fps_y.groupby('label').size().plot(kind='bar')

We need now to split the dataset into training, test and validation sets.

In [None]:
x, x_test, y, y_test = train_test_split(
    fps_x, fps_y,
    test_size=0.2,
    train_size=0.8,
    stratify=fps_y
)
x_train, x_cv, y_train, y_cv = train_test_split(
    x, y,
    test_size=0.25,
    train_size=0.75,
    stratify=y
)

Now we need to one hot encode the sequences.

In [None]:
x_train_enc = enc.DNAEncoding(x_train)
x_train = x_train_enc.one_hot_encode()

x_test_enc = enc.DNAEncoding(x_test)
x_test = x_test_enc.one_hot_encode()

x_cv_enc = enc.DNAEncoding(x_cv)
x_cv = x_cv_enc.one_hot_encode()

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
print(x_cv.shape)
print(y_cv.shape)

In [None]:
# convert to torch.tensor
train_data = data_utils.TensorDataset(
    torch.tensor(x_train, dtype=torch.float),
    torch.tensor(y_train, dtype=torch.long)
)
test_data = data_utils.TensorDataset(
    torch.tensor(x_test, dtype=torch.float),
    torch.tensor(y_test, dtype=torch.long)
)
valid_data = data_utils.TensorDataset(
    torch.tensor(x_cv, dtype=torch.float),
    torch.tensor(y_cv, dtype=torch.long)
)

batch_size = 16

# Data loader
trainloader = data_utils.DataLoader(
    train_data,
    shuffle=True,
    batch_size=batch_size
)
testloader = data_utils.DataLoader(
    test_data,
    shuffle=True,
    batch_size=batch_size
)
validloader = data_utils.DataLoader(
    valid_data,
    shuffle=True,
    batch_size=batch_size
)

Building the model equivalent to the one in the paper.

In [None]:
from src.models import Net
from src.train import traindata
from src.test import test

In [None]:
torch.manual_seed(2022)
os.environ["CUDA_VISIBLE_DEVICES"] = '4,5'
device = torch.device('cuda:0')

model = Net().to(device)

paramDict = {
    'epoch': 200,
    'batchSize': 32,
    'dropOut': 0.2,
    'learning_rate': 0.004,
    'loss': nn.CrossEntropyLoss(),
    'metrics': ['accuracy'],
    'activation1': 'relu',
    'activation2': 'sigmoid',
    'monitor': 'val_accuracy',
    'save_best_only': True,
    'mode': 'max'
}

class_weight = {0: 1.0, 1: 4.0}

optimizerDict = {
    'adam': Adam(model.parameters(), learning_rate=0.001, beta_1=0.9, beta_2=0.999),
}

# epochs = 100
# lr = 0.004
# loss_function = nn.CrossEntropyLoss()

#optimizer = Adam(model.parameters(), lr=lr)

model = traindata(device, model, paramDict['epochs'], optimizerDict['adam'], paramDict['loss'], trainloader, validloader)

# Test
acc, mcc, report = test(device, model, testloader)
print('Accuracy: %.3f' % acc)
print('MCC: %.3f' % mcc)
print(report)