### Padding Comparison

According to Reddy and Reddy (2019), there is little difference in performance between pre- and post padding in LSTMs, unlike with CNNs. This file is a more comprehensive comparison, creating models for upstream, downstream, and split padding. All input data is used in these models except for spacer sequences ('spacs' column) with lengths other than 16, 17, and 18. These other sequences have been synthetically developed and vary with a length from 0 to 31.

Padding makes all the inputs equal in length by adding layers of zeros or other "filler" data outside the actual data in an input matrix. The primary purpose of padding is to preserve the spatial size of the input so that the output after applying filters (kernels) remains the same size or to adjust it according to the desired output dimensions.

In LSTM, padding is applied before performing the convolution operation. When a filter scans the input data, padding ensures that the filter properly covers the border areas, allowing for more accurate feature extraction. This is particularly important in deep learning, as it allows the network to learn from the entire dataset without bias towards the center of the images.

The amount of padding needed depends on the size of the filter (also known as the kernel) and the desired output size. For a filter of size FxF and input size NxN, to achieve 'same' padding, one would typically add (F-1)/2 rows of zeros on both the top and bottom of the input and the same number of columns of zeros on the left and right sides.

https://deepai.org/machine-learning-glossary-and-terms/padding


In [None]:
import pandas as pd
import numpy as np

In [None]:
# import dataset into a pandas data frame

df = pd.read_csv('41467_2022_32829_MOESM5_ESM.csv')
df.head()

In [None]:
# All input and output data

X = df[['UP', 'h35', 'spacs', 'h10', 'disc', 'ITR']]
y = df['Observed log(TX/Txref)']

X.head()

In [None]:
# stores the various input approaches
X_dict = {}

# stores split training/testing
train_test = {}

# stores the results
results = {}

# stores the models
models = {}

# stores the model history
model_history = {}

In [None]:
# Import necessary libraries for random search
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam

def build_model(current_model):
    # Define RNN model architecture
    models[current_model] = Sequential()
    models[current_model].add(LSTM(64, input_shape=X_dict[current_model].shape[1:])) # dynamically generated input shape based on X data
    models[current_model].add(Dense(1, activation='linear'))

    # Compile the model
    optimizer = Adam(learning_rate=0.001)
    models[current_model].compile(optimizer=optimizer, loss='mean_squared_error')

    # Early stopping to prevent overfitting
    early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

    # Train the model
    history = models[current_model].fit(train_test[current_model]['X_train'],
                                    train_test[current_model]['y_train'],
                                    epochs=150,
                                    batch_size=32,
                                    validation_data=(X_test, y_test),
                                    callbacks=[early_stopping])

    # Evaluate the model
    loss = models[current_model].evaluate(train_test[current_model]['X_test'], train_test[current_model]['y_test'])

    return models[current_model], loss, history

In [None]:
# remove all rows with spacer sequences that are not 16-18 nucleotides long


_df = df[(df['spacs'].str.len() >= 16) & (df['spacs'].str.len() <= 18)]

X = _df[['UP', 'h35', 'spacs', 'h10', 'disc', 'ITR']]
y = _df['Observed log(TX/Txref)']

print(f'Removed {df.shape[0] - _df.shape[0]} rows')


In [None]:
# Function to one-hot encode DNA sequences

def padded_one_hot_encode(sequence):
    mapping = {'A': [1,0,0,0,0], 'C': [0,1,0,0,0], 'G': [0,0,1,0,0], 'T': [0,0,0,1,0], '0': [0,0,0,0,1]}
    encoding = []
    for nucleotide in sequence:
         encoding += [mapping[nucleotide]]
    return encoding

## Upstream Padding
Add padding upstream (before) each sequence to standardize the length, then concatenate the sequences.

In [None]:
upstream_padding_full = {}

for col in X.columns:
    max_len = X[col].apply(len).max()
    upstream_padding_full[col] = np.array([padded_one_hot_encode('0' * (max_len - len(seq)) + seq) for seq in X[col]])

# Concatenate the one-hot encoded, upstream-padded sequences
X_dict['upstream_padding_full'] = np.concatenate([upstream_padding_full[col] for col in X.columns], axis=1)
X_dict['upstream_padding_full'].shape

In [None]:
# split the data in training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_dict['upstream_padding_full'], y, test_size=0.2, random_state=1, shuffle=True)
train_test['upstream_padding_full'] = {'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test}

In [None]:
# Call the function to build the model, save the model, and store the results

m = 'upstream_padding_full'

models[m], results[m], model_history[m] = build_model(m)
models[m].save(m + '.keras')

## Downstream Padding
Add padding downstream (after) each sequence to standardize the length, then concatenate the sequences.

In [None]:
downstream_padding_full = {}

for col in X.columns:
    max_len = X[col].apply(len).max()
    downstream_padding_full[col] = np.array([padded_one_hot_encode(seq + '0' * (max_len - len(seq))) for seq in X[col]])

# Concatenate the one-hot encoded, upstream-padded sequences
X_dict['downstream_padding_full'] = np.concatenate([downstream_padding_full[col] for col in X.columns], axis=1)
X_dict['downstream_padding_full'][0]

In [None]:
# split the data in training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_dict['downstream_padding_full'], y, test_size=0.2, random_state=1, shuffle=True)
train_test['downstream_padding_full'] = {'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test}

In [None]:
# Call the function to build the model, save the model, and store the results

m = 'downstream_padding_full'

models[m], results[m], model_history[m] = build_model(m)
models[m].save(m + '.keras')

## Split Padding
Add half the padding upstream (before) and half the padding downstream (after) for each sequence to standardize the length, then concatenate the sequences.

In [None]:
split_padding_full = {}

for col in X.columns:
    max_len = X[col].apply(len).max()
    split_padding_full[col] = np.array([padded_one_hot_encode('0' * ((max_len - len(seq)) // 2) +
                                                              seq + '0' * ((max_len - len(seq) + 1) // 2)) for seq in X[col]])

# Concatenate the one-hot encoded, upstream-padded sequences
X_dict['split_padding_full'] = np.concatenate([split_padding_full[col] for col in X.columns], axis=1)
X_dict['split_padding_full'][0]

In [None]:
# split the data in training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_dict['split_padding_full'], y, test_size=0.2, random_state=1, shuffle=True)
train_test['split_padding_full'] = {'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test}

In [None]:
# Call the function to build the model, save the model, and store the results

m = 'split_padding_full'

models[m], results[m], model_history[m] = build_model(m)
models[m].save(m + '.keras')