### The Task

In this challenge, we want to train a classifier for sequences of genetic code.

Each sequence is represented by a string of letters [‘A’, ‘C’, ‘G’, ’T’] and belongs to one of five categories/classes labelled [0,…,4].

For training purposes, you will find 400 labelled sequences, each of length 400 characters (sequences: data_x, labels: data_y).

To validate your model, you have a further 100 labelled sequences (val_x, val_y) with 1200 characters each.

Finally, you have 250 unlabeled sequences (test_x, 2000 characters) which need to be classified.

Hint: Training recurrent networks is very expensive! Do not start working on this challenge too late or you will not manage to finish in time.

Your task is to train an RNN-based classifier and make a prediction for the missing labels of the test set (test_x in the attached archive). Store your prediction as a one-dimensional numpy.ndarray, save this array as prediction.npy, and upload this file to the KVV.

You will receive points according to the achieved accuracy according to the following table:
accuracy 	points

≥95%=10, ≥90%=7, ≥85%=5

### Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow-gpu as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, BatchNormalization, Dropout, Bidirectional
from tensorflow.keras.utils import to_categorical


### Solution

In [None]:

# 1. Data Loading
with np.load('rnn-challenge-data.npz') as fh:
    x_train, y_train = fh['data_x'], fh['data_y']
    x_val, y_val     = fh['val_x'], fh['val_y']
    x_test           = fh['test_x']

# 2. Vectorization (Optimized with np.eye)
mapping = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
def encode_sequences(data):
    # Converts 'A' -> 0, 'C' -> 1, etc., then to one-hot
    encoded = np.array([[mapping[char] for char in seq] for seq in data])
    return tf.one_hot(encoded, depth=4).numpy()

x_train_vec = encode_sequences(x_train)
x_val_vec   = encode_sequences(x_val)
x_test_vec  = encode_sequences(x_test)

# Convert labels to categorical (One-Hot)
y_train_ohe = to_categorical(y_train, num_classes=5)
y_val_ohe   = to_categorical(y_val, num_classes=5)

# 3. Model Architecture
model = Sequential([
    # Bidirectional LSTMs often capture genetic patterns better
    Bidirectional(LSTM(64, return_sequences=False), input_shape=(None, 4)),
    BatchNormalization(), # Stabilizes training and allows higher learning rates
    Dropout(0.2),         # Prevents overfitting
    Dense(32, activation='relu'),
    Dense(5, activation='softmax') # Use softmax for multi-class classification
])

model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)

# 4. Training with EarlyStopping (Better than custom threshold)
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy', 
    patience=20, 
    restore_best_weights=True
)

history = model.fit(
    x_train_vec, y_train_ohe,
    validation_data=(x_val_vec, y_val_ohe),
    epochs=200, # Should converge much faster now
    batch_size=32,
    callbacks=[early_stop]
)

# 5. Prediction & Saving
predictions = np.argmax(model.predict(x_test_vec), axis=1)
np.save('prediction.npy', predictions)

In [None]:
# PREDICT prediction FROM x_test
import numpy as np
predictions = model.predict(x_test_data_floats)
predictions = np.argmax(predictions, axis=1) # THAT'S YOUR JOB
print(predictions.shape)
print(predictions)


In [None]:
# MAKE SURE THAT YOU HAVE THE RIGHT FORMAT
assert predictions.ndim == 1
assert predictions.shape[0] == 250

# AND SAVE EXACTLY AS SHOWN BELOW
np.save('results/prediction.npy', predictions)