### The Task

In this challenge, we want to train a classifier for sequences of genetic code.

Each sequence is represented by a string of letters [‘A’, ‘C’, ‘G’, ’T’] and belongs to one of five categories/classes labelled [0,…,4].

For training purposes, you will find 400 labelled sequences, each of length 400 characters (sequences: data_x, labels: data_y).

To validate your model, you have a further 100 labelled sequences (val_x, val_y) with 1200 characters each.

Finally, you have 250 unlabeled sequences (test_x, 2000 characters) which need to be classified.

Hint: Training recurrent networks is very expensive! Do not start working on this challenge too late or you will not manage to finish in time.

Your task is to train an RNN-based classifier and make a prediction for the missing labels of the test set (test_x in the attached archive). Store your prediction as a one-dimensional numpy.ndarray, save this array as prediction.npy, and upload this file to the KVV.

You will receive points according to the achieved accuracy according to the following table:
accuracy 	points

≥95%=10, ≥90%=7, ≥85%=5

### Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, BatchNormalization, Dropout, Bidirectional
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau


2026-01-22 23:55:26.721826: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-22 23:55:26.747183: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-01-22 23:55:26.747206: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-01-22 23:55:26.748460: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-01-22 23:55:26.753249: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-22 23:55:26.753847: I tensorflow/core/platform/cpu_feature_guard.cc:1

### Solution

In [2]:

# 1. Data Loading
with np.load('rnn-challenge-data.npz') as fh:
    x_train, y_train = fh['data_x'], fh['data_y']
    x_val, y_val     = fh['val_x'], fh['val_y']
    x_test           = fh['test_x']

# 2. Vectorization (Optimized with np.eye)
mapping = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
def encode_sequences(data):
    # Converts 'A' -> 0, 'C' -> 1, etc., then to one-hot
    encoded = np.array([[mapping[char] for char in seq] for seq in data])
    return tf.one_hot(encoded, depth=4).numpy()

x_train_vec = encode_sequences(x_train)
x_val_vec   = encode_sequences(x_val)
x_test_vec  = encode_sequences(x_test)

# Convert labels to categorical (One-Hot)
y_train_ohe = to_categorical(y_train, num_classes=5)
y_val_ohe   = to_categorical(y_val, num_classes=5)

# 3. Model Architecture
model = Sequential([
    # Bidirectional LSTMs often capture genetic patterns better
    Bidirectional(LSTM(64, return_sequences=False), input_shape=(None, 4)),
    BatchNormalization(), # Stabilizes training and allows higher learning rates
    Dropout(0.2),         # Prevents overfitting
    Dense(32, activation='relu'),
    Dense(5, activation='softmax') # Use softmax for multi-class classification
])

model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)

# 4. Training with EarlyStopping (Better than custom threshold)
early_stop = EarlyStopping(
    monitor='val_accuracy', 
    patience=50, 
    restore_best_weights=True,
    verbose=1
)

# 2. Lower the "volume" of learning when progress stalls
reduce_lr = ReduceLROnPlateau(
    monitor='val_loss', 
    factor=0.2,   # Multiply learning rate by 0.2 (divide by 5)
    patience=15,  # Wait 15 epochs of no improvement before dropping LR
    min_lr=1e-6,  # Don't let it go lower than this
    verbose=1
)

history = model.fit(
    x_train_vec, y_train_ohe,
    validation_data=(x_val_vec, y_val_ohe),
    epochs=60, # Set high, but EarlyStopping will handle the exit
    batch_size=32,
    callbacks=[early_stop, reduce_lr]
)

# 5. Prediction & Saving
predictions = np.argmax(model.predict(x_test_vec), axis=1)
np.save('prediction.npy', predictions)

2026-01-22 23:55:27.730980: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2026-01-22 23:55:27.731349: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 26: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 55: ReduceLROnPlateau reducing learning rate to 4.0000001899898055e-05.
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


In [5]:
print(predictions)

"""
(250,)
[2 4 1 1 0 4 2 0 4 2 4 3 3 2 0 3 3 2 3 2 0 4 2 4 0 3 2 0 1 4 1 1 1 1 0 0 4
 3 1 3 2 2 2 4 3 4 1 0 1 0 1 2 4 4 3 0 0 4 4 2 1 2 3 0 3 1 2 2 4 3 3 4 2 3
 3 1 1 4 4 0 1 0 0 1 2 0 4 0 4 2 2 3 2 3 2 3 4 1 2 1 2 4 2 1 0 3 3 1 3 3 0
 1 1 0 4 4 2 0 1 4 2 0 4 2 3 2 4 0 1 0 2 4 0 1 2 0 4 2 2 1 3 0 1 0 0 0 2 2
 2 2 0 0 0 3 3 4 4 4 2 1 1 0 3 1 1 1 2 2 1 3 4 4 1 3 1 3 4 0 1 2 4 3 0 4 2
 1 3 1 4 3 2 3 1 0 0 0 4 2 3 2 4 3 2 1 1 4 3 1 4 0 1 1 1 1 0 3 4 3 1 3 4 3
 1 3 1 0 2 4 2 3 0 4 4 3 0 2 3 3 3 3 0 4 0 4 3 0 2 2 0 0]
"""

[2 4 2 1 0 4 2 0 3 2 4 3 2 2 0 3 3 2 3 2 0 4 2 4 0 3 2 0 1 3 2 1 1 1 0 0 4
 3 1 3 2 2 2 4 3 3 1 0 1 0 2 2 3 3 2 0 0 4 4 2 1 2 3 0 3 1 2 2 3 3 3 3 2 3
 3 2 1 3 4 0 1 0 0 1 2 0 3 0 4 2 2 3 2 3 2 2 4 1 2 1 2 4 2 2 0 3 3 2 2 3 0
 1 1 0 4 4 2 0 1 4 2 0 4 2 3 2 4 0 1 0 2 4 0 1 2 0 4 2 2 1 3 0 1 0 0 0 2 2
 2 2 0 0 0 3 3 4 4 4 2 2 1 0 0 1 1 1 2 2 1 3 4 4 1 3 1 3 4 0 1 2 4 3 0 3 2
 1 3 1 4 3 2 0 2 0 0 0 4 2 3 2 3 3 2 1 1 3 3 1 4 0 2 1 1 1 0 3 3 3 1 3 4 3
 1 2 1 0 2 4 2 3 0 3 4 0 0 2 3 3 2 3 0 4 0 3 2 0 2 2 0 0]


'\n(250,)\n[2 4 1 1 0 4 2 0 4 2 4 3 3 2 0 3 3 2 3 2 0 4 2 4 0 3 2 0 1 4 1 1 1 1 0 0 4\n 3 1 3 2 2 2 4 3 4 1 0 1 0 1 2 4 4 3 0 0 4 4 2 1 2 3 0 3 1 2 2 4 3 3 4 2 3\n 3 1 1 4 4 0 1 0 0 1 2 0 4 0 4 2 2 3 2 3 2 3 4 1 2 1 2 4 2 1 0 3 3 1 3 3 0\n 1 1 0 4 4 2 0 1 4 2 0 4 2 3 2 4 0 1 0 2 4 0 1 2 0 4 2 2 1 3 0 1 0 0 0 2 2\n 2 2 0 0 0 3 3 4 4 4 2 1 1 0 3 1 1 1 2 2 1 3 4 4 1 3 1 3 4 0 1 2 4 3 0 4 2\n 1 3 1 4 3 2 3 1 0 0 0 4 2 3 2 4 3 2 1 1 4 3 1 4 0 1 1 1 1 0 3 4 3 1 3 4 3\n 1 3 1 0 2 4 2 3 0 4 4 3 0 2 3 3 3 3 0 4 0 4 3 0 2 2 0 0]\n'

In [4]:
# MAKE SURE THAT YOU HAVE THE RIGHT FORMAT
assert predictions.ndim == 1
assert predictions.shape[0] == 250

# AND SAVE EXACTLY AS SHOWN BELOW
np.save('results/prediction.npy', predictions)

FileNotFoundError: [Errno 2] No such file or directory: 'results/prediction.npy'