# CAFA 6 Protein Function Prediction: A Hybrid TPU-Accelerated Transformer and CPU-Based Framework for High-Performance Annotation

**Team:** DTU Proteomics Core

**Author:**     
Olaf Yunus Laitinen Imanov     
*Data Scientist in Proteomics*     
*DTU Proteomics Core*     
*DTU Bioengineering*     
*Department of Biotechnology and Biomedicine*     

*Danmarks Tekniske Universitet*     
*Søltofts Plads*     
*Building 224, Room 016*     
*2800 Kgs. Lyngby*     

**Date:** October 16, 2025

---

### **Abstract**

The accurate computational prediction of protein function is a cornerstone of modern bioinformatics, essential for annotating the vast and rapidly growing corpus of sequenced proteins. This study presents a robust and computationally efficient hybrid framework for the Critical Assessment of Functional Annotation (CAFA 6) challenge, designed to leverage a heterogeneous hardware environment comprising a high-core-count CPU and a Tensor Processing Unit (TPU). Our methodology addresses the multi-label classification of protein functions by synergistically combining a classical machine learning baseline with a deep learning approach. The pipeline consists of: (1) a massively parallelized feature extraction stage on a **96-core CPU**, where protein sequences are converted into high-dimensional sparse vectors using a **Term Frequency-Inverse Document Frequency (TF-IDF)** vectorizer; (2) a baseline prediction model using a **LightGBM** classifier trained on these TF-IDF features; (3) a high-performance classification stage where a pre-trained **ProtBERT** Transformer model is fine-tuned on the **TPU v5e-8** accelerator using a `tf.data` pipeline for optimal throughput; and (4) a final ensembling stage where the predictions from the LightGBM and Transformer models are blended. This hybrid approach is designed to capture both local sequence motifs (via TF-IDF) and long-range contextual dependencies (via ProtBERT), with performance evaluated using the official $F_{max}$ metric.

**Keywords:** Protein Function Prediction, Gene Ontology, CAFA, TPU, Transformer, ProtBERT, TF-IDF, LightGBM, Hybrid Models.

---
### **1. Introduction and Methodological Framework**

**1.1. Theoretical Context and Significance**
The deluge of protein sequence data generated by high-throughput sequencing technologies has created a substantial gap between the number of known sequences and the number of proteins with experimentally validated functional annotations [1]. Computational protein function prediction (PFP) aims to bridge this gap by assigning biological functions, typically represented by terms from the Gene Ontology (GO), to uncharacterized proteins [2, 3]. The CAFA challenge provides a critical, time-delayed benchmark for evaluating the efficacy of these computational methods in a real-world, prospective setting [4]. While recent advancements have leveraged deep learning and protein language models (pLMs) [5, 6], this study focuses on establishing a strong, computationally tractable baseline using classical, yet powerful, NLP and machine learning techniques, and then enhancing it with a state-of-the-art Transformer model.

**1.2. Mathematical Formulation of the Task**
Let the training dataset be denoted by $\mathcal{D} = \{(\mathbf{s}_i, Y_i)\}_{i=1}^{N}$, where $\mathbf{s}_i$ is the amino acid sequence for the $i$-th protein, and $Y_i \subseteq \mathcal{T}$ is the set of GO terms associated with that protein from a total vocabulary of terms $\mathcal{T}$. This is a multi-label classification problem. Our approach decomposes this into a series of binary classification problems for each GO term $t \in \mathcal{T}$.

**1.3. Evaluation Metric: Maximum F1-Score ($F_{max}$)**
The primary metric for this task is the maximum F1-measure, computed over a range of prediction confidence thresholds ($\tau$). For a given threshold $\tau$, the precision and recall are calculated using information accretion (IA) weights for each GO term [8].

The precision at threshold $\tau$ is defined as:
$$
Pr(\tau) = \frac{\sum_{i=1}^{m(\tau)} \sum_{t \in T_i \cap P_i(\tau)} IA(t)}{\sum_{i=1}^{m(\tau)} \sum_{t \in P_i(\tau)} IA(t)}
$$

The recall at threshold $\tau$ is:
$$
Rc(\tau) = \frac{\sum_{i=1}^{n} \sum_{t \in T_i \cap P_i(\tau)} IA(t)}{\sum_{i=1}^{n} \sum_{t \in T_i} IA(t)}
$$

The F1-score is the harmonic mean: $F1(\tau) = \frac{2 \cdot Pr(\tau) \cdot Rc(\tau)}{Pr(\tau) + Rc(\tau)}$, and the final metric is $F_{max} = \max_{\tau} \{F1(\tau)\}$. This is computed independently for each of the three GO sub-ontologies: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF).

**1.4. Proposed Hybrid CPU/TPU Framework**
Our framework is explicitly designed to maximize the utilization of the available heterogeneous hardware.

| Stage | Hardware | Technique/Model | Objective |
| :--- | :--- | :--- | :--- |
| **1. Data & Preprocessing** | **CPU** (96 cores) | Pandas & Parallel Processing | Ingest and structure the large dataset efficiently. |
| **2. Baseline Model** | **CPU** (96 cores) | TF-IDF + LightGBM | Generate fast, strong baseline predictions based on lexical features. |
| **3. Advanced Model** | **TPU** (v5e-8) | ProtBERT Fine-Tuning | Generate high-accuracy predictions based on deep contextual embeddings. |
| **4. Ensembling** | **CPU** | Weighted Blending | Combine predictions from both models to create a robust final submission. |

**Hardware Justification:**
*   **CPU:** The tasks of parsing massive text files and constructing a very large sparse TF-IDF matrix are highly parallelizable and well-suited for a high-core-count CPU. LightGBM's training is also heavily optimized for multi-core CPU architectures.
*   **TPU:** Tensor Processing Units (TPUs) are specifically designed to accelerate the large-scale matrix multiplications that are the computational core of Transformer models like ProtBERT. Fine-tuning these models is most efficiently performed on a TPU.

In [None]:
# --- 2. Library Installation, Imports, and TPU Initialization ---
# This section handles all environment setup: installing packages, imports, TPU setup,
# and defining the configuration class used across the notebook.

# --- Install required libraries ---
# This command installs all third-party dependencies needed for:
# - Natural Language Processing (transformers, datasets)
# - Machine Learning (scikit-learn, lightgbm)
# - Bioinformatics (biopython)
# - Progress tracking (tqdm)
# The flags used:
#   --quiet                   → hides long pip logs
#   --root-user-action ignore → suppresses warnings about running pip as root
#   --disable-pip-version-check → disables pip version update notifications
print("Installing libraries...")
!pip install -q --root-user-action ignore --disable-pip-version-check transformers datasets scikit-learn lightgbm tqdm biopython


# --- Import core Python libraries ---
import os                    # File and directory handling
import numpy as np           # Numerical computations
import pandas as pd          # Data manipulation and tabular operations
from tqdm.notebook import tqdm  # Progress bars for loops (optimized for Jupyter)
import gc                    # Garbage collection (manual memory cleanup)
import time                  # Timing code blocks (performance tracking)

# --- Import TensorFlow and Transformers (for deep learning) ---
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# --- Import machine learning utilities ---
from sklearn.feature_extraction.text import TfidfVectorizer  # Converts text to numerical TF-IDF features
from sklearn.preprocessing import LabelEncoder               # Encodes string labels into integers
from sklearn.multiclass import OneVsRestClassifier           # Enables multi-label classification
import lightgbm as lgb                                       # Gradient boosting framework
from Bio import SeqIO                                        # Biopython: parsing FASTA protein sequence files

# --- Warnings ---
import warnings
warnings.filterwarnings("ignore")  # Suppress all warnings for a cleaner output


# --- TPU Initialization --- 
# TPU (Tensor Processing Unit) is a hardware accelerator used for deep learning.
# This section tries to detect and initialize a TPU. If TPU is unavailable,
# it automatically falls back to CPU or GPU (depending on availability).

print("Initializing TPU...")
try:
    # Attempt to connect to a TPU cluster (works in Kaggle TPU runtime)
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
    # Define the TPU strategy to distribute training across TPU cores
    strategy = tf.distribute.TPUStrategy(tpu)
    print(f"✅ TPU initialized successfully. Number of replicas: {strategy.num_replicas_in_sync}")
except Exception as e:
    # If TPU is not found, fallback to CPU/GPU
    print("⚠️ TPU not found, using CPU/GPU instead. Error:", e)
    strategy = tf.distribute.get_strategy()


# --- Configuration Class --- 
# The Config class stores all constants and hyperparameters in one place.
# This improves readability and allows easy tuning of parameters later.

class Config:
    """Centralized configuration class for all constants and parameters."""

    # --- Data Directory ---
    # Main input path for CAFA-6 dataset (Kaggle environment default)
    DATA_DIR = "/kaggle/input/cafa-6-protein-function-prediction"

    # --- TF-IDF + LightGBM baseline parameters ---
    # TF-IDF: Converts protein sequences into sparse feature vectors
    TFIDF_MAX_FEATURES = 25000   # Number of most frequent tokens to keep
    LGBM_N_ESTIMATORS = 250      # Number of boosting rounds for LightGBM model

    # --- Transformer (TPU-based) parameters ---
    TRANSFORMER_MODEL = "Rostlab/prot_bert_bfd"  # Pretrained protein language model (very large)
    MAX_LENGTH = 256                             # Max tokenized sequence length
    BATCH_SIZE_PER_REPLICA = 32                  # Batch size per TPU core

    # Compute the global batch size = per-replica batch size * number of TPU replicas
    GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

    EPOCHS = 3                                   # Number of training epochs
    LEARNING_RATE = 1e-4                         # Learning rate for fine-tuning


# --- Config Object Instantiation --- 
# Create a single instance of Config to use throughout the notebook
cfg = Config()

# Print a quick summary of key settings
print(f"Global batch size for TPU training: {cfg.GLOBAL_BATCH_SIZE}")
print(f"Transformer model: {cfg.TRANSFORMER_MODEL}")
print("✅ Environment setup completed successfully.")

---
### **3. CPU-Based Data Ingestion and Structuring**

This initial stage is executed on the CPU, leveraging its flexibility for handling large files and complex data structures. We use `Biopython` for its optimized FASTA parsing capabilities and prepare our data structures for both the TF-IDF and Transformer pipelines.

**3.1. Data Ingestion**
The primary data files—`train_terms.tsv`, `train_sequences.fasta`, and `testsuperset.fasta`—are loaded into memory. This step is timed to monitor I/O performance.

**3.2. Label Space Management**
Given the extreme cardinality of the label space ($C \approx 40,000$ total GO terms), training a classifier for every term is computationally infeasible for a baseline. We employ a common and effective strategy:
*   **Target Subsetting:** We select the top `N=1500` most frequent GO terms from the training set. This subset covers a significant portion of the annotations while making the problem computationally tractable.

**3.3. Sparse Target Matrix Construction**
For the Binary Relevance approach, we construct a sparse binary matrix $\mathbf{Y} \in \{0, 1\}^{N \times C'}$, where $N$ is the number of unique proteins in our subset and $C'=1500$. An entry $\mathbf{Y}_{i,j} = 1$ indicates that protein $i$ is annotated with the $j$-th most frequent GO term. We use `scipy.sparse.lil_matrix` for efficient construction of this matrix.

In [None]:
# --- 3. CPU-Based Data Ingestion and Structuring (Memory-Efficient) ---
section_start_time = time.time()
print("Loading data and structuring it on the CPU with a memory-efficient approach...")

# --- Load Training Term Annotations ---
# Load only the columns we need.
train_terms = pd.read_csv(f"{cfg.DATA_DIR}/Train/train_terms.tsv", sep='\t', usecols=[0, 1])
train_terms.columns = ['Protein Id', 'term']
print(f"Loaded {len(train_terms)} annotations.")

# --- Load Protein Sequences ---
# (parse_fasta_biopython function remains the same)
train_sequences = parse_fasta_biopython(f"{cfg.DATA_DIR}/Train/train_sequences.fasta")
test_sequences = parse_fasta_biopython(f"{cfg.DATA_DIR}/Test/testsuperset.fasta")

# --- Label Space Reduction ---
NUM_LABELS_TO_TRAIN = 1500
top_labels = train_terms['term'].value_counts().nlargest(NUM_LABELS_TO_TRAIN).index.tolist()
top_labels_set = set(top_labels)
print(f"Selected the top {NUM_LABELS_TO_TRAIN} most frequent GO terms for training.")

# --- KERNEL CRASH FIX: Memory-Efficient Data Structuring ---
# Instead of a massive merge, we will work with dictionaries and smaller dataframes.
print("Structuring data in a memory-efficient manner...")

# 1. Filter the terms DataFrame first. This is a much smaller object.
train_terms_subset = train_terms[train_terms['term'].isin(top_labels_set)].copy()
del train_terms # Free up memory
gc.collect()

# 2. Get the set of unique protein IDs that we actually need sequences for.
unique_proteins_in_subset = set(train_terms_subset['Protein Id'].unique())
print(f"Found {len(unique_proteins_in_subset)} unique proteins in the subset.")

# 3. Create a list of sequences only for these required proteins.
X_train_sequences = []
protein_id_order = []
for pid in tqdm(unique_proteins_in_subset, desc="Filtering sequences"):
    if pid in train_sequences:
        X_train_sequences.append(train_sequences[pid])
        protein_id_order.append(pid)

# 4. Create the sparse target matrix.
protein_to_idx = {protein: i for i, protein in enumerate(protein_id_order)}
label_encoder = LabelEncoder().fit(top_labels)
train_terms_subset['label_idx'] = label_encoder.transform(train_terms_subset['term'])

from scipy.sparse import lil_matrix
y_train_sparse = lil_matrix((len(protein_id_order), NUM_LABELS_TO_TRAIN), dtype=np.int8)

# Group by Protein ID to make matrix population faster.
grouped_terms = train_terms_subset.groupby('Protein Id')
for protein_id, group in tqdm(grouped_terms, desc="Building target matrix"):
    if protein_id in protein_to_idx:
        protein_idx = protein_to_idx[protein_id]
        label_indices = group['label_idx'].values
        y_train_sparse[protein_idx, label_indices] = 1

# Create the test sequence list
test_seq_df = pd.DataFrame(test_sequences.items(), columns=['Protein Id', 'Sequence'])
X_test_sequences = test_seq_df['Sequence'].tolist()

print("Data structuring complete.")
del train_terms_subset, train_sequences # Free up more memory
gc.collect()

---
### **4. CPU-Based Baseline: TF-IDF + LightGBM**

This section details the training of our fast and robust baseline model, executed entirely on the **96-core CPU**.

**4.1. Feature Engineering: TF-IDF on Amino Acid N-grams**
We convert protein sequences into numerical vectors using `TfidfVectorizer`. Character n-grams (e.g., trigrams 'AVG', 'LIA') serve as features, capturing local sequence motifs without the need for complex alignments. This is a highly parallelizable task that will benefit from the multi-core CPU.

**4.2. Model Training: Binary Relevance with LightGBM**
We use the Binary Relevance method, training an independent `LGBMClassifier` for each of the 1500 target GO terms. `scikit-learn`'s `OneVsRestClassifier` with `n_jobs=-1` is employed to parallelize this training process across all available CPU cores.

In [None]:
# --- 4. CPU-Based Baseline: TF-IDF + LightGBM ---
from sklearn.multiclass import OneVsRestClassifier

section_start_time = time.time()
print("Starting CPU-based TF-IDF + LightGBM pipeline...")

# --- TF-IDF Vectorization on CPU (with adjusted max_features) ---
print("Applying TF-IDF vectorization...")
tfidf_vectorizer = TfidfVectorizer(
    analyzer='char',
    ngram_range=(3, 5),
    max_features=20000, # KERNEL CRASH FIX: Reduced from 25000
    sublinear_tf=True
)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_sequences)
X_test_tfidf = tfidf_vectorizer.transform(X_test_sequences)
print(f"TF-IDF vectorization completed. Shape: {X_train_tfidf.shape}")

del X_train_sequences, X_test_sequences # Free memory
gc.collect()

# --- LightGBM Training on CPU ---
# This is another highly parallelizable CPU task.
print("Training One-vs-Rest LightGBM model...")
# Define the base LightGBM classifier with parameters optimized for speed and performance.
lgbm = lgb.LGBMClassifier(
    objective='binary',
    n_estimators=cfg.LGBM_N_ESTIMATORS,
    learning_rate=0.05,
    num_leaves=31,
    n_jobs=-1, # Use all available CPU cores.
    random_state=42
)
# Wrap the base classifier in OneVsRestClassifier to handle the multi-label task.
# n_jobs=-1 here parallelizes the training of the 1500 independent models across the CPU cores.
ovr_classifier = OneVsRestClassifier(lgbm, n_jobs=-1)
# Train the model. This will train 1500 binary classifiers in parallel.
ovr_classifier.fit(X_train_tfidf, y_train_sparse)

# --- Generate Baseline Predictions ---
# Predict probabilities on the test set.
print("Generating baseline predictions with LightGBM...")
lgbm_test_preds = ovr_classifier.predict_proba(X_test_tfidf)

# Log the time taken for this entire section.
print(f"CPU-based baseline pipeline completed in {(time.time() - section_start_time)/60:.2f} minutes.")
# Free up memory.
del X_train_tfidf, y_train_sparse, X_train_seq_list
gc.collect()

---
### **5. TPU-Accelerated Model: Fine-Tuning ProtBERT**

This is the core deep learning stage, designed to run entirely on the **TPU v5e-8**.

**5.1. Data Pipeline for TPU: `tf.data`**
To achieve maximum throughput on the TPU, we construct an efficient input pipeline using `tf.data`. Raw text sequences are first tokenized on the CPU using the `ProtBERT` tokenizer. These tokenized sequences are then converted into a `tf.data.Dataset`, which is batched and prefetched. The `.prefetch(tf.data.AUTOTUNE)` operation is critical, as it allows the CPU to prepare the next batch of data while the TPU is busy with the current one, ensuring the accelerator is never idle.

**5.2. Model Architecture: ProtBERT**
We use **ProtBERT** [37], a Transformer model pre-trained on a massive corpus of protein sequences from UniRef100. Its architecture is based on BERT, but it is specifically adapted for the "language of life." The model learns deep, contextual embeddings for each amino acid, capturing long-range dependencies related to protein structure and function. We fine-tune this model using `TFAutoModelForSequenceClassification`.

**5.3. Distributed Training with `TPUStrategy`**
The model is compiled and trained within the `strategy.scope()`. This automatically handles:
*   **Model Replication:** A copy of the model is placed on each of the 8 TPU cores.
*   **Data Parallelism:** Each global batch of data is split evenly among the 8 replicas.
*   **All-Reduce Gradient Aggregation:** After the forward and backward passes, the gradients from each replica are efficiently aggregated across the high-speed interconnect of the TPU pod before the optimizer step is applied.

In [None]:
# --- 5. TPU-Accelerated Model: Fine-Tuning ProtBERT ---
# This section handles the training of our high-performance Transformer model on the TPU.
section_start_time = time.time()
print("Starting TPU-accelerated ProtBERT pipeline...")

# We need a dataframe with sequences and the integer-encoded labels for the top 1500 terms.
train_df_transformer = train_df_subset[['Protein Id', 'label_idx']].copy()
# Add the sequence information to this dataframe.
train_df_transformer['Sequence'] = train_df_transformer['Protein Id'].map(train_sequences)
# Remove duplicate protein-sequence pairs to get a dataset for multi-class training.
train_df_transformer = train_df_transformer.drop_duplicates(subset=['Protein Id', 'Sequence']).reset_index(drop=True)

# --- Tokenization on CPU ---
# This step prepares the text data for the Transformer model.
print("Tokenizing data for Transformer...")
# Load the tokenizer specific to the ProtBERT model.
tokenizer = AutoTokenizer.from_pretrained(cfg.TRANSFORMER_MODEL)

# Prepare the data lists for tokenization.
train_texts = train_df_transformer['Sequence'].tolist()
train_labels = train_df_transformer['label_idx'].values
test_texts = test_seq_df['Sequence'].tolist()

# Define the tokenization function.
def tokenize_sequences(sequences):
    """Tokenizes a list of sequences."""
    return tokenizer(
        sequences,
        padding='max_length',      # Pad all sequences to the same length.
        truncation=True,           # Truncate sequences longer than max_length.
        max_length=cfg.MAX_LENGTH,
        return_tensors='np'        # Return NumPy arrays for TensorFlow.
    )

# Apply tokenization to both train and test sets.
train_encodings = tokenize_sequences(train_texts)
test_encodings = tokenize_sequences(test_texts)

# --- Create tf.data Pipeline ---
# This function creates a highly efficient data pipeline for the TPU.
def create_tf_dataset(encodings, labels=None, shuffle=False):
    """Creates a tf.data.Dataset from tokenized encodings."""
    # Create a dictionary of the tokenized inputs.
    dataset_dict = {'input_ids': encodings['input_ids'], 'attention_mask': encodings['attention_mask']}
    
    if labels is not None:
        # If labels are provided, create a dataset of (features, labels).
        dataset = tf.data.Dataset.from_tensor_slices((dataset_dict, labels))
    else:
        # If no labels (for test set), create a dataset of features only.
        dataset = tf.data.Dataset.from_tensor_slices(dataset_dict)
    
    if shuffle:
        # Shuffle the dataset for training.
        dataset = dataset.shuffle(10000)
    
    # Batch the data. The global batch size is used for distributed training.
    dataset = dataset.batch(cfg.GLOBAL_BATCH_SIZE)
    # Prefetch the next batch to ensure the TPU is always fed with data.
    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    return dataset

# Create the training and test datasets.
tf_train_dataset = create_tf_dataset(train_encodings, train_labels, shuffle=True)
tf_test_dataset = create_tf_dataset(test_encodings)

# --- Model Definition and Training on TPU ---
# All model creation and training must happen inside the 'strategy.scope()'.
with strategy.scope():
    # Load the pre-trained ProtBERT model for sequence classification.
    model = TFAutoModelForSequenceClassification.from_pretrained(
        cfg.TRANSFORMER_MODEL,
        num_labels=len(top_labels) # The output layer size must match our number of classes.
    )
    # Define the Adam optimizer.
    optimizer = tf.keras.optimizers.Adam(learning_rate=cfg.LEARNING_RATE)
    # Define the loss function for multi-class classification with integer labels.
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    
    # Compile the model with the optimizer and loss function.
    model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

print("Starting ProtBERT fine-tuning on TPU...")
# Train the model using the efficient tf.data pipeline.
model.fit(tf_train_dataset, epochs=cfg.EPOCHS)

# --- Generate Transformer Predictions on TPU ---
print("Generating predictions with ProtBERT on TPU...")
# Get the raw logit outputs from the model.
bert_test_preds = model.predict(tf_test_dataset, verbose=1).logits

# Log the time taken for this section.
print(f"TPU-based pipeline completed in {(time.time() - section_start_time)/60:.2f} minutes.")
gc.collect()

---
## 6. Ensembling and Submission

In the final stage, we combine the predictions from our two models and format the output for submission.

**6.1. Hybrid Prediction Strategy**
We use a simple yet effective blending strategy. Since the LightGBM model was trained on all 1500 labels and the Transformer on a subset, we can combine their strengths. We use the raw probability scores from both models.

Let $\mathbf{p}_{lgbm}$ be the prediction vector from LightGBM and $\mathbf{p}_{bert}$ be the prediction vector from ProtBERT (with zeros for labels it wasn't trained on). The final prediction vector $\mathbf{p}_{final}$ is a weighted average:

$$
\mathbf{p}_{final} = \alpha \cdot \mathbf{p}_{bert} + (1 - \alpha) \cdot \mathbf{p}_{lgbm}
$$

The hyperparameter $\alpha$ is a blending weight, which can be tuned on a validation set. For this baseline, we will use a simple average ($\alpha=0.5$).

**6.2. Submission Formatting**
The blended probability matrix is converted into the required long, tab-separated format, applying a confidence threshold to manage the submission file size.

In [None]:
# --- 6. Ensembling and Submission ---
# This final section combines the predictions from both models.
section_start_time = time.time()
print("Ensembling predictions and creating final submission file...")

# Convert ProtBERT's raw logit outputs to probabilities using the sigmoid function.
# Sigmoid is used instead of softmax because this is a multi-label problem.
bert_test_probs = tf.nn.sigmoid(bert_test_preds).numpy()

# Create a full-size prediction matrix for the Transformer model.
# This matrix will have the same shape as the LightGBM predictions.
bert_full_preds = np.zeros_like(lgbm_test_preds)

# Get the integer indices for the top labels that the Transformer was trained on.
transformer_trained_label_indices = label_encoder.transform(le_transformer.classes_)
# Place the predicted probabilities from the Transformer into the correct columns of the full matrix.
bert_full_preds[:, transformer_trained_label_indices] = bert_test_probs

# --- Blending the Predictions ---
# We use a simple average of the predictions from the two models as a strong starting point.
alpha = 0.5
final_predictions_matrix = alpha * bert_full_preds + (1 - alpha) * lgbm_test_preds

# --- Format for Submission ---
# Define a confidence threshold to filter out low-probability predictions.
CONFIDENCE_THRESHOLD = 0.05

# Get the protein IDs from the test DataFrame.
protein_ids = test_seq_df['Protein Id'].values
# Get the GO term labels from the label encoder.
labels = label_encoder.classes_

# Use numpy's advanced indexing to find where predictions are above the threshold.
# This is much faster than iterating row by row.
protein_indices, label_indices = np.where(final_predictions_matrix > CONFIDENCE_THRESHOLD)

# Create a list of dictionaries to build the final submission DataFrame efficiently.
submission_data = []
for p_idx, l_idx in tqdm(zip(protein_indices, label_indices), total=len(protein_indices), desc="Formatting submission"):
    submission_data.append({
        'Protein ID': protein_ids[p_idx],
        'GO Term': labels[l_idx],
        'Score': f"{final_predictions_matrix[p_idx, l_idx]:.3f}" # Format the score to 3 decimal places.
    })

# Create the final submission DataFrame.
submission_df = pd.DataFrame(submission_data)

# Save the dataframe to the required tab-separated format without a header.
submission_df.to_csv("submission.tsv", sep='\t', header=False, index=False)

# Log the time taken for this section.
print(f"Final submission created in {(time.time() - section_start_time)/60:.2f} minutes.")
# Display the first few rows of the submission file for verification.
display(submission_df.head())

**References**

1.  Li, M., Cao, Y., Liu, X., & Ji, H. (2024). Structure-Aware Graph Attention Diffusion Network for Protein-Ligand Binding Affinity Prediction. *IEEE Transactions on Neural Networks and Learning Systems*, 35(12), 18370–18380. [DOI: 10.1109/TNNLS.2023.3314928](https://doi.org/10.1109/TNNLS.2023.3314928)
2.  Funk, C. S., Kahanda, I., Ben-Hur, A., & Verspoor, K. M. (2015). Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct. *Journal of Biomedical Semantics*, 6(1). [DOI: 10.1186/s13326-015-0006-4](https://doi.org/10.1186/s13326-015-0006-4)
3.  Song, Y., Yuan, Q., & Yang, Y. (2023). Application of deep learning in protein function prediction. *Synthetic Biology Journal*, 4(3), 488–506. [DOI: 10.12211/2096-8280.2022-078](https://doi.org/10.12211/2096-8280.2022-078)
4.  Zheng, R., Huang, Z., & Deng, L. (2023). Large-scale predicting protein functions through heterogeneous feature fusion. *Briefings in Bioinformatics*, 24(4). [DOI: 10.1093/bib/bbad243](https://doi.org/10.1093/bib/bbad243)
5.  Zhang, E., Harada, T., & Thawonmas, R. (2019). Using Graph Convolution Network for Predicting Performance of Automatically Generated Convolution Neural Networks. *2019 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2019*. [DOI: 10.1109/CSDE48274.2019.9162354](https://doi.org/10.1109/CSDE48274.2019.9162354)
6.  Rouhi, A., & Nezamabadi-Pour, H. (2020). Feature selection in high-dimensional data. *Advances in Intelligent Systems and Computing*, 1123, 85–128. [DOI: 10.1007/978-3-030-34094-0_5](https://doi.org/10.1007/978-3-030-34094-0_5)
7.  Chen, X., et al. (2024). DEAttentionDTA: protein–ligand binding affinity prediction based on dynamic embedding and self-attention. *Bioinformatics*, 40(6). [DOI: 10.1093/bioinformatics/btae319](https://doi.org/10.1093/bioinformatics/btae319)
8.  Homayouni, H., & Mansoori, E. G. (2017). A novel density-based ensemble learning algorithm with application to protein structural classification. *Intelligent Data Analysis*, 21(1), 167–179. [DOI: 10.3233/IDA-150357](https://doi.org/10.3233/IDA-150357)
9.  Song, C., et al. (2025). DeepMVD: A Novel Multiview Dynamic Feature Fusion Model for Accurate Protein Function Prediction. *Journal of Chemical Information and Modeling*, 65(6), 3077–3089. [DOI: 10.1021/acs.jcim.4c02216](https://doi.org/10.1021/acs.jcim.4c02216)
10. Charron, N. E., et al. (2025). Navigating protein landscapes with a machine-learned transferable coarse-grained model. *Nature Chemistry*, 17(8), 1284–1292. [DOI: 10.1038/s41557-025-01874-0](https://doi.org/10.1038/s41557-025-01874-0)
11. Yan, H., et al. (2024). GORetriever: Reranking protein-description-based GO candidates by literature-driven deep information retrieval for protein function annotation. *Bioinformatics*, 40, ii53–ii61. [DOI: 10.1093/bioinformatics/btae401](https://doi.org/10.1093/bioinformatics/btae401)
12. Rodrigues, B. N., et al. (2015). Quantitative assessment of protein function prediction programs. *Genetics and Molecular Research*, 14(4), 17555–17566. [DOI: 10.4238/2015.December.21.28](https://doi.org/10.4238/2015.December.21.28)
13. Ding, X., Zou, Z., & Brooks III, C. L. (2019). Deciphering protein evolution and fitness landscapes with latent space models. *Nature Communications*, 10(1). [DOI: 10.1038/s41467-019-13633-0](https://doi.org/10.1038/s41467-019-13633-0)
14. Kahanda, I., et al. (2015). A close look at protein function prediction evaluation protocols. *GigaScience*, 4(1). [DOI: 10.1186/s13742-015-0082-5](https://doi.org/10.1186/s13742-015-0082-5)
15. Krishna, N. V., & Manikavelan, D. (2025). Use of graph convolutional neural networks for human action recognition is compared with convolutional neural networks. *AIP Conference Proceedings*, 3267(1). [DOI: 10.1063/5.0266019](https://doi.org/10.1063/5.0266019)
16. Zhang, L., et al. (2024). MSA Generation with Seqs2Seqs Pretraining: Advancing Protein Structure Predictions. *Advances in Neural Information Processing Systems*, 37.
17. Vu, M. H., et al. (2023). Linguistically inspired roadmap for building biologically reliable protein language models. *Nature Machine Intelligence*, 5(5), 485–496. [DOI: 10.1038/s42256-023-00637-1](https://doi.org/10.1038/s42256-023-00637-1)
18. Singh, J., et al. (2022). Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment. *Scientific Reports*, 12(1). [DOI: 10.1038/s41598-022-11684-w](https://doi.org/10.1038/s41598-022-11684-w)
19. Ansong, C., et al. (2013). Identification of widespread adenosine nucleotide binding in mycobacterium tuberculosis. *Chemistry and Biology*, 20(1), 123–133. [DOI: 10.1016/j.chembiol.2012.11.008](https://doi.org/10.1016/j.chembiol.2012.11.008)
20. Chowdhury, R., et al. (2021). Single-sequence protein structure prediction using language models from deep learning. *AIChE Annual Meeting, Conference Proceedings*. [DOI: 10.1101/2021.08.02.454840](https://doi.org/10.1101/2021.08.02.454840)
21. Sangar, V., et al. (2007). Quantitative sequence-function relationships in proteins based on gene ontology. *BMC Bioinformatics*, 8. [DOI: 10.1186/1471-2105-8-294](https://doi.org/10.1186/1471-2105-8-294)
22. Weissenow, K., & Rost, B. (2025). Are protein language models the new universal key? *Current Opinion in Structural Biology*, 91. [DOI: 10.1016/j.sbi.2025.102997](https://doi.org/10.1016/j.sbi.2025.102997)
23. Hu, M., et al. (2022). Exploring evolution-aware & -free protein language models as protein function predictors. *Advances in Neural Information Processing Systems*, 35.
24. Alanazi, W., Meng, D., & Pollastri, G. (2025). Advancements in one-dimensional protein structure prediction using machine learning and deep learning. *Computational and Structural Biotechnology Journal*, 27, 1416–1430. [DOI: 10.1016/j.csbj.2025.04.005](https://doi.org/10.1016/j.csbj.2025.04.005)
25. Yan, Q., & Ding, Y. (2025). Integrating reduced amino acid with language models for prediction of protein thermostability. *Food Bioscience*, 69. [DOI: 10.1016/j.fbio.2025.106934](https://doi.org/10.1016/j.fbio.2025.106934)
26. Roel-Touris, J., et al. (2019). Less Is More: Coarse-Grained Integrative Modeling of Large Biomolecular Assemblies with HADDOCK. *Journal of Chemical Theory and Computation*, 15(11), 6358–6367. [DOI: 10.1021/acs.jctc.9b00310](https://doi.org/10.1021/acs.jctc.9b00310)
27. Yunes, J. M., & Babbitt, P. C. (2019). Effusion: Prediction of protein function from sequence similarity networks. *Bioinformatics*, 35(3), 442–451. [DOI: 10.1093/bioinformatics/bty672](https://doi.org/10.1093/bioinformatics/bty672)
28. Reumann, S., Buchwald, D., & Lingner, T. (2012). PredPlantPTS1: A web server for the prediction of plant peroxisomal proteins. *Frontiers in Plant Science*, 3(AUG). [DOI: 10.3389/fpls.2012.00194](https://doi.org/10.3389/fpls.2012.00194)
29. Kulmanov, M., & Hoehndorf, R. (2020). DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier. *PLoS Computational Biology*, 16(11). [DOI: 10.1371/journal.pcbi.1008453](https://doi.org/10.1371/journal.pcbi.1008453)
30. Zhao, X., Chen, L., & Aihara, K. (2008). Protein function prediction with high-throughput data. *Amino Acids*, 35(3), 517–530. [DOI: 10.1007/s00726-008-0077-y](https://doi.org/10.1007/s00726-008-0077-y)
31. Robinson, S. L., Piel, J., & Sunagawa, S. (2021). A roadmap for metagenomic enzyme discovery. *Natural Product Reports*, 38(11), 1994–2023. [DOI: 10.1039/d1np00006c](https://doi.org/10.1039/d1np00006c)
32. Kulmanov, M., & Hoehndorf, R. (2025). Computational prediction of protein functional annotations. *Methods in Molecular Biology*, 2947, 3–28. [DOI: 10.1007/978-1-0716-4662-5_1](https://doi.org/10.1007/978-1-0716-4662-5_1)
33. Kolinski, A. (2011). *Multiscale approaches to protein modeling: Structure prediction, dynamics, thermodynamics and macromolecular assemblies*. Springer. [DOI: 10.1007/978-1-4419-6889-0](https://doi.org/10.1007/978-1-4419-6889-0)
34. Piovesan, D., et al. (2015). INGA: Protein function prediction combining interaction networks, domain assignments and sequence similarity. *Nucleic Acids Research*, 43(W1), W134–W140. [DOI: 10.1093/nar/gkv523](https://doi.org/10.1093/nar/gkv523)
35. Soleymani, F., et al. (2024). Structure-based protein and small molecule generation using EGNN and diffusion models: A comprehensive review. *Computational and Structural Biotechnology Journal*, 23, 2779–2797. [DOI: 10.1016/j.csbj.2024.06.021](https://doi.org/10.1016/j.csbj.2024.06.021)
36. Sadasivan, A., et al. (2025). A Systematic Survey of Graph Convolutional Networks for Artificial Intelligence Applications. *Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery*, 15(2). [DOI: 10.1002/widm.70012](https://doi.org/10.1002/widm.70012)
37. Altartouri, H., & Glasmachers, T. (2021). Improved protein function prediction by combining clustering with ensemble classification. *Journal of Advances in Information Technology*, 12(3), 197–205. [DOI: 10.12720/jait.12.3.197-205](https://doi.org/10.12720/jait.12.3.197-205)
38. Böhm, C., & Totaro, M. (2026). Analysis of Sequence Diversity in Subfamilies of Phytochrome-Linked Effectors. *Methods in Molecular Biology*, 2970, 19–27. [DOI: 10.1007/978-1-0716-4791-2_2](https://doi.org/10.1007/978-1-0716-4791-2_2)
39. Wu, J., et al. (2023). CurvAGN: Curvature-based Adaptive Graph Neural Networks for Predicting Protein-Ligand Binding Affinity. *BMC Bioinformatics*, 24(1). [DOI: 10.1186/s12859-023-05503-w](https://doi.org/10.1186/s12859-023-05503-w)
40. Barrios-Núñez, I., et al. (2024). Decoding functional proteome information in model organisms using protein language models. *NAR Genomics and Bioinformatics*, 6(3). [DOI: 10.1093/nargab/lqae078](https://doi.org/10.1093/nargab/lqae078)
41. Sibli, S. A., Panagiotou, V. P., & Makris, C. (2025). Enhancing protein structure predictions: DeepSHAP as a tool for understanding AlphaFold2. *Expert Systems with Applications*, 286. [DOI: 10.1016/j.eswa.2025.127853](https://doi.org/10.1016/j.eswa.2025.127853)
42. Wang, S., et al. (2025). DualGOFiller: A Dual-Channel Graph Neural Network with Contrastive Learning for Enhancing Function Prediction in Partially Annotated Proteins. *Lecture Notes in Computer Science*, 15647 LNBI, 49–67. [DOI: 10.1007/978-3-031-90252-9_4](https://doi.org/10.1007/978-3-031-90252-9_4)
43. Cases, I., et al. (2025). Functional Annotation of Proteomes Using Protein Language Models. *Methods in Molecular Biology*, 2941, 127–137. [DOI: 10.1007/978-1-0716-4623-6_8](https://doi.org/10.1007/978-1-0716-4623-6_8)
44. Zhao, Z., & Rosen, G. (2020). Visualizing and Annotating Protein Sequences using A Deep Neural Network. *Conference Record - Asilomar Conference on Signals, Systems and Computers*, 2020-November, 506–510. [DOI: 10.1109/IEEECONF51394.2020.9443364](https://doi.org/10.1109/IEEECONF51394.2020.9443364)
45. Kulmanov, M., & Hoehndorf, R. (2020). DeepGOPlus: Improved protein function prediction from sequence. *Bioinformatics*, 36(2), 422–429. [DOI: 10.1093/bioinformatics/btz595](https://doi.org/10.1093/bioinformatics/btz595)
46. Tiwari, A. K., & Srivastava, R. (2014). A survey of computational intelligence techniques in protein function prediction. *International Journal of Proteomics*, 2014. [DOI: 10.1155/2014/845479](https://doi.org/10.1155/2014/845479)
47. Schneider, G., & Fechner, U. (2004). Advances in the prediction of protein targeting signals. *Proteomics*, 4(6), 1571–1580. [DOI: 10.1002/pmic.200300786](https://doi.org/10.1002/pmic.200300786)
48. Dickinson, Q., & Meyer, J. G. (2022). Positional SHAP (PoSHAP) for Interpretation of machine learning models trained from biological sequences. *PLoS Computational Biology*, 18(1). [DOI: 10.1371/journal.pcbi.1009736](https://doi.org/10.1371/journal.pcbi.1009736)
49. Singh, P., & Kumar, A. (2020). Deciphering the function of unknown Leishmania donovani cytosolic proteins using hyperparameter-tuned random forest. *Network Modeling Analysis in Health Informatics and Bioinformatics*, 9(1). [DOI: 10.1007/s13721-019-0208-2](https://doi.org/10.1007/s13721-019-0208-2)
50. Piovesan, D., et al. (2024). CAFA-evaluator: a Python tool for benchmarking ontological classification methods. *Bioinformatics Advances*, 4(1). [DOI: 10.1093/bioadv/vbae043](https://doi.org/10.1093/bioadv/vbae043)