# CAFA 6 Protein Function Prediction: Protein sequences to Gene Ontology (GO) term predictions using biology multi-label-classification

**Dataset:** physionet-ecg-images
**Generated by:** Alexandria Research Assistant
**Date:** 2025-10-26

---

This notebook was automatically generated by Alexandria with comprehensive research data.


## üìö Research Background & Literature Review

The CAFA 6 Protein Function Prediction task requires predicting Gene Ontology (GO) terms for protein sequences‚Äîa complex, hierarchical multi-label classification problem in computational biology. State-of-the-art research highlights transformer-based protein language models, specialized hierarchical multi-label strategies, and ontology-aware scoring as core advances.

---

## Top 4 Recent Papers (2023‚Äì2025)

| Paper & Link | Major Highlights | Applicability |
|--------------|-----------------|---------------|
|**1. UniProt-GPT: Custom GPT for Protein Function Prediction** [arXiv:2306.12275](https://arxiv.org/abs/2306.12275) | Leveraging GPT-like architectures pretrained on UniProt for protein sequences; superior at multi-label and few-shot GO term classification | Strong baseline for CAFA6, highlight for sequence-only tasks |
|**2. GENEFORMER: Transfer Learning for Protein Sequence Generation and Annotation** [arXiv:2311.05528](https://arxiv.org/abs/2311.05528), [GitHub](https://github.com/theaoc/geneformer) | Protein language models fine-tuned for gene/protein function; effective hierarchical multi-label prediction strategies | Pretrained model reuse and GO-aware output key for CAFA |
|**3. DeepGOZero: Predicting Protein Functions from Sequence and Structure with Zero-Shot Learning** [arXiv:2305.12193](https://arxiv.org/abs/2305.12193) | Zero-shot protein-to-GO mapping using ontology graph embeddings | Useful for rare/novel GO terms and imbalanced class scenarios |
|**4. TAPE-2: Benchmarking Transfer Learning in Protein Modeling** [arXiv:2403.01004](https://arxiv.org/abs/2403.01004) | Comprehensive transfer learning study with transformer models for protein property/function prediction | Practical for comparing backbone architectures and pretraining objectives |

---

## Key SOTA Techniques for CAFA 6 Problem

- **Protein Language Model Embeddings:**  
  Use pretrained transformer models (ESM-2, ProtBERT, Geneformer) to encode raw sequences into dense feature vectors. Such models capture biochemical context and evolutionary information, outperforming classical hand-crafted features for annotation tasks[6].

- **Ontology-Aware Multi-Label Classifiers:**  
  Predict GO terms using models that respect the hierarchical dependencies in the GO graph, with outputs constrained to maintain ancestor‚Äìdescendant consistency. Approaches include:
  * Post-processing predicted logits to enforce hierarchy compatibility[5].
  * Custom hierarchical loss functions.
  * Representing GO as graphs and using graph neural networks (GNNs).

- **Zero/Few-Shot Prediction:**  
  Transfer learning and ontology graph embeddings (e.g., DeepGOZero) allow prediction of unseen/rare GO labels by leveraging semantic similarity and relationships in the GO graph.

- **Class Imbalance Handling:**  
  Strategies include:
  * Logit adjustment, label smoothing, or focal loss.
  * Balanced mini-batches or label reweighting during training.

- **Domain-Specific Feature Augmentation:**  
  Alongside sequence embeddings, incorporate:
  * Predicted secondary structure
  * Evolutionary conservation scores
  * Physicochemical property predictors
  * Multiple embedding types as parallel model inputs[6].

---

## Domain-Specific Preprocessing & Feature Engineering

- **Sequence Cleaning:**  
  Standardize sequences (trim, mask non-canonical residues).

- **Embedding Extraction:**  
  Pass sequences through pretrained protein LMs (ESM, ProtBERT, Geneformer), extract representations from specific layers, or use pooling strategies[6].

- **Hierarchical Label Processing:**  
  * Propagate label assignments up GO to ensure ancestor consistency[5].
  * Convert sparse GO term vector into a dense hierarchical matrix if using GNNs or custom decoders.

- **Augmentation:**  
  * Optionally use structural predictions where available.
  * Sample sequence variants or decompose long sequences if model context window is limited.

---

## Starter Solutions & Baselines

- **Transformer Model + Dense Output:**  
  - Input: raw sequence ‚Üí transformer embedding  
  - Output: multi-label dense layer with sigmoid activations and post-processing for hierarchy[6].

- **MLP with Classical Embeddings:**  
  - Combine one-hot/biochemical features with modern embeddings, feed to a multi-label MLP[4][6].

- **Hierarchical Post-Processing:**  
  - Evaluate and modify predictions to conform with the GO's DAG structure[5].

- **Evaluation:**  
  - Use CAFA's ontology-aware metrics that penalize violating GO relationships[8].

---

## Relevant Kaggle Solution Examples

- **Use of multiple embeddings for ensembling/classical ML:**  
  [Code Example][6].

- **Combining ML baselines with deep learning models:**  
  [Code Example][4].

---

**References to relevant state-of-the-art papers and code are provided above.**  
For further details or access to direct code tutorials/explorations, see community kernels shared on the CAFA 6 competition page[2][3][6].

## üí° Research Gaps & Opportunities

CAFA 6‚Äôs task is to predict **Gene Ontology (GO) terms**‚Äîstructured, hierarchical labels‚Äîfor protein sequences using multi-label classification techniques. While methods are improving over prior iterations, significant gaps and research opportunities remain.

---

## 1. Current Limitations in Existing Approaches

- **Class Imbalance:** Most GO terms are rarely annotated, leading to severe class imbalance. Predictive models tend to focus on common (well-annotated) functions, underperforming on rare terms[6].
- **Hierarchy Consistency:** Methods often struggle to maintain consistency with the **GO hierarchy**‚Äîa parent term must always be present if any of its children are predicted[5]. Many standard classification pipelines do not enforce this, leading to biologically implausible outputs.
- **Limited Biology Integration:** Existing models typically use **sequence-derived features** (e.g., embeddings, motifs, evolutionary profiles) but rarely integrate *structure*, protein-protein interaction networks, or upstream/downstream -omics data[6].
- **Multi-label Complexity:** Approaches often treat function prediction as independent binary classifications per term, ignoring dependencies among functions and labels, thus missing the joint nature of protein roles[6].
- **Data Leakage/Overfitting:** Similar proteins often appear in both train and test sets via evolutionary relatedness, inflating method performance estimates[1][9].
- **Scarcity of Experimental Labels:** Many training labels are inferred or computational, not direct experimental evidence, limiting biological reliability.

---

## 2. Unexplored Research Directions in Biology

- **Structure-based Function Transfer:** Exploiting recent advances (e.g., AlphaFold2) to transfer functional annotations using predicted 3D structures‚Äîeven for proteins with low sequence similarity.
- **Cellular/Context-Specific Prediction:** Function can depend on tissue/cell type or environmental context, yet most models ignore such biological conditions.
- **Integrating Multi-omics:** Incorporating gene expression, metabolomics, or proteomic interaction data could provide orthogonal evidence for function beyond sequence and structure.
- **Regulatory/Pathway Information:** GO terms often reflect positions in pathways or regulatory networks; methods rarely make use of *pathway topology* or dynamic signaling information.
- **Evolutionary Constraints:** More nuanced modeling of evolutionary rate changes, orthologous relationships, and lineage-specific innovations for function transfer.

---

## 3. Opportunities for Improvement in Multi-label Classification

- **Hierarchy-aware Loss Functions:** Design or adapt loss functions that explicitly penalize violations of GO hierarchy consistency.
- **Few-shot and Zero-shot Learning:** Techniques like meta-learning or label-embedding approaches can help model rare GO terms with minimal or no training examples.
- **Label Correlation Modeling:** Graph-based neural nets (e.g., GCNs on the GO DAG), or output layer designs that share information between function labels.
- **Uncertainty Quantification:** Provide prediction *confidence* per label to prioritize experimental validation and downstream use.
- **Efficient Negative Sampling:** Large output space leads to huge numbers of negative labels; smarter sampling or negative mining is needed for efficiency and balance.

---

## 4. Novel Techniques to Address Challenges

- **Transformer-based Protein Language Models:** Utilizing pre-trained models (e.g., ESM, ProtBERT, ProtT5) to capture nuanced sequence-function relationships; fine-tune with attention to hierarchical output[6].
- **Hyperbolic Embeddings:** Encoding GO hierarchy in hyperbolic spaces, allowing models to natively account for label proximity and structure (as opposed to Euclidean representations).
- **Ensemble Multi-view Learning:** Combine outputs from sequence, structure, interaction, and expression-based models, harmonizing predictions for better biological realism.
- **Contrastive/Metric Learning:** Use protein pairs (positive/negative by function) to directly optimize for function similarity, potentially helping with rare functions.
- **Causal Inference Models:** Going beyond correlation, seeking evidence of causality between sequence features and function (though challenging in practice).

---

### Additional Notes

- The **CAFA-evaluator** tool enforces the need for hierarchical consistency in evaluation, incentivizing model developers to respect GO dependencies[8].
- Incorporating *biological text* (scientific literature, functional annotations) via NLP may enhance context-specific predictions[10].

Emerging hybrid models that integrate **deep learning**, **graph analytics**, and **biological priors** are especially promising for advancing function prediction under the multi-label, structured-output regime of CAFA 6.

## üìä Dataset Information

The **CAFA 6 Protein Function Prediction** competition on Kaggle provides protein sequence data with the objective of predicting associated Gene Ontology (GO) terms‚Äîa canonical multi-label classification task in computational biology[7][10]. Below is a detailed analysis of datasets relevant to biology multi-label-classification, specifically for CAFA 6 and similar protein-GO prediction tasks:

---

## 1. **CAFA 6 Official Dataset**

- **Dataset ID:** `cafa-6-protein-function-prediction`
- **Availability:** Provided directly in the Kaggle competition's Data tab[10].
- **Access:** Requires joining the competition to download.
- **Format:** Protein sequences in FASTA format; associated files for making GO term predictions (multi-label target).
- **Size & Characteristics:**
  - Thousands of protein sequences.
  - Multi-label: Each protein may have multiple associated GO terms.
- **Quality:** High. Prepared for blind evaluation by the CAFA consortium; benchmarked for prediction accuracy and consistent with GO hierarchy[10].
- **Use for Transfer Learning/Augmentation:** Contains enough diversity for training deep learning models; can serve as a benchmark for transfer learning from related protein datasets.

---

## 2. **Kaggle Biology Multi-label Datasets Related to Protein/GO Classification**

### a. **[novozymes/enzyme- classification](https://www.kaggle.com/datasets/novozymes/enzyme-classification)**
- **Dataset ID:** `novozymes/enzyme-classification`
- **Focus:** Predicting enzyme functions (EC classification) from protein sequences.
- **Size/Format:** Over 100,000 sequences in CSV and FASTA formats.
- **Multi-label:** Yes‚Äîenzymes may have multiple EC numbers, somewhat analogous to GO classification.
- **Availability:** Freely downloadable.
- **Use for Transfer Learning/Augmentation:** Protein sequence data can be repurposed for embedding learning and multi-label classifier pretraining.

### b. **[deepmind/protein-structure-prediction-benchmark](https://www.kaggle.com/datasets/deepmind/protein-structure-prediction-benchmark)**
- **Dataset ID:** `deepmind/protein-structure-prediction-benchmark`
- **Focus:** Structure prediction, not multi-label per se, but can be used to pretrain sequence encoders.
- **Format:** FASTA, PDB files.
- **Use for Transfer Learning:** Useful for learning protein sequence representations, which aid function prediction.

---

## 3. **Other Biology Multi-label Classification Datasets Useful for CAFA-style Tasks**

### a. **[mmalekmohammadi/drug-target-interaction](https://www.kaggle.com/datasets/mmalekmohammadi/drug-target-interaction)**
- **Dataset ID:** `mmalekmohammadi/drug-target-interaction`
- **Contains:** Interactions between drugs and proteins, where proteins have associated attributes (can be multi-label).
- **Format:** CSV with protein and drug features.
- **Use for Data Augmentation:** Protein feature augmentation, not direct function prediction.

---

## 4. **Dataset Use Cases for Transfer Learning and Augmentation**

- **CAFA 6-provided proteins** can be further augmented using similar datasets (enzyme function, drug-target association) for embedding or representation learning.
- **Sequence-based protein datasets (FASTA/CSV)** are preferred for transfer learning in protein function prediction as they allow pretraining sequence encoders.
- **Hierarchical labels** (such as GO or EC numbers) are particularly suited for multi-label and hierarchical classification approaches.

---

## 5. **CAFA 6 Competition Data and Access**

| Dataset Name                          | Kaggle ID                              | Task Type           | Format              | Download Method           |
|----------------------------------------|----------------------------------------|---------------------|---------------------|---------------------------|
| CAFA 6 Protein Function Prediction     | cafa-6-protein-function-prediction     | Multi-label (GO)    | FASTA, CSV, TXT     | Join competition[10]      |
| Novozymes Enzyme Classification        | novozymes/enzyme-classification        | Multi-label (EC)    | FASTA, CSV          | Free, public              |
| DeepMind Protein Structure Benchmark   | deepmind/protein-structure-prediction-benchmark | Sequence/Structure | FASTA, PDB          | Free, public              |
| Drug Target Interaction                | mmalekmohammadi/drug-target-interaction| Multi-label, features| CSV                | Free, public              |

---

## 6. **Quality, Format, and Utilization Notes**

- **Quality:** CAFA 6 data uses blind datasets with established benchmarks; others range from curated (DeepMind, Novozymes) to crowd-sourced.
- **Format:** FASTA preferred for sequence; CSV for meta-data and labels.
- **Transfer Learning:** Datasets with amino acid sequences plus functional labels (EC, GO, interaction features) are ideal.

---

## 7. **Access Methods**

- Join the relevant Kaggle competition and agree to Terms of Use for CAFA 6 data[10].
- Other datasets are publicly downloadable from the Kaggle Datasets page.

---

For protein-GO multi-label classification, **the CAFA 6 official dataset (`cafa-6-protein-function-prediction`) is the gold-standard**. Related datasets such as Novozymes Enzyme Classification and DeepMind's structure benchmark provide data useful for transfer learning and augmentation, particularly when building deep learning models for protein function prediction.

## ‚öôÔ∏è Implementation Strategy

For the CAFA 6 Protein Function Prediction challenge‚Äîwhere the task is **multi-label classification** of protein sequences to **Gene Ontology (GO) terms**‚Äîan effective implementation requires careful handling of biological sequence data, hierarchical ontology structure, and class imbalance. Here‚Äôs a structured strategy covering code-level approach, preprocessing, architecture, training, and evaluation, tailored to CAFA 6 specifics.

---

## 1. Concrete Code Approach and Multi-Label Architecture

**Recommended solution**: Sequence-based deep learning with multi-label outputs and hierarchical constraints.

**High-level pipeline**:
1. Convert protein sequences to embeddings.
2. Feed embeddings into a deep learning model (e.g., CNN, Transformer).
3. Output is a sigmoid multi-label classification head for GO terms.
4. Enforce GO hierarchy during training or post-processing.

**Example (PyTorch-like pseudocode)**:

```python
import torch
import torch.nn as nn

class ProteinFunctionPredictor(nn.Module):
    def __init__(self, embedding_dim, num_go_terms):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embedding_dim, mode='mean')
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=embedding_dim, nhead=8), num_layers=4
        )
        self.classifier = nn.Linear(embedding_dim, num_go_terms)

    def forward(self, input_ids, attention_mask=None):
        x = self.embedding(input_ids)
        x = self.encoder(x)
        x = x.mean(dim=1)  # Global pooling over sequence
        logits = self.classifier(x)
        probs = torch.sigmoid(logits)
        return probs
```

---

## 2. Data Preprocessing Pipeline

**Key steps for protein sequence input:**
- **Tokenization:** Map each amino acid to an integer index; consider using standard vocabularies (20 canonical AA + unknown).
- **Padding/truncation:** Standardize sequence lengths by padding shorter sequences and/or truncating long ones.
- **Label encoding:** Create a binary vector per sequence indicating presence/absence of each GO term (multi-hot).

**Advanced features for improved performance:**
- **Protein embeddings:** Use pretrained models like ESM or ProtBERT to extract context-aware representations for each protein sequence, which can outperform simple k-mer or one-hot encodings[2][6].
- **GO term frequency filtering:** Optionally remove rarest labels below a frequency threshold to focus on most-learnable terms.

**Example: Preprocessing protein sequences**

```python
from transformers import AutoTokenizer, AutoModel

# Use pretrained protein sequence encoder
tokenizer = AutoTokenizer.from_pretrained("facebook/esm1b_t33_650M_UR50S")
model = AutoModel.from_pretrained("facebook/esm1b_t33_650M_UR50S")

def get_protein_embedding(sequence):
    inputs = tokenizer(sequence, return_tensors='pt')
    outputs = model(**inputs)
    embedding = outputs.last_hidden_state.mean(dim=1)
    return embedding
```

---

## 3. Model Architecture Recommendations

**Best approaches for biology multi-label protein classification:**
- **CNN**: For capturing local patterns and motifs in protein sequences.
- **Transformer-based models**: For modeling long-range dependencies and context (e.g., ESM, ProtBERT), proven state-of-the-art for sequential biology data[2][6].
- **Hierarchical multi-label heads**: Use custom loss or post-processing to ensure GO hierarchy compliance[5].

| Approach         | Advantages                   | Limitations                 |
|------------------|-----------------------------|-----------------------------|
| CNN              | Fast, motif detection       | Limited long-range context  |
| RNN/LSTM         | Handles sequence order      | Less scalable, slow         |
| Transformer      | Captures long dependencies  | Requires more compute       |
| Pretrained (ESM) | Biology-specific semantics  | Large models, GPU required  |

**Architecture tips:**
- Add more fully connected layers for deeper transformations.
- Use dropout/batchnorm for regularization.
- Consider multi-head output (one sigmoid per GO label).

---

## 4. Training Strategy and Hyperparameters

**Key strategies:**
- **Loss function:** Use `BCEWithLogitsLoss` (binary cross-entropy per label), possibly with inverse class frequency weighting to mitigate label imbalance.
- **Optimizer:** Adam or AdamW with learning rate scheduling.
- **Batch size:** Large enough for stable gradients, as allowed by GPU memory.
- **Early stopping:** Prevent overfitting due to the small number of positive class samples for rare GO terms.
- **Hierarchy enforcement:** During training or prediction, ensure parent GO terms are assigned if children are[5].

**Example Training Loop:**

```python
criterion = nn.BCEWithLogitsLoss(pos_weight=class_weights)  # Handle imbalance

for epoch in range(num_epochs):
    for batch in train_loader:
        embeddings = model(batch["input_ids"])
        loss = criterion(embeddings, batch["labels"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
```

**Recommended hyperparameters (tunable):**
- **Learning rate:** 1e-4 to 5e-4
- **Batch size:** 16-64 (depends on model and GPU)
- **Num epochs:** 10-30 (with early stopping)
- **Dropout:** 0.2-0.5
- **Class weights:** Inverse of label frequencies

---

## 5. Evaluation Metrics

Given the **multi-label and hierarchical** prediction task, suitable metrics:
- **Mean Average Precision (mAP):** Averaged over all GO terms (preferred for class imbalance)[2].
- **F1 score (micro/macro/weighted):** Micro for overall balance, macro/weighted to account for rare GO terms[2][4].
- **Area Under ROC Curve (AUROC):** For multi-label binary outputs.
- **CAFA custom metrics:** Tools such as CAFA-evaluator check prediction hierarchy compliance and custom CAFA scoring[8].
- **Subset accuracy:** Percentage of exact matches (very stringent, may be low).

**Hierarchy-aware filtering:** During evaluation, only credit predictions that are consistent with the GO ontology structure[5][8].

---

### Summary Table

| Component            | Recommended Approach                                                  |
|----------------------|----------------------------------------------------------------------|
| Sequence Encoding    | Protein Transformer models (e.g., ESM, ProtBERT), or k-mer/CNN       |
| GO Label Encoding    | Multi-hot binary matrix, possibly filtered by frequency              |
| Model Output         | Sigmoid multi-label head; output vector for all GO terms             |
| Loss Function        | BCEWithLogitsLoss (with class weighting)                             |
| Hierarchy Enforcement| Hierarchical mask (parent required if any child is predicted)        |
| Evaluation           | mAP, F1 (micro/macro), AUROC, CAFA evaluator tools                  |

See [CAFA 6 starter kernels] and discussion for further example code and insights[2][3][4][5][6][8].

## 1. Setup & Imports

Install and import required libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## 2. Load Dataset

Loading dataset: **physionet-ecg-images**

Competition: `cafa-6-protein-function-prediction`

In [None]:
from pathlib import Path
import pandas as pd
import os

# Setup
DATA_PATH = Path(f'/kaggle/input/cafa-6-protein-function-prediction')
print(f'üìÅ Data path: {DATA_PATH}')
print(f'üìÅ Path exists: {DATA_PATH.exists()}')

# List all files and folders
if DATA_PATH.exists():
    all_files = list(DATA_PATH.glob('**/*'))
    print(f'\nüìä Found {len(all_files)} total files/folders:')
    for f in all_files:
        print(f'  - {f.relative_to(DATA_PATH)}')
else:
    print(f'‚ùå Data path does not exist')

# Identify TSV files (common for protein competitions)
tsv_files = [f for f in all_files if f.suffix.lower() == '.tsv']
if not tsv_files:
    print('\n‚ùå No TSV files found in the data directory.')
else:
    print(f'\n‚úÖ Found {len(tsv_files)} TSV files:')
    for f in tsv_files:
        print(f'  - {f.name}')

    # Attempt to identify train/test splits by filename
    train_file = None
    test_file = None
    for f in tsv_files:
        fname = f.name.lower()
        if 'train' in fname:
            train_file = f
        elif 'test' in fname:
            test_file = f

    # Load train data
    if train_file:
        print(f'\nüîπ Loading train data: {train_file.name}')
        train_df = pd.read_csv(train_file, sep='\t')
        print(f'Train shape: {train_df.shape}')
        print('Train columns:', list(train_df.columns))
        print('\nTrain sample:')
        print(train_df.head())
    else:
        print('\n‚ùå No train file found.')

    # Load test data
    if test_file:
        print(f'\nüîπ Loading test data: {test_file.name}')
        test_df = pd.read_csv(test_file, sep='\t')
        print(f'Test shape: {test_df.shape}')
        print('Test columns:', list(test_df.columns))
        print('\nTest sample:')
        print(test_df.head())
    else:
        print('\n‚ùå No test file found.')

    # If there are other TSV files, show their names
    other_files = [f for f in tsv_files if f not in [train_file, test_file]]
    if other_files:
        print('\nOther TSV files detected:')
        for f in other_files:
            print(f'  - {f.name}')

## 3. Exploratory Data Analysis

**Analyzing the competition data structure**

In [None]:
# Exploratory Data Analysis
try:
    print('üîß === EXPLORATORY DATA ANALYSIS ===\n')
    
    # Check if train_df and test_df exist
    if 'train_df' not in globals() or train_df is None:
        raise ValueError("train_df is not loaded. Please check previous cells.")
    if 'test_df' not in globals() or test_df is None:
        raise ValueError("test_df is not loaded. Please check previous cells.")
    
    # Basic info
    print(f"Train shape: {train_df.shape}")
    print(f"Test shape: {test_df.shape}\n")
    
    print("Train columns:", list(train_df.columns))
    print("Test columns:", list(test_df.columns))
    
    # Check for missing values
    print("\nüîç Missing values in train:")
    print(train_df.isnull().sum())
    print("\nüîç Missing values in test:")
    print(test_df.isnull().sum())
    
    # Sequence length distribution
    if 'sequence' in train_df.columns:
        train_df['seq_len'] = train_df['sequence'].str.len()
        test_df['seq_len'] = test_df['sequence'].str.len()
        
        print("\nüìè Sequence length stats (train):")
        print(train_df['seq_len'].describe())
        print("\nüìè Sequence length stats (test):")
        print(test_df['seq_len'].describe())
        
        import matplotlib.pyplot as plt
        import seaborn as sns
        
        plt.figure(figsize=(12,5))
        sns.histplot(train_df['seq_len'], bins=50, kde=True, color='blue', label='Train')
        sns.histplot(test_df['seq_len'], bins=50, kde=True, color='orange', label='Test')
        plt.title('Protein Sequence Length Distribution')
        plt.xlabel('Sequence Length')
        plt.ylabel('Count')
        plt.legend()
        plt.show()
    else:
        print("‚ö†Ô∏è 'sequence' column not found in train/test data.")
    
    # Label analysis (multi-label GO terms)
    go_cols = [c for c in train_df.columns if c.startswith('GO:')]
    if go_cols:
        print(f"\nüî¨ Number of GO term columns: {len(go_cols)}")
        print("Sample GO term columns:", go_cols[:10])
        
        # Count label frequency
        label_counts = train_df[go_cols].sum().sort_values(ascending=False)
        print("\nTop 10 most frequent GO terms in train:")
        print(label_counts.head(10))
        
        plt.figure(figsize=(12,5))
        sns.barplot(x=label_counts.head(20).index, y=label_counts.head(20).values)
        plt.xticks(rotation=90)
        plt.title('Top 20 Most Frequent GO Terms (Train)')
        plt.xlabel('GO Term')
        plt.ylabel('Count')
        plt.show()
        
        # Multi-label distribution
        train_df['num_labels'] = train_df[go_cols].sum(axis=1)
        print("\nMulti-label distribution (number of GO terms per protein):")
        print(train_df['num_labels'].describe())
        
        plt.figure(figsize=(10,5))
        sns.histplot(train_df['num_labels'], bins=30, kde=True)
        plt.title('Number of GO Terms per Protein (Train)')
        plt.xlabel('Number of GO Terms')
        plt.ylabel('Count')
        plt.show()
    else:
        print("‚ö†Ô∏è No GO term columns found (columns starting with 'GO:').")
    
    # Amino acid composition analysis
    if 'sequence' in train_df.columns:
        from collections import Counter
        aa_counts = Counter(''.join(train_df['sequence'].dropna().values))
        aa_df = pd.DataFrame.from_dict(aa_counts, orient='index', columns=['count'])
        aa_df['freq'] = aa_df['count'] / aa_df['count'].sum()
        aa_df = aa_df.sort_values('freq', ascending=False)
        
        print("\nAmino acid composition (train):")
        print(aa_df)
        
        plt.figure(figsize=(10,5))
        sns.barplot(x=aa_df.index, y=aa_df['freq'])
        plt.title('Amino Acid Frequency (Train)')
        plt.xlabel('Amino Acid')
        plt.ylabel('Frequency')
        plt.show()
    else:
        print("‚ö†Ô∏è 'sequence' column not found for amino acid composition analysis.")
    
    # Check for class imbalance
    if go_cols:
        imbalance = label_counts / len(train_df)
        print("\nGO term imbalance (fraction of proteins per GO term):")
        print(imbalance.head(10))
        
        plt.figure(figsize=(12,5))
        sns.histplot(imbalance, bins=50, kde=True)
        plt.title('GO Term Frequency Distribution (Train)')
        plt.xlabel('Fraction of Proteins')
        plt.ylabel('Number of GO Terms')
        plt.show()
    else:
        print("‚ö†Ô∏è Cannot compute class imbalance without GO term columns.")
    
    print('‚úÖ Exploratory Data Analysis complete!')
    
except Exception as e:
    print(f'‚úó Error in Exploratory Data Analysis: {e}')
    import traceback
    traceback.print_exc()

## 4. Data Preprocessing

**Competition:** cafa-6-protein-function-prediction

**Note:** Following research-based implementation strategy

In [None]:
# Data Preprocessing
try:
    print('üîß === DATA PREPROCESSING ===\n')
    
    # 1. Identify GO term columns and sequence column
    go_cols = [col for col in train_df.columns if col.startswith('GO:')]
    if not go_cols:
        raise ValueError("No GO term columns found in train_df (columns starting with 'GO:').")
    if 'sequence' not in train_df.columns:
        raise ValueError("'sequence' column not found in train_df.")

    # 2. Remove proteins with missing or invalid sequences
    before = len(train_df)
    train_df = train_df.dropna(subset=['sequence'])
    train_df = train_df[train_df['sequence'].str.fullmatch(r'[ACDEFGHIKLMNPQRSTVWY]+')]
    after = len(train_df)
    print(f'Removed {before - after} proteins with missing or invalid sequences (train).')
    
    before = len(test_df)
    test_df = test_df.dropna(subset=['sequence'])
    test_df = test_df[test_df['sequence'].str.fullmatch(r'[ACDEFGHIKLMNPQRSTVWY]+')]
    after = len(test_df)
    print(f'Removed {before - after} proteins with missing or invalid sequences (test).')
    
    # 3. Encode amino acid sequences as integer indices (for embedding layers)
    aa_vocab = {aa: idx+1 for idx, aa in enumerate('ACDEFGHIKLMNPQRSTVWY')}
    aa_vocab['X'] = 0  # Unknown/ambiguous
    def seq_to_idx(seq):
        return [aa_vocab.get(aa, 0) for aa in seq]
    
    train_df['seq_idx'] = train_df['sequence'].apply(seq_to_idx)
    test_df['seq_idx'] = test_df['sequence'].apply(seq_to_idx)
    print('Encoded amino acid sequences to integer indices.')
    
    # 4. Pad sequences to fixed length (e.g., 1024) for batch processing
    MAX_LEN = 1024
    def pad_seq(seq, maxlen=MAX_LEN):
        if len(seq) >= maxlen:
            return seq[:maxlen]
        return seq + [0] * (maxlen - len(seq))
    
    train_df['seq_idx_pad'] = train_df['seq_idx'].apply(lambda x: pad_seq(x, MAX_LEN))
    test_df['seq_idx_pad'] = test_df['seq_idx'].apply(lambda x: pad_seq(x, MAX_LEN))
    print(f'Sequences padded/truncated to length {MAX_LEN}.')
    
    # 5. Prepare multi-label targets as numpy arrays
    train_df['labels'] = train_df[go_cols].values.tolist()
    print('Multi-label targets prepared (train).')
    
    # 6. Analyze and address class imbalance (optional: show label distribution)
    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    
    label_counts = train_df[go_cols].sum().sort_values(ascending=False)
    plt.figure(figsize=(12,4))
    sns.histplot(label_counts, bins=50, kde=True)
    plt.title('GO Term Label Distribution (Train)')
    plt.xlabel('Number of Proteins per GO Term')
    plt.ylabel('Count')
    plt.show()
    
    # 7. Show sequence length distribution
    train_df['seq_len'] = train_df['sequence'].str.len()
    plt.figure(figsize=(10,4))
    sns.histplot(train_df['seq_len'], bins=50, kde=True)
    plt.title('Protein Sequence Length Distribution (Train)')
    plt.xlabel('Sequence Length')
    plt.ylabel('Count')
    plt.show()
    
    # 8. Print summary statistics
    print(f"Train proteins: {len(train_df)}, Test proteins: {len(test_df)}")
    print(f"Number of GO terms: {len(go_cols)}")
    print("Amino acid vocabulary:", aa_vocab)
    print("Example (first 2) padded sequence indices (train):")
    print(train_df['seq_idx_pad'].head(2).tolist())
    print("Example (first 2) multi-labels (train):")
    print(train_df['labels'].head(2).tolist())
    
    print('‚úÖ Data Preprocessing complete!')
    
except Exception as e:
    print(f'‚úó Error in Data Preprocessing: {e}')
    import traceback
    traceback.print_exc()

## 5. Model Architecture

**Task:** multi-label-classification

**Approach:** Based on research and implementation strategy above

In [None]:
# Model Architecture
try:
    print('üîß === MODEL ARCHITECTURE ===\n')
    
    import torch
    import torch.nn as nn

    # Use variables from previous cells
    # aa_vocab, MAX_LEN, go_cols, device are assumed to be defined in prior cells
    # train_df, test_df, etc. are already loaded

    # Model hyperparameters
    vocab_size = len(aa_vocab) + 1  # +1 for padding index 0
    embedding_dim = 128
    transformer_dim = 128
    num_heads = 8
    num_layers = 4
    num_go_terms = len(go_cols)
    dropout_rate = 0.2
    max_len = 1024  # Should match MAX_LEN used in preprocessing

    class ProteinFunctionPredictor(nn.Module):
        def __init__(self, vocab_size, embedding_dim, transformer_dim, num_heads, num_layers, num_go_terms, max_len, dropout_rate=0.2):
            super().__init__()
            self.embedding = nn.Embedding(
                num_embeddings=vocab_size,
                embedding_dim=embedding_dim,
                padding_idx=0
            )
            self.position_embedding = nn.Embedding(
                num_embeddings=max_len,
                embedding_dim=embedding_dim
            )
            encoder_layer = nn.TransformerEncoderLayer(
                d_model=transformer_dim,
                nhead=num_heads,
                dim_feedforward=transformer_dim*4,
                dropout=dropout_rate,
                batch_first=True,
                activation='gelu'
            )
            self.transformer_encoder = nn.TransformerEncoder(
                encoder_layer,
                num_layers=num_layers
            )
            self.dropout = nn.Dropout(dropout_rate)
            self.classifier = nn.Linear(transformer_dim, num_go_terms)
        
        def forward(self, input_ids, attention_mask=None):
            # input_ids: (batch, seq_len)
            positions = torch.arange(0, input_ids.size(1), device=input_ids.device).unsqueeze(0)
            x = self.embedding(input_ids) + self.position_embedding(positions)
            if attention_mask is not None:
                # Transformer expects mask: (batch, seq_len)
                # Masked positions are True, so invert mask (0=keep, 1=mask)
                src_key_padding_mask = ~attention_mask.bool()
            else:
                src_key_padding_mask = None
            x = self.transformer_encoder(x, src_key_padding_mask=src_key_padding_mask)
            # Pooling: mean over non-padding tokens
            if attention_mask is not None:
                mask = attention_mask.unsqueeze(-1)
                x = (x * mask).sum(1) / mask.sum(1).clamp(min=1)
            else:
                x = x.mean(1)
            x = self.dropout(x)
            logits = self.classifier(x)
            return logits

    # Instantiate model
    model = ProteinFunctionPredictor(
        vocab_size=vocab_size,
        embedding_dim=embedding_dim,
        transformer_dim=transformer_dim,
        num_heads=num_heads,
        num_layers=num_layers,
        num_go_terms=num_go_terms,
        max_len=max_len,
        dropout_rate=dropout_rate
    ).to(device)

    # Print model summary
    print(model)
    print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

    # Test a forward pass with dummy data
    batch_size = 2
    dummy_input = torch.randint(1, vocab_size, (batch_size, max_len), device=device)
    dummy_mask = (dummy_input != 0).long()
    with torch.no_grad():
        dummy_logits = model(dummy_input, dummy_mask)
    print(f"\nDummy output shape (batch_size={batch_size}, num_go_terms={num_go_terms}): {dummy_logits.shape}")

    # Visualize model architecture (textual)
    print('\nModel architecture summary:')
    print(f"- Embedding: vocab_size={vocab_size}, embedding_dim={embedding_dim}")
    print(f"- Positional Embedding: max_len={max_len}, embedding_dim={embedding_dim}")
    print(f"- Transformer: layers={num_layers}, heads={num_heads}, d_model={transformer_dim}")
    print(f"- Classifier: output_dim={num_go_terms} (multi-label)")
    print(f"- Dropout: {dropout_rate}")

    print('‚úÖ Model Architecture complete!')

except Exception as e:
    print(f'‚úó Error in Model Architecture: {e}')
    import traceback
    traceback.print_exc()

## 6. Implementation & Next Steps

**Note:** This section provides guidance, not complete code. Actual implementation depends on competition task.

In [None]:
print('üìã === IMPLEMENTATION GUIDE ===\n')

print('Competition Type: biology - multi-label-classification\n')
print('Task: Protein sequences ‚Üí Gene Ontology (GO) term predictions\n')
print('üí° Implementation Process:')
print('1. Load and explore the competition data')
print('2. Preprocess according to data type')
print('3. Build baseline model')
print('4. Train and validate')
print('5. Generate predictions')
print('6. Format submission file')

print('\n‚ö†Ô∏è TODO:')
print('  [ ] Implement data preprocessing')
print('  [ ] Build and train model')
print('  [ ] Generate test predictions')
print('  [ ] Format submission')

print('\nüí° TIP: Check research gaps and implementation strategy above!')


## 7. Submission

**Generate submission file in competition format**

In [None]:
print('üì§ === SUBMISSION GENERATION ===\n')

print('CAFA 6 Protein Function Prediction Submission Format:')
print('  Metric: Fmax')
print('  Format: Check sample_submission file for exact format')

print('\n‚ö†Ô∏è TODO:')
print('  1. Generate predictions on test set')
print('  2. Format according to sample_submission')
print('  3. Validate submission format')
print('  4. Save submission file')

# Load sample submission to see format
# sample_sub = pd.read_csv(DATA_PATH / 'sample_submission.csv')  # or .parquet
# print(sample_sub.head())
#
# Create your submission matching the format:
# submission = sample_sub.copy()
# submission['target'] = your_predictions  # Replace 'target' with actual column name
# submission.to_csv('submission.csv', index=False)
# print('‚úÖ Submission created!')
