# CAFA-6 Protein Function Prediction: Complete Experiments Summary

This notebook documents all approaches tried, results obtained, and key insights from our extensive experimentation on the CAFA-6 Kaggle competition.

**Competition Goal**: Predict Gene Ontology (GO) terms for 224,309 test proteins

**Metric**: F-max (maximum protein-centric F-measure across thresholds)

**Our Best Score**: 0.378 (ESM-2 Embedding approach)

**Top Leaderboard**: 0.456 (Mixture of Experts team)

## Table of Contents

1. [Executive Summary](#1.-Executive-Summary)
2. [Approaches That Worked](#2.-Approaches-That-Worked)
3. [Approaches That Failed](#3.-Approaches-That-Failed)
4. [Complete Submission History](#4.-Complete-Submission-History)
5. [Key Technical Insights](#5.-Key-Technical-Insights)
6. [Gap Analysis vs Top Teams](#6.-Gap-Analysis-vs-Top-Teams)
7. [Code Examples](#7.-Code-Examples)
8. [Conclusions & Lessons Learned](#8.-Conclusions-&-Lessons-Learned)
9. [Citations, Credits & References](#9.-Citations,-Credits-&-References)

---
## 1. Executive Summary

### Timeline Overview
- **Start Date**: November 2025
- **End Date**: January 2026
- **Total Scripts Created**: 135
- **Total Submissions**: 100+
- **Best Score Achieved**: 0.378 F-max

### Key Milestones

| Date | Score | Approach | Significance |
|------|-------|----------|-------------|
| Nov 2025 | 0.340 | Multi-Model Blend V2 | Early best |
| Dec 2025 | 0.336 | GOA Baseline | Realized GOA is strong |
| Dec 2025 | 0.338 | ESM2 90/10 Blend | Found optimal blend ratio |
| Jan 2026 | 0.374 | GOA 60% + ProtT5 40% | Major improvement |
| Jan 2026 | 0.378 | ESM-2 Embedding Notebook | Current best |

### Final Status
- **Our Best**: 0.378
- **Top Leaderboard**: 0.456
- **Gap**: 0.078 (20.6% improvement needed)

---
## 2. Approaches That Worked

### 2.1 GOA Baseline (0.336)

**What is GOA?**
- UniProt Gene Ontology Annotations
- Direct transfer of known annotations to test proteins
- Pre-computed predictions available as Kaggle dataset

**Why it works:**
- Well-calibrated confidence scores (not all 1.0)
- Comprehensive coverage (all 224,309 test proteins)
- Based on evidence codes from UniProt

**Key insight:** The GOA baseline is extremely hard to beat. Most ML approaches score worse.

---

### 2.2 ESM2 90/10 Weighted Blend (0.338)

```python
# Optimal blending formula
final_score = 0.9 * GOA_score + 0.1 * ESM2_ML_score
```

**Why it works:**
- 90% GOA preserves the strong baseline
- 10% ESM2 provides calibration adjustment
- Only blend where BOTH have predictions (don't add novel)

**Experiments with different ratios:**

| Ratio (GOA/ML) | Score | Result |
|----------------|-------|--------|
| 95/5 | 0.337 | Slightly worse |
| **90/10** | **0.338** | **Optimal** |
| 80/20 | 0.337 | Worse |
| 70/30 | 0.335 | Worse |

---

### 2.3 GOA + ProtT5 Blend (0.374)

**The breakthrough approach:**

```python
WEIGHT_GOA = 0.60
WEIGHT_PROTT5 = 0.40

# Blending rule: only average if BOTH have predictions
if s_goa > 0 and s_prott5 > 0:
    merged[t] = WEIGHT_GOA * s_goa + WEIGHT_PROTT5 * s_prott5
else:
    merged[t] = s_goa if s_goa > 0 else s_prott5
```

**Key improvements over ESM2 blend:**
- ProtT5 predictions have better coverage
- 40% weight (vs 10% for ESM2) works better for ProtT5
- ProtT5 provides complementary signal, not just calibration

---

### 2.4 ESM-2 Embedding Approach (0.378) - CURRENT BEST

This came from a Kaggle notebook using ESM-2 embeddings with a specific inference approach.

**Why it beat our blending approaches:**
- Direct embedding-based predictions
- Better calibration than simple blending
- Likely uses techniques we haven't fully replicated

---
## 3. Approaches That Failed

### 3.1 GO Hierarchy Propagation (0.005 - 0.263)

**What we tried:**
- Propagate predictions to parent GO terms
- Depth-based decay scoring
- Negative propagation

**Why it failed:**
- **CAFA evaluation handles propagation internally**
- Our propagation creates duplicates
- Depth decay destroys calibration

**CRITICAL LESSON:** Never do GO propagation - the evaluation system does it.

---

### 3.2 Adding Novel Predictions (No improvement)

**What we tried:**
- Add ML predictions not in GOA
- Use consensus (ESM2 + ProtT5 both agree)
- Various confidence thresholds (0.35, 0.5, 0.8)

**Results:**

| Novel Predictions Added | Score | Result |
|------------------------|-------|--------|
| 16,502 (threshold 0.5) | 0.338 | No change |
| 469,260 (threshold 0.35) | 0.332 | WORSE |
| 0 (baseline blend) | 0.338 | Best |

**Why it failed:**
- F-max heavily penalizes false positives
- Even high-confidence consensus predictions are mostly wrong
- ML predictions not in GOA are usually false positives

---

### 3.3 GOA as Input Features (0.318)

**The idea:**
Based on CAFA-5 winner insight that GOA can be used as input features.

```python
class GOAAwareModel(nn.Module):
    # Input: ESM2 embeddings + GOA predictions
    # Output: Refined predictions
    # Learned goa_residual_weight (~0.67)
```

**Validation F-max: 0.5262** (amazing!)

**Test F-max: 0.318** (WORSE than baseline!)

**Why it failed:**
- Model overfits to training GOA patterns
- Test GOA distribution differs from training
- Learned to "copy and adjust" GOA, which doesn't transfer
- Circular dependency: using GOA to predict GO terms

---

### 3.4 GCN with GOA (ProtBoost approach) (0.197)

**The idea:**
Following CAFA-5 2nd place (ProtBoost), use GCN to refine predictions.

**Our implementation:**
- GCN stacker with GOA as input feature (1 of 3 features)
- GO ontology graph structure
- SAGEConv layers

**Result: 0.197 (CATASTROPHIC - 41% worse than baseline)**

**Why it failed:**
- GOA dominated as 1/3 features (33% weight)
- ProtBoost uses GOA as 1/29 features (3.4% weight)
- Our GCN learned to copy GOA, not improve it
- Graph convolutions spread predictions â†’ false positives

---

### 3.5 Evidence Weighting (0.179 - 0.234)

**The idea:**
Weight GOA predictions by evidence code quality.

**Evidence code hierarchy:**
- EXP (experimental) = 1.0
- IDA, IMP = 0.95
- ISS (sequence similarity) = 0.8
- IEA (electronic) = 0.7

**Why it failed:**
- GOA already has well-calibrated scores
- Our weighting broke the calibration
- Set high scores everywhere (mean 0.991 vs 0.308 in original)

---

### 3.6 Filtering Low-Confidence Predictions (0.323)

**Hypothesis:** GOA over-predicts (20.3 GO terms/protein vs 6.5 in training)

**What we tried:** Keep only predictions with score >= 0.2

**Result:** WORSE (0.323 vs 0.336 baseline)

**Why:** Low-confidence predictions provide valuable recall.

---

### 3.7 ML-Only Approaches

| Approach | Score | Issue |
|----------|-------|-------|
| LightGBM multi-target | 0.132 | False positives |
| Practical Model | 0.077 | Massive overfitting |
| DIAMOND only | 0.027 | Homology insufficient |

**Key insight:** Pure ML without GOA always fails on this task.

---
## 4. Complete Submission History

### Score Distribution Visualization

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Key submissions (selected for clarity)
submissions = [
    ('ESM-2 Embedding', 0.378),
    ('GOA 60% + ProtT5 40%', 0.374),
    ('Simple blend 60/40', 0.364),
    ('Compact blend', 0.343),
    ('ESM2 90/10 Blend', 0.338),
    ('GOA Baseline', 0.336),
    ('ProtBoost MLP', 0.330),
    ('GOA filtered', 0.323),
    ('GOA-aware model', 0.318),
    ('Evidence weighted', 0.234),
    ('GCN with GOA', 0.197),
    ('LightGBM ML-only', 0.132),
    ('Practical model', 0.077),
    ('DIAMOND only', 0.027),
    ('Bad propagation', 0.005),
]

names, scores = zip(*submissions)
colors = ['green' if s >= 0.36 else 'blue' if s >= 0.33 else 'orange' if s >= 0.2 else 'red' for s in scores]

fig, ax = plt.subplots(figsize=(12, 8))
bars = ax.barh(names, scores, color=colors)
ax.set_xlabel('F-max Score')
ax.set_title('CAFA-6 Submission History: Key Approaches')
ax.axvline(x=0.336, color='gray', linestyle='--', label='GOA Baseline')
ax.axvline(x=0.456, color='purple', linestyle='--', label='Top Leaderboard')
ax.legend()

# Add value labels
for bar, score in zip(bars, scores):
    ax.text(score + 0.01, bar.get_y() + bar.get_height()/2, f'{score:.3f}', va='center')

plt.tight_layout()
plt.show()

### Chronological Summary Table

| Date | Score | Method | Key Insight |
|------|-------|--------|-------------|
| 2026-01-25 | 0.364 | Simple blend (no power scaling) | Power scaling hurts |
| 2026-01-25 | 0.343 | Compact blend (top 100, t=0.1) | Too aggressive filtering |
| 2026-01-20 | **0.378** | **ESM-2 Embedding Notebook** | **CURRENT BEST** |
| 2026-01-20 | **0.374** | **GOA 60% + ProtT5 40%** | **Major breakthrough** |
| 2026-01-19 | 0.197 | GCN with GOA | CATASTROPHIC |
| 2026-01-17 | 0.318 | GOA-aware model | Overfitting |
| 2026-01-16 | 0.234 | Evidence weighted | Broke calibration |
| 2026-01-09 | 0.338 | ESM2 90/10 blend | Optimal ratio |
| 2026-01-08 | 0.336 | Py-Boost experiments | No improvement |
| 2025-12-27 | 0.338 | ESM2 90/10 blend | First best |
| 2025-12-17 | 0.336 | GOA baseline | Strong baseline |
| 2025-12-04 | 0.005 | Bad propagation | WORST score |
| 2025-11-14 | 0.340 | Multi-Model Blend V2 | Early best |

---
## 5. Key Technical Insights

### 5.1 The F-max Metric is Unforgiving

F-max = max over thresholds of:
$$F_1(t) = \frac{2 \cdot precision(t) \cdot recall(t)}{precision(t) + recall(t)}$$

**Key properties:**
- False positives hurt precision dramatically
- Adding predictions usually hurts more than helps
- Calibration matters more than raw model performance

### 5.2 Validation-Test Gap is Massive

| Model | Validation F-max | Test F-max | Gap |
|-------|-----------------|------------|-----|
| GOA-aware | 0.5262 | 0.318 | -0.208 |
| ProtBoost MLP | 0.4298 | 0.330 | -0.100 |
| Practical model | 0.3486 | 0.077 | -0.272 |
| ProtT5 | 0.33 | 0.165 | -0.165 |

**Lesson:** Don't trust validation scores. Test distribution is fundamentally different.

### 5.3 What the Top Teams Are Doing (0.456)

Based on analysis of CAFA-5 winners and CAFA-6 top notebooks:

1. **GOCurator (CAFA-5 1st)**: Text mining + literature retrieval
2. **ProtBoost (CAFA-5 2nd)**: 29 features per GO term, GCN stacking
3. **Top CAFA-6 notebooks**: Negative propagation, power scaling

### 5.4 The GOA Paradox

**Paradox:** GOA is both our best baseline AND impossible to improve.

Every attempt to use GOA in a learnable way failed:
- GOA as input features: -5.4%
- GCN with GOA: -41.4%
- Evidence weighting: -46.7%
- Pure GOA baseline: BEST

**The only improvement came from blending with ProtT5 predictions.**

---
## 6. Gap Analysis vs Top Teams

### Current Leaderboard (Jan 2026)

| Rank | Team | Score | Gap from Us |
|------|------|-------|-------------|
| 1 | Mixture of Experts | 0.456 | +0.078 (20.6%) |
| 2 | WePredictProteins | 0.441 | +0.063 |
| 3 | Guoliang&Born4 | 0.440 | +0.062 |
| 4 | chya | 0.429 | +0.051 |
| 5 | mirrandax | 0.423 | +0.045 |
| ... | ... | ... | ... |
| ~100+ | **Us** | **0.378** | - |

### What We're Missing

| Missing Component | Used By | Estimated Impact | Difficulty |
|-------------------|---------|------------------|------------|
| Literature/text mining | GOCurator | +0.05-0.08 | High |
| Learning to Rank | GOCurator, NetGO | +0.03-0.05 | Medium |
| Proper GCN (29 features) | ProtBoost | +0.02-0.04 | Medium |
| ESM2 fine-tuning (LoRA) | Various | +0.01-0.03 | Medium |
| AlphaFold structures | ESM-GNN | +0.01-0.02 | High |

---
## 7. Code Examples

### 7.1 Optimal Blending (0.374 approach)

In [None]:
# Simplified example of the 0.374 blending approach
def blend_goa_prott5(goa_preds, prott5_preds, test_proteins):
    """
    Blend GOA and ProtT5 predictions.
    
    Parameters:
    - goa_preds: dict of {protein_id: {go_term: score}}
    - prott5_preds: dict of {protein_id: {go_term: score}}
    - test_proteins: set of protein IDs
    
    Returns:
    - blended: dict of {protein_id: {go_term: score}}
    """
    GOA_WEIGHT = 0.60
    PROTT5_WEIGHT = 0.40
    
    blended = {}
    
    for pid in test_proteins:
        goa = goa_preds.get(pid, {})
        prott5 = prott5_preds.get(pid, {})
        
        all_terms = set(goa.keys()) | set(prott5.keys())
        
        scores = {}
        for go in all_terms:
            g = goa.get(go, 0)
            p = prott5.get(go, 0)
            
            if g > 0 and p > 0:
                # Both have predictions - weighted blend
                scores[go] = GOA_WEIGHT * g + PROTT5_WEIGHT * p
            elif g > 0:
                # GOA only - keep as is
                scores[go] = g
            elif p > 0:
                # ProtT5 only - slight discount
                scores[go] = p * 0.85
        
        if scores:
            blended[pid] = scores
    
    return blended

print("Blending function defined. Key: only blend when BOTH have predictions.")

### 7.2 Why Power Scaling Hurts

In [None]:
# Power scaling was tried but hurt performance
def apply_power_scaling(scores, power=0.8, max_cap=0.95):
    """
    Power scaling: score^power (boosts lower scores)
    
    Problem: Distorts calibration that F-max relies on
    """
    max_score = max(scores.values()) if scores else 0
    if max_score <= 0:
        return scores
    
    scaled = {}
    for go, score in scores.items():
        normalized = (score / max_score) ** power
        scaled[go] = normalized * min(max_score, max_cap)
    
    return scaled

# Example showing the distortion
import numpy as np
original = np.array([0.1, 0.3, 0.5, 0.7, 0.9])
scaled = original ** 0.8

print("Original scores:", original)
print("After power=0.8:", np.round(scaled, 3))
print("\nNote: Low scores boosted disproportionately - breaks calibration")

### 7.3 The Asymmetric Focal Loss

In [None]:
import torch
import torch.nn as nn

class AsymmetricFocalLoss(nn.Module):
    """
    The key loss function for multi-label classification.
    
    Parameters:
    - gamma_neg: Focus on hard negatives (default 4.0)
    - gamma_pos: Focus on hard positives (default 0.0)
    - clip: Minimum probability to avoid log(0)
    
    Why it works:
    - Extreme class imbalance (~6 positives per 1500 labels)
    - gamma_neg=4 down-weights easy negatives
    - Allows model to focus on rare positive labels
    
    Reference: Ridnik et al. (2021) ICCV
    """
    def __init__(self, gamma_neg=4.0, gamma_pos=0.0, clip=0.05):
        super().__init__()
        self.gamma_neg = gamma_neg
        self.gamma_pos = gamma_pos
        self.clip = clip

    def forward(self, logits, targets):
        # Sigmoid probabilities
        probs = torch.sigmoid(logits)
        
        # Separate positive and negative
        pos_mask = targets == 1
        neg_mask = targets == 0
        
        # Positive loss (standard BCE)
        pos_probs = probs[pos_mask].clamp(min=self.clip)
        pos_loss = -torch.log(pos_probs)
        if self.gamma_pos > 0:
            pos_loss = pos_loss * ((1 - pos_probs) ** self.gamma_pos)
        
        # Negative loss (asymmetric focal)
        neg_probs = probs[neg_mask].clamp(max=1-self.clip)
        neg_loss = -torch.log(1 - neg_probs)
        neg_loss = neg_loss * (neg_probs ** self.gamma_neg)  # Down-weight easy negatives
        
        return pos_loss.mean() + neg_loss.mean()

print("AsymmetricFocalLoss: gamma_neg=4 is critical for extreme class imbalance")

---
## 8. Conclusions & Lessons Learned

### What We Learned

1. **GOA is king**: The UniProt GOA baseline is extremely hard to beat
2. **Don't do propagation**: CAFA handles it internally
3. **F-max punishes false positives**: Be conservative with predictions
4. **Validation doesn't transfer**: Expect 10-20% drops on test
5. **Simple blending works**: 60/40 GOA/ProtT5 better than complex ML
6. **More ML weight hurts**: 90/10 better than 80/20 for ESM2
7. **Novel predictions don't help**: Even high-confidence consensus is mostly wrong

### What We Would Do Differently

1. **Start with GOA + ProtT5 blend** from day one
2. **Skip propagation experiments** entirely
3. **Focus on calibration** over raw model performance
4. **Trust the validation-test gap** - expect ~0.1 drop
5. **Study top notebooks first** before trying novel ideas

### Remaining Gap to Top

To reach 0.456 from 0.378, we would need:
- Text mining / literature features (like GOCurator)
- Learning to Rank ensemble
- Proper GCN stacking with 29+ features
- ESM-2 fine-tuning with LoRA
- Possibly structural features from AlphaFold

### Final Takeaway

**The CAFA protein function prediction task is fundamentally about annotation transfer, not ML.**

The best approaches leverage existing biological knowledge (GOA, ProtT5 pre-training, homology) rather than trying to learn from scratch. Pure ML approaches consistently fail because:
- Extreme class imbalance (~6 positives per protein out of 40,000+ GO terms)
- Test distribution differs from training
- F-max metric heavily penalizes false positives

**Our journey: From 0.005 (catastrophic) to 0.378 (respectable)**

---
## 9. Citations, Credits & References

### Competition & Data Sources

1. **CAFA-6 Competition**: Kaggle CAFA 6 Protein Function Prediction
   - https://www.kaggle.com/competitions/cafa-6-protein-function-prediction

2. **GOA Predictions Dataset**: ymuroya47/cafa6-goa-predictions
   - https://www.kaggle.com/datasets/ymuroya47/cafa6-goa-predictions
   - Contains `goa_submission.tsv` and `prott5_interpro_predictions.tsv`

3. **Gene Ontology Consortium**
   - Ashburner, M., et al. (2000). Gene Ontology: tool for the unification of biology. *Nature Genetics*, 25(1), 25-29.
   - http://geneontology.org/

4. **UniProt-GOA Database**
   - Huntley, R. P., et al. (2015). The GOA database: gene ontology annotation updates for 2015. *Nucleic Acids Research*, 43(D1), D1057-D1063.
   - https://www.ebi.ac.uk/GOA

### Protein Language Models

5. **ESM-2** (Meta AI)
   - Lin, Z., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. *Science*, 379(6637), 1123-1130.
   - https://github.com/facebookresearch/esm

6. **ProtT5** (Rostlab)
   - Elnaggar, A., et al. (2022). ProtTrans: Toward understanding the language of life through self-supervised learning. *IEEE TPAMI*, 44(10), 7112-7127.
   - https://github.com/agemagician/ProtTrans

7. **Ankh** (ElnaggarLab)
   - Elnaggar, A., et al. (2023). Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling.
   - https://github.com/agemagician/Ankh

### Key Methods & Papers

8. **CAFA Evaluation**
   - Radivojac, P., et al. (2013). A large-scale evaluation of computational protein function prediction. *Nature Methods*, 10(3), 221-227.

9. **Asymmetric Focal Loss**
   - Ridnik, T., et al. (2021). Asymmetric loss for multi-label classification. *ICCV 2021*.
   - https://github.com/Alibaba-MIIL/ASL

10. **DeepGOPlus**
    - Kulmanov, M., & Hoehndorf, R. (2020). DeepGOPlus: improved protein function prediction from sequence. *Bioinformatics*, 36(2), 422-429.

11. **DIAMOND** (Homology Search)
    - Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. *Nature Methods*, 12(1), 59-60.

### CAFA-5 Winning Solutions (Referenced)

12. **GOCurator** (1st Place, CAFA-5)
    - Yuan, Q., et al. (2024). GORetriever: A Two-Stage Retrieval Framework for Gene Ontology Annotation.
    - https://pmc.ncbi.nlm.nih.gov/articles/PMC11520413/

13. **ProtBoost** (2nd Place, CAFA-5)
    - Yuan, Q., et al. (2024). ProtBoost: Boosted protein function prediction.
    - https://arxiv.org/abs/2412.04529

### Top CAFA-6 Notebooks Referenced

14. **GOA + ProtT5 Ensemble (0.375)**
    - Ibrahim Qasimi
    - https://www.kaggle.com/code/ibrahimqasimi/0-375-biological-function-of-a-protein

15. **GOA + ProtT5 Ensemble (0.370)**
    - Nikita Kuznetsov
    - https://www.kaggle.com/code/nikitakuznetsof/cafa-6-goa-prott5-ensemble-0-370-f2fcb6

16. **KTDK Final**
    - Khoa Tran
    - https://www.kaggle.com/code/khoatran512/ktdk-int3405e2-final

### Tools & Libraries

17. **PyTorch**: https://pytorch.org/
18. **Hugging Face Transformers**: https://huggingface.co/
19. **scikit-learn**: https://scikit-learn.org/
20. **Py-Boost (SketchBoost)**: https://github.com/sb-ai-lab/Py-Boost

### Acknowledgments

- **Meta AI** for ESM-2 protein language models
- **Rostlab** for ProtT5 models
- **ElnaggarLab** for Ankh models
- **Gene Ontology Consortium** for GO ontology
- **UniProt-GOA** for annotation data

---
## Appendix: Project Statistics

| Metric | Value |
|--------|-------|
| Scripts created | 135 |
| Output files | 119 |
| Git commits | 50+ |
| Submissions | 100+ |
| Best score | 0.378 |
| Worst score | 0.005 |
| Time spent | ~3 months |
| GOA baseline | 0.336 |
| Improvement over baseline | +12.5% |