In [None]:
!pip install mermaidian


# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#hide_input
#hide_output


## üß¨ **FANTASIA ‚Äì CAFA-6 Functional Annotation Framework**

---

### **1. Introduction**

This notebook documents the system developed for **CAFA-6**, integrating two complementary components designed to perform large-scale, temporally consistent protein function prediction:

* **[Protein Information System (PIS)](https://github.com/CBBIO/protein-information-system):**
  A structured and versioned protein information framework capable of maintaining multiple **temporal instances** containing proteins, Gene Ontology (GO) annotations, structural data, and embeddings.
  The system can ingest data from the **UniProt API** or directly from **Gene Annotation Files (GAFs)**, preserving all associated metadata and evidence codes.

* **[FANTASIA](https://github.com/CBBIO/FANTASIA):**
  *Functional ANnoTAtion based on embedding space SImilArity*, a configurable inference engine that predicts protein function by measuring similarity in embedding space.
  FANTASIA supports **taxonomy-based** and **redundancy-based** filtering strategies and integrates seamlessly with PIS for reproducible analysis.

Additional exploratory work was carried out using the Kaggle notebook
**[CAFA5 Private Test Set Discovery](https://www.kaggle.com/code/xaxipirulazo/cafa5-private-test-set-discovery)**,
where Gene Annotation Files from **releases 214 to 226** ‚Äî spanning nearly **six years of biological knowledge evolution** ‚Äî were processed to define temporal validation environments.

---

### **2. Current System State**

After parsing and filtering, only **experimentally supported annotations** (as defined by the CAFA committee‚Äôs holdout protocol) were retained.
The system now contains:

| Metric             |                                                            Value |
| ------------------ | ---------------------------------------------------------------: |
| Total proteins     |                                                      **224 309** |
| Annotated proteins |                                                      **216 275** |
| Total annotations  | **1 711 334** (including IBA + IEA; both can be disabled in PIS) |

---

### **3. Embedding Infrastructure**

Embeddings have been generated and stored using **ESM-3c** and **Ankh-3**, applying **multi-layer mean pooling** across model layers.
These representations are linked to their corresponding proteins in the PIS graph, enabling **embedding-space inference** by transferring GO annotations from nearest neighbors.

Each inference result includes not only predicted GO terms but also **rich metadata** about the source proteins and alignment metrics:

| Category                | Key Attributes                                                                      |
| ----------------------- | ----------------------------------------------------------------------------------- |
| **Embedding metadata**  | `accession`, `model_name`, `embedding_type_id`, `layer_index`                       |
| **Functional metrics**  | `distance`, `go_id`, `category`, `evidence_code`                                    |
| **Protein context**     | `protein_id`, `organism`, `gene_name`                                               |
| **Alignment scores**    | `reliability_index`, `identity`, `similarity`, `alignment_score`, `gaps_percentage` |
| **Distance parameters** | `distance_metric`, `distance_threshold`                                             |

---

### **4. Embedding Generation Configuration**

The embedding generation pipeline defines how protein sequences are encoded using large protein language models (PLMs).
This configuration controls **truncation**, **batch size**, **device allocation**, and **layer selection**, which together determine computational performance and representation quality.

#### **Configuration Overview**

```yaml
embedding:
  device: cuda                 # enum{cuda,cpu} | Primary device for PLMs; set to cpu to force CPU execution
  queue_batch_size: 100        # int | Number of sequences per published batch to RabbitMQ
  max_sequence_length: 1516    # int | 0 disables truncation; otherwise sequences are truncated to this length

  # Per-model configuration.
  # name: logical model identifier used across the pipeline (case-sensitive).
  # enabled: if false, model is ignored at runtime.
  # batch_size: PLM forward batch size (beware of VRAM limits).
  # layer_index: list[int] of layers to export; multiple indices will produce several representations (multi-layer mean pooling).
  # LAYER INDEXING NOTE: 0 = last (output) layer, 1 = penultimate, 2 = second-to-last, and so on.

  models:
    ESM: # 34 layers: 0..33
      enabled: False
      batch_size: 1
      layer_index: [ 0 ]

    ESM3c: # 36 layers: 0..35
      enabled: True
      batch_size: 1
      layer_index: [ 0 ]
      distance_threshold: 0

    Ankh3-Large: # 49 layers: 0..48
      enabled: False
      batch_size: 1
      layer_index: [ 0 ]
      distance_threshold: 0

    Prot-T5: # 25 layers: 0..24
      enabled: False
      batch_size: 1
      layer_index: [ 0 ]
      distance_threshold: 0

    Prost-T5: # 25 layers: 0..24 (same backbone as Prot-T5)
      enabled: False
      batch_size: 1
      layer_index: [ 0, 1, 2, 11, 12, 13 ]
      distance_threshold: 0
```

#### **Technical Notes**

* **Device selection (`device`)** ‚Äì Configures the main execution device. CUDA is preferred when available for PLM inference; CPU fallback is supported for debugging or low-resource environments.
* **Truncation (`max_sequence_length`)** ‚Äì Sequences exceeding this length are truncated to avoid GPU memory overflow; `0` disables truncation entirely.
* **Batch parameters (`batch_size`, `queue_batch_size`)** ‚Äì Control throughput and memory footprint. Small batches (1‚Äì4) ensure stable performance within limited VRAM; the queue batch size governs the number of sequences sent per processing batch.
* **Layer indexing (`layer_index`)** ‚Äì Allows selective extraction of representations from deeper or intermediate PLM layers. Combining multiple indices yields averaged (‚Äúmean-pooled‚Äù) embeddings that capture both structural and functional information.
* **Model selection** ‚Äì Only models marked as `enabled: True` are loaded. In this configuration, **ESM3c** is the active model with its final layer exported.

---

### **5. LookUp Configuration**

The lookup stage consumes precomputed embeddings (from Stage A) and reference tables loaded in memory from the Protein Information System (PIS).
It performs pairwise similarity searches, redundancy and taxonomy filtering, and exports per-query results in CSV/TSV format.

```yaml
# ==============================================================================
# Stage B ‚Äî Lookup
# Consumes embeddings.h5 + in-memory references (IDs, vectors, GO) to produce CSV/TSV.
# ==============================================================================
lookup:
  use_gpu: True               # bool | If true, run vector distances on GPU when available
  batch_size: 216             # int  | Vector distance batch size (tune to GPU memory)
  distance_metric: cosine     # enum{cosine,euclidean} | Distance for nearest-neighbor search
  limit_per_entry: 3          # int  | k neighbors returned per query (a.k.a. ‚Äúk‚Äù)
  lookup_cache_max: 4         # int  | Max entries per (model,layer) in-memory cache (tune to RAM)
  topgo: true                 # bool | If true, emit TopGO-compatible TSV alongside CSV outputs

  precision: 4                # int | N√∫mero de decimales a usar en la exportaci√≥n de resultados.

  # Redundancy filtering (optional pre-filter on reference side, e.g., MMseqs2).
  redundancy:
    identity: 0               # float in [0,1] | 0 disables; 1.0 = 100% identity (strict deduplication)
    coverage: 0.7             # float in (0,1]  | Alignment coverage threshold used in deduplication
    threads: 10               # int  | CPU threads for redundancy filtering tools

  # Taxonomy filters (applied after NN retrieval to prune/keep specific taxa).
  taxonomy:
    exclude: [ ]              # list[str] | Taxonomy IDs to exclude (e.g., ["559292","6239"])
    include_only: [ ]         # list[str] | If non-empty, restrict results to these IDs (takes precedence)
    get_descendants: false    # bool | If true, expand filters to include descendants
```

#### **Key Parameters**

| Parameter          | Description                                                                            |
| ------------------ | -------------------------------------------------------------------------------------- |
| `use_gpu`          | Enables GPU-based nearest neighbor computation for accelerated lookup.                 |
| `batch_size`       | Controls memory usage and parallelism during distance calculations.                    |
| `distance_metric`  | Defines similarity function (`cosine` or `euclidean`).                                 |
| `limit_per_entry`  | Number of top hits per protein (k).                                                    |
| `lookup_cache_max` | Controls RAM caching per model/layer.                                                  |
| `topgo`            | Generates additional `.topgo` files compatible with downstream GO enrichment analysis. |
| `redundancy`       | Optional prefiltering step to remove redundant reference proteins.                     |
| `taxonomy`         | Applies organism-level filters after similarity retrieval.                             |
| `precision`        | Number of decimals in exported numeric results.                                        |

---

### **5. Post-Processing Configuration**

The aggregation of prediction attributes is governed by a weighted scoring schema defined as follows:

```yaml
postprocess:
  keep_sequences: true
  summary:
    normalize_count_by_limit_per_entry: true
    export_raw_count: true
    metrics:
      reliability_index: [max]
      identity: [max]
      identity_sw: [max]
    aliases:
      reliability_index: ri
      identity: id_g
      identity_sw: id_l
    weights:
      reliability_index: { max: 0.3 }
      mean_id_g: 0.35
      mean_id_l: 0.35
      count: 0
    weighted_prefix: "w_"
```

The **post-processing stage** integrates results across multiple **layers and models** by applying a *weighted aggregation schema*.
For each query, FANTASIA combines the **maximum reliability index** and **mean global/local identities** from all active models and exported layers.
Scores are normalized per entry, aliased for clarity (`ri`, `id_g`, `id_l`), and merged using the specified weights
*(0.3 √ó RI + 0.35 √ó ID‚Ççg‚Çé + 0.35 √ó ID‚Ççl‚Çé)* to produce a unified prediction output (`w_*`).

---

### **6. Research framework schema**



In [None]:
import mermaidian as mm

# Configuraci√≥n base (id√©ntica al ejemplo oficial)
out_path = '/kaggle/working/'

config0 = {'fontSize': '24px'}
options0 = {'bgColor': '#ffffff', 'width': '900'}
pad_data0 = {'pad_top': 80, 'pad_bottom': 40, 'border_thickness': 6, 'border_color': "#aaaaaa", 'pad_color': '#ffffff'}
title_data0 = {'position': 'tc', 'font_scale': 1.0, 'font_thickness': 1, 'font_color': "#000000", 'font_name': 'duplex'}

def show_mermaid(diagram_code, file_name, title, theme='forest', config=config0, options=options0, pad_data=pad_data0, title_data=title_data0):
    options['bgColor'] = options['bgColor'].replace('#', '')
    diagram = mm.get_mermaid_diagram('svg', diagram_code, theme, config, options)
    title_data_svg = {
        'title': title,
        'position': 'tc',
        'font_name': 'Arial, sans-serif',
        'font_size': 24,
        'font_color': '#000000',
        'font_bg_color': '',
        'font_weight': 'bold'
    }
    diagramPBT = mm.add_paddings_border_and_title_to_svg(diagram, pad_data, title_data_svg)
    mm.show_svg_ipython_centered(diagramPBT)
    mm.save_diagram_as_svg(f'{out_path}/{file_name}.svg', diagramPBT)


research_framework_code = """
flowchart LR
    subgraph PIS["Protein Information System (PIS)"]
        B1["T‚ÇÄ ‚Äì CAFA5 baseline:<br>proteins, annotations, structures, embeddings"]
        B2["T‚ÇÅ ‚Äì CAFA6 baseline:<br>updated annotations and embeddings"]
    end

    subgraph FANTASIA_T0["FANTASIA (T‚ÇÄ instance)"]
        C1["Compute embedding distances<br>(cosine, L2)"]
        C2["Transfer GO terms from nearest neighbors"]
        C3["Weighted post-processing<br>(RI¬∑0.3 + ID‚Ççg‚Çé¬∑0.35 + ID‚Ççl‚Çé¬∑0.35)"]
    end

    subgraph FANTASIA_T1["FANTASIA (T‚ÇÅ instance)"]
        C4["Compute embedding distances<br>(cosine, L2)"]
        C5["Transfer GO terms from nearest neighbors"]
        C6["Weighted post-processing<br>(RI¬∑0.3 + ID‚Ççg‚Çé¬∑0.35 + ID‚Ççl‚Çé¬∑0.35)"]
    end

    subgraph PIPELINE["Validation & Submission Pipeline"]
        D1["Holdout generation between T‚ÇÄ ‚Üí T‚ÇÅ"]
        D2["Internal validation on multiple temporal holdouts"]
        D3["Model consolidation"]
        D4["CAFA6 final submission"]
    end

    A1["UniProt & GOA GAF archives (2014‚Äì2026)"] --> A2["Parsing & filtering by evidence codes<br>(EXP, IDA, IMP, IPI, IBA, IEA)"]
    A2 --> B1 & B2
    B1 --> C1 --> C2 --> C3 --> D1 --> D2 --> D3
    B2 --> C4 --> C5 --> C6 --> D4

    classDef data fill:#202830,stroke:#89b4fa,stroke-width:2px,color:#f8f9fa
    classDef pis fill:#1e1e2e,stroke:#f38ba8,stroke-width:2px,color:#f8f9fa
    classDef fantasia fill:#282a36,stroke:#94e2d5,stroke-width:2px,color:#f8f9fa
    classDef pipe fill:#2b303b,stroke:#a6e3a1,stroke-width:2px,color:#f8f9fa
    class A1,A2 data
    class B1,B2 pis
    class C1,C2,C3,C4,C5,C6 fantasia
    class D1,D2,D3,D4 pipe
"""

theme = {'primaryColor': '#1e1e2e',
         'primaryTextColor': '#ffffff',
         'secondaryColor': '#11111b',
         'tertiaryColor': '#4c4f69',
         'lineColor': '#aaaaaa',
         'fontSize': '20px'}

show_mermaid(
    research_framework_code,
    file_name='fantasia_cafa6_schema',
    title='FANTASIA‚ÄìCAFA6 Research Framework',
    theme=theme
)



---

### **7. Future Work**

* Implement multiple **temporal holdouts** leveraging the historical sequence of GAF releases to rigorously evaluate model generalization.
* Establish a benchmark milestone: **surpass the CAFA-5 #1 score** under both *leakage-free* and *standard* conditions.
* Extend validation beyond FANTASIA to other prediction pipelines and embeddings.
* Conduct **local hyperparameter optimization**, prioritizing robust validation over leaderboard fitting.
* Explore integration of **structural data** to enhance annotation precision.
* Continue investigating **multifunctionality and embedding-based inference reliability** within the PIS‚ÄìFANTASIA framework.

---

### **8. Immediate Tasks**

This section summarizes the short-term objectives to consolidate the CAFA-6 system and prepare the official validation and submission pipeline.

#### **1. Repository Creation ‚Äì GAF Filtering Mechanism**

* **Objective:** Create a dedicated GitHub repository for the **massive GAF filtering pipeline**.
* **Purpose:** Centralize the code responsible for parsing, cleaning, and filtering historical GAFs (releases 214‚Äì226).
* **Actions:**

  * Publish the preprocessing scripts with version control and clear documentation.
  * Include configuration templates for evidence-based filtering (`EXP`, `IDA`, `IMP`, `IPI`, `IBA`, `IEA`).
  * Ensure reproducibility via environment files (`pyproject.toml` or `environment.yml`).

#### **2. FANTASIA Results Dataset ‚Äì CAFA-6**

* **Objective:** Upload the **FANTASIA inference results** as a Kaggle dataset for CAFA-6.
* **Actions:**

  * Share FANTASIA results sample.
  * Compress and upload into a Dataset.
  * Include metadata and README with some exploratory analysis.
  * Provide an **example submission notebook** demonstrating leaderboard formatting and metric calculation from raw results.

#### **3. Information System Instance ‚Äì Pre-Evaluation Snapshot** 

* **Objective:** Generate a **PIS instance** representing the biological knowledge *prior to the CAFA-6 evaluation period*.
* **Actions:**

  * Freeze the temporal database snapshot (proteins, annotations, structures, embeddings).
  * Tag it as `PIS_T‚ÇÄ_local` for internal validation use.
  * Store version metadata (GAF source, release IDs, evidence filters).

#### **3b. Information System Instance ‚Äì  Latest Snapshot** 

* **Objective:** Generate a **PIS instance** representing the most updated biological knowledge to predict the current holdout.
* **Actions:**

  * Freeze the temporal database snapshot (proteins, annotations, structures, embeddings).
  * Tag it as `PIS_T‚ÇÄ_local` for internal validation use.
  * Store version metadata (GAF source, release IDs, evidence filters).

#### **4. Automated Evaluation and Hyperparameter Search**

* **Objective:** Implement an **automated evaluation engine** to explore hyperparameters across different holdouts.
* **Scope:**

  * Automate internal validation loops using subsets of proteins with recent GO updates.
  * Log metrics such as F-score, precision, recall, and semantic similarity.
  * Integrate search strategies (grid/random/Bayesian) for model parameter tuning.
  * Output results to structured reports or dashboards.

#### **5. Controlled Holdout Evaluation**

* **Rationale:**
  Changes between GAF releases are **modest compared to the test supersets**, so validation should rely on **smaller, well-defined sets**.
* **Plan:**

  * Focus on reduced and traceable subsets where annotation updates are verified.
  * Use these controlled holdouts to test new post-processing, weighting schemes, and model comparison consistency.

**Outcome:**
Completion of these tasks will finalize the **CAFA-6 validation framework**, enabling fully reproducible experiments, model optimization, and submission deployment within the PIS‚ÄìFANTASIA ecosystem.
