# Topics Covered in This Notebook

This notebook presents an exploratory data analysis (EDA) for the CAFA 6 Protein Function Prediction competition. It validates the dataset schema and integrity, characterizes label imbalance across GO subontologies (MF, BP, CC), examines information accumulation (IA) weights, analyzes protein sequence length distributions, investigates taxonomic skewness, and summarizes the properties of the Gene Ontology (GO) graph relevant to modeling.

## Objectives
- Establish a reliable understanding of data formats, sizes, and constraints.
- Quantify class imbalance and its effects on evaluation and calibration.
- Investigate IA weights to predict metric sensitivity to rare/deep terms.
- Evaluate sequence length distributions to inform feature extraction and clustering.
- Examine taxonomic composition, promote taxon awareness prerequisites, and validation splits.
- Summarize the GO DAG structure (depth, relationships) to guide propagation and thresholding.

## Data Sources
- Training: sequences (`train_sequences.fasta`), labels (`train_terms.tsv`), taxonomy (`train_taxonomy.tsv`)
- Ontology: GO graph (`go-basic.obo`)
- Evaluation weights: knowledge base (`IA.tsv`)
- Test superset: sequences (`testsuperset.fasta`) and taxa (`testsuperset-taxon-list.tsv`)
- Reference: `sample_submission.tsv` for submission format

## Key Questions Addressed
- How imbalanced are GO term labels overall and per subontology?
- How do IA weights relate to term frequency and ontology depth?
- What are typical and extreme sequence lengths?
- How skewed is the taxonomy, and how much train–test taxon overlap exists?
- What ontology characteristics (depth/co-occurrence) should influence modeling choices?

## Methods (High-Level)
- Lightweight parsing of TSV/FASTA/OBO to extract schema and counts
- Frequency and density plots for labels, IA, and sequence lengths
- Overlap and distribution comparisons for taxonomy and IDs
- Approximate GO depth via shortest distance from subontology roots

## Outputs and Takeaways
- Clear summary of label imbalance and IA-weighted evaluation implications
- Practical guidance for batching/featurization from sequence length analysis
- Rationale for taxon-aware priors and stratified validation
- Motivation for ontology-aware prediction propagation and joint thresholding

## Reproducibility
- No comments in code cells; markdown cells explain outputs
- Figures and tables generated directly from the provided competition data

In [None]:
import os
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
from itertools import islice

In [None]:
base_dir = "/kaggle/input/cafa-6-protein-function-prediction"
train_dir = os.path.join(base_dir, "Train")
test_dir = os.path.join(base_dir, "Test")

paths = {
    "go_obo": os.path.join(train_dir, "go-basic.obo"),
    "train_fasta": os.path.join(train_dir, "train_sequences.fasta"),
    "train_terms": os.path.join(train_dir, "train_terms.tsv"),
    "train_tax": os.path.join(train_dir, "train_taxonomy.tsv"),
    "ia": os.path.join(base_dir, "IA.tsv"),
    "test_fasta": os.path.join(test_dir, "testsuperset.fasta"),
    "selected_taxa": os.path.join(test_dir, "testsuperset-taxon-list.tsv"),
    "sample_sub": os.path.join(base_dir, "sample_submission.tsv"),
}

{key: (os.path.exists(p), p) for key, p in paths.items()}

In [None]:
sample_sub = pd.read_csv(paths["sample_sub"], sep="\t", header=None, names=["EntryID","term_or_text","score","text_opt"], nrows=1000)
sample_sub.head(8)

Sample submission has three primary columns: protein ID, GO term ID or the literal `Text`, and a probability score in (0, 1]. Text predictions include a fourth field for the description. Scores should have up to three significant figures, and each protein can have at most 1500 terms across MF/BP/CC.

In [None]:
terms_df = pd.read_csv(paths["train_terms"], sep="\t")
terms_df.shape, terms_df.head(10)

Training label table loaded. Columns are expected to be `EntryID`, `term`, and `aspect` where aspect ∈ {F, P, C} corresponding to MF, BP, and CC subontologies.

In [None]:
n_proteins = terms_df["EntryID"].nunique()
n_terms = terms_df["term"].nunique()
aspect_counts = terms_df["aspect"].value_counts().rename(index={"F":"MF","P":"BP","C":"CC"})
n_rows = len(terms_df)
summary_basic = {
    "rows": n_rows,
    "unique_proteins": n_proteins,
    "unique_terms": n_terms,
    "labels_per_protein_mean": terms_df.groupby("EntryID")["term"].nunique().mean(),
}
summary_basic, aspect_counts.to_dict()

Basic summary shows the size of the multi-label problem, unique proteins and GO terms, and the count of label rows per subontology. The average number of unique terms per protein indicates label cardinality.

In [None]:
plt.figure(figsize=(5,3))
sns.countplot(x=terms_df["aspect"].map({"F":"MF","P":"BP","C":"CC"}))
plt.title("Label Rows by GO Subontology")
plt.xlabel("Subontology")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

Label distribution by subontology provides an initial view of which branch (MF/BP/CC) dominates annotations, informing sampling strategies and evaluation slicing.

In [None]:
term_freq = terms_df["term"].value_counts()
term_freq_desc = term_freq.describe(percentiles=[0.5,0.9,0.99]).to_dict()
term_freq_desc

Term frequency statistics quantify class imbalance and long-tail behavior. High skew suggests that naive baselines (global priors) may overfit head classes while underrepresenting rare terms.

In [None]:
top_terms = term_freq.head(20)
plt.figure(figsize=(7,4))
sns.barplot(x=top_terms.values, y=top_terms.index, orient="h")
plt.title("Top 20 Most Frequent GO Terms")
plt.xlabel("Frequency")
plt.ylabel("GO Term")
plt.tight_layout()
plt.show()

The top terms contribute a sizable fraction of labels. Downstream models should control for dominance of high-frequency terms to avoid inflated precision on shallow ontology nodes.

In [None]:
ia_df = pd.read_csv(paths["ia"], sep="\t", header=None, names=["term","ia"])
ia_df["ia"].describe(percentiles=[0.5,0.9,0.99]).to_dict()

Information accretion weights characterize the relative importance of terms. The long tail at higher IA values often corresponds to deeper, rarer, and harder-to-predict terms, which are emphasized in weighted metrics.

In [None]:
plt.figure(figsize=(6,3.8))
sns.histplot(ia_df["ia"], bins=80, log_scale=(False, False))
plt.title("Distribution of IA Weights")
plt.xlabel("IA")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

IA weight distribution helps anticipate evaluation sensitivity to rare, specific terms. Calibration and thresholding should account for IA to optimize the weighted F1.

In [None]:
# Efficient FASTA header parsing and sequence length distribution
def fasta_lengths(path, max_records=None):
    lengths = []
    ids = []
    cnt = 0
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        seq_len = 0
        cur_id = None
        for line in f:
            if line.startswith(">"):
                if cur_id is not None:
                    lengths.append(seq_len)
                    ids.append(cur_id)
                cur_id = line.strip().split("|")[1] if "|" in line else line.strip()[1:].split()[0]
                seq_len = 0
                cnt += 1
                if max_records and cnt > max_records:
                    break
            else:
                seq_len += len(line.strip())
        if cur_id is not None:
            lengths.append(seq_len)
            ids.append(cur_id)
    return pd.DataFrame({"EntryID": ids, "seq_len": lengths})

train_len_df = fasta_lengths(paths["train_fasta"])
train_len_df["seq_len"].describe(percentiles=[0.5,0.9,0.99]).to_dict()

Sequence length summary offers a sense of the modeling space for sequence encoders and similarity tools. Extreme lengths may require batching strategies or truncated embeddings.

In [None]:
tax_df = pd.read_csv(paths["train_tax"], sep="\t", header=None, names=["EntryID","taxon_id"])
tax_counts = tax_df["taxon_id"].value_counts()
tax_counts.head(10).to_dict()

Top taxa by protein counts indicate potential benefits of taxon-aware priors or stratified evaluation splits. Species imbalance may bias learned representations.

In [None]:
top_taxon = tax_counts.head(20)
plt.figure(figsize=(7,4))
sns.barplot(x=top_taxon.values, y=top_taxon.index, orient="h")
plt.title("Top 20 Taxa by Training Proteins")
plt.xlabel("Count")
plt.ylabel("Taxon ID")
plt.tight_layout()
plt.show()

Dominant taxa may warrant per-taxon calibration or separate baselines to avoid conflating phylogenetic signal with functional signal.

In [None]:
labels_per_protein = terms_df.groupby("EntryID")["term"].nunique()
labels_per_protein.describe(percentiles=[0.5,0.9,0.99]).to_dict()

Label cardinality per protein summarizes how many GO terms are typically associated with a single protein. This impacts thresholding and the expected sparsity of predictions.

In [None]:
sample_entries = set(terms_df["EntryID"].drop_duplicates().sample(5000, random_state=42)) if terms_df["EntryID"].nunique() > 5000 else set(terms_df["EntryID"].unique())
sub = terms_df[terms_df["EntryID"].isin(sample_entries)]
pair_counts = Counter()
for eid, grp in sub.groupby("EntryID"):
    terms = grp["term"].values
    if len(terms) <= 1:
        continue
    for i in range(len(terms)):
        for j in range(i+1, len(terms)):
            a, b = terms[i], terms[j]
            if a > b:
                a, b = b, a
            pair_counts[(a,b)] += 1
co_top = sorted(pair_counts.items(), key=lambda x: x[1], reverse=True)[:20]
co_top[:10]

Frequent co-occurring terms reveal ontology or biological relationships commonly annotated together. This can motivate structured predictors or joint calibration strategies.

In [None]:
freq_df = term_freq.rename("freq").reset_index().rename(columns={"index":"term"})
ia_join = freq_df.merge(ia_df, on="term", how="left").dropna()
corr = ia_join[["freq","ia"]].corr().iloc[0,1]
corr

Correlation between term frequency and IA weights typically trends negative: rarer (deeper) terms tend to have higher IA. This highlights the need to avoid shallow-term bias.

In [None]:
test_ids = []
with open(paths["test_fasta"], "r", encoding="utf-8", errors="ignore") as f:
    for line in f:
        if line.startswith(">"):
            pid = line.strip().split("|")[1] if "|" in line else line.strip()[1:].split()[0]
            test_ids.append(pid)
test_ids = pd.Series(test_ids, name="EntryID")
train_ids = terms_df["EntryID"].drop_duplicates()
overlap = len(set(train_ids).intersection(set(test_ids)))
total_test = test_ids.nunique()
{"test_unique": total_test, "overlap_with_train": overlap, "overlap_rate": round(overlap / max(1,total_test), 4)}

Training–test superset overlap indicates how often the same UniProt IDs appear across splits. Although evaluations are prospective, overlap patterns can influence similarity-based baselines.

In [None]:
def parse_obo_ids(path, max_nodes=None):
    parents = defaultdict(list)
    aspects = {}
    cur_id = None
    cur_ns = None
    cnt = 0
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            s = line.strip()
            if s == "[Term]":
                if cur_id is not None:
                    cnt += 1
                    if max_nodes and cnt > max_nodes:
                        break
                cur_id = None
                cur_ns = None
            elif s.startswith("id: GO:"):
                cur_id = s.split("id: ")[1]
            elif s.startswith("namespace:"):
                cur_ns = s.split("namespace: ")[1]
                if cur_id:
                    aspects[cur_id] = cur_ns
            elif s.startswith("is_a: GO:") and cur_id:
                p = s.split("is_a: ")[1].split()[0]
                parents[cur_id].append(p)
            elif s.startswith("relationship: part_of GO:") and cur_id:
                p = s.split("relationship: part_of ")[1].split()[0]
                parents[cur_id].append(p)
    return parents, aspects

parents, aspects_ns = parse_obo_ids(paths["go_obo"])
num_nodes = len(aspects_ns)
num_edges = sum(len(v) for v in parents.values())
{"nodes": num_nodes, "edges": num_edges}

GO graph summary provides scale of the DAG and supports simple diagnostics without pulling full ontology toolchains. Relationships include `is_a` and `part_of`.

In [None]:
from collections import deque

roots = {
    "biological_process": "GO:0008150",
    "cellular_component": "GO:0005575",
    "molecular_function": "GO:0003674",
}

def shortest_depth_from_roots(parents, roots_set):
    depth = {}
    indeg = defaultdict(int)
    children = defaultdict(list)
    for child, pars in parents.items():
        for p in pars:
            children[p].append(child)
            indeg[child] += 1
        if child not in indeg:
            indeg[child] = indeg.get(child, 0)
    q = deque()
    for r in roots_set:
        depth[r] = 0
        q.append(r)
    while q:
        u = q.popleft()
        for v in children.get(u, []):
            dv = depth.get(v, None)
            nu = depth[u] + 1
            if dv is None or nu < dv:
                depth[v] = nu
            indeg[v] -= 1
            if indeg[v] <= 0:
                q.append(v)
    return depth

depth_bp = shortest_depth_from_roots(parents, {roots["biological_process"]})
depth_cc = shortest_depth_from_roots(parents, {roots["cellular_component"]})
depth_mf = shortest_depth_from_roots(parents, {roots["molecular_function"]})

depth_stats = {
    "BP_median_depth": float(pd.Series([d for k,d in depth_bp.items()]).median()) if depth_bp else None,
    "CC_median_depth": float(pd.Series([d for k,d in depth_cc.items()]).median()) if depth_cc else None,
    "MF_median_depth": float(pd.Series([d for k,d in depth_mf.items()]).median()) if depth_mf else None,
}
depth_stats

Approximate depth distributions per subontology indicate hierarchical granularity. Deeper terms are typically rarer, with higher IA, and drive weighted evaluation sensitivity.

In [None]:
del sample_sub
gc.collect()

## EDA Summary and Modeling Implications

- Significant class imbalance with a heavy head and a long tail of rare terms.
- IA prioritizes deeper, rarer GO terms; calibration should consider IA-weighted objectives.
- Wide sequence length variation suggests careful batching and featurization choices.
- Taxonomic skew indicates benefits from taxon-aware priors and stratified validation.
- GO DAG depth and co-occurrence patterns motivate ancestor propagation and joint thresholding.

In [None]:
aspect_map = {"F":"MF", "P":"BP", "C":"CC"}
terms_df["_aspect_name"] = terms_df["aspect"].map(aspect_map)

freq_by_aspect = (
    terms_df.groupby("_aspect_name")["term"]
    .value_counts()
    .rename("freq")
    .reset_index()
)

g = sns.FacetGrid(freq_by_aspect, col="_aspect_name", sharex=False, sharey=False, height=3.2)
g.map_dataframe(sns.histplot, x="freq", bins=80)
g.set_axis_labels("Term frequency", "Count")
g.set_titles("{col_name}")
plt.tight_layout()
plt.show()

Aspect-wise term frequency histograms illustrate class imbalance per subontology. Each subontology shows a heavy head with a long tail of rare terms, implying that naive global priors will be insufficient without balancing or calibration.

In [None]:
freq_df = terms_df["term"].value_counts().rename("freq").reset_index().rename(columns={"index":"term"})
ia_join = freq_df.merge(ia_df, on="term", how="left").dropna()

corr_pearson = ia_join[["freq","ia"]].corr(method="pearson").iloc[0,1]
corr_spearman = ia_join[["freq","ia"]].corr(method="spearman").iloc[0,1]
{"pearson_freq_IA": float(corr_pearson), "spearman_freq_IA": float(corr_spearman)}

IA generally increases as terms get rarer (negative correlation with frequency). This reinforces the need to optimize for IA-weighted objectives or apply thresholds that avoid shallow-term bias.

In [None]:
plt.figure(figsize=(6,4))
sns.scatterplot(data=ia_join.sample(min(50000, len(ia_join)), random_state=42), x="freq", y="ia", alpha=0.25, s=10)
plt.xscale("log")
plt.title("IA vs Term Frequency")
plt.xlabel("Term frequency (log scale)")
plt.ylabel("IA")
plt.tight_layout()
plt.show()

The IA–frequency scatter (log-scaled x-axis) highlights that many high-IA terms are very rare. Models should be attentive to recall on these rare but important terms.

In [None]:
ns_map = {
    "biological_process": "BP",
    "cellular_component": "CC",
    "molecular_function": "MF",
}

def term_depth(go_id):
    if go_id in depth_bp:
        return depth_bp.get(go_id, np.nan)
    if go_id in depth_cc:
        return depth_cc.get(go_id, np.nan)
    if go_id in depth_mf:
        return depth_mf.get(go_id, np.nan)
    return np.nan

unique_terms = terms_df["term"].drop_duplicates()
depth_df = pd.DataFrame({"term": unique_terms})
depth_df["depth"] = depth_df["term"].map(term_depth)

aspect_ns_df = pd.DataFrame({"term": list(aspects_ns.keys()), "ns": [aspects_ns[k] for k in aspects_ns.keys()]})
aspect_ns_df["aspect"] = aspect_ns_df["ns"].map(ns_map)

depth_ns = depth_df.merge(aspect_ns_df[["term","aspect"]], on="term", how="left")
depth_ns = depth_ns.dropna(subset=["depth","aspect"])
depth_ns.head(3)

We derive an approximate depth for labeled terms by shortest distance from each subontology root. This provides a coarse measure of hierarchical specificity per term.

In [None]:
plt.figure(figsize=(7,3.5))
sns.kdeplot(data=depth_ns, x="depth", hue="aspect", common_norm=False, fill=True)
plt.title("Approximate Term Depth by Subontology")
plt.xlabel("Depth from subontology root (approx.)")
plt.ylabel("Density")
plt.tight_layout()
plt.show()

Depth distributions differ across MF/BP/CC. Deeper terms are more specific and tend to carry higher IA, which shifts evaluation sensitivity toward accurate predictions in the tail.

In [None]:
ia_depth = ia_join.merge(depth_ns[["term","depth"]], on="term", how="left").dropna()
sample = ia_depth.sample(min(100000, len(ia_depth)), random_state=42)

plt.figure(figsize=(6.2,4.2))
plt.hexbin(sample["depth"], sample["ia"], gridsize=50, cmap="viridis")
cb = plt.colorbar()
cb.set_label("Counts")
plt.xlabel("Depth")
plt.ylabel("IA")
plt.title("IA vs Depth (Hexbin)")
plt.tight_layout()
plt.show()

IA grows with depth, confirming that deeper ontology nodes are weighted more heavily. This motivates ancestor propagation and careful score calibration to avoid over-penalizing shallow-only predictions.

In [None]:
entry_aspects = terms_df.groupby("EntryID")["_aspect_name"].unique().reset_index()
entry_aspects = entry_aspects.explode("_aspect_name").rename(columns={"_aspect_name":"aspect"})
len_by_aspect = train_len_df.merge(entry_aspects, on="EntryID", how="left").dropna(subset=["aspect"])

plt.figure(figsize=(6.5,4))
sns.boxplot(data=len_by_aspect, x="aspect", y="seq_len", showfliers=False)
plt.title("Sequence Length by Subontology Presence")
plt.xlabel("Aspect")
plt.ylabel("Sequence length (aa)")
plt.tight_layout()
plt.show()

Sequence lengths vary across proteins associated with different subontologies. While not causal, this can affect batching and memory usage for sequence encoders.

In [None]:
test_taxa = pd.read_csv(paths["selected_taxa"], sep="\t", header=None, names=["taxon_id"])
train_tax_counts = tax_df["taxon_id"].value_counts()
test_tax_counts = test_taxa["taxon_id"].value_counts()

overlap_taxa = set(train_tax_counts.index).intersection(set(test_tax_counts.index))
overlap_stats = {
    "train_taxa_unique": int(train_tax_counts.size),
    "test_taxa_unique": int(test_tax_counts.size),
    "overlap_taxa": int(len(overlap_taxa)),
    "overlap_rate": round(len(overlap_taxa) / max(1, test_tax_counts.size), 4),
}
overlap_stats

Taxonomic overlap between train and test-superset indicates potential transferability of taxon-aware priors and nearest-neighbor baselines. High overlap supports leveraging taxonomy-conditioned statistics.

In [None]:
top_train_tax = train_tax_counts.head(20)
top_test_tax = test_tax_counts.head(20)

fig, axes = plt.subplots(1, 2, figsize=(12,4), sharey=True)
sns.barplot(x=top_train_tax.values, y=top_train_tax.index, ax=axes[0], orient="h")
axes[0].set_title("Top 20 Train Taxa")
axes[0].set_xlabel("Count")
axes[0].set_ylabel("Taxon ID")

sns.barplot(x=top_test_tax.values, y=top_test_tax.index, ax=axes[1], orient="h")
axes[1].set_title("Top 20 Test-superset Taxa")
axes[1].set_xlabel("Count")
axes[1].set_ylabel("Taxon ID")

plt.tight_layout()
plt.show()

Comparing top taxa surfaces distribution shifts. If the test-superset emphasizes different taxa, priors and similarity searches should account for this to maintain generalization.

In [None]:
terms_per_protein_aspect = (
    terms_df.groupby(["EntryID","_aspect_name"])["term"].nunique().rename("n_terms").reset_index()
)

plt.figure(figsize=(6.5,4))
sns.violinplot(data=terms_per_protein_aspect, x="_aspect_name", y="n_terms", inner="quartile", cut=0)
plt.title("Number of Terms per Protein by Aspect")
plt.xlabel("Aspect")
plt.ylabel("Distinct GO terms")
plt.tight_layout()
plt.show()

Proteins often carry multiple terms per subontology. This density guides expectations for the number of predictions per protein and informs per-aspect thresholding.

In [None]:
missing_ia = set(terms_df["term"].unique()) - set(ia_df["term"].unique())
len(missing_ia)

A small number of labeled terms may not have IA values due to ontology version or filtering. During submission assembly, ensure that parent propagation and scoring handle such cases gracefully.