## Credits
This analysis is built upon the foundational work by @leonidkulyk from the CAFA 5 competition, who published the acclaimed notebook titled "[EDA] üß¨CAFA5-PFP ~ üï∏Ô∏èInteractive DAGs | üìäPlotly." I want to extend my gratitude for his insightful contribution.

In this notebook, I have adapted and utilized his original code to analyze this year's CAFA competition data.

<!-- Dark Mode Header -->
<div style="background-color:#1e1e1e; padding: 30px; border-radius: 12px; text-align:center;">
    <h1 style="font-family: 'Consolas', monospace; font-size: 32px; font-weight: bold; color:#f5f5f5; margin-bottom:10px;">
        üß¨ CAFA 5 Protein Function Prediction 
    </h1>
    <p style="color:#949494; font-family: 'Consolas', monospace; font-size: 20px; text-align:center; margin-top:0;">
        Predict the biological function of a protein
    </p>
</div>
<hr style="border:1px solid #333; margin:20px 0;">



<!-- Dark Mode Overview Section -->
<div style="background-color:#1e1e1e; padding: 30px; border-radius: 12px; color:#f5f5f5; font-family: 'Consolas', monospace;">

  <h2 style="font-size:32px; font-weight:bold; text-align:center; margin-bottom:20px;">
    (‡≤†‡≤ø_‡≤†) Overview
  </h2>

  <p style="font-size:16px; margin-bottom:12px;">
    üî¥ <b>Goal</b>: Predict the function of proteins based on their amino acid sequences and other data.
  </p>

  <p style="font-size:16px; margin-bottom:12px;">
    ‚ö™ <b>Importance</b>: Accurate protein function assignment is <span style="color:#ffa500;">crucial for understanding molecular biology</span>, discovering cellular mechanisms, and developing new therapies.
  </p>

  <p style="font-size:16px; margin-bottom:12px;">
    ‚ö™ <b>Context</b>: Proteins are composed of <span style="color:#00bfff;">20 types of amino acids</span>. Humans have tens of thousands of proteins, each a chain of amino acids forming unique structures.
  </p>

  <p style="font-size:16px; margin-bottom:12px;">
    ‚ö™ <b>Challenges</b>: Proteins often have multiple functions and interact with many partners. Predicting functions is complex due to ambiguity, data integration, and structural diversity.
  </p>

  <p style="font-size:16px; margin-bottom:12px;">
    ‚ö™ <b>Host</b>: Function-COSI organizes the competition, bringing together computational biologists, experimental biologists, and biocurators.
  </p>

  <p style="font-size:16px; margin-bottom:12px;">
    ‚ö™ <b>Co-organizers</b>: Iowa State University, Northeastern University, University of Padova, and UniProt.
  </p>

  <p style="font-size:16px; margin-bottom:0;">
    ‚ö™ <b>Acknowledgments</b>: Supported by the organizers above and the International Society for Computational Biology.
  </p>

</div>
<hr style="border:1px solid #333; margin:25px 0;">


<!-- Dark Mode Table of Contents -->
<a id="top"></a>

<div style="background-color:#1e1e1e; padding:25px 20px; border-radius:15px; text-align:center; font-family:'Consolas', monospace; color:#3c79f5; font-size:28px; font-weight:bold; box-shadow: 0 0 10px rgba(60,121,245,0.6); margin-bottom:20px;">
    Table of Contents
</div>

<div style="background-color:#2a2a2a; padding:25px 30px; border-radius:12px; font-family:'Consolas', monospace; font-size:16px; color:#f5f5f5; line-height:1.8;">
    <ul style="list-style-type:none; padding-left:0;">
        <li>üîπ <a href="#iid" style="color:#61dafb; text-decoration:none;">Install & Import & Define</a></li>
        <li>üîπ <a href="#1" style="color:#61dafb; text-decoration:none;">1. Data overview</a></li>
        <li>üîπ <a href="#2" style="color:#61dafb; text-decoration:none;">2. Training Set</a>
            <ul style="list-style-type:none; padding-left:20px; margin-top:5px;">
                <li>‚ö™ <a href="#2.1" style="color:#61dafb; text-decoration:none;">2.1 Gene Ontology</a></li>
                <li>‚ö™ <a href="#2.2" style="color:#61dafb; text-decoration:none;">2.2 Training sequences</a></li>
                <li>‚ö™ <a href="#2.3" style="color:#61dafb; text-decoration:none;">2.3 Labels</a></li>
                <li>‚ö™ <a href="#2.4" style="color:#61dafb; text-decoration:none;">2.4 Taxonomy</a></li>
                <li>‚ö™ <a href="#2.5" style="color:#61dafb; text-decoration:none;">2.5 Information accretion</a></li>
            </ul>
        </li>
    </ul>
</div>



<div style="
    background-color:#1e1e1e; 
    color:#00ffff; 
    padding:15px 25px; 
    border-radius:12px; 
    font-family:'Consolas', monospace; 
    font-size:18px; 
    font-weight:bold; 
    text-align:center;
    box-shadow: 0 0 10px rgba(0,255,255,0.3);
">
üõ† Install & Import & Define
</div>



In [None]:
!pip install obonet -q
!pip install pyvis -q

In [None]:
!pip install Bio

In [None]:
import os
import json
from PIL import Image
from typing import Dict
from collections import Counter

import random
import cv2
import obonet
import networkx
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import matplotlib.patches as mpatch
from Bio import SeqIO
from pyvis.network import Network
import plotly.io as pio
pio.renderers.default = 'iframe_connected'

<div style="
    background-color:#1e1e1e; 
    color:#00ffff; 
    padding:15px 25px; 
    border-radius:12px; 
    font-family:'Consolas', monospace; 
    font-size:18px; 
    font-weight:bold; 
    text-align:center;
    box-shadow: 0 0 10px rgba(0,255,255,0.3);
">
üõ† Define config
</div>


In [None]:
class CFG:
    train_go_obo_path: str = "/kaggle/input/cafa-6-protein-function-prediction/Train/go-basic.obo"
    train_seq_fasta_path: str = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_sequences.fasta"
    train_terms_path: str = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_terms.tsv"
    train_taxonomy_path: str = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_taxonomy.tsv"
    train_ia_path: str = "/kaggle/input/cafa-6-protein-function-prediction/IA.txt"

<div style="
    background-color:#1e1e1e; 
    color:#00ffff; 
    padding:15px 25px; 
    border-radius:12px; 
    font-family:'Consolas', monospace; 
    font-size:18px; 
    font-weight:bold; 
    text-align:center;
    box-shadow: 0 0 10px rgba(0,255,255,0.3);
">
üõ† Define utilization methods
</div>


In [None]:
def plot_dag(graph, term, radius=1):
    # create smaller subgraph
    # radius - include all neighbors of distance<=radius from n (increse it to add further parent's branches).
    ng_graph = networkx.ego_graph(graph, term, radius=radius)

    for n in ng_graph.nodes(data=True):
        # concatenate label of the node with its attribute
        n[1]["label"] = n[0] + " " +n[1]["name"]

    nt = Network(directed=True, notebook=True, cdn_resources="in_line")
    nt.from_nx(ng_graph)
    return nt.show("network.html")

<a id="1"></a>
<div style="
    background-color:#1e1e1e;
    color:#61dafb;
    padding:20px;
    font-size:32px;
    font-family:'Consolas', monospace;
    text-align:center;
    border-radius:15px;
    box-shadow: 0 0 10px rgba(97,218,251,0.5) inset;
">
<b>1. Data overview</b>
</div>


<div style="background-color:#1e1e1e; color:#e0e0e0; padding:25px; border-radius:12px; font-family:'Consolas', monospace; font-size:16px; line-height:1.6;">

üî¥ **The [Gene Ontology (GO)](http://geneontology.org/docs/ontology-documentation/)** is a concept hierarchy that describes the biological function of genes and gene products at different levels of abstraction (Ashburner et al., 2000). It is a good model to describe the multi-faceted nature of protein function.  

‚ö™ GO is a `directed acyclic graph`. The nodes in this graph are functional descriptors (terms or classes) connected by relational ties (is_a, part_of, etc.). For example, terms *protein binding activity* and *binding activity* are related by an is_a relationship; however, the edge is often reversed to point from binding towards protein binding. This graph contains **three subgraphs (subontologies)**: **Molecular Function (MF), Biological Process (BP), and Cellular Component (CC)**. Each represents a different aspect of a protein's function: what it does on a molecular level (MF), which biological processes it participates in (BP), and where in the cell it is located (CC).  

‚ö™ The protein's function is therefore represented by a subset of one or more subontologies. Annotations are supported by evidence codes, divided into experimental (from research papers) and non-experimental (inferred computationally). Read more about [GO evidence codes](http://geneontology.org/docs/guide-go-evidence-codes/).  

üî¥ In this competition, **experimentally determined term-protein assignments** are used as class labels. That is, if a protein is labeled with a term, it means this protein has this function validated by experimental evidence. By processing these annotated terms, we can generate a dataset of proteins and their ground truth labels. The absence of a term does not mean the protein lacks the function‚Äîjust that it hasn‚Äôt been annotated yet. Proteins may have annotations from multiple subontologies.  

</div>


<div style="background-color:#1e1e1e; color:#f0f0f0; padding:25px; border-radius:12px; 
            font-family:'Consolas', monospace; font-size:28px; text-align:center; 
            box-shadow: rgba(0, 0, 0, 0.2) 0px 2px 6px inset;">
    <b>2. Training Set</b>
</div>


<p style="font-family: 'Consolas', monospace; font-size:16px; color:#e0e0e0; line-height:1.6;">
‚ö™ For the <i>training set</i>, we include all proteins with annotated terms that have been validated by experimental or high-throughput evidence, traceable author statement (<code>TAS</code>), or inferred by curator (<code>IC</code>). We use annotations from the UniProtKB release of 2022-11-17. Participants are not required to use these data and are also welcome to use any other data available to them.
</p>

<p style="font-family: 'Consolas', monospace; font-size:16px; color:#e0e0e0; line-height:1.6;">
‚ö™ For the <i>training set</i>, we include all proteins with annotated terms that have been validated by experimental or high-throughput evidence, traceable author statement (<code>TAS</code>), or inferred by curator (<code>IC</code>). We use annotations from the UniProtKB release of 2022-11-17. Participants are not required to use these data and are also welcome to use any other data available to them.
</p>


<p style="font-family: 'Consolas', monospace; font-size:16px; color:#e0e0e0; line-height:1.6;">
‚ùî Let's consider each <i>training file</i> iteratively.
</p>


<div style="background-color:#1e1e1e; color:#f0f0f0; padding:25px; border-radius:12px; 
            font-family:'Consolas', monospace; font-size:28px; text-align:center; 
            box-shadow: rgba(0, 0, 0, 0.2) 0px 2px 6px inset;">
    <b>2.1 Gene Ontology </b>
</div>


In [None]:
%%time
graph = obonet.read_obo(CFG.train_go_obo_path)

<p style="font-family: consolas; font-size: 16px;">‚ùî Number of nodes.</p>

In [None]:
print(f"Number of nodes: {len(graph)}")

<p style="font-family: consolas; font-size: 16px;">‚ùî Number of edges.</p>

In [None]:
print(f"Number of edges: {graph.number_of_edges()}")

<p style="font-family: consolas; font-size: 16px;">üî¥ To display a graph, you need to focus on a specific term, let's take term <code>GO:0034655</code> as an example.</p> 

In [None]:
term = "GO:0034655"

<p style="font-family: consolas; font-size: 16px;">‚ùî Lookup <code>nucleobase-containing compound catabolic process</code> node properties (term GO:0034655).</p>

In [None]:
graph.nodes[term]

<p style="font-family: consolas; font-size: 16px;">‚ùî Let's plot DAG for the term GO:0034655.</p>

<p style="font-family: consolas; font-size: 16px;">‚ùî Let's plot DAG for the term GO:0034655.</p>

In [None]:
def plot_dag(graph, term, radius=1):
    """
    Plots a DAG subgraph around the specified term using pyvis.
    Saves a separate HTML file for each term and radius.
    """
    # Create smaller subgraph
    ng_graph = networkx.ego_graph(graph, term, radius=radius, undirected=False)

    # Prepare node labels
    for n in ng_graph.nodes(data=True):
        n[1]["label"] = n[0] + " " + n[1].get("name", "")

    # Initialize pyvis network
    nt = Network(
        height="800px",
        width="100%",
        directed=True,
        notebook=True,
        bgcolor="#1e1e1e",
        font_color="white",
        cdn_resources="in_line"
    )
    nt.from_nx(ng_graph)

    # Optional styling
    nt.set_options("""
    var options = {
      "nodes": {"color":{"background":"#1f78b4","border":"#ffffff"},"font":{"color":"white","size":14,"face":"Consolas"}},
      "edges": {"color":"white","arrows":{"to":{"enabled":true}}},
      "physics": {"enabled":true,"stabilization":{"iterations":200}}
    }
    """)

    # Dynamic filename
    filename = f"network_{term}_radius{radius}.html"
    return nt.show(filename)


In [None]:
plot_dag(graph, term, radius= 1)

<p style="font-family: consolas; font-size: 16px;">‚ùî Now let's look how's full DAG looks like for the selected term. To do that just increase value of the radius parameter. <code>radius</code> - responsible include all neighbors of distance ‚â§ radius from n.</p>

In [None]:
plot_dag(graph, term, radius=50000)

<p style="font-family: consolas; font-size: 16px;">‚ö™ In the first graph, only the nodes are connected by relational ties between them, that is, which are located in <code>is_a</code>. But the second one shows a complete graph in which you can see how the connection will look with a non-peripheral relationship between all the nodes in the graph.</p> 

<p style="font-family: consolas; font-size: 16px;">üî¥ Let's look at another term. The term <code>GO:00048420</code> taken as an example.</p> 

In [None]:
term = "GO:0004842"

<p style="font-family: consolas; font-size: 16px;">‚ùî Lookup <code>cellular nitrogen compound catabolic process</code> node properties (term GO:0044270).</p>

In [None]:
graph.nodes[term]

In [None]:
plot_dag(graph, term, radius=3)

<p style="font-family: consolas; font-size: 16px;">‚ùî Now let's look how's full DAG looks like for the selected term. To do that just increase value of the radius parameter. <code>radius</code> - responsible include all neighbors of distance ‚â§ radius from n.</p>

In [None]:
plot_dag(graph, term, radius=1000)

<div style="background-color:#1e1e1e; color:#f0f0f0; padding:25px; border-radius:12px; 
            font-family:'Consolas', monospace; font-size:28px; text-align:center; 
            box-shadow: rgba(0, 0, 0, 0.2) 0px 2px 6px inset;">
    <b>2.2 Training Sequences </b>
</div>


<div style="background-color:#1e1e1e; 
            color:#c5c5c5; 
            font-family:'Consolas', monospace; 
            font-size:16px; 
            line-height:1.6; 
            padding:20px; 
            border-radius:12px;
            box-shadow: rgba(0,0,0,0.5) 0px 2px 6px inset;">

<p>‚ö™ <code>Training sequences</code>: <b>train_sequences.fasta</b> contains the protein sequences for the training dataset.</p>

<p>‚ö™ These files are in <a href="https://en.wikipedia.org/wiki/FASTA_format" style="color:#61dafb;"><strong>FASTA format</strong></a>, a standard format for describing protein sequences. The proteins were all retrieved from the <a href="https://www.uniprot.org/" style="color:#61dafb;"><strong>UniProt dataset</strong></a> curated at the European Bioinformatics Institute.</p>

<p>‚ö™ The header contains the protein's UniProt accession ID and additional information about the protein. Most protein sequences were extracted from the Swiss-Prot database, but a subset of proteins not represented in Swiss-Prot were extracted from the TrEMBL database. In both cases, sequences come from the 2022_05 release (14-Dec-2022). More info <a href="https://www.uniprot.org/help/uniprotkb_sections" style="color:#61dafb;"><strong>here</strong></a>.</p>

<p>‚ö™ The <code>train_sequences.fasta</code> file indicates the database source. For example, <code>sp|P9WHI7|RECN_MYCT</code> in the FASTA header indicates a Swiss-Prot protein (<code>sp</code>) with UniProt ID <code>P9WHI7</code> and gene name <code>RECN_MYCT</code>. TrEMBL sequences use <code>tr</code> instead. Both Swiss-Prot and TrEMBL are part of UniProtKB.</p>

<p>‚ö™ This file contains only sequences for proteins with annotations (labeled proteins). To get the full set of protein sequences for unlabeled proteins, download Swiss-Prot and TrEMBL <a href="https://www.uniprot.org/help/downloads" style="color:#61dafb;"><strong>here</strong></a>.</p>

</div>


<p style="font-family: consolas; font-size: 16px;">üî¥ To read and analyze the protein sequences from the <b>train_sequences.fasta</b> file, we can use the <code>Biopython</code> package.</p> 

In [None]:
print("Sequence example:\n\n", next(iter(SeqIO.parse(CFG.train_seq_fasta_path, "fasta"))))

<p style="font-family: consolas; font-size: 16px;">‚ùî Let's count the number of sequences.</p>

In [None]:
sequences = SeqIO.parse(CFG.train_seq_fasta_path, "fasta")
num_sequences = sum(1 for seq in sequences)

print("Number of sequences:", num_sequences)

<p style="font-family: consolas; font-size: 16px;">‚ùî Let's plot the length distribution of the protein sequences.</p>

In [None]:
from Bio import SeqIO
import plotly.express as px

# Load sequences
sequences = SeqIO.parse(CFG.train_seq_fasta_path, "fasta")

# Get the length of each sequence
lengths = [len(seq) for seq in sequences]

# Create histogram
fig = px.histogram(
    x=lengths,
    nbins=1000,
    color_discrete_sequence=['goldenrod']
)

# Dark mode layout
fig.update_layout(
    template='plotly_dark',  # Enables dark mode
    title={
        'text': "Distribution of Protein Sequence Lengths",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24, 'color': 'goldenrod'}
    },
    xaxis_title="Sequence Length",
    yaxis_title="Count",
    xaxis=dict(showgrid=True, gridcolor='gray', zeroline=False),
    yaxis=dict(showgrid=True, gridcolor='gray', zeroline=False),
    plot_bgcolor='#1e1e1e',  # Dark background
    paper_bgcolor='#1e1e1e', # Dark surrounding
)

fig.show()


In [None]:
np.percentile(lengths, 99)

<p style="font-family: consolas; font-size: 16px;">‚ö™ This also means that only <b>1%</b> of the data is present after the value <b>2375</b>.</p>

<p style="font-family: consolas; font-size: 16px;">‚ùî Let's calculate <b>the amino acid composition</b> of each protein sequence. Amino acid composition is the frequency distribution of amino acids in a protein sequence. It can provide valuable information about the protein's structure and function.</p>

In [None]:
from Bio import SeqIO
from collections import Counter
import plotly.express as px

# Load sequences
records = SeqIO.parse(CFG.train_seq_fasta_path, "fasta")

# Create a list of all amino acids
aa_list = [aa for record in records for aa in record.seq]

# Count frequency of each amino acid
aa_count = Counter(aa_list)

# Plot horizontal bar chart
fig = px.bar(
    x=list(aa_count.values()),
    y=list(aa_count.keys()),
    orientation='h',
    color_discrete_sequence=['darkslateblue'],
    height=700
)

# Dark mode layout
fig.update_layout(
    template='plotly_dark',
    title={
        'text': "Amino Acid Composition",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24, 'color': 'darkorange'}
    },
    xaxis_title="Frequency",
    yaxis_title="Amino Acid",
    xaxis=dict(showgrid=True, gridcolor='gray', zeroline=False),
    yaxis=dict(showgrid=False),
    plot_bgcolor='#1e1e1e',
    paper_bgcolor='#1e1e1e'
)

fig.show()


<p style="font-family: consolas; font-size: 16px;">‚ö™ Here are a few observations that can be made from the obtained frequency values:</p>

* <p style="font-family: consolas; font-size: 16px;">The most common amino acids in this dataset are leucine (L), serine (S), alanine (A), and glutamic (E), tyrosine (Y). These amino acids are known to be abundant in proteins and play important roles in protein structure and function.</p>
* <p style="font-family: consolas; font-size: 16px;">The least common amino acids in this dataset are cysteine (C), methionine (M), tryptophan (W), and histidine (H). These amino acids are typically less abundant in proteins, but they can be important for specific functions, such as catalysis, metal binding, or protein-protein interactions.</p>
* <p style="font-family: consolas; font-size: 16px;">The absence of ambiguous amino acids (X, B, Z) and rare amino acids (O, U) in the dataset suggests that sequences are not incomplete or contain errors unlike CAFA 5.</p>

<div style="background-color:#1e1e1e; color:#f0f0f0; padding:25px; border-radius:12px; 
            font-family:'Consolas', monospace; font-size:28px; text-align:center; 
            box-shadow: rgba(0, 0, 0, 0.2) 0px 2px 6px inset;">
    <b>2.3 Labels </b>
</div>


<p style="font-family: consolas; font-size: 16px;">‚ö™ <code>Labels</code>: <b>train_terms.tsv</b> contains the list of annotated terms (ground truth) for the proteins in train_sequences.fasta.</p> 

* <p style="font-family: consolas; font-size: 16px;">The first column indicates the protein's UniProt accession ID.</p>
* <p style="font-family: consolas; font-size: 16px;">The second is the GO term ID.</p>
* <p style="font-family: consolas; font-size: 16px;">The third indicates in which ontology the term appears (BPO, CCO or MFO). BPO, CCO, and MFO are abbreviations for different categories of gene ontology terms. <code>BPO</code>: Biological Process Ontology, which describes biological processes, functions, and pathways. <code>CCO</code>: Cellular Component Ontology, which describes the components of a cell or its extracellular environment. <code>MFO</code>: Molecular Function Ontology, which describes the biochemical activities or capabilities of proteins and other molecules. These categories are used in gene ontology to classify genes and gene products based on their biological roles and functions. By using these categories, researchers can better understand the functions and interactions of different genes and gene products within a biological system.</p>

<p style="font-family: consolas; font-size: 16px;">‚ö™ Load the train terms dataframe.</p> 

In [None]:
train_terms_df = pd.read_csv(CFG.train_terms_path, sep="\t")

<p style="font-family: consolas; font-size: 16px;">‚ùî Let's look how's data looks like.</p>

In [None]:
train_terms_df.head()

<p style="font-family: consolas; font-size: 16px;">‚ùî Display main information about dataframe columns.</p>

In [None]:
train_terms_df.describe()

<p style="font-family: consolas; font-size: 16px;">‚ùî Now let's plot pie distribution of aspect values.</p>

In [None]:
import plotly.express as px

# Aspect counts
aspect_counts = train_terms_df.aspect.value_counts()

# Pie chart
fig = px.pie(
    values=aspect_counts.values,
    names=aspect_counts.index,
    color_discrete_sequence=px.colors.sequential.Viridis
)

# Dark mode layout
fig.update_traces(
    textposition='inside',
    textfont_size=14,
    textinfo='percent+label',
    marker=dict(line=dict(color='#1e1e1e', width=2))
)
fig.update_layout(
    template='plotly_dark',
    title={
        'text': "Pie Distribution of Aspect Values",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 22, 'color': 'lightblue'}
    },
    legend_title_text='Aspect:',
    paper_bgcolor='#1e1e1e',
    plot_bgcolor='#1e1e1e'
)

fig.show()


<div style="background-color:#1e1e1e; color:#f0f0f0; padding:25px; border-radius:12px; 
            font-family:'Consolas', monospace; font-size:28px; text-align:center; 
            box-shadow: rgba(0, 0, 0, 0.2) 0px 2px 6px inset;">
    <b>2.4 Taxonomy </b>
</div>

<p style="font-family: consolas; font-size: 16px;">‚ö™ <code>Taxonomy</code>: <b>train_taxonomy.tsv</b> contains the list of proteins and the species to which they belong, represented by a "taxonomic identifier" (taxon ID) number. The first column is the protein UniProt accession ID and the second is the taxon ID. More information about taxonomies can he found <a href="https://www.uniprot.org/help/taxonomic_identifier"><strong>here</strong></a>.</p>

In [None]:
train_taxonomy_df = pd.read_csv(CFG.train_taxonomy_path, sep="\t")

<p style="font-family: consolas; font-size: 16px;">‚ùî Let's look how's data looks like.</p>

In [None]:
train_taxonomy_df.head()

<p style="font-family: consolas; font-size: 16px;">‚ùî Display dataframe length.</p>

In [None]:
len(train_taxonomy_df)

<div style="background-color:#1e1e1e; color:#f0f0f0; padding:25px; border-radius:12px; 
            font-family:'Consolas', monospace; font-size:28px; text-align:center; 
            box-shadow: rgba(0, 0, 0, 0.2) 0px 2px 6px inset;">
    <b>2.5 Information Accreation </b>
</div>


<p style="font-family: consolas; font-size: 16px;">‚ö™ <code>Information accretion</code>: <b>IA.txt</b> contains the information accretion (weights) for each GO term. These weights are used to compute weighted precision and recall, as described in the Evaluation section of the competition.</p>

In [None]:
limit = 10

with open("/kaggle/input/cafa-6-protein-function-prediction/IA.tsv") as f:
    ia_weights = [x.replace("\n", "").split("\t") for x in f.readlines()]

ia_weights[:limit]