- This notebook was forked from https://www.kaggle.com/code/pauljef/data-analysis-cafa-6-protein-function-prediction.
- The officially provided data were loaded in DataFrame format and combined.

I hope this helps improve the understanding and analysis of the data. Please let me know if you have any feedback. Thank you!

In [None]:
!pip install Bio
from Bio import SeqIO
import pandas as pd
import re

!pip install goatools
from goatools.obo_parser import GODag

# Load each data

## `train_sequences.fasta`

In [None]:
fasta_path = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_sequences.fasta"
seq_records = list(SeqIO.parse(fasta_path, "fasta"))
seq_records[0]

In [None]:
def extract_meta(description, key):
    if key == "OS":
        match = re.search(r"OS=(\S+\s\S+)", description)
        if match:
             return match.group(1).strip()
    match = re.search(rf"{key}=([^ ]+)", description)
    return match.group(1) if match else None
    
sequences_df = pd.DataFrame({
    "protein_id": [rec.id for rec in seq_records],
    "sequence": [str(rec.seq) for rec in seq_records],
    "sequence_length": [len(rec.seq) for rec in seq_records],
    "OS": [extract_meta(rec.description, "OS") for rec in seq_records],
    "OX": [extract_meta(rec.description, "OX") for rec in seq_records],
    "GN": [extract_meta(rec.description, "GN") for rec in seq_records],
    "PE": [extract_meta(rec.description, "PE") for rec in seq_records],
    "SV": [extract_meta(rec.description, "SV") for rec in seq_records],
})
sequences_df["protein_id"] = sequences_df["protein_id"].apply(lambda x: x.split("|")[1])
sequences_df

↑ The three elements — `taxon_id` in `train_taxonomy.tsv`, and `OS` and `OZ` in `train_sequences.fasta` — have a one-to-one correspondence.

## `train_taxonomy.tsv`

In [None]:
tax_path = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_taxonomy.tsv"
tax_df = pd.read_csv(tax_path, sep="\t", names=["protein_id", "taxon_id"])
tax_df

↑ The three elements — `taxon_id` in `train_taxonomy.tsv`, and `OS` and `OZ` in `train_sequences.fasta` — have a one-to-one correspondence.

## `train_terms.tsv`

In [None]:
go_path = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_terms.tsv"
go_df = pd.read_csv(go_path, sep="\t", header=0, names=["protein_id", "go_id", "aspect"])
go_df

↑ `namespace` in `go-basic.obo` and `aspect` in `train_terms.tsv` have a one-to-one correspondence.

In [None]:
go_agg_df = go_df.groupby("protein_id")["go_id"].apply(list).reset_index()

go_f_df = go_df[go_df["aspect"] == "F"]
go_agg_f_df = go_f_df.groupby("protein_id")["go_id"].apply(list).reset_index()
go_agg_df = go_agg_df.merge(go_agg_f_df.rename(columns={"go_id": "go_id_f"}), on="protein_id", how="left")

go_p_df = go_df[go_df["aspect"] == "P"]
go_agg_p_df = go_p_df.groupby("protein_id")["go_id"].apply(list).reset_index()
go_agg_df = go_agg_df.merge(go_agg_p_df.rename(columns={"go_id": "go_id_p"}), on="protein_id", how="left")

go_c_df = go_df[go_df["aspect"] == "C"]
go_agg_c_df = go_c_df.groupby("protein_id")["go_id"].apply(list).reset_index()
go_agg_df = go_agg_df.merge(go_agg_c_df.rename(columns={"go_id": "go_id_c"}), on="protein_id", how="left")

go_agg_df

↑ `go_id` = `go_id_f` + `go_id_p` + `go_id_c`

## `IA.tsv`

In [None]:
ia_path = "/kaggle/input/cafa-6-protein-function-prediction/IA.tsv"
ia_df = pd.read_csv(ia_path, sep="\t", names=["go_id", "weight"])
ia_df

## `go-basic.obo`

In [None]:
go = GODag("/kaggle/input/cafa-6-protein-function-prediction/Train/go-basic.obo")
first_term_obj = go[next(iter(go))]
keys = list(first_term_obj.__dict__.keys())
keys

In [None]:
go_data = []

for go_id, term_obj in go.items():
    data = {
        "go_id": go_id,
        "name": term_obj.name,
        "namespace": term_obj.namespace,
        "level": term_obj.level,
        "depth": term_obj.depth,
        "is_obsolete": term_obj.is_obsolete,
        "parents_ids": ",".join(term_obj._parents),
    }
    go_data.append(data)

obo_df = pd.DataFrame(go_data)
obo_df

↑ `namespace` in `go-basic.obo` and `aspect` in `train_terms.tsv` have a one-to-one correspondence.

# Merge dataframe

## on `protein_id`

In [None]:
merge_on_protein_df = sequences_df.merge(tax_df, on="protein_id", how="left") \
                        .merge(go_agg_df, on="protein_id", how="left")
merge_on_protein_df

- This includes data from `train_sequences.fasta`, `train_taxonomy.tsv` and `train_terms.tsv`.
- `taxon_id`, `OS` and `OZ` have a one-to-one correspondence.

## on `go_id`

In [None]:
merge_on_go_df = go_df.merge(ia_df, on="go_id", how="left") \
                    .merge(obo_df, on="go_id", how="left") \
                    .merge(sequences_df, on="protein_id", how="left") \
                    .merge(tax_df, on="protein_id", how="left")

col = merge_on_go_df.pop("protein_id")
merge_on_go_df.insert(9, "protein_id", col)

merge_on_go_df 

- This includes data from `train_terms.tsv`, `IA.tsv`, `go-basic.obo`, `train_sequences.fasta` and `train_taxonomy.tsv`.
- `aspect` and `namespace` have a one-to-one correspondence.
- `taxon_id`, `OS` and `OZ` have a one-to-one correspondence.