The given code is a starter code that uses a naive approach for CAFA 5 Protein Function Prediction problem. The code reads a training dataset that contains protein sequences and their corresponding Gene Ontology (GO) terms, and extracts the most frequently occurring GO terms. These top GO terms are then used to make predictions for a test dataset that contains protein sequences. The predictions are made by assigning the top GO terms to each protein sequence in the test dataset with a confidence score that corresponds to the relative frequency of the term in the training dataset. 

In [1]:
import numpy as np
import pandas as pd
from Bio import SeqIO
from tqdm import tqdm

In [2]:
def read_fasta(fastaPath):    
    fasta_sequences = SeqIO.parse(open(fastaPath), 'fasta')
    ids = []
    sequences = []
    for fasta in fasta_sequences:
        ids.append(fasta.id)
        sequences.append(str(fasta.seq))
    return pd.DataFrame({'Id': ids, 'Sequence': sequences})

def get_top_go_terms(data, num_terms):
    term_counts = data['term'].value_counts()
    freq_counts = term_counts / len(data)
    freq_top = freq_counts.nlargest(num_terms)
    return freq_top

In [3]:
train_terms = pd.read_csv('/kaggle/input/cafa-5-protein-function-prediction/Train/train_terms.tsv', sep='\t')
top_terms = get_top_go_terms(train_terms, 10)

test_data = read_fasta('/kaggle/input/cafa-5-protein-function-prediction/Test (Targets)/testsuperset.fasta')

results = []
for index, row in tqdm(test_data.iterrows(), total=test_data.shape[0], position=0):
    for term, freq in top_terms.items():
        results.append((row['Id'], term, freq))

final_results = pd.DataFrame(results, columns=['Id', 'GO term', 'Confidence'])
final_results.to_csv('submission.tsv', sep='\t', index=False)

100%|██████████| 141865/141865 [00:15<00:00, 8924.39it/s]
