# Let's use Foldseek to give us an edge

I've added a Foldseek-based model to [my libary ProFun](https://github.com/SamusRam/ProFun).

![](https://raw.githubusercontent.com/steineggerlab/foldseek/master/.github/foldseek.png)


Foldseek is a recent tool from the dream teams of amazing [Johannes Söding](https://www.mpinat.mpg.de/642011/cv_soeding) and [Martin Steinegger](https://steineggerlab.com/en/). It "enables fast and sensitive comparisons of large structure sets" ([from the Foldseek official github](https://github.com/steineggerlab/foldseek)).
"Foldseek aligns the structure of a query protein against a database by describing tertiary amino acid interactions within proteins as sequences over a structural alphabet." [[van Kempen, Michel, et al. "Fast and accurate protein structure search with Foldseek." Nature Biotechnology (2023): 1-4.](https://www.nature.com/articles/s41587-023-01773-0)]


I have created a k-nn classifier based on Foldseek and shared it via github: 
https://github.com/SamusRam/ProFun

This notebook shows how to use the init version of [my GitHub library](https://github.com/SamusRam/ProFun) to improve on top of [the currently best public notebook](https://www.kaggle.com/code/kirilldubovik/cafa5-tuning-merge-datasets) by [@Kirill Dubovik](https://www.kaggle.com/kirilldubovik), which is a version of the nice [notebook](https://www.kaggle.com/code/mtinti/merge-datasets) by [@MT](https://www.kaggle.com/mtinti).

In [1]:
import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm, trange
from Bio import SeqIO
import numpy as np
import os

from profun.models import FoldseekMatching, FoldseekConfig
from profun.utils.project_info import ExperimentInfo

  from .autonotebook import tqdm as notebook_tqdm


## Obtaining train data

In [2]:
data_root = Path('../preprocessing\data\cafa-5-protein-function-prediction')
train_terms = pd.read_csv(data_root/"Train/train_terms.tsv",sep="\t")

ids = []
seqs = []
with open(data_root/"Train/train_sequences.fasta") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        ids.append(record.id)
        seqs.append(str(record.seq))
train_seqs_df = pd.DataFrame({'EntryID': ids, 'Seq': seqs})
train_df_long = train_terms.merge(train_seqs_df, on='EntryID')
train_df_long_sample = train_df_long.sample(200)

## Init model

In [4]:
experiment_info = ExperimentInfo(validation_schema='public_lb', 
                                 model_type='foldseek', model_version='5nn')

config = FoldseekConfig(experiment_info=experiment_info, 
                        id_col_name='EntryID', 
                        target_col_name='term',
                        seq_col_name='Seq',
                        class_names=list(train_df_long_sample['term'].unique()), 
                        optimize_hyperparams=False, 
                        n_calls_hyperparams_opt=None,
                        hyperparam_dimensions=None,
                        per_class_optimization=None,
                        class_weights=None,
                        n_neighbours=5,
                        e_threshold=0.0001,
                        n_jobs=1,
                        pred_batch_size=10,
                        local_pdb_storage_path=None #then it stores structures into the working dir
                    )

model = FoldseekMatching(config)

## Train model

During the training the predicted AlphaFold2 structures are automatically downloaded from [https://alphafold.ebi.ac.uk/https://alphafold.ebi.ac.uk/](https://alphafold.ebi.ac.uk/https://alphafold.ebi.ac.uk/). Any proteins missing from the DB of predicted structures are omitted.

In [5]:
model.fit(train_df_long_sample)

ERROR:profun.models.foldseek_model:AlphaFold2 structures downloading failed


TypeError: __init__() missing 2 required positional arguments: 'returncode' and 'cmd'

## Predict on test
It's an illustration. I computed the predictions for the whole test set offline. Unfortunately, on Kaggle I experience MMseq2 error reported [here](https://github.com/soedinglab/metaeuk/issues/48) (there's an OpenMP-related [check in MMSeq2](https://github.com/soedinglab/metaeuk/blob/1da320a9daa75dce5539442b5674f69951a2fe4f/lib/mmseqs/src/commons/CommandCaller.cpp#L17). If you experience the same error on your local machine, please refer to [this thread](https://github.com/soedinglab/metaeuk/issues/48)).

In [None]:
ids = []
seqs = []
with open(data_root/"Test (Targets)/testsuperset.fasta") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        ids.append(record.id)
        seqs.append(str(record.seq))
test_seqs_df = pd.DataFrame({'EntryID': ids, 'Seq': seqs})
# test_pred_df = model.predict_proba(test_seqs_df.sample(42).drop_duplicates('EntryID'), return_long_df=True)

# Combining with the best public result

In [None]:
test_pred_df_foldseek = pd.read_csv('/kaggle/input/foldseek-cafa/foldseek_submission.tsv',
    sep='\t', header=None, names=[1, 2, 3])
test_pred_df_foldseek = test_pred_df_foldseek[test_pred_df_foldseek[3] > 0.6]

submission_best_public = pd.read_csv('/kaggle/input/cafa5-tuning-merge-datasets/submission.tsv',
    sep='\t', header=None, names=['Id', 'GO term', 'Confidence'])

submissions_merged = submission_best_public.merge(test_pred_df_foldseek, left_on=['Id', 'GO term'], 
                                                  right_on=[1, 2], how='outer')
submissions_merged.drop([1, 2], axis=1, inplace=True)
submissions_merged['confidence_combined'] = submissions_merged.apply(lambda row: row['Confidence'] if not np.isnan(row['Confidence']) else row[3], axis=1)
submissions_merged[['Id', 'GO term', 'confidence_combined']].to_csv('submission.tsv',
    sep='\t', header=False, index=False)