## Goal of the Competition:

Enzymes are proteins that act as catalysts in the chemical reactions of living organisms. **The goal of this competition is to predict the thermostability of enzyme variants**. The experimentally measured `thermostability (melting temperature)` data includes natural sequences, as well as engineered sequences with single or multiple mutations upon the natural sequences.

Understanding and accurately predict protein stability is a fundamental problem in biotechnology. Its applications include enzyme engineering for addressing the world’s challenges in sustainability, carbon neutrality and more. Improvements to enzyme stability could lower costs and increase the speed scientists can iterate on concepts

<center><img src="https://storage.googleapis.com/kaggle-competitions/kaggle/37190/logos/header.png?t=2022-08-30-15-34-26" width=1000></center>

In [1]:
!pip install biopandas

Collecting biopandas
  Downloading biopandas-0.4.1-py2.py3-none-any.whl (878 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m879.0/879.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopandas
Successfully installed biopandas-0.4.1
[0m

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pandas_profiling
import time
import torch
import torch.nn as nn

In [3]:
TRAIN = "/kaggle/input/novozymes-enzyme-stability-prediction/train.csv"
TEST = "/kaggle/input/novozymes-enzyme-stability-prediction/test.csv"
SUBMISSION = "/kaggle/input/novozymes-enzyme-stability-prediction/sample_submission.csv"
PDB_FILE = "/kaggle/input/novozymes-enzyme-stability-prediction/wildtype_structure_prediction_af2.pdb"

In [4]:
train_df = pd.read_csv(TRAIN)
test_df = pd.read_csv(TEST)
#df = pd.concat([train_df, test_df])

In [5]:
display(train_df.head(n=3))
display(test_df.head(n=3))

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5


Unnamed: 0,seq_id,protein_sequence,pH,data_source
0,31390,VPVNPEPDATSVENVAEKTGSGDSQSDPIKADLEVKGQSALPFDVD...,8,Novozymes
1,31391,VPVNPEPDATSVENVAKKTGSGDSQSDPIKADLEVKGQSALPFDVD...,8,Novozymes
2,31392,VPVNPEPDATSVENVAKTGSGDSQSDPIKADLEVKGQSALPFDVDC...,8,Novozymes


In [6]:
from biopandas.pdb import PandasPdb

pdb_df =  PandasPdb().read_pdb(PDB_FILE)
pdb_df.df.keys()

dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

In [7]:
atom_df = pdb_df.df['ATOM']
hetatm_df = pdb_df.df['HETATM']
anisou_df = pdb_df.df['ANISOU']
others_df = pdb_df.df['OTHERS']

In [8]:
train_df["len_seq"] = [len(protein) for protein in train_df["protein_sequence"]]
test_df["len_seq"] = [len(protein) for protein in test_df["protein_sequence"]]
train_df.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,len_seq
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7,341
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5,286
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5,497
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2,265
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5,1451


In [9]:
from scipy.sparse import csr_matrix

train_df = train_df[train_df["len_seq"]<=221]
train_df.reset_index(inplace=True)
sequences = [list(string) for string in train_df["protein_sequence"].values.tolist()]
sequences_train = pd.DataFrame(sequences)
sequences_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,211,212,213,214,215,216,217,218,219,220
0,A,A,F,Q,V,T,S,N,E,I,...,,,,,,,,,,
1,A,A,G,G,Q,P,Q,G,A,T,...,A,Q,Q,Q,C,N,,,,
2,A,A,I,G,I,G,I,L,G,G,...,,,,,,,,,,
3,A,A,K,S,G,D,A,E,E,A,...,,,,,,,,,,
4,A,A,L,A,L,G,L,P,A,F,...,,,,,,,,,,


In [10]:
from sklearn.preprocessing import LabelEncoder

sequences_train = sequences_train.apply(LabelEncoder().fit_transform)
sequences_train["tm"] = train_df["tm"]
sequences_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,212,213,214,215,216,217,218,219,220,tm
0,0,0,4,13,17,16,15,11,3,7,...,20,19,20,20,20,20,20,20,18,49.7
1,0,0,5,5,13,12,13,5,0,16,...,13,13,13,1,11,20,20,20,18,45.1
2,0,0,7,5,7,5,7,9,5,5,...,20,19,20,20,20,20,20,20,18,62.8
3,0,0,8,15,5,2,0,3,3,0,...,20,19,20,20,20,20,20,20,18,36.3
4,0,0,9,0,9,5,9,12,0,4,...,20,19,20,20,20,20,20,20,18,83.0


In [11]:
from sklearn.model_selection import train_test_split
import xgboost

X = sequences_train.loc[:, sequences_train.columns != "tm"]
y = sequences_train.loc[:, sequences_train.columns == "tm"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# create an xgboost regression model
model = xgboost.XGBRegressor(n_estimators=500, max_depth=15)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [12]:
from scipy import stats

stats.spearmanr(y_test, y_pred)

SpearmanrResult(correlation=0.3547764232104556, pvalue=9.909469529327458e-55)

In [13]:
from scipy.sparse import csr_matrix

test_df = test_df[test_df["len_seq"]<=221]
sequences = [list(string) for string in test_df["protein_sequence"].values.tolist()]
sequences_test = pd.DataFrame(sequences)
sequences_test = sequences_test.apply(LabelEncoder().fit_transform)
sequences_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,211,212,213,214,215,216,217,218,219,220
0,0,0,0,0,0,0,0,0,0,0,...,7,11,7,5,1,8,13,15,2,6
1,0,0,0,0,0,0,0,0,0,0,...,7,11,7,5,1,8,13,15,2,6
2,0,0,0,0,0,0,0,0,0,0,...,10,11,6,2,5,16,11,4,4,13
3,0,0,0,0,0,0,0,0,0,0,...,7,11,7,5,1,8,13,15,2,6
4,0,0,0,0,0,0,0,0,0,0,...,7,11,7,5,1,8,13,15,2,6


In [14]:
submission = pd.DataFrame()
submission["tm"] = model.predict(sequences_test)
submission["seq_id"] = test_df["seq_id"]
submission.to_csv("submission.csv", index=False)