# Novozyme Enzyme Stability Prediction

This notebook contains model training and evaluation to predict the thermal stability (as measured via melting point) of enzymes based on their amino acid sequence.

Competition details are available [here](https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/overview).

Prepared for SCS3546 - Deep Learning

<pre> Christopher Eeles </pre>

<pre> X361483 </pre>

Please note that the dependencies for this notebook are available in a Conda
environment file on GitHub under `ChristopherEeles/enzyme_thermal_stability_prediction/env`

## Dataset Download

Retrieve the dataset from Kaggle via the Kaggle API utility

In [1]:
from pathlib import Path
from shlib import Cmd
import zipfile as zip

In [2]:
# Path constants
DATA_DIR = Path("rawdata")
METADATA_DIR = Path("metadata")
LOG_DIR = Path("logs")
RESULT_DIR = Path("results")

# Kaggle constants
COMPETITION_NAME = "novozymes-enzyme-stability-prediction"

In [3]:
# Initialize project directories
for d in (DATA_DIR, METADATA_DIR, LOG_DIR, RESULT_DIR):
    d.mkdir(parents=True, exist_ok=True)

In [4]:
# Download competition data
download_competition_files = Cmd(["kaggle", "competitions", "download", "-c", 
    COMPETITION_NAME, "-p", DATA_DIR])
download_competition_files.run()
dataset_file = sorted(DATA_DIR.glob(f"{COMPETITION_NAME}.*"))
dataset_file

Downloading novozymes-enzyme-stability-prediction.zip to rawdata



100%|██████████| 7.06M/7.06M [00:00<00:00, 55.3MB/s]


[PosixPath('rawdata/novozymes-enzyme-stability-prediction.zip')]

In [5]:
with zip.ZipFile(dataset_file[0].resolve()) as z:
    z.extractall(path=DATA_DIR)
    dataset_file[0].unlink()

In [6]:
dataset_files = sorted(DATA_DIR.glob("*"))
dataset_files

[PosixPath('rawdata/sample_submission.csv'),
 PosixPath('rawdata/test.csv'),
 PosixPath('rawdata/train.csv'),
 PosixPath('rawdata/train_updates_20220929.csv'),
 PosixPath('rawdata/wildtype_structure_prediction_af2.pdb')]

## Data Exploration

Before we begin modelling we will have a look at the files available for the Novozyme competition to see what kind of features are available to help with our task.

In [19]:
from biopandas.pdb import PandasPdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [20]:
TRAIN_PATH = dataset_files[2]
TEST_PATH = dataset_files[1]
SAMPLE_SUBMISSION = dataset_files[0]
TRAIN_UPDATE_PATH = dataset_files[3]
TRAIN_PDB_PATH = dataset_files[4]

In [21]:
# Load available csv files
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)
sample_submission_df = pd.read_csv(SAMPLE_SUBMISSION)
train_update_df = pd.read_csv(TRAIN_UPDATE_PATH)

In [22]:
# Use Biopythons Biopandas to load the PDB protein structure file
pdb_df = PandasPdb().read_pdb(str(TRAIN_PDB_PATH))

In [23]:
# Get it PDB file into a Python native format
protein_struct_df_dict = pdb_df.df
protein_struct_df_dict.keys()

dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

In [24]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31390 entries, 0 to 31389
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   seq_id            31390 non-null  int64  
 1   protein_sequence  31390 non-null  object 
 2   pH                31104 non-null  float64
 3   data_source       28043 non-null  object 
 4   tm                31390 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 1.2+ MB


In [26]:
## NOTE: tm column is melting point in Celsius (C)
train_df.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5


In [27]:
# Not sure what this is? Maybe there are errors in the original training dataset?
train_update_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2434 entries, 0 to 2433
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   seq_id            2434 non-null   int64  
 1   protein_sequence  25 non-null     object 
 2   pH                25 non-null     float64
 3   data_source       0 non-null      float64
 4   tm                25 non-null     float64
dtypes: float64(3), int64(1), object(1)
memory usage: 95.2+ KB


In [31]:
# There were some data quality issues, need to drop NaN rows and update some pH and tm values
# See: https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/356251
train_update_df.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,69,,,,
1,70,,,,
2,71,,,,
3,72,,,,
4,73,,,,


In [34]:
# Extract rows which need updating in train data
bad_seq_ids = train_update_df.seq_id.values
bad_seq_ids

array([   69,    70,    71, ..., 30740, 30741, 30742])

In [47]:
# Drop those rows from the training data and append the updated rows
train_df_fix = train_df.loc[~train_df.seq_id.isin(bad_seq_ids), :]
train_df_fix = (pd.concat([train_df_fix, train_update_df])
    .sort_values(by="seq_id"))

In [48]:
train_df_fix.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5


In [74]:
# Sanity check the non NaN columns got updated correctly
assert all(train_df_fix.iloc[bad_seq_ids].pH.dropna() == train_update_df.pH.dropna())

In [79]:
# Drop columns with NaN in the tm column, since that is our target in modelling
train_df2 = train_df_fix.loc[~train_df_fix.tm.isna(), ]

In [81]:
train_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28981 entries, 0 to 31389
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   seq_id            28981 non-null  int64  
 1   protein_sequence  28981 non-null  object 
 2   pH                28695 non-null  float64
 3   data_source       28001 non-null  object 
 4   tm                28981 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 1.3+ MB


In [29]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2413 entries, 0 to 2412
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   seq_id            2413 non-null   int64 
 1   protein_sequence  2413 non-null   object
 2   pH                2413 non-null   int64 
 3   data_source       2413 non-null   object
dtypes: int64(2), object(2)
memory usage: 75.5+ KB


In [30]:
test_df.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source
0,31390,VPVNPEPDATSVENVAEKTGSGDSQSDPIKADLEVKGQSALPFDVD...,8,Novozymes
1,31391,VPVNPEPDATSVENVAKKTGSGDSQSDPIKADLEVKGQSALPFDVD...,8,Novozymes
2,31392,VPVNPEPDATSVENVAKTGSGDSQSDPIKADLEVKGQSALPFDVDC...,8,Novozymes
3,31393,VPVNPEPDATSVENVALCTGSGDSQSDPIKADLEVKGQSALPFDVD...,8,Novozymes
4,31394,VPVNPEPDATSVENVALFTGSGDSQSDPIKADLEVKGQSALPFDVD...,8,Novozymes
