# Kaggle Competition: Novozymes Enzyme Stability Prediction, Help identify the thermostable mutations in enzymes

by: Tianxiong Yu, upload at <b><a href="https://github.com/Lecter314/MLDM_2022_YuTianxiong_EEP/tree/main/Novozymes%20Enzyme%20Stability%20Prediction">github_MLDM_2022_YuTianxiong_EEP</a></b>

* Goal of the Competition

    * Enzymes are proteins that act as catalysts in the chemical reactions of living organisms. The goal of this competition is to **predict the thermostability of enzyme variants**. The experimentally measured thermostability (melting temperature) data includes natural sequences, as well as engineered sequences with single or multiple mutations upon the natural sequences.

* Prize Money: \$ 25,000

* Timeline:
    * September 21, 2022 - Start Date;
    * December 27, 2022 - Entry Deadline. You must accept the competition rules before this date in order to compete;
    * December 27, 2022\* - Team Merger Deadline. This is the last day participants may join or merge teams;
    * January 3, 2023 - Final Submission Deadline;
    
All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

See in https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/overview.

## 1 Intro

### 1.1 Competition background

In this competition, you are asked to develop models that can predict the ranking of protein stability (as measured by melting point, $t_m$) after single-point amino acid mutation and deletion.

The below is from https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/overview/description, and formated by <b><a href="https://www.kaggle.com/code/dschettler8845/novo-esp-eda-baseline">@dschettler8845</a></b>.

<b><a href="https://www.britannica.com/science/enzyme">Enzymes</a></b> are <b><a href="https://www.britannica.com/science/protein">proteins</a></b> that act as <b><a href="https://www.britannica.com/science/catalyst">catalysts</a></b> in the chemical reactions of living organisms. 

<b><a href="https://www.novozymes.com/en">Novozymes</a></b> finds enzymes in nature and optimizes them for use in industry. 
* In industry, enzymes replace chemicals and accelerate production processes. 
* They help our customers make more from less, while saving energy and generating less waste. 
* Enzymes are widely used in laundry and dishwashing detergents where they remove stains and enable low-temperature washing and concentrated detergents. 
* Other enzymes improve the quality of bread, beer and wine, or increase the nutritional value of animal feed. 
* Enzymes are also used in the production of biofuels where they turn starch or cellulose from biomass into sugars which can be fermented to ethanol. 

These are just a few examples as we sell enzymes to more than <b>40 different industries</b>. Like enzymes, microorganisms have natural properties that can be put to use in a variety of processes. 
* Novozymes supplies a range of microorganisms for use in agriculture, animal health and nutrition, industrial cleaning and wastewater treatment.

<b><mark>However, many enzymes are only marginally stable, which limits their performance under harsh application conditions.</mark></b> 
* Instability also decreases the amount of protein that can be produced by the cell. 
* Therefore, the development of efficient computational approaches to predict protein stability carries enormous technical and scientific interest. 

Computational protein stability prediction based on physics principles have made remarkable progress thanks to advanced physics-based methods such as <b><a href="https://foldxsuite.crg.eu/">FoldX</a></b>, <b><a href="https://www.rosettacommons.org/software">Rosetta</a></b>, and others. Recently, many machine learning methods were proposed to predict the stability impact of mutations on protein based on the pattern of variation in natural sequences and their three dimensional structures. More and more protein structures are being solved thanks to the recent breakthrough of <b><a href="https://www.deepmind.com/research/highlighted-research/alphafold">AlphaFold2</a></b>. <b><mark>However, accurate prediction of protein thermal stability remains a great challenge.</mark></b>

**Alphafold2 prediction of wildtype 3d structure (from @cdeotte)**

<center><img src="https://raw.githubusercontent.com/cdeotte/Kaggle_Images/main/Sep-2022/test-image.png"></center>

### 1.2 Related knowledge needed in this work

Overall we're not specialized in biology, there's no need to examine each mechanism in detail. From here we only highlight some fundamental knowledge related to the topic so as to get the flavor and work more smoothly on further parts. And if we encounter problems afterwords, we'll move back for more ideas.

* **Proteins** are large, complex molecules that play many critical roles in the body. They do most of the work in cells and are required for the structure, function, and regulation of the body’s tissues and organs (see: https://medlineplus.gov/genetics/understanding/howgeneswork/protein/);

* Protein is made from twenty-plus basic building blocks called **amino acids** (see: https://www.hsph.harvard.edu/nutritionsource/what-should-you-eat/protein/);

* An **amino acid** is an organic molecule that is made up of a basic amino group (−NH2), an acidic carboxyl group (−COOH), and an organic R group (or side chain) that is unique to each amino acid (see: https://www.britannica.com/science/amino-acid);
    * Amino Acid structure (particularly the R-Group) is determined by a particular codon (triplet of Nucleotides).

* There are only 5 types of **nucleotides** (see: https://www.genome.gov/genetics-glossary/Nucleotide);
    * 3 are found in both DNA and RNA (Adenine (A), Cytosine (C), Guanine (G));
    * Thymine (T) is also used in DNA while Uracil (U) is used in RNA.


* Graphically explanation

**Image comparing proteins with language**
<center><img src="https://www.ptglab.com/media/3301/1503677_complexity-of-proteins-blog-diagram_v1.jpg"></center>


**Twenty different types of side chains**
</center><img src="https://personal.psu.edu/staff/m/b/mbt102/bisci4online/chemistry/charges.gif"></center>


**How it works (sorta)**
</center><img src="https://cdn.britannica.com/80/780-050-CC40AEDF/Synthesis-protein.jpg"></center>

Conclusion: To make a **protein** we use instructions (DNA/RNA, Nucleotides), to build up the protein chain by adding one **amino acid** at a time (in our instructions each codon (triplet of nucleotides) tells us what amino acid comes next). At some point the instructions will also tell us when to stop (stop codon). (see https://www.kaggle.com/code/dschettler8845/novo-esp-eli5-performant-approaches-lb-0-451)

## 2 Import related libraries and datasets

In [1]:
import pandas as pd
import numpy as np

In [28]:
data_old = pd.read_csv("train.csv")
data_update = pd.read_csv("train_updates_20220929.csv")

### 2.1 file description


* **train.csv** - the training data, with columns as follows:

    * seq_id: unique identifier of each protein variants
    * protein_sequence: amino acid sequence of each protein variant. The stability (as measured by tm) of protein is determined by its protein sequence. (Please note that most of the sequences in the test data have the same length of 221 amino acids, but some of them have 220 because of amino acid deletion.)
    * pH: the scale used to specify the acidity of an aqueous solution under which the stability of protein was measured. Stability of the same protein can change at different pH levels.
    * data_source: source where the data was published
    * tm: target column. Since only the spearman correlation will be used for the evaluation, the correct prediction of the relative order is more important than the absolute tm values. (Higher tm means the protein variant is more stable.)


* **train_updates_20220929.csv** - corrected rows in train, please see this forum post for details


* **test.csv** - the test data; your task is to predict the target tm for each protein_sequence (indicated by a unique seq_id)


* **sample_submission.csv** - a sample submission file in the correct format, with seq_id values corresponding to test.csv


* **wildtype_structure_prediction_af2.pdb** - the 3 dimensional structure of the enzyme listed above, as predicted by AlphaFold

In [29]:
data_old.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5


In [30]:
data_update.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,69,,,,
1,70,,,,
2,71,,,,
3,72,,,,
4,73,,,,


## 3 Data preprocessing

### 3.1 Substitue old dataset with updated ones


As has been pointed out, there are some data issues in the training data. A file has been added to the Data page which contains the rows that should **not** be used due to data quality issues (2409 rows, with all features marked as NaN), as well as the rows where the pH and tm were **transposed** (25 rows, with corrected features in this dataset).


The original train.csv has not been modified. Please use this file to make adjustments as necessary.

In [36]:
data_new = data_old.set_index('seq_id')
data_new.update(data_update.set_index('seq_id'))
data_new.to_csv("data_new.csv")
data_new = data_new.reset_index()
data_updated = data_new.copy()
data_updated

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5
...,...,...,...,...,...
31385,31385,YYMYSGGGSALAAGGGGAGRKGDWNDIDSIKKKDLHHSRGDEKAQG...,7.0,doi.org/10.1038/s41592-020-0801-4,51.8
31386,31386,YYNDQHRLSSYSVETAMFLSWERAIVKPGAMFKKAVIGFNCNVDLI...,7.0,doi.org/10.1038/s41592-020-0801-4,37.2
31387,31387,YYQRTLGAELLYKISFGEMPKSAQDSAENCPSGMQFPDTAIAHANV...,7.0,doi.org/10.1038/s41592-020-0801-4,64.6
31388,31388,YYSFSDNITTVFLSRQAIDDDHSLSLGTISDVVESENGVVAADDAR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.7


### 3.2 Consider whether some of the data_source providers are unreliable

In [51]:
data_updated['data_source'].value_counts()

doi.org/10.1038/s41592-020-0801-4                                  24525
10.1021/acscatal.9b05223                                             211
10.1016/j.bpc.2006.10.014                                            185
10.7554/eLife.54639                                                  151
10.1007/s00253-018-8872-1                                             84
                                                                   ...  
10.1002/prot.10216                                                     1
10.1002/(sici)1097-0134(19990215)34:3<303::aid-prot4>3.0.co;2-h        1
10.1021/bi025807d                                                      1
10.1016/j.jmb.2005.10.066                                              1
10.1016/s0022-2836(03)00028-7                                          1
Name: data_source, Length: 324, dtype: int64

In [50]:
# qusetionalbe columns are all from NaN source
sum(data_updated['data_source'].value_counts() == data_old['data_source'].value_counts()) == len(data_updated['data_source'].value_counts())

True

In [52]:
# for these NaN source columns, they maybe be considered in lower weight
data_updated['data_source'].isna().sum()

3347