# Protein Structure Prediction Using MLP 
# 1. Introduction
This project aims to predict protein structures using *DSSP* encoding and *Multi-Layer Perceptron* (MLP) models. We based our project on an article by Burkhard Rost and Chris Sander (Prediction of Protein Secondary Structure at Better than 70% Accuracy, Journal of Molecular Biology, 1993), in which the authors describe their model implementation to achieve 70% accuracy. Essentially, the model was a two-layered feed-forward neural network containing a single hidden layer, and it was trained using a database of 130 water-soluble protein chains. Our project differs by utilizing modern deep learning architectures such as Keras and a DSSP database. Moreother, we use an *nvidia gtx 1650* for claculation. 

* Keras : <br>
Keras is a software library that provides a Python interface for artificial neural networks. It acts as an abstraction layer for building and training deep learning models, making it easier for developers and researchers to implement complex neural networks without needing to delve into the underlying mathematical details. Keras is designed to be user-friendly, modular, and extensible, allowing for rapid experimentation and prototyping (Jaya Gupta et al., 2022).

* DSSP : <br>
DSSP, which stands for "Define Secondary Structure of Proteins," is a widely used algorithm and software tool for assigning secondary structure to protein structures based on their three-dimensional coordinates. It was developed by Wolfgang Kabsch and Chris Sander in the 1980s and has since become a standard method in structural biology (Shaowen Yao et al., 2017).

Finnaly, Key steps include:
   - Feature extraction from protein sequence files.
   - Preprocessing using one-hot encoding and frequency encoding.
   - Resampling using SMOTE for class balancing.
   - Implementing MLP for prediction.

# 2. Data Preprocessing and Feature Extraction:
   
The original data file, as shown in this example, contained the protein residue sequences (RES) along with DSSP, DSSPACC (an extension of DSSP that incorporates accessibility information), STRIDE (Structural Identification), and the alignment.

                                        RES:M,F,K,V,Y,G,Y,D,S,N,I,H,K,C,V
                                        DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,_
                                        DSSPACC:e,b,e,b,b,b,b,e,b,e,b,e,e
                                        STRIDE:C,E,E,E,E,E,C,T,T,T,T,T,T
                                        RsNo:1,2,3,4,5,6,7,8,9,10,11,12,13,14
                                        DEFINE:E,E,E,E,E,E,_,_,_,_,_,_,_
                                        align1:M,F,K,V,Y,G,Y,D,S,N,I,H,K
                                        align2:K,I,E,V,Y,G,I,P,D,E,V,G,R
<div style="text-align: center;">
    Example of aazb-1 protein used in the dataset 
</div>
<br>
Since Keras only processes numerical values, we needed to encode our sequences using one-hot encoding and frequency encoding.

* One-Hot encoding : <br>
    One-hot encoding is a method of converting categorical variables into a binary matrix representation. Since our protein sequences contain 20 residues, each amino acid will be encoded using 20 digits. Gaps and unrecognized amino acids will be encoded with a repetition of 20 zeros. Consequently, a DataFrame that we will use will have a significant size; for a peptide of 13 amino acids, we will have 260 columns.

* Frequency encoding : <br>
    In frequency encoding, we use the proportion of each amino acid to encode the entire protein. The final dataset will be much smaller in size compared to one-hot encoding, with only 20 columns instead of 260.

Furthermore, the corresponding secondary structure will be encoded as follows: *{'H': 0, 'E': 1, 'C': 2}*. Note that every other character different from H and E will be coded as 2.

The first step is to parse the files to retrieve the needed information and encode the sequences. We created the script *Extract_features.py* to handle all of the preprocessing. Additionally, if required, this script can perform further sampling via the *SMOTE()* function from the imbalanced-learn package to balance class proportions.

In [1]:
# Feature extraction example
from Exctract_features import create_dataset

pwd = "./513_distribute"

df1 = create_dataset(pwd, method='ohe', rsp=True)  # One-hot encoding with resampling
df2 = create_dataset(pwd, method='freq', rsp=True)  # Frequency encoding with resampling

# Displaying a sample of the dataset
df1.head()
df2.head()

Precessing files ...


Processing files: 100%|██████████| 513/513 [00:01<00:00, 344.50it/s]


Encoding ...
OneHot Encoding ...


Processing files: 100%|██████████| 77963/77963 [00:02<00:00, 28062.86it/s]


Dataframe creation...
Precessing files ...


Processing files: 100%|██████████| 513/513 [00:01<00:00, 351.39it/s]


Encoding ...
Frequences calculation ...


Processing files: 100%|██████████| 77963/77963 [00:00<00:00, 112163.48it/s]


Dataframe creation...


Unnamed: 0_level_0,DSSP,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
RES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CDAFVGTWKLVSS,2,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.0,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.076923,0.0,0.153846,0.076923,0.076923,0.0,0.153846
DAFVGTWKLVSSE,2,0.076923,0.0,0.0,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.076923,0.0,0.153846,0.076923,0.076923,0.0,0.153846
AFVGTWKLVSSEN,2,0.076923,0.0,0.076923,0.0,0.0,0.0,0.076923,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.076923,0.0,0.153846,0.076923,0.076923,0.0,0.153846
FVGTWKLVSSENF,2,0.0,0.0,0.076923,0.0,0.0,0.0,0.076923,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.153846,0.0,0.153846,0.076923,0.076923,0.0,0.153846
VGTWKLVSSENFD,2,0.0,0.0,0.076923,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.076923,0.0,0.153846,0.076923,0.076923,0.0,0.153846


# 3.  Multi-Layer Perceptron (MLP) Implementation

To implement our model using Keras (in *keras_MLP.ipynb*), we start by splitting the dataset into training and testing sets using *train_test_split()*, reserving 20% of the data for testing while ensuring reproducibility with a fixed random state. Next, we convert the target labels into a one-hot encoded format using *to_categorical()*, which is suitable for classification (the resulting format is *1.,0.,0.* for *H*, *0.,1.,0.* for *E*, and *0.,0.,1.* for *C*).

Our model architecture consists of four dense layers, with the first three employing *ReLU* activation functions and the final layer using softmax to output class probabilities. We compile the model with the *Adam* optimizer and *categorical crossentropy loss*, enabling it to effectively learn and classify inputs based on the provided features. Finally, we added an early stopping mechanism to prevent the model from overfitting.

We tested two architectures: the first one for one-hot encoding and the second for frequency encoding. We maintained the same architecture as the one used for one-hot encoding for the combined DataFrame (one-hot + frequency). The first architecture is simpler compared to the second one and is designed to learn quickly to achieve higher accuracy. Furthermore, we tested various layer densities and learning rates to achieve the results that we will discuss.

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=15, verbose=1, mode='min')

model = keras.Sequential([
    keras.layers.Dense(500, activation='relu'),  
    keras.layers.Dense(256, activation='relu'), 
    keras.layers.Dense(128, activation='relu'), 
    keras.layers.Dense(3, activation='softmax')                   
])
# For the second and third model we use 5 hiden layers (1000, 500, 500, 256, 128), Relu as activation
opt = keras.optimizers.Adam(learning_rate=0.008) # for frequency encoding we use learning_rate=0.001, the same for the last case

model.compile(optimizer=opt,
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(X_train, y_train, 
                    epochs=3, batch_size=32, 
                    validation_data=(X_test, y_test),
                    callbacks=early_stopping)

# 4. Results and Analysis
   
We implemented our model in three different ways: with one-hot encoding, frequency encoding, and a combination of both. We evaluated our model to identify which approach performed the best.

## One-hot encoding
As mentioned, one-hot encoding generates the largest amount of data for analysis, and our results showed an accuracy of 65.68%. As excpeted, our model learned very quickly; we achieved 60.18% accuracy after the first epoch and stopped after 3 epochs.

<div style="text-align: center;">
    <img src="Imges/plot_ohe.png">
</div>
<div style="text-align: center;">
    <img src="Imges/matrix_ohe.png">
</div>
<br>

## Frequency endocing
Even though the DataFrame is smaller, the learning time increases, resulting in a better accuracy of 80.31% after 30 epochs.

<div style="text-align: center;">
    <img src="Imges/plot_freq.png">
</div>
<div style="text-align: center;">
    <img src="Imges/matrix_freq.png">
</div>

## Combine 
Combining the two datasets for training resulted in a larger DataFrame but a significant drop in accuracy compared to the previous methods. The model performed similarly to the one using one-hot encoding in the sense that it started at around 60% accuracy, but its learning curve varied dramatically throughout the training process. Ultimately, the model clearly overfitted, achieving an accuracy of only 64.10%.

<div style="text-align: center;">
    <img src="Imges/plot_bith.png">
</div>
<div style="text-align: center;">
    <img src="Imges/matrix_bith.png">
</div>


# 5. Discusssion and Conclusions

Our three models did not perform the same. One-hot encoding allowed the model to learn very quickly but resulted in low accuracy as a drawback. Frequency encoding yielded the best results, albeit with a relatively long learning time. Finally, combining the two methods led to overfitting in the model. We also observed that precision and recall for each class were quite balanced across all models.

We tested our models using balanced data, but when faced with imbalanced data, accuracy dropped significantly in all cases, with respective accuracies of 60.44%, 60.64%, and 60.49% (with overfitting) for one-hot encoding, frequency encoding, and the combined approach. Ultimately, when unbalanced, the model tended to become overly specific to the majority class.

# Suggestions for Further Improvement:

We could have tested an autoencoding model using ProtBERT and the raw amino acid sequences, but we did not do so due to time constraints.