This model is a basic implementation of regression methods and techniques for predicting the molecular weights of proteins.

Import the necessary libraries for the model. In this instance, I have utilized the Pandas and Lazy Predict libraries. For the regression analysis, I have selected three models: Linear Regression, Random Forest, and Support Vector Regression (SVR). Additionally, I have chosen the BioPython library, as my model pertains to the regression prediction of molecular weights of proteins.

In [None]:
#Import Pandas
import pandas as pd
# Import Biopython
!pip install scikit-learn biopython
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
# Import Lazy Predict
!pip install lazypredict
import pandas as pd
from lazypredict.Supervised import LazyRegressor
from sklearn import datasets
from sklearn.utils import shuffle
from tqdm import tqdm
import time
import logging
import warnings
import numpy as np
from sklearn.metrics import mean_squared_error
# Import necessary models for regression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
# Train and Test
from sklearn.preprocessing import LabelEncoder
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder

Collecting biopython
  Using cached biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
Installing collected packages: biopython
Successfully installed biopython-1.84
Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


Defing the function for calculating the molecular weights of proteins.

In [None]:
def calculate_features(sequence):
    analysis = ProteinAnalysis(sequence)
    features = {
        'molecular_weight': analysis.molecular_weight()
    }
    amino_acid_percent = analysis.get_amino_acids_percent()
    features.update(amino_acid_percent)
    return features

Inserting the dataset.

In [None]:
file_path2 = '/content/test2_dataset.csv'
# Read the datasets
data2 = pd.read_csv(file_path2, encoding='latin1', delimiter=',', on_bad_lines='skip')

Initiallizing the model with implementing the features to be extracted in the model.

In [None]:
# Feature Engineering: Extract features from simple_fasta
features_df = data2['simple_fasta'].apply(calculate_features).apply(pd.Series)
data = pd.concat([data2, features_df], axis=1)

# Label encoding for 'Species'
le = LabelEncoder()
data['Species_encoded'] = le.fit_transform(data['Species'])

# Select only the required columns for the model
features = data[['Species_encoded', 'molecular_weight']]
target = data['Tm']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

Visulazing the selected featuers of the dataset for train nd test.

In [None]:
X_train.head()

Unnamed: 0,Species_encoded,molecular_weight
2401,4,255493.02
3366,7,46536.31
2615,0,47660.57
999,5,32926.87
2382,4,53326.54


Visulazing the target values for train and test.

In [None]:
y_train.head()

2401   52.40
3366   68.70
2615   52.90
999    65.54
2382   52.82
Name: Tm, dtype: float64

Implementing the Regression Model and training and testing it.

In [None]:
# Initialize the logger
LOGGER = logging.getLogger(__name__)

# Custom LazyRegressor with specific models
class CustomLazyRegressor(LazyRegressor):
    def __init__(self, verbose=0, ignore_warnings=True, custom_metric=None):
        super().__init__(verbose=verbose, ignore_warnings=ignore_warnings, custom_metric=custom_metric)
        self.regressors = {
            'Linear Regression': LinearRegression(),
            'Random Forest': RandomForestRegressor(),
            'SVR': SVR()
        }
        self.predictions = {}
        self.mse_scores = {}  # To store the Mean Squared Errors scores

    # Overriding the fit method to iterate over key-value pairs
    def fit(self, X_train, X_test, y_train, y_test): # Align this line with the __init__ method
        for name, model in tqdm(self.regressors.items()):
            start = time.time()
            try:
                model.fit(X_train, y_train)
                self.predictions[name] = model.predict(X_test)
                self.mse_scores[name] = mean_squared_error(y_test, self.predictions[name])
            except Exception as e:
                LOGGER.exception('Fit failed: %s', e)
                self.predictions[name] = None
                self.mse_scores[name] = None
        return self.regressors, self.predictions, self.mse_scores

Printint the results of the model for Linear regression, Random Forest and SVR.

In [None]:
# Initialize and fit the custom lazy regressor
reg = CustomLazyRegressor(verbose=0, ignore_warnings=True, custom_metric=None)
models, predictions, mse_scores = reg.fit(X_train, X_test, y_train, y_test)

# Print the model performance
for model_name in mse_scores:
    print(f"Model: {model_name}, MSE: {mse_scores[model_name]}")

100%|██████████| 3/3 [00:01<00:00,  2.49it/s]

Model: Linear Regression, MSE: 86.85893551746672
Model: Random Forest, MSE: 30.672025099177393
Model: SVR, MSE: 125.43301363478567





The result here shows that this model have maximum effecacy for the SVR Technique.