<span style="font-size: 24px;"> Multi-label Classification of enzyme substrates </span> 

Key skills covered: Multilabel classification, Random Forests, bagging, data cleaning methods (drop, concatenate, splitting columns, dropping empty values), dealing with imbalanced class sizes (weighting of classes, oversampling), hyperparameter tuning with grid search.

<span style="font-size: 20px;"> Dataset </span> 

There are 3 files names mixed_(desc, ecfp, fcfp).csv containing chemical, structural, connectivity information. Each csv list all samples with a unique ID, with a different collection of variables in the columns and the final column specifying which classes they're in as follows: 1_1_1_1_0_1. 

<span style="font-size: 20px;"> Problem space </span> 

First we combine them into a single dataset.

We have a dataset of substrates, and a range of features. We want to predict the EC classes of the substrates. We have six classes, and then can belong to multiple, since the same molecules participate in different types of reactions.

The dataset is highly imbalanced in labels (smallest is 1, highest is 248), so we need to opt for an algorithm that can tackle label imbalance. Note that we don't have to scale the features as this doesn't matter for tree based models.

In [10]:
import pandas as pd
from scipy.stats import shapiro, zscore
path_desc = "/home/frances/Documents/Getting a JOB/ML preparation/Multilabel_Substrate_Classification/mixed_desc.csv"
path_ecfp = "/home/frances/Documents/Getting a JOB/ML preparation/Multilabel_Substrate_Classification/mixed_ecfp.csv"
path_fcfp = "/home/frances/Documents/Getting a JOB/ML preparation/Multilabel_Substrate_Classification/mixed_fcfp.csv"

desc = pd.read_csv(path_desc)
desc.drop(['EC1_EC2_EC3_EC4_EC5_EC6'],axis=1,inplace=True)
ecfp = pd.read_csv(path_ecfp)
ecfp.drop(['EC1_EC2_EC3_EC4_EC5_EC6'],axis=1,inplace=True)
fcfp = pd.read_csv(path_fcfp)

merged_data = pd.concat([desc, ecfp, fcfp], axis=1) # concaternating files
merged_data.drop(['CIDs'],axis=1,inplace=True)

data = {'EC1_EC2_EC3_EC4_EC5_EC6': merged_data['EC1_EC2_EC3_EC4_EC5_EC6']}
df = pd.DataFrame(data)

# Split the 'Column' values by '_' and expand into separate columns
df[['EC1', 'EC2', 'EC3', 'EC4', 'EC5', 'EC6']] = df['EC1_EC2_EC3_EC4_EC5_EC6'].str.split('_', expand=True).astype(int)

# Drop the original 'Column' after extracting the EC values
df.drop(columns=['EC1_EC2_EC3_EC4_EC5_EC6'], inplace=True)

merged_data = merged_data.drop('EC1_EC2_EC3_EC4_EC5_EC6',axis=1)
#cleaned_data = pd.concat([merged_data, df], axis=1)
#cleaned_data = cleaned_data.dropna()

In [13]:
import numpy as np

columns_to_normalise = []
# Loop through each column
for column_name in merged_data.columns:
    # Count null values
    null_count = merged_data[column_name].isnull().sum()
    print(f"Null Count: {null_count}")
    statistic, p_value = shapiro(merged_data[column_name].dropna())
    if p_value<0.05:
        columns_to_normalise.append(column_name)

print(columns_to_normalise)

for column in columns_to normalise:
    merged_data[column] = np.log(column)

SyntaxError: invalid syntax (2106664645.py, line 15)

In [None]:

        # Z-Score and Outlier Detection
        z_scores = zscore(merged_data[column_name].dropna())
        threshold = 3
        outliers = (z_scores > threshold) | (z_scores < -threshold)
        print("Outliers detected:")
        print(outliers)

        print("\n")

<span style="font-size: 20px;"> Random Forest </span> 

Random Forest is an ensemble learning method that combines multiple decision trees. It can handle imbalanced datasets well by averaging predictions from different trees, which reduces the risk of overfitting to the majority class. It is also less sensitive to outliers and can manage high dimensional data well. Random forests also can be used for feature selection since they provide a metric of importance of features. 

<b> How it works:</b>

Random Forest uses a few techniques:
- "Bagging" (Bootstrap Aggregating) is used to create multiple decision trees. Bootstrapping means a random sample is drawn with replacement from the original data to train an individual decision tree.
- Random Feature Selection is used to introduce randomness by only selecting a random subset of features at each branch
- For classification and regression tasks, each tree will provide a prediction. For classification, the final result is the most frequent prediction. For regression, it's an average of the numerical values predicted.

In [333]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X = cleaned_data.drop(['EC1','EC2','EC3','EC4','EC5','EC6'], axis=1)
Y = df

X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y, test_size=0.4, random_state=42) 
X_eval, X_test, Y_eval, Y_test = train_test_split(X_temp, Y_temp, test_size=0.5, random_state=42)

model = RandomForestClassifier() # initialise the model

X_train['Ipc'] = X_train['Ipc'].astype(int) # had an issue with the values being too large for float

model.fit(X_train,Y_train) # fit the model
y_pred_eval = model.predict(X_eval) # make predictions on evaluation data
accuracy_eval = accuracy_score(Y_eval, y_pred_eval)  

print(f"Accuracy on the validation dataset: {accuracy_eval}")

Accuracy on the validation dataset: 0.19230769230769232


It's clear that the accuracy is incredibly low, likely due to the highly inbalanced class sizes. Let's make use of weightings to see if we can change this.

We'll use a balanced weighting first and then adjust by class sizes, to see if it leads to further improvement.

In [329]:
EC1_no = Y_train['EC1'].value_counts()
EC2_no = Y_train['EC2'].value_counts()
EC3_no = Y_train['EC3'].value_counts()
EC4_no = Y_train['EC4'].value_counts()
EC5_no = Y_train['EC5'].value_counts()
EC6_no = Y_train['EC6'].value_counts()

class_counts_asc = [{0: EC1_no[0], 1: EC1_no[1]},
                    {0: EC2_no[0], 1: EC2_no[1]},
                    {0: EC3_no[0], 1: EC3_no[1]},
                    {0: EC4_no[0], 1: EC4_no[1]},
                    {0: EC5_no[0], 1: EC5_no[1]},
                    {0: EC6_no[0], 1: EC6_no[1]}]

model_bal = RandomForestClassifier(class_weight=class_counts_asc)

model_bal.fit(X_train, Y_train)

y_pred_eval_bal = model_bal.predict(X_eval) # make predictions on evaluation data
accuracy_eval_bal = accuracy_score(Y_eval, y_pred_eval_bal)  

print(f"Accuracy on the validation dataset with weight balancing: {accuracy_eval_bal}")

Accuracy on the validation dataset with weight balancing: 0.21634615384615385


The accuracy is barely different. It could be that I did something weird when I was cleaning the data. There are a lot of input features, which can lead to overfitting, but usually Random Forest is a method that is well equipped to deal with this. If that is the problem, then hyperparameter tuning should help matters. 

<b> Problems </b>

I would like to look at the confusion_matrix, f1_score, precision_score, recall_score and I'd like to perform oversampling with the imbalance learn library (imblearn.oversampling import SMOTE). HOWEVER, none of these methods support multi-label classification. You have to basically compute them on the classes separately and then amalgamate the results. I don't have time for that now, so we'll have to give up.

<span style="font-size: 20px;"> Hyperparameter Tuning </span> 

The general approach to Hyperparameter Tuning is to employ grid search or random search. Grid search involves defining a dictionary of hyperparameter values to explore. The key is the hyperparameter, the value is a list of different values. You then create an instance of the GridSearchCV class, with input parameters as follows:
- estimator = model (the initiated instance of the RandomForestClassifier class)
- paramgrid=paramgrid (the dictionary of hyperparameters)
- cv=5 (cross validation value, usually 5 or 10)

Then we do grid_search.fit(X_train,Y_train) to fit the training data.

To access the best hyperparameters (values that yielded the best performance) and best estimator (the model object tuned with the best hyperparameters), we use grid_search.best_params_ and grid_search.best_estimator_.

Grid search is a bit computationally expensive, instead you can choose randomly and optimise. There are various methods, like random search or bayesian optimisation.

In [340]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200], # number of trees
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Instantiate GridSearchCV
grid_search = GridSearchCV(estimator=model_bal, param_grid=param_grid, cv=5)

# Fit GridSearchCV on training data
grid_search.fit(X_train, Y_train)

# Access best hyperparameters and best estimator
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model on validation data
accuracy = best_model.score(X_eval, Y_eval)

print(accuracy)

0.20192307692307693


<span style="font-size: 20px;"> Conclusion </span> 

Well, that was a fail. If I had more time I would have re-pre-processed the data. It would help to have an understanding of the data, but I'm not a biology expert. I would also look at some of the other evaluation methods, and performed oversampling. I would then opt for trying another algorithm, like XG Boost, Support Vector Machines, and maybe even a deep neural net.

<span style="font-size: 20px;"> Next steps </span> 

Next I will switch to regression, and try to find a neater dataset so that doesn't detract from the model building. I'll spend some time properly on feature engineering and data cleaning at some point. 