<a href="https://colab.research.google.com/github/Ash100/CADD_Project/blob/main/QSAR_Model_Generation%20and%20Testing_Part_15.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QSAR - MODEL Generation and Testing

My name is **Dr. Ashfaq Ahmad** and I work in the field of Structure Biology and Bioinformatics. A step-by-step video demonstration can be found on [**Video Tutorial**](https://youtu.be/KSz0sQM13K0)

These files are prepared for teaching and research purposes. If you want to use it for commercial purposes, please **contact us**.


**Quantitative structure–activity relationship (QSAR)** model(s) are generated by the users to test their compounds. We have already generated descriptors for our cmpound dataset. If you have not prepare your descriptors, please [**Prepare your dataset here**](https://youtu.be/NyAiwGwIPCM)

Here we will generate a model, perform some training and optimization, and finally will use that model to predict Activity of Unknown compounds. Once you generate and train your model, you can keep it and use in future.

I suggest you to read some literature from the field of your choice.

In [None]:
#@title Install necessary libraries
!pip install scikit-learn matplotlib seaborn

In [2]:
#@title Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [3]:
# Load the dataset
file_path = '/content/merged_descriptors.csv'  # Update this to your file path
data = pd.read_csv(file_path)

In [None]:
# Display the first few rows of the dataset
print(data.head())

In [6]:
#@title Define features and target variable
# Assume 'Activity' is the target variable and all other columns are features
features = data.drop(columns=['SMILES', 'Activity'])  # Drop SMILES and Activity columns
target = data['Activity']

In [7]:
#@title Handle missing values (optional, depending on your dataset)
features = features.fillna(features.mean())
target = target.fillna(target.mean())

In [8]:
#@title Normalize/Scale features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

In [9]:
#@title Split the data into training and test sets (80:20)
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

Build and train the simple QSAR models

In [None]:
#@title Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

In [None]:
#@title Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [None]:
#@title Predict and evaluate the models - Linear Regression
# Linear Regression
try:
    y_pred_lr = lr_model.predict(X_test)
    lr_mse = mean_squared_error(y_test, y_pred_lr)
    lr_r2 = r2_score(y_test, y_pred_lr)
    print("\nLinear Regression Results:")
    print(f"Mean Squared Error: {lr_mse}")
    print(f"R^2 Score: {lr_r2}")
except Exception as e:
    print(f"Error in Linear Regression evaluation: {e}")
    lr_mse = None
    lr_r2 = None

In [None]:
#@title Random Forest Regressor
try:
    y_pred_rf = rf_model.predict(X_test)
    rf_mse = mean_squared_error(y_test, y_pred_rf)
    rf_r2 = r2_score(y_test, y_pred_rf)
    print("\nRandom Forest Regressor Results:")
    print(f"Mean Squared Error: {rf_mse}")
    print(f"R^2 Score: {rf_r2}")
except Exception as e:
    print(f"Error in Random Forest evaluation: {e}")
    rf_mse = None
    rf_r2 = None

In [None]:
#@title Save performance metrics to CSV
metrics_data = {
    'Model': ['Linear Regression', 'Random Forest'],
    'Mean Squared Error': [lr_mse, rf_mse],
    'R^2 Score': [lr_r2, rf_r2]
}

metrics_df = pd.DataFrame(metrics_data)
metrics_df.to_csv('/content/model_performance_metrics.csv', index=False)
print("Model performance metrics saved to /content/model_performance_metrics.csv")

In [None]:
#@title save output result file in CSV
output_file_path = '/content/qsar_results.csv'
results = pd.DataFrame({
    'SMILES': data['SMILES'],
    'Actual Activity': target,
    'Predicted Activity (LR)': lr_model.predict(features_scaled) if lr_mse is not None else np.nan,
    'Predicted Activity (RF)': rf_model.predict(features_scaled) if rf_mse is not None else np.nan
})
results.to_csv(output_file_path, index=False)
print(f"QSAR results saved to {output_file_path}")

In [None]:
#@title Plot Actual vs. Predicted Activity for Linear Regression
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.scatterplot(x=y_test, y=y_pred_lr, alpha=0.7)
plt.xlabel('Actual Activity')
plt.ylabel('Predicted Activity (LR)')
plt.title('Actual vs. Predicted Activity (Linear Regression)')

# Add the ideal line where y = x
min_val = min(y_test.min(), y_pred_lr.min())
max_val = max(y_test.max(), y_pred_lr.max())
plt.plot([min_val, max_val], [min_val, max_val], color='red', linestyle='--', label='Ideal Line')
plt.legend()

In [None]:
#@title Plot Residuals for Linear Regression
plt.subplot(1, 2, 2)
residuals_lr = y_test - y_pred_lr
sns.histplot(residuals_lr, kde=True)
plt.xlabel('Residuals')
plt.title('Residuals Plot (Linear Regression)')

plt.tight_layout()
plt.show()

In [None]:
#@title Feature Importance Plot for Linear Regression
importance_df = pd.DataFrame({
    'Feature': features.columns,
    'Coefficient': lr_model.coef_
})
importance_df = importance_df.sort_values(by='Coefficient', ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='Coefficient', y='Feature', data=importance_df)
plt.title('Feature Importance (Linear Regression)')
plt.show()

In [None]:
#@title Feature Importance Plot for Random Forest
importances = rf_model.feature_importances_
feature_names = features.columns
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance (Random Forest)')
plt.show()

###**Optimization strategies of the model**
We can see that the MSE and R^2 is bit on a higher side- So, why not perform some optimization?

In [20]:
#@title Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

In [21]:
#@title Define different models for testing with some parameter settings
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Random Forest': RandomForestRegressor(),
    'Gradient Boosting': GradientBoostingRegressor()
}

In [22]:
#@title Hyperparameter grids
param_grids = {
    'Ridge Regression': {'alpha': [0.1, 1.0, 10.0, 100.0]},
    'Lasso Regression': {'alpha': [0.1, 1.0, 10.0, 100.0]},
    'Random Forest': {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30]},
    'Gradient Boosting': {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7]}
}

In [23]:
#@title Function to evaluate models
def evaluate_model(model_name, model, X_train, X_test, y_train, y_test, param_grid=None):
    if param_grid:
        search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
        search.fit(X_train, y_train)
        best_model = search.best_estimator_
        best_params = search.best_params_
        print(f"Best parameters for {model_name}: {best_params}")
    else:
        best_model = model
        best_params = None

    best_model.fit(X_train, y_train)
    y_pred = best_model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"\n{model_name} Results:")
    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")

    return best_model, mse, r2

In [None]:
#@title Evaluate models
results = []
for model_name, model in models.items():
    param_grid = param_grids.get(model_name, None)
    best_model, mse, r2 = evaluate_model(model_name, model, X_train, X_test, y_train, y_test, param_grid)
    results.append({'Model': model_name, 'Mean Squared Error': mse, 'R^2 Score': r2})


In [None]:
#@title Save performance metrics to CSV
metrics_df = pd.DataFrame(results)
metrics_df.to_csv('/content/model_performance_metrics.csv', index=False)
print("Model performance metrics saved to /content/model_performance_metrics_advanced.csv")

In [None]:
#@title Save or output results as needed
output_file_path = '/content/qsar_results_advanced.csv'
results_df = pd.DataFrame({
    'SMILES': data['SMILES'],
    'Actual Activity': target,
    'Predicted Activity (Best Model)': best_model.predict(features_scaled) if 'best_model' in locals() else np.nan
})
results_df.to_csv(output_file_path, index=False)
print(f"QSAR results saved to {output_file_path}")

In [None]:
#@title Plot Actual vs. Predicted Activity for Best Model
plt.figure(figsize=(14, 7))

plt.subplot(1, 2, 1)
sns.scatterplot(x=y_test, y=best_model.predict(X_test), alpha=0.7)
plt.xlabel('Actual Activity')
plt.ylabel('Predicted Activity')
plt.title(f'Actual vs. Predicted Activity ({model_name})')

# Add the ideal line where y = x
min_val = min(y_test.min(), best_model.predict(X_test).min())
max_val = max(y_test.max(), best_model.predict(X_test).max())
plt.plot([min_val, max_val], [min_val, max_val], color='red', linestyle='--', label='Ideal Line')
plt.legend()

In [None]:
# Plot Residuals for Best Model
plt.subplot(1, 2, 2)
residuals = y_test - best_model.predict(X_test)
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Residuals Plot')

plt.tight_layout()
plt.savefig('/content/best_model_plots.png')
print("Best model plots saved to /content/best_model_plots.png")

**Save the Best Model**

In [None]:
#@title Evaluate models and track the best model
best_model = None
best_model_name = None
best_mse = float('inf')  # Initialize to infinity for finding the minimum
results = []

for model_name, model in models.items():
    param_grid = param_grids.get(model_name, None)
    model, mse, r2 = evaluate_model(model_name, model, X_train, X_test, y_train, y_test, param_grid)
    results.append({'Model': model_name, 'Mean Squared Error': mse, 'R^2 Score': r2})

    if mse < best_mse:
        best_mse = mse
        best_model = model
        best_model_name = model_name

In [None]:
#@title Save performance metrics to CSV
metrics_df = pd.DataFrame(results)
metrics_df.to_csv('/content/model_performance_metrics.csv', index=False)
print("Model performance metrics saved to /content/model_performance_metrics_best.csv")

# Save or output results as needed
output_file_path = '/content/qsar_results_best.csv'
results_df = pd.DataFrame({
    'SMILES': data['SMILES'],
    'Actual Activity': target,
    'Predicted Activity (Best Model)': best_model.predict(features_scaled) if best_model else np.nan
})
results_df.to_csv(output_file_path, index=False)
print(f"QSAR results saved to {output_file_path}")

In [None]:
#@title Save the best model
import joblib  # Ensure this line is included at the beginning of your script

# Save the best model
if best_model:
    model_filename = '/content/best_model.pkl'
    joblib.dump(best_model, model_filename)
    print(f"Best model ({best_model_name}) saved to {model_filename}")

In [None]:
#@title Plot Actual vs. Predicted Activity for Best Model
plt.figure(figsize=(14, 7))

plt.subplot(1, 2, 1)
sns.scatterplot(x=y_test, y=best_model.predict(X_test), alpha=0.7)
plt.xlabel('Actual Activity')
plt.ylabel('Predicted Activity')
plt.title(f'Actual vs. Predicted Activity ({best_model_name})')

# Add the ideal line where y = x
min_val = min(y_test.min(), best_model.predict(X_test).min())
max_val = max(y_test.max(), best_model.predict(X_test).max())
plt.plot([min_val, max_val], [min_val, max_val], color='red', linestyle='--', label='Ideal Line')
plt.legend()

In [None]:
#@title Plot Residuals for Best Model
plt.subplot(1, 2, 2)
residuals = y_test - best_model.predict(X_test)
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Residuals Plot')

plt.tight_layout()
plt.savefig('/content/best_model_plots.png')
print("Best model plots saved to /content/best_model_plots.png")

In [None]:
#@title Feature Importance Plot for Best Model
if best_model_name in ['Random Forest', 'Gradient Boosting']:
    importances = best_model.feature_importances_
    importance_df = pd.DataFrame({
        'Feature': features.columns,
        'Importance': importances
    })
    importance_df = importance_df.sort_values(by='Importance', ascending=False)

    plt.figure(figsize=(12, 8))
    sns.barplot(x='Importance', y='Feature', data=importance_df)
    plt.title(f'Feature Importance ({best_model_name})')
    plt.tight_layout()
    plt.savefig(f'/content/{best_model_name.lower().replace(" ", "_")}_feature_importance.png')
    print(f"Feature importance plot for {best_model_name} saved to /content/{best_model_name.lower().replace(' ', '_')}_feature_importance.png")

### **Testing Unknown Datset with the best model**

In [None]:
#@title Load the best model
model_filename = '/content/best_model.pkl'
best_model = joblib.load(model_filename)
print(f"Best model loaded from {model_filename}")

Now calculate the same features for the unknown dataset, as we have calculated in the previous section. Bring that data here and load it below

In [37]:
#@title Import required libraries
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.preprocessing import StandardScaler
import joblib

In [38]:
#@title Load new compounds
new_data = pd.read_csv('/content/unknown.csv')

In [39]:
#@title Extract feature columns (assuming all columns except SMILES and Activity are features)
feature_columns = [col for col in new_data.columns if col not in ['SMILES', 'Activity']]
new_data_features = new_data[feature_columns]


In [40]:
#@title Scale the new data if applicable
new_data_scaled = scaler.transform(new_data_features)

In [41]:
#@title Perform predictions with the best model
predictions = best_model.predict(new_data_scaled)

In [42]:
#@title Add predictions to the DataFrame
new_data['Predicted_Activity'] = predictions

In [None]:
#@title Save results to a CSV file
new_data.to_csv('/content/predicted_activities_unknown.csv', index=False)
print("Predictions saved to /content/result_for_unknown.csv")

##**Research Directions and applications**
are discussed in [**Tutorial Video**](https://youtu.be/KSz0sQM13K0)

1. Virtual Screening

2. Exploratory Data Analysis

3. Compound Library of Unknown compounds Generation



**Congratulation**, You did it.

I am sure, you have learned something new in this One hour time.
If you are happy, please Subscribe My Youtube Channel [**Bioinformatics Insights**](https://www.youtube.com/@Bioinformaticsinsights).

Also if you want to stay connected for updates, courses, and computational services, you can follow this Whatsapp Channel [**BinfoLab**](https://whatsapp.com/channel/0029VajkwkdCHDydS6Y2lM36).