# Task 4.3+ Supervised Learning - Regression and hyperparameter tuning
### Modul 12: Application of Machine Learning in Health Care
**Author:** Markus Schwaiger

**Date:** May 21, 2024

---

- Load dataset Blood-Brain Barrier Data.
- Split the dataset into a training (75%) and test (25%) set.
- Select a learning method such as random forest. Use preprocessing (scaling/centering) if necessary.
- Perform a 10-fold cross validation using trainControl parameter of method train.
- Analyze the performance values and feature importances.
- Apply the final model to the test set and calculate performance measures.
IMPORTANT: If you use preprocessing you need to apply the transformation to the test by using predict function.
- Update your git-repository.

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

## Load dataset Blood-Brain Barrier Data

In [13]:
X = pd.read_csv("../data/BloodBrain_descriptors.csv", index_col=0)
y = pd.read_csv("../data/BloodBrain_logBBB.csv").squeeze()

# Split the dataset into training (75%) and test (25%) sets

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

- Check the Split

In [15]:
print(X_train.shape, X_test.shape)

(156, 134) (52, 134)


# Set up preprocessing (scaling/centering) and apply to the training data

In [16]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Train and evaluate the model
- learning method: random forest

In [17]:
rf_model = RandomForestRegressor(random_state=123)
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [4, 6, 8, 10, None]
}
# Perform a 10-fold cross-validation using GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)
best_rf = grid_search.best_estimator_

# Analyze the performance values and feature importance

In [18]:
# Get cross-validation scores for the best model
cv_scores = cross_val_score(best_rf, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
cv_rmse = np.sqrt(-cv_scores)
print(f"Cross-validated RMSE: {cv_rmse.mean()} (± {cv_rmse.std()})")

# Feature importances from the best model
feature_names = X.columns # Get the column names of the features
importances = best_rf.feature_importances_
indices = importances.argsort()[::-1]
print("Top 10 Feature Importances:")
for f in range(10):
    print(f"{f + 1}. Feature '{feature_names[indices[f]]}' ({importances[indices[f]]:.4f})")

Cross-validated RMSE: 0.5037022429795923 (± 0.08936662284154512)
Top 10 Feature Importances:
1. Feature 'tcnp' (0.0406)
2. Feature 'fnsa3' (0.0400)
3. Feature 'polar_area' (0.0340)
4. Feature 'scaa3' (0.0334)
5. Feature 'most_positive_charge' (0.0318)
6. Feature 'psa_npsa' (0.0285)
7. Feature 'tpsa' (0.0239)
8. Feature 'scaa1' (0.0211)
9. Feature 'tcsa' (0.0208)
10. Feature 'rpcg' (0.0205)


# Apply preprocessing to the test data

In [19]:
X_test = scaler.transform(X_test)

# Predict on the test data using the final model (best_rf)

In [20]:
y_pred = best_rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"RMSE: {rmse}")
print(f"R-squared: {r2}")
print(f"MAE: {mae}")

RMSE: 0.49840272245724876
R-squared: 0.4058076559864354
MAE: 0.3820643203669778
