# Task 4.4+ Supervised Learning - Classification and hyperparameter tuning
### Modul 12: Application of Machine Learning in Health Care
**Author:** Markus Schwaiger

**Date:** May 21, 2024

---

- Load dataset COX-2 Activity Data
- Split the dataset into a training (75%) and test (25%) set.
- Select a learning method such as random forest. Use preprocessing (scaling/centering) if necessary.
- Perform a 10-fold cross validation using trainControl parameter of method train.
- Analyze the performance values and feature importances.
- Apply the final model to the test set and calculate performance measures.
IMPORTANT: If you use preprocessing you need to apply the transformation to the test by using predict function.
- Update your git-repository.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

## Load dataset BCOX-2 Activity

In [2]:
X = pd.read_csv("../data/cox2Descr.csv")
y = pd.read_csv("../data/cox2Class.csv").squeeze()

# Split the dataset into training (75%) and test (25%) sets

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

- Check the Split

In [4]:
print(X_train.shape, X_test.shape)

(346, 255) (116, 255)


# Set up preprocessing (scaling/centering) and apply to the training data

In [5]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Train and evaluate the model
- learning method: random forest

In [6]:
label_encoder = LabelEncoder() 
y_train_encoded = label_encoder.fit_transform(y_train) # Encode categorical labels

# Train the RandomForestClassifier
rf_model = RandomForestClassifier(random_state=123)
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [4, 6, 8, 10, None]
}
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=10, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train_encoded)
best_rf = grid_search.best_estimator_

# Analyze the performance values and feature importance

In [7]:
accuracy = grid_search.best_score_
print("Best Accuracy:", accuracy)

# Feature importances from the best model
feature_names = X.columns # Get the column names of the features
importances = best_rf.feature_importances_
indices = importances.argsort()[::-1]
print("Top 10 Feature Importances:")
for f in range(10):
    print(f"{f + 1}. Feature '{feature_names[indices[f]]}' ({importances[indices[f]]:.4f})")

Best Accuracy: 0.8646218487394958
Top 10 Feature Importances:
1. Feature 'QikProp_QPlogKhsa' (0.0170)
2. Feature 'QikProp_QPPCaco' (0.0162)
3. Feature 'QikProp_QPlogS' (0.0158)
4. Feature 'QikProp_QPlogPo.w' (0.0125)
5. Feature 'moe2D_logS' (0.0116)
6. Feature 'QikProp_QPPMDCK' (0.0115)
7. Feature 'moe2D_logP.o.w.' (0.0115)
8. Feature 'QikProp_IP.eV.' (0.0111)
9. Feature 'QikProp_accptHB' (0.0108)
10. Feature 'QikProp_QPlogKp' (0.0106)


# Apply preprocessing to the test data

In [8]:
X_test = scaler.transform(X_test)

# Predict on the test data using the final model (best_rf)

In [9]:
y_pred = best_rf.predict(X_test)
y_test_encoded = label_encoder.fit_transform(y_test) # Encode categorical labels
conf_matrix = confusion_matrix(y_test_encoded, y_pred)
accuracy = accuracy_score(y_test_encoded, y_pred)

print("Confusion Matrix:")
print(conf_matrix)
print("\nAccuracy:", accuracy)

Confusion Matrix:
[[ 7 19]
 [ 6 84]]

Accuracy: 0.7844827586206896
