# Supervised Machine Learning

This notebook demonstrates a streamlined classical machine learning (ML). We will build a selection of models, using multiple algorithms and techniques, and compare their performance. As in the [previous episode](1_data_explore.ipynb), the dataset we will be using is the [Indian Liver Patient Dataset](https://www.kaggle.com/datasets/jeevannagaraj/indian-liver-patient-dataset).

## Key Objectives
- Apply **multiple classical ML algorithms**
- Perform **feature selection** and **preprocessing**
- Use **cross-validation** and **hyperparameter tuning**
- Compare model **performance** and **interpretability**

## 1. Importing Packages and Loading Data

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn essentials
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, classification_report, confusion_matrix

# Classical ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')

In [None]:
# load in the augmented data


## Data Exploration

A reminder of the shape and distribution of the data

In [None]:
# View that shape, see if any missing values snuck in, and how the data is distributed:

## 2. Feature Analysis & Selection

In [None]:
# Separate the data from the target column
X = 
y = 

## 3. Data Preprocessing & Splitting

In [None]:
# Split the data with test_train_split
X_train, X_test, y_train, y_test = 

# Scale the features with StandardScalar
scaler = 
X_train_scaled = 
X_test_scaled = 

## 4. Model Training & Evaluation

![Cross validation cross validation](../assets/grid_search_cross_validation.png)

In [None]:
# Define models in a dictionary:

models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42, probability=True),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}

In [None]:
# how to train and predict with a model

# intialise model
rf_model = RandomForestClassifier()

# train the model with .fit()

# predict on new data with .predict()

# get the probability with .predict_proba()


In [None]:
# Iterate through, train and evaluate each model:
results = {}

for name, model in models.items():
    # Train model
    model.fit(X_train_selected, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_selected)

    if hasattr(model, 'predict_proba'):
        y_pred_proba = model.predict_proba(X_test_selected)[:, 1]
    else: 
        None
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    if y_pred_proba is not None:
        roc_auc = roc_auc_score(y_test, y_pred_proba) 
    else:
        None
    
    # Cross-validation score
    cv_score = cross_val_score(model, X_train_selected, y_train, cv=5, scoring='accuracy').mean()
    
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc,
        'CV Score': cv_score
    }

# Display results
results_df = pd.DataFrame(results).T
print("Model Performance Comparison:")
print("=" * 60)
print(results_df.round(3))

![Scores diagram](../assets/Confusion-matrix-Precision-Recall-Accuracy-and-F1-score.png)

## 4b. Model Comparison & Visualisation

In [None]:
# Visualise model performance:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Accuracy comparison:
axes[0, 0].bar(results_df.index, results_df['CV Score'], color='skyblue')
axes[0, 0].set_title('Model Accuracy (CV) Comparison')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 0].set_ylim(0, 1)

# F1-Score comparison:
axes[0, 1].bar(results_df.index, results_df['F1-Score'], color='lightgreen')
axes[0, 1].set_title('F1-Score Comparison')
axes[0, 1].set_ylabel('F1-Score')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].set_ylim(0, 1)

# Generate line indicating random:
axes[1, 0].plot([0, 1], [0, 1], 'k--', label='Random Classifier')

# Loop through models :
for model_name, model in models.items():
    
    # Get probability predictions:
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test_selected)[:, 1]
    elif hasattr(model, "decision_function"):
        y_proba = model.decision_function(X_test_selected)
    else:
        continue
    
    # Calculate ROC curve:
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    
    # Plot:
    axes[1, 0].plot(fpr, tpr, label=f'{model_name} (AUC = {results[model_name]['ROC-AUC']:.3f})')

axes[1, 0].set_xlabel('False Positive Rate')
axes[1, 0].set_ylabel('True Positive Rate')
axes[1, 0].set_title('ROC Curves Comparison')
axes[1, 0].legend(loc='lower right')
axes[1, 0].set_xlim([0.0, 1.0])
axes[1, 0].set_ylim([0.0, 1.05])

# Overall metrics heatmap:
metrics_heatmap = results_df[['CV Score', 'Precision', 'Recall', 'F1-Score']]
sns.heatmap(metrics_heatmap, annot=True, cmap='Blues', fmt='.3f', ax=axes[1, 1]) # vmax = 1, vmin = 0,
axes[1, 1].set_title('Performance Metrics Heatmap')

plt.tight_layout()
plt.show()

In [None]:
# Why do some have zero precision and recall


## 5. Understanding the ROC curve

In [None]:
# Predictions from the Random Forest model 

In [None]:
# Probabilities for class 1 with Random Forest

y_proba = 

print(y_proba)

print(f"Lowest prediction probability: {min(y_proba)}")

print(f"Highest prediction probability: {max(y_proba)}")

In [None]:
# Making our own binary threshold


In [None]:
thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]

for thresh in thresholds:

    # Get the classification given the threshold
    y_pred_thresh = (y_proba >= thresh).astype(int)
    
    # Calculate confusion matrix
    cm = confusion_matrix(y_test, y_pred_thresh)
    tn, fp, fn, tp = cm.ravel()

    # Calculate TPR and FPR 
    tpr = tp / (tp + fn)
    fpr = fp / (fp + tn)
    
    print(f"\nThreshold = {thresh}:")
    print(f"  True Positive Rate: {tpr:.3f}")
    print(f"  False Positive Rate: {fpr:.3f}")

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

## 6. Hyperparameter Tuning - Refining your best model

In [None]:
# Find the best model Pythonically:
best_model_name = 
print(f"\nBest performing model: {best_model_name}")
print(f"'CV Accuracy: {results_df.loc[best_model_name, 'CV Score']:.3f}")

In [None]:
# Some given hyperparameters, check out their documentation pages for what they so and their normal range:
if best_model_name == 'Random Forest':
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }
    best_model = RandomForestClassifier(random_state=42)
elif best_model_name == 'SVM':
    param_grid = {
        'C': [0.1, 1, 10],
        'gamma': ['scale', 'auto', 0.1, 1],
        'kernel': ['rbf', 'linear']
    }
    best_model = SVC(random_state=42, probability=True)
elif best_model_name == 'Logistic Regression':
    param_grid = {
        'C': [0.1, 1, 10, 100],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'saga']
    }
    best_model = LogisticRegression(random_state=42)

In [None]:
# Grid search:
grid_search = 


print(f"Best parameters for {best_model_name}:")
print(grid_search.best_params_)
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")

In [None]:
# pd.DataFrame(grid_search.cv_results_)

In [None]:
# Final evaluation:
final_model = 
final_predictions = 
final_accuracy = 

print(f"\nFinal optimised model accuracy: {final_accuracy:.3f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, final_predictions, target_names=['Benign', 'Malignant']))

In [None]:
# Visualise model performance:
fig, axes = plt.subplots(figsize=(15, 10))

# Generate line indicating random:
axes.plot([0, 1], [0, 1], 'k--', label='Random Classifier')

y_proba = final_model.predict_proba(X_test_selected)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
axes.plot(fpr, tpr, label=f'{'Tuned Model'} (AUC = {roc_auc_score(y_test, y_proba):.3f})')

base_model = RandomForestClassifier(random_state=42).fit(X_train_selected, y_train)
y_proba = base_model.predict_proba(X_test_selected)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
axes.plot(fpr, tpr, label=f'{'Base Model'} (AUC = {roc_auc_score(y_test, y_proba):.3f})')

axes.set_xlabel('False Positive Rate')
axes.set_ylabel('True Positive Rate')
axes.set_title('ROC Curves Comparison')
axes.legend(loc='lower right')
axes.set_xlim([0.0, 1.0])
axes.set_ylim([0.0, 1.05])

fig.show()

# 7. Models explained

This table summarises the models we built here.

| **Model** | **How it Works** | **Pros** | **Cons** |
|-----------|------------------|----------|----------|
| **Logistic Regression** | Fits a linear equation to the features and applies a **sigmoid** to predict probability of class membership. Learns coefficients that describe how each feature affects log-odds of outcome. | - Very fast and interpretable<br>- Produces probabilities<br>- Works well on linearly separable problems | - Only linear boundaries<br>- Sensitive to multicollinearity<br>- Limited with complex feature interactions |
| **Random Forest** | An **ensemble of decision trees** trained on bootstrapped subsets of the data and features. Prediction = majority vote (classification). | - Handles non-linearities<br>- Robust to noise/outliers<br>- Works well without scaling<br>- Provides feature importance | - Can be slower on very large datasets<br>- Less interpretable<br>- May overfit if trees are deep |
| **Support Vector Machine (SVM)** | Finds a **hyperplane** that maximises the margin between classes. With kernels, can model non-linear decision boundaries. | - Effective in high-dimensional spaces<br>- Works well with clear margin separation<br>- Flexible with kernels | - Sensitive to parameter choice<br>- Can be slower on larger datasets<br>- Harder to interpret |
| **K-Nearest Neighbours (KNN)** | Classifies a new point by looking at the **majority class of its k nearest neighbours** in feature space. | - Simple, intuitive<br>- No training phase<br>- Captures local structure | - Slow at prediction on larger datasets<br>- Sensitive to scaling and irrelevant features<br>- Struggles in high dimensions |
| **Gradient Boosting** | Builds an **ensemble of weak learners (shallow trees)** sequentially, where each tree corrects errors of the previous. | - High accuracy<br>- Captures complex patterns<br>- Robust with tuning | - Computationally intensive<br>- Sensitive to hyperparameters<br>- Can overfit small datasets |