<!-- filepath: /home/marco/Marco/study_ml/datascience/06_machine_learning/machine_learning_summary.ipynb -->
# Machine Learning Summary: Concepts, Code, and Visualizations

This instructional notebook walks through core machine learning concepts covered in the 06_machine_learning module. It combines explanations, code, and visualizations with practical use cases.

## Objectives
- Understand regression (simple/multiple), classification (KNN, Decision Trees, Logistic Regression, SVM), and clustering (K-Means)
- Learn multi-class strategies (Softmax, One-vs-All, One-vs-One)
- Apply essential preprocessing (scaling, encoding)
- Evaluate models with appropriate metrics
- Explore regularization and hyperparameter tuning
- Connect concepts to real-world applications

## Table of Contents
1. Setup and Utilities
2. Data Preprocessing (Scaling, One-Hot Encoding)
3. Linear Regression (Simple & Multiple)
4. K-Nearest Neighbors (KNN)
5. Decision Trees (Classification)
6. Regression Trees
7. Logistic Regression
8. Support Vector Machines (SVM) & Kernels
9. Multi-class: Softmax, One-vs-All, One-vs-One
10. Clustering with K-Means + Customer Segmentation
11. Model Evaluation Metrics
12. Regularization (L1/L2) & Effects
13. Hyperparameter Tuning (Grid Search)
14. Ensemble Snapshot (Bagging/Random Forest)
15. Best Practices & Workflow

In [None]:
# Setup & Utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, normalize
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.cluster import KMeans

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    accuracy_score, confusion_matrix, classification_report,
    f1_score, jaccard_score, log_loss
)

np.random.seed(42)
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (7, 4)

## 1) Data Preprocessing

Proper preprocessing ensures fair comparisons between models and stable performance.

- Feature scaling: Standardization (mean 0, std 1) or Min-Max scaling (0-1)
- One-hot encoding: Convert categorical features to indicator columns
- Train/test split: Hold out unseen data for unbiased evaluation
- Normalization: Row-wise normalization (e.g., L1) for some models
- Handle missing values and outliers appropriately

We will generate a small synthetic dataset to demonstrate scaling and encoding.

In [None]:
# Preprocessing demo: scaling and one-hot encoding on a mixed-type dataset
from sklearn.datasets import make_classification

# Create a small mixed dataset
X_num, y_bin = make_classification(n_samples=400, n_features=4, n_informative=3, n_redundant=1, random_state=42)
# Add a simple categorical column with 3 categories
cats = np.random.choice(["A", "B", "C"], size=X_num.shape[0])
X_df = pd.DataFrame(X_num, columns=["feat1","feat2","feat3","feat4"]).assign(cat=cats)

num_features = ["feat1","feat2","feat3","feat4"]
cat_features = ["cat"]

# OneHotEncoder compatibility for different sklearn versions
try:
    ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
except TypeError:
    ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")

preprocess = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("cat", ohe, cat_features)
])

X_proc = preprocess.fit_transform(X_df)
print("Original shape:", X_df.shape, "/ Processed shape:", X_proc.shape)

# Visualize distributions before/after scaling for one feature
fig, ax = plt.subplots(1,2, figsize=(10,3))
ax[0].hist(X_df["feat1"], bins=30, color="#88c")
ax[0].set_title("feat1 (raw)")
ax[1].hist(X_proc[:,0], bins=30, color="#c88")
ax[1].set_title("feat1 (standardized)")
plt.show()

In [None]:
# Utility: Confusion matrix plotter
import itertools

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix'):
    if normalize:
        cm = cm.astype('float') / (cm.sum(axis=1, keepdims=True) + 1e-12)
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

## 2) Linear Regression (Simple & Multiple)

Linear Regression models the relationship between a continuous target y and one or more features X.

- Simple Linear Regression: y = θ₀ + θ₁ x₁
- Multiple Linear Regression: y = θ₀ + Σ θⱼ xⱼ

Key points:
- Parameters estimated by Ordinary Least Squares (minimize sum of squared residuals).
- Assumptions: linearity, independence, homoscedasticity, normality of residuals.
- Evaluation: MAE, MSE, RMSE, R².

We will fit simple and multiple linear regression, visualize residuals, and report metrics.

In [None]:
# Linear Regression: simple & multiple on synthetic data
from sklearn.datasets import make_regression

# Create regression dataset
X_reg, y_reg = make_regression(n_samples=400, n_features=5, n_informative=4, noise=15.0, random_state=42)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(Xr_train, yr_train)
yr_pred = lin_reg.predict(Xr_test)

# Metrics
mae = mean_absolute_error(yr_test, yr_pred)
mse = mean_squared_error(yr_test, yr_pred)
rmse = np.sqrt(mse)
r2 = r2_score(yr_test, yr_pred)
print({"MAE": round(mae,2), "MSE": round(mse,2), "RMSE": round(rmse,2), "R2": round(r2,3)})

# Visualization: True vs Predicted and Residuals
fig, ax = plt.subplots(1,2, figsize=(12,4))
ax[0].scatter(yr_test, yr_pred, alpha=0.6)
ax[0].plot([yr_test.min(), yr_test.max()], [yr_test.min(), yr_test.max()], 'r--')
ax[0].set_title('True vs Predicted')
ax[0].set_xlabel('True y')
ax[0].set_ylabel('Predicted y')

residuals = yr_test - yr_pred
ax[1].hist(residuals, bins=30, color="#6aa")
ax[1].set_title('Residuals distribution')
plt.show()

# Coefficients insight
plt.bar(range(X_reg.shape[1]), lin_reg.coef_)
plt.title('Linear Regression Coefficients')
plt.xlabel('Feature index')
plt.ylabel('Coefficient value')
plt.show()

## 3) K-Nearest Neighbors (KNN)

KNN classifies a sample based on the majority label among its k closest training points in feature space.

Key ideas:
- Distance metric (Euclidean by default) defines “closeness”.
- Scaling features is critical because KNN is distance-based.
- k controls bias-variance: small k can overfit; large k can underfit.

We will visualize decision boundaries in 2D and plot accuracy vs k.

In [None]:
# KNN: decision boundary and accuracy vs k
from sklearn.datasets import make_classification
from matplotlib.colors import ListedColormap

X_knn, y_knn = make_classification(n_samples=500, n_features=2, n_redundant=0, n_informative=2,
                                   n_clusters_per_class=1, class_sep=1.2, random_state=42)
Xk_train, Xk_test, yk_train, yk_test = train_test_split(X_knn, y_knn, test_size=0.3, random_state=42)

scaler_knn = StandardScaler().fit(Xk_train)
Xk_train_s = scaler_knn.transform(Xk_train)
Xk_test_s = scaler_knn.transform(Xk_test)

# Accuracy vs k
k_values = list(range(1, 21))
accs = []
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(Xk_train_s, yk_train)
    accs.append(accuracy_score(yk_test, knn.predict(Xk_test_s)))

plt.plot(k_values, accs, marker='o')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.title('KNN Accuracy vs k')
plt.show()

# Decision boundary for k=5
k_best = 5
knn = KNeighborsClassifier(n_neighbors=k_best).fit(Xk_train_s, yk_train)

x_min, x_max = Xk_train_s[:, 0].min() - 1, Xk_train_s[:, 0].max() + 1
y_min, y_max = Xk_train_s[:, 1].min() - 1, Xk_train_s[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

cmap_light = ListedColormap(['#FFBBBB', '#BBFFBB'])
cmap_bold = ['#FF0000', '#00AA00']
plt.contourf(xx, yy, Z, cmap=cmap_light, alpha=0.7)
plt.scatter(Xk_train_s[:, 0], Xk_train_s[:, 1], c=yk_train, cmap=ListedColormap(cmap_bold), edgecolor='k', s=25)
plt.title(f'KNN Decision Boundary (k={k_best})')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.show()

# Confusion matrix
y_pred_knn = knn.predict(Xk_test_s)
cm = confusion_matrix(yk_test, y_pred_knn)
plot_confusion_matrix(cm, classes=['Class 0','Class 1'], normalize=False, title='KNN Confusion Matrix')

## 4) Decision Trees (Classification)

Decision Trees split the feature space into regions using if-else rules.

- Nodes: decision points on features; Leaves: final class
- Splitting criteria: Gini impurity or Entropy (information gain)
- Control overfitting via max_depth, min_samples_split, min_samples_leaf
- Pros: interpretable, handles non-linear boundaries; Cons: can overfit

We will train a tree, visualize its decision boundary, and review metrics and feature importance.

In [None]:
# Decision Tree: classification with decision boundary and metrics
from sklearn.datasets import make_classification
from matplotlib.colors import ListedColormap

X_dt, y_dt = make_classification(n_samples=600, n_features=2, n_redundant=0, n_informative=2,
                                 n_clusters_per_class=1, class_sep=1.2, random_state=7)
Xd_train, Xd_test, yd_train, yd_test = train_test_split(X_dt, y_dt, test_size=0.3, random_state=7)

clf_dt = DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=7)
clf_dt.fit(Xd_train, yd_train)

# Decision boundary
x_min, x_max = X_dt[:, 0].min() - 1, X_dt[:, 0].max() + 1
y_min, y_max = X_dt[:, 1].min() - 1, X_dt[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))
Z = clf_dt.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

cmap_light = ListedColormap(['#FFEEEE', '#EEFFEE'])
plt.contourf(xx, yy, Z, cmap=cmap_light, alpha=0.7)
plt.scatter(Xd_train[:, 0], Xd_train[:, 1], c=yd_train, cmap=ListedColormap(['#FF0000','#00AA00']), edgecolor='k', s=20)
plt.title('Decision Tree Decision Boundary (depth=4)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

# Metrics
yd_pred = clf_dt.predict(Xd_test)
print(classification_report(yd_test, yd_pred))
cm = confusion_matrix(yd_test, yd_pred)
plot_confusion_matrix(cm, classes=['Class 0','Class 1'], normalize=True, title='Decision Tree (Normalized CM)')

## 5) Regression Trees

Decision Trees can also predict continuous targets (regression trees).

- Split criterion: minimize variance (MSE/MAE) within nodes
- Leaves output continuous values (e.g., mean of training targets in leaf)
- Evaluation: MAE, MSE, RMSE, R²

We will train a regression tree, compare to linear regression, and visualize predictions vs truth.

In [None]:
# Regression Trees vs Linear Regression
from sklearn.datasets import make_regression

X_reg2, y_reg2 = make_regression(n_samples=600, n_features=4, n_informative=3, noise=20.0, random_state=0)
Xrt_train, Xrt_test, yrt_train, yrt_test = train_test_split(X_reg2, y_reg2, test_size=0.3, random_state=0)

# Linear Regression baseline
lin = LinearRegression().fit(Xrt_train, yrt_train)
y_pred_lin = lin.predict(Xrt_test)

# Regression Tree
rt = DecisionTreeRegressor(max_depth=5, random_state=0)
rt.fit(Xrt_train, yrt_train)
y_pred_rt = rt.predict(Xrt_test)

# Metrics
metrics_lin = {
    'MAE': mean_absolute_error(yrt_test, y_pred_lin),
    'MSE': mean_squared_error(yrt_test, y_pred_lin),
    'RMSE': np.sqrt(mean_squared_error(yrt_test, y_pred_lin)),
    'R2': r2_score(yrt_test, y_pred_lin)
}
metrics_rt = {
    'MAE': mean_absolute_error(yrt_test, y_pred_rt),
    'MSE': mean_squared_error(yrt_test, y_pred_rt),
    'RMSE': np.sqrt(mean_squared_error(yrt_test, y_pred_rt)),
    'R2': r2_score(yrt_test, y_pred_rt)
}
print('Linear Regression:', {k: round(v,3) for k,v in metrics_lin.items()})
print('Regression Tree   :', {k: round(v,3) for k,v in metrics_rt.items()})

# Plot predictions vs truth
plt.scatter(yrt_test, y_pred_lin, alpha=0.5, label='Linear')
plt.scatter(yrt_test, y_pred_rt, alpha=0.5, label='Tree')
plt.plot([yrt_test.min(), yrt_test.max()], [yrt_test.min(), yrt_test.max()], 'k--')
plt.legend(); plt.title('Regression: Predictions vs Truth'); plt.xlabel('True'); plt.ylabel('Predicted')
plt.show()

## 6) Logistic Regression

Logistic Regression models the probability of class membership using the logistic (sigmoid) function.

- Sigmoid: σ(z) = 1 / (1 + e^{-z}) with z = θᵀx
- Outputs probabilities; threshold (e.g., 0.5) maps to class labels
- Trained via maximum likelihood with regularization (L2 by default)
- Metrics: Accuracy, Precision/Recall/F1, Jaccard, Log Loss

We will train a logistic model, visualize the decision boundary, and evaluate metrics.

In [None]:
# Logistic Regression: decision boundary and metrics
from sklearn.datasets import make_classification
from matplotlib.colors import ListedColormap

X_lr, y_lr = make_classification(n_samples=500, n_features=2, n_redundant=0, n_informative=2,
                                 n_clusters_per_class=1, class_sep=1.2, random_state=12)
Xl_train, Xl_test, yl_train, yl_test = train_test_split(X_lr, y_lr, test_size=0.3, random_state=12)

scaler_lr = StandardScaler().fit(Xl_train)
Xl_train_s = scaler_lr.transform(Xl_train)
Xl_test_s = scaler_lr.transform(Xl_test)

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(Xl_train_s, yl_train)

# Decision boundary
grid_x, grid_y = np.meshgrid(
    np.linspace(Xl_train_s[:,0].min()-1, Xl_train_s[:,0].max()+1, 300),
    np.linspace(Xl_train_s[:,1].min()-1, Xl_train_s[:,1].max()+1, 300)
)
Z = log_reg.predict(np.c_[grid_x.ravel(), grid_y.ravel()]).reshape(grid_x.shape)
plt.contourf(grid_x, grid_y, Z, cmap=ListedColormap(['#FEE','#EEF']), alpha=0.7)
plt.scatter(Xl_train_s[:,0], Xl_train_s[:,1], c=yl_train, cmap=ListedColormap(['#F00','#00A']), edgecolor='k', s=20)
plt.title('Logistic Regression Decision Boundary')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.show()

yl_pred = log_reg.predict(Xl_test_s)
yl_proba = log_reg.predict_proba(Xl_test_s)
print(classification_report(yl_test, yl_pred))
print('Log Loss:', round(log_loss(yl_test, yl_proba), 4))
plot_confusion_matrix(confusion_matrix(yl_test, yl_pred), ['0','1'], True, 'Logistic Regression (Normalized CM)')

## 7) Support Vector Machines (SVM) & Kernels

SVM finds the hyperplane that maximizes the margin between classes.

- Support vectors: training samples on the margin
- Margin: distance between class boundaries
- C (regularization):
  - Large C: hard margin (low bias, high variance)
  - Small C: soft margin (higher bias, lower variance)
- Kernels:
  - Linear: for linearly separable data
  - RBF: non-linear boundaries; controlled by gamma
  - Polynomial, Sigmoid (less common)
- gamma (RBF):
  - Small gamma: smoother boundary (underfit)
  - Large gamma: wiggly boundary (overfit)

We will compare linear vs RBF kernels and visualize decision boundaries.

In [None]:
# SVM: compare linear vs RBF kernels
from sklearn.datasets import make_classification
from matplotlib.colors import ListedColormap

X_svm, y_svm = make_classification(n_samples=600, n_features=2, n_redundant=0, n_informative=2,
                                   n_clusters_per_class=1, class_sep=1.2, random_state=21)
Xs_train, Xs_test, ys_train, ys_test = train_test_split(X_svm, y_svm, test_size=0.3, random_state=21)
scaler_svm = StandardScaler().fit(Xs_train)
Xs_train_s = scaler_svm.transform(Xs_train)
Xs_test_s = scaler_svm.transform(Xs_test)

models = {
    'linear': SVC(kernel='linear', C=1.0, probability=True, random_state=21),
    'rbf_low_gamma': SVC(kernel='rbf', C=1.0, gamma=0.5, probability=True, random_state=21),
    'rbf_high_gamma': SVC(kernel='rbf', C=1.0, gamma=5.0, probability=True, random_state=21)
}

for name, mdl in models.items():
    mdl.fit(Xs_train_s, ys_train)
    y_pred = mdl.predict(Xs_test_s)
    acc = accuracy_score(ys_test, y_pred)
    print(f"{name}: accuracy = {acc:.3f}")

    # Decision boundary
    x_min, x_max = Xs_train_s[:, 0].min() - 1, Xs_train_s[:, 0].max() + 1
    y_min, y_max = Xs_train_s[:, 1].min() - 1, Xs_train_s[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))
    Z = mdl.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFECEC','#ECFFEC']), alpha=0.7)
    plt.scatter(Xs_train_s[:, 0], Xs_train_s[:, 1], c=ys_train, cmap=ListedColormap(['#FF0000','#00AA00']), edgecolor='k', s=20)
    plt.title(f'SVM Decision Boundary: {name}')
    plt.xlabel('Feature 1 (scaled)')
    plt.ylabel('Feature 2 (scaled)')
    plt.show()

    cm = confusion_matrix(ys_test, y_pred)
    plot_confusion_matrix(cm, classes=['Class 0','Class 1'], normalize=True, title=f'{name} (Normalized CM)')