[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/8_ml_theory_and_practice_tasks.ipynb) 

# Tutorial 8 - Machine Learning Theory & Practice
In this tutorial, we revisit the ML Theory & Practice session of our BADS lecture. 

# Preliminaries



In [None]:
# Imports
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

# Load preprocessed HMEQ data from GitHub
data_url = 'https://raw.githubusercontent.com/Humboldt-WI/bads/master/data/hmeq_prepared.csv'
df = pd.read_csv(data_url)
X = df.copy() # Separate features and target
y = X.pop("BAD")

# Data partitioning using the holdout method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Train a simple logistic regression model as benchmark
lr = LogisticRegression(max_iter=1000, random_state=123).fit(X_train, y_train)
yhat_lr = lr.predict_proba(X_test)[:, 1]
auc_lr = roc_auc_score(y_test, yhat_lr)
mse_lr = mean_squared_error(y_test, yhat_lr)
print(f"Data set size (samples x features): {X.shape[0]} x {X.shape[1]}.")
print(f"Logistic Regression test set AUC (MSE): {auc_lr:.3f} ({mse_lr:.3f})")


Data set size (samples x features): 5960 x 29.
Logistic Regression test set AUC (MSE): 0.906(0.081)


# Bias, Variance and Overfitting
<p align="left">
  <img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/overfitting.png" alt="Bias, variance, and overfitting" width="640" />
</p>

Image source: [Geeks for Geeks](https://www.geeksforgeeks.org/machine-learning/ml-bias-variance-trade-off/)


## Overfitting in neural networks

### An arbitrary neural network model
- Train a neural network and assess its performance on training and test data. 
- Compare the results to those of the logistic regression benchmark model.



### Training error evolution

#### The influence of training epochs
In this part, we try to reproduce the above illustration, approximating the complexity of the NN by the **number of training iterations**. 

- Vary the number of training epochs of the neural network from 1 to 1000 in steps of 50.
- For each configuration, train the model on the training data and evaluate its performance (AUC and MSE) on both training and test data.
- Plot the training and test performance against the number of training iterations.
  - Display your results in 1 x 2 grid of two charts:
  - Let the first chart measure model performance by MSE
  - Let the second chart measure model performance by 1-AUC (i.e., to transform AUC as an error measure) 

#### The influence of model complexity
In this part, we try to reproduce the above illustration, approximating the complexity of the NN by the **number of weights**, which are a function of the number of layers and the size of those layers. So we will have to train multiple NNs with different architecture.

Apart from this modification, the steps are similar to the previous task.