**Author:** Shahab Fatemi

**Email:** shahab.fatemi@umu.se   ;   shahab.fatemi@amitiscode.com

**Created:** 2024-04-19

**Last update:** 2025-09-10

**MIT License** — Shahab Fatemi (2025); For use in the *Machine Learning in Physics* course, Umeå University, Sweden; See the full license text in the parent folder.

<hr>

📢 <span style="color:red"><strong> Note for Students:</strong></span>

* Before working on the labs, review your lecture notes.

* Please read all sections, code blocks, and comments **carefully** to fully understand the material. Throughout the labs, my instructions are provided to you in written form, guiding you through the materials step-by-step.

* All concepts covered in this lab are part of the course and may be included in the final exam.

* I strongly encourage you to work in pairs and discuss your findings, observations, and reasoning with each other.

* If something is unclear, don't hesitate to ask.

* Exercise submission is not required; these tasks are designed to help you practice, explore the concepts, and learn by doing.

* I have done my best to make the lab files as bug-free (and error-free) as possible, but remember: *there is no such thing as bug-free code.* If you observed any bugs, errors, typos, or other issues, I would greatly appreciate it if you report them to me by email. Verbal notifications are not work, as I will likely forget 🙂

ENJOY WORKING ON THIS LAB.
***

# 🛠️ Purpose and Learning Outcomes:

In this lab, you will build upon your previous lab, with a focus on avoiding overfitting by selecting better model parameters using regularization techniques:
    * L1 (Lasso)
    * L2 (Ridge)
    * Elastic Net

At the end, you will also learn how the "learning curves" are used for diagnosing underfitting and overfitting. We did not discuss it in the class, so you will learn it in here.
***

In [None]:
import numpy as np
import matplotlib.pyplot as plt

"""
    Creating a list of colors based on the "tab10" colormap.
    I want to use the color set in the "tab10" colormap for my plotting.
"""
cmap = plt.colormaps["tab10"]
colors = [cmap(i) for i in range(21)]

# Code from the previous lab

The following code sections should look familiar to you as they are the same functions and setups used in your previous lab (Polynomials).

Previously, you explored how increasing model complexity (e.g., using high degree polynomials) can lead to overfitting. In this lab, we will start by regenerating the data, visualizing it, and observing the behavior of high-order polynomial models once again. These are requyired for the later steps.

Compared to the previous lab, I've made a new `fit_and_plot` function to fit and visualize data for different models. This newly developed model is a modification of `plot_multiple_fits` used in the previous lab.

In [None]:
from sklearn.model_selection import train_test_split

# Function to generate data
def generate_new_data(a=+1, b=-5, c=+3, x_range=(-2, 2), 
                      num_points=100, noise_level=3.0, seed=42):
    np.random.seed(seed)
    x = np.linspace(x_range[0], x_range[1], num_points)
    y_true = a * x**5 + b * x**3 + c * x
    y_noisy = y_true + np.random.normal(0, noise_level, num_points)
    return x, y_true, y_noisy

# Function to plot data
def plot_data(t, y_true, y_noisy, x_fit=None, y_fit=None, poly_degree=None, title="Projectile Motion"):
    plt.figure(figsize=(6, 4), dpi=200)
    plt.scatter(t, y_noisy, color=colors[1], s=40, edgecolors="k", alpha=0.6, label=f"Noisy data")
    if (y_true is not None):
        plt.scatter(t, y_true, color=colors[0], s=5, label="True y")

    if (x_fit is not None and 
        y_fit is not None):
        if (poly_degree is not None):
            label_text = f"Polynomial degree {poly_degree}"
        else:
            label_text = "Regression fit"
        plt.scatter(x_fit, y_fit, color=colors[2], s=2, label=label_text)

    # Customize plot appearance
    plt.xlabel("Time (s)", fontsize=14)
    plt.ylabel("Displacement (m)", fontsize=14)
    plt.title(title, fontsize=16)
    plt.legend()
    plt.grid(True, linestyle="--", color="grey", linewidth=0.5, alpha=0.6)
    plt.tight_layout()
    plt.show()
    
# ========== MAIN ==========
# Parameters for a non-linear model
n = 50    # number of points
a = +1    # 1st coefficient
b = -7    # 2nd coefficient
c = +3    # 3rd coefficient
x_range = (-2.5, 2.5)
noise_level = 7.0  # Noise lebel

# Generate data and add some noise
x, y_true, y_noisy = generate_new_data(a, b, c, x_range, n, noise_level)

# Randomly split data (both true and noisy) into training and validation/test sets
x_train, x_test, y_train_noisy, y_test_noisy, y_train_true, y_test_true = train_test_split(
    x, y_noisy, y_true, test_size=0.3, random_state=42)

# Plot only the training set
plot_data(x_train, y_train_true, y_train_noisy, title="Training Data")

In the following code section, we explore how the model fits the data when using different polynomial degrees. Focus on the overfitted lines.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression

# Fit (train) polynomial regression model using sklearn pipeline
def fit_polynomial_sklearn(x, y, degree=2, normalize=True):
    if(normalize):
        # Create a pipeline with normalization
        model = make_pipeline(StandardScaler(), 
                              PolynomialFeatures(degree, include_bias=False), 
                              LinearRegression())
    else:
        # Create a pipeline without normalization
        model = make_pipeline(PolynomialFeatures(degree, include_bias=False), 
                              LinearRegression())

    # We need to use x.reshape, as required by sklearn
    model.fit(x.reshape(-1, 1), y)
    return model

# Fits multiple models and plots their results
def fit_and_plot(t, y_true, y_noisy, models, labels=None):
    plt.figure(figsize=(6, 4), dpi=200)
    plt.scatter(t, y_noisy, color=colors[1], s=40, edgecolors="k", alpha=0.6, label="Noisy data")
    plt.scatter(t, y_true , color=colors[0], s=5 , label="True y")

    # Generate dense grid for smooth predictions
    t_vals = np.linspace(min(t), max(t), 2*t.size)
    
    for idx, model in enumerate(models):
        y_vals = model.predict(t_vals[:, np.newaxis])
        label = labels[idx] if labels else f"Model {idx+1}"
        plt.plot(t_vals, y_vals, color=colors[idx+1], linewidth=2, label=label)

    plt.xlabel("x", fontsize=14)
    plt.ylabel("y", fontsize=14)
    plt.title("Polynomial Regression w. Regularization", fontsize=16)
    plt.legend()
    plt.grid(True, linestyle="--", color="grey", alpha=0.6)
    #plt.tight_layout()
    plt.show()

# ========= MAIN ==========
# Fit polynomial models of different degrees to the training data
degrees = [6, 12, 21]  # Selected polynomial degrees
models  = []
labels  = []

for d in degrees:
    model = fit_polynomial_sklearn(x_train, y_train_noisy, degree=d, normalize=True)
    models.append(model)
    labels.append(f"Poly deg {d}")

# Plot data and fits
fit_and_plot(x_train, y_train_true, y_train_noisy, models, labels=labels)  

***
### ✅ Check your understanding
- Examine the plots generated above. How does changing the polynomial degree affect the model's ability to fit the data? What behavior do you observe for low vs. high-degree polynomials?

- Go through both functions implemented above line by line. Make sure you understand the purpose and logic of every part of the code.

***

# Regularization

Regularization is a fundamental ML technique used to prevent overfitting, especially when working with flexible models like high-degree polynomials. By adding a penalty term to the model's cost function, regularization prevents the model for fitting noise and less important features in training data. In this lab, you will work on Ridge (L2), Lasso (L1), and Elastic Net methods, and see their influences in model generalization. Before moving forward, review your lecture notes and make sure you understand how Ridge, Lasso, and Elastic Net work.

The code section below fits polynomial regression models to noisy data, both with and without regularization. To streamline the process, I've developed two functions. The first one, `polynomial_regularization`, builds a modeling pipeline using sklearn. It first normalizes the input features with `StandardScaler`, then generates polynomial features of a chosen degree, and finally applies a regression method, either plain linear regression, or any of the chosen regularizors. The regularization parameter is `reg_lambda`.

Once the models are trained, `fit_and_plot` is called to visualize the results. These functions allow you to explore how regularization affects polynomial regression in practice.


In [None]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Function to fit polynomial regression with regularization
# Here I also show you how to write comments for your functions to be 
# more informative and easily converted to documentation tools, like Sphinx. 
# I strongly recommend you to follow this style in your future works.
def polynomial_regularization(t, y, regularization, reg_lambda, poly_degree):
    """
    parameters:
    - t (array): Time values.
    - y (array): measured values (e.g., projectile trajectory).
    - regularization (list): List of models to fit (e.g., Ridge, Lasso, etc.).
    - reg_lambda (scalar, float32): Regularization parameter. 
    - poly_degree (scalar, int): Degree of the polynomial to fit.
    """
    # Create a pipeline with normalization and polynomial regression
    model = make_pipeline( StandardScaler(),                 # Normalize features
                           PolynomialFeatures(poly_degree, include_bias=False),  # Create polynomial features
                           regularization(alpha=reg_lambda, max_iter=5000)  # Apply regularization
                        )

    # Reshape t to be a column vector
    # As said earlier, reshape is necessary because sklearn expects input features to 
    # be 2D (n_samples, n_features)
    # If t is a 1D array, reshape it to a 2D array with one column. 
    # Instead of reshape, one can use np.newaxis.
    model.fit( t.reshape(-1,1), y ) # Fit the model

    return model

⚠️ When you run the code block below, you may get warnings. Ignore them and move on for now. They are associated parameters in regularization models.

In [None]:
# ========= MAIN ==========
# Define degrees of polynomials to fit
poly_degree = 21 # polynomail degree

ridge_lambda = 1.0  # Ridge regularization factor
lasso_lambda = 1.0  # Lasso regularization factor
elnet_lambda = 1.0  # ElasticNet regularization factor

models  = [fit_polynomial_sklearn   (x_train, y_train_noisy, poly_degree),
           polynomial_regularization(x_train, y_train_noisy, Ridge, 
                                     reg_lambda=ridge_lambda, poly_degree=poly_degree),
           polynomial_regularization(x_train, y_train_noisy, Lasso, 
                                     reg_lambda=lasso_lambda, poly_degree=poly_degree),
           polynomial_regularization(x_train, y_train_noisy, ElasticNet, 
                                     reg_lambda=elnet_lambda, poly_degree=poly_degree)]

labels  = [f"Polynom (d={poly_degree})", 
           rf"Ridge (d={poly_degree}, $\lambda={ridge_lambda}$)", 
           rf"Lasso (d={poly_degree}, $\lambda={lasso_lambda}$)", 
           rf"ElasticNet (d={poly_degree}, $\lambda={elnet_lambda}$)"]

# Plot data and fits
fit_and_plot(x_train, y_train_true, y_train_noisy, models, labels=labels)

In [None]:
# Print model coefficients
for i, model in enumerate(models):
    # Get the regression model (the second step in the pipeline)
    mymodel   = model.steps[-1][1]  # e.g., LinearRegression, Ridge, etc.
    coefs     = mymodel.coef_
    intercept = mymodel.intercept_

    print(f"Model {i+1}:")
    print(f"  Intercept: {intercept:.4f}")
    print(f"  Coefficients: {coefs}")
    print(35*"-")

***
### ✅ Check your understanding

- Study the figure above and make sure you understand the impact of different regularization techniques on polynomial regression.
- What are the differences between Ridge, Lasso, and ElasticNet regression? See your lecture notes.
- Did regularization help reduce overfitting in your model? How can you verify that, when visually not inspected? Often, for multi-dimensional data, it is not easy to visualize.
- Examine the model coefficients printed in the previous code block. Compare how the coefficients differ for different models. Do the results align with your expectations, especially regarding the impact of regularization?
- Why is it important to use a `pipeline` in our code developments?

***
### 💡 Reflect and Run
- Change the `poly_degree` to 5, 13, and 19 to see how it affects the model fits.

- We have a few more `hyperparameters` to tune: i.e., Ridge_lambda, Lasso_lambda, ElNet_lambda. Experiment with different values for these parameters to see how they impact the model performance.

Side Note: You know that you do not necessarily need to use the data I've used for training. You can create your own dataset that follows any function. For example, generate data for a damping oscillator: 

${y = A \cdot \exp{(-ct)} \cos(\omega t + \phi)}$, 
where `A`, `c`, $\omega$, and $\phi$ are parameters you can set.

***
# ⛷️ Exercise

Focus on L1 and L2 norm regularizations: Lasso and Ridge Regression, respectively. Develop a code that performs a complete evaluation to determine the optimal value of ${\lambda}$ (the regularization hyper-parameter) for both models. Use all suitable metrics and techniques you learned so far (e.g., cross-validation, cost function, ...) and justify your choice of ${\lambda}$ based on your findings.

***

# Learning Curve

We move to a different topic which was left from the previous lab, i.e., **learning curve**. 

A learning curve visually illustrates how a model's performance evolves as the size of the training dataset increases. It shows the connection between the model's effectiveness (e.g., metrics like accuracy, loss, or RMSE) and the amount of training data used. This analysis again helps us to identify underfitting and overfitting conditions for our model.

In the code below, we calculate the learning curve for a model by evaluating its performance across varying training set sizes. The code uses the `learning_curve` function from sklearn to compute training and validation scores through cross validation. The scores, measured as negative RMSE (how the learning_curve returns outputs), are converted to positive RMSE values. In general, we analyze how the model's performance evolves as more data is used for training, providing insight into underfitting or overfitting behavior of the model. We test it on the projectile motion problem.

Before you run the code, read about the learning curve: 
https://en.wikipedia.org/wiki/Learning_curve_(machine_learning)

and study:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html

In [None]:
from sklearn.model_selection import learning_curve

def plot_learning_curve(t, y_noisy, 
                        model, 
                        cv=5, y_lim=[-1, 100]):
    # Reshape the time data and create a feature matrix
    t = t.reshape(-1, 1)  # Reshape for sklearn

    # Generate learning curves
    # Learning_curve is a sklearn function that computes training and test scores for different training set sizes
    # Read about it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
    train_sizes, train_scores, test_scores = learning_curve(
        estimator=model,    # model
        X=t,                # Time feature
        y=y_noisy,          # Noisy target variable
        train_sizes=np.linspace(0.01, 1.0, 30),  # Training set sizes from 1% to 100%
        cv=cv,              # Cross-validation splitting strategy.
        scoring='neg_root_mean_squared_error', # Negative RMSE for scoring in regression
        shuffle=True )
    
    # Calculate the mean and standard deviation of the training scores
    train_scores_mean = -np.mean(train_scores, axis=1)
    test_scores_mean  = -np.mean(test_scores , axis=1)
    train_scores_std  =  np.std (train_scores, axis=1)
    test_scores_std   =  np.std (test_scores , axis=1)

    # Plotting the learning curve
    plt.figure(figsize=(6, 4), dpi=200)
       
    plt.plot(train_sizes, train_scores_mean, "--", marker='o', color=colors[0], label="Training RMSE")
    plt.plot(train_sizes, test_scores_mean , "-" , marker='s', color=colors[1], label="Validation RMSE")

    # Plot the standard deviation as a shaded region
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, 
                                  train_scores_mean + train_scores_std, 
                                  color=colors[0], alpha=0.2)
    plt.fill_between(train_sizes, test_scores_mean  - test_scores_std , 
                                  test_scores_mean  + test_scores_std , 
                                  color=colors[1], alpha=0.2)

    # Customize plot appearance
    plt.xlabel("Training Set Size", fontsize=14)
    plt.ylabel("RMSE", fontsize=14)
    plt.ylim(y_lim)
    plt.title("Learning Curve", fontsize=16)
    plt.legend(loc="best")
    plt.grid(True, linestyle="--", color="grey", alpha=0.6)
    plt.show()

# ========== MAIN ==========
# Parameters for a non-linear model
a = +1    # 1st coefficient
b = -7    # 2nd coefficient
c = +3    # 3rd coefficient
x_range = (-2.5, 2.5)
noise_level = 7.0  # Noise lebel

num_samples = 50    # number of points
poly_degree = 5     # polynomail degree

# Generate data and add some noise
x, y_true, y_noisy = generate_new_data(a, b, c, x_range, num_samples, noise_level)

# Develop a pipeline to use polynomial model of degree N
model = make_pipeline(StandardScaler(), 
                      PolynomialFeatures(poly_degree, include_bias=False), 
                      LinearRegression())

# Plot learning curve, using linear regression model
plot_learning_curve(x, y_noisy, model=model, cv=5, y_lim=[-1, 100])

#### ⚠️ Important:

Let's interpret the Learning Curve above together:

In general, if both training and validation errors are high, the model is too simple to capture the true patterns in the data, thus underfitting. On the other hand, if the training error is low but the validation error remains high, the model is too complex and is fitting noise in the training data, thus overfitting.

A model that generalizes well will have both training and validation errors low and close to each other. The point where the two curves begin to converge typically indicates a good balance between bias and variance.

In our plot, the gap between training and validation RMSE suggests that the model still exhibits some variance (i.e., overfitting). Increasing the size of the training data helps reduce this gap, indicating improved generalization. However, if the gap persists and the validation error plateaus suggests that the model also suffers from some bias which means there is a limit to how much the error can be improved by simply adding more data.

***
### 💡 Reflect and Run

- How large is the generalization error in the experiment above, and what does that suggest about your model's performance?

- Now, explore how the model behaves under different conditions by running the learning curve experiment for the following combinations of number of samples and polynomial degree:

    | number of samples |  Polynomail degree |
    |-------------------|--------------------|
    |   50              |        2           |
    |   50              |        5           |
    |   50              |        7           |
    |   500             |        2           |
    |   500             |        5           |
    |   500             |        10          |
    |   50000           |        2           |
    |   50000           |        5           |
    |   50000           |        10          |

For each setting run the experiment and adjust the `y_lim` parameter in your `plot_learning_curve` function if needed. It is a good idea to place each run in a separate code block so you can easily compare the plots. After running the experiments, analyze the results:

- How large is the generalization error in each case?

- How deos it change as you increase the number of training samples or the model complexity (here, polynomial degree)?

- Do you see any sign for underfitting or overfitting, or does the model achieve a good fit?

***

# ⛷️ Exercise (compulsory)

Take a pen and paper and based on your observations, sketch three different learning curves, one for each of the following scenarios:
- A model that underfits the data
- A model that overfits the data
- A model that generalizes well (good fit)

Make your sketch for a generic model.

***
END
***