In [0]:
import requests
from IPython.core.display import HTML
HTML(f"""
<style>
@import "https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css";
</style>
""")

# Model Complexity and Overfitting
In this exercise, show that simpler models with fewer features may perform better on unseen data due to reduced risk of overfitting.

<article class="message">
    <div class="message-body">
        <strong>List of tasks</strong>
        <ul style="list-style: none;">
            <li>
            <a href="#poly_re">Task 1:  Re-use polynomial regression</a>
            </li>
            <li>
            <a href="#occam_train">Task 2: Train and evaluate linear models with po…</a>
            </li>
            <li>
            <a href="#plot_results">Task 3: Plot the polynomials (models)</a>
            </li>
            <li>
            <a href="#reflection">Task 4: Reflection</a>
            </li>
            <li>
            <a href="#Different_data_func">Task 5: Changing the data generating function</a>
            </li>
            <li>
            <a href="#reflection_failure">Task 6: Complex underlying true function</a>
            </li>
        </ul>
    </div>
</article>



In [0]:
import numpy as np
import matplotlib.pyplot as plt
# Set a random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n_samples = 100
X = np.linspace(0, 10, n_samples).reshape(-1, 1)
y_true = 2.5 * X.ravel() + 1.5
noise = np.random.normal(0, 1, n_samples)
y = y_true + noise

# Split the data into training and test sets
split_index = int(0.7 * n_samples)
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]


# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Test data')
plt.legend()

<article class="message task"><a class="anchor" id="poly_re"></a>
    <div class="message-header">
        <span>Task 1:  Re-use polynomial regression</span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-stoplights easy"></i>
        </span>
    </div>
<div class="message-body">


Insert/re-use the least square polynomial regression
 functions you implemented last week, in the cell below.


</div></article>



In [0]:
#### re-use / create desing matrix and polynomial regression functions
def create_design_matrix(X, degree):
    """
    Create a design matrix for polynomial regression.
    
    Parameters:
    - X: np.ndarray
        Input dataset (samples, 1).
    - degree: int
        Degree of the polynomial.
        
    Returns:
    - A: np.ndarray
        Design matrix with columns corresponding to X^0, X^1, ..., X^degree.
    """
    #write code/solution here ... 

def polynomial_regression(X, y, degree):
    """
    Compute the weights for polynomial regression using the least squares method.
    
    Parameters:
    - X: np.ndarray
        Input dataset (samples, 1).
    - y: np.ndarray
        Target values.
    - degree: int
        Degree of the polynomial.
        
    Returns:
    - w: np.ndarray
        Weights/coefficients for the polynomial regression.
    """
    #write code/solution here ...

<article class="message task"><a class="anchor" id="occam_train"></a>
    <div class="message-header">
        <span>Task 2: Train and evaluate linear models with polynomial features</span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-stoplights medium"></i>
        </span>
    </div>
<div class="message-body">


1. Use the functions `polynomial_regression`
 and `create_design_matrix`
, to perform least square polynomial regression for each order in `degrees`
. 
2. Implement the function `compute_mse`
 that based on the predictions of a model and the ground truth targets return the _mean-squared-error_.


$$ MSE = \frac{1}{m}\sum_{i=1}^{m}(f_{\mathbf{w}}(x_{i})-y_{i})^2$$
3. For each polynomial model calculate the _mean-squared-error_ (use `polynomial_regression`
, `create_design_matrix`
, and `compute_mse`
).



</div></article>



In [0]:
def compute_mse(y_true, y_pred):
    """Compute Mean Squared Error between true and predicted values."""
    #write code/solution here ... 

# Train and evaluate linear models with different polynomial features
train_errors = []
test_errors = []
w_s = []

degrees = [1, 2, 3, 4, 5, 6]

#write code/solution here ...

<article class="message task"><a class="anchor" id="plot_results"></a>
    <div class="message-header">
        <span>Task 3: Plot the polynomials (models)</span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-stoplights easy"></i>
        </span>
    </div>
<div class="message-body">


1. Plot the predictions of the polynomial models using `X`
 as input.
2. Make scatterplots with both the training and test data points (in different colors).



</div></article>



In [0]:
# Plot the results
# Insert code for question 1
# The following line keep axis fixed in a plot
plt.ylim(0,30)
# Insert code for question 2

<article class="message task"><a class="anchor" id="reflection"></a>
    <div class="message-header">
        <span>Task 4: Reflection</span>
        <span class="has-text-right">
          <i class="bi bi-lightbulb-fill"></i><i class="bi bi-stoplights medium"></i>
        </span>
    </div>
<div class="message-body">


Reflect upon the following questions:
1. Which model performed best on the training data?
2. Which model performed best on the test data?
3. How does the complexity (degree) of the model affect the performance on the training and test data?
4. Which model(s) shows signs of overfitting? How can you tell?



</div></article>



In [0]:
# Write reflection here

<article class="message task"><a class="anchor" id="Different_data_func"></a>
    <div class="message-header">
        <span>Task 5: Changing the data generating function</span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-lightbulb-fill"></i><i class="bi bi-stoplights medium"></i>
        </span>
    </div>
<div class="message-body">


How does the results change if the underlying function generating the data is changed to a 2. order polynomial? 
1. Re-generate the data by replacing `y_true`
 with $y=f(x)=x^2+1.5x-3$ in the data generation step, and rerun the other code blocks.
2. Does it still make sense to follow the strategy of Occam's razor?



</div></article>



In [0]:
# Write reflection

<article class="message task"><a class="anchor" id="reflection_failure"></a>
    <div class="message-header">
        <span>Task 6: Complex underlying true function <em>(optional)</em></span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-lightbulb-fill"></i><i class="bi bi-stoplights hard"></i>
        </span>
    </div>
<div class="message-body">


Repeat task 1-4 using the function: 

$$ f(x) = \sin(x^2) + 1.5 $$
as the data generating function, see code down below.
1. How do the polynomial models perform, compared to the generated data following a linear trend? 

2. How does the complexity of the model (in terms of the degree of the polynomial features) affect its performance on the training and test data?

3. Choose `np.sin`
-based kernels for the least squares fit instead. How do the models perform now? 

4. Reflect upon potential issues with using Occams's Razor for model selection in machine learning related tasks. 




</div></article>



In [0]:
# Generate synthetic data sin
n_samples = 1000
X = np.linspace(0, 10, n_samples).reshape(-1, 1)
y_true =  np.sin(X.ravel()**2) + 1.5
noise = np.random.normal(0, .1, n_samples)
y = y_true + noise

... 
# repeat exercise here