**Author:** Shahab Fatemi

**Email:** shahab.fatemi@umu.se   ;   shahab.fatemi@amitiscode.com

**Created:** 2024-11-xx

**Last update:** 2025-09-11

**MIT License** — Shahab Fatemi (2025); For use in the *Machine Learning in Physics* course, Umeå University, Sweden; See the full license text in the parent folder.

<hr>

📢 <span style="color:red"><strong> Note for Students:</strong></span>

* Before working on the labs, review your lecture notes.

* Please read all sections, code blocks, and comments **carefully** to fully understand the material. Throughout the labs, my instructions are provided to you in written form, guiding you through the materials step-by-step.

* All concepts covered in this lab are part of the course and may be included in the final exam.

* I strongly encourage you to work in pairs and discuss your findings, observations, and reasoning with each other.

* If something is unclear, don't hesitate to ask.

* Exercise submission is not required; these tasks are designed to help you practice, explore the concepts, and learn by doing.

* I have done my best to make the lab files as bug-free (and error-free) as possible, but remember: *there is no such thing as bug-free code.* If you observed any bugs, errors, typos, or other issues, I would greatly appreciate it if you report them to me by email. Verbal notifications are not work, as I will likely forget 🙂

ENJOY WORKING ON THIS LAB.
***

# 🛠️ Purpose and Learning Outcomes:

This section builds on what you learned about Gradient Descent earlier. Here, I have developed longer code examples that use animations to show how GD works. You will see how the algorithm moves toward a solution, both in simple cases and in problems that involve local minima.

***

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../utils'))
from notebook_config import *

## Gradient Descent Implementation from Scratch

In the code below, I've implemented the Gradient Descent algorithm to optimize parameters of a function by iteratively minimizing its cost. The code, while simple in principle, might be not so easy for Python's beginners. However, I have included it here, because I am using it to visually demonstrate the GD for you. Since I know that some of you have no or little experience with Python, I will explain the code step by step. If you do not want to learn the code itself, you can skip and move to running it and playing with it. 

### Step by step:
- **Goal:** To develope a full visualization of GD optimization for a nonlinear hypothesis model.

1. First, the data is generated using a known function with two parameters, $h(x)=w_0 / ((x - w_1)^2 + 1)$. This model is based on a Cauchy-like function, and noise is added to simulate real world examplrs.

2. A cost function is defined using MSE between the model predictions and noisy data. The script computes a grid of cost values over a wide range of possible parameters to visualize the loss surface.

3. The gradient of the cost function is calculated with respect to both parameters. These gradients are then used to update the parameters iteratively in the gradeint descent loop. At each step, the updated parameter values are recorded.

4. During the optimization process, a live plot is updated to show the path of the parameters over the contour map of the cost function. Below the contour plot, a dynamic gauge shows the current cost value, scaled relative to the initial cost. 

5. Finally, the script prints both the true parameters used to generate the data and the optimized parameters found by gradient descent.

### In short:
First, we initialize parameters and repeatedly update them by moving in the direction of the negative gradient of the cost function with respect to each parameter, scaled by a learning rate. This iterative process continues for a number of iteratitions. The goal is to find the parameters that minimize the cost function. In this example, our hypothesis function is $h(x)=w_0 / ((x - w_1)^2 + 1)$. In this example, we have theoretically calculated the gradient of the cost function.

#### NOTE: 
- When you run the code, see the animated figure.
- It was not necessary to write the code in a class. However, since i am going to have many similar functions in the next code sections, and to avoid choosing odd names for them, I decided to encapsulate codes into classes. 
- If you have hard-time understanding the code, I'd be happy to help :) 

In [None]:
from bisect import bisect_right  # For the cost guage
from matplotlib.patches import Wedge

# Gradient Descent Visualizer class
class GDVisualizer:
    # Initialize the class.
    def __init__(self, w_true, w_init, x_range=(0, 5), num_points=100, noise_level=0.1, learning_rate=0.05, epochs=100):
        self.w_true        = w_true             # True parameters used to generate the data
        self.w_init        = w_init             # Initial guess for the parameters
        self.num_points    = num_points         # Number of data points to generate
        self.x_range       = x_range            # Range of x values for data generation
        self.noise_level   = noise_level        # Noise level for data generation
        self.learning_rate = learning_rate      # Learning rate for gradient descent
        self.epochs        = epochs             # Number of iterations for optimization

        # Generate data
        self.x_data, self.y_true, self.y_noisy = self.generate_data()

        # Generate the cost function on a grid for a wide range of w0 and w1
        self.w0_range = np.linspace(0.5, 2.5, 100)
        self.w1_range = np.linspace(1.5, 3.5, 100)
        self.w0_vals, self.w1_vals, self.cost_grid = self.compute_cost_grid()

        # Initialize cost
        self.cost_init = self.cost_function(self.w_init, self.x_data, self.y_noisy)

    # Hypothesis function h(w,x) in the form of a cauchy probability distribution function
    def h(self, w, x):
        return w[0] / ((x - w[1])**2 + 1)

    # Generate noisy data based on the hypothesis function h(w, x)
    def generate_data(self, seed=42):
        np.random.seed(seed)  # For reproducibility
        x_data  = np.linspace(self.x_range[0], self.x_range[1], self.num_points)
        y_true  = self.h(self.w_true, x_data)
        y_noisy = y_true + np.random.normal(0, self.noise_level, size=self.num_points)
        return x_data, y_true, y_noisy

    # MSE cost function.
    def cost_function(self, w, x, y):
        predictions = self.h(w, x)
        return np.mean((predictions - y)**2)

    # Compute gradients of the cost function with respect to w values
    def gradients(self, w, x, y):
        predictions = self.h(w, x)
        error = y - predictions
        grad_w0 = -2 * np.mean(error / ((x - w[1])**2 + 1))   # Theoretical dh/w0
        grad_w1 = -2 * np.mean(error * w[0] * 2 * (x - w[1]) / (((x - w[1])**2 + 1)**2))  # Theoretical dh/w1
        return np.array([grad_w0, grad_w1])

    # Generate a cost grid over a range of w values for visualization.
    # This is used to generate the contours in the visualization function.
    def compute_cost_grid(self):
        w0_vals, w1_vals = np.meshgrid(self.w0_range, self.w1_range)
        cost_grid = np.zeros_like(w0_vals)
        for i in range(w0_vals.shape[0]):
            for j in range(w0_vals.shape[1]):
                cost_grid[i, j] = self.cost_function([w0_vals[i, j], w1_vals[i, j]], self.x_data, self.y_noisy)
        return w0_vals, w1_vals, cost_grid

    # Perform gradient descent with real-time plotting.
    def gradient_descent(self):
        w = np.array(self.w_init, dtype=float)
        trajectory = [w.copy()]   # Store the data points for the trajectory of the descent

        # this is the main loop for gradient descent
        for epoch in range(self.epochs):
            grad = self.gradients(w, self.x_data, self.y_noisy)
            w -= self.learning_rate * grad
            trajectory.append(w.copy())
            
            cost_current = self.cost_function(w, self.x_data, self.y_noisy)
            self.plot_realtime(trajectory, cost_current)

        return w

    # Update the plot with the current trajectory of the Gradient Descent and plot cost guage.
    def plot_realtime(self, trajectory, cost_current):
        clear_output(wait=True)   # Clear the output for real-time plotting
        plt.figure(figsize=(5, 5), dpi=150)

        # Cost grid and trajectory of the Gradient Descent
        ax1 = plt.subplot(2, 1, 1)
        contour = ax1.contourf(self.w0_vals, self.w1_vals, np.log10(self.cost_grid), levels=50, cmap="magma")
        ax1.contour(self.w0_vals, self.w1_vals, np.log10(self.cost_grid), levels=10, linewidths=1.0, colors="white", alpha=0.6)
        
        ## All lines below are visualization touchups
        # Set the color bar and its ranges
        cbar = plt.colorbar(contour, ax=ax1)
        cbar.set_label('Cost')
        min_tick = int(np.floor(np.min(np.log10(self.cost_grid)))) + 1
        max_tick = int(np.ceil(np.max(np.log10(self.cost_grid))))
        ticks = range(min_tick, max_tick + 1)
        cbar.set_ticks(ticks)
        cbar.set_ticklabels([f'$10^{{{tick}}}$' for tick in ticks])

        # Plot the trajectory.
        ax1.plot([pt[0] for pt in trajectory], [pt[1] for pt in trajectory], marker='.', markersize=2, linewidth=1.0, color='cyan')
        
        # Mark the true values for w.
        ax1.scatter(self.w_true[0], self.w_true[1], marker='*', color='#2ecc71', edgecolor='#00bcd4', label='True Ws')
        
        ax1.set_title('Gradient Descent')
        ax1.set_xlabel('$w_0$')
        ax1.set_ylabel('$w_1$')
        ax1.set_xlim([0.5, 2.5])
        ax1.set_ylim([1.5, 3])
        ax1.legend()

        # Show the cost guage 
        ax2 = plt.subplot(2, 1, 2)
        ax2.set_aspect('equal')
        ax2.set_xlim([-1.5, 1.5])
        ax2.set_ylim([-1.5, 1.5])
        ax2.axis('off')

        gauge_max = self.cost_init
        normalized_cost = cost_current / gauge_max

        # Use the Wedge widget.
        wedge_background = Wedge((0, 0), 1.0, 0, 180, facecolor="lightgray", edgecolor="black")
        ax2.add_patch(wedge_background)

        angle = 180 * normalized_cost
        thresholds = [0, 10, 30, 90]
        colors = ["forestgreen", "yellow", "orange", "red"]

        color_index = bisect_right(thresholds, angle) - 1
        needle_color = colors[max(0, min(color_index, len(colors) - 1))]
        wedge_needle = Wedge((0, 0), 1.0, angle - 2, angle + 2, facecolor=needle_color, edgecolor=needle_color)
        ax2.add_patch(wedge_needle)

        ax2.text(0.0, -0.3, f"Cost: {cost_current:.4f}", ha='center')
        ax2.text(+1.1, +0.1, f"0", ha='center')
        ax2.text(-1.3, +0.1, f"{gauge_max:.2f}", ha='center')
        ax2.text(0.0, +1.2, "Cost Gauge", ha='center', fontsize=14, fontweight='bold')
        #plt.tight_layout()
        plt.show()

In [None]:
if __name__ == "__main__":
    w_true = [1.60, 2.25]  # True parameters
    w_init = [0.75, 1.75]  # Initial guess
    epochs = 100           # Number of iterations
    learning_rate = 0.1    # Learning rate

    # Instanse of the GDVisualizer and initalize the class parameters
    gd_visualizer = GDVisualizer(w_true, w_init, learning_rate=learning_rate, epochs=epochs)

    # Run gradient descent and visualize
    w_optimized = gd_visualizer.gradient_descent()

    print(f"True parameters: {gd_visualizer.w_true}")
    print(f"Optimized parameters: {w_optimized}")

***
### 💡 Reflect and Run

- In lines 20-21, I've defined a range for the w-space to explore. How do we know the range for `w0` and `w1`?

- Add a third subplot that displays the cost function's history over the epochs.

- Next, modify the learning rate: first set it to 1.0, then to 3.0. For each case, run the full training and observe how both the cost function and the parameter updates evolve. Does the model still converge? Does it overshoot? Explain what you observe. 

- In the example above, $h(x) = w_0/((x - w_1)^2 + 1)$. Change the code such that it handles $h(x) = w_0/((x - w_1)^2 + w_2)$. If you have not time for actually doing it, think about the required changes in the `gradients` function. Only changing the `h(self, w, x)` in the class is not sufficient. Why?

***

## GD for nonlinear models with multiple minima

In both of the examples above, we used a convex function with one minimum for the cost functions. The code below implements a GD method to optimize parameters of a model with multiple minima, using a differenthypothesis function $h(w, x) = \sin(w_0 x + w_1)$. The function $h(w, x)$ is nonlinear and has multiple minima, which means that the optimization process may converge to different local minima depending on the initial parameter values.

The class I've written below is fundamentally similar to the `GDVisualizer` you worked on and studied earlier in this session.

In [None]:
# Gradient Descent Visualizer class
class GDOscillator:
    # Initialize the GradientDescentVisualizer class.
    def __init__(self, w_true, w_init, x_range=(0.0, np.pi), num_points=100, noise_level=0.1, learning_rate=0.05, epochs=100):
        self.w_true        = w_true		# True parameters used to generate the data
        self.w_init        = w_init             # Initial guess for the parameters
        self.num_points    = num_points		# Number of data points to generate
        self.x_range       = x_range		# Range of x values for data generation
        self.noise_level   = noise_level	# Noise level for data generation
        self.learning_rate = learning_rate	# Learning rate for gradient descent
        self.epochs        = epochs		# Number of iterations for optimization

        # Generate data
        self.x_data, self.y_true, self.y_noisy = self.generate_data()

        # Generate the cost function on a grid for a wide range of w0 and w1
        self.w0_range = np.linspace(-np.pi/2,    np.pi/2, 100)
        self.w1_range = np.linspace(-1.5*np.pi  , 2.*np.pi  , 100)
        self.w0_vals, self.w1_vals, self.cost_grid = self.compute_cost_grid()

        # Initialize cost
        self.cost_init = self.cost_function(self.w_init, self.x_data, self.y_noisy)

    # Hypothesis function h(w, x) = sin(w[0] * x + w[1])
    def h(self, w, x):
        return np.sin(w[0] * x + w[1])

    # Generate noisy data based on the hypothesis function h(w, x)
    def generate_data(self, seed=42):
        np.random.seed(seed)  # For reproducibility
        x_data = np.linspace(self.x_range[0], self.x_range[1], self.num_points)
        y_true = self.h(self.w_true, x_data)
        y_noisy = y_true + np.random.normal(0, self.noise_level, size=self.num_points)
        return x_data, y_true, y_noisy

    # MSE cost function.
    def cost_function(self, w, x, y):
        predictions = self.h(w, x)
        return np.mean((predictions - y)**2)

    # Compute gradients of the cost function with respect to w values
    def gradients(self, w, x, y):
        predictions = self.h(w, x)
        error = y - predictions
        grad_w0 = -2 * np.mean(error * np.cos(w[0] * x + w[1]) * x)  # Derivative with respect to w0
        grad_w1 = -2 * np.mean(error * np.cos(w[0] * x + w[1]))      # Derivative with respect to w1
        return np.array([grad_w0, grad_w1])

    # Generate a cost grid over a range of w values for visualization.
    # This is used to generate the contours in the visualization function.
    def compute_cost_grid(self):
        w0_vals, w1_vals = np.meshgrid(self.w0_range, self.w1_range)
        cost_grid = np.zeros_like(w0_vals)
        for i in range(w0_vals.shape[0]):
            for j in range(w0_vals.shape[1]):
                cost_grid[i, j] = self.cost_function([w0_vals[i, j], w1_vals[i, j]], self.x_data, self.y_noisy)
        return w0_vals, w1_vals, cost_grid

    # Perform gradient descent with real-time plotting.
    def gradient_descent(self):
        w = np.array(self.w_init, dtype=float)
        trajectory = [w.copy()]   # Store the data points for the trajectory of the descent

        # this is the main loop for gradient descent
        for epoch in range(self.epochs):
            grad = self.gradients(w, self.x_data, self.y_noisy)
            w -= self.learning_rate * grad
            trajectory.append(w.copy())
            
            cost_current = self.cost_function(w, self.x_data, self.y_noisy)
            self.plot_realtime(trajectory, cost_current)

        return w

    # Update the plot with the current trajectory of the Gradient Descent and plot cost guage.
    def plot_realtime(self, trajectory, cost_current):
        clear_output(wait=True)   # Clear the output for real-time plotting
        plt.figure(figsize=(5, 5), dpi=150)

        # Cost grid and trajectory of the Gradient Descent
        ax1 = plt.subplot(2, 1, 1)
        contour = ax1.contourf(self.w0_vals, self.w1_vals, np.log10(self.cost_grid), levels=50, cmap="magma")
        ax1.contour(self.w0_vals, self.w1_vals, 
                    np.log10(self.cost_grid), levels=10, 
                    linewidths=1.0, colors="white", alpha=0.6)
        
        # Set the color bar and its ranges
        cbar = plt.colorbar(contour, ax=ax1)
        contour.set_clim(-2, 1)
        cbar.set_label('Cost')
        min_tick = int(np.floor(np.min(np.log10(self.cost_grid))))+1
        max_tick = int(np.ceil(np.max(np.log10(self.cost_grid))))-1
        ticks = range(min_tick, max_tick + 1)
        cbar.set_ticks(ticks)
        cbar.set_ticklabels([f'$10^{{{tick}}}$' for tick in ticks])

        # Plot the trajectory.
        ax1.plot(self.w_init[0], self.w_init[1], 's', markersize=4, color='w')
        ax1.plot([pt[0] for pt in trajectory], [pt[1] for pt in trajectory], marker='.', markersize=2, linewidth=1.0, color='cyan')
        
        # Mark the true values for w.
        ax1.scatter(self.w_true[0], self.w_true[1], marker='*', color='#2ecc71', edgecolor='#00bcd4', label='True Ws')
        
        ax1.set_title('Gradient Descent')
        ax1.set_xlabel('$w_0$')
        ax1.set_ylabel('$w_1$')
        #ax1.set_xlim([0.5, 2.5])
        #ax1.set_ylim([1.5, 3])
        ax1.legend()

        # Show the cost guage 
        ax2 = plt.subplot(2, 1, 2)
        ax2.set_aspect('equal')
        ax2.set_xlim([-1.5, 1.5])
        ax2.set_ylim([-1.5, 1.5])
        ax2.axis('off')

        gauge_max = self.cost_init
        normalized_cost = cost_current / gauge_max

        # Use the Wedge widget.
        wedge_background = Wedge((0, 0), 1.0, 0, 180, facecolor="lightgray", edgecolor="black")
        ax2.add_patch(wedge_background)

        angle = 180 * normalized_cost
        thresholds = [0, 10, 30, 90]
        colors = ["forestgreen", "yellow", "orange", "red"]

        color_index = bisect_right(thresholds, angle) - 1
        needle_color = colors[max(0, min(color_index, len(colors) - 1))]
        wedge_needle = Wedge((0, 0), 1.0, angle - 2, angle + 2, facecolor=needle_color, edgecolor=needle_color)
        ax2.add_patch(wedge_needle)

        ax2.text(0.0, -0.3, f"Cost: {cost_current:.4f}", ha='center')
        ax2.text(+1.1, +0.1, f"0", ha='center')
        ax2.text(-1.3, +0.1, f"{gauge_max:.2f}", ha='center')
        ax2.text(0.0, +1.2, "Cost Gauge", ha='center', fontsize=14, fontweight='bold')
        #plt.tight_layout()
        plt.show()


In [None]:
if __name__ == "__main__":
    w_true = [1.0, 0.5]
    w_init = [0., 0.] # [0.5, 2.0] 
    epochs = 100
    learning_rate = 0.2

    # Instanse of the GDOscillator and initalize the class parameters
    gd_visualizer = GDOscillator(w_true, w_init, learning_rate=learning_rate, epochs=epochs)

    # Run gradient descent and visualize
    w_optimized = gd_visualizer.gradient_descent()

    print(f"True parameters: {gd_visualizer.w_true}")
    print(f"Optimized parameters: {w_optimized}")

***
### 💡 Reflect and Run

- Increase the number of epochs and observe the effect on convergence.

- Experiment with different learning rates and observe their effect on convergence speed and stability. For example, set `epochs = 200` and
`learning_rate = 0.5` and re-run the code. What do you observe? Why?

- Change the learning rate to 0.2, and set `w_init = [0.5, 2.0]`. Re-run the code and explain your observations.

- Change the initial weights to different values, re-run the code, explain what you observe.

***
END
***