# Stellar luminosity:
## Linear Model for Regression

In this notebook, linear regression will be implemented to model the mass-luminosity relationship of stars. 

### Objective

Model stellar luminosity as a function of stellar mass using linear regression with an explicit bias term: L_hat = w * M + b.

### Dataset and Notation

We will use the following notation for the notebook

- **M:** stellar mass (in units of solar mass, M⊙)
- **L:** stellar luminosity (in units of solar luminosity, L⊙)

And the dataset for this first part is:

M = [0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4]
L = [0.15, 0.35, 1.00, 2.30, 4.10, 7.00, 11.2, 17.5, 25.0, 35.0]


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
#Set variables in arrays (one feature)
M = np.array([0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4])
L = np.array([0.15, 0.35, 1.00, 2.30, 4.10, 7.00, 11.2, 17.5, 25.0, 35.0])

N = len(L)  # number of data samples

#### 1. Dataset visualization: M vs L

The graph allows us to see if the relationship between L and M is approximately linear and if a linear model is plausibility.

In [None]:
plt.figure()
plt.scatter(L, M)
plt.xlabel("L (Luminosity)")
plt.ylabel("M (Mass)")
plt.title("Mass vs Luminosity of Stars")
plt.show()

#### 2. Model and loss (MSE)

##### Model

In this notebook, a **simple linear regression model** is used, which assumes a linear relationship between luminosity \(L\) and mass \(M\):


$$\hat{M} = wL + b$$


where:


- \(w\) is the **slope**, which indicates how much the mass changes when luminosity increases.
- \(b\) is the **intercept**, which represents the estimated mass when \(L = 0\).


This model serves as a simple first approximation to describe the relationship between these two physical variables.

##### Loss Function (Mean Squared Error)

To evaluate how well the model fits the data, the **Mean Squared Error (MSE)** is used:


$$J(w,b) = \frac{1}{N} \sum_{i=1}^{N} (\hat{M}_i - M_i)^2$$


This function measures the average of the squared differences between the predicted values and the true values.


**Reasons for using MSE:**
- It strongly penalizes large errors.
- It is always non-negative.
- It is differentiable, which allows the use of **gradient descent**.
- It is a standard loss function for regression problems.


The goal of the training process is to find the values of \(w\) and \(b\) that **minimize this loss function**.

In [None]:
def predict(L, w, b):
    return w * L + b

def mse(L, M, w, b):
    M_pred = predict(L, w, b)
    return np.mean((M_pred - M) ** 2)

#### 3. Cost surface (mandatory)

To better understand the behavior of the loss function, we evaluate the cost
\( J(w, b) \) over a grid of possible values of the parameters \( w \) (slope) and
\( b \) (intercept).


For each pair \((w, b)\), the Mean Squared Error (MSE) is computed using the training
data. This allows us to visualize how the error changes depending on the model
parameters.

In [None]:
w_vals = np.linspace(-10, 10, 100)
b_vals = np.linspace(-10, 10, 100)

W, B = np.meshgrid(w_vals, b_vals)
J = np.zeros_like(W)

In [None]:
for i in range(len(w_vals)):
    for j in range(len(b_vals)):
        J[j, i] = mse(L, M, W[j, i], B[j, i])

In [None]:
plt.figure()
plt.contour(W, B, J, levels=50)
plt.xlabel("w")
plt.ylabel("b")
plt.title("Cost Surface (Contours)")
plt.show()

The minimum of this surface represents the pair (w, b) that minimizes the mean square error.

#### 4. Gradients: dJ/dw and dJ/db.

The gradients of the MSE with respect to \( w \) and \( b \) indicate how to update the
parameters in order to minimize the cost function using gradient descent.

In [None]:
def gradients(L, M, w, b):
    M_pred = predict(L, w, b)
    error = M_pred - M
    
    dw = (2 / N) * np.sum(error * L)
    db = (2 / N) * np.sum(error)
    
    return dw, db

#### 5. Gradient descent (non-vectorized)

- For each iteration, the algorithm loops over all samples.
- The prediction error is computed one data point at a time.
- Gradients for `w` and `b` are accumulated using an explicit loop.
- Parameters are updated in the direction that reduces the loss.

In [None]:
def gradient_descent_loop(L, M, lr, iterations):
    w, b = 0.0, 0.0
    loss_history = []

    for _ in range(iterations):
        dw, db = 0.0, 0.0
        
        for i in range(N):
            error = (w * L[i] + b) - M[i]
            dw += error * L[i]
            db += error
        
        dw *= (2 / N)
        db *= (2 / N)
        
        w -= lr * dw
        b -= lr * db
        
        loss_history.append(mse(L, M, w, b))
    
    return w, b, loss_history

#### 6. Gradient descent (vectorized)

In the vectorized implementation, gradients are computed using NumPy operations instead of explicit loops over the samples.
This approach is mathematically equivalent to the non-vectorized version but is significantly more efficient and concise.


Vectorization leverages optimized linear algebra routines, making it the standard approach in practical machine learning implementations.
It also improves code readability and reduces the risk of implementation errors.

In [None]:
def gradient_descent_vectorized(L, M, lr, iterations):
    w, b = 0.0, 0.0
    loss_history = []

    for _ in range(iterations):
        dw, db = gradients(L, M, w, b)
        
        w -= lr * dw
        b -= lr * db
        
        loss_history.append(mse(L, M, w, b))
    
    return w, b, loss_history

#### 7. Convergence (mandatory)

The loss decreases monotonically with the number of iterations, indicating that gradient descent is converging correctly.


For an appropriate learning rate, the curve is smooth and stable, showing no oscillations or divergence.


This behavior suggests that the algorithm is efficiently moving toward the minimum of the cost function.

In [None]:
w, b, loss = gradient_descent_vectorized(L, M, lr=0.01, iterations=100)

plt.figure()
plt.plot(loss)
plt.xlabel("Iteraciones")
plt.ylabel("MSE")
plt.title("Convergencia del Gradient Descent")
plt.show()

The curve shows how the error progressively decreases, indicating convergence of the algorithm.

#### 8. Experiments with different learning rates

We trained the linear regression model using three different learning rates to analyze their effect on convergence.


- **Low learning rate (e.g., 0.001):**
Converges slowly but is very stable. Requires more iterations to reach a low loss.


- **Medium learning rate (e.g., 0.01):**
Provides a good balance between convergence speed and stability. This learning rate achieved the lowest final loss in a reasonable number of iterations.


- **High learning rate (e.g., 0.1):**
Converges faster initially but may show oscillations or instability near the minimum.


Overall, the experiment shows that the choice of learning rate strongly affects both convergence speed and training stability.

In [None]:
learning_rates = [0.001, 0.01, 0.00001]
results = {}

for lr in learning_rates:
    w, b, loss = gradient_descent_vectorized(L, M, lr, 200)
    results[lr] = (w, b, loss[-1])

In [None]:
for lr, (w, b, final_loss) in results.items():
    print(f"LR={lr} → w={w:.4f}, b={b:.4f}, loss={final_loss:.4f}")

Large learning rates converge faster but can be unstable; small values are more stable but slow.

#### 9. Final fit plot

The regression line captures the overall trend of the data, indicating that a linear relationship between L and M is a reasonable first approximation.


However, systematic errors are visible in certain regions, where the model tends to overestimate or underestimate the true values. This suggests that the underlying relationship may not be perfectly linear and that more complex models could provide a better fit.

In [None]:
L_mean, L_std = np.mean(L), np.std(L)
M_mean, M_std = np.mean(M), np.std(M)

L_norm = (L - L_mean) / L_std
M_norm = (M - M_mean) / M_std

best_lr = 0.001
w, b, _ = gradient_descent_vectorized(L, M, best_lr, 2000)

plt.figure()
plt.scatter(L, M, label="Datos reales")
plt.plot(L, predict(L, w, b), label="Modelo lineal")
plt.xlabel("L")
plt.ylabel("M")
plt.legend()
plt.title("Ajuste final del modelo")
plt.show()

#### 10. Conceptual questions

1. **Astrophysical meaning of w:**
    It represents the rate of change of mass with respect to luminosity.

2. **Limitations of the linear model:**
    Many astrophysical processes are nonlinear and cannot be accurately described by a simple linear relationship.