# 1

## 1.1.a

### Step 1: Probability distribution of y
As $\varepsilon$ follows the normal distribution, the conditional distribution of y given x is:

$$
y \mid x \sim \mathcal{N} \left( \theta_0 + \theta_1 x + \dots + \theta_P x^P, \sigma^2 \right)
$$

The probability density function (PDF) for a normal distribution is:

$$
p(y \mid x, \theta, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(y - \mu(x))^2}{2\sigma^2} \right)
$$

Where $\mu(x)$ is the predicted value:

$$
\mu(x) = \theta_0 + \theta_1 x + \dots + \theta_P x^P.
$$

---

### Step 2: Likelihood function

Given N independent observations $\mathbf{y} = (y_1, y_2, \dots, y_N)^T$, the likelihood function is:

$$
L(\theta, \sigma^2) = \prod_{i=1}^{N} p(y_i \mid x_i, \theta, \sigma^2)
$$

Substituting the normal PDF:

$$
L(\theta, \sigma^2) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(y_i - \mu(x_i))^2}{2\sigma^2} \right)
$$

---

### Step 3: Log-likelihood function

Taking the natural logarithm:

$$
\log L(\theta, \sigma^2) = \sum_{i=1}^{N} \left[ -\frac{1}{2} \log (2\pi\sigma^2) - \frac{(y_i - \mu(x_i))^2}{2\sigma^2} \right]
$$

Simplifying:

$$
\log L(\theta, \sigma^2) = -\frac{N}{2} \log (2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{N} (y_i - \mu(x_i))^2
$$

Where

$$
\mu(x_i) = \theta_0 + \theta_1 x_i + \dots + \theta_P x_i^P.
$$`

## 1.1 b 

To find the maximum likelihood estimates, we differentiate the log-likelihood with respect to each parameter $\theta_j$ and set the derivative equal to zero:

$$
\frac{\partial}{\partial \theta_j}\log L(\theta, \sigma^2) = 0
$$

We have:

$$
\frac{\partial}{\partial \theta_j}\left[ -\frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - \mu(x_i))^2 \right] = 0
$$

Carrying out this differentiation explicitly, we get:

$$
\frac{1}{\sigma^2}\sum_{i=1}^{N}(y_i - \mu(x_i)) x_i^j = 0
$$

Since  $\sigma^2$ is positive and nonzero, we simplify to:

$$
\sum_{i=1}^{N}(y_i - \mu(x_i)) x_i^j = 0 \quad \text{for each } j=0,1,...,P
$$

This results in the linear system of equations (normal equations):

$$
\sum_{i=1}^{N}\left(y_i - (\theta_0 + \theta_1 x_i + \dots + \theta_P x_i^P)\right)x_i^j = 0 \quad \text{for each } j=0,1,\dots,P.
$$

In matrix form, this solution can be expressed as:

$$
\hat{\theta} = (X^TX)^{-1}X^Ty
$$

where $X$ is the design matrix containing the polynomial terms.



## 1.1 c 

In [None]:
import Pkg
Pkg.add("Plots")
using Plots

theta_0 = 0.3
theta_1 = -0.1
theta_2 = 0.5
variance = 0.0001
standard_deviation = sqrt(variance)

x = -0.5:0.1:0.2

Random.seed!(42) # seed for reproducibility
epsilon = randn(length(x))
y = theta_0 .+ theta_1*x .+ theta_2*x.^2 .+ standard_deviation*epsilon

scatter(x, y, label="data", xlabel="x", ylabel="y", title="Data")

## 1.1 d

In [None]:
# Helper function: create design matrix for polynomial of order P
function design_matrix(xvals, P)
    # X will have columns [1, x, x^2, ..., x^P]
    X = [xi^p for xi in xvals, p in 0:P]
    return X
end

# Compute ML estimate \hat{theta} via normal equations and log-likelihood
function ml_estimate(xvals, yvals, P)
    X = design_matrix(xvals, P)
    # ML parameter estimate using least squares
    theta_hat = (X'X) \ (X'yvals)

    # Fitted values
    y_hat = X * theta_hat

    # Residual sum of squares (RSS)
    rss = sum((yvals .- y_hat).^2)

    # Estimate of sigma^2 is RSS / N in the ML setting
    sigma2_hat = rss / length(yvals)

    # Log-likelihood under Gaussian noise
    # L = -N/2 * log(2πσ_hat^2) - RSS/(2σ_hat^2)
    N = length(yvals)
    logL = -N/2 * log(2*π*sigma2_hat) - (rss / (2*sigma2_hat))

    return theta_hat, logL
end

# Compare parameter estimates and log-likelihood for P=1, P=2, P=7
println("-------------------------------------------------")
println("True theta = [$theta_0, $theta_1, $theta_2]")
for p in (1, 2, 7)
    θhat, logL = ml_estimate(x, y, p)
    println("Polynomial order P = $p")
    println("  ML parameter estimates = ", θhat)
    println("  Log-likelihood         = $logL")
    println("-------------------------------------------------\n")
end

The first three parameters of $\theta$ are closest to their true values when the number of parameters is the same as in the original function. The function with only 2 parameters is underfitting and the function with 7 parameters is overfitting.

The log-likelihood is increasing with the number of parameters. This is an indicator that, for noisy data, if the data is too likely to have been generated by the model, the model is likely overfitting.

# 2

# 3