## (a) Supervised or Unsupervised Learning?

This problem is in a **supervised learning** setting. In supervised learning, we are given input-output pairs and we learn a mapping from inputs to a known target. Here each training sample is a pair $(\mathbf{x}_i, y_i)$ where $\mathbf{x}_i \in \mathbb{R}^d$ is the feature vector (e.g., a sequence of measurements like velocities over time) and $y_i \in \mathbb{R}$ is the target scalar (e.g., the final location of the car). The learning algorithm uses these **labeled** examples to fit a model.

**Mathematical explanation:** The objective function explicitly involves the known targets $y_i$. For instance, the ordinary least squares cost is:

$$ J(\mathbf{w}) = \sum_{i=1}^{n} (\mathbf{x}_i^T \mathbf{w} - y_i)^2 = \|X\mathbf{w} - \mathbf{y}\|_2^2, $$

where $X \in \mathbb{R}^{n\times d}$ is the matrix of features and $\mathbf{y} = (y_1,\dots,y_n)^T$ is the vector of targets. The presence of $y_i$ in the cost means the learning is guided by **observed outputs**, which is the hallmark of supervised learning. In contrast, an unsupervised setting would have no such target values and would typically involve finding structure in the feature data alone (e.g., clustering or density estimation). Here we clearly use known outcomes $y_i$ to train the model, so the problem is supervised.

## (b) Method to Obtain $\hat{y}_t$ for Each $t$ (Least Squares Models with First $t$ Measurements)

We want to develop a predictive model using only the **first $t$ features** (measurements) of each sample to predict the target $y$. For each $t = 1,2,\dots,d$, we will derive a separate least-squares model (with its own weight vector and predictions) that uses the truncated feature vector $(x_{(1)}, \dots, x_{(t)})$. Denote the prediction at time $t$ by $\hat{y}_t$. Essentially, for each $t$ we solve a smaller least squares problem restricted to the first $t$ components of the feature vectors. Each $t$ yields a **different model** (different parameters), as the feature set changes.

**Setup:** Let $X^{(t)} \in \mathbb{R}^{n \times t}$ be the matrix containing only the first $t$ features of each of the $n$ training examples. In other words, if $\mathbf{x}_i = (x_{i,1}, x_{i,2}, \dots, x_{i,d})^T$ is the full feature vector for sample $i$, then $X^{(t)}$ has rows $\mathbf{x}_i^{(t)} = (x_{i,1}, \dots, x_{i,t})$ for $i=1,\dots,n$. We continue to use $\mathbf{y} = (y_1,\dots,y_n)^T$ for the vector of target outputs.

**Steps to compute** $\mathbf{\hat{y}}_t$ **using the first $t$ features:**

1. **Formulate the least squares problem for $t$ features:** We restrict the input to the first $t$ measurements. The prediction model at time $t$ will be a linear function of the first $t$ components:
   $$\hat{y}_t^{(i)} = \mathbf{x}_{i,1:t}^T \mathbf{w}^{(t)},$$
   where $\mathbf{x}_{i,1:t} \in \mathbb{R}^t$ is the feature vector of the $i$-th training sample truncated to components 1 through $t$, and $\mathbf{w}^{(t)} \in \mathbb{R}^t$ is the weight vector for the model using $t$ features. The training error (sum of squared residuals) using $t$ features is:
   $$
   J_t(\mathbf{w}^{(t)}) = \sum_{i=1}^n \Big( \mathbf{x}_{i,1:t}^T \mathbf{w}^{(t)} - y_i \Big)^2 = \big\|X^{(t)} \mathbf{w}^{(t)} - \mathbf{y}\big\|_2^2\,.
   $$
   This is the **least-squares cost** restricted to the first $t$ measurements.

2. **Optimize to find the weight vector $\mathbf{\hat{w}}^{(t)}$:** We minimize $J_t(\mathbf{w}^{(t)})$ with respect to $\mathbf{w}^{(t)}$. This is an ordinary least squares problem in $\mathbb{R}^t$. To find the minimizer, set the gradient to zero (normal equations):
   $$
   \frac{\partial J_t}{\partial \mathbf{w}^{(t)}} = 2\,X^{(t)T}\!\big(X^{(t)}\mathbf{w}^{(t)} - \mathbf{y}\big) = \mathbf{0}.
   $$
   Setting this to zero yields the **normal equations** for the first $t$ features:
   $$
   X^{(t)T} X^{(t)} \,\mathbf{\hat{w}}^{(t)} = X^{(t)T} \mathbf{y}\,.
   $$
   Provided $X^{(t)T}X^{(t)}$ is invertible (which usually holds if the $t$ features are linearly independent across the $n$ samples and $n \ge t$), we solve for the unique minimizer:
   $$
   \mathbf{\hat{w}}^{(t)} = \big(X^{(t)T} X^{(t)}\big)^{-1}\,X^{(t)T}\mathbf{y}\,.
   $$
   This is the **least-squares estimator** using the first $t$ measurements. (If $X^{(t)T}X^{(t)}$ is not invertible, one can use the pseudoinverse to get a solution — but assuming general position, we proceed with the inverse.)

3. **Obtain the prediction $\hat{y}_t$:** Now that we have the weight vector $\mathbf{\hat{w}}^{(t)}$, the model's prediction at time $t$ for any input's first $t$ measurements is given by the linear combination of those features with the learned weights. In particular, for a **new** sample (or any given sample) with feature vector $\mathbf{x}_{1:t}\in \mathbb{R}^t$ (the first $t$ measurements), the predicted target is
   $$
   \hat{y}_t = \mathbf{x}_{1:t}^T\,\mathbf{\hat{w}}^{(t)}\,.
   $$
   This $\hat{y}_t$ is the output of the model that uses $t$ features. We would compute such a $\hat{y}_t$ for each $t=1,2,\dots,d$. Effectively, we are training $d$ different models: one model using 1 feature, another using 2 features, and so on up to using all $d$ features. Each model yields its own sequence of predictions $\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_d$ (for a given sample).

4. **Different model for each $t$:** Note that as $t$ increases, the feature set expands and the weight vector $\mathbf{\hat{w}}^{(t)}$ will generally be different from $\mathbf{\hat{w}}^{(t-1)}$ (it typically has a different dimension and different values). Thus, we obtain a sequence of models $\mathbf{\hat{w}}^{(1)}, \mathbf{\hat{w}}^{(2)}, \dots, \mathbf{\hat{w}}^{(d)}$, each trained on an increasing prefix of the measurements. Consequently, for each time step $t$ we have a potentially improved prediction $\hat{y}_t$. Intuitively, as $t$ grows (i.e., we use more measurements), the model has more information and should predict $y$ more accurately (on training data, the error will non-increase). In particular, the final model at $t=d$ uses all features and corresponds to the original least-squares solution on the full data.

**Explicit formula summary:** To be clear, using only the first $t$ features of all training samples, the least squares estimator (weight vector) is
$$
\mathbf{\hat{w}}^{(t)} = \arg\min_{\mathbf{w}\in \mathbb{R}^t} \sum_{i=1}^n ( \mathbf{x}_{i,1:t}^T \mathbf{w} - y_i)^2 = \big(X^{(t)T}X^{(t)}\big)^{-1} X^{(t)T} \mathbf{y},
$$
and the resulting prediction for any example's first $t$ measurements $\mathbf{x}_{1:t}$ is
$$
\hat{y}_t = \mathbf{x}_{1:t}^T\,\mathbf{\hat{w}}^{(t)}.
$$

### Verification with a Synthetic Example

To reinforce the correctness of these formulas, we can simulate a small dataset and compute the least squares solution for each prefix of features. We expect that: (1) the normal equations are satisfied for each $\hat{\mathbf{w}}^{(t)}$, and (2) as $t$ increases, the training error (sum of squared errors) should not increase (indeed, it typically decreases as more features are used).

Let's create a random dataset with $n=50$ samples and $d=6$ total features, then compute $\hat{\mathbf{w}}^{(t)}$ and training mean squared error (MSE) for each $t=1$ to $6$:


In [2]:
import numpy as np

# Generate a random dataset
np.random.seed(0)
n, d = 50, 6
X_full = np.random.randn(n, d)             # feature matrix (n x d)
true_w = np.random.randn(d)               # underlying true weights (for simulation)
y = X_full.dot(true_w) + 0.5*np.random.randn(n)  # generate targets with some noise

# Solve least squares for each prefix of features 1..t
for t in range(1, d+1):
    X_t = X_full[:, :t]  # first t columns of X_full
    # Compute least-squares solution for X_t (using numpy lstsq which gives the minimizer)
    w_t, *_ = np.linalg.lstsq(X_t, y, rcond=None)
    # Compute training predictions and errors
    pred = X_t.dot(w_t)
    mse = np.mean((pred - y)**2)
    grad_norm = np.linalg.norm(X_t.T.dot(pred - y))  # should be close to 0 if normal eq holds
    print(f"t={t}: training MSE = {mse:.4f},  ||X^T (Xw - y)|| = {grad_norm:.2e}")

t=1: training MSE = 3.9356,  ||X^T (Xw - y)|| = 1.51e-14
t=2: training MSE = 1.2751,  ||X^T (Xw - y)|| = 2.82e-14
t=3: training MSE = 1.1937,  ||X^T (Xw - y)|| = 7.92e-14
t=4: training MSE = 0.6511,  ||X^T (Xw - y)|| = 4.95e-14
t=5: training MSE = 0.2930,  ||X^T (Xw - y)|| = 1.13e-13
t=6: training MSE = 0.2012,  ||X^T (Xw - y)|| = 9.05e-14


We can observe that for each $t$, the quantity $\|X^{(t)T}(X^{(t)}\hat{\mathbf{w}}^{(t)} - \mathbf{y})\|$ is essentially $0$ (up to numerical precision), confirming that the normal equation is satisfied by our solution $\hat{\mathbf{w}}^{(t)}$. Furthermore, the training MSE decreases as $t$ increases (from 1.2751 down to 0.2012 in this example), which is expected because using more features gives the model more flexibility to fit the data (the model with all 6 features has the lowest error on training data). This illustrates that each additional measurement can improve the prediction of $y$, which aligns with our intuition and the requirement that we get a different (typically better) model for each $t$.