# Linear Basis Function Models

## Function Models
$$\begin{align*}y(\mathbf{x},\mathbf{w})
&=w_0+\sum_{j=1}^{M-1}w_j\phi_j(\mathbf{x}) \tag{3.2}\\
&=\sum_{j=0}^{M-1}w_j\phi_j(\mathbf{x})\qquad let\ \phi_0(\mathbf{x})=1 \\
&=\mathbf{w}^T\phi(\mathbf{x}) \tag{3.3}
\end{align*}$$
where
- $\mathbf{x}=(x_1,\cdots,x_{M-1})^T$, denote the input of this model.
- $\mathbf{w}=(w_0,\cdots,w_{M-1})^T$, denote the parameters of this model.
- $\phi = (\phi_0,\cdots,\phi_{M-1})^T$, are known as *basis functions*.

## Basis Functions
1. $\phi(\mathbf{x})=\mathbf{x}$. This is the most common and simple basis function. Then the function model takes the form
$$y(\mathbf{x},\mathbf{w})=w_0+w_1x_1+\cdots+w_{M-1}x_{M-1} \tag{3.1}$$
which is often simply known as *linear regression*.
2. $\phi_j(x)=x^j$. For a single input variable $x$, the basis functions take the powers of $x$. Then the function model takes the form
$$y(x,\mathbf{w})=w_0+w_1x+w_2x^2+\cdots+w_{M-1}x^{M-1}$$
3. $\displaystyle{\phi_j(x)=exp\left\{-\frac{(x-\mu_j)^2}{2s^2}\right\} }$. This is usually referred to as 'Gaussian' basis function.
4. $\displaystyle{\phi_j(x)=\sigma\left(\frac{x-\mu_j}{s}\right)\quad where\ \sigma(a)=\frac{1}{1+exp(-a)}}$. This is the sigmoidal bisis function. Equivalently, we can use the 'tanh' function because it is related to the logistic sigmoid by $tanh(a)=2\sigma(a)-1$, which can be easily achived by modifying the parameters $w_j$.
5. Fourier basis.
6. Wavelet basis.

Most of the discussion in this chapter is independent of the particular choice of basis function set. Indeed, much of our discussion will be equally applicable to the situation in which the vector $\phi(\mathbf{x})$ of basis functions is simply the identity $\phi(\mathbf{x})=\mathbf{x}$. Furthermore, in order to keep the notation simple, we shall focus on the case of a single targe variable $t$.


# Maximun Likelihood

## Least Squares

Assume that the target variable $t$ is given by a deterministic function $y(\mathbf{x},\mathbf{w})$ with addictive Gaussian noise so that
$$t = y(\mathbf{x},\mathbf{w})+\epsilon \tag{3.7}$$
where $\epsilon$ is a zero mean Gaussian random variable with precision (inverse variance) $\beta$. Thus we can write
$$p(t|\mathbf{x}, \mathbf{w},\beta)=\mathcal{N}(t|y(\mathbf{x},\mathbf{w}), \beta^{-1}) \tag{3.8}$$
Now consider a data set of inputs $\mathbf{X}=\{\mathbf{x}_1,\cdots,\mathbf{x}_N\}$ with corresponding target values $\mathbb{t}=\{t_1,\cdots,t_N\}$. With the assumption that these data points are drawn independently from the distribution (3.8), we obtain the following expression for the likelihood function, which is a function of the adjustable parameters $\mathbf{w}$ and $\beta$, in the form
$$p(\mathbb{t}|\mathbf{X}, \mathbf{w},\beta)=\prod_{n=1}^N\mathcal{N}(t_n|\mathbf{w}^T\phi(\mathbf{x}_n),\beta^{-1})\tag{3.10}$$
<font color='red'>*Note that in supervised learning problems such as regression and classification, we are not seeking to model the distribution of the input variables. Thus $\mathbf{x}$ will always appear in the set of conditioning variables, and so from now on we will drop the explicit $\mathbf{x}$ from the expressions such as $p(\mathbb{t}|\mathbf{x},\mathbf{w},\beta)$ in order to keep the notation uncluttered.*</font>  
Logarithm likelihood function
$$\begin{align*}\ln p(\mathbb{t}|\mathbf{w},\beta)
&=\sum_{n=1}^N\ln\mathcal{N}(t_n|\mathbf{w}^T\phi(\mathbf{x}_n),\beta^{-1})\\
&=\frac{N}{2}\ln\beta-\frac{N}{2}\ln(2\pi)-\beta E_D(\mathbf{w})\qquad  where\ E_D(\mathbf{w})=\frac{1}{2}\sum_{n=1}^N\{t_n-\mathbf{w}^T\phi(\mathbf{x}_n)\}^2 \tag{3.11,3.12}
\end{align*}$$
Maximizing the log likelihood function is equivlant to minimizing the error function. Thus the gradient of the log likelihood function with respect to $\mathbf{w}$ takes the form
$$\begin{align*}-\nabla\ln p(\mathbb{t}|\mathbf{w},\beta)
&=\nabla E_D(\mathbf{w})\\
&=\sum_{n=1}^N\{t_n-\mathbf{w}^T\phi(\mathbf{x}_n)\}\phi(\mathbf{x}_n)^T \tag{3.13}
\end{align*}$$


### The parameters $\mathbf{w}$
Maximize the likelihood function is equivalent to setting this gradient to zero, which takes the form
$$\begin{align*}0
&=\sum_{n=1}^N\{t_n-\mathbf{w}^T\phi(\mathbf{x}_n)\}\phi(\mathbf{x}_n)^T\\
&=\sum_{n=1}^N t_n\phi(\mathbf{x}_n)^T-\sum_{n=1}^N\mathbf{w}^T\phi(\mathbf{x}_n)\phi(\mathbf{x}_n)^T\\
\Rightarrow \mathbf{w}^T\sum_{n=1}^N\phi(\mathbf{x}_n)\phi(\mathbf{x}_n)^T&=\sum_{n=1}^N t_n\phi(\mathbf{x}_n)^T\\
\Rightarrow \mathbf{w}^T \sum_{n=1}^N\begin{bmatrix}\phi_0(\mathbf{x}_n)\\ \vdots \\ \phi_{M-1}(\mathbf{x}_n) \end{bmatrix}
\begin{bmatrix}\phi_0(\mathbf{x}_n) &\cdots & \phi_{M-1}(\mathbf{x}_n) \end{bmatrix}
&=\sum_{n=1}^N t_n\begin{bmatrix}\phi_0(\mathbf{x}_n) &\cdots & \phi_{M-1}(\mathbf{x}_n) \end{bmatrix}\\
\Rightarrow \mathbf{w}^T \sum_{n=1}^N
\begin{bmatrix}
\phi_0(\mathbf{x}_n)\phi_0(\mathbf{x}_n) &\phi_0(\mathbf{x}_n)\phi_1(\mathbf{x}_n) & \cdots &\phi_0(\mathbf{x}_n)\phi_{M-1}(\mathbf{x}_n)\\
\phi_1(\mathbf{x}_n)\phi_0(\mathbf{x}_n) &\phi_1(\mathbf{x}_n)\phi_1(\mathbf{x}_n) & \cdots &\phi_1(\mathbf{x}_n)\phi_{M-1}(\mathbf{x}_n)\\
\vdots &\cdots &\ddots &\vdots \\
\phi_{M-1}(\mathbf{x}_n)\phi_0(\mathbf{x}_n) &\phi_{M-1}(\mathbf{x}_n)\phi_1(\mathbf{x}_n) &\cdots &\phi_{M-1}(\mathbf{x}_n)\phi_{M-1}(\mathbf{x}_n) 
\end{bmatrix}
&=\begin{bmatrix}t_1 &\cdots &t_N\end{bmatrix}
\begin{bmatrix}
\phi_0(\mathbf{x}_1) &\phi_1(\mathbf{x}_1) &\cdots &\phi_{M-1}(\mathbf{x}_1)\\
\phi_0(\mathbf{x}_2) &\phi_1(\mathbf{x}_2) &\cdots &\phi_{M-1}(\mathbf{x}_2)\\
\vdots &\vdots &\ddots &\vdots\\
\phi_0(\mathbf{x}_N) &\phi_1(\mathbf{x}_N) &\cdots &\phi_{M-1}(\mathbf{x}_N)\\
\end{bmatrix}\\
\Rightarrow \mathbf{w}^T 
\begin{bmatrix}
\sum_{n=1}^N \phi_0(\mathbf{x}_n)\phi_0(\mathbf{x}_n) &\sum_{n=1}^N \phi_0(\mathbf{x}_n)\phi_1(\mathbf{x}_n) & \cdots &\sum_{n=1}^N\phi_0(\mathbf{x}_n)\phi_{M-1}(\mathbf{x}_n)\\
\sum_{n=1}^N \phi_1(\mathbf{x}_n)\phi_0(\mathbf{x}_n) &\sum_{n=1}^N \phi_1(\mathbf{x}_n)\phi_1(\mathbf{x}_n) & \cdots &\sum_{n=1}^N \phi_1(\mathbf{x}_n)\phi_{M-1}(\mathbf{x}_n)\\
\vdots &\cdots &\ddots &\vdots \\
\sum_{n=1}^N \phi_{M-1}(\mathbf{x}_n)\phi_0(\mathbf{x}_n) &\sum_{n=1}^N \phi_{M-1}(\mathbf{x}_n)\phi_1(\mathbf{x}_n) &\cdots &\sum_{n=1}^N \phi_{M-1}(\mathbf{x}_n)\phi_{M-1}(\mathbf{x}_n)
\end{bmatrix}
&=\mathbb{t}^T
\begin{bmatrix}
\phi_0(\mathbf{x}_1) &\phi_1(\mathbf{x}_1) &\cdots &\phi_{M-1}(\mathbf{x}_1)\\
\phi_0(\mathbf{x}_2) &\phi_1(\mathbf{x}_2) &\cdots &\phi_{M-1}(\mathbf{x}_2)\\
\vdots &\vdots &\ddots &\vdots\\
\phi_0(\mathbf{x}_N) &\phi_1(\mathbf{x}_N) &\cdots &\phi_{M-1}(\mathbf{x}_N)\\
\end{bmatrix}\\
\Rightarrow \mathbf{w}^T
\begin{bmatrix}
\phi_0(\mathbf{x}_1) &\phi_0(\mathbf{x}_2) &\cdots &\phi_0(\mathbf{x}_N)\\
\phi_1(\mathbf{x}_1) &\phi_1(\mathbf{x}_2) &\cdots &\phi_1(\mathbf{x}_N)\\
\vdots &\vdots &\ddots &\vdots\\
\phi_{M-1}(\mathbf{x}_1) &\phi_{M-1}(\mathbf{x}_2) &\cdots &\phi_{M-1}(\mathbf{x}_N)\\
\end{bmatrix}
\begin{bmatrix}
\phi_0(\mathbf{x}_1) &\phi_1(\mathbf{x}_1) &\cdots &\phi_{M-1}(\mathbf{x}_1)\\
\phi_0(\mathbf{x}_2) &\phi_1(\mathbf{x}_2) &\cdots &\phi_{M-1}(\mathbf{x}_2)\\
\vdots &\vdots &\ddots &\vdots\\
\phi_0(\mathbf{x}_N) &\phi_1(\mathbf{x}_N) &\cdots &\phi_{M-1}(\mathbf{x}_N)\\
\end{bmatrix}
&=\mathbb{t}^T
\begin{bmatrix}
\phi_0(\mathbf{x}_1) &\phi_1(\mathbf{x}_1) &\cdots &\phi_{M-1}(\mathbf{x}_1)\\
\phi_0(\mathbf{x}_2) &\phi_1(\mathbf{x}_2) &\cdots &\phi_{M-1}(\mathbf{x}_2)\\
\vdots &\vdots &\ddots &\vdots\\
\phi_0(\mathbf{x}_N) &\phi_1(\mathbf{x}_N) &\cdots &\phi_{M-1}(\mathbf{x}_N)\\
\end{bmatrix}\\
\Rightarrow \mathbf{w}^T\Phi^T\Phi =\mathbb{t}^T\Phi\qquad let\ 
\Phi&=\begin{bmatrix}
\phi_0(\mathbf{x}_1) &\phi_1(\mathbf{x}_1) &\cdots &\phi_{M-1}(\mathbf{x}_1)\\
\phi_0(\mathbf{x}_2) &\phi_1(\mathbf{x}_2) &\cdots &\phi_{M-1}(\mathbf{x}_2)\\
\vdots &\vdots &\ddots &\vdots\\
\phi_0(\mathbf{x}_N) &\phi_1(\mathbf{x}_N) &\cdots &\phi_{M-1}(\mathbf{x}_N)\\
\end{bmatrix}\\
\Rightarrow \mathbf{w}^T&=\mathbb{t}^T\Phi(\Phi^{T}\Phi)^{-1}\\
\Rightarrow \mathbf{w}_{ML} &= ((\Phi^{T}\Phi)^{-1})^{T}\Phi^T\mathbb{t}\\
&= ((\Phi^{T}\Phi)^{T})^{-1}\Phi^T\mathbb{t}\\
&=(\Phi^{T}\Phi)^{-1}\Phi^T\mathbb{t}\\
&=\Phi^{\dagger }\mathbb{t} \qquad let\ \Phi^{\dagger }\equiv (\Phi^{T}\Phi)^{-1}\Phi^T
\end{align*}$$

where $\Phi^{\dagger }$ is known as the *Moore-Penrose pseudo-inverse* of the matrix $\Phi$. And if $M=N$, then $\Phi^{\dagger }=(\Phi^{T}\Phi)^{-1}\Phi^T=\Phi^{-1}(\Phi^{T})^{-1}\Phi^T=\Phi^{-1}$.


### The parameter $w_0$
The parameter $w_0$ is the bias of our model. We shall analyze this bias parameter individually here. Thus the error function is given by
$$E_D(\mathbf{w})=\frac{1}{2}\sum_{n=1}^N\{t_n-w_0-\sum_{j=1}^{M-1}w_j\phi_j(\mathbf{x}_n)\}^2 \tag{3.18}$$
Setting the derivative with respect to $w_0$ equal to zero in order to maximize the likelihood function, we obtain
$$w_0=\bar{t}-\sum_{j=1}^{M-1}w_j\bar{\phi_j}\qquad where\quad \bar{t}=\frac{1}{N}\sum_{n=1}^Nt_n,\bar{\phi_j}=\frac{1}{N}\sum_{n=1}^N\phi_j(\mathbf{x}_n)\tag{3.19,3.20}$$
Thus the bias $w_0$ compensates for the difference between the avarage (over the training set) of the target values and the weighted sum of the averages of the basis function values.

### The parameter $\beta$
Maximize the log likelihood function with respect to the noise precision parameter $\beta$ gives
$$\frac{1}{\beta_{ML}}=\frac{1}{N}\sum_{n=1}^N\{t_n-\mathbf{w}_{ML}^{T}\phi(\mathbf{x}_n)\}^2$$


## Geometry of Least Squares
Back to the error function 
$$\begin{align*}
E_D(\mathbf{w})&=\frac{1}{2}\sum_{n=1}^N\{t_n-\mathbf{w}^T\phi(\mathbf{x}_n)\}^2 \tag{3.12}\\
\Rightarrow 2E_D(\mathbf{w})&=\sum_{n=1}^N\{t_n-\mathbf{w}^T\phi(\mathbf{x}_n)\}^2 \\
&=\left\| 
\begin{bmatrix}
t_1\\
t_2\\
\vdots\\
t_N
\end{bmatrix}
-
\begin{bmatrix}
\mathbf{w}^T\phi(\mathbf{x}_1)\\
\mathbf{w}^T\phi(\mathbf{x}_2)\\
\vdots\\
\mathbf{w}^T\phi(\mathbf{x}_N)
\end{bmatrix}
\right\|^2\\
&=\left\| 
\begin{bmatrix}
t_1\\
t_2\\
\vdots\\
t_N
\end{bmatrix}
-
\begin{bmatrix}
\phi_0(\mathbf{x}_1) &\phi_1(\mathbf{x}_1) &\cdots &\phi_{M-1}(\mathbf{x}_1)\\
\phi_0(\mathbf{x}_2) &\phi_1(\mathbf{x}_2) &\cdots &\phi_{M-1}(\mathbf{x}_2)\\
\vdots\\
\phi_0(\mathbf{x}_N) &\phi_1(\mathbf{x}_N) &\cdots &\phi_{M-1}(\mathbf{x}_N)\\
\end{bmatrix}
\begin{bmatrix}
w_0\\
w_1\\
\vdots\\
w_{M-1}
\end{bmatrix}
\right\|^2\\
&=\left\| 
\begin{bmatrix}
t_1\\
t_2\\
\vdots\\
t_N
\end{bmatrix}
-
\left(
w_0
\begin{bmatrix}\phi_0(\mathbf{x}_1)\\\phi_0(\mathbf{x}_2)\\ \vdots\\ \phi_0(\mathbf{x}_N)\end{bmatrix}
+
w_1
\begin{bmatrix}\phi_1(\mathbf{x}_1)\\\phi_1(\mathbf{x}_2)\\ \vdots\\ \phi_1(\mathbf{x}_N)\end{bmatrix}
+\cdots
+
w_{M-1}
\begin{bmatrix}\phi_{M-1}(\mathbf{x}_1)\\\phi_{M-1}(\mathbf{x}_2)\\ \vdots\\ \phi_{M-1}(\mathbf{x}_N)\end{bmatrix}
\right)
\right\|^2
\end{align*}$$
This expression is the Euclidean distance between the target vector $\mathbb{t}=\begin{bmatrix}t_1\\ t_2\\ \vdots \\t_N\end{bmatrix}$ and the linear combination of $M$ basis function vectors $\mathbb{y}=w_0\phi_0+\cdots+w_{M-1}\phi_{M-1}$, where $\phi_j=\begin{bmatrix}\phi_j(\mathbf{x}_1)\\ \phi_j(\mathbf{x}_2)\\ \vdots\\ \phi_j(\mathbf{x}_N)\}\end{bmatrix}$. For minimizing the error function in the perspective of the Geometry, it can be partition into $3$ situations 
1. If $M < N$, then the linear combination of $\phi_j$ is a subspace. In order to obtain the nearest distance, $\mathbb{y}$ has to be the projection of the $\mathbb{t}$ to the subspace.
2. If $M\geq N$, then $\mathbb{y}$ can be the same as the vector $\mathbb{t}$.
3. If $\Phi^T\Phi$ is a singular martrix, where $\Phi=(\phi_0, \cdots, \phi_{M-1})$, then the resulting numerical difficulties can be addressed using the technique SVD.


## Sequential learning
The model parameters update after each data presentation.

We can obtain a sequential learning algorithm by applying the technique of *stochastic gradient descent*, also known as *sequential gradient descent*. This approach assumes that the error function comprises a sum over data points $E=\sum_nE_n$, then after presentation of pattern $n$
$$\mathbf{w}^{(\tau+1)}=\mathbf{(\tau)}-\eta\nabla E_n \tag{3.22}$$
- $\tau$ denotes the iteration number.
- $\eta$ denotes the learning rate.
- $\nabla E_n$ denotes the gradient of the $nth$ observation data with respect to the parameter $\mathbf{w}$.

For the case of the sum-of-squares error function (3.12), this gives
$$\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}+\eta(t_n-\mathbf{w}^{(\tau)T}\phi_n)\phi_n\qquad where\quad \phi_n=\phi(\mathbf{x}_n) \tag{3.23}$$


## Regularized least squares
### Quadratic regularizer
The regularized error function (least squares) takes the form
$$E_D(\mathbf{w})+\lambda E_W(\mathbf{w})=\frac{1}{2}\sum_{n=1}^N\{t_n-\mathbf{w}^T\phi(\mathbf{x}_n)\}^2+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w}\tag{3.24,3.27}$$
For maximizing the log likelihood function, equivlantly, for minimizing the error function, we obtain
$$\mathbf{w}_{ML}=(\lambda\mathbf{I}+\Phi^T\Phi)^{-1}\Phi^T\mathbb{t} \tag{3.28}$$

### General regularizer
A more general regularizer is sometimes used, for which the regularized error function takes the form
$$\frac{1}{2}\sum_{n=1}^N\{t_n-\mathbf{w}^T\phi(\mathbf{x}_n)\}^2+\frac{\lambda}{2}\sum_{j=1}^{M}|w_j|^q \tag{3.29}$$
where
- $q = 1$ is known as the lasso in the statistics literature.
- $q = 2$ corresponds to the quadratic regularizer (3.27) that we always uses.

Using the approach called *Lagrange Multipliers*, we can find the optimun parameters $\mathbf{w}_{ML}$ in the perspective of geometry.  
Minimizing (3.29) is equivalent to minimizing $E_D(\mathbf{w})$ subject to the constraint
$$E_W(\mathbf{w})=\sum_{j=1}^M|w_j|^q\leq \eta \tag{3.30}$$
for and appropriate value of the parameter $\eta$. This is also equivalent to finding a point $\mathbf{w}_{ML}$ that lying on $E_W(\mathbf{w})-\eta$ to make the $E_D(\mathbf{w})$ smallest.

### Limitation of regularization
<font color='red'>Regularization is more likely to drive the parameters to zero, which limits the effective model complexity and avoids over-fitting on traning data. However, the problem of determining the optimal model complexity is then shifted from one of finding the appropriate number of basis functions to one of determining a suitable value of the regularization coefficient $\lambda$.</font>

## Multiple outputs
Consider the input is $\mathbf{x}$ which is the same as we discussed before, but the output is a vector $\mathbf{t}$ within $K$ elements instead of a single variable $t$. In this case, the function model takes the form
$$y(\mathbf{x},\mathbf{W})=\mathbf{W}^T\phi(\mathbf{x}) \tag{3.31}$$
and the Gaussian conditional distribution takes the form
$$p(\mathbf{t}|\mathbf{x},\mathbf{W},\beta)=\mathcal{N}(\mathbf{t}|\mathbf{W}^T\phi(\mathbf{x}),\beta^{-1}\mathbf{I}) \tag{3.32}$$
where
- $\mathbf{x}$ is the same as we discussed before.
- $y(\mathbf{x},\mathbf{W})$ is a function model for which the output is a vector within $K$ elements.
- $\mathbf{W}$ is an $M\times K$ matrix of parameters $\mathbf{W}=\begin{bmatrix}\mathbf{w}_1 &\cdots &\mathbf{w}_K\end{bmatrix}=\begin{bmatrix}w_{01} &\cdots &w_{0K}\\ \vdots &\ddots &\vdots\\ w_{(M-1)1} &\cdots &w_{(M-1)K}\end{bmatrix}$.
- $\phi(\mathbf{x})$ is an $M$-dimensional column vector, which takes the same form as we discussed before.
- $\mathbf{t}$ is the target vector.

If we have a set of observations for which 
- the inputs are $\mathbf{X}=\begin{bmatrix}\mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_N^T\end{bmatrix}$.
- the targets are $\mathbf{T}=\begin{bmatrix}\mathbf{t}_1^T \\ \vdots \\ \mathbf{t}_N^T\end{bmatrix}=\begin{bmatrix}t_{11} &\cdots &t_{1K} \\ \vdots &\ddots &\vdots \\ t_{N1} &\cdots &t_{NK}\end{bmatrix}$.

The log likelihood function is then given by
$$\begin{align*}
\ln p(\mathbf{T}|\mathbf{X},\mathbf{W},\beta)
&=\sum_{n=1}^N\ln \mathcal{N}(\mathbf{t}_n|\mathbf{W}^T\phi(\mathbf{x}_n),\beta^{-1}\mathbf{I})\\
&=\frac{NK}{2}\ln\left(\frac{\beta}{2\pi}\right)-\frac{\beta}{2}\sum_{n=1}^N\|\mathbf{t}_n-\mathbf{W}^T\phi(\mathbf{x}_n)\|^2 \tag{3.33}
\end{align*}$$
Maximizing this function with respect to $\mathbf{W}$ gives
$$\mathbf{W}_{ML}=(\Phi^T\Phi)^{-1}\Phi^T\mathbf{T} \tag{3.34}$$

Decouple this expression
$$\begin{align*}
\begin{bmatrix}w_{01} &\cdots &w_{0K}\\ \vdots &\ddots &\vdots\\ w_{(M-1)1} &\cdots &w_{(M-1)K}\end{bmatrix}
&=(\Phi^T\Phi)^{-1}\Phi^T\begin{bmatrix}t_{11} &\cdots &t_{1K} \\ \vdots &\ddots &\vdots \\ t_{N1} &\cdots &t_{NK}\end{bmatrix}\\
\Rightarrow \begin{bmatrix}\mathbf{w}_1 &\cdots &\mathbf{w}_K\end{bmatrix}
&=(\Phi^T\Phi)^{-1}\Phi^T \begin{bmatrix}\mathbb{t}_1 &\cdots &\mathbb{t}_K\end{bmatrix} 
\qquad let\ \mathbb{t}_k=\begin{bmatrix}t_{1k}\\ \vdots\\ t_{Nk}\end{bmatrix}\\
\Rightarrow \mathbf{w}_k&=(\Phi^T\Phi)^{-1}\Phi^T\mathbb{t}_k\\
&=\Phi^{\dagger}\mathbb{t}_k \qquad let\ \Phi^{\dagger}=(\Phi^T\Phi)^{-1}\Phi^T \tag{3.35}
\end{align*}$$