# Adaptive basis functions
In the discussion of linear regression and logistic regression, the model are based on linear combinations of **fixed** nonlinear basis fucntions $\phi(\mathbf{x})$ and take the form

$$y(\mathbf{x},\mathbf{w}) = f\left(\sum_{j=1}^M w_j\phi_j(\mathbf{x})\right) \tag{5.1}$$

Our goal is to extend this model by making the basis functions $\phi_j(\mathbf{x})$ depend on parameters and then to allow these parameters to be adjusted, along with the coefficients $\{w_j\}$, during training. There are, of course, many ways to construct parametric nonlinear basis functions. <font color='red'>Neural networks use basis functions that follow the same form as (5.1),</font> so that each basis function is itself a nonlinear function of a linear combination of the inputs, where the coefficients in the linear combination are adaptive parameters.

The model therefore can be partitioned into two layers.

#### First layer

$$z_j = \phi_j(\mathbf{x}) = h\left(a_j\right),\qquad let\ a_j=\sum_{i=0}^D w_{ji}^{(1)}x_i,\quad x_{0} = 1 \tag{5.2,5.3}$$

where
- $\phi_j(\mathbf{x})$ denotes the $j^{th}$ basis function.
- $\mathbf{x}$ is a $D$-dimensional input vector.
- $\mathbf{z}$ is the $M$-dimensional output vector of the first layer. Each element in this vector is called *hidden units*.
- $a_j$ are known as *activations*, which is the linear combinations of the input vector.
- $w_{ji}$ denote the weight of the $i^{th}$ unit of the input vector for combining the $j^{th}$ unit of the output vector in the first layer.
- $h(\cdot)$ is the *activation function*, and is generally chosen to be sigmoidal functions suchas the logistic sigmoid or the 'tanh' function.


#### Second layer

$$y_k = f\left(a_k\right),\qquad let\ a_k=\sum_{j=0}^M w_{kj}^{(2)}z_j, \quad z_0 = 1 \tag{5.4,5.5}$$

where

- $\mathbf{z}$ is the $M$-dimensional input vector of the second layer.
- $\mathbf{y}$ is the $K$-dimensional output vector of the second layer.
- $a_k$ are the *activations* of the second layer.
- $w_{kj}$ denote the weight of the $j^{th}$ unit of the input vector for combining the $k^{th}$ unit of the output vector in the second layer.
- $f(\cdot)$ is the *activation function* of the second layer. If this is a 2-classes classification model, then $f$ is a logistic sigmoid function. If this is a multiclass classification model, then $f$ is a softmax function.

# Multilayer perceptron

We can combine these two layers to give the overall network function that takes the form

$$y_k(\mathbf{x},\mathbf{w}) = \sigma\left(\sum_{j=0}^M w_{kj}^{(2)} h\left(\sum_{i=0}^D w_{ji}^{(1)}x_i\right)\right) \tag{5.9}$$

where the set of all weight and bias parameters have been grouped together into a vector $\mathbf{w}$.

We can see that the neural network model comprises two stages of processing, each of which resembles the preceptron model of Section 4.1.7, and for this reson the neural network is also known as the *multilayer perceptron*, or MLP. <font color='red'>A key difference compared to the perceptron, however, is that the neural network uses continous sigmoidal nonlinerarities in the hidden units, whereas the perceptron uses step-function nonlinearities.</font> This means that the neural network function is differentiable with respect to the network parameters, and this property will play a central role in network training.


# Neural network forms

## Additional layers

We use the number of layers of adaptive weights to represent the number of layers of neural network.

The neural networks we discussed above has only two layers. However, we can easily add layers by consisting of a weighted linear combination of each output followed by an element-wise transformation using a nonlinear activation function.

## Skip-layer

For instance, in a two-layer network these skip-layer would go directly from inputs to outputs.

## Sparse network

A sparse network is the case that not all possible connections within a layer being present.

## General feed-forward architecture

A feed-forward network must satisfy the condition of having no closed directed cycles, which ensure that the outputs are deterministic functions of the inputs.


# Weight-space symmetries

Consider a two-layer neural network with $M$ hidden units having 'tanh' activation functions and full connectiveity in both layers. If we change the sign of all of the weights and the bias feeding into a particular hidden unit, then, for a given input pattern, the sign of the activation of the hidden unit will be reversed, because $tanh(-a) = -tanh(a)$. That is to say if we change the signs of all the weights that connet to the hidden unit, it is still equivalent to the original scheme. Thus, for a single hidden unit, there are two different weight vector groups give the same output. 

$$\color{red}{\left .
\begin{array}{ll}
w_{0m}^{(1)}\\
w_{1m}^{(1)}\\
\vdots\\
w_{Dm}^{(1)}
\end{array}
\right\}
\xrightarrow[]{\mathbf{y} = tanh(\mathbf{w}^T\mathbf{x})}
\left\{
\begin{array}{ll}
w_{m1}^{(2)}\\
\vdots\\
w_{mD'}^{(2)}
\end{array}
\right .}
\overset{equivalent}{\Leftrightarrow}
\color{blue}{\left .
\begin{array}{ll}
-w_{0m}^{(1)}\\
-w_{1m}^{(1)}\\
\vdots\\
-w_{Dm}^{(1)}
\end{array}
\right\}
\xrightarrow[]{-\mathbf{y} = tanh(-\mathbf{w}^T\mathbf{x})}
\left\{
\begin{array}{ll}
-w_{m1}^{(2)}\\
\vdots\\
-w_{mD'}^{(2)}
\end{array}
\right .}
$$

And there are $M$ hidden layers, thus any given weight vector will be one of a set $2^M$ equivalent weight vectors. Moreover, for the reason that the order of the $M$ hidden unit is not determined, there are $M!$ different orderings of the hidden units. The network will therefore have overall weight-space symmetry factor of $M!2^M$.