# Neural Networks: Representation

## Motivations

### Non-linear Hypotheses

训练模型时，若采用的都是 $50×50$ 像素的小图片，并且我们将所有的像素视为特征，则会有 $2500$ 个特征，如果我们要进一步将两两特征组合构成一个多项式模型，则会有约个$2500^2 / 2$（接近三百万个）特征。普通的逻辑回归模型，不能有效地处理这么多的特征，这时候我们需要神经网络。

### Neurons and the Brain

![image.png](https://i.loli.net/2020/01/16/atLOIrjifHu9TFC.png)

## Neural Networks

### Model Representation I

![image.png](https://i.loli.net/2020/02/27/neEyou8xs3PmfFc.png)

Input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer".

$a_{i}^{(j)}="$ activation" of unit $i$ in layer $j$   
$\Theta^{(j)}=$ matrix of weights controlling function mapping from layer $j$ to layer $j+1$

$\begin{aligned} a_{1}^{(2)} &=g\left(\Theta_{10}^{(1)} x_{0}+\Theta_{11}^{(1)} x_{1}+\Theta_{12}^{(1)} x_{2}+\Theta_{13}^{(1)} x_{3}\right) \\ a_{2}^{(2)} &=g\left(\Theta_{20}^{(1)} x_{0}+\Theta_{21}^{(1)} x_{1}+\Theta_{22}^{(1)} x_{2}+\Theta_{23}^{(1)} x_{3}\right) \\ a_{3}^{(2)} &=g\left(\Theta_{30}^{(1)} x_{0}+\Theta_{31}^{(1)} x_{1}+\Theta_{32}^{(1)} x_{2}+\Theta_{33}^{(1)} x_{3}\right) \\ h_{\Theta}(x)=a_{1}^{(3)}=& g\left(\Theta_{10}^{(2)} a_{0}^{(2)}+\Theta_{11}^{(2)} a_{1}^{(2)}+\Theta_{12}^{(2)} a_{2}^{(2)}+\Theta_{13}^{(2)} a_{3}^{(2)}\right) \end{aligned}$

If network has $s_{j}$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1}×(s_{j}+1)$.  
> The +1 comes from the addition in $\Theta^{(j)}$ of the "bias nodes," $x_0$ and $\Theta_{0}^{(j)}$. In other words the output nodes will not include the bias nodes while the inputs will. The following image summarizes our model representation:
![image.png](https://i.loli.net/2020/02/27/tAqioFUYc91QBGV.png)

### Model Representation II

We're going to define a new variable $z_k^{(j)}$ that encompasses the parameters inside our g function. In our previous example if we replaced by the variable z for all the parameters we would get:  
$a_{1}^{(2)}=g\left(z_{1}^{(2)}\right)$  
$a_{2}^{(2)}=g\left(z_{2}^{(2)}\right)$  
$a_{3}^{(2)}=g\left(z_{3}^{(2)}\right)$  

In other words, for layer j=2 and node k, the variable z will be:  
$z_{k}^{(2)}=\Theta_{k, 0}^{(1)} x_{0}+\Theta_{k, 1}^{(1)} x_{1}+\cdots+\Theta_{k, n}^{(1)} x_{n}$

The vector representation of x and $z^{j}$ is:  
$x=\left[\begin{array}{c}{x_{0}} \\ {x_{1}} \\ {\cdots} \\ {x_{n}}\end{array}\right] z^{(j)}=\left[\begin{array}{c}{z_{1}^{(j)}} \\ {z_{2}^{(j)}} \\ {\cdots} \\ {z_{n}^{(j)}}\end{array}\right]$

Setting $x = a^{(1)}$ we can rewrite the equation as:  
$z^{(j)}=\Theta^{(j-1)} a^{(j-1)}$

We are multiplying our matrix $\Theta^{(j-1)}$ with dimensions $s_j\times (n+1)$ (where $s_j$ is the number of our activation nodes) by our vector $a^{(j-1)}$ with height (n+1). This gives us our vector $z^{(j)}$ with height $s_j$. Now we can get a vector of our activation nodes for layer j as follows:    $a^{(j)}=g\left(z^{(j)}\right)$

Where our function g can be applied element-wise to our vector $z^{(j)}$.

We can then add a bias unit (equal to 1) to layer j after we have computed $a^{(j)}$. This will be element $a_0^{(j)}$ and will be equal to 1. To compute our final hypothesis, let's first compute another z vector:  
$z^{(j+1)}=\Theta^{(j)} a^{(j)}$

We get this final z vector by multiplying the next theta matrix after $\Theta^{(j-1)}$ with the values of all the activation nodes we just got. This last theta matrix $\Theta^{(j)}$ will have only **one row** which is multiplied by one column $a^{(j)}$ so that our result is a single number. We then get our final result with:  
$h_{\Theta}(x)=a^{(j+1)}=g\left(z^{(j+1)}\right)$

Notice that in this **last step**, between layer j and layer j+1, we are doing **exactly the same thing** as we did in logistic regression. Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.

## Applications

### Examples and Intuitions I

#### Non-linear classification example: XOR/XNOR

![image.png](https://i.loli.net/2020/02/27/nOFqsowctAbRdhG.png)

#### Simple example: AND

![image.png](https://i.loli.net/2020/02/27/3q9lPTK7fzoxX1i.png)

#### Example: OR

![image.png](https://i.loli.net/2020/02/27/aSoL8Y1e7xZMHqA.png)

### Examples and Intuitions II

#### Negation: NOT

![image.png](https://i.loli.net/2020/03/01/wd5SqMDAQyvKPjO.png)

#### Putting together: $x_1 \text{ XNOR } x_2$ 

Since  
$x_{1} \mathrm{XNOR} x_{2}=\left[x_{1} \mathrm{AND} x_{2}\right] \mathrm{OR}\left[\left(\mathrm{NOT} x_{1}\right) \mathrm{AND}\left(\mathrm{NOT} x_{2}\right)\right]$
![image.png](https://i.loli.net/2020/02/27/uNZJYyzlsabLk5D.png)

> [Handwritten digit classification [courtesy of Yann LeCun]](http://yann.lecun.com/exdb/mnist/)

### Multiclass Classification

To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of four categories. We will use the following example to see how this classification is done. This algorithm takes as input an image and classifies it accordingly:
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/9Aeo6bGtEea4MxKdJPaTxA_4febc7ec9ac9dd0e4309bd1778171d36_Screenshot-2016-11-23-10.49.05.png?expiry=1579305600000&hmac=bhabjPZNSU6IM_1L75kotipX5KV84nS2M4aoqeU24Pk)

We can define our set of resulting classes as y:  
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/KBpHLXqiEealOA67wFuqoQ_95654ff11df1261d935ab00553d724e5_Screenshot-2016-09-14-10.38.27.png?expiry=1579305600000&hmac=0PLKWj8zT7sXGGG-WVFkTHwrEoW_2V9YuXpDEfXHadE)

Each $y^{(i)}$ represents a different image corresponding to either a car, pedestrian, truck, or motorcycle. The inner layers, each provide us with some new information which leads to our final hypothesis function. The setup looks like:
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/VBxpV7GvEeamBAoLccicqA_3e7f67888330b131426ecffd27936f61_Screenshot-2016-11-23-10.59.19.png?expiry=1579305600000&hmac=hV1XupDrjoMttEAs_c5CwIa5Ubwq2_jFXSnbn7Cuugo)

Our resulting hypothesis for one set of inputs may look like:  
$h_{\Theta}(x)=\left[\begin{array}{l}{0} \\ {0} \\ {1} \\ {0}\end{array}\right]$  
In which case our resulting class is the third one down, or $h_{\Theta}(x)_3$, which represents the motorcycle.