# 1. Introduction
In the past we have looked at both logistic regression and binary classification. There, we would collect some data, and then try to predict 1 of 2 possible labels. For example, if we were dealing with an e-commerce stie, we could collect _**time spent on site**_ and _**number of pages viewed**_, and then try to predict whether someone is going to buy something on the site. 

In this case, we only have 2 dimensions. We will plot the information, and then try to use a straight line to classify the classes (buy or not buy):

$$\sigma \big( w_1*(\text{time spent on site}) + w_2 (\text{number pages viewed})\big)$$

If we are able to find a line that goes between the classes, they are _linearly seperable_. When we are dealing with data that is linearly seperable, logistic regression is fine, since it is a linear classifier. So, in 2 dimensions linearly seperable data can be separated via a line, in 3 dimensions a plane, and and 4+ dimensions a hyperplane. The point is, no matter how many dimensions we are dealing with, our decision boundary is going to be straight, not curved. 

## 1.1 Neural Networks Add Non-linearity
Now, as we get into the realm of Neural Networks, things begin to change. We can have non-linearly seperable variables, such as: 

<img src="images2/nonlinear-data.png" width="400">

Logistic regression would _not_ be appropriate for this, while neural networks would! Recall, a linear function has the form:

$$w_1x_1 + w_2x_2+...+w_nx_n$$

$$w^T x$$

Where, just a reminder, in the vector notation $w_T x$, the weights are transposed because by convention they are stored as a column vector, but we need to be able to perform matrix vector multiplicaton (akin to the dot product in this case) with the input vector $x$. 

So, we can see that anything that cannot be simplified into $w^Tx$ is nonlinear. Now, $x^2$ and $x^3$ are both nonlinear, but neural networks are nonlinear in a _very specific way_. Neural Networks achieve nonlinearity by:

> _Being a combination of multiple logistic regression units (neurons) put together._

That is going to be the focus of this section; determining how we can build a nonlinear classifier (in this case a neural network), by combining logistic regression units (neurons). We will then use this nonlinear classifier to make _**predictions**_

---

# 2. Logistic Regression $\rightarrow$ Neural Networks
We are now ready to start the transition from logistic regression to neural networks. Recall that logistic regression is a neuron, and we are going to be connecting many together to make a network of neurons. The most basic way to do this is the _**feed forward method**_. For logistic regression, we have a weight corresponding to every input:

<img src="images2/logistic-reg-unit.png" width="500">

This is seen clearly in the image above. We have two input features, $x_1$ and $x_2$, but of course there can be many more. Each input feature has a corresponding weight, $w_1$ and $w_2$. In order to determine the output $y$, we multiply each input by its weight, sum them all together, add a bias term, and put it through a sigmoid function:

$$z = x_1w_1 + x_2w_2 + bw_0$$

$$y = p(y \mid x) = \frac{1}{1 + e^{-z}}$$

$$prediction = round \big( p(y \mid x)\big)$$

If our prediction is greater than 0.5, we predict class 1, otherwise we predict class 0.

## 2.1 Extend to a Neural Network
Now, in order to extend this concept to that of a neural network, the main thing we need to do is just add more logistic regression layers (i.e. neurons):

<img src="images2/multilayer-neurons.png" width="500">

We will be working mainly with 1 extra layer, but an arbitrary number can be added. In recent years, researchers have found amazing success with deeper networks, hence the term _deep learning_. The first step is of course to just add 1 layer, and our calculations are the exact same! 

> We multiply each input by its weight (linear combination), add the bias, and pass it through a sigmoid function. 

That is how we get each value at node $z$:

$$z_j = \sigma \big(\sum_i (W_{ij}x_j + b_j)\big)$$

Where in the above equation our $x$ inputs are indexed by $i$, and the $z$ nodes are indexed by $j$. Also, notice that our set of weights $w$ is now a matrix. This is because there needs to be a weight for each input-output pair. Hence, since we have 2 inputs and 3 outputs, we need there to be 6 weights in total. 

## 2.2 Nonlinearities
We have already spoken about the fact that the main things that makes neural networks so powerful is that they are nonlinear classifiers. We have already gone over one of the most common nonlinear functions utilized in this architecture, the sigmoid:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

There are also several other nonlinearities that we will cover soon, such as _tanh_ and the _RELU_. 

## 2.3 Vectorization
From an implementation standpoint, remember that it is faster to use the numpy matrix and vector operations, compared to using python for loops. Hence, we are going to treat our system layers as follows:

> * $X$ is going to be a **_D dimensional vector_** (D is 2 in the above diagram)
* $Z$ is going to be an **_M dimensional vector_** (M is 3 in the above diagram)

This means that Z is going to look like: 

$$z_j = \sigma \big(\sum_i (W_{ij}x_i) + b_j\big)$$

$$\downarrow$$

$$z_j = \sigma \big(W_{j}^Tx) + b_j\big)$$

$$\downarrow$$

$$z = \sigma \big(W^Tx) + b\big)$$

And $p(y \mid x)$ will look like:

$$p(y \mid x) = \sigma \big(\sum_j (v_{j}z_j) + c\big)$$

$$\downarrow$$

$$p(y \mid x) = \sigma \big(v^Tz + c\big)$$

## 2.4 Matrix Notation
The above is pretty good, but we can do even better! In general, we have _many_ data points, and we want to consider more than one sample at a time. So, we can further vectorize our calculations, by using the full input matrix of data! 

This will be an $NxD$ matrix, where $N$ is the number of samples and $D$ is the dimensionality. Because we will then be calculating everything at the same time, $z$ will then be an $NxM$ matrix. The output $y$ will be an $Nx1$ matrix (for binary classification, for $k$ classes it will be $Nxk$. It is important that our weights all have the correct shape. Hence:

> * $W$ is $Dxm$
* The first bias, $b$ is $Mx1$
* The output weight $v$ is an $Mx1$ vector
* The output bias $c$ is a scalar

We end up with:

$$Z = \sigma \big(XW + b\big) $$

$$Y = \sigma \big(Zv + c\big)$$