# Section Introduction and Outline 
* Recall: in previous studies we have looked at logistic regression and binary classification
* This is where we collect some data, and then try to predict 1 of 2 possible labels
* For example, we could collect **time spent on site** and **number of pages viewed**, and we can then try to predict whether someone is going to buy something on your site

![lr%20plot.png](attachment:lr%20plot.png)

* this may look like the image above
* in the case that we only have 2 dimensions, we will plot the information, and then try to use a straight line to classify the classes 
### $$\sigma\Big(w_1*(time\;spent\;on\;site) + w_2*(number\;pages\;viewed)\Big)$$
* if we can find a line that goes between the classes, then we say that they are "linearly separable"
* Recall that when we have a linearly seperable problem, that logistic regression is fine, since its a linear classifier 
* Note, we are not limited to just two inputs! 
    * our ecommerce data for instance has more than 2 inputs
    * in 3 dimensions, our decision boundary is a plane
    * in 4+ dimensions our decision boundary is a hyperplane
* The point is, no matter how many dimensions we have, our decision boundary is going to be straight not curved
## This changes with Neural Networks
* neural networks are a little more advanced
* we can have linearly seperable variables

![nonlinearly%20seperable.png](attachment:nonlinearly%20seperable.png)

* logistic regression would not be appropriate for this, neural networks would 
* Remember, a linear function has the form:
### $$w^Tx$$
* anything that cannot be simplified into $w^Tx$ is nonlinear
* neural networks are nonlinear
* $x^2$ and $x^3$ are nonlinear, but neural networks are nonlinear in a very specific way
* They achieve nonlinearity simply by being a combination of multiple logistic regression units (neurons) put together 
* So that is what we are going to do in this first section: **we are going to see how we can form a nonlinear classifier (neural network), which we can build by combining multiple logistic regression units (neurons)**
## Multi-class classification
* in the logistic regression notes, we talk about binary classification
* Now, we are talking about multi-class classification 
* we are going to specifically talk about how to classify more than two things when looking at **sigmoid vs. softmax**
* note that we can apply the softmax retroactively to logistic regression as well
* conversely, if you only want to do binary classification in a neural network, then the sigmoid will work! 

## Code
* we are then going to put all of these concepts into code
* we will first put softmax into code
* then we will use what we know about softmax and build an entire neural network in code
* finally we will apply it to a real world problem

## Prediction
* This section focuses on how to do prediction in a neural network
* AKA, given an input vector, how do I calculate the output
* How do we interpret the numbers that come out
* As you will see, the numbers that come out represent the probability that the input belongs to each class
* in this section, none of the outputs from our neural network will make any sense
* remember, logistic regression involves a set of weights 
* because neural networks are made up of logistic regression units, they also have weights
* In logisitic regression, we used gradient descent to determine those weights
* This is known as **training**
* This section will focus on prediction, the next section will focus on training
* Our weights will not make sense until we train the neural networks

## Typical ML process
* note that this is the typical process in machine learning
* you first create a brain, but initially that brain isn't smart since it has not been trained yet
* Hence the predictions are not accurate
* then we train
* after training, the brain will make accurate predictions
* this is because we trained it to 

---
# Logistic Regression to Neural Networks
* we are now going to transition from logistic regression to neural networks
* recall that logistic regression is a neuron, and we are going to connect them together to make a network of neurons
* the most basic way to do this is the feed forward method (we will go over)
* For logistic regression, we have a weight corresponding to every input

![logistic%20regression%20unit.png](attachment:logistic%20regression%20unit.png)

* note the image above only has 2 input features (x1 and x2), but it can have many more
* and to get the output y, we multiply each input by its weight, sum them all together, add a bias term, and put it through a sigmoid

### $$a = x_1w_1 + x_2w_2 + b$$
### $$y = p(y\;|\;x) = \frac{1}{1+e^{-a}}$$
### $$prediction = round\Big(p(y|x))\Big)$$
* and if prediction > 0.5 then we predict class 1, less than 0.5 predict class 0

## Extend to neural network
* basically all we do is just add more logistic regression layers

![neural%20network%20diagram.png](attachment:neural%20network%20diagram.png)

* we are going to mainly work with 1 extra layer, but an arbitrary number can be added
* in recent years, researchers have found more success with deeper networks, hence the term deep learning 
* the first step of course is to just add 1 layer, and the calculations are exactly the same
* we multiply each input by its weight, add the bias, and pass it through a sigmoid 
* that is how we get each value at the node z 
### $$z_j = \sigma(\sum_i(W_{ij}x_i+b_j))$$
* note that in the equation above and diagram, the x inputs are indexed with i, and the z nodes are indexed with j
* also, notice that w (our weights) is now a matrix. There needs to be a weight for each input output pair
* Hence, if there are two inputs and 3 inputs, then there will be 6 weights in total 
* notice how each node z has its own bias, $v_j$

## Nonlinearities
* These are what make neural networks so powerful, because they are nonlinear classifiers
* We have already seen the sigmoid (S-curve), which goes from 0 to 1 

![sigmoid.png](attachment:sigmoid.png)

### $$sigmoid(x) = \frac{1}{1+e^{-x}}$$

## Example
* our inputs will be x1 = 0, and x2 = 1, or x=[0,1]
* we will have one hidden layer, with two hidden units
* All of our weights will be 1, and our biases will be 0 
### $$w_{1,1} = w_{1,2} = w_{2,1} = w_{2,2} = 1$$
### $$b = c = 0$$
### $$v_1 = v_2 = 0$$
* first we need to calculate our z's
### $$z_1 = \sigma(0*1+1*1) = 0.731 $$
### $$z_2 = \sigma(0*1+1*1) = 0.731 $$
* now we need to calculate our output prediction, y (a probability)
### $$p(y|x) = \sigma(0.731*1 + 0.731*1) = 0.812$$
* soon we will see how to choose the correct weights and biases so that neural networks we build will be able to to perform accurately

## Vector Notation
* remember in numpy it is faster to use the built in matrix and vector operators than it is to use python for loops
* so we can treat X as a **D dimensional vector** (D = 2 in diagram above), and Z as an **M dimensional vector** (M = 3 in diagram above)
* So Z would looke like:

![z%20vector%20notation.png](attachment:z%20vector%20notation.png)

* and p(y|x) would look like:

![p%20vector%20notation.png](attachment:p%20vector%20notation.png)

## Matrix notation
* we can go even further than this though! 
* we generally have many data points, and we want to consider more than one sample at a time 
* so we can further vectorize our calculations, by using the full input matrix of data 
* this is an N x D matrix, where N is the number of samples and D is the dimensionality
* because we will then be calculating everything at the same time, z will then be an N x M matrix
* and the output Y will be an N x 1 matrix (for binary classification, for k classes it'll be N x K
* because all of the elements have to be valid for matrix multiplication, all of the weights must be the correct shape
* hence, W is D x M, the first bias b is M x 1, the output weight v is an M x 1 vector and the output bias c is a scalar  
### $$Z = \sigma(XW+b)$$
### $$Y = \sigma(Zv + c)$$

---
# Interpreting the Weights of a Neural Network
* in linear regression and logistic regression, interpreting the weights is pretty straight forward
* Neural networks have face criticism for not being interpretable
* However, we will learn that this lack of interpretability is not a limitation of neural networks, but a limitation in understanding of geometry
* In other words its not that neural networks are not interpretable
* People just don't understand geometry well enough to realize it is their own thinking that is limited

## Interpreting the Weights of a Single Neuron 

![single%20neuron.png](attachment:single%20neuron.png)

* we are going to start with interpreting the weights of a single neuron, since a neural network is just a layer of neurons 
* suppose we are trying to predict whether or not a person is at risk for disease X 
* the inputs to this model are obesity, whether the person smokes or not, and how often they exercise
* Suppose the weights for each of these factors are +1, +0.5, and -0.8
* what does that tell us? 
* Lets try it! 
* Take a person who is not obese, who does smoke, and who does exercise daily, meaning our input vector is [0, 1, 1]
* And our prediction is:
### $$\sigma(0*1+1*0.5-1*0.8) = \sigma(-0.3)= 0.426$$
* so the probability that this person is at risk for the disease is 42.6%. Because this is less than 50%, we classify this person as not being at risk for disease X 

### At this point it should be clear how these weights effect the output
* we care about 2 things: 
    * sign 
    * magnitude
* smokes = bad (positive), leading to higher chance of disease X
* exercise = good (negative), leading to lesser chance of disease X
* Since exercise weight has a higher magnitude than smoking weight, it overpowers it!

## What about a Neural network 

![neural%20network.png](attachment:neural%20network.png)

* the problem here is that past the first layer, the weights do not have any physical meaning
* For example: 
    * the first hidden unit could be: +0.5 smoking, -0.8 exercise
    * the second hidden unit could be: -0.1 smoking, +0.3 exercise 
* The problem is that because neural networks are so expressive, they lose the simple intepretation past the first layer 
* Remember that passing a value through the sigmoid function flattens out extreme values - in other words they are **nonlinear**
* This means that we cannot combine the two smoking weights, +0.5 - 0.1, to get +0.4, since this disregards the nonlinearity
* remove the non linearities would be pointless, since that would just bring us back to a linear neuron
### What makes neural networks so powerful...
* is that beyond the first layer, the weights no longer represent the initial inputs: obesity, smoking, exercise
* they are powerful because beyond the first layer the weights represent something different entirely 
    * they represent something that cannot be expressed as just "this much smoking", "this much obesity", and "this much exercise" 
* in other words, the very fact that you cannot express the meaning of the features beyond the first layer, in terms of the real inputs, is what makes neural networks good at what they do

## Geometry
* when people talk about "interpretability", they are usually referring to two types of models
    * the first is the kind we just talked about, the linear neuron
    * this is where each weight can be interpreted as how much that input effects the output
    * the second kind is a decision tree, which essentially just boils down to a bunch of if statements
        * ex: if obesity > 0.5 and smoking == true, return 0.75

### Linear Decision Boundary

![linear%20boundary.png](attachment:linear%20boundary.png)

* geometrically, a linear model is called a linear model, because the data between each class is separated by a line or plane
* we call this line or plane a decision boundary

### Decision Tree Boundary

![dtree%20boundary.png](attachment:dtree%20boundary.png)

* a decision tree works using if statements
* because of this, the decision boundary always has to line up with the coordinate axis
* we can see that neither of these boundaries are very expressive

## Nonlinear Decision Boundary
* what if we wanted to find a decision boundary like this:

![nonlinear%20boundary%202.png](attachment:nonlinear%20boundary%202.png)

* this is something that a neural network could do, but a single neuron can't do
* **Thus, trying to interpret a model in terms of its raw features is the wrong question to ask**
* if you do that, you are essentially limiting the models geometry to be a straight line 
* A better question to ask, is what does the decision boundary look like?


---
# Softmax
* lets now cover what happens if your output has more than 2 categories
* previously we have only discussed binary classification, for which there are many real world applications
* (humidity, ground is wet, month, location) -> (rain, no rain)
* in the above case, there are only two possible outputs

## K Classes
* Suppose you are facebook and you want to tag Faces, cars, wedding dresses, etc
* famous MNIST data set

## Logistic Unit Extension
* Lets talk about how we can extend the logistic unit to be able to handle more than 2 classes
* recall that when we have two classes, we only need 1 output, P(Y=1|X)
* this is because P(Y=0|X) = 1 - P(Y=1|X)
* this can also be done in another way, however

## Binary output with 2 output nodes

![binary%20output%202%20nodes.png](attachment:binary%20output%202%20nodes.png)

* we could have two output nodes, and just normalize them so that they sum to 1
* we will also exponentiate them first, to make sure they are both positive
* Notice that the weights are no longer a vector as they were in logistic regression 
* since every input node has to be connected to every output node, you have D input nodes and 2 output nodes
* so the total number of weights is 2*D, and they are stored in a (D x 2) matrix

## Softmax for K classes
* notice that this is very easy to extend to K classes
 
![softmax.png](attachment:softmax.png)

* the output which we have been calling a is usually the activation 

## Sigmoid vs. Softmax
