# Logistic Regression Overview

Logistic Regression uses trends in the data in order to predict which class a new data-point belongs to.

In order to find the line that cuts the data in the "best" place (i.e. splits the classes most clearly) we can use **Gradient Decent**. A line is drawn randomly, and the number of errors (or later, the error value) is measurd. Then, we preform gradient decent on this error value in order to minimise the error, resulting in a line that seperates the data well.

The **Error Function** that was mentioned is more complex than simply counting the number of incorrectly classified points. Instead, it works on a penalty system, where correctly classified points recieve a small penaty, and incorrect point recieve a very large penalty. This system takes into account the whole data space, which is good. 

Gradient Decent is now applied to the system minimising the result of the error function. A simple implamentation of the error function could be:

In [1]:
def error_function(dataspace):
    return sum(d.error for d in dataspace)

However, it is not usual that a large dataspace can be split into correct classes using a single straight line. Often, multiple devisor line, curves, and circles may be needed to correctly classify a new point. 

Neural Networks are good at solving these problems.

---
# Basic Neural Network

In the case of a dataspace being classified by two distinct lines, in order to classify a new point, we must check if it is in the area that is classified by the two lines. This is a complec operations, so it is possible to break it down. 

We could instead ask three simple questuins, and orientatie them into a Neural Network.

1. Is the new point above the first line?
2. Is the new point above the second line?
3. Were the previous answers **BOTH** true?

This problem is now well suited to Neural nets, seen in the diagram below

INSERT DIAGRAM HERE

## Perceptrons

Perceptrons, or *Neurons* are the simple nodes within a neural network. Each perceptron takes in some number of inupts, and decides what to send as a singular output.

It quickly becomes apparent that not all of the inuts to the Perceptron hold the same importance, or ***weight***.

## Weight

When a perceptron has several inputs, it must to be able to know which inputs are the most important, and hold the most weight over the output. Weights are initized to a **random** value, and then these weights are altered based on feedback, this is what is altered during training.

## Combining the Inputs

Each perceptron summs the value from each input multiplied by its input weight. This forms the singular value that the percecptron operates on. This process is known as *linear combination*.

$$
total\_input = \sum{x_{i} w_{i}}
$$

or alternitavely, in code:

In [2]:
def total_input(list_of_inputs, list_of_weights):
    total = 0
    for i in range(len(list_of_inputs)):
        total += list_of_inputs[i]*list_of_weights[i]
    return total

## The Activation Function

The Activation function is what takes in the singulat value, and decides whether the nuron should activate or not. In this case, should return a $1$ or a $0$. 

Due to this abstracted nature, the activation function can be any function that takes a single input and returns a single value. A simple example is the **Heavyside Step** function, which returns *zero* if the input is less than zero, and one if it is greater or equal to zero.

In [3]:
def heavy_side(x):
    if (x<0):
        return 0
    else:
        return 1

### Bias

A Bias is used to shift the result of the activation function so that the result is more suitable. Like the weights, the Bias can be initized to a random value, and then trained by the network.

An example:

$$
f(x\_ \{ 1\} ,...,x\_ \{ m\} )\quad =\quad \begin{matrix} 0\quad if\quad b+\sum { { w }_{ i }{ x }_{ i }\quad <\quad 0 }  \\ 1\quad if\quad b+\sum { { w }_{ i }{ x }_{ i }\quad \ge \quad 0 }  \end{matrix}
$$

Could be used to classify the university admiission example. The above activation function would return $1$ if they shoud be accepted, and $0$ otherwise. 

The weights and the Biases can then be updated to fit the data better with a learning algorithm such as Gradient Decent. 

---
# The  AND Gate Neural Net


Here we will create a Perceptron that mimics the rules of an AND gate for two inputs. As the solution is well defined an simple, we will manually set the correcct weights and Bias for the Perceptron.

### The Activation Function

In this example we will use a *Heavy-side Step* as the activation function. Remember that as there is a Bias, the formula looks like:

$$
f(x\_ \{ 1\} ,...,x\_ \{ m\} )\quad =\quad \begin{matrix} 0\quad if\quad b+\sum { { w }_{ i }{ x }_{ i }\quad <\quad 0 }  \\1 \quad otherwise\end{matrix}
$$

### The Weights

As this is a straight AND gate, all of the input weights should be the same (The value can be arbitary, as long as the Bias compensates). For simplisity, we will set the weights to both $1$. 

### The Bias

Knkowing that the only correct solution is when both inputs are 1, as are weights are 1, the total value of the Perceptron will be $b + 2$. If we want the resullt to be negative in any incorrect state (10,01,00) then we should take the Bias to be -2. 

### Summary

In [4]:
import pandas as pd


weight1 = 1.0   # Arbitary, but the same
weight2 = 1.0   # Arbitary, but the same
bias = -2.0     # In order to ensure all incorrect pairings are negative

# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))


Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                  -2.0                    0          Yes
      0          1                  -1.0                    0          Yes
      1          0                  -1.0                    0          Yes
      1          1                   0.0                    1          Yes


---
# The OR Gate

Simularly, the weights of the input vlaues must be equal, however this time we only want the result to be negative in the case where the inouts are (00). As a result, we can either slightly increase the weights of the inputs, or decrease the magnitgde of the Bias.

In [5]:
import pandas as pd


weight1 = 1.0   # Arbitary, but the same
weight2 = 1.0   # Arbitary, but the same
bias = -1.0     # In order to ensure all incorrect pairings are negative

# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, True, True, True]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))


Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                  -1.0                    0          Yes
      0          1                   0.0                    1          Yes
      1          0                   0.0                    1          Yes
      1          1                   1.0                    1          Yes


---
# The NOT Gate
This Perceptron has a single input (ignore any except the first input), and just inverts whatever it is give.

Do do this, we can simply have a weight of -1. This will have no effect of a input of 0, which the heavy side will then set as 1, but will make a plus 1 inout -1, meannig havyside willset it as 0.


In [6]:
import pandas as pd


weight1 = -1.0  # Arbitary value, must be negative
weight2 = 0.0   # All other inputs are ignored
bias = 0.0     

# Inputs and outputs
test_inputs = [(1, 0), (0, 0)]
correct_outputs = [False, True]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      1          0                  -1.0                    0          Yes
      0          0                   0.0                    1          Yes





---
# A Simple Network

# Other Activation Functions

As we mentioned earlier, the architecture of the Perceptron means that We can define any *Activation Function* that simply takes the result form the Linear COmbination of the input, and returns a single value. There are several common ones that are more sophisticated than the Heavy-side Step that we used earlier. 

The one that we wll be using most is called a **Sigmoid Function**. The idea of a sigmoid is that it smooths out the range, with limits at 0 and 1.  


In [7]:

def sigmoid(x):
    return 1.0/(1+(np.e**(-x)))


## The difference between Regression and Neural Networks?

Up until this point, there has been very little differencebetween the capability of the Regression model and the Neural nets that we have been looking at; and too an extent, a single Perceptron is essentailly the same as a Regression model if the Activation function is a certain way. However, when we start to stack multiple Perceptrons together to form networks and layers, we can begin to handel *Linearly Inseperable* data, which is something that Regression cannot do. 

A further advantage is the flexability that the Activation Function allows us. If our net uses continuous and Differentiable functions, then we can use learning Algorithms to train the network, for example using Gradient Decent.



---
# Learning Weights

For any system to learn, it must have a sense of **Trial and Error**. This error is needed to know when the network is wrong, and to know whether it is getting better, or diverging from the desired result.

## Error

The Error that we are refuring to must obviousl be mathematically defined, and it turns out that a good measure is the **Sum of Squared Errors**. This formula is good for several reasons, it take into account the whole dataset, and also penalises larger errors more than smaller errors, as the error value is squared. Squaring the value also has the advantage of meaning that all errors are positive. 

$$
E\quad =\quad \frac { 1 }{ 2 } \sum _{  }^{ \mu  }{ \sum _{  }^{ j }{ { \left[ { y }_{ j }^{ \mu  }-{ \hat { y }  }_{ j }^{ \mu  } \right]  }^{ 2 } }  } 
$$
Where $\hat {y} $ in the above fromula is the result from the network, and $y$ is the expected value.


In words, what the above equasion is doing is taking the differnce between each output node in the network and its expected value, and squaring it, then sum it. Now, you do that for all data points and sum up those values too.


This results in a error that encompasses the error for all of the output nodes under all datapoints.

### Error vs Weights

If you recall, the output from a perceptron is dependent on the weight of the inputs. In turn, the error of the Perceptron is hense also indirectly dependent on these weights.

This is exactly what we want! 

When we need to rekduce error, we know that the buttons to press are the weights of the Inputs.

This can be seen when we rewrite the nets output of $y$ in its own derivation.

$$E
\quad =\quad \frac { 1 }{ 2 } \sum _{  }^{ \mu  }{ \sum _{  }^{ j }{ { \left[ { y }_{ j }^{ \mu  }-{ f(\sum _{  }^{ i }{ { w }_{ ij } } \hat { y }  }_{ j }^{ \mu  }) \right]  }^{ 2 } }  } 
$$



# Gradient Decent

The main idea behind Gradient decent it to take lots of small steps in the direction that minimises the desited variable.

In this case, the term 'gradient' means the slope of the function at the curent point.

As we know, the slope of a function is calculated ising the origonal functions derivitive. 

The main problem with the gradient decent method is that by definition it will never go uphill, so is suseptable to local minimums. 

## What can be done about this?

We can run the network on data with a known output, and taylor the input weights to suit that outcome.


## How do I change the weight?

It is known that the Error is a function of the Weights of the input nodes. Concider a case where there is only one input (and hense weight) involved, then it can be thought that:

$$
\Delta w \propto -gradient
$$

The change in the weight will be in the opposite direction to the gradient. Whach makes sense, if the gradient is positive, then going left will be down hill

It follows that:
$$
\begin{align}
\Delta w &\propto -\frac {\delta E  }{\delta w  } \\
\Delta w &= -\eta \frac {\delta E  }{\delta w  } \\
\end{align}
$$

The scalling constant $\eta$ is known as the learning rate, and it simply changes how quickly the weights change for a given error. Smaller takes longer, but bigger can struggle to settle








### What is that Partial Derivitive?

In order to calculate the change in the weight for each input, then we need to solve the differental:

$$
\begin{align}
\frac {\delta E  }{\delta w  }& = \frac {\delta  }{\delta w  }\frac{1}{2}(y-\hat{y})^2\\
\text{as } \space \hat{y} \space\text{is a function of} \space w\\
& = \frac {\delta  }{\delta w  }\frac{1}{2}(y-\hat{y}(w))^2\\
\end{align}
$$

Using the chain rule, the squared comes down and cancels witht the half, and we multiply be the derivitive of  $y-\hat{y}$ with respect to $w$

$$
\begin{align}
\frac {\delta E  }{\delta w  }& = (y-\hat{y})  \frac {\delta   }{\delta w  } (y-\hat{y})\\
\end{align}
$$

As the $y$ and the minus are not dependent on the wight (What we are deriving with respect to) then the y can go away and the minus is brought outside 
$$
\begin{align}
\frac {\delta E  }{\delta w  }& = -(y-\hat{y})  \frac {\delta  \hat{y} }{\delta w  }\\
\end{align}
$$



Okay, if you recall that the predicted weight $\hat{y}$ is the result of applying the activation function to the Linear combination :
$$
\begin{align}
\hat{y} &= f(h) \\
h &= \sum{w_i x_i} \\
\end{align}
$$

Then:

$$
\begin{align}
\frac {\delta E  }{\delta w  }& = -(y-\hat{y})  \frac {\delta  \hat{y} }{\delta w  }\\
\end{align}
$$

by the chain rule, becomes:

$$
\begin{align}
\frac {\delta E  }{\delta w_i  }& = -(y-\hat{y}) f'(h)  \frac {\delta  }{\delta w  } \sum{w_i x_i}\\
\end{align}
$$

Finally by the derivitive of summations,


$$
\begin{align}
\frac {\delta E  }{\delta w_i  }& = -(y-\hat{y}) f'(h)  x_i\\
\end{align}
$$



### What do we do witht the partal derivitive?

Now that we have differentiated the Error with respect to the weights properly, then we can work backwards by saying that:
$$
\\
\Delta w = \eta (y-\hat{y}) f'(h)  x_i\\
$$

So we know have the change in the weight for each input $x_i$.

To simplify the above, we say that Define an **Error Term** $\delta$:
$$
\delta = (y-\hat{y}) f'(h) 
$$

So we now see that 

$$
\Delta w_i = \eta \delta  x_i\\
$$

So **FINALLY** :

In order to update the weight

$$
w = w + \eta \delta  x_i\\
$$




---
# Gradient Decent in Code

Compiling the above:

The change in weight is $\Delta w_i = \eta \delta  x_i$

Where the error function is $(y-\hat{y}) f'(h)$

Which is eqivilent to $\delta = (y-\hat{y}) f'(\sum{w_i x_i})$ 


In the above, $(y-\hat{y}$ is equivilent to the output error.

We will now define the system in codeassuming a single output unit, and a sigmoid activation function.

## Sigmoid

Here is the Sigmoid function, and the function that returns the dreivitive.

In [9]:
import numpy as np

# Defining the sigmoid function for activations
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Change in Weight

This is where we use GD to calculate a change in weight.



In [10]:
learnrate = 0.5
inputs = np.array([1, 2, 3, 4])
target = np.array(0.5)

# Initial weights
weights = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight


# The node's linear combination of inputs and weights
#     This is a single value
linear_comb = np.dot(inputs, weights) 

# The output of neural network
nn_output = sigmoid(linear_comb)

# The error of neural network
#      This is y - y-hat
error = target - nn_output

# Output Gradient
output_gradient = sigmoid_prime(linear_comb)

# The error term
error_term = error * output_gradient

#The final change in weights
del_w = learnrate * error_term * inputs

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.689974481128
Amount of Error:
-0.189974481128
Change in Weights:
[-0.02031869 -0.04063738 -0.06095608 -0.08127477]


---
# Multi updates

For this example we will be using the uni admissions example. 

The data has 3 columns. 
1. GRE Score
2. GPA
3. Shcool Rank

## Data Clean Up

It is clear that the `rank` of the school is not a mathematical value, and encodes no meaning. 

Because of this, we use **Dummy Variables** to show the school instead. 

We will use 4 Dummy columns to represent the 4 rank schools, the value in the column representing the schools rank will be $1$, all other `rank` columns will be $0$.

As we are again using a **Sigmoid activation function** must normaise the data to have a mean of $0$ and a standard deviation of $1$. This is because the sigmoid function squishes big and small values to 1 and 0, if this happens then the gradient will be nearly 0, meaning that the network will struggle to train.

## Mean of Square Error

Previously we have used the *Sum of Squared Errors* to calculte Error. However in this example we will use the ***Mean* of Squared Errors**. The only difference is that we devide by the number of data points, $m$. 

The reason is that as there is a large number of data points, the sum of the error is large, resulting in large steps, which is problematic for Gradient Decent. To combat this, ti would be possible to just use a smaller learning Rate. However as the number of points is constant, division by $m$ has the same effect, without us having to change the Learning Rate.

$$
E\quad =\quad \frac { 1 }{ 2m } \sum _{  }^{ \mu  }{ { \left[ { y }_{ j }^{ \mu  }-{ \hat { y }  }_{ j }^{ \mu  } \right]  }^{ 2 } }  
$$


## Implamentation

The General steps are as follows.

1. Set the inital $\Delta w_i = 0$
2. For each record in the data
 i. Make a forward pass though the netwrok, calculating the output $\hat{y}$
 ii. Calculate the *Error Term*, $\delta$
 iii. Update $\Delta w_i$
    
3. Update the weight, $w_i = \frac {w_i + \eta \Delta w_i}{m}$
4. Repeat for all epochs


**Note**: this will not run withough the data_prep, but it is correct

In [None]:
import numpy as np
from data_prep import features, targets, features_test, targets_test


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


# The Random Seed
np.random.seed(42)

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000                               # What are Epochs?
learnrate = 0.5

for e in range(epochs):
    # Initalise the intal change in weight to be Zero
    del_w = np.zeros(weights.shape)
    
    for inpt, target in zip(features.values, targets):
        # Loop through all records
        
        #Linear Combination
        linear_comb = np.dot(inpt, weights)

        output = sigmoid(linear_comb)

        # TODO: Calculate the error
        error = (target - output)

        # TODO: Calculate the error term
        error_term = error * (output*(1-output))

        # TODO: Calculate the change in weights for this sample
        #       and add it to the total weight change
        del_w += error_term*inpt

    # TODO: Update weights using the learning rate and the average change in weights
    weights += learnrate*del_w/n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

---
# The Hidden Layer

We saw in the first example (AND, XOR) that stacking multiple perceptron layers allowed us to classify **Linearly Inseperable** things. This is where Deep Learning Gets its name, thenwetwroks are multiple layers deep.

Our weigths are no longer a vector, but rather a matrix where the i and j indexes represent the inout and hidden layers respectivly.

For example, assuming that `features` is the 2D matrix of input data:

```

# Number of records and input units
n_records, n_inputs = features.shape

# Number of hidden units
n_hidden = 2

weights_input_to_hidden = np.random.normal(0, n_inputs**-0.5, size=(n_inputs, n_hidden))
```

So now, we need to do matrix multiplication to find the linear combinations.

```
hiden_inputs = np.matmul(inputs, weights_input_to_hidden)
```

**Pro-Tip**: to create a column vector from a row, usearr[:,None]

**TODO**

1. Calculate the input to the hidden layer.
2. Calculate the hidden layer output.
3. Calculate the input to the output layer.
4. Calculate the output of the network.

In [12]:

# Network size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)

# Make some fake data
Input = np.random.randn(4)

weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))


# TODO: Make a forward pass through the network

hidden_layer_in = np.dot(Input, weights_input_to_hidden)
hidden_layer_out = sigmoid(hidden_layer_in)

print('Hidden-layer Output:')
print(hidden_layer_out)

output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
output_layer_out = sigmoid(output_layer_in)

print('Output-layer Output:')
print(output_layer_out)

Hidden-layer Output:
[ 0.41492192  0.42604313  0.5002434 ]
Output-layer Output:
[ 0.49815196  0.48539772]


---
## Back Propagation

This is the idea that even with multiple hidden layers, the error can be propagated back thought the network using the weights so that the whole network can be trained.



In [14]:


Input = np.array([0.5, 0.1, -0.2])
target = 0.6
learnrate = 0.5

weights_input_hidden = np.array([[0.5, -0.6],
                                 [0.1, -0.2],
                                 [0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(Input, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)

output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)
output = sigmoid(output_layer_in)

## Backwards pass
## TODO: Calculate output error
error = target-output

# TODO: Calculate error term for output layer
output_error_term = error*(output*(1-output))

# TODO: Calculate error term for hidden layer
hidden_error_term = np.dot(output_error_term, weights_hidden_output) * hidden_layer_output*(1-hidden_layer_output)

# TODO: Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * output_error_term * hidden_layer_output

# TODO: Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * hidden_error_term * Input[:, None]

print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)

Change in weights for hidden layer to output layer:
[ 0.00804047  0.00555918]
Change in weights for input layer to hidden layer:
[[  1.77005547e-04  -5.11178506e-04]
 [  3.54011093e-05  -1.02235701e-04]
 [ -7.08022187e-05   2.04471402e-04]]
