# Network prediction and measuring its effectiveness

## The algorithm

### 1. Find the correct weigths descending the prime derivative of the activation function

* If $f$ is the activation function, the output of the network is: 

(__1__): $$y_j^m = f(\sum_i w_{ij} · x_i^m)$$



(the dot product of the tensor containing the weights and the tensor containing the inputs); as activation function it is used the **sigmoid**.
* The learning rate is a discretional amount that is set by trial and errors to fit the model.
* Every model needs to minimize the error to be effective, to minimize the error it is used the _sum of squared errors_ (SSE).
* To find the right amounts for the weights, it is needed to scale down (**descent**) the error proportionally to the output and the learning rate. The best mathematical tool to carry on this task is to descent the **gradient** of the activation function (the prime derivative of the multivariable system that represent the network).


### 2. Minimize the error

The quantity to minimize is the difference between the researched output and the output calculated in [1]: 

(__2__): $$(y_j - y_j^m)$$

Using SSE:

(__3__): $$E = \frac{1}{2} \sum_m \sum_j [y_j - y_j^m] ^2$$


### 3. Find the change in weight ($\Delta w_{ij}$), the amount of the descent

The right amount to decrease the weights is needed.

To calculate the partial derivatives in the second operand of the right side of this equation:

(__4__): $$\Delta w_{ij} = -\eta \frac{\delta E}{\delta w_{ij}} $$

the [chain rule](https://www.khanacademy.org/math/ap-calculus-ab/product-quotient-chain-rules-ab/chain-rule-ab/v/chain-rule-introduction) is needed. The result is:

(__5__): $$\Delta w_{ij} = \eta * (y_j - y_j^m) * f'(\sum_i w_{ij} · x_i) * x_i$$

where the __change in weight__ is proportional to the error (amplified by the learning rate $\eta$) and the input $x_i$ and the prime derivative of the dot product of the weights tensor and the input tensor.



### 4. Define the Gradient Descent

(__6__): $$\Delta w_{ij} = \eta \delta_j x_i$$

where the __error gradient__ $\delta_j$ is actually:

(__7__): $$\delta_j = (y_j - y_j^m) * f'(\sum_i w_{ij} x_i) $$

with the first operand to be the error and second operand of the right part of the equation being the __gradient (the prime derivative) of the activation function__ for the dot product of weights and inputs.

## The code

### 1

In [10]:
import numpy as np

# define sigmoid
def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

# set initial values for w and x tensors
w = np.array([0.5, -0.5])
x = np.array([1, 2])
learningrate_eta = 0.5

# find the output of the network
output_y = sigmoid(np.dot(w, x))
output_y

0.37754066879814541

### 2

In [13]:
# define the target error (a discretional value)
target_y = np.array(0.5)

# find the actual error
error = target - output_y
error

-0.1775406687981454

In [14]:
# define sigmoid prime
# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# find the error gradient 
error_gradient = sigmoid_prime(np.dot(w, x))

In [17]:
# find the appropriate change in weight for every input
delta_w = [
    learningrate_eta * error_gradient * x[0],
    learningrate_eta * error_gradient * x[1]
]

print('This is the right gradient descent step', delta_w)

This is the right gradient descent step [0.11750185610079725, 0.23500371220159449]
