# How Do Machines Learn?

In the last talk on machine learning, we trained a random forest classifier to classify digitized, handwritten images of digits into numerical symbols from 0 to 9.

However, in that discussion, we glossed over a few important details about how these models 'learn'. 

## Following patterns found in nature

In answering the question about how machines learn, one thing that might be helpful to ask is how people, or other biological systems, learn.

A biological neuron accumulates electrical charge on the cell membranes next to the axon. The electrical charges are accumulated by action from the set of multiple presynaptic neurons that provide input to the cell.  

_Neurotransmitters_ in the _synaptic cleft_ instruct the cell to open or close _ion channels_ in the neuron.  Ions travelling in or out of the cell change the the membrane potential, until a point at which an _Action potential_ is achieved, which causes electrical charge to be dissipated down the axon and towards the _postsynaptic neuron_, which starts the process anew for downstream neurons.


<img src="Neuron.svg" style="background-color:#fff;" />

## Neuron as Model

If we distill the functionality of the function of a neuron into a functional unit, we can see it as a function that has several parts

- an input vector representing the set of inputs from presynaptic neurons
- a bias term representing the sensitivity of the neuron itself
- an activation function

<img src="modelneuron.png" />


This simplified, mathematical represenation of a neuron, used for computing, was called a _Perceptron_, and has been around for a long time (Rosenblatt, 1957)



In [None]:
# Given a mathematical representation of the neuron, we can compute it 
import numpy as np

example_input = [1, .2, .1, .05, .2]

# these are arbitrary
example_weights = [.2, .12, .4, .6, .90]

def perceptron(input, weights, threshold):
    """
    aggregates input, applying the weights, compares the aggregated input against a threshold and producing an output.
    """
    input_vector = np.array(input)
    weights = np.array(weights)
    bias_weight = .2

    # summing the activation uses the dot product
    activation_level = np.dot(input_vector, weights) + (bias_weight * 1)                                
    activation_level
    
    if activation_level >= threshold:
        perceptron_output = 1
    else:
        perceptron_output = 0
    return perceptron_output

In [None]:
output = perceptron(example_input, example_weights, 0.5)

print(output)


## But Where's the Learning?

So far, this explains a bit about how to represent a neuron mathematically, but it doesn't provide us with a lot of information about how it learns. 

It learns by changing the weights, based on whether the output of the system is correct or incorrect (so, in this case, it's supervised-learning).

As it's supervised learning, that means that we know the answers to the question that the model is trying to solve, beforehand.  Let's add a very basic supervised learning algorithm.

In [None]:
def adjust_weights(perceptron_output, 
    expected_output, 
    given_input,
    input_weights,
    learning_rate=1):
    """
    takes observed perceptron output, and input weights, and gives a new set of weights
    """
    new_weights = []
    for i, x in enumerate(given_input):

        # if the expected output is greater than the observed output, make the weight smaller
        # if the expected output is lesser than the observed output, make the weight bigger
        new_weight = input_weights[i] + (expected_output - perceptron_output) * x * learning_rate
        new_weights.append(new_weight)
    return np.array(new_weights)
 

Notice that the function has a "learning rate" parameter.  We'll come back to that in a little while.

In [None]:

expected_output = 0
new_weights = adjust_weights(output, 0, example_input, example_weights)

print(f'old weights are: {example_weights}')
print(f'new weights are {new_weights}')

In [None]:
# Now let's put our learning algorithm and our perceptron together:
# Remember the correct output for the the algorithm is 0

output = perceptron(example_input, example_weights, 0.5)
print(f'output before learning is {output}')
new_weights = adjust_weights(output, 0, example_input, example_weights)

output_again = perceptron(example_input, new_weights, 0.5)
print(f'output_after_learning is {output_again}')

## Now let's try to use perceptrons to do a very simple classifier task

We can use a perceptron with 2 inputs and one output to try and solve a problem that trains a network to "understand" logical OR:

In [None]:
# example problem for learning logical OR

sample_data = [[0, 0],  # False, False
               [0, 1],  # False, True
               [1, 0],  # True, False
               [1, 1]]  # True, True

expected_results = [0,  # (False OR False) gives False
                    1,  # (False OR True ) gives True
                    1,  # (True  OR False) gives True
                    1]  # (True  OR True ) gives True

activation_threshold = 0.5

In [None]:
from random import random
import numpy as np

weights = np.random.random(2)/1000  # Small random float 0 < w < .001
weights

In [None]:
bias_weight = np.random.random() / 1000
bias_weight

In [None]:
output = []
for datum in sample_data:
    output.append(perceptron(datum, weights, activation_threshold))

output_arr = np.array(output)
for idx,datum  in enumerate(sample_data):
    new_weights = adjust_weights(output[idx], expected_results[idx], datum, weights)

new_weights

# This represents one iteration, or "batch", of having trained our very simple nerual network.



In [None]:
# Let's make a function that represents this training.
def train_perceptron(input_features, 
    input_labels, 
    initial_weights,
    activation_threshold,
    num_batches, 
    learning_rate=1):
    """
    Trains a perceptron with given input
    """

    weights = initial_weights
    for batch_num in range(0,num_batches):
        print(f'pre-batch weights are: {weights}')
        output = []
        for datum in input_features:
            output.append(perceptron(datum, weights, activation_threshold))

        for idx,datum  in enumerate(input_features):
            weights = adjust_weights(output[idx], input_labels[idx], datum, weights,learning_rate)

        print(f'post-batch weights are: {weights}, output is: {output}, expected is {input_labels}')        



In [None]:
# Let's train our simple model with a few batches

train_perceptron(sample_data, expected_results, weights, 0.5, 5)

### What do these results mean?

We can see here that the simple "network" (which was really just one neuron) arrived at a state where it was not "learning" at all any more, because the network eventually got all the answers correct.

In this case, our toy network converged after only one batch! Its error rate dropped to zero.

When a model arrives at a point, after which it does not learn any more with the given training data set, its weights do not change. We call this _convergence_. 

What if we had tried a learning rate that was not the default (1)? Presumably, this would affect how many iterations it takes to reach convergence.

In [None]:
train_perceptron(sample_data, expected_results, weights, 0.5, 10, learning_rate=0.05)

As expected, adjusting the learning rate to be lower than 1 makes the network converge more slowly.

But why would you want a network to converge more slowly? Let's remember this and the other parameters that we've created for the `train_perceptron` function:  learning rate, number of batches, etc, are standard _hyperparameters_ that apply to most neural network models.  Tweaking these hyperparameters has an effect on model performance.

## Recapping: What did we just do?

1. Start with a problem that is of a predefined dimensionality 
    - input is a vector (1x2 vector of 0 or 1) in the case of our "logical OR" problem
    - output is scalar (0 or 1)

2. Create a processing unit that can handle a problem that is of that dimensionality 
    - 1x2 vector of randomly initialized weights
    - 1 bias weight 
    - a stepwise activation function 

3. Create a way to adjust the weights after data are processed
    - in this case, we are taking the error output and changing the weights that contributed to the error, as defined in our `adjust_weights` function: 
    ```python
        # if the expected output is greater than the observed output, make the weight smaller
        # if the expected output is lesser than the observed output, make the weight bigger
        new_weight = input_weights[i] + (expected_output - perceptron_output) * x * learning_rate
    ```

    This is one example of a _cost function_ , or _loss function_.  A neural network (or other classifier) _learns_ by optimizing a cost function so that it's at a minimum.  In the network's case, it does this by making incremental changes to the weight parameters.  
    
    How does this work, in more complex, less trivial networks?



## Some problems with Simple Perceptrons (simple neurons with stepwise or linear activation functions)


The basic perceptron as we implemented it above is very old. In the decades between 1957 and now, there has been a lot of evolution in neural networks, but a few of the big, early problems with neural networks were as follows:

1. A single Perceptron can't solve problems whose answers aren't linearly separable - that is, more complicated relationships in the patterns of data, that aren't easily delineated by a line on a chart, and aren't easily caught.

2. There's no simple way to adjust the weights during learning for more than one layer of them




In [None]:
# example of an "XOR" problem, which does NOT have a linearly separable solution:
xor_features = np.array([[0, 0],
                    [0, 1],
                    [1, 0],
                    [1, 1]])
xor_labels = np.array([[0],
                    [1],
                    [1],
                    [0]])

train_perceptron(xor_features, xor_labels, [0.02, 0.34], 0.5, 10, learning_rate=0.05)

Notice how this model does _not_ converge!


## Solution to some of the early problems with neural nets:

- __Add more neurons, and layers of neurons to the network__

    <img src="Colored_neural_network.svg.png" style="background-color: #fff"/>

- __Backpropagation algorithm__

- Have a nonlinear activation function.  Implementing the backpropagation algorithm efficiently means being able to differentiate the activation function. Change the activation function.  The sigmoid activation function was one of the first that was adopted, and choice of the activation function, more generally, can be an important hyperparameter choice when training a model.

    <img src="Logistic-curve.svg" style="background-color: #fff"/>

    ![image](ReLU_and_GELU.svg "Rectified and Gaussean Linear Units")


## End Result:

The result of the choices of multiple layers of neurons, and nonlinear activation functions, is what differentiates modern neural networks from the earlier perceptrons.  The backpropagation algorithm allows for meaningful updates to multiple layers of input neurons, based on the error output.  It allows us to meaningfully change the weights of hidden layers and not just a single vector. 

[short video on optimizing a loss function](https://www.youtube.com/watch?v=IHZwWFHWa-w&t=310s)

If this kind of iterative optimization is visualized in 3 dimensions, it can be thought Of as traversing a slope, trying to find the deepest possible 'trough', or zigzagging across a topographical map: 

[Some 2- and 3-d visualizations of gradient descent](https://smunix.github.io/en.wikipedia.org/wiki/Gradient_descent.html)

This kind of iterative traversal of the parameter space, with the y-axis parameter being the value of the cost function (in an attempt to minimize it), is called _gradient descent_. A fully connected neural network will have converged when, on successive iterations of learning, its space on the error gradient ceases to change significantly with successive learning batches.

## Revisiting a question from before

Above, one of the things we asked is "why would we want a neural network to learn more slowly"?  

- An answer is that one of the problems we sometimes see when we train ANNs is called the "expanding gradient" problem.  This is a problem that can occur if learning rates are set too high.  Conceptually, it means that we are "jumping around" the error gradient space in big leaps, rather than taking our time and traversing the gradient more slowly. This means that the steps either take us away from one of the deep minimal 'troughs' that we want to be in, or have a step size that's too large for us to make the fine-grained parameter changes we need to make to get to the bottom.

- There are other permutations in gradient descent algorithms and some are more complex than others
    - Randomization in gradient descent (Stochastic gradient descent )
    - Manipulation of batch size and overall number of batches as hyperparameters
    - Manipulation of the underlying loss function to be optimized.



# What are modern neural network APIs like? 

[Keras](https://keras.io/) is a layer of abstraction that provides a fluent API for building neural nets.

Keras, and other machine-learning APIs, give us the ability to manipulate the kinds of hyperparameters like learning rates, and network architecture, at a high level. 

Below, we'll try to train a net with multiple neurons.  We'll see if we get a better result by changing the topology of our net.

In [None]:
import numpy as np
from keras.models import Sequential                 
from keras.layers import Dense, Activation          
from keras.optimizers import SGD      


# Our examples for an exclusive OR.
x_train = np.array([[0, 0],
                    [0, 1],
                    [1, 0],
                    [1, 1]])                        
y_train = np.array([[0],
                    [1],
                    [1],
                    [0]])       

model = Sequential()
num_neurons = 10
model.add(Dense(num_neurons, input_dim=2))
model.add(Activation('tanh'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.summary()

In [None]:
# SGD is stochastic gradient descent, which reorganizes the order of training data and updates weights fre

sgd = SGD(lr=0.1)
model.compile(loss='binary_crossentropy', 
    optimizer=sgd,
    metrics=['accuracy'])

In [None]:
# now let's see if a model can learn XOR
model.fit(x_train, y_train, epochs=500)

# NOTES:

Several of the notes on loss functions and gradient descent, they come from [3-blue-1-brown's great video](https://www.youtube.com/watch?v=IHZwWFHWa-w) on the topic. 

Several of the images and code examples in this notebook are adapted from Hobson Lane et al.'s\ text: 

[Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

If you are interested in NLP and machine learning in general, this is a great book.  It was written by members of Portland's own local Data Science Meetup group.

### References

- https://en.wikipedia.org/wiki/Activation_function
- Rosenblatt, 1957 [The Perceptron: A Perceiving and Recognizing Automaton](https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf)
- Minsky and Papert, 1969 [Perceptrons: An Introduction to Computational Geometry](https://direct.mit.edu/books/book/3132/PerceptronsAn-Introduction-to-Computational)


### Image attributions

By User:Dhp1080 - &quot;Anatomy and Physiology&quot; by the US National Cancer Institute&#039;s Surveillance, Epidemiology and End Results (SEER) Program ., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1474927

By Qef (talk) - Created from scratch with gnuplot, Public Domain, https://commons.wikimedia.org/w/index.php?curid=4310325

By Ringdongdang - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=95947821

Autorstwa Mayranna - Praca własna, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=30128320