# How Do Machines Learn?

In the last talk on machine learning, we trained a random forest classifier to classify digitized, handwritten images of digits into numerical symbols from 0 to 9.

However, in that discussion, we glossed over a few important details about how these models 'learn'. 

## Following patterns found in nature

In answering the question about how machines learn, one thing that might be helpful to ask is how people, or other biological systems, learn.

A biological neuron accumulates electrical charge on the cell membranes next to the axon. The electrical charges are accumulated by action from the set of multiple presynaptic neurons that provide input to the cell.  

_Neurotransmitters_ in the _synaptic cleft_ instruct the cell to open or close _ion channels_ in the neuron.  Ions travelling in or out of the cell change the the membrane potential, until a point at which an _Action potential_ is achieved, which causes electrical charge to be dissipated down the axon and towards the _postsynaptic neuron_, which starts the process anew for downstream neurons.

![image](bioneuron.png "a biological neuron")

## Neuron as Model

If we distill the functionality of the function of a neuron into a functional unit, we can see it as a function that has several parts

- an input vector representing the set of inputs from presynaptic neurons
- a bias term representing the sensitivity of the neuron itself
- an activation function

![image](modelneuron.png "a model neuron")



This simplified, mathematical represenation of a neuron, used for computing, was called a _Perceptron_, and has been around for a long time (Rosenblatt, 1957)



In [11]:
# Given a mathematical representation of the neuron, we can compute it 
import numpy as np

example_input = [1, .2, .1, .05, .2]

# these are arbitrary
example_weights = [.2, .12, .4, .6, .90]

def perceptron(input, weights, threshold):
    """
    aggregates input, applying the weights, compares the aggregated input against a threshold and producing an output.
    """
    input_vector = np.array(input)
    weights = np.array(weights)
    bias_weight = .2

    # summing the activation uses the dot product
    activation_level = np.dot(input_vector, weights) + (bias_weight * 1)                                
    activation_level
    
    if activation_level >= threshold:
        perceptron_output = 1
    else:
        perceptron_output = 0
    return perceptron_output

In [10]:
output = perceptron(example_input, example_weights, 0.5)

print(output)


1


## But Where's the Learning?

So far, this explains a bit about how to represent a neuron mathematically, but it doesn't provide us with a lot of information about how it learns. 

It learns by changing the weights, based on whether the output of the system is correct or incorrect (so, in this case, it's supervised-learning).

As it's supervised learning, that means that we know the answers to the question that the model is trying to solve, beforehand.  Let's add a very basic supervised learning algorithm.

In [10]:
def adjust_weights(perceptron_output, 
    expected_output, 
    given_input,
    input_weights,
    learning_rate=1):
    """
    takes observed perceptron output, and input weights, and gives a new set of weights
    """
    new_weights = []
    for i, x in enumerate(given_input):

        new_weight = input_weights[i] + (expected_output - perceptron_output) * x * learning_rate
        new_weights.append(new_weight)
    return np.array(new_weights)
 

Notice that the function has a "learning rate" parameter.  We'll come back to that in a little while.

In [17]:

expected_output = 0
new_weights = adjust_weights(output, 0, example_input, example_weights)

print(f'old weights are: {example_weights}')
print(f'new weights are {new_weights}')

i is 0
x is 1
new_weight is -0.8
i is 1
x is 0.2
new_weight is -0.08000000000000002
i is 2
x is 0.1
new_weight is 0.30000000000000004
i is 3
x is 0.05
new_weight is 0.5499999999999999
i is 4
x is 0.2
new_weight is 0.7
old weights are: [0.2, 0.12, 0.4, 0.6, 0.9]
new weights are [-0.8  -0.08  0.3   0.55  0.7 ]


In [11]:
# Now let's put our learning algorithm and our perceptron together:
# Remember the correct output for the the algorithm is 0

output = perceptron(example_input, example_weights, 0.5)
print(f'output before learning is {output}')
new_weights = adjust_weights(output, 0, example_input, example_weights)

output_again = perceptron(example_input, new_weights, 0.5)
print(f'output_after_learning is {output_again}')

output before learning is 1
output_after_learning is 0


## Now let's try to use perceptrons to do a very simple classifier task

We can use a perceptron with 2 inputs and one output to try and solve a problem that trains a network to "understand" logical OR:

In [13]:
# example problem for learning logical OR

sample_data = [[0, 0],  # False, False
               [0, 1],  # False, True
               [1, 0],  # True, False
               [1, 1]]  # True, True

expected_results = [0,  # (False OR False) gives False
                    1,  # (False OR True ) gives True
                    1,  # (True  OR False) gives True
                    1]  # (True  OR True ) gives True

activation_threshold = 0.5

SyntaxError: invalid syntax (<ipython-input-13-ce9685e9bcb7>, line 8)

In [35]:
from random import random
import numpy as np

weights = np.random.random(2)/1000  # Small random float 0 < w < .001
weights

array([0.00074888, 0.00065487])

In [36]:
bias_weight = np.random.random() / 1000
bias_weight

0.0004419447072414342

In [21]:
output = []
for datum in sample_data:
    output.append(perceptron(datum, weights, activation_threshold))

output_arr = np.array(output)
for idx,datum  in enumerate(sample_data):
    new_weights = adjust_weights(output[idx], expected_results[idx], datum, weights)

new_weights

# This represents one iteration, or "batch", of having trained our very simple nerual network.



array([1.00041949, 1.00055609])

In [14]:
# Let's make a function that represents this training.
def train_perceptron(input_features, 
    input_labels, 
    initial_weights,
    activation_threshold,
    num_batches, 
    learning_rate=1):
    """
    Trains a perceptron with given input
    """

    weights = initial_weights
    for batch_num in range(0,num_batches):
        print(f'pre-batch weights are: {weights}')
        output = []
        for datum in input_features:
            output.append(perceptron(datum, weights, activation_threshold))

        for idx,datum  in enumerate(input_features):
            weights = adjust_weights(output[idx], input_labels[idx], datum, weights,learning_rate)

        print(f'post-batch weights are: {weights}, output is: {output}, expected is {input_labels}')        



In [38]:
# Let's train our simple model with a few batches

train_perceptron(sample_data, expected_results, weights, 0.5, 5)

pre-batch weights are: [0.00074888 0.00065487]
post-batch weights are: [2.00074888 2.00065487], output is: [0, 0, 0, 0], expected is [0, 1, 1, 1]
pre-batch weights are: [2.00074888 2.00065487]
post-batch weights are: [2.00074888 2.00065487], output is: [0, 1, 1, 1], expected is [0, 1, 1, 1]
pre-batch weights are: [2.00074888 2.00065487]
post-batch weights are: [2.00074888 2.00065487], output is: [0, 1, 1, 1], expected is [0, 1, 1, 1]
pre-batch weights are: [2.00074888 2.00065487]
post-batch weights are: [2.00074888 2.00065487], output is: [0, 1, 1, 1], expected is [0, 1, 1, 1]
pre-batch weights are: [2.00074888 2.00065487]
post-batch weights are: [2.00074888 2.00065487], output is: [0, 1, 1, 1], expected is [0, 1, 1, 1]


We can see here that the simple "network" (which was really just one neuron) arrived at a state where it was not "learning" at all any more, because the network eventually got all the answers correct.

In this case, our toy network converged after only one batch! Its error rate dropped to zero.

When a model arrives at a point, after which it does not learn any more with the given training data set, its weights do not change. We call this _convergence_. 

What if we had tried a learning rate that was not the default (1)? Presumably, this would affect how many iterations it takes to reach convergence.

In [48]:
train_perceptron(sample_data, expected_results, weights, 0.5, 10, learning_rate=0.05)

pre-batch weights are: [0.00074888 0.00065487]
post-batch weights are: [0.10074888 0.10065487], output is: [0, 0, 0, 0], expected is [0, 1, 1, 1]
pre-batch weights are: [0.10074888 0.10065487]
post-batch weights are: [0.20074888 0.20065487], output is: [0, 0, 0, 0], expected is [0, 1, 1, 1]
pre-batch weights are: [0.20074888 0.20065487]
post-batch weights are: [0.25074888 0.25065487], output is: [0, 0, 0, 1], expected is [0, 1, 1, 1]
pre-batch weights are: [0.25074888 0.25065487]
post-batch weights are: [0.30074888 0.30065487], output is: [0, 0, 0, 1], expected is [0, 1, 1, 1]
pre-batch weights are: [0.30074888 0.30065487]
post-batch weights are: [0.30074888 0.30065487], output is: [0, 1, 1, 1], expected is [0, 1, 1, 1]
pre-batch weights are: [0.30074888 0.30065487]
post-batch weights are: [0.30074888 0.30065487], output is: [0, 1, 1, 1], expected is [0, 1, 1, 1]
pre-batch weights are: [0.30074888 0.30065487]
post-batch weights are: [0.30074888 0.30065487], output is: [0, 1, 1, 1], exp

As expected, adjusting the learning rate to be lower than 1 makes the network converge more slowly.

But why would you want a network to converge more slowly? Let's remember this and the other parameters that we've created for the `train_perceptron` function:  learning rate, number of batches, etc, are standard _hyperparameters_ that apply to most neural network models.  Tweaking these hyperparameters has an effect on model performance.

## Recapping: What did we just do?

__TODO put visualization of our toy network here__


## Some problems with Simple Perceptrons (simple neurons with stepwise or linear activation functions)
TODO do i really need this detail

The basic perceptron as we implemented it above is very old. In the decades between 1957 and now, there has been a lot of evolution in neural networks, but a few of the big, early problems with neural networks were as follows:

1. Perceptrons can't solve problems whose answers aren't linearly separable - that is, more complicated relationships in the patterns of data, that aren't easily delineated by a line on a chart, aren't easily caught.

2. There's no simple way to adjust the weights during learning for more than one layer of them




In [15]:
# example of an "XOR" problem, which does NOT have a linearly separable solution:

xor_features = np.array([[0, 0],
                    [0, 1],
                    [1, 0],
                    [1, 1]])
xor_labels = np.array([[0],
                    [1],
                    [1],
                    [0]])



train_perceptron(xor_features, xor_labels, [0.02, 0.34], 0.5, 10, learning_rate=0.05)

pre-batch weights are: [0.02, 0.34]
post-batch weights are: [[0.02]
 [0.29]], output is: [0, 1, 0, 1], expected is [[0]
 [1]
 [1]
 [0]]
pre-batch weights are: [[0.02]
 [0.29]]
post-batch weights are: [[0.02]
 [0.29]], output is: [0, 0, 0, 1], expected is [[0]
 [1]
 [1]
 [0]]
pre-batch weights are: [[0.02]
 [0.29]]
post-batch weights are: [[0.02]
 [0.29]], output is: [0, 0, 0, 1], expected is [[0]
 [1]
 [1]
 [0]]
pre-batch weights are: [[0.02]
 [0.29]]
post-batch weights are: [[0.02]
 [0.29]], output is: [0, 0, 0, 1], expected is [[0]
 [1]
 [1]
 [0]]
pre-batch weights are: [[0.02]
 [0.29]]
post-batch weights are: [[0.02]
 [0.29]], output is: [0, 0, 0, 1], expected is [[0]
 [1]
 [1]
 [0]]
pre-batch weights are: [[0.02]
 [0.29]]
post-batch weights are: [[0.02]
 [0.29]], output is: [0, 0, 0, 1], expected is [[0]
 [1]
 [1]
 [0]]
pre-batch weights are: [[0.02]
 [0.29]]
post-batch weights are: [[0.02]
 [0.29]], output is: [0, 0, 0, 1], expected is [[0]
 [1]
 [1]
 [0]]
pre-batch weights are: [


## Solutions to these problems:
1. Changing the activation function
    
2. Backpropagation algorithm


# What are modern neural network APIs like? 

In [4]:
import numpy as np
from keras.models import Sequential                 
from keras.layers import Dense, Activation          
from keras.optimizers import SGD      


# Our examples for an exclusive OR.
x_train = np.array([[0, 0],
                    [0, 1],
                    [1, 0],
                    [1, 1]])                        
y_train = np.array([[0],
                    [1],
                    [1],
                    [0]])       

model = Sequential()
num_neurons = 10
model.add(Dense(num_neurons, input_dim=2))
model.add(Activation('tanh'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 10)                30        
                                                                 
 activation (Activation)     (None, 10)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 11        
                                                                 
 activation_1 (Activation)   (None, 1)                 0         
                                                                 
Total params: 41
Trainable params: 41
Non-trainable params: 0
_________________________________________________________________


You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


# NOTES:

The notes on loss functions and gradient descent, they come from 3-blue-1-brown's great video on the topic

Several of the images and code examples in this notebook come from, or are adapted from, Hobson Lane et al.'s\ text: 

[Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

If you are interested in NLP and machine learning in general, this is a great book.  It was written by members of Portland's own local Data Science Meetup group.  