# What is a Perceptron

**This tutorial is adapted from [Damir Cavar](http://damir.cavar.me/) and Callie Federer tutorial**

Let's imagine we have to make a decision about going for a workout or not. We have the following things to take into account:

- #### Weather (Sunny or Rainy) ?
- #### Time of Day (Morning or Evening) ?
- #### Energy Level (Energized) ?

![Perceptron_Cartoon](perceptron_cartoon.png)

A perceptron is simple model that can help us solve such a problem given enough data. **The perceptron can be defined as an algorithm for supervised binary classification.**

## Tutorial Example: To Gym or Not To Gym

We import *numpy* and define our *activation* function.

In [7]:
import numpy as np

def bactivation(z):
    if z == 0.5:
        return 1
    else: return 0

Here, we define our example data **input** $x$ 

In [8]:
weather = 0
evening = 1
energized = 1
x = np.array([weather, evening, energized])

Furthermore, we define the corresponding **weights** $w$, and **bias** $b$ of our perceptron $p$:

In [3]:
w = np.array([0.5, 0.5, 0])
b = 0

Our perceptron would compute $p$ as the **dot-product** $w \cdot x$ and add the **bias** $b$ to it. Subsequently, the sigmoid function defined above will convert this $p$ value to the **activation value** $a$ of the unit:

In [5]:
p = w.dot(x) + b
a = bactivation(p)

Let's see what we get:

In [9]:
print("Perceptron Value (p):", p)
print("Activation Value of Perceptron (a):", a)

Perceptron Value (p): 0.5
Activation Value of Perceptron (a): 1


Plot a viz with the different values here.

## Logic Gates (AND, OR) with Perceptrons

The task is to implement a simple **perceptron** to compute logical operations like AND, and OR.

- Input: $x_1$ and $x_2$
- Bias: $b = -1$ for AND; $b = 0$ for OR
- Weights: $w = [1, 1]$

with the following activation function:

$$
y = \begin{cases}
    \ 0 & \quad \text{if } w \cdot x + b \leq 0\\
    \ 1 & \quad \text{if } w \cdot x + b > 0
  \end{cases}
$$

We can define this threshold function in Python as:

In [6]:
def activation(z):
    if z > 0:
        return 1
    return 0

For AND we could implement a perceptron as:

In [7]:
w = np.array([1, 1])
b = -1
x = np.array([0, 0])
print("0 AND 0:", activation(w.dot(x) + b))
x = np.array([1, 0])
print("1 AND 0:", activation(w.dot(x) + b))
x = np.array([0, 1])
print("0 AND 1:", activation(w.dot(x) + b))
x = np.array([1, 1])
print("1 AND 1:", activation(w.dot(x) + b))

0 AND 0: 0
1 AND 0: 0
0 AND 1: 0
1 AND 1: 1


For OR we could implement a perceptron as:

In [8]:
w = np.array([1, 1])
b = 0
x = np.array([0, 0])
print("0 OR 0:", activation(w.dot(x) + b))
x = np.array([1, 0])
print("1 OR 0:", activation(w.dot(x) + b))
x = np.array([0, 1])
print("0 OR 1:", activation(w.dot(x) + b))
x = np.array([1, 1])
print("1 OR 1:", activation(w.dot(x) + b))

0 OR 0: 0
1 OR 0: 1
0 OR 1: 1
1 OR 1: 1


## The Famous XOR Problem

The power of neural units comes from combining them into larger networks. Minsky and Papert (1969): A single neural unit cannot compute the simple logical function XOR.

With this narrow definition of a perceptron, it seems not possible to implement an XOR logic perceptron. The restriction is that there is a threshold function that is binary and piecewise linear.

As one student in my 2020 L645 class, Kazuki Yabe, points out, with a different activation function and a different weight vector, one unit can of course handle XOR. If we use the following activation function:

$$
y = \begin{cases}
    \ 0 & \quad \text{if } w \cdot x + b \neq 0.5\\
    \ 1 & \quad \text{if } w \cdot x + b = 0.5
  \end{cases}
$$

In [11]:
def bactivation(z):
    if z == 0.5:
        return 1
    else: return 0

If we assume the weights to be set to 0.5 and the bias to 0, one unit can handle the XOR logic:

- Input: $x_1$ and $x_2$
- Bias: $b = 0$ for XOR
- Weights: $w = [0.5, 0.5]$

In [12]:
w = np.array([0.5, 0.5])
b = 0
x = np.array([0, 0])
print("0 OR 0:", bactivation(w.dot(x) + b))
x = np.array([1, 0])
print("1 OR 0:", bactivation(w.dot(x) + b))
x = np.array([0, 1])
print("0 OR 1:", bactivation(w.dot(x) + b))
x = np.array([1, 1])
print("1 OR 1:", bactivation(w.dot(x) + b))

0 OR 0: 0
1 OR 0: 1
0 OR 1: 1
1 OR 1: 0


This particular activation function is of course not differentiable, and it remains to be shown that the weights can be learned, but nevertheless, a single unit can be identified that solves the XOR problem.

The difference between Minsky and Papert's (1969) definition of a perceptron and this unit is that - as Julia Hockenmaier pointed out - a perceptron is defined to have a decision function that would be binary and piecewise linear. This means that the unit that solves the XOR problem is not compatible with the definition of perceptron as in Minsky and Papert (1969) (p.c. Julia Hockenmaier).

# (WIP) Training Perceptrons

Up until this point, we have been solely performing inference with these models. How do we get the weights? In this section, we delve into how to train these perceptrons with data (Weight Estimation). **We will be classifying between apples and oranges!**

In [17]:
%matplotlib inline
import pandas as pd

fruits = pd.read_table('apples_n_oranges.txt')
fruits.head()

Unnamed: 0,fruit_label,fruit_name,mass,width,height,color_score
0,0,apple,140,7.3,7.1,0.87
1,1,orange,140,6.7,7.1,0.72
2,0,apple,152,7.6,7.3,0.69
3,1,orange,142,7.6,7.8,0.75
4,1,orange,144,6.8,7.4,0.75


In [16]:
import numpy as np 
%matplotlib inline
import pandas as pd
import math

Let's first define our activation function for our perceptron:

In [13]:
def step_fxn(x):
    ''' an implementation of the step function'''
    if summation > 0:
        activation = 1   ### step function
    else:
        activation = 0 
    return activation 

In [20]:
def MinMaxScaler(x):
    ''' Transforms features by scaling each feature to range (0,1)'''
    scaled_xs = [] 
    x_min = np.min(x)
    x_max = np.max(x)
    for v in x:
        x_std = (v - x_min) / (x_max - x_min)
        x_scaled = x_std * (1 - 0) + 0
        scaled_xs.append(x_scaled)
    return scaled_xs 

In [29]:
def accuracy(n_correct, n_total):
    accuracy = n_correct / n_total
    print('Accuracy is ' + str(100*accuracy) + '%')
    return accuracy 

In [21]:
### 
unique_fruit_cat = fruits['fruit_name'].unique()
for fruit in unique_fruit_cat:
    print(fruit, ": ", len(fruits[fruits['fruit_name'] == fruit]))

apple :  19
orange :  19


In [22]:
###
training_data = fruits.sample(frac=0.8, replace=False)
test_data = fruits.drop(training_data.index)
print(len(training_data), len(test_data))

30 8


In [23]:
feature_names = ['mass', 'width', 'height', 'color_score']
train = fruits.sample(frac=0.8, random_state=200)
test = fruits.drop(train.index)
X_train, y_train = train[feature_names], train['fruit_label']
X_test, y_test = test[feature_names], test['fruit_label']

In [24]:
X_train_scaled = pd.DataFrame()
X_test_scaled = pd.DataFrame()
X_train_scaled['mass'] = MinMaxScaler(X_train['mass'])
X_train_scaled['width'] = MinMaxScaler(X_train['width'])
X_train_scaled['height'] = MinMaxScaler(X_train['height'])
X_train_scaled['color_score'] = MinMaxScaler(X_train['color_score'])
X_test_scaled['mass'] = MinMaxScaler(X_test['mass'])
X_test_scaled['width'] = MinMaxScaler(X_test['width'])
X_test_scaled['height'] = MinMaxScaler(X_test['height'])
X_test_scaled['color_score'] = MinMaxScaler(X_test['color_score'])

In [25]:
X_train_scaled.iloc[0]

mass           0.054054
width          0.310345
height         0.192308
color_score    0.368421
Name: 0, dtype: float64

In [26]:
learning_rate = 0.001 ## how large of updates should we make at each step
weights = [0.46, 1.22, 1.04, -0.23] 
epochs = 1000 ## number of times to iterate through all the training set examples and update the weights

In [27]:
for _ in range(epochs):
    n_correct = 0 
    for idx in range(len(X_train_scaled)):
        ### select the training sample 
        X = X_train_scaled.iloc[idx]
        Y = y_train.iloc[idx]
        
        ### forward pass (1) calculate the weighted sum over all the inputs
        summation = sum(X*weights) 
        ### forward pass (2) calculate the activation function
        activation = step_fxn(summation)

        ### calculate the error 
        error = Y - activation
        if (error==0):
            n_correct +=1
        ### backwards pass: calculate the update to the weights. w = w + learning_rate *error*X
        d_weights = learning_rate * error* X
        weights = weights + d_weights
        
    print('Epoch : ' + str(_))
    accuracy(n_correct, len(X_train_scaled))
    

Epoch : 0


NameError: name 'accuracy' is not defined

In [28]:
n_correct = 0 
for idx in range(len(X_test_scaled)):
    
    ### select testing sample 
    X = X_test_scaled.iloc[idx]
    Y = y_test.iloc[idx]

    ### calculate the model output 
    summation = sum(X*weights)#+ bias
    activation = step_fxn(summation)
    
    ### add one to number correct if they match 
    if(activation == Y):
        n_correct += 1
    
accuracy = n_correct / len(X_test)
print('Accuracy is ' + str(100*accuracy) + '%')

Accuracy is 37.5%


## Revisiting Minsky and Papert (1969): Tri-Perceptron XOR Solution

There is a proposed solution in [Goodfellow et al. (2016)](https://www.deeplearningbook.org/) for the XOR problem, using a network with two layers of ReLU-based units.

![XOR Network](XOR_Network.png)

This two layer and three perceptron network solves the problem.

For more deiscussion on this problem, consult:

- [Wikipedia on the XOR problem](https://en.wikipedia.org/wiki/Perceptron)
- [Solving XOR with a single Perceptron](https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182)

# (WIP) Multi-Layer Perceptron

Essentially, the multi-layer perceptron is the solution to the **Famous XOR Problem** and addresses Minsky and Papert's concerns.

In [32]:
def Sigmoid(Z):
    ''' the sigmoid function'''
    return 1/(1+np.exp(-Z))

def dSigmoid(Z):
    ''' the derivative of sigmoid function'''
    s = 1/(1+np.exp(-Z))
    dZ = s * (1-s)
    return dZ

In [33]:
nn = {
'x1': 70, 
'x2': 16 ,
'w1': 0.15,
'w2': 0.20,
'w3': 0.25,
'w4': 0.30,
'w5': 0.40,
'w6': 0.45,
'w7': 0.50,
'w8': 0.55, 
'target1': 1.0,
'target2': 0.0,
'eta' : 0.1
}

<img src="network.png" width="350">

### The Forward pass

Calculate the weighted sum and output for both hidden layers and both outputs. 

$net_{h1} = w_1 * x_1 + w_2 * x_2$

$out_{h1} = $
$\frac{1}{1 + e^{-net_{h1}}}$

$net_{h2} = w_3 * x_1 + w_4 * x_2$

$out_{h2} = $
$\frac{1}{1 + e^{-net_{h2}}}$

$net_{o1} = w_5 * out_{h1} + w_6 * out_{h2}$

$out_{o1} = $
$\frac{1}{1 + e^{-net_{o1}}}$

$net_{o2} = w_7 * out_{h1} + w_8 * out_{h2}$

$out_{o2} = $
$\frac{1}{1 + e^{-net_{o2}}}$



In [34]:
def forward(nn):
    nn['net_h1'] = nn['w1'] * nn['x1'] + nn['w2'] * nn['x2']
    nn['net_h2'] = nn['w3'] * nn['x1'] + nn['w4'] * nn['x2']
    nn['out_h1'] = Sigmoid(nn['net_h1'])
    nn['out_h2'] = Sigmoid(nn['net_h2'])
    nn['net_o1'] = nn['w5'] * nn['out_h1']  + nn['w6'] * nn['out_h2']
    nn['net_o2'] = nn['w7'] * nn['out_h1']  + nn['w8'] * nn['out_h2']
    nn['out_o1'] = Sigmoid(nn['net_o1'])
    nn['out_o2'] = Sigmoid(nn['net_o2']) 
    return nn

In [35]:
forward(nn)

{'x1': 70,
 'x2': 16,
 'w1': 0.15,
 'w2': 0.2,
 'w3': 0.25,
 'w4': 0.3,
 'w5': 0.4,
 'w6': 0.45,
 'w7': 0.5,
 'w8': 0.55,
 'target1': 1.0,
 'target2': 0.0,
 'eta': 0.1,
 'net_h1': 13.7,
 'net_h2': 22.3,
 'out_h1': 0.9999988775548947,
 'out_h2': 0.9999999997933511,
 'net_o1': 0.849999550928966,
 'net_o2': 1.0499994386637905,
 'out_o1': 0.7005670482710666,
 'out_o2': 0.7407747913901797}

### Error calculation 

The squared error function

$E_{total} = \sum\frac{1}{2}(target - output)^2$

$E_{o1} = \frac{1}{2}(target_{o1} - output_{o1})^2$

$E_{o2} = \frac{1}{2}(target_{o2} - output_{o2})^2$

$E_{total} = E_{o1} + E_{o2}$

In [None]:
def calc_error(nn):
    nn['err1'] = (1/2) * (nn['target1'] - nn['out_o1'])**2
    nn['err2'] = (1/2) * (nn['target2'] - nn['out_o2'])**2
    nn['total_error'] = nn['err1'] + nn['err2']
    return nn

In [None]:
calc_error(nn)

### The backwards pass

output weight updates: 

$\frac{\partial E_{total}}{\partial w_5} = \frac{\partial E_{total}}{\partial out_{o1}} * \frac{\partial out_{o1}}{\partial net_{o1}} * \frac{\partial net_{o1}}{\partial w_5}$

$\frac{\partial E_{total}}{\partial w_6} = \frac{\partial E_{total}}{\partial out_{o1}} * \frac{\partial out_{o1}}{\partial net_{o1}} * \frac{\partial net_{o1}}{\partial w_6}$

$\frac{\partial E_{total}}{\partial w_7} = \frac{\partial E_{total}}{\partial out_{o2}} * \frac{\partial out_{o2}}{\partial net_{o2}} * \frac{\partial net_{o2}}{\partial w_7}$

$\frac{\partial E_{total}}{\partial w_8} = \frac{\partial E_{total}}{\partial out_{o2}} * \frac{\partial out_{o2}}{\partial net_{o2}} * \frac{\partial net_{o2}}{\partial w_8}$


hidden weight updates: 

$\frac{\partial E_{total}}{\partial w_1} = \frac{\partial E_{total}}{\partial out_{h1}} * \frac{\partial out_{h1}}{\partial net_{h1}} * \frac{\partial net_{h1}}{\partial w_1}$

$\frac{\partial E_{total}}{\partial w_2} = \frac{\partial E_{total}}{\partial out_{h1}} * \frac{\partial out_{h1}}{\partial net_{h1}} * \frac{\partial net_{h1}}{\partial w_2}$

$\frac{\partial E_{total}}{\partial w_3} = \frac{\partial E_{total}}{\partial out_{h2}} * \frac{\partial out_{h2}}{\partial net_{h2}} * \frac{\partial net_{h2}}{\partial w_3}$

$\frac{\partial E_{total}}{\partial w_4} = \frac{\partial E_{total}}{\partial out_{h2}} * \frac{\partial out_{h2}}{\partial net_{h2}} * \frac{\partial net_{h2}}{\partial w_4}$

In [36]:
def backward(nn):
    ############ output weights 
    
    ### w5 
    nn['dErr_outo1'] = -(nn['target1'] - nn['out_o1'])
    nn['douto1_neto1'] = dSigmoid(nn['out_o1'])
    nn['dneto1_w5'] = nn['out_h1']
    w5 = nn['w5'] - nn['eta'] * nn['dErr_outo1'] * nn['douto1_neto1'] * nn['dneto1_w5']
    
    ### w6
    nn['dneto1_w6'] = nn['out_h2']
    w6 = nn['w6'] - nn['eta'] * nn['dErr_outo1'] * nn['douto1_neto1'] * nn['dneto1_w6']

    
    ### w7 
    nn['dErr_outo2'] = -(nn['target2'] - nn['out_o2'])
    nn['douto2_neto2'] = dSigmoid(nn['out_o2'])
    nn['dneto2_w7'] = nn['out_h1']
    w7 = nn['w7'] - nn['eta'] * nn['dErr_outo2'] * nn['douto2_neto2'] * nn['dneto2_w7']
    
    ### w8
    nn['dneto2_w8'] = nn['out_h2']
    w8 = nn['w8'] - nn['eta'] * nn['dErr_outo2'] * nn['douto2_neto2'] * nn['dneto2_w8']
    
    ############ hidden weights 
    
    ### w1 
    nn['dErr_neto1'] = nn['dErr_outo1'] * nn['douto1_neto1']
    nn['dneto1_outh1'] = nn['w5']
    nn['dErr1_outh1'] = nn['dErr_neto1'] * nn['dneto1_outh1']
    nn['dErr_neto2'] = nn['dErr_outo2'] * nn['douto2_neto2']
    nn['dneto2_outh1'] = nn['w7']
    nn['dErr2_outh1'] = nn['dErr_neto2'] * nn['dneto2_outh1']
    nn['dErr_outh1'] = nn['dErr1_outh1'] + nn['dErr2_outh1']
    nn['douth1_neth1'] = dSigmoid(nn['out_h1'])
    nn['dneth1_w1'] = nn['x1']
    w1 = nn['w1'] - nn['eta'] * nn['dErr_outh1'] * nn['douth1_neth1'] * nn['dneth1_w1']
    
    ### w2
    nn['dneth1_w2'] = nn['x2']
    w2 = nn['w2'] - nn['eta'] * nn['dErr_outh1'] * nn['douth1_neth1'] * nn['dneth1_w2']
    
    ### w3
    nn['dneto1_outh2'] = nn['w6']
    nn['dErr1_outh2'] = nn['dErr_neto1'] * nn['dneto1_outh2']
    nn['dneto2_outh2'] = nn['w8'] 
    nn['dErr2_outh2'] = nn['dErr_neto2'] * nn['dneto2_outh2']
    nn['dErr_outh2'] = nn['dErr1_outh2'] + nn['dErr2_outh2']
    nn['douth2_neth2'] = dSigmoid(nn['out_h2'])
    nn['dneth2_w3'] = nn['x1']
    w3 = nn['w3'] - nn['eta'] * nn['dErr_outh2'] * nn['douth2_neth2'] * nn['dneth2_w3']
    
    ### w4
    nn['dneth2_w4'] = nn['x2']
    w4 = nn['w4'] - nn['eta'] * nn['dErr_outh2'] * nn['douth2_neth2'] * nn['dneth2_w4']
    
    ### update all weights simultaneously
    nn['w1'] = w1
    nn['w2'] = w2
    nn['w3'] = w3
    nn['w4'] = w4
    nn['w5'] = w5
    nn['w6'] = w6
    nn['w7'] = w7
    nn['w8'] = w8

    return nn

In [37]:
backward(nn)

{'x1': 70,
 'x2': 16,
 'w1': 0.07510107754711555,
 'w2': 0.18288024629648356,
 'w3': 0.16852474274223866,
 'w4': 0.2813770840553688,
 'w5': 0.4066375399108027,
 'w6': 0.45663754735971357,
 'w7': 0.48380575740490334,
 'w8': 0.5338057392310812,
 'target1': 1.0,
 'target2': 0.0,
 'eta': 0.1,
 'net_h1': 13.7,
 'net_h2': 22.3,
 'out_h1': 0.9999988775548947,
 'out_h2': 0.9999999997933511,
 'net_o1': 0.849999550928966,
 'net_o2': 1.0499994386637905,
 'out_o1': 0.7005670482710666,
 'out_o2': 0.7407747913901797,
 'dErr_outo1': -0.29943295172893336,
 'douto1_neto1': 0.22167057175103325,
 'dneto1_w5': 0.9999988775548947,
 'dneto1_w6': 0.9999999997933511,
 'dErr_outo2': 0.7407747913901797,
 'douto2_neto2': 0.2186124711651475,
 'dneto2_w7': 0.9999988775548947,
 'dneto2_w8': 0.9999999997933511,
 'dErr_neto1': -0.06637547361085219,
 'dneto1_outh1': 0.4,
 'dErr1_outh1': -0.02655018944434088,
 'dErr_neto2': 0.16194260772265381,
 'dneto2_outh1': 0.5,
 'dErr2_outh1': 0.08097130386132691,
 'dErr_outh1': 0

In [38]:
def run_network(nn, niters = 10):
    for i in range(niters):
        nn = forward(nn)
        print('output1: ' + str(nn['out_o1']) + ' target1: ' + str(nn['target1']))
        print('output2: ' + str(nn['out_o2']) + ' target2: ' + str(nn['target2']))
        nn = calc_error(nn)
        print('Error: ' + str(nn['total_error']))

        nn = backward(nn)

In [39]:
run_network(nn, niters = 200)

output1: 0.703320758742449 target1: 1.0
output2: 0.7344807324321174 target2: 0.0


NameError: name 'calc_error' is not defined

# (WIP) Adaptive Optimization with MLP

# Assignment: Reproducing Rosenblatt's NYT Experiment

![Times_July_13_Rosenblatt_Perceptron](Times_July_13_Rosenblatt_Perceptron.png)