#### Machine Learning 2: predictive models, deep learning, neural network 2022Z

### This material may not be copied or published elsewhere (including Facebook and other social media) without the  permission of the author!


# Theano

- Python library for defining mathematical functions (operating over vectors and matrices),
- allows to compute the gradients of defined functions
- low level library for building and training neural networks
- the foundational layer for many deep learning packages (ex. Keras,Lasagne)

# Instalation of Keras

Hint: Check which Anaconda version is suitable for architecture of Your system [32-bit or 64-bit?]
<br><br>
<b>Using Anaconda Prompt </b><br>
1. windows search-> Anaconda Prompt
2. conda update conda [enter]
3. conda update anaconda [enter]
4. conda install theano [enter]
5. Proceed y/n y

or <br><br>
<b>Using Jupyter</b><br>
1. !pip install theano 

### Optimizing loss functions = use of stochastic gradient descent (SGD)

Loss functions in deep learning are complicated,it is not convenient to derive the gradients manually. Theano allows to define functions as mathematical expressions and calculate the gradients. <br>

A typical workflow for using Theano:

1. Define the loss function using symbolic expression
2. Calculate the gradients of loss function using Theano
3. Pass this gradient function as a parameter to a SGD optimization routine to optimize the loss function

The strengths of Theano:
- seamless integration with Numpy that allows the user to use Numpy
objects (vectors and matrices) in the definition of loss functions
- generate optimized code for both CPU as well as GPU
- optimized automatic/symbolic differentiation
- numerical stability for the generated code via automatic/symbolic differentiation

### Check Theano configuration

In [1]:
#install theano
#!pip install theano 

# theano 
import theano
print('theano: %s' % theano.__version__) 

theano: 1.0.4


### Example: Function with Scalars

1. Scalars are defined before they can be used in a mathematical expression.
2. Every scalar is given a unique name.
3. Once defined, the scalar can be operated upon with operations like +, -, * and /.
4. The function construct in Theano allows one to relate inputs and outputs. 
We have defined a function with the name g, which takes a, b, c, d, and e as input and produces f as the output.
5. We can now compute the result of the function g given the input and check that it
evaluates exactly as the non-Theano expression.

In [4]:
import theano.tensor as T
from theano import function
a = T.dscalar('a')
b = T.dscalar('b')
c = T.dscalar('c')
d = T.dscalar('d')
e = T.dscalar('e')
f = ((a - b + c) * d )/e
g = function([a, b, c, d, e], f)
print("Expected: ((1 - 2 + 3) * 4)/5.0 = ", ((1 - 2 + 3) * 4)/5.0)
print("Via Theano: ((1 - 2 + 3) * 4)/5.0 = ", g(1, 2, 3, 4, 5))
# Expected: ((1 - 2 + 3) * 4)/5.0 = 1.6
# Via Theano: ((1 - 2 + 3) * 4)/5.0 = 1.6

Expected: ((1 - 2 + 3) * 4)/5.0 =  1.6
Via Theano: ((1 - 2 + 3) * 4)/5.0 =  1.6


### Example: Functions with Vectors

1. Vectors/Matrices are defined before they can be used in a mathematical
expression.
2. Every Vector/Matrix is given a unique name.
3. The dimensions of the Vectors/Matrices are not specified.
4. Once Vectors/Matrices are defined, the user can define operations like matrix
addition, subtraction, and multiplication.
5. As before, the user can define a function based on the defined expressions. In
this case we define a function f that takes a, b, c, and d as input and produces e as
output.
6. The user can pass Numpy arrays to the function and compute the output.

In [5]:
import numpy
import theano.tensor as T
from theano import function
a = T.dmatrix('a')
b = T.dmatrix('b')
c = T.dmatrix('c')
d = T.dmatrix('d')

e = (a + b - c) * d
f = function([a,b,c,d], e)
a_data = numpy.array([[1,1],[1,1]])
b_data = numpy.array([[2,2],[2,2]])
c_data = numpy.array([[5,5],[5,5]])
d_data = numpy.array([[3,3],[3,3]])
print("Expected:", (a_data + b_data - c_data) * d_data)
print("Via Theano:", f(a_data,b_data,c_data,d_data))
# Expected: [[-6 -6]
# [-6 -6]]
# Via Theano: [[-6. -6.]
# [-6. -6.]]

Expected: [[-6 -6]
 [-6 -6]]
Via Theano: [[-6. -6.]
 [-6. -6.]]


### Functions with Scalars and Vectors

1. Scalars and vectors/matrices can be used together in expressions.
2. The user needs to take care that vector/matrices respect the dimensionality both
while defining the expressions as well as passing inputs to the expressions.

In [6]:
import numpy
import theano.tensor as T
from theano import function

a = T.dmatrix('a')
b = T.dmatrix('b')
c = T.dmatrix('c')
d = T.dmatrix('d')
p = T.dscalar('p')
q = T.dscalar('q')
r = T.dscalar('r')
s = T.dscalar('s')
u = T.dscalar('u')

e = (((a * p) + (b - q) - (c + r )) * d/s) * u

f = function([a,b,c,d,p,q,r,s,u], e)
a_data = numpy.array([[1,1],[1,1]])
b_data = numpy.array([[2,2],[2,2]])
c_data = numpy.array([[5,5],[5,5]])
d_data = numpy.array([[3,3],[3,3]])
print("Expected:", (((a_data * 1.0) + (b_data - 2.0) - (c_data + 3.0 )) * d_data/4.0) * 5.0)
print("Via Theano:", f(a_data,b_data,c_data,d_data,1,2,3,4,5))
# Expected: [[-26.25 -26.25]
# [-26.25 -26.25]]
# Via Theano: [[-26.25 -26.25]
# [-26.25 -26.25]]

Expected: [[-26.25 -26.25]
 [-26.25 -26.25]]
Via Theano: [[-26.25 -26.25]
 [-26.25 -26.25]]


### Example: Activiation Functions

1. The nnet package in Theano is used to define a number of common activation functions.
2. Activation functions:
    - sigmoid
    - tanh
    - fast_sigmoid 
    - soft_plus
    - relu
    - soft_max

In [10]:
import theano.tensor as T
from theano import function

# sigmoid
a = T.dmatrix('a')
f_a = T.nnet.sigmoid(a)
f_sigmoid = function([a],[f_a])
print("sigmoid:", f_sigmoid([[-1,0,1]]))

# tanh
b = T.dmatrix('b')
f_b = T.tanh(b)
f_tanh = function([b],[f_b])
print("tanh:", f_tanh([[-1,0,1]]))

# fast sigmoid
c = T.dmatrix('c')
f_c = T.nnet.ultra_fast_sigmoid(c)
f_fast_sigmoid = function([c],[f_c])
print("fast sigmoid:", f_fast_sigmoid([[-1,0,1]]))

# softplus
d = T.dmatrix('d')
f_d = T.nnet.softplus(d)
f_softplus = function([d],[f_d])
print("soft plus:",f_softplus([[-1,0,1]]))
# relu
e = T.dmatrix('e')
f_e = T.nnet.relu(e)
f_relu = function([e],[f_e])
print("relu:",f_relu([[-1,0,1]]))

# softmax
f = T.dmatrix('f')
f_f = T.nnet.softmax(f)
f_softmax = function([f],[f_f])
print("soft max:",f_softmax([[-1,0,1]]))
                     

sigmoid: [array([[0.26894142, 0.5       , 0.73105858]])]
tanh: [array([[-0.76159416,  0.        ,  0.76159416]])]
fast sigmoid: [array([[0.25, 0.5 , 0.75]])]
soft plus: [array([[0.31326169, 0.69314718, 1.31326169]])]
relu: [array([[0., 0., 1.]])]
soft max: [array([[0.09003057, 0.24472847, 0.66524096]])]


### Example : Shared Variables

1. All models (deep learning or otherwise) will involve defining functions with
internal state, which will typically be weights that need to be learned or fitted.
2. A shared variable is defined using the shared construct in Theano.
3. A shared variable can be initialized with Numpy constructs.
4. Once the shared variable is defined and initialized, it can be used in the
definition of expressions and functions similar to scalars and
vectors/matrices.
5. A user can get the value of the shared variable using the <b>get_value method</b>.
6. A user can set the value for the shared variable using the <b>set_value method</b>.
7. A function defined using the shared variable computes its output based on
the current value of the shared variable. 

In [11]:
import theano.tensor as T
from theano import function
from theano import shared
import numpy
x = T.dmatrix('x')
y = shared(numpy.array([[4, 5, 6]]))
z = x + y
f = function(inputs = [x], outputs = [z])
print("Original Shared Value:", y.get_value())
print("Original Function Evaluation:", f([[1, 2, 3]]))
y.set_value(numpy.array([[5, 6, 7]]))
print("Original Shared Value:", y.get_value())
print("Original Function Evaluation:", f([[1, 2, 3]]))
# Couldn't import dot_parser, loading of dot files will not be possible.
# Original Shared Value: [[4 5 6]]
# Original Function Evaluation: [array([[ 5., 7., 9.]])]
# Original Shared Value: [[5 6 7]]
# Original Function Evaluation: [array([[ 6., 8., 10.]])]

Original Shared Value: [[4 5 6]]
Original Function Evaluation: [array([[5., 7., 9.]])]
Original Shared Value: [[5 6 7]]
Original Function Evaluation: [array([[ 6.,  8., 10.]])]


### Example: Gradients

1. A function needs to be defined using expressions before the gradient of the
function can be generated.
2. The grad construct in Theano allows the user to generate the gradient of a
function (as an expression).

In [12]:
import theano.tensor as T
from theano import function
from theano import shared
import numpy
x = T.dmatrix('x')
y = shared(numpy.array([[4, 5, 6]]))
z = T.sum(((x * x) + y) * x)
f = function(inputs = [x], outputs = [z])
g = T.grad(z,[x])
g_f = function([x], g)

print("Original:", f([[1, 2, 3]]))
print("Original Gradient:", g_f([[1, 2, 3]]))
y.set_value(numpy.array([[1, 1, 1]]))
print("Updated:", f([[1, 2, 3]]))
print("Updated Gradient", g_f([[1, 2, 3]]))
# Original: [array(68.0)]
# Original Gradient: [array([[ 7., 17., 33.]])]
# Updated: [array(42.0)]
# Updated Gradient [array([[ 4., 13., 28.]])]

Original: [array(68.)]
Original Gradient: [array([[ 7., 17., 33.]])]
Updated: [array(42.)]
Updated Gradient [array([[ 4., 13., 28.]])]


### Neural network model with 2 layers

In [13]:
import numpy
import theano
import theano.tensor as T
import sklearn.metrics

def l2(x):
    return T.sum(x**2)
examples = 1000
features = 100
hidden = 10

D = (numpy.random.randn(examples, features), numpy.random.randint(size=examples,low=0, high=2))
training_steps = 1000

x = T.dmatrix("x")
y = T.dvector("y")
w1 = theano.shared(numpy.random.randn(features, hidden), name="w1")
b1 = theano.shared(numpy.zeros(hidden), name="b1")
w2 = theano.shared(numpy.random.randn(hidden), name="w2")
b2 = theano.shared(0., name="b2")
p1 = T.tanh(T.dot(x, w1) + b1)
p2 = T.tanh(T.dot(p1, w2) + b2)
prediction = p2 > 0.5

error = T.nnet.binary_crossentropy(p2,y)

loss = error.mean() + 0.01 * (l2(w1) + l2(w2))

gw1, gb1, gw2, gb2 = T.grad(loss, [w1, b1, w2, b2])

train = theano.function(inputs=[x,y],outputs=[p2, error], updates=((w1, w1 - 0.1 * gw1),
(b1, b1 - 0.1 * gb1), (w2, w2 - 0.1 * gw2), (b2, b2 - 0.1 * gb2)))

predict = theano.function(inputs=[x], outputs=[prediction])

print("Accuracy before Training:", sklearn.metrics.accuracy_score(D[1], numpy.array(predict(D[0])).ravel()))
for i in range(training_steps):
    prediction, error = train(D[0], D[1])
print("Accuracy after Training:", sklearn.metrics.accuracy_score(D[1],numpy.array(predict(D[0])).ravel()))
# Accuracy before Training: 0.51
# Accuracy after Training: 0.716


Accuracy before Training: 0.521
Accuracy after Training: 0.733


# Gradient descent

- <b>Numerical gradient</b>: slow :(, approximate :(, easy to write :)
- <b>Analytic gradient </b>: fast :), exact :), error-prone :(

$$\frac{df(x)}{dx}=lim\frac{f(x+h)-f(x)}{f(x)}$$


## Simple example:
<br>
<img src="backprop_0.png">


## Local gradient 
<img src="backprop_1.png">


## More complex example:

<img src="backprop_2a.png">

### Step_1
<img src="backprop_2.png">

### Step_2
<img src="backprop_3.png">


### Step_3
<img src="backprop_4.png">

### Step_4
<img src="backprop_5.png">

### Step_5
<img src="backprop_6.png">

### Step_6
<img src="backprop_7.png">

## Alternative approach:
<img src="backprop_8.png">
<br>
<img src="backprop_9.png">
<br>

## Neural network from scratch

In [14]:
from math import exp
from random import seed
from random import random
 
# Initialize a network
def initialize_network(n_inputs, n_hidden, n_outputs):
    network = list()
    hidden_layer = [{'weights':[random() for i in range(n_inputs + 1)]} for i in range(n_hidden)]
    network.append(hidden_layer)
    output_layer = [{'weights':[random() for i in range(n_hidden + 1)]} for i in range(n_outputs)]
    network.append(output_layer)
    return network
 
# Calculate neuron activation for an input
def activate(weights, inputs):
    activation = weights[-1]
    for i in range(len(weights)-1):
        activation += weights[i] * inputs[i]
    return activation
 
# Transfer neuron activation
def transfer(activation):
    return 1.0 / (1.0 + exp(-activation))

# Forward propagate input to a network output
def forward_propagate(network, row):
    inputs = row
    for layer in network:
        new_inputs = []
        for neuron in layer:
            activation = activate(neuron['weights'], inputs)
            neuron['output'] = transfer(activation)
            new_inputs.append(neuron['output'])
        inputs = new_inputs
    return inputs
 
# Calculate the derivative of an neuron output
def transfer_derivative(output):
    return output * (1.0 - output)
 
# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
    for i in reversed(range(len(network))):
        layer = network[i]
        errors = list()
        if i != len(network)-1:
            for j in range(len(layer)):
                error = 0.0
                for neuron in network[i + 1]:
                    error += (neuron['weights'][j] * neuron['delta'])
                errors.append(error)
        else:
            for j in range(len(layer)):
                neuron = layer[j]
                errors.append(expected[j] - neuron['output'])
        for j in range(len(layer)):
            neuron = layer[j]
            neuron['delta'] = errors[j] * transfer_derivative(neuron['output'])
            
# Update network weights with error
def update_weights(network, row, l_rate):
    for i in range(len(network)):
        inputs = row[:-1]
        if i != 0:
            inputs = [neuron['output'] for neuron in network[i - 1]]
        for neuron in network[i]:
            for j in range(len(inputs)):
                neuron['weights'][j] += l_rate * neuron['delta'] * inputs[j]
            neuron['weights'][-1] += l_rate * neuron['delta']
            
#Train a network for a fixed number of epochs
def train_network(network, train, l_rate, n_epoch, n_outputs):
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train:
            outputs = forward_propagate(network, row)
            expected = [0 for i in range(n_outputs)]
            expected[row[-1]] = 1
            sum_error += sum([(expected[i]-outputs[i])**2 for i in range(len(expected))])
            backward_propagate_error(network, expected)
            update_weights(network, row, l_rate)
        print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))

#Test training backprop algorithm
seed(1)
dataset = [[2.7810836,2.550537003,0],
    [1.465489372,2.362125076,0],
    [3.396561688,4.400293529,0],
    [1.38807019,1.850220317,0],
    [3.06407232,3.005305973,0],
    [7.627531214,2.759262235,1],
    [5.332441248,2.088626775,1],
    [6.922596716,1.77106367,1],
    [8.675418651,-0.242068655,1],
    [7.673756466,3.508563011,1]]
n_inputs = len(dataset[0]) - 1
n_outputs = len(set([row[-1] for row in dataset]))
network = initialize_network(n_inputs, 2, n_outputs)
train_network(network, dataset, 0.5, 20, n_outputs)
for layer in network:
    print(layer)

>epoch=0, lrate=0.500, error=6.350
>epoch=1, lrate=0.500, error=5.531
>epoch=2, lrate=0.500, error=5.221
>epoch=3, lrate=0.500, error=4.951
>epoch=4, lrate=0.500, error=4.519
>epoch=5, lrate=0.500, error=4.173
>epoch=6, lrate=0.500, error=3.835
>epoch=7, lrate=0.500, error=3.506
>epoch=8, lrate=0.500, error=3.192
>epoch=9, lrate=0.500, error=2.898
>epoch=10, lrate=0.500, error=2.626
>epoch=11, lrate=0.500, error=2.377
>epoch=12, lrate=0.500, error=2.153
>epoch=13, lrate=0.500, error=1.953
>epoch=14, lrate=0.500, error=1.774
>epoch=15, lrate=0.500, error=1.614
>epoch=16, lrate=0.500, error=1.472
>epoch=17, lrate=0.500, error=1.346
>epoch=18, lrate=0.500, error=1.233
>epoch=19, lrate=0.500, error=1.132
[{'weights': [-1.4688375095432327, 1.850887325439514, 1.0858178629550297], 'output': 0.029980305604426185, 'delta': -0.0059546604162323625}, {'weights': [0.37711098142462157, -0.0625909894552989, 0.2765123702642716], 'output': 0.9456229000211323, 'delta': 0.0026279652850863837}]
[{'weights

# Regularization

* any modification to a learning algorithm to reduce its generalization error but not its training error
* reduce generalization error even at the expense of increasing training error:
    - E.g., Limiting model capacity is a regularization method
    
## Generalization error
- performance on inputs not previously seen.(Also called as Test error)
    
## Goals of regularization
1. Encode prior knowledge
2. Express preference for simpler model
3. Need to make underdetermined problem determined

## Methods
1. Limiting capacity: no of hidden units
2. Norm Penalties: L2- and L1- regularization
3. Early stopping

<br>
<img src="regularization_0.png">
