
# Backpropagation Practice

Using TensorFlow Keras, Implement a 3 input, 4 node hidden-layer, 1 output node Multilayer Perceptron on the following dataset:

| x1 | x2 | x3 | y |
|----|----|----|---|
| 0  | 0  | 1  | 0 |
| 0  | 1  | 1  | 1 |
| 1  | 0  | 1  | 1 |
| 0  | 1  | 0  | 1 |
| 1  | 0  | 0  | 1 |
| 1  | 1  | 1  | 0 |
| 0  | 0  | 0  | 0 |

If you look at the data you'll notice that the first two columns behave like an XOR gate while the last column is mostly just noise. Remember that creating an XOR gate was what the perceptron was criticized for not being able to learn.

This is your "Hello World!" of TensorFlow.

### Example TensorFlow Starter Code

```python 
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(3, activation='sigmoid', input_dim=2),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['acc'])

results = model.fit(X,y, epochs=100)

```

### Additional Written Tasks:
1. Investigate the various [loss functions](https://www.tensorflow.org/api_docs/python/tf/keras/losses). Which is best suited for the task at hand (predicting 1 / 0) and why? 
2. What is the difference between a loss function and a metric? Why might we need both in Keras? 
3. Investigate the various [optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers). Stochastic Gradient Descent (`sgd`) is not the learning algorithm dejour anyone. Why is that? What do newer optimizers such as `adam` have to offer? 

#Loss Functions Back-

Back-propagation is the essence of neural net training. It is the practice of fine-tuning the weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e. iteration). Proper tuning of the weights ensures lower error rates, making the model reliable by increasing its generalization.

Crossentropy loss function when there are two or more label classes. We expect labels to be provided in a one_hot representation.

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

#Difference between a loss function and Metric:

The loss function is that parameter one passes to Keras model.compile which is actually optimized while training the model . This loss function is generally minimized by the model.

Unlike the loss function , the metric is another list of parameters passed to Keras model.compile which is actually used for judging the performance of the model.

For example : For some reason you may want to minimize the MSE loss for a regression model while also want to check the AUC for the model . In this case the MSE is the loss function and the AUC is the metric . Metric is the model performance parameter that one can see while the model is judging itself on the validation set after each epoch of training. It is important to note that the metric is important for few Keras callbacks like EarlyStopping when one wants to stop training the model in case the metric isn't improving for a certaining no. of epochs" stackoverflow.

#The various optimizers. Stochastic Gradient Descent

1. Adapt the “gradient component” (∂L/∂w)
Instead of using only one single gradient like in stochastic vanilla gradient descent to update the weight, take an aggregate of multiple gradients. Specifically, these optimisers use the exponential moving average of gradients.
2. Adapt the “learning rate component” (α)
Instead of keeping a constant learning rate, adapt the learning rate according to the magnitude of the gradient(s).
3. Both (1) and (2)
Adapt both the gradient component and the learning rate component.


https://towardsdatascience.com/10-gradient-descent-optimisation-algorithms-86989510b5e9

#What do newer optimizers such as adam have to offer?

Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum. It uses the squared gradients to scale the learning rate like RMSprop and it takes advantage of momentum by using moving average of the gradient instead of gradient itself like SGD with momentum. Let’s take a closer look at how it works.

Adam is an adaptive learning rate method, which means, it computes individual learning rates for different parameters. Its name is derived from adaptive moment estimation, and the reason it’s called that is because Adam uses estimations of first and second moments of gradient to adapt the learning rate for each weight of the neural network. Now, what is moment ? N-th moment of a random variable is defined as the expected value of that variable to the power of n.

https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
import pandas as pd
import numpy as np

In [None]:
#DataSet
df = pd.DataFrame({
    'x1': [0, 0, 1, 0, 1, 1, 0],
    'x2': [0, 1, 0, 1, 0, 1, 0],
    'x3': [1, 1, 1, 0, 0, 1, 0],
    'y': [0, 1, 1, 1, 1, 0, 0,]
})
df

Unnamed: 0,x1,x2,x3,y
0,0,0,1,0
1,0,1,1,1
2,1,0,1,1
3,0,1,0,1
4,1,0,0,1
5,1,1,1,0
6,0,0,0,0


In [None]:
A,B = check_X_y(df[['x1', 'x2', 'x3']], df['y'])
A.shape, B.reshape(-1, 1).shape

((7, 3), (7, 1))

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import accuracy_score

In [None]:
#Multilayer Perceptron
class Perceptron(BaseEstimator, ClassifierMixin):
    def __init__(self, epochs=1000):
        self.epochs = epochs
        
        #Neural net
        self.inputs = 3
        self.hidden_nodes = 4
        self.outputs = 1
        
        # input -> hidden weights
        self.weights_1 = np.random.randn(self.inputs, self.hidden_nodes) -1
        # hidden -> output weights
        self.weights_2 = np.random.randn(self.hidden_nodes, self.outputs) -1 
    
    #Formula Sigmoid:
    def sigmoid(self, s):
        return 1 / (1 + np.exp(-s))
    
    def sigmoid_prime(self, s):
        sx = self.sigmoid(s)
        return sx / (1 - sx)
    
    def feed_forward(self, X):
        """Calculate the NN inference using feed forward"""
        
        self.hidden_sum = np.dot(X, self.weights_1) # Weighted sum
        self.activated_hidden = self.sigmoid(self.hidden_sum) # Activation
        self.output_sum = np.dot(self.activated_hidden, self.weights_2) # Weighted sum 2
        self.activated_output = self.sigmoid(self.output_sum) #Output
        
        return self.activated_output
    
    def back_prop(self, X, y, o):
        """ back prop through the network"""
        self.o_error = y - o
                
        self.o_delta = self.o_error * self.sigmoid_prime(self.output_sum) # apply derivative of sigmoid to error
        self.z2_error = self.o_delta.dot(self.weights_2.T) # z2 error: how much were our output layer weights off
        self.z2_delta = self.z2_error * self.sigmoid_prime(self.hidden_sum)  #z2 delta: how much were the weights off
        
        # calculate partial gradient
        self.weights_1 += X.T.dot(self.z2_delta) # adj 1 set (input => hidden) weights
        self.weights_2 += self.activated_hidden.T.dot(self.o_delta) # adj 2 set (hidden => output) weights

    def train(self, X, y):
        o = self.feed_forward(X)
        self.back_prop(X, y, o)

    def fit(self, X, y):
        X, y = check_X_y(X, y)
        y = y.reshape(-1, 1)
        self.X_ = X
        self.y_ = y
        
        for i in range(self.epochs):
            self.train(X, y)
        
        return self
    
    #Calculate the NN inference using feed forward
    def predict_proba(self, X):
        check_is_fitted(self)
        X = check_array(X)

        hidden_sum = np.dot(X, self.weights_1) # Weighted sum
        activated_hidden = self.sigmoid(hidden_sum) # Activation
        output_sum = np.dot(activated_hidden, self.weights_2) # Weighted sum 2
        activated_output = self.sigmoid(output_sum)  #Final Output
        
        return activated_output.reshape(1, -1)

    def predict(self, X):
        check_is_fitted(self)
        pred_proba = self.predict_proba(X)
        return np.round(pred_proba).astype(int)

In [None]:
perceptron = Perceptron(epochs=1000).fit(df[['x1', 'x2', 'x3']], df['y'])
y_pred = perceptron.predict(df[['x1', 'x2', 'x3']])
score = accuracy_score(df['y'], y_pred[0])
print(f'Accuracy score: {score * 100:.2f}%')

Accuracy score: 0.00%




In [None]:
#Predict by perceptron
perceptron.predict_proba(df[['x1', 'x2', 'x3']])[0]

array([nan, nan, nan, nan, nan, nan, nan])

In [None]:
y_pred[0]

array([-9223372036854775808, -9223372036854775808, -9223372036854775808,
       -9223372036854775808, -9223372036854775808, -9223372036854775808,
       -9223372036854775808])

### Build a Tensor Keras Perceptron

Try to match the architecture we used on Monday - inputs nodes and one output node. Apply this architecture to the XOR-ish dataset above. 

After fitting your model answer these questions: 

Are you able to achieve the same results as a bigger architecture from the first part of the assignment? Why is this disparity the case? What properties of the XOR dataset would cause this disparity? 

Now extrapolate this behavior on a much larger dataset in terms of features. What kind of architecture decisions could we make to avoid the problems the XOR dataset presents at scale? 

*Note:* The bias term is baked in by default in the Dense layer.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.datasets import mnist
%tensorflow_version 2.x
import tensorflow as tf
import numpy as np

In [None]:
#Load and Train data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(32)

#Model
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(100),
  tf.keras.layers.ReLU(),
  tf.keras.layers.Dense(10)
])

#https://www.tensorflow.org/api_docs/python/tf/keras/datasets/mnist/load_data

In [None]:
x_train[0].shape

(28, 28)

In [None]:
X = x_train.reshape(x_train.shape[0], 784)

In [None]:
X.shape

(60000, 784)

In [None]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 100)               78500     
_________________________________________________________________
re_lu_2 (ReLU)               (None, 100)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 10)                1010      
Total params: 79,510
Trainable params: 79,510
Non-trainable params: 0
_________________________________________________________________


## Try building/training a more complex MLP on a bigger dataset.

Use TensorFlow Keras & the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) to build the canonical handwriting digit recognizer and see what kind of accuracy you can achieve. 

If you need inspiration, the Internet is chalk-full of tutorials, but I want you to see how far you can get on your own first. I've linked to the original MNIST dataset above but it will probably be easier to download data through a neural network library. If you reference outside resources make sure you understand every line of code that you're using from other sources, and share with your fellow students helpful resources that you find.


### Parts
1. Gathering & Transforming the Data
2. Making MNIST a Binary Problem
3. Estimating your Neural Network (the part you focus on)

### Gathering the Data 

`keras` has a handy method to pull the mnist dataset for you. You'll notice that each observation is a 28x28 arrary which represents an image. Although most Neural Network frameworks can handle higher dimensional data, that is more overhead than necessary for us. We need to flatten the image to one long row which will be 784 values (28X28). Basically, you will be appending each row to one another to make on really long row. 

In [None]:
import numpy as np
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

In [None]:
# input image dimensions
img_rows, img_cols = 28, 28

In [None]:
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
x_train = x_train.reshape(x_train.shape[0], img_rows * img_cols)
x_test = x_test.reshape(x_test.shape[0], img_rows * img_cols)

# Normalize Our Data
x_train = x_train / 255
x_test = x_test / 255

In [None]:
# Now the data should be in a format you're more familiar with
x_train.shape

(60000, 784)

### Making MNIST a Binary Problem 
MNIST is multiclass classification problem; however we haven't covered all the necessary techniques to handle this yet. You would need to one-hot encode the target, use a different loss metric, and use softmax activations for the last layer. This is all stuff we'll cover later this week, but let us simplify the problem for now: Zero or all else.

In [None]:
import numpy as np

y_temp = np.zeros(y_train.shape)
y_temp[np.where(y_train == 0.0)[0]] = 1
y_train = y_temp

y_temp = np.zeros(y_test.shape)
y_temp[np.where(y_test == 0.0)[0]] = 1
y_test = y_temp

In [None]:
# A Nice Binary target for ya to work with
y_train

array([0., 1., 0., ..., 0., 0., 0.])

In [None]:
x_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
print(x_test.shape)
print(y_test.shape)

(10000, 784)
(10000,)


In [None]:
#https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense

model = Sequential()
model.add(Dense(30, input_dim= 784, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['acc'])

In [None]:
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f4b3327f9b0>



- Make MNIST a multiclass problem using cross entropy & soft-max
- Implement Cross Validation model evaluation on your MNIST implementation 
- Research different [Gradient Descent Based Optimizers](https://keras.io/optimizers/)
 - [Siraj Raval the evolution of gradient descent](https://www.youtube.com/watch?v=nhqo0u1a6fw)
- Build a housing price estimation model using a neural network. How does its accuracy compare with the regression models that we fit earlier on in class?