<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Backpropagation Practice

## *Data Science Unit 4 Sprint 2 Assignment 2*

Using TensorFlow Keras, Implement a 3 input, 4 node hidden-layer, 1 output node Multilayer Perceptron on the following dataset:

| x1 | x2 | x3 | y |
|----|----|----|---|
| 0  | 0  | 1  | 0 |
| 0  | 1  | 1  | 1 |
| 1  | 0  | 1  | 1 |
| 0  | 1  | 0  | 1 |
| 1  | 0  | 0  | 1 |
| 1  | 1  | 1  | 0 |
| 0  | 0  | 0  | 0 |

If you look at the data you'll notice that the first two columns behave like an XOR gate while the last column is mostly just noise. Remember that creating an XOR gate was what the perceptron was criticized for not being able to learn.

This is your "Hello World!" of TensorFlow.

### Example TensorFlow Starter Code

```python 
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(3, activation='sigmoid', input_dim=2),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['acc'])

results = model.fit(X,y, epochs=100)

```

### Additional Written Tasks:
1. Investigate the various [loss functions](https://www.tensorflow.org/api_docs/python/tf/keras/losses). Which is best suited for the task at hand (predicting 1 / 0) and why? 

**The loss function most ideal for the task at hand would be BinaryCrossentropy, because it compares the true values with the predicted values (as the name of the loss function suggests, it's for True / False (binary) classification issues).**

2. What is the difference between a loss function and a metric? Why might we need both in Keras? 

**A loss function is meant for optimization of the model, whereas the metric is meant for evaluating the model's performance. Both are needed because the loss function's purpose is to optimize the model, and the metric is meant to determine the quality of this optimization in terms of effect on model performance.**

3. Investigate the various [optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers). Stochastic Gradient Descent (`sgd`) is not the learning algorithm dejour anyone. Why is that? What do newer optimizers such as `adam` have to offer? 

**Stochastic Gradient Descent is not the learning algorithm de jour anymore because when working with larger amounts of data, you're prone to more noise. Because of this, on each epoch, the learning step can go back and forth, wandering around the minimum without ever actually converging into an acceptable solution. SDG, due to the nature of the algorithm, also is computationally expensive, because it performs the update of parameters on each example. Newer optimizers such as Adam are more efficient in terms of computation and memory requirements. According to TensorFlow documentation, they're also invatiant to rescaling of gradients diagonally, and more suited for problems that are large in terms of size of data and/or parameters.**

### Build a Tensor Keras Perceptron

Try to match the architecture we used on Monday - inputs nodes and one output node. Apply this architecture to the XOR-ish dataset above. 

After fitting your model answer these questions: 

Are you able to achieve the same results as a bigger architecture from the first part of the assignment? Why is this disparity the case? What properties of the XOR dataset would cause this disparity? 

Now extrapolate this behavior on a much larger dataset in terms of features. What kind of architecture decisions could we make to avoid the problems the XOR dataset presents at scale? 

*Note:* The bias term is baked in by default in the Dense layer.

In [16]:
# Import pandas
import pandas as pd

df = pd.DataFrame({'x1':[0, 0, 1, 0 , 1, 1, 0],
                   'x2':[0, 1, 0, 1, 0, 1, 0],
                   'x3':[1, 1, 1, 0, 0, 1, 0],
                   'y':[0, 1, 1, 1, 1, 0, 0]
                  })

feats = list(df)[:-1]  # Extracting the x features columns
features = df[feats].values  # Getting the values from the features columns
target = df['y'].values  # This is our target we're trying to predict

print('Features (X values):', feats)  # Making sure features are correct.
print("Values to predict (y value):", target)  # Making sure the target we're attempting to predict is correct.

print(f'Shape of Features: {features.shape}')  # Confirming shape of features to pass into the input dimensions for our input layer.

Features (X values): ['x1', 'x2', 'x3']
Values to predict (y value): [0 1 1 1 1 0 0]
Shape of Features: (7, 3)


#### Building Keras Perceptron (With Hidden Layer)

In [17]:
# Keras Imports
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model1 = Sequential([
                    Dense(3, input_dim=3),  # We do not apply the activation function to this because it's the input layer, correct?
                    Dense(4, activation='sigmoid'),  # This is our hidden layer with four nodes.
                    Dense(1, activation='sigmoid')  # This will be our output layer. Because this is binary classification, only 1 output is needed.
])

model1.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['accuracy'])

results = model1.fit(features, target, epochs=25)  # Fitting our model (100 iterations / epochs)
score1 = model1.evaluate(features, target)  # Note: This will spit out both our loss and our metric (accuracy here)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [18]:
print('Model Evaluation:')
print(f'Metric is {model1.metrics_names[1]}: {round(score1[1], 3)}')

Model Evaluation:
Metric is accuracy: 0.429


#### Building Keras Perceptron (Without Hidden Layer)

In [19]:
model2 = Sequential([
                    Dense(3, input_dim=3),  # Input layer; same question applies.
                    Dense(1, activation='sigmoid')  # Output layer
])

model2.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['accuracy'])

results = model2.fit(features, target, epochs=25)  # Fitting our model (25 iterations / epochs again)
score2 = model2.evaluate(features, target)  # Note: This will spit out both our loss and our metric (accuracy here)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [20]:
print('Model Evaluation:')
print(f'Metric is {model2.metrics_names[1]}: {round(score2[1], 3)}')

Model Evaluation:
Metric is accuracy: 0.286


#### Evaluations For Both Models

In [21]:
print(f'Metric is {model2.metrics_names[1]}: {round(score2[1], 3)} (Without Hidden Layer - Basic Perceptron)')
print(f'Metric is {model1.metrics_names[1]}: {round(score1[1], 3)} (With Hidden Layer - Multi-Class Perceptron)')

if score2[1] > score1[1]:
  print('(Basic Perceptron Outperformed Multi-Class Perceptron)')
elif score2[1] == score1[1]:
  print('(Both Models Performed Equally)')
else:
  print('(Multi-Class Perceptron Outperformed Basic Perceptron)')

Metric is accuracy: 0.286 (Without Hidden Layer - Basic Perceptron)
Metric is accuracy: 0.429 (With Hidden Layer - Multi-Class Perceptron)
(Multi-Class Perceptron Outperformed Basic Perceptron)


#### Answered Questions

Are you able to achieve the same results as a bigger architecture from the first part of the assignment? Why is this disparity the case? What properties of the XOR dataset would cause this disparity?

**By adding a hidden layer, I was able to achieve better results. This is because of the nature of XOR gates. XOR stands for exclusive or, meaning that the true output is only true if only *one* of the inputs to the gate is true. This is related to linear seperability, and for a long time was a common criticism of perceptrons / neural networks as a whole. This problem was solved with the implementation of multi-layered perceptrons.**

Now extrapolate this behavior on a much larger dataset in terms of features. What kind of architecture decisions could we make to avoid the problems the XOR dataset presents at scale?

**Again, XOR gates implement an exclusive or (hence the name), meaning that a true output results if one (ONLY one) of the inputs to the gate is true. This can be solved by having multiple layers. Thus, the primary architecture decisions would be the number of hidden layers, as well as the number of nodes within those layers.**

## Try building/training a more complex MLP on a bigger dataset.

Use TensorFlow Keras & the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) to build the canonical handwriting digit recognizer and see what kind of accuracy you can achieve. 

If you need inspiration, the Internet is chalk-full of tutorials, but I want you to see how far you can get on your own first. I've linked to the original MNIST dataset above but it will probably be easier to download data through a neural network library. If you reference outside resources make sure you understand every line of code that you're using from other sources, and share with your fellow students helpful resources that you find.


### Parts
1. Gathering & Transforming the Data
2. Making MNIST a Binary Problem
3. Estimating your Neural Network (the part you focus on)

### Gathering the Data 

`keras` has a handy method to pull the mnist dataset for you. You'll notice that each observation is a 28x28 arrary which represents an image. Although most Neural Network frameworks can handle higher dimensional data, that is more overhead than necessary for us. We need to flatten the image to one long row which will be 784 values (28X28). Basically, you will be appending each row to one another to make on really long row. 

In [0]:
import numpy as np
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

In [0]:
# Input image dimensions
img_rows, img_cols = 28, 28

In [0]:
# Load Data & Train-Test-Split
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [0]:
# Reshaping data in accordance to image dimensions
X_train = X_train.reshape(X_train.shape[0], img_rows * img_cols)
X_test = X_test.reshape(X_test.shape[0], img_rows * img_cols)

# Normalize Our Data
X_train = X_train / 255
X_test = X_test / 255

In [26]:
# Now the data should be in a format you're more familiar with
X_train.shape

(60000, 784)

### Making MNIST a Binary Problem 
MNIST is multiclass classification problem; however we haven't covered all the necessary techniques to handle this yet. You would need to one-hot encode the target, use a different loss metric, and use softmax activations for the last layer. This is all stuff we'll cover later this week, but let us simplify the problem for now: Zero or all else.

In [0]:
import numpy as np

y_temp = np.zeros(y_train.shape)
y_temp[np.where(y_train == 0.0)[0]] = 1
y_train = y_temp

y_temp = np.zeros(y_test.shape)
y_temp[np.where(y_test == 0.0)[0]] = 1
y_test = y_temp

In [28]:
# Binary target to work with.
y_train

array([0., 1., 0., ..., 0., 0., 0.])

### Estimating Your Net

In [29]:
model3 = Sequential([
                     Dense(125, input_dim=784),  # What is generally considered the best way to determine number of nodes?
                     Dense(10, activation='sigmoid'),  # Hidden layer number 1.
                     Dense(5, activation='sigmoid'),  # Hidden layer number 2
                     Dense(1, activation='sigmoid')  # Output layer
])

model3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])  # Compile

model3.fit(X_train, y_train)
score3 = model3.evaluate(X_test, y_test)



In [30]:
print('For MNIST Dataset Neural Network:')
print(f'Metric ({model3.metrics_names[1]}): {score3[1] * 100}%')

For MNIST Dataset Neural Network:
Metric (accuracy): 99.26999807357788%


## Stretch Goals: 

- Make MNIST a multiclass problem using cross entropy & soft-max
- Implement Cross Validation model evaluation on your MNIST implementation 
- Research different [Gradient Descent Based Optimizers](https://keras.io/optimizers/)
 - [Siraj Raval the evolution of gradient descent](https://www.youtube.com/watch?v=nhqo0u1a6fw)
- Build a housing price estimation model using a neural network. How does its accuracy compare with the regression models that we fit earlier on in class?