<a href="https://colab.research.google.com/github/pragmatizt/DS-Unit-4-Sprint-2-Neural-Networks/blob/master/ira_Unit_4_Sprint_Challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2*

# Sprint Challenge - Neural Network Foundations

Table of Problems

1. [Defining Neural Networks](#Q1)
2. [Chocolate Gummy Bears](#Q2)
    - Perceptron
    - Multilayer Perceptron
4. [Keras MMP](#Q3)

<a id="Q1"></a>
## 1. Define the following terms:

- **Neuron:** They make up the neural network.  These nodes are the simple, but highly interconnected elements which are organized in layers which process information using dynamic state responses to external inputs. (artificial neural networks are modeled after how our own biological brains & networks)

- **Input Layer:** These present the patterns to the network.  Which would communicate to one or more hidden layers.  These nodes are passive (they don't change the data); receiving a single value on their input and duplicate the value to their many outputs.  

   Starting from the input layer, it duplicates each value and sends it to all the hidden nodes.

- **Hidden Layer:** layers of mathematical functions each designed to produce an output specific to an intended result.  They allow for the function of a neural network to be broken down into specific transformations of the data.  

   They are found between the input and output of the algorithm.  The function applies weights to the inputs and directs them through an activation function as the input.

- **Output Layer:** This is where hidden layers link to. Output layers receive connections from hidden layers. It returns an output value that corresponds to the prediction of the response variable.  

   In classification problems, only one output mode. In multiclass classification, could be more.  The activate nodes of the output layer combine and change the data to produce output values.

- **Activation:** Each node has one, and the activation function defines the output of that node given an input (or set of inputs).

   There are a number of activation functions available. The ones we used for this week's sprint were sigmoid and RelU.  

- **Backpropagation:** an algorithm for supervised learning of artificial neural networks using gradient descent.  

   The method calculates the gradient of the error function with respect to the neural network's weights. 

   The calculation of the gradient proceeds "backwards" through the network.  With the gradient of the final layer of weights being calculated first and the gradient of the first layer of weights being calculated last.


# Referenced links:
(for when I refer to this sprint in the future)
*   https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
*   https://en.wikipedia.org/wiki/Activation_function
*   https://deepai.org/machine-learning-glossary-and-terms/hidden-layer-machine-learning
*   https://brilliant.org/wiki/backpropagation/





## 2. Chocolate Gummy Bears <a id="Q2"></a>

Right now, you're probably thinking, "yuck, who the hell would eat that?". Great question. Your candy company wants to know too. And you thought I was kidding about the [Chocolate Gummy Bears](https://nuts.com/chocolatessweets/gummies/gummy-bears/milk-gummy-bears.html?utm_source=google&utm_medium=cpc&adpos=1o1&gclid=Cj0KCQjwrfvsBRD7ARIsAKuDvMOZrysDku3jGuWaDqf9TrV3x5JLXt1eqnVhN0KM6fMcbA1nod3h8AwaAvWwEALw_wcB). 

Let's assume that a candy company has gone out and collected information on the types of Halloween candy kids ate. Our candy company wants to predict the eating behavior of witches, warlocks, and ghosts -- aka costumed kids. They shared a sample dataset with us. Each row represents a piece of candy that a costumed child was presented with during "trick" or "treat". We know if the candy was `chocolate` (or not chocolate) or `gummy` (or not gummy). Your goal is to predict if the costumed kid `ate` the piece of candy. 

If both chocolate and gummy equal one, you've got a chocolate gummy bear on your hands!?!?!
![Chocolate Gummy Bear](https://ed910ae2d60f0d25bcb8-80550f96b5feb12604f4f720bfefb46d.ssl.cf1.rackcdn.com/3fb630c04435b7b5-2leZuM7_-zoom.jpg)

In [0]:
import pandas as pd
candy = pd.read_csv('chocolate_gummy_bears.csv')

In [0]:
candy.head()

Unnamed: 0,chocolate,gummy,ate
0,0,1,1
1,1,0,1
2,0,1,1
3,0,0,0
4,1,1,0


In [0]:
# The necessities
import numpy as np

# Sklearn packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

### Perceptron

To make predictions on the `candy` dataframe. Build and train a Perceptron using numpy. Your target column is `ate` and your features: `chocolate` and `gummy`. Do not do any feature engineering. :P

Once you've trained your model, report your accuracy. You will not be able to achieve more than ~50% with the simple perceptron. Explain why you could not achieve a higher accuracy with the *simple perceptron* architecture, because it's possible to achieve ~95% accuracy on this dataset. Provide your answer in markdown (and *optional* data anlysis code) after your perceptron implementation. 

In [0]:
print(candy.shape)
candy.head()

(10000, 3)


Unnamed: 0,chocolate,gummy,ate
0,0,1,1
1,1,0,1
2,0,1,1
3,0,0,0
4,1,1,0


In [0]:
# Start your candy perceptron here

X = candy[['chocolate', 'gummy']].values
y = candy['ate'].values

In [0]:
X.shape, y.shape

((10000, 2), (10000,))

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=85)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((8000, 2), (2000, 2), (8000,), (2000,))

In [0]:
class Perceptron(object):
  def __init__(self, rate=0.1, niter=1000):
    self.rate = rate
    self.niter = niter

  def fit(self, X, y):
    """ fit training data
    X : training vectors, X.shape : [#samples, #features]
    y : target values, y.shape : [#samples]
    """

    # Initialize weights
    self.weight = np.zeros(1 + X.shape[1])

    # Number of misclassifications
    self.errors = [] # number of misclassifications

    for i in range(self.niter):
      err = 0
      for xi, target in zip(X, y):
        delta_w = self.rate * (target - self.predict(xi))
        self.weight[1:] += delta_w * xi
        self.weight[0] += delta_w
        err += int(delta_w != 0.0)
      self.errors.append(err)
    return self

  def net_input(self, X):
    return np.dot(X, self.weight[1:]) + self.weight[0]

  def predict(self, X):
    """Return class label after unit step"""
    return np.where(self.net_input(X) >= 0.0, 1, 0)

In [0]:
pn = Perceptron()
pn.fit(X_train, y_train)

<__main__.Perceptron at 0x7f84f49788d0>

In [0]:
predictions = [pn.predict(X_test[i]) for i in range(len(X_test))]
print("accuracy:", accuracy_score(y_test, predictions))

accuracy: 0.507


In [0]:
# The accuracy provided is right in line with 
# where the instructions for this part said they would be, ~50%.

### Multilayer Perceptron <a id="Q3"></a>

Using the sample candy dataset, implement a Neural Network Multilayer Perceptron class that uses backpropagation to update the network's weights. Your Multilayer Perceptron should be implemented in Numpy. 
Your network must have one hidden layer.

Once you've trained your model, report your accuracy. Explain why your MLP's performance is considerably better than your simple perceptron's on the candy dataset. 

In [0]:
class NeuralNetwork:
  def __init__(self):
    # Set up the architecture of the neural network
    self.inputs = 2
    self.hiddenNodes = 3
    self.outputNodes = 1

    # initial weights: 2 x 3 matrix array for the 1st layer 
    self.weights1 = np.random.rand(self.inputs, self.hiddenNodes)

    # 3 x 1 matrix array for the hidden to output pathway
    self.weights2 = np.random.rand(self.hiddenNodes, self.outputNodes)

  def sigmoid(self, s):
    return 1 / (1+np.exp(-s))

  def sigmoidPrime(self, s):
    return s * (1 - s)

  def feed_forward(self, X):
    """
    Calculate the NN inference using feed forward. aka "predict"
    """

    # weighted sum of inputs => hidden layer
    self.hidden_sum = np.dot(X, self.weights1)

    # activations of weighted sum
    self.activated_hidden = self.sigmoid(self.hidden_sum)

    # weighted sum between hidden and output
    self.output_sum = np.dot(self.activated_hidden, self.weights2)

    # final activation of output
    self.activated_output = self.sigmoid(self.output_sum)

    return self.activated_output

  def backward(self, X, y, o):
    """
    Backward propagate through the network
    """

    # error in output
    self.o_error = y - o

    # apply derivative of sigmoid to error
    # how far off are we in relation to the sigmoid f(x) of the output
    # ^- aka hidden => output
    self.o_delta = self.o_error * self.sigmoidPrime(o)

    # z2 error
    self.z2_error = self.o_delta.dot(self.weights2.T)

    # how much of that "far off" can be explained by the input => hidden
    self.z2_delta = self.z2_error * self.sigmoidPrime(self.activated_hidden)

    # adjustment to first set of weights (input => hdiden)
    self.weights1 += X.T.dot(self.z2_delta)

    # adjustment to second set of weights (hidden => output)
    self.weights2 += self.activated_hidden.T.dot(self.o_delta)

  def train(self, X, y):
    o = self.feed_forward(X)
    self.backward(X, y, o)



P.S. Don't try candy gummy bears. They're disgusting. 

### the three cells below reference Lecture & Assignment from module 2
(reference for self)

In [0]:
import matplotlib.pyplot as plt

In [0]:
# reshaping y_train and y_test
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

In [0]:
nn = NeuralNetwork()

cost = []
for i in range(1000):
    cost.append(np.mean(np.square(y - nn.feed_forward(X))))
    nn.train(X_train, y_train)


print("Predicted Output: \n" + str(nn.feed_forward(X))) 
print("Loss: \n" + str(np.mean(np.square(y - nn.feed_forward(X)))))

Predicted Output: 
[[2.49337724e-94]
 [2.49333457e-94]
 [2.49337724e-94]
 ...
 [2.49337724e-94]
 [2.49337724e-94]
 [2.49333457e-94]]
Loss: 
0.5


# Explanation: 
*(answer to the question)*
  - The added layers, and the ability to "backpropagate" account for why the score is so dramatically improved. 

 A perceptron is limited to a feed-forward process, so it can't go back to implement backpropagation (i.e. making adjustments on the error and the weights).  

 The multilayer has better performance because it can "learn" from this error, which allows us to get these high accuracy scores in our predictions.  

*reference*: https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/

## 3. Keras MMP <a id="Q3"></a>

Implement a Multilayer Perceptron architecture of your choosing using the Keras library. Train your model and report its baseline accuracy. Then hyperparameter tune at least two parameters and report your model's accuracy.
Use the Heart Disease Dataset (binary classification)
Use an appropriate loss function for a binary classification task
Use an appropriate activation function on the final layer of your network.
Train your model using verbose output for ease of grading.
Use GridSearchCV or RandomSearchCV to hyperparameter tune your model. (for at least two hyperparameters)
When hyperparameter tuning, show you work by adding code cells for each new experiment.
Report the accuracy for each combination of hyperparameters as you test them so that we can easily see which resulted in the highest accuracy.
You must hyperparameter tune at least 3 parameters in order to get a 3 on this section.

In [0]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/heart.csv')
df = df.sample(frac=1)
print(df.shape)
df.head()

(303, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
34,51,1,3,125,213,0,0,125,1,1.4,2,1,2,1
293,67,1,2,152,212,0,0,150,0,0.8,1,0,3,0
98,43,1,2,130,315,0,1,162,0,1.9,2,1,2,1
139,64,1,0,128,263,0,1,105,1,0.2,1,1,3,1
53,44,0,2,108,141,0,1,175,0,0.6,1,0,2,1


### Step 1: importing packages

In [0]:
import tensorflow
from tensorflow import keras
from sklearn.preprocessing import MinMaxScaler, Normalizer, OrdinalEncoder
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Activation

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from tensorflow.keras.optimizers import Adam, SGD
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

from tensorflow.keras.utils import to_categorical

### Step 2: Train, test, split

In [0]:
scaler = MinMaxScaler()
df_transform = scaler.fit_transform(df)

In [0]:
# Split the values into X and y components:
X_train, X_test, y_train, y_test = train_test_split(df_transform[:, :-1], df_transform[:, -1], 
test_size=0.20, random_state=85)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(242, 13) (61, 13) (242,) (61,)


### Step 3: Arriving at our baseline model

In [0]:
# Fixing random seed for reproducibility
seed = 85
np.random.seed(seed)

# Function to create model, required for KerasClassifier
def create_model():
  # create model
  model = Sequential()
  model.add(Dense(13, input_dim=13, activation='relu'))
  #model.add(Dropout(0.2))  # <-- added these later to see effect on score
  model.add(Dense(12, activation='sigmoid'))
  #model.add(Dropout(0.2))  # <-- added these later to see effect on score, omitted from baseline.
  model.add(Dense(1, activation='sigmoid'))

  # compile model
  adam = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
  model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
  print(model.summary())
  return model

# create model
model = KerasClassifier(build_fn=create_model, verbose=0)

In [0]:
model.fit(X_train, y_train,
          validation_data=(X_test,y_test),
          epochs=50,
          batch_size=20,
          verbose=1)

Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_54 (Dense)             (None, 13)                182       
_________________________________________________________________
dense_55 (Dense)             (None, 12)                168       
_________________________________________________________________
dense_56 (Dense)             (None, 1)                 13        
Total params: 363
Trainable params: 363
Non-trainable params: 0
_________________________________________________________________
None
Train on 242 samples, validate on 61 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29

<tensorflow.python.keras.callbacks.History at 0x7f84adc53940>

In [0]:
print(f"Accuracy: {accuracy_score(np.round(model.predict(X_test)),y_test)}")

Accuracy: 0.9344262295081968


###Results, with tuning (Keras):
- *Version 1.0*: *no dropout rate*, learning rate: 0.01, epochs = 20, batch size = 10. ***0.7868852459016393***
- *Version 1.1*: *with dropout*, learning rate: 0.01, epochs = 50, batch size = 10. ***0.8852459016393442***

- *Version 1.2*: with dropout.  same LR, epoch, and *batch sizes: 20* ***Accuracy: 0.9344262295081968***

### Step 4: Using Grid Search to find the best parameters

In [0]:
# Function to create model, required for KerasClassifier:
def gridsearch_create_model():
  # create model
  model = Sequential()
  model.add(Dense(13, input_dim=13, activation='relu'))
  model.add(Dense(12, activation='sigmoid'))
  model.add(Dense(1, activation='sigmoid'))

  # compile model
  adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
  model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

  return model

  # create model
model2 = KerasClassifier(build_fn=gridsearch_create_model, verbose=0)

  # define the grid search parameters
param_grid = {'batch_size': [32, 64, 128, 256, 512],
                'epochs': [100]}

                # see "Results, with tuning:" cell below for version changes
                # note: original batch_size: [10, 20, 40, 60, 80, 100],


In [0]:
# Create grid search
grid = GridSearchCV(estimator=model2, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)



In [0]:
# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
  print(f"Means: {mean}, Stdev: {stdev} with: {params}")

Best: 0.8057851306170472 using {'batch_size': 128, 'epochs': 100}
Means: 0.8057851259373436, Stdev: 0.01634438450914547 with: [{'batch_size': 32, 'epochs': 100}, {'batch_size': 64, 'epochs': 100}, {'batch_size': 128, 'epochs': 100}, {'batch_size': 256, 'epochs': 100}, {'batch_size': 512, 'epochs': 100}]
Means: 0.7933884391114732, Stdev: 0.006529184827057151 with: [{'batch_size': 32, 'epochs': 100}, {'batch_size': 64, 'epochs': 100}, {'batch_size': 128, 'epochs': 100}, {'batch_size': 256, 'epochs': 100}, {'batch_size': 512, 'epochs': 100}]
Means: 0.8057851306170472, Stdev: 0.042027961140042376 with: [{'batch_size': 32, 'epochs': 100}, {'batch_size': 64, 'epochs': 100}, {'batch_size': 128, 'epochs': 100}, {'batch_size': 256, 'epochs': 100}, {'batch_size': 512, 'epochs': 100}]
Means: 0.7644628104099558, Stdev: 0.06307058614582445 with: [{'batch_size': 32, 'epochs': 100}, {'batch_size': 64, 'epochs': 100}, {'batch_size': 128, 'epochs': 100}, {'batch_size': 256, 'epochs': 100}, {'batch_size

###Results, with tuning (Grid Search):
*(adjustments made on: learning rate, batch size, epochs)*
- *Version 1.0:*
reduced learning rate to: 0.001, kept batch_size: 10, epochs: 20 = **Best: 0.8057851259373436**
- *Version 1.1:* Increased epochs to 40. **Best: 0.8016528969953868 using {'batch_size': 10, 'epochs': 40}**
- *Version 1.2:* Increased epochs to 50. **Best: 0.8181818282801258 using {'batch_size': 10, 'epochs': 50}**
- *Version 1.3:* Increased batch size to 32, 64, 128, etc. ***Best: 0.7520661083134738 using {'batch_size': 32, 'epochs': 50}***
- *Version 1.4:* increased epochs to 100. ***Best: 0.8057851306170472 using {'batch_size': 64, 'epochs': 100}***