# Intro

This is following a tutorial from [machine learning mastery](https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/) just to get familiar with things

I am going to dump a lot of the tips I found useful into this notebook.
There will also be a lot of links to the authors other works.

# Load the data

In [None]:
import numpy as np
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense

In [None]:
# Breakdown of the data

# Input Variables (X):
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)

# Output Variables (y):
# 1. Class variable (0 or 1)

# The turoial wants us to use the numpy loadtxt
dataset = loadtxt("pima-indians-diabetes.csv", delimiter=',')
X = dataset[:, 0:8]
y = dataset[:, 8]

# Here is the equivalent in pandas - just to prove a point
# import pandas as pd
# pandas_dataset = pd.read_csv("pima-indians-diabetes.csv", delimiter=',', header=None)
# pandas_X = pandas_dataset.loc[:, 0:7]
# pandas_y = pandas_dataset.loc[:, 8]

# Define the Keras Model

The first thing to get right is to ensure the input layer has the right number of input features.
This can be specified when creating the first layer with the input_dim argument and setting it to 8 for the 8 input variables.
The tutorial asks how to determine the number of nodes in each layer, then links to [another of their posts](https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/)

It used to be the case that Sigmoid and Tanh activation functions were preferred for all layers.
These days, better performance is achieved using the ReLU activation function.
We use a sigmoid on the output layer to ensure our network output is between 0 and 1 and easy to map to either a probability of class 1 or snap to a hard classification of either class with a default threshold of 0.5.

In [None]:
model = Sequential()
model.add(Dense(12, input_dim=8, activation="relu"))
model.add(Dense(8, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

***Note***,
the most confusing thing here is that the shape of the input to the model is defined as an argument on the first hidden layer.
This means that the line of code that adds the first Dense layer is doing 2 things, defining the input or visible layer and the first hidden layer.

# Compile the Keras Model

When compiling, we must specify some additional properties required when training the network.
Remember training a network means finding the best set of weights to map inputs to outputs in our dataset.

We must specify the **loss function** to use to evaluate a set of weights, the optimizer is used to search through different weights for the network and any optional metrics we would like to collect and report during training.

[How to Choose Loss Functions When Training Deep Learning Neural Networks](https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/)

We will define the **optimizer** as the efficient stochastic gradient descent algorithm “adam“.
This is a popular version of gradient descent because it automatically tunes itself and gives good results in a wide range of problems.
[Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)

Finally, because it is a classification problem, we will collect and report the classification accuracy, defined via the **metrics** argument

In [None]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Fit the Keras Model

Training occurs over epochs and each epoch is split into batches.

***Epoch***: One pass through all of the rows in the training dataset.
***Batch***: One or more samples considered by the model within an epoch before weights are updated.

One epoch is comprised of one or more batches, based on the chosen batch size and the model is fit for many epochs.

These configurations can be chosen experimentally by trial and error.
We want to train the model enough so that it learns a good (or good enough) mapping of rows of input data to the output classification.
The model will always have some error, but the amount of error will level out after some point for a given model configuration.

*This is called **model convergence**.*

In [None]:
model.fit(X, y, epochs=150, batch_size=10)

**Note**:
Neural nets require ALL inputs to be numeric, make sure that you do some form of encoding on your categoricals before throwing your dataset at the model

# Evaluate the Keras Model

So the tutorial was super lazy here and just evaluated the data on the same dataset it trained on.
**DO NOT DO THIS IN THE REAL WORLD**
To their credit they did say in a real example you should split into train and test sets

In [None]:
_, accuracy = model.evaluate(X, y)
print(f"Accuracy: {accuracy * 100.0:.2f} %")

# Make predictions

In [None]:
predictions = model.predict(X)

# If instead you want your predictions as labels
threshold = 0.5
predictions_crisp = (model.predict(X) > threshold).astype(int)

In [None]:
predictions

In [None]:
predictions_crisp

In [None]:
print(f"The single model gives us an accuracy of: {100.0 * np.mean(predictions_crisp == y.astype(int)):.2f} %")

# My improvements / experiments

Due to the stochastic nature of backpropagation we can train the model multiple times and it will be slightly different.
This can be advantageous as you can do similar things to ensemble machine learning and stack the outcomes

In [None]:
models = {}
predictions = {}

# Generate multiple models
for i in range(0, 5):
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation="relu"))
    model.add(Dense(8, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))

    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

    model.fit(X, y, epochs=150, batch_size=10, verbose=0)

    _, accuracy = model.evaluate(X, y)
    print(f"Model {i} has accuracy: {accuracy * 100.0:.2f} %")
    key = f"model_{i}"
    models[key] = model
    predictions[key] = (model.predict(X) > threshold).astype(int)

In [None]:
from statistics import mode

group_prediction = []
for i in range(0, len(X)):
    group_prediction.append(mode([predictions[key][i][0] for key in predictions.keys()]))

In [None]:
print(f"Group think gives us an accuracy of: {100.0 * np.mean(group_prediction == y.astype(int)):.2f} %")

By taking the **mode prediction** of a bunch of models for each data point we can hopefully make a model that is better than average
As you can see, if we had just trained a single model we might have been unlucky and got one of the under performing models

In essence this is what a lot of fancy research neural nets are doing.
By making a swarm of slightly different models that can cover each others weaknesses we end up with a prevailing group model that makes on average better choices