## Here we will explore the basics of Neural Network architecture to the widest audience possible. Machine learning concepts tend to use complex jargon to describe concepts that can be defined in simpler terms -which I will try to do in this exercise. 

We will look at the Pima Indians Diabetes dataset for  (https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes). First, we want to get the data into our local machine. To do this I like to use NumPy, which creates a format similar to Excel spreadsheets with rows and columns, however it is not constrained by the memory limits of Excel.

First we want to install Anaconda and python 3.6 specifically: https://www.anaconda.com/download/

Next, we want to install packages which our work depends on.
On the terminal we want to get these packages by typing: conda install -c conda-forge tensorflow keras

To download the file from the UC Irvine server we go to our terminal in our working file directory and use the command: wget "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" 



In [266]:
# We want to import the scientific packages...
import numpy as np # here is the standard abbreviation for numpy
import keras
import tensorflow


In [267]:
# It is common practice to set a seed, which means that any random assignment can be recreated by others. 
# np.random.seed(1)
# We are using the function in numpy called loadtxt to read the file into a Excel-like format that Python can use
file = np.loadtxt("pima-indians-diabetes.data", delimiter = ',') 


In [268]:
print(file.shape) # (pandas or numpy file).shape let's us see the dimensions (rows x columns) # Very similar to R, the head command let's us get a feel for the data by displaying up to the first 10
# records.


(768, 9)


We want to split the data between our eight inputs (Xs) and output (Y), we can see that column 8 has two possible outputs (classes) either one or zero. When looking over the documentation (https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes) we see that this is the output that we are trying to predict. Normally we would want to explore the data and discuss hypotheses as to what relationships we want to explore between the inputs and outputs, but let's make a neural network instead! 

In [269]:
# Let's now split the data:
# Let's break down this operation...
# first for X we want to think of the "[]" as looking up the selected values we specify.
# file[row(s), column(s)], and ":" specifies that we want all records of the specified dimension.
# We want the records, but for X we want to exclude the last column which serves as our output that we are trying to predict. 
# thus we use the designation [:8], for all columns up to but excluding 8. 
X = file[:, 0:8]
Y = file[:, 8]

In [270]:
X.shape # here we see the last 10 records for X matrix

(768, 8)

In [271]:
Y.shape # here we see the first ten records for Y vector

(768,)

Great! Now we want to the general architecture of our network... Feel free to refer to this documentation religiously (https://keras.io/getting-started/sequential-model-guide/). Let's define what a neural network is, broadly speaking a neural network is going to be comprised of an input layer and then subsequent layers that perform transformations on the data before the output layer, which in this case is predicting the onset of Diabetes given certain conditions. In simple terms we can think of each layer as a logistic regression equation as I have written in psuedo code: z = dot_product((weights transpose)o(inputs)) + bias 

This should look familiar to any practitioners of linear regression, if not here are some helpful refreshers https://www.khanacademy.org/math/precalculus/precalc-matrices/multiplying-matrices-by-matrices/v/multiplying-a-matrix-by-a-matrix, http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf  
With this equation we then calculate the activation function sigmoid = 1/(1+e^(-z)) if z is large we get a number very close to 1, and if z is very small (negative) we a number very close to zero, thus a binary classifier (0 or 1). Concretely, I like to define neural networks as stacks of logistic regression layers. This isn't technically correct, but is a good starting point before exploring the other components.
Some more housekeeping, I have spoken to several practicing Data Scientists and the common practice is to use Keras as the front-end and let your backend be Tensorflow (tensorflow has been found to be much faster than it's competitor Theano in most applications). Keras has a lot of utility because it automates many of the lower level tasks of Tensorflow.

We want to have our cost function be minimized, furthermore, in the case of Logistic regression we want to minimize our cross entropy cost function which penalizes incorrect classifications. We can think of this process in terms of taking steps towards the bottom of a hill, each epoch (iteration through the entire data set) will show us the loss which can be seen as an index value of the cost function and we want to see the loss decrease. 

Now let's get to Keras! We define Sequential as the stack of layers.

In [272]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

network = Sequential()# Here is our starting point.

Now let's add a layer to the network... Dense is composed of units, an activation function. The activation function can be the same that we defined for the logistic regression and the number of units defines the number of nodes in the layer. There are also a bevy of other useful parameters (https://keras.io/layers/core/). Moreover, we should note that it performs the operation: output = activation(dotproduct(input)o(weight) + bias)
which should look familiar! Let's now make our layers in the network. 


In [273]:
network.add(Dense(12, input_dim = 8, activation = 'relu' )) # ReLU tends to reach a minimum faster than sigmoid activation

In [274]:
network.add(Dense(8, activation = 'relu')) # I picked rather arbitrary values, feel free to ticker!


A technical definition for these two layers would be two hidden layers, these are defined as the layers between the input and the output layer. So, that means the fun has to end and we are going to define our output layer. For binary classification (0 or 1) the best practices state to use the sigmoid function in the output layer. In this case we are using the transformation of the data in the two hidden layers to see what the classifier called a Multi-Layer perceptron predicts for the output. 

In [275]:
network.add(Dense(1, activation='sigmoid'))

Now we want to design the learning process, which means we call the compile method, we specify optimizers like how we try to take steps towards minimizing our cost function as much as possible, in our case it will be binary_crossentropy. 

In [276]:
network.compile( loss='binary_crossentropy',optimizer='adam', metrics=['accuracy', 'binary_accuracy'])

After compiling the network we can now run it! We specify our inputs and the outputs, then the number of full loops we want to run through the data. One thing to note is that running such a classifier through the same data over and over again means that it will not generalize to new data. Luckily, there are ways to rectify that, however those tools are beyond the scope of this tutorial. For now let's run this model and see what happens to our loss and accuracy during the iterations.

In [265]:
network.fit(X,Y,epochs=100, batch_size=30) #

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100


Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x120c32f28>

Not the highest accuracy, but we are also working with very little data and we did not explore the art of tweaking the knobs on Keras, or validating our model with test data. If you liked this I would recommend checking out a few blogs of notable AI rockstars that are driving innovation across industries http://cs.stanford.edu/people/karpathy/, https://www.quora.com/Whats-the-most-effective-way-to-get-started-with-deep-learning

I will try to make more of these that explore Keras in more depth!


Credit for this code goes to me, Mark Conrad
https://www.linkedin.com/in/mark-conrad/ 
