## Artificial Neural Networks Applications Using TensorFlow 
<br> Prepared for the Aritfical Neural Network Seminar at Columbia University, Sep 2017
>Author: Tristan Eisenhart
<br>te2252@columbia.edu


If you haven't installed TensorFlow on your machine yet, please follow those intructions: https://www.tensorflow.org/install/

In [2]:
# Introduction to Neural Networks and TensorFlow -- currently using version 1.2 of TensorFlow
import tensorflow as tf

##### Graphs, variables and operations
We'll learn through a couple of examples, but before jumping into the practice, let's have a look at a couple of features that should be understood before starting to develop in TensorFlow:

1. Graphs -- a computational graph in TensorFlow is a series of operations arranged into a graph of nodes. The development of Neural Networks algorithms in TensorFlow is done through two distinct steps: 
> 1. building the computational graph, which focuses on creating and defining the nodes of the graph
> 1. running the computational graph, which focuses on evaluating the resulting graph through what is called a session

1. Variables -- multiple type of variables exist in TensorFlow -- the most common ones are presneted below:

    > Constant variables: float or int variables that will remain constant and that we wish to declare in step1 <br>
    > Zeros: tensors with a specific shape that are initiated with zeros <br>
    > Placeholders: tensors for which we will pass a value in the future. We need to specify a shape when declaring placeholders <br>

1. Operations -- TensorFlow has built-in functions for basic and complex operations
> Addition: tf.add() <br>
> Matrix multiplication: tf.matmul() <br>
> ... <br>
> But also prebuilt functions, optimization methods, network designs, cells (LSTM), etc. <br>
> The TensorFlow documentation can be found at https://www.tensorflow.org/api_docs/python/
    
Now that we have (very) quickly defined those 3 elements of TensorFlow, let's look at an example that illustrate the concept of the session. Once again, a session is used to evaluate the graph that is built in the first step mentioned above.
<br>
##### Step 1: building the computational graph
Here, we are simply passing variables to the graph and multiplying the matrices node3 and node4. Notice that when you run the cell below, TensorFlow does not display the value of the variables but their characteristics. To get an 
overview of their values, you will have to evaluate the graph that you have created in step 1. 

In [7]:
# equivalent to node1 = 7.0
node1 = tf.constant(7.0, dtype = tf.float32) 

# tensor of shape (2,3) initiated with zeros
node2 = tf.zeros((2,3), dtype = tf.int32) 

# a placeholder is an empty tensor with a specified shape and type for which we will pass values in the future (when evaluating the graph)
node3 = tf.placeholder(dtype = tf.float32, shape = (2,3)) 

# a matrix with size [3,2] and random gaussian variables
node4 = tf.random_normal([3, 2], seed=1234)

# the matrix multiplication operator
node5 = tf.matmul(node3,node4)

# Displaying the type of our variables
print(node1)
print(node2)
print(node3)
print(node4)
print(node5)

Tensor("Const_4:0", shape=(), dtype=float32)
Tensor("zeros_4:0", shape=(2, 3), dtype=int32)
Tensor("Placeholder_4:0", shape=(2, 3), dtype=float32)
Tensor("random_normal_4:0", shape=(3, 2), dtype=float32)
Tensor("MatMul_4:0", shape=(2, 2), dtype=float32)


##### Step 2: evaluating the computational graph through a session
Now that when we want to evaluate the variables passed into our graph, we will call a session using tf.Session(). Notice that in step 1, we had created a placeholder variable. A placeholder is an empty tensor with a prespecified shape and type for which we will pass a value in the future, when evaluating the graph. When you will develop more advanced neural networks in which you will want to pass training data, you will need to use placeholders to train your graph in batches (more on this later). For now, notice how when you run the below cell, the true values of the variables are displayed.

In [8]:
# Step 2: Running the computational graph

# Calling a session to evaluate the graph
sess = tf.Session()

# Creating some data to feed into our placeholder 
x = [[0,1,2],[3,4,5]]

# Displaying the values of our variables, by running the graph using sess.run(feed_dict={})
print("node1:\n", sess.run(node1))
print("node2:\n", sess.run(node2))

# This is how you will want to feed values into your placeholders when running your graph
print("node3:\n", sess.run(node3, feed_dict={node3:x}))
print("node4:\n", sess.run(node4))
print("node5:\n", sess.run(node5, feed_dict={node3:x}))

node1:
 7.0
node2:
 [[0 0 0]
 [0 0 0]]
node3:
 [[ 0.  1.  2.]
 [ 3.  4.  5.]]
node4:
 [[ 0.51340485 -0.25581399]
 [ 0.65199131  1.39236379]
 [ 0.37256798  0.20336303]]
node5:
 [[  3.55750608   2.70003891]
 [ 12.15140343   9.36564636]]


This concludes our very short introduction to using TensorFlow. A lot of prebuilt funtions and operations are explained and described in the TensorFlow documentation that can be found at https://www.tensorflow.org/api_docs/python/. Let's now look at our first application, the exclusive-or (XOR) function. This is a nonlinear problem that is quite difficult to solve. For a reminder of what the XOR function is, you can visit https://en.wikipedia.org/wiki/Exclusive_or. As you will notice in the 2D graph below, linear functions do not work to approximate the XOR function, as there is no way to seperate the data linearly (try to draw a line that seperates the black dotes from the grey dotes and you will see that this is not possible). 

<img src="https://www.researchgate.net/profile/Michael_Siomau/publication/232642531/figure/fig1/AS:300360385220613@1448622904913/FIG-1-The-feature-space-of-XOR-function-is-two-dimensional-and-discrete-each-feature.png">

<i> source: https://www.researchgate.net/profile/Michael_Siomau/publication/232642531/figure/fig1/AS:300360385220613@1448622904913/FIG-1-The-feature-space-of-XOR-function-is-two-dimensional-and-discrete-each-feature.png </i>

That is why Neural Networks (that rely on a nonlinear activation function) are very good at approximating the XOR function. Let's take a look at how a simple Feed-Forward Neural Net can be developed and applied in the context of the XOR function using TensorFlow.

### Application 1 - The XOR function

Step 0: Let's create some data for our first application

In [14]:
# Importing relevant packages
import numpy as np
import pandas as pd

# Our dataset consist of 7 samples with 3 binary features each and a binary output value (either 0 or 1)
df = pd.DataFrame({'x1':np.array([0,1,0,0,1,1,1]),'x2':np.array([0,0,0,1,1,1,0]),
                   'x3':np.array([1,1,0,0,1,1,1]),'output':np.array([1,1,0,1,0,0,1])})

print("Let's take a look at our data:\n",df)

Let's take a look at our data:
    output  x1  x2  x3
0       1   0   0   1
1       1   1   0   1
2       0   0   0   0
3       1   0   1   0
4       0   1   1   1
5       0   1   1   1
6       1   1   0   1


A typical neural network is made of layers, nodes, weights and biases (see the image below). In the feed-forward architecture, information flows forward without going backwards. The below image displays an input layer, two hidden layers and an output layer.
<img src="https://camo.githubusercontent.com/269f47b8185a2ca349ead57db511250553fd918b/687474703a2f2f63733233316e2e6769746875622e696f2f6173736574732f6e6e312f6e657572616c5f6e6574322e6a706567"> </img>

<i> source: <a html="https://camo.githubusercontent.com/269f47b8185a2ca349ead57db511250553fd918b/687474703a2f2f63733233316e2e6769746875622e696f2f6173736574732f6e6e312f6e657572616c5f6e6574322e6a706567 "> link </a> </i>

At each layer, you will encounter an activation function (adding nonlinearity), and a matrix multiplication of the sort:

<br>

$$ layer\ output = \sigma(X.W + Biases),\ where\ \sigma\ is\ a\ nonlinear\ activation\ function$$

<br>
For a list of all activation functions supported in TensorFlow, visit https://www.tensorflow.org/versions/r0.12/api_docs/python/nn/activation_functions_ <br>

When optimizing a neural network's parameters, it is necessary to come up with a cost function. This cost function, normally representing the error associated with the classification or prediction task (such as wrongly classified observations, or by how far our prediction misses the observed output) will be minimized using optimization algorithms. Typically, the Adam algorithm or the Stochastic Gradient Descent Algorithms can be used to come up with optimal (not necessarily global optimal) values for our weights and biases parameters.

In our example, we will use the Mean Square Error, equal to: <br>

$$ MSE = \Sigma(y - \widehat{y})^2$$ <br>

In other words, the closer our prediction is from the right class, the smaller our mean squared error will be. To optimize our parameters, we will use the stochastic gradient descent method. We'll run 10,000 iterations to minimize our Mean Squared Error. The architecture of the network will be a single layer with 4 nodes and we will predict a continous variable. At the end of our while loop, we hope to have output values close to 1 for observations that have a class = 1 in our input data and values close to 0 for observations that have a class = 0. We will use a sigmoid acitvation function and we will initialize our weights and biases using Gaussian random variables. The learning rate that we will use is equal to 0.01 (think of the learning rate as how far our gradient descent will take a leap at each iteration). Notice that we will have a matrix of size [number of input feautres, number of nodes] to represent our weights in the hidden layer and a matrix of size [number of nodes, number of classes] for our output layer. 

With all that said, let's dig into how we build our graph.

##### Step 1: building the computational graph

In [26]:
# Resetting to default graph -- especially usefull when running multiple sessions
tf.reset_default_graph()

# Declaring parameters / architecture of neural network
num_input_features = 3 # represents the number of features in the input data
num_hidden_nodes = 4 # the number of nodes used in the 1st (and only) hidden layer of our network
num_classes = 1 # the number of features in the output data -- this is equivalent to a regression problem, we are not trying to predict a class but a number, therefore there is only 1 class
learning_rate = 0.01 # parameter used in the optimization process
seed = 7 # to replicate results

# Declaring placeholders for input data and true outputs
inputs = tf.placeholder(tf.float32, shape=[None, 3]) # inputs size will be size of dataset * num_input_features
true_outputs = tf.placeholder(tf.float32, shape=[None, 1]) # output size will be size of dataset * num_classes

# Randomely initializing weights and biases using normal distribution
weights = {
    'hidden': tf.Variable(tf.random_normal([num_input_features, num_hidden_nodes], seed=seed)),
    'output': tf.Variable(tf.random_normal([num_hidden_nodes, num_classes], seed=seed))}

biases = {
    'hidden': tf.Variable(tf.random_normal([num_hidden_nodes], seed=seed)),
    'output': tf.Variable(tf.random_normal([num_classes], seed=seed))}

# Computing layer_1 and the output layer (this is a single-layer feed forward neural net) with a sigmoid activation function
# The introduction of an activation function allows for non-linearity
# Layers are simply equal to activation_function(Wx + biases)

layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(inputs,weights['hidden']),biases['hidden']))
output_layer = tf.nn.sigmoid(tf.add(tf.matmul(layer_1,weights['output']),biases['output']))

# Now that the architecture is designed, let's look at the optimization process -- our objective / cost / error function is the mean square error 
# We use an iterative optimization process, here the Stochastic Gradient Descent Methode that learns at a predefined learning_rate
error = tf.subtract(output_layer, true_outputs)
mean_square_error = tf.reduce_sum(tf.square(error))
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(mean_square_error)

Notice that all we have done is create some weights and biases variables, declared input placeholders and expressed the equations that will be used to formulate the network as well as to optimize the parameters. Now let's jump to step 2, in which we will run our 10,000 iterations and display our results.

##### Step 2: Running our computational graph

In [28]:
# Creating a session to run the graph
sess = tf.Session()

# Initializing all variables - you need to initialize your variables before running your graph
init = tf.global_variables_initializer()
sess.run(init)

# Let's limit the number of iterations 
iter_ = 0
print("Starting optimization")

while iter_ <= 10000:
    
    # Here we are running the optimization using the Stochastic Gradient Descent Methode. Remember that we 
    # need to feed input data to our placeholders
    _ = sess.run(train, feed_dict={inputs:np.array(df[['x1','x2','x3']]),true_outputs:np.array(df[['output']])})
    
    # Evaluating the Mean Squared Error
    mse = sess.run(mean_square_error, feed_dict={inputs:np.array(df[['x1','x2','x3']]),true_outputs:np.array(df[['output']])})
    
    # Displaying results every 2000 iterations
    if iter_ % 2000 == 0:
        # Evaluating the output layer -- what is predicted for each observation
        out = sess.run(output_layer, feed_dict={inputs:np.array(df[['x1','x2','x3']])})
        
        # Displaying the mean square error
        print("Iteration:",iter_, "Mean_square_error:",mse, "\nOutput\n",out)
    
    iter_ += 1

print("Very cool, we are finished with the optimiztion!")

Starting optimization
Iteration: 0 Mean_square_error: 2.26706 
Output
 [[ 0.83837086]
 [ 0.88519567]
 [ 0.80585104]
 [ 0.77517498]
 [ 0.87023813]
 [ 0.87023813]
 [ 0.88519567]]
Iteration: 2000 Mean_square_error: 1.40898 
Output
 [[ 0.66873139]
 [ 0.70373935]
 [ 0.59839165]
 [ 0.3687821 ]
 [ 0.42848191]
 [ 0.42848191]
 [ 0.70373935]]
Iteration: 4000 Mean_square_error: 1.06147 
Output
 [[ 0.76707333]
 [ 0.8082152 ]
 [ 0.41577893]
 [ 0.26200432]
 [ 0.32874194]
 [ 0.32874194]
 [ 0.8082152 ]]
Iteration: 6000 Mean_square_error: 0.797579 
Output
 [[ 0.85751957]
 [ 0.88369256]
 [ 0.28574413]
 [ 0.2936894 ]
 [ 0.29128924]
 [ 0.29128924]
 [ 0.88369256]]
Iteration: 8000 Mean_square_error: 0.425461 
Output
 [[ 0.89129716]
 [ 0.89529473]
 [ 0.25673971]
 [ 0.53540641]
 [ 0.23447354]
 [ 0.23447354]
 [ 0.89529473]]
Iteration: 10000 Mean_square_error: 0.179643 
Output
 [[ 0.90059078]
 [ 0.91688555]
 [ 0.17294322]
 [ 0.72849256]
 [ 0.16173878]
 [ 0.16173878]
 [ 0.91688555]]
Very cool, we are finished wi

Notice that we start with a Mean Squared Error Function greater than 2.2. As iterations are run, and as the network learns the XOR function, our model makes predictions that are closer and closer to the real output values of our observations. In turn, this leads to a cost function that decreases with iterations. In the final run, notice how our predictions seperate: observations 1,2,4 and 7 are converging to their true value of 1, while observations 3,5 and 6 converge to their true value of 0. Our error function is also the smallest at iteration 10,000.

We have succesfully approximated the XOR function using a single-layer feed-forward neural network. 

In the next application, we will take a look at the very famous digit recognition problem using the MNIST dataset. We'll create a network that recognizes handwritten digits and classifies them from 0 to 9. I'll try to publish an update with that application soon.

Hope you enjoyed this hsort tutorial to Artificial Neural Networks in TensorFlow :)

# THAT'S IT FOR NOW