# Feed Forward Neural Networks

In the previous notebook, I strove to implement a convulutional neural network by following an example provided on the keras github. As a first exposure to neural networks and working with Keras, that experience helped expose me to the language and structure of building keras models. Simultaneously, I strove to understand the basics of convulutional neural networks through research and instruction.

Now I would like to take a step backwards and test a simple, non-convulutional neural network. This network could be considered "feed forward", "dense", or "fully connected". I would like to see what happens when I do a single densely connected layer, and then perhaps two or three densely connected layers. 

### Softmax Activation

In order to make the network classify images into one of the ten classes, adding a softmax activation after the final output layer is required. Each dense node itself outputs a float value useful for regression, but useless in the case of classification. Adding softmax to the value of the output is similiar to changing a linear model to a logistic model. 

In [1]:
%run __initremote__.py

Using TensorFlow backend.


x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples


In [2]:
def print_summary(model):
    for l in model.layers:
        print (l.name, l.input_shape,'==>',l.output_shape)
    print (model.summary())

In [18]:
early_stop = keras.callbacks.EarlyStopping(monitor='val_acc', 
                                           min_delta=0, 
                                           patience=5, 
                                           verbose=0, 
                                           mode='auto')

In [3]:
model = Sequential()

In [4]:
model.add(Flatten(input_shape=x_train.shape[1:]))

In [5]:
model.add(Dense(3072))

In [6]:
model.add(Dense(10))

In [7]:
model.add(Activation('softmax'))

In [8]:
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [9]:
print_summary(model)

flatten_1 (None, 32, 32, 3) ==> (None, 3072)
dense_1 (None, 3072) ==> (None, 3072)
dense_2 (None, 3072) ==> (None, 10)
activation_1 (None, 10) ==> (None, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 3072)              9440256   
_________________________________________________________________
dense_2 (Dense)              (None, 10)                30730     
_________________________________________________________________
activation_1 (Activation)    (None, 10)                0         
Total params: 9,470,986
Trainable params: 9,470,986
Non-trainable params: 0
_________________________________________________________________
None


In [10]:
model.fit(x_train, y_train,
              batch_size=32,
              epochs=5,
              validation_data=(x_test, y_test),
              callbacks=[early_stop]
              shuffle=True)

Train on 50000 samples, validate on 10000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f92acce2828>

Clearly, using a single dense layer with neurons equal to inputs and one output layer does not produce extraordinary results. However, at 35% validation accuracy, this model is already doing much better than a simply guess. A baseline result of 10% accuracy would equate to just picking one of the ten classes and guessing every time that it will be that class (always picking truck, for example). That is not what we see in this case, however. It is clear that this kind of network is sophisticated enough to notice some features, and start to make some very naive guesses as to which class each image may belong to. 

This makes a lot of sense, however. By unraveling the image data into a single 3072x1 vector, I am training on nothing but a one dimensional band of nearly meaningless data. It is unlikely a human could extract any meaningful value from this row vector. Not only that, but without any activation layer, no non-linearity is being introduced to the model, meaning the current model is not but a linear classifier with 3072 features / betas corresponding to 10 output layers. 

What happens when we add an activation layer to this model? Let's try it below.

In [14]:
model = Sequential()
model.add(Flatten(input_shape=x_train.shape[1:]))
model.add(Dense(3072))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [15]:
print_summary(model)

flatten_2 (None, 32, 32, 3) ==> (None, 3072)
dense_3 (None, 3072) ==> (None, 3072)
activation_2 (None, 3072) ==> (None, 3072)
dense_4 (None, 3072) ==> (None, 10)
activation_3 (None, 10) ==> (None, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 3072)              9440256   
_________________________________________________________________
activation_2 (Activation)    (None, 3072)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                30730     
_________________________________________________________________
activation_3 (Activation)    (None, 10)                0         
Total params: 9,470,986
Trainable params: 9,470,986
Non-trainable params:

In [17]:
model.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              callbacks=[early_stop],
              shuffle=True)

Train on 50000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100


<keras.callbacks.History at 0x7f92ab257208>

Incredible. With just a single hidden layer, a Rectified Linear Unit activation layer, an output layer and a softmax activation layer, the neural network built above reaches 50% accuracy on validation in 12 epochs. 

Would adding a second fully connected layer help? What if the second layer has less neurons, allowing the model to start to build some selectivity into the feature input layer?

In [19]:
model = Sequential()
model.add(Flatten(input_shape=x_train.shape[1:]))
model.add(Dense(3072))
model.add(Activation('relu'))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [20]:
print_summary(model)

flatten_3 (None, 32, 32, 3) ==> (None, 3072)
dense_5 (None, 3072) ==> (None, 3072)
activation_4 (None, 3072) ==> (None, 3072)
dense_6 (None, 3072) ==> (None, 512)
activation_5 (None, 512) ==> (None, 512)
dense_7 (None, 512) ==> (None, 10)
activation_6 (None, 10) ==> (None, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_3 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 3072)              9440256   
_________________________________________________________________
activation_4 (Activation)    (None, 3072)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 512)               1573376   
_________________________________________________________________
activation_5 (Activation)    (None, 512)               0      

In [22]:
model.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              callbacks=[early_stop],
              shuffle=True)

Train on 50000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100


<keras.callbacks.History at 0x7f92842f48d0>

Surprisingly, adding a single fully connected layer didn't significantly improve model performance.

Luckily, Convulutional Neural Networks have proven to be extremely effective at image classification. In the next notebook, I will start to build some historically viable CNN's that have proven useful in the past.

### Multi-Layer Perceptron

What happens if I don't flatten the data before sending it into the network? How would a fully connected network do if the data is flattened after the first layer?

Let's test it out. 

In [24]:
model = Sequential()
model.add(Dense(64, input_shape=x_train.shape[1:]))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [25]:
print_summary(model)

dense_8 (None, 32, 32, 3) ==> (None, 32, 32, 64)
activation_7 (None, 32, 32, 64) ==> (None, 32, 32, 64)
flatten_4 (None, 32, 32, 64) ==> (None, 65536)
dense_9 (None, 65536) ==> (None, 32)
activation_8 (None, 32) ==> (None, 32)
dense_10 (None, 32) ==> (None, 10)
activation_9 (None, 10) ==> (None, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 32, 32, 64)        256       
_________________________________________________________________
activation_7 (Activation)    (None, 32, 32, 64)        0         
_________________________________________________________________
flatten_4 (Flatten)          (None, 65536)             0         
_________________________________________________________________
dense_9 (Dense)              (None, 32)                2097184   
_________________________________________________________________
activation_8 (Activation)    (None, 32)

In [26]:
model.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              callbacks=[early_stop],
              shuffle=True)

Train on 50000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100


<keras.callbacks.History at 0x7f92acafe5c0>

This network does just as well as the more complicated fully connected networks above. The use of less nodes in the hidden layers does not impare model performance, and validation scores top out at about 50%, as the previous models have before. 

Next, let's look at Convulutional2D layers and see how they can help the neural net classify images by creating filters and pools features together. 