# Feed Forward Neural Networks

In the previous notebook, I strove to implement a convulutional neural network by following an example provided on the keras github. As a first exposure to neural networks and working with Keras, that experience helped expose me to the language and structure of building keras models. Simultaneously, I strove to understand the basics of convulutional neural networks through research and instruction.

Now I would like to take a step backwards and test a simple, non-convulutional neural network. This network could be considered "feed forward", "dense", or "fully connected". I would like to see what happens when I do a single densely connected layer, and then perhaps two or three densely connected layers. 

In [5]:
%run __init__.py

Using TensorFlow backend.


X_train: (50000, 32, 32, 3), y_train: (50000, 10)
X_test: (10000, 32, 32, 3), y_test: (10000, 10)
Class labels: ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']


In [6]:
def print_summary(model):
    for l in model.layers:
        print (l.name, l.input_shape,'==>',l.output_shape)
    print (model.summary())

In [7]:
model = Sequential()

In [9]:
model.add(Flatten(input_shape=X_train.shape[1:]))

In [10]:
model.add(Dense(3072))

In [11]:
model.add(Dense(10))

In [12]:
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [16]:
print_summary(model)

flatten_2 (None, 32, 32, 3) ==> (None, 3072)
dense_1 (None, 3072) ==> (None, 3072)
dense_2 (None, 3072) ==> (None, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 3072)              9440256   
_________________________________________________________________
dense_2 (Dense)              (None, 10)                30730     
Total params: 9,470,986
Trainable params: 9,470,986
Non-trainable params: 0
_________________________________________________________________
None


In [17]:
model.fit(X_train, y_train,
              batch_size=32,
              epochs=5,
              validation_data=(X_test, y_test),
              shuffle=True)

Train on 50000 samples, validate on 10000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f91986f7cf8>

Clearly, using a single dense layer with neurons equal to inputs and one output layer producing extraordinarily bad results. At 10% accuracy, the model is currently performing a baseline equal to just picking one of the ten classes and guessing every time that it will be that class (always picking truck, for example). 

This makes a lot of sense, however. By unraveling the image data into a single 3072x1 vector, I am training on nothing but a one dimensional band of nearly meaningless data. It is unlikely a human could extract any meaningful value from this row vector. Not only that, but without any activation layer, no non-linearity is being introduced to the model, meaning the current model is not but a linear classifier with 3072 features / betas corresponding to 10 output layers. 

What happens when we add an activation layer to this model? Let's try it below.

In [20]:
model = Sequential()
model.add(Flatten(input_shape=X_train.shape[1:]))
model.add(Dense(3072))
model.add(Activation('relu'))
model.add(Dense(10))
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [21]:
print_summary(model)

flatten_4 (None, 32, 32, 3) ==> (None, 3072)
dense_5 (None, 3072) ==> (None, 3072)
activation_2 (None, 3072) ==> (None, 3072)
dense_6 (None, 3072) ==> (None, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_4 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 3072)              9440256   
_________________________________________________________________
activation_2 (Activation)    (None, 3072)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 10)                30730     
Total params: 9,470,986
Trainable params: 9,470,986
Non-trainable params: 0
_________________________________________________________________
None


In [22]:
model.fit(X_train, y_train,
              batch_size=32,
              epochs=5,
              validation_data=(X_test, y_test),
              shuffle=True)

Train on 50000 samples, validate on 10000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f918a44ed30>

The model now starts to eek out a fraction of a percent in better performance, but it is still pretty much just guessing. Would adding a second fully connected layer help? What if the second layer has less neurons, allowing the model to start to build some selectivity into the feature input layer?

In [23]:
model = Sequential()
model.add(Flatten(input_shape=X_train.shape[1:]))
model.add(Dense(3072))
model.add(Activation('relu'))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(10))
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [25]:
print_summary(model)

flatten_5 (None, 32, 32, 3) ==> (None, 3072)
dense_7 (None, 3072) ==> (None, 3072)
activation_3 (None, 3072) ==> (None, 3072)
dense_8 (None, 3072) ==> (None, 512)
activation_4 (None, 512) ==> (None, 512)
dense_9 (None, 512) ==> (None, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_5 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_7 (Dense)              (None, 3072)              9440256   
_________________________________________________________________
activation_3 (Activation)    (None, 3072)              0         
_________________________________________________________________
dense_8 (Dense)              (None, 512)               1573376   
_________________________________________________________________
activation_4 (Activation)    (None, 512)               0         
___________________________________

In [26]:
model.fit(X_train, y_train,
              batch_size=32,
              epochs=5,
              validation_data=(X_test, y_test),
              shuffle=True)

Train on 50000 samples, validate on 10000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f9198fd19b0>

Surprisingly, adding a single fully connected layer hurt model performance, as opposed to helping it. 

Clearly, feed forward neural networks alone do terribly on image classification. I could try to add layer after layer or train for countless epochs, but it seems like a pointless pursuit, especially with decresing accuracy after just two layers.

Luckily, Convulutional Neural Networks have proven to be extremely effective at image classification. In the next notebook, I will start to build some historically viable CNN's that have proven useful in the past.

### Multi-Layer Perceptron

What happens if I don't flatten the data before sending it into the network? How would a fully connected network do if the data is flattened after the first layer?

Let's test it out. 

In [3]:
32*32 / 4 / 4

64.0

In [7]:
model = Sequential()
model.add(Dense(256, input_shape=X_train.shape[1:]))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(10))
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [8]:
print_summary(model)

dense_1 (None, 32, 32, 3) ==> (None, 32, 32, 256)
activation_1 (None, 32, 32, 256) ==> (None, 32, 32, 256)
flatten_1 (None, 32, 32, 256) ==> (None, 262144)
dense_2 (None, 262144) ==> (None, 64)
activation_2 (None, 64) ==> (None, 64)
dense_3 (None, 64) ==> (None, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 32, 32, 256)       1024      
_________________________________________________________________
activation_1 (Activation)    (None, 32, 32, 256)       0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 262144)            0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                16777280  
_________________________________________________________________
activation_2 (Activation)    (None, 64)                0         
_______

In [9]:
model.fit(X_train, y_train,
              batch_size=32,
              epochs=5,
              validation_data=(X_test, y_test),
              shuffle=True)

Train on 50000 samples, validate on 10000 samples
Epoch 1/5
 1504/50000 [..............................] - ETA: 510s - loss: 9.3459 - acc: 0.0898

KeyboardInterrupt: 