In [1]:
# dataset
from keras.datasets import mnist

# NN architecture
from keras import models
from keras import layers

# set up labels
from keras.utils import to_categorical

# helper fucntion
import numpy as np
import matplotlib.pyplot as plt

Using TensorFlow backend.


In [2]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In [3]:
# train image dataset
train_images.shape

(60000, 28, 28)

In [4]:
# label dataset
train_labels

array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

In [5]:
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape = (28 * 28,)))
network.add(layers.Dense(10, activation = 'softmax'))

In [6]:
train_images = train_images.reshape((60000, 28*28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

In [7]:
# compile the process
network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy']
               )

In [8]:
# prepare these categorical data
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [22]:
network.fit(train_images, train_labels, epochs = 5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0xb34d129b0>

## batch size and epoch

```python
batch = train_images[:128]
```
This trainning process call SGD short for stochastic gradient descent. If we chose the batch number similar as the sample size number, it is essentially stochastic gradient descent. If the batch number equals 1, then it called **real** SGD. If the batch number was inbetween, we call it **mini-batch** SGD.

Batch size is the number of training instance let seen by the model during each training iteration.

Since we cannot process all the data at once, so we need to divide our dataset into small pieces and give it one by one to our computer and update the weights of our neural network at the end of every step.

### why we use more than one epoch?
It offer the NN more chance to go through the data, therefore, more number of times the weight are changed in NN, and the curve goes from underfitting to optimal to overfitting.

### so what is a good number for epoch?
It depends on how diverse is your data is. Technically, more diverse, larger epoch number.

### how batch size affect accuracy? (see Question 2 in detail)
For now, empirically we just know that large batch result in quicker trainning but normally lower accuracy.  

### how do we update a batch of data at once?
it doesn't matter that much whether you give 100 or 256 or 2048 or 10000 (batch size) images as long as it fits in the memory of your (GPU) hardware.


### Question here
1. How does batch of data been update at once?

2. Empirically, there is not much difference between large batch number and small batch number when compare to their result in this example. However, with larger batch number, the trainning speed has significantly boosted. So what is the really difference between them?

### Answer
1. you just sum up all values of GD, and then apply them to weights at once

2. From Andrew NG cs229-note, for single update, it also known as LMS (least mean square) update. $\theta_j := \theta_j + \alpha\Sigma(y^i-h_\theta(x^i))x_j^i$ As this method looks at every example in the entire training set on every step, so it called **batch gradient desent**. $\theta_j := \theta_j + \alpha(y^i-h_\theta(x^i))x_j^i$ (for each i) For this one, we update the parameter based on one piece of data only so it call **stochastic gradient desent**. Comparing to each update, batch update takes longer than SGD, however compare to the whole dataset SGD takes much more times to finish. ***In batch gradient descent***, you compute the gradient over the entire dataset, averaging over potentially a vast amount of information. It takes lots of memory to do that. But the real handicap is the batch gradient trajectory land you in a bad spot (saddle point). ***In pure SGD***, on the other hand, you update your parameters by adding (minus sign) the gradient computed on a single instance of the dataset. Since it's based on one random data point, it's very noisy and may go off in a direction far from the batch gradient. However, the noisiness is exactly what you want in non-convex optimization, because it helps you escape from saddle points or local minima. The disadvantage is it's terribly inefficient and you need to loop over the entire dataset many times to find a good solution. ***The minibatch methodology*** is a compromise that injects enough noise to each gradient update, while achieving a relative speedy convergence.

In [23]:
# test our model
network.evaluate(test_images, test_labels)



[0.13549849958815474, 0.9787]

In [11]:
# visualise a sample image, the method that writen on the book does not work here

#digit = train_images[0]
#plt.imshow(digit.reshape(28,28), cmap='Greys', interpolation='None')
#plt.show()

In [12]:
batch = train_images[:128]

In [13]:
batch.shape

(128, 784)