[Universal Approximation Theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) states that any feed forward neural network with a single hidden layer containing a finite number of neurons can fit any function.  

Creating a neural networks with one hidden layer correctly classifies all samples in MNIST.

In [1]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

  from ._conv import register_converters as _register_converters


In [3]:
mnist = input_data.read_data_sets('./mnist',one_hot=True)
train_images = mnist.train.images
test_images = mnist.test.images
print("Shape of the training images: ", train_images.shape)
print("Shape of the test images: ", test_images.shape)

Extracting ./mnist/train-images-idx3-ubyte.gz
Extracting ./mnist/train-labels-idx1-ubyte.gz
Extracting ./mnist/t10k-images-idx3-ubyte.gz
Extracting ./mnist/t10k-labels-idx1-ubyte.gz
Shape of the training images:  (55000, 784)
Shape of the test images:  (10000, 784)


Respectively setting 20 and 1000 hidden neurons, let's see what happens.

In [5]:
num_of_hidden_neurons = 20
num_of_pixels = 28*28
num_of_classes = 10
epochs = 20000
batch_size = 32
learning_rate = 0.01

tf.reset_default_graph()
X = tf.placeholder(tf.float32,shape=[None,num_of_pixels])
y_ = tf.placeholder(tf.float32,shape=[None,num_of_classes])

W1 = tf.get_variable(name='W1',shape=[num_of_pixels,num_of_hidden_neurons],initializer=tf.contrib.layers.variance_scaling_initializer())
b1 = tf.get_variable(name='b1',shape=[num_of_hidden_neurons],initializer=tf.constant_initializer(0.1))
h1 = tf.matmul(X,W1) + b1
h1 = tf.nn.relu(h1)

W2 = tf.get_variable(name='W2',shape=[num_of_hidden_neurons,num_of_classes],initializer=tf.contrib.layers.variance_scaling_initializer())
b2 = tf.get_variable(name='b2',shape=[num_of_classes],initializer=tf.constant_initializer(0.1))
y = tf.matmul(h1,W2) + b2
y = tf.nn.relu(y)

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=y,labels=y_))
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(loss)

correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
num_of_correct_prediction = tf.reduce_sum(tf.cast(correct_prediction,tf.int32))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

for epoch in range(epochs):
    batch_x,batch_y = mnist.train.next_batch(batch_size)
    _,training_loss = sess.run([train_step,loss],feed_dict={X:batch_x/255.0,y_:batch_y})
    
    if epoch%2000 == 0:
        training_accuracy = accuracy.eval(session=sess,feed_dict={X:batch_x/255.0,y_:batch_y})
        print("step {0}/{1}, training accuracy {2}".format(epoch,epochs,training_accuracy))

testing_accuracy = accuracy.eval(session=sess,feed_dict={X:mnist.test.images/255.0,y_:mnist.test.labels})
num_of_correct = sess.run(num_of_correct_prediction,feed_dict={X:mnist.test.images/255.0,y_:mnist.test.labels})
print("Testing accuracy by the neural network with one hidden layer containing 20 neurons: %g"%(testing_accuracy))
print('Number of digits in training set being correctly classified {0}/{1}'.format(num_of_correct,test_images.shape[0]))



step 0/20000, training accuracy 0.125
step 2000/20000, training accuracy 0.90625
step 4000/20000, training accuracy 0.96875
step 6000/20000, training accuracy 0.9375
step 8000/20000, training accuracy 0.875
step 10000/20000, training accuracy 0.9375
step 12000/20000, training accuracy 0.90625
step 14000/20000, training accuracy 0.96875
step 16000/20000, training accuracy 0.96875
step 18000/20000, training accuracy 1.0
Testing accuracy by the neural network with one hidden layer containing 20 neurons: 0.9493
Number of digits in training set being correctly classified 9493/10000


In [4]:
num_of_hidden_neurons = 1000
num_of_pixels = 28*28
num_of_classes = 10
epochs = 20000
batch_size = 632
learning_rate = 0.01

tf.reset_default_graph()
X = tf.placeholder(tf.float32,shape=[None,num_of_pixels])
y_ = tf.placeholder(tf.float32,shape=[None,num_of_classes])

W1 = tf.get_variable(name='W1',shape=[num_of_pixels,num_of_hidden_neurons],initializer=tf.contrib.layers.variance_scaling_initializer())
b1 = tf.get_variable(name='b1',shape=[num_of_hidden_neurons],initializer=tf.constant_initializer(0.1))
h1 = tf.matmul(X,W1) + b1
h1 = tf.nn.relu(h1)

W2 = tf.get_variable(name='W2',shape=[num_of_hidden_neurons,num_of_classes],initializer=tf.contrib.layers.variance_scaling_initializer())
b2 = tf.get_variable(name='b2',shape=[num_of_classes],initializer=tf.constant_initializer(0.1))
y = tf.matmul(h1,W2) + b2
y = tf.nn.relu(y)

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=y,labels=y_))
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(loss)

correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
num_of_correct_prediction = tf.reduce_sum(tf.cast(correct_prediction,tf.int32))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

for epoch in range(epochs):
    batch_x,batch_y = mnist.train.next_batch(batch_size)
    _,training_loss = sess.run([train_step,loss],feed_dict={X:batch_x/255.0,y_:batch_y})
    
    if epoch%2000 == 0:
        training_accuracy = accuracy.eval(session=sess,feed_dict={X:batch_x/255.0,y_:batch_y})
        print("step {0}/{1}, training accuracy {2}".format(epoch,epochs,training_accuracy))

testing_accuracy = accuracy.eval(session=sess,feed_dict={X:mnist.test.images/255.0,y_:mnist.test.labels})
num_of_correct = sess.run(num_of_correct_prediction,feed_dict={X:mnist.test.images/255.0,y_:mnist.test.labels})
print("Testing accuracy by the neural network with one hidden layer containing 20 neurons: %g"%(testing_accuracy))
print('Number of digits in training set being correctly classified {0}/{1}'.format(num_of_correct,test_images.shape[0]))

step 0/20000, training accuracy 0.21360759437084198
step 2000/20000, training accuracy 0.09810126572847366
step 4000/20000, training accuracy 0.1139240488409996
step 6000/20000, training accuracy 0.0901898741722107
step 8000/20000, training accuracy 0.10601265728473663
step 10000/20000, training accuracy 0.10126582533121109
step 12000/20000, training accuracy 0.0917721539735794
step 14000/20000, training accuracy 0.08227848261594772
step 16000/20000, training accuracy 0.10601265728473663
step 18000/20000, training accuracy 0.10443037748336792
Testing accuracy by the neural network with one hidden layer containing 20 neurons: 0.098
Number of digits in training set being correctly classified 980/10000


_______________
According to our results, we can clearly find that a simple neural networks with one hidden layer can perform well on mnist classification. However, more neurons in hidden layer doesn't mean better performance. We can see the bad performance when the number of hidden layer neurons is up to 1000. So our results mean the universial approximation theorem is right but depends on the different situations.