### Objective 

Develop a skeletal framework for long-term, indefinite training that is conducted in multiple, discrete sessions spread out over time.  We want to be able to train a network like this:
- Hit the switch to start training.
- Open up TensorBoard and see the graph of the loss function, check out the kernels, etc.
- Wait for a while, refresh TensorBoard, and see how things are coming along.
- Hit some kind of "pause button" and shut everything down for a while.
- Come back later, fire it up again, and it picks up right where it left off without misssing a beat.


### Step 1: Build the Graph 

We just need dummy operations.  No need to saddle the framework with actual machine learning.  So, we'll use just one trainable parameter called $p$ and we won't bother feeding any input to it.

The "network output" is simply the constant function $f(x) = p$.

In [1]:
import tensorflow as tf
graph = tf.Graph()
with graph.as_default():
    p = tf.get_variable(name="tf_p", 
                        shape=[1], 
                        initializer=tf.constant_initializer(0),
                       dtype=tf.float32)

### Training Operation

To make this at least somewhat realistic (not *too* far off from a real application), we'll make a proper **train_op** using a tf.train.Optimizer.

tf.train.**GradientDescent** is the simplest one, so let's go with that.

To simulate training, we'll just increment our trainable parameter $p$ by (oh, let's say) 2 at every training step.  The straightforward way to do this would be to use a loss function with a constant gradient of -2 (e.g. $\mbox{loss}(x) = -2x$).  That would require setting up an input placeholder for $x$, though, and we don't want to clutter up the framework with unnecessary variables.

So, instead of calling tf.train.GradientDescentOptimizer.**minimize(loss)**--which takes the gradient of the loss function and then applies it to the trainable parameters--we'll take a detour around the gradient-taking part and call tf.train.GradientDescentOptimizer.**apply_gradients(grad)** directly.  We need to pass it a tensor which it will assume is the gradient of the loss.  It'll be none the wiser if we just give it -2.

In [2]:
with graph.as_default():
    
    # Step counter
    global_step = tf.Variable(0, name="global_step", trainable=False)
    
    # Fake the gradient of a loss function to make the optimizer think p always needs to be adjusted by +2
    fake_loss_gradient = -2 * tf.ones([1])

    # Simulate calling compute_gradients() (the first half of minimize())
    grads_and_vars = [(fake_loss_gradient, p)]
    
    # Set train_op to be the apply_gradients part of minimize()
    train_op = tf.train.GradientDescentOptimizer(learning_rate=1.0).apply_gradients(grads_and_vars, global_step=global_step)

#### Test train_op in a simple session

Just for a sanity check, let's run our train_op in an old-fashioned tf.Session.  For the real training, we'll use a tf.train.Supervisor.managed_session() 

In [3]:
with tf.Session(graph=graph) as s:
    
    # Initialize the network parameters
    tf.global_variables_initializer().run()
    
    for i in range(1000):
        _, y, gstep = s.run([train_op, p, global_step])
        if (i % 100 == 0):
            print("step %3d:\tp = %f" % (gstep,y))

step   1:	p = 2.000000
step 101:	p = 202.000000
step 201:	p = 402.000000
step 301:	p = 602.000000
step 401:	p = 802.000000
step 501:	p = 1002.000000
step 601:	p = 1202.000000
step 701:	p = 1402.000000
step 801:	p = 1602.000000
step 901:	p = 1802.000000


### Training Session

In [10]:
import time
training_dir = "/tmp/pausible_training/"
pause_file = training_dir + "pause"
with graph.as_default():
    sv = tf.train.Supervisor(logdir=training_dir)
    with sv.managed_session() as s:

        # Supervisor calls tf.global_variables_initializer().run() for us

        for i in range(5000):
            if tf.gfile.Exists(pause_file):
                print("Pause command received.  Saving checkpoint and shutting down.")
                tf.gfile.Remove(pause_file)
                sv.saver.save(s, training_dir + 'model.ckpt', global_step=global_step)
                break
            _, y, gstep = s.run([train_op, p, global_step])
            if sv.should_stop():
                break
            if (i % 100 == 0):
                print("%d step %3d:\tp = %f" % (i, gstep,y))
                time.sleep(1)

INFO:tensorflow:global_step/sec: 0
0 step 5804:	p = 11608.000000
100 step 5904:	p = 11808.000000
200 step 6004:	p = 12008.000000
300 step 6104:	p = 12208.000000
400 step 6204:	p = 12408.000000
500 step 6304:	p = 12608.000000
600 step 6404:	p = 12808.000000
700 step 6504:	p = 13008.000000
You paused!


Cool!  It can resume from where it left off!
The problem now is that it's saving at regular checkpoints--which is great (I think that's great)...  We want to be able to do that.  But we also want to be able to force it to save on command.  That way, when we deliberately interrupt it we can save immediately before halting and not lose anything.  Regular checkpoints are great for unexpected shutdowns, but when we know we're about to shut down, we shouldn't have to rely on the luck of the checkpoint schedule.

So, I think the way you have to do a deliberate save is like this:
sv.saver.save(s, 'model', global_step=global_step)

In [None]:
We can use tf.train.Supervisor to run a managed session.  

There  some kind of *save-and-quit* button.  Not sure how best to do this, but an easy way would be for the code to check for the existence of a file called "pause" in the training dir.  To pause the training, you'd go to a bash prompt and do "touch pause".  The training code would notice it, delete it (to prepare for the next pause), save a checkpoint, and shut down.

#### Next Objective
Provide a way to load up the state of the network from earlier points in time.  Local minima in the training loss. Then we can test or validate those and package up the best one for "release" (e.g. Kaggle submission).