### Objective 

Develop a skeletal framework for long-term, indefinite training that is conducted in multiple, discrete sessions spread out over time.  We want to be able to train a network like this:
- Hit the switch to start training.
- Open up TensorBoard and see the graph of the loss function, check out the kernels, etc.
- Wait for a while, refresh TensorBoard, and see how things are coming along.
- Hit some kind of "pause button" and shut everything down for a while.
- Come back later, fire it up again, and it picks up right where it left off without misssing a beat.
- Furthermore, if it ever crashes, we want to be able to resume from a recent checkpoint so we don't lose too much work.

### Build the Graph 

We just need dummy operations.  No need to saddle the framework with actual machine learning.  So, we'll use just one trainable parameter called $p$ and we won't bother feeding any input to it.  The "network output" is simply the constant function $f() = p$.

In [1]:
import tensorflow as tf
graph = tf.Graph()
with graph.as_default():
    p = tf.get_variable(name="p", 
                        shape=[1], 
                        initializer=tf.constant_initializer(0),
                        dtype=tf.float32)

### Training Operation

To make this at least somewhat realistic (not *too* far off from a real application), we'll make a proper **train_op** using a `tf.train.Optimizer`.

`tf.train.GradientDescent` is the simplest one, so let's go with that.

To simulate training, we'll just increment our trainable parameter $p$ by (oh, let's say) 2 at every training step.  The straightforward way to do this would be to use a loss function with a constant gradient of -2 (e.g. loss$(x) = -2x + c$).  That would require setting up an input placeholder for $x$, though, and we don't want to clutter up the framework with unnecessary variables.

So, instead of calling `tf.train.GradientDescentOptimizer.minimize(loss)`--which takes the gradient of the loss function and then applies it to the trainable parameters--we'll take a detour around the gradient-taking part and call `tf.train.GradientDescentOptimizer.apply_gradients(grad)` directly.  We need to pass it a tensor `grad` which it will assume is the gradient of the loss.  It'll be none the wiser if we just give it -2.

In [2]:
with graph.as_default():
    
    # Step counter
    global_step = tf.Variable(0, name="global_step", trainable=False)
    
    # Fake the gradient of a loss function to make the optimizer think p always needs to be adjusted by +2
    fake_loss_gradient = -2 * tf.ones([1])

    # Simulate calling compute_gradients() (the first half of minimize())
    grads_and_vars = [(fake_loss_gradient, p)]
    
    # Set train_op to be the apply_gradients part of minimize()
    train_op = tf.train.GradientDescentOptimizer(learning_rate=1.0).apply_gradients(grads_and_vars, global_step=global_step)

#### Test train_op in a simple session

Just for a sanity check, let's run our train_op in an old-fashioned tf.Session.  For the real training, we'll use a tf.train.Supervisor.managed_session() 

In [3]:
with tf.Session(graph=graph) as s:
    
    # Initialize the network parameters
    tf.global_variables_initializer().run()
    
    for i in range(500):
        _, y, gstep = s.run([train_op, p, global_step])
        if (gstep % 100 == 0):
            print("step %3d:\tp = %f" % (gstep,y))

step 100:	p = 200.000000
step 200:	p = 400.000000
step 300:	p = 600.000000
step 400:	p = 800.000000
step 500:	p = 1000.000000


Looks good.  Now let's do it with a pausable session.

### Training Session

TensorFlow supports saving and loading the network parameters using the `tf.train.Saver` class. Since we want checkpoints, however, we can make this easier by using a `tf.train.Supervisor.managed_session` instead of the usual `tf.Session`.  

A `managed_session` has its own Saver and it will save checkpoints automatically and reload from them.  The only thing it leaves for us to do is to implement is the "pause button."  To do that, we'll periodically check the training directory for a file named "pause".  If one exists, we'll delete it (in preparation for the next use), save a checkpoint manually, and then shut down the training.

**BEFORE RUNNING THE NEXT CELL** get a terminal open and have this command ready to go:
```
touch /tmp/pausable_training/pause
```

**NB**: Only the *values* of the network parameters are saved and reloaded--not the network topology itself.  We still have to build the graph and if there's input (which there usually is), load it.  The step we skip is the network initialization step (`tf.global_variables_initializer().run()`).  Obviously, we have to leave that to `managed_session()` since we want it to handle the task of choosing whether to reload values from storage or initialize them from scratch.

In [None]:
import time
training_dir = "/tmp/pausable_training/"
pause_file = training_dir + "pause"
checkpoint_file = training_dir + "pt.ckpt"

with graph.as_default():
    sv = tf.train.Supervisor(logdir=training_dir)
    with sv.managed_session() as s:

        # Supervisor calls tf.global_variables_initializer().run() for us

        while not sv.should_stop():
            _, y, gstep = s.run([train_op, p, global_step])
            if (gstep % 100 == 0):
                print("step %3d:\tp = %f" % (gstep,y))
                if (tf.gfile.Exists(pause_file)):
                    sv.Stop()
                time.sleep(1)
                
        if tf.gfile.Exists(pause_file):
            print("Pause command received.  Saving checkpoint and shutting down.")
            tf.gfile.Remove(pause_file)
            sv.saver.save(s, checkpoint_file, global_step=global_step)

INFO:tensorflow:Restoring parameters from /tmp/pausable_training/pt.ckpt-1900
INFO:tensorflow:global_step/sec: 0
step 2000:	p = 4000.000000
step 2100:	p = 4200.000000
step 2200:	p = 4400.000000
step 2300:	p = 4600.000000
step 2400:	p = 4800.000000
step 2500:	p = 5000.000000
Pause command received.  Saving checkpoint and shutting down.


### Now, we shut down and continue with more training later

In [None]:
# Kill the kernel, forcing it to restart - NB: You'll have to step manually from here on.
import os
os._exit(0)

Re-run the notebook at this point to perform subsequent training sessions.  Each time it will pick up where it left off (unless the training directory has been wiped).  Note the reported "step" numbers above to see it resuming.

### Next Objectives
#### Make this a little less skeletal 
Use input data and actually train some simple parameters (call `minimize`) so we have a real train_op.  Try logistic regression on a randomly-generated dataset.
#### Get a better handle on TensorBoard reporting  
Want to see the loss and the trainable parameters at the very least.  Eventually we'll add learning rates, kernels, etc.
#### Keep a roster of the top performers
Keep snapshots of the networks corresponding to local minima in the training loss.  Make it easy to load those snapshots.  Once we suspect we're beginning to overtrain the network, we can load up those "training highlights", cross-validate them, and package up the best one for "release" (e.g. Kaggle submission).
#### Use a file input queue
We want to operate on lots of data: more then what will fit main memory.  See this page: https://www.tensorflow.org/programmers_guide/reading_data