### Objective 

Develop a skeletal framework for long-term, indefinite training that is conducted in multiple, discrete sessions spread out over time.  We want to be able to train a network like this:
- Hit the switch to start training.
- Open up TensorBoard and see the graph of the loss function, check out the kernels, etc.
- Wait for a while, refresh TensorBoard, and see how things are coming along.
- Hit some kind of *save and quit* button.  Not sure how best to do this, but an easy way would be for the code to check for the existence of a file called "pause" in the training dir.  To pause the training, you'd go to a bash prompt and do "touch pause".  
- Come back later, fire it up again, and hit a *continue* button that resumes training from where we left off.

#### Next Objective
Provide a way to load up the state of the network from earlier points in time.  Local minima in the training loss. Then we can test or validate those and package up the best one for "release" (e.g. Kaggle submission).

In [1]:
import tensorflow as tf

### Graph 

We just need dummy operations.  No need to saddle the framework with actual machine learning.  So, we'll simulate training with looping counter called $p$.  The "network output" is simply the constant function $f(x) = p$.  It has one trainable parameter: $p$.  We won't actually feed it any input.

In [2]:
graph = tf.Graph()
with graph.as_default():
    p = tf.get_variable(name="tf_p", 
                        shape=[1], 
                        initializer=tf.constant_initializer(0),
                       dtype=tf.float32)

### Training Operation

Actually...  Instead of doing addition modulo n, what happens when an int variable overflows in TensorFlow?  Let's find out (while we're here anyway).  **Ans: "Inf"** *(increase learning rate and/or $p$ to see)*

We need to follow TF conventions here, so we'll use a proper tf.learn.Optimizer to do our "training."  The simplest one seems to be GradientDescent, so we'll use that and fake the "loss" function in order to make our parameter increment.

In [36]:
with graph.as_default():
    
    # Fake loss function to make the optimizer think p always needs to be adjusted by +2
    loss = -2 * tf.ones([1])

    # Step counter
    global_step = tf.Variable(0, name="global_step", trainable=False)
    
    # Normally we'd run minimize(), which first computes the gradients and then applies them.
    # Here, we want to fool the optimizer with a fake (gradient, variable-to-train) pair and then apply that.
    # So, we'll run just the second half of minimize() and fake the first half.
    grads_and_vars = [(loss, p)]
    train_op = tf.train.GradientDescentOptimizer(learning_rate=1.0).apply_gradients(grads_and_vars, global_step=global_step)

#### Test train_op in a simple session

In [40]:
with tf.Session(graph=graph) as s:
    
    # Initialize the network parameters
    tf.global_variables_initializer().run()
    
    for i in range(1000):
        _, y, gstep = s.run([train_op, p, global_step])
        if (i % 100 == 0):
            print("step %3d:\tp = %f" % (gstep,y))

step   1:	p = 2.000000
step 101:	p = 202.000000
step 201:	p = 402.000000
step 301:	p = 602.000000
step 401:	p = 802.000000
step 501:	p = 1002.000000
step 601:	p = 1202.000000
step 701:	p = 1402.000000
step 801:	p = 1602.000000
step 901:	p = 1802.000000


### Training Session