# tensorflow debugger

following [this](https://www.tensorflow.org/programmers_guide/debugger)

the `tensorflow` debugger is called `tfdbg`, and is a `curses`-based cli. *because* it is `curses`-based, I won't be able to do much of the walkthrough here in the shell. I'll put my extra thoughts in here

In [1]:
import tensorflow as tf

import utils

  from ._conv import register_converters as _register_converters


## wrapping `tensorflow` sessions with `tfdbg`

to use `tfdbg`, the first step is to wrap the session object in a debugger wrapper:

```python
from tensorflow.python import debug as tf_debug

sess = tf_debug.LocalCLIDebugWrapperSession(sess)
```

or, using the full context manager:

```python
with tf_debug.LocalCLIDebugWrapperSession(tf.Session()) as sess:
    ...
```

In [4]:
from tensorflow.python import debug as tf_debug

for an example of this in action, check out [`debug_mnist.py`, L127](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/debug/examples/debug_mnist.py#L127)

some debugging checks are so common that they have been added to the `tf.python.debug.lib.debug_data` module (e.g.: `tfdbg.has_inf_or_nan`)

## debugging model training with `tfdbg`

the [`debug_mnist.py`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/debug/examples/debug_mnist.py) code has a cli flag built in to activate `--debug` mode. under the hood, this is a switch for using the `LocalCLIDebugWrapperSession` we discussed above, and thereby launching the `curses` interface for interactive debugging.

once this has been activated, we are dropped into the program at the [*first* invocation `sess.run()`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/debug/examples/debug_mnist.py#L136) post-wrapper (note: there could be many such invocations and we'd be dropped into the program for each).

### `tfdbg` `cli` frequently-used commands

this is a very good basic summary table of commands

### other features of the `tfdbg` `cli`

nothing to add

### finding `nan`s and `inf`s

the `run` subcommand allows you to apply a condition filter after every step (analogous to conditional breakpoints):

```
run -f has_inf_or_nan
```

the `has_inf_or_nan` filter *exists* because it was explicitly written out as `python` code (in the debug library itself, apparently; I can't see it anywhere) and was registered as a tensor filter in the default `LocalCLIDebugWrapperSession`. to add *your own* filter:

```python
def my_filter_callable(datum, tensor):
    return len(tensor.shape) == 0 and tensor == 0.0

sess.add_tensor_filter('my_filter', my_filter_callable)
```

will allow you to write `run -f my_filter` in the `tfdbg` interface. follow the following two docstrings down the rabbit hole to understand what `datum` and `tensor` are in the above (hint: `tensor` is a `np` array, which is generally good enough

In [5]:
tf_debug.LocalCLIDebugWrapperSession.add_tensor_filter?

In [6]:
tf_debug.DebugDumpDir.find?

the discussion here walks the user how to navigate toward the source of a *known* problem (obviously, that's taking as granted the greater half of the battle). the basic steps are:

1. **filter tensors for problem**: use the `run -f` or `lt -f has_inf_or_nan` command to identify the problematic tensors
1. **loop** to find the problematic input / operation:
    1. **debug tensors**: use `pt` on the "first" / originating offending tensor, and within that the regex searching command `/(inf|nan)` to find the offending entries
    1. **debug operations that created problem tensor**: this tensor was the output of a node operation; investigate that operation with `node_info`
        1. in particular, identify the inputs to that operation
    1. **debug inputs to problematic operation**: use `pt` on the input(s) of that operation
    1. **repeat**: iterate the above steps until you think you know which input was a problem and why
1. **identify problematic source code**: once you've identified the problematic input / operation, find the origin in the source code with `node_info -t` (traceback)

### fixing the problem

source code had manual calcualtion of crossentropy:

```python
diff = -(y_ * tf.log(y))
```

use the builtin instead:

```python
diff = tf.losses.sparse_softmax_cross_entropy(labels=y_, logits=logits)
```

## debugging `tf-learn` estimators and experiments

the whole point of the `tfdbg.LocalCLIDebugWrapperSession` is that it directly wraps the `tensorflow` session object. this is a problem for some of the higher-level apis where the `session` is obscured from the user -- how do we insert the debugger into those programs?

the answer is `tfdbg` hooks

### sidebar about difference between `tf-learn` and the "regular" estimators

this documentation presents a discussion about the `tf-learn` elements -- these are located in the `tf.contrib.learn` package. I *think* that modules estimator modules from this package are "graduated" to the core library in `tf.estimator` once they reach a stable point, so we should be able to treat them interchangeably in the long run, but in the short run some of the things you might find yourself using are `tf.contrib.learn` estimators.

here's a quick diversion on the types of classifiers / estimators / regressor available in each:

In [13]:
tflearn_cers = {
    _
    for _ in dir(tf.contrib.learn)
    if any(kw in _ for kw in ['Classifier', 'Estimator', 'Regressor'])
}

In [14]:
tfe_cers = {
    _
    for _ in dir(tf.estimator)
    if any(kw in _ for kw in ['Classifier', 'Estimator', 'Regressor'])
}

In [18]:
print('items in tf-learn but not in estimators:')
for module in sorted(tflearn_cers.difference(tfe_cers)):
    print('\t{}'.format(module))

print('\nitems in estimators but not in tf-learn:')
for module in sorted(tfe_cers.difference(tflearn_cers)):
    print('\t{}'.format(module))

print('\nitems in both:')
for module in sorted(tfe_cers.intersection(tflearn_cers)):
    print('\t{}'.format(module))

items in tf-learn but not in estimators:
	BaseEstimator
	DNNEstimator
	DNNLinearCombinedEstimator
	DynamicRnnEstimator
	LinearEstimator
	LogisticRegressor

items in estimators but not in tf-learn:
	BaselineClassifier
	BaselineRegressor
	BoostedTreesClassifier
	BoostedTreesRegressor
	EstimatorSpec

items in both:
	DNNClassifier
	DNNLinearCombinedClassifier
	DNNLinearCombinedRegressor
	DNNRegressor
	Estimator
	LinearClassifier
	LinearRegressor


of course, having a named object in both modules doesn't mean the code is identical -- just a suggestion that the two are related

### debugging `tf.contrib.learn` estimators

`tfdbg` can access the `fit` and `evaluate` methods of `tf-learn` `Estimator` objects because those object methods allow for `hooks` via the `monitor` argument:

```python
from tensorflow.python import debug as tf_debug

# Create a LocalCLIDebugHook and use it as a monitor when calling fit().
hooks = [tf_debug.LocalCLIDebugHook()]

# `classifier` is an instance of one of the classifier
# classes in `tf.contrib.learn`
classifier.fit(x=training_set.data,
               y=training_set.target,
               steps=1000,
               monitors=hooks)

accuracy_score = classifier.evaluate(x=test_set.data,
                                     y=test_set.target,
                                     hooks=hooks)["accuracy"]
```

the example module is built-in to [`debug_tflearn_iris.py`](https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/debug/examples/debug_tflearn_iris.py) and can be investiagated via the command

```
python -m tensorflow.python.debug.examples.debug_tflearn_iris --debug
```

### debugging `tf.contrib.learn` experiments

we have a lot of experience so far with the `experiments` api, but there is a different api available in `tf.contrib.learn`: `Experiment`

In [20]:
# tf.contrib.learn.Experiment?

directly from the docs:

> THIS CLASS IS DEPRECATED. See
[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
for general migration instructions.

looks like experiments has been migrated to `tf.estimator.train_and_evaluate`

basically, this section of the docs are too old and have been deprecated. I am inferring (I hope correctly) that the `hooks` are now passed to the `tf.estimator.TrainSpec` and `tf.estimator.EvalSpec`.

the new "experiment" interface is `tf.estimator.train_and_evaluate`, and that takes as arguments an `estimator`, and then a `TrainSpec` and `EvalSpec`:

In [25]:
tf.estimator.train_and_evaluate?

those specs themselves take `hooks`, which would indicate to me that they are debug-able:

In [26]:
tf.estimator.TrainSpec?

## debugging `keras` models with `tfdbg`

and what if we want to use the `keras` api? simple: tell `keras` to use a wrapped session:

```python
import tensorflow as tf
from tensorflow.python import debug as tf_debug

tf.keras.backend.set_session(tf_debug.LocalCLIDebugWrapperSession(tf.Session()))
```

## debugging tf-slim with `tfdbg`

what if you're using yet another fucking higher level api, `tf-slim`, defined in `tf.contrib.slim`

In [29]:
tf.contrib.slim.learning?

**I WILL SKIP THIS SECTION**: per this SO comment but a tf developer, slim is basically deprecated and should be fully avoided: https://github.com/tensorflow/tensorflow/issues/16182#issuecomment-372397483

### debugging training in `tf-slim`

**I WILL SKIP THIS SECTION**: per this SO comment but a tf developer, slim is basically deprecated and should be fully avoided: https://github.com/tensorflow/tensorflow/issues/16182#issuecomment-372397483

### debugging evaluation in `tf-slim`

**I WILL SKIP THIS SECTION**: per this SO comment but a tf developer, slim is basically deprecated and should be fully avoided: https://github.com/tensorflow/tensorflow/issues/16182#issuecomment-372397483

## offline debugging of remotely-running sessions

what to do if you don't have terminal access to a running session? use the `offline_analyzer` binary of `tfdbg`

### debugging remote `tf.sessions`

suppose you have a `tf.Session` connected to a remote service already existing. every time you want to `run` that session, you have the ability to specify a `tf.RunOptions` options object. `tfdbg` has implemented a function which updates that object to watch the graph as it is being executed; to save tensors to a directory where they can be retroactively opened and examined (I believe that is what going on, at least!)

this is done with the following code:

```python
from tensorflow.python import debug as tf_debug

# ... Code where your session and graph are set up...

run_options = tf.RunOptions()
tf_debug.watch_graph(
      run_options,
      session.graph,
      debug_urls=["file:///shared/storage/location/tfdbg_dumps_1"]
)
# Be sure to specify different directories for different run() calls.

session.run(fetches, feed_dict=feeds, options=run_options)
```

now we are presupposing that this was done for multiple run calls and the program has run its course (probably incorrectly at that). those files were written to the server which remotely executed the graph.

**here's hoping you actually have file access to those debug directories!**

you actually need to access those directories to run `tfdbg` against them. this means that if you don't have shared directory access, you're kinda effed.

if you *do* have access,

```
python -m tensorflow.python.debug.cli.offline_analyzer \
    --dump_dir=/shared/storage/location/tfdbg_dumps_1
```

### `c++` and other languages

blah blah modify `debug_options` field of `RunOptions` blah blah

### debugging remotely-running `tf-learn` estimators and experiments

above we debugged *local* estimators using the `tf_debug.LocalCLIDebugHook` `hooks`. for a *remote* estimator we can use the `DumpingDebugHook`, which will do the same sort of thing as the session dumps: write outputs to files and then post-facto ingest them:

```python
# Let your BUILD target depend on "//tensorflow/python/debug:debug_py
# (You don't need to worry about the BUILD dependency if you are using a pip
#  install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug

hooks = [tf_debug.DumpingDebugHook("/shared/storage/location/tfdbg_dumps_1")]
```

and after files have been written to **some shared location**:

```
python -m tensorflow.python.debug.cli.offline_analyzer \
    --dump_dir="/shared/storage/location/tfdbg_dumps_1/run_<epoch_timestamp_microsec>_<uuid>"
```

## frequently asked questions

a bunch of more or less interesting stuff, but one big one: there is a `tensorboard` plugin for `tfdbg`

# summary

`tfdbg` is a pretty well-featured debugging console application that provides you with tools to step through individual `tf.Session.run` calls and investigate the produced tensors at each stage. this documentation provides a quick overview of the most relevant commands and outlines how to use them in some probably-common use cases