# Train and visualize a model in Tensorflow - Part 3: TensorFlow DNNClassifier

In [1]:
import numpy as np
import tensorflow as tf

from sklearn.metrics import classification_report

## Data management

Before creating the model, we need to specify what the input and output is going to be. For that we use the document matrix obtained in the previous part as input to the the classifier.

However, most optimization algorithms similar to Stochastic Gradient Descent need the data in small portions for optimization purposes. On top of that, the training cycle goes through the entire dataset several times (epochs) before converging to a good solution.

Fortunately, Tensorflow has the solution to iterate over datasets several times in small batches. These function are called input functions, and they can take a numpy array or a pandas dataframe. It's worth noticing that, during the past updates, Tensorflow has been including more functions to transform the input data in batches handling enconding of categorical features, embeddings, etc, althoug we wont use those function here.

We load our dataset and create the input function to handle it with the following code:

In [2]:
# Load the dataset into a numpy keyed structure
newsgroups = np.load('./resources/newsgroup.npz')

# Define the batch size
batch_size = 100

def dataset_input_fn(dataset):
    """
    Creates an input function using the `numpy_input_fn` method from
    tensorflow, based on the dataset we want to use.
    
    Args:
        dataset: String that represents the dataset (should be `train` or `test`)
    
    Returns:
        An `numpy_input_fn` function to feed to an estimator
    """
    assert dataset in ('train', 'test'), "The selected dataset should be `train` or `test`"
    
    return tf.estimator.inputs.numpy_input_fn(
        # A dictionary of numpy arrays that match each array with the corresponding column in the model.
        # For this case we only have "one" colum which represents all the dimensions in the embeddings.
        x={'input_data': newsgroups['%s_data' % dataset]},
        # The target array
        y=newsgroups['%s_target' % dataset],
        # The batch size to iterate the data in small fractions
        batch_size=batch_size,
        # If the dataset is `test` only run once
        num_epochs=1 if dataset == 'test' else None,
        # Only shuffle the dataset for the `train` data
        shuffle=dataset == 'train'
    )

## Defining the model

The classifier to train is a `tf.estimator.LinearClassifier` which is basically a wrapper in Tensorflow for a Logistic Regression classifier. 

The object instantiation takes as input an iterator (i.e. `feature_columns`) that match the dictionary fed to the input function. As the input function only takes one column with a number of dimensions equal to the number of dimensions in the embeddings, there is only one feature column of that number of dimensions.

In [3]:
input_size = newsgroups['train_data'].shape[1]
num_classes = newsgroups['labels'].shape[0]

feature_columns = [tf.feature_column.numeric_column('input_data', shape=(input_size,))]

model = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=(5000,),
    n_classes=num_classes,
    model_dir="/tmp/ng_model")

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x14828a693fd0>, '_num_worker_replicas': 1, '_master': '', '_num_ps_replicas': 0, '_model_dir': '/tmp/ng_model', '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': 600, '_service': None, '_is_chief': True, '_session_config': None, '_save_summary_steps': 100, '_log_step_count_steps': 100, '_task_id': 0, '_tf_random_seed': None}


## Training cicle

Now that we have the function that build the model, we can create the training cycle.

In [4]:
model.train(input_fn=dataset_input_fn("train"), steps=2000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/ng_model/model.ckpt.
INFO:tensorflow:loss = 299.54712, step = 1
INFO:tensorflow:global_step/sec: 37.6368
INFO:tensorflow:loss = 46.64124, step = 101 (2.659 sec)
INFO:tensorflow:global_step/sec: 38.5234
INFO:tensorflow:loss = 16.212477, step = 201 (2.596 sec)
INFO:tensorflow:global_step/sec: 38.8718
INFO:tensorflow:loss = 10.498708, step = 301 (2.574 sec)
INFO:tensorflow:global_step/sec: 37.5528
INFO:tensorflow:loss = 1.0421454, step = 401 (2.662 sec)
INFO:tensorflow:global_step/sec: 38.6393
INFO:tensorflow:loss = 0.9621756, step = 501 (2.587 sec)
INFO:tensorflow:global_step/sec: 38.4342
INFO:tensorflow:loss = 0.6435354, step = 601 (2.603 sec)
INFO:tensorflow:global_step/sec: 37.7656
INFO:tensorflow:loss = 0.5751495, step = 701 (2.649 sec)
INFO:tensorflow:global_step/sec: 35.9498
INFO:tensorflow:loss = 0.4871225, step = 801 (2.780 sec)
INFO:tensorflow:global_step/sec: 35.7682
INFO:tensorflow:l

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x14828a66c160>

## Evaluation

As seen before, it is also quite easy to get the evaluation metrics defined in the model after traning:

In [5]:
# Evaluate the model and print results
eval_results = model.evaluate(input_fn=dataset_input_fn("test"))
print("Accuracy: %.2f" % eval_results['accuracy'])

INFO:tensorflow:Starting evaluation at 2018-01-19-13:50:37
INFO:tensorflow:Restoring parameters from /tmp/ng_model/model.ckpt-2000
INFO:tensorflow:Finished evaluation at 2018-01-19-13:50:38
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.9065321, average_loss = 0.33596954, global_step = 2000, loss = 33.29635
Accuracy: 0.91


In [6]:
test_predictions = list(model.predict(input_fn=dataset_input_fn("test")))
test_predictions_classes = np.array([p['class_ids'][0] for p in test_predictions])

print(classification_report(newsgroups['test_target'], test_predictions_classes))

INFO:tensorflow:Restoring parameters from /tmp/ng_model/model.ckpt-2000
             precision    recall  f1-score   support

          0       0.93      0.96      0.94       160
          1       0.80      0.87      0.83       195
          2       0.86      0.84      0.85       197
          3       0.83      0.78      0.80       196
          4       0.91      0.83      0.87       192
          5       0.88      0.91      0.89       196
          6       0.88      0.78      0.83       194
          7       0.89      0.93      0.91       198
          8       0.99      0.95      0.97       199
          9       0.95      0.97      0.96       199
         10       0.98      0.95      0.97       200
         11       0.98      0.94      0.96       198
         12       0.76      0.90      0.83       196
         13       0.92      0.96      0.94       198
         14       0.98      0.95      0.97       197
         15       0.94      0.94      0.94       200
         16       0.95    