# Train and visualize a model in Tensorflow - Part 3: TensorFlow DNNClassifier

In [1]:
import numpy as np
import tensorflow as tf

from sklearn.metrics import classification_report

## Data management

Before creating the model, we need to specify what the input and output is going to be. For that we use the document matrix obtained in the previous part as input to the the classifier.

However, most optimization algorithms similar to Stochastic Gradient Descent need the data in small portions for optimization purposes. On top of that, the training cycle goes through the entire dataset several times (epochs) before converging to a good solution.

Fortunately, Tensorflow has the solution to iterate over datasets several times in small batches. These function are called input functions, and they can take a numpy array or a pandas dataframe. It's worth noticing that, during the past updates, Tensorflow has been including more functions to transform the input data in batches handling enconding of categorical features, embeddings, etc, althoug we wont use those function here.

We load our dataset and create the input function to handle it with the following code:

In [2]:
# Load the dataset into a numpy keyed structure
newsgroups = np.load('./resources/newsgroup.npz')

# Define the batch size
batch_size = 100

def dataset_input_fn(dataset):
    """
    Creates an input function using the `numpy_input_fn` method from
    tensorflow, based on the dataset we want to use.
    
    Args:
        dataset: String that represents the dataset (should be `train` or `test`)
    
    Returns:
        An `numpy_input_fn` function to feed to an estimator
    """
    assert dataset in ('train', 'test'), "The selected dataset should be `train` or `test`"
    
    return tf.estimator.inputs.numpy_input_fn(
        # A dictionary of numpy arrays that match each array with the
        # corresponding column in the model. For this case we only
        # have "one" colum which represents the whole array.
        x={'input_data': newsgroups['%s_data' % dataset]},
        # The target array
        y=newsgroups['%s_target' % dataset],
        # The batch size to iterate the data in small fractions
        batch_size=batch_size,
        # If the dataset is `test` only run once
        num_epochs=1 if dataset == 'test' else None,
        # Only shuffle the dataset for the `train` data
        shuffle=dataset == 'train'
    )

## Defining the model

The classifier to train is a `tf.estimator.LinearClassifier` which is basically a wrapper in Tensorflow for a Logistic Regression classifier. 

The object instantiation takes as input an iterator (i.e. `feature_columns`) that match the dictionary fed to the input function. As the input function only takes one column with a number of dimensions equal to the number of dimensions in the embeddings, there is only one feature column of that number of dimensions.

In [3]:
input_size = newsgroups['train_data'].shape[1]
num_classes = newsgroups['labels'].shape[0]

feature_columns = [tf.feature_column.numeric_column(
    'input_data', shape=(input_size,))]

model = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=(5000, 2000,),
    n_classes=num_classes,
    model_dir="/tmp/ng_model")

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_num_worker_replicas': 1, '_save_checkpoints_steps': None, '_task_id': 0, '_service': None, '_log_step_count_steps': 100, '_keep_checkpoint_max': 5, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_model_dir': '/tmp/ng_model', '_session_config': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1480a7829ef0>, '_task_type': 'worker', '_tf_random_seed': None, '_num_ps_replicas': 0, '_keep_checkpoint_every_n_hours': 10000, '_is_chief': True, '_master': ''}


## Training cicle

Now that we have the function that build the model, we can create the training cycle.

In [4]:
model.train(input_fn=dataset_input_fn("train"), steps=4000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/ng_model/model.ckpt.
INFO:tensorflow:step = 1, loss = 299.67517
INFO:tensorflow:global_step/sec: 36.4463
INFO:tensorflow:step = 101, loss = 84.466484 (2.744 sec)
INFO:tensorflow:global_step/sec: 34.1879
INFO:tensorflow:step = 201, loss = 43.79915 (2.926 sec)
INFO:tensorflow:global_step/sec: 35.2797
INFO:tensorflow:step = 301, loss = 9.824969 (2.834 sec)
INFO:tensorflow:global_step/sec: 34.3661
INFO:tensorflow:step = 401, loss = 1.3690943 (2.911 sec)
INFO:tensorflow:global_step/sec: 35.0466
INFO:tensorflow:step = 501, loss = 0.45615438 (2.853 sec)
INFO:tensorflow:global_step/sec: 35.0685
INFO:tensorflow:step = 601, loss = 0.24632725 (2.850 sec)
INFO:tensorflow:global_step/sec: 34.9918
INFO:tensorflow:step = 701, loss = 0.22747818 (2.859 sec)
INFO:tensorflow:global_step/sec: 34.8299
INFO:tensorflow:step = 801, loss = 0.11410611 (2.871 sec)
INFO:tensorflow:global_step/sec: 33.9194
INFO:tensorflo

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1480a78050f0>

## Evaluation

As seen before, it is also quite easy to get the evaluation metrics defined in the model after traning:

In [5]:
# Evaluate the model and print results
eval_results = model.evaluate(input_fn=dataset_input_fn("test"))
print("Accuracy: %.2f" % eval_results['accuracy'])

INFO:tensorflow:Starting evaluation at 2018-01-19-14:23:21
INFO:tensorflow:Restoring parameters from /tmp/ng_model/model.ckpt-4000
INFO:tensorflow:Finished evaluation at 2018-01-19-14:23:22
INFO:tensorflow:Saving dict for global step 4000: accuracy = 0.9020181, average_loss = 0.47777027, global_step = 4000, loss = 47.34955
Accuracy: 0.90


We can even use the same tools from scikit-learn that we use for any other model, once we have the array with predictions

In [6]:
test_predictions = list(model.predict(input_fn=dataset_input_fn("test")))
test_predictions_classes = np.array(
    [p['class_ids'][0] for p in test_predictions])

print(classification_report(
    newsgroups['test_target'], test_predictions_classes))

INFO:tensorflow:Restoring parameters from /tmp/ng_model/model.ckpt-4000
             precision    recall  f1-score   support

          0       0.92      0.96      0.94       160
          1       0.88      0.81      0.84       195
          2       0.82      0.87      0.84       197
          3       0.77      0.80      0.78       196
          4       0.92      0.80      0.86       192
          5       0.89      0.92      0.90       196
          6       0.89      0.82      0.86       194
          7       0.91      0.91      0.91       198
          8       0.97      0.96      0.97       199
          9       0.96      0.96      0.96       199
         10       0.98      0.94      0.96       200
         11       0.97      0.94      0.96       198
         12       0.77      0.89      0.82       196
         13       0.89      0.97      0.93       198
         14       0.95      0.94      0.95       197
         15       0.94      0.94      0.94       200
         16       0.92    