# Train and visualize a model in Tensorflow - Part 4: TensorFlow DNNClassifier

This is the last part of the tutorial regarding the training of a model using TensorFlow. In [Part 3](https://github.com/PLN-FaMAF/tensorflowTutorial2018/blob/master/tensorflow_tutorial_3.ipynb) we saw how to design, train and evaluate a neural network using the TensorFlow's API. That method is highly customizable and flexible, but is also tedious to work with.

When TensorFlow first came out, the previous method was the only way to create a model. Luckily, newer versions of the library are shipped with some packages that are simpler to work with. In particular the [estimator's API](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator) which provides a much simpler way to define a model without having to work with the math behind it.

In [1]:
import numpy as np
import tensorflow as tf

from sklearn.metrics import classification_report

## Data management

Before creating the model, we need to specify what the input and output is going to be. For that we use the document matrix obtained in the previous part as input to the the classifier.

However, most optimization algorithms similar to Stochastic Gradient Descent need the data in small portions for optimization purposes. On top of that, the training cycle goes through the entire dataset several times (epochs) before converging to a good solution.

Fortunately, Tensorflow has the solution to iterate over datasets several times in small batches. These function are called input functions, and they can take a numpy array or a pandas dataframe. It's worth noticing that, during the past updates, Tensorflow has been including more functions to transform the input data in batches handling enconding of categorical features, embeddings, etc, althoug we wont use those function here.

We load our dataset and create the input function to handle it with the following code:

In [2]:
# Load the dataset into a numpy keyed structure
newsgroups = np.load('./resources/newsgroup.npz')

# Define the batch size
batch_size = 100

def dataset_input_fn(dataset):
    """
    Creates an input function using the `numpy_input_fn` method from
    tensorflow, based on the dataset we want to use.
    
    Args:
        dataset: String that represents the dataset
        (should be `train` or `test`)
    
    Returns:
        An `numpy_input_fn` function to feed to an estimator
    """
    assert dataset in ('train', 'test'),\
        "The selected dataset should be `train` or `test`"
    
    return tf.estimator.inputs.numpy_input_fn(
        # A dictionary of numpy arrays that match each array with the
        # corresponding column in the model. For this case we only
        # have "one" colum which represents the whole array.
        x={'input_data': newsgroups['%s_data' % dataset]},
        # The target array
        y=newsgroups['%s_target' % dataset],
        # The batch size to iterate the data in small fractions
        batch_size=batch_size,
        # If the dataset is `test` only run once
        num_epochs=1 if dataset == 'test' else None,
        # Only shuffle the dataset for the `train` data
        shuffle=dataset == 'train'
    )

## Defining the model

The classifier to train is a `tf.estimator.LinearClassifier` which is basically a wrapper in Tensorflow for a Logistic Regression classifier. 

The object instantiation takes as input an iterator (i.e. `feature_columns`) that match the dictionary fed to the input function. As the input function only takes one column with a number of dimensions equal to the number of dimensions in the embeddings, there is only one feature column of that number of dimensions.

In [3]:
input_size = newsgroups['train_data'].shape[1]
num_classes = newsgroups['labels'].shape[0]

feature_columns = [tf.feature_column.numeric_column(
    'input_data', shape=(input_size,))]

model = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=(5000, 2000,),
    n_classes=num_classes,
    model_dir="/tmp/ng_model")

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_num_ps_replicas': 0, '_tf_random_seed': None, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_task_type': 'worker', '_model_dir': '/tmp/ng_model', '_save_checkpoints_steps': None, '_task_id': 0, '_service': None, '_keep_checkpoint_max': 5, '_master': '', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x146cc7350eb8>, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_save_summary_steps': 100}


## Training cicle

Now that we have the function that build the model, we can create the training cycle.

In [4]:
model.train(input_fn=dataset_input_fn("train"), steps=2000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/ng_model/model.ckpt-2
INFO:tensorflow:Saving checkpoints for 3 into /tmp/ng_model/model.ckpt.
INFO:tensorflow:loss = 375.80908, step = 3
INFO:tensorflow:global_step/sec: 30.59
INFO:tensorflow:loss = 82.62279, step = 103 (3.271 sec)
INFO:tensorflow:global_step/sec: 32.1716
INFO:tensorflow:loss = 23.374432, step = 203 (3.109 sec)
INFO:tensorflow:global_step/sec: 31.548
INFO:tensorflow:loss = 5.9647384, step = 303 (3.170 sec)
INFO:tensorflow:global_step/sec: 31.3147
INFO:tensorflow:loss = 1.7345433, step = 403 (3.193 sec)
INFO:tensorflow:global_step/sec: 31.7272
INFO:tensorflow:loss = 0.7357566, step = 503 (3.152 sec)
INFO:tensorflow:global_step/sec: 31.6862
INFO:tensorflow:loss = 0.6507288, step = 603 (3.156 sec)
INFO:tensorflow:global_step/sec: 31.8988
INFO:tensorflow:loss = 0.24774258, step = 703 (3.134 sec)
INFO:tensorflow:global_step/sec: 32.1994
INFO:tensorflow:loss = 0.25066963, step = 803 (3

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x146cc7350e10>

## Evaluation

As seen before, it is also quite easy to get the evaluation metrics defined in the model after traning:

In [5]:
# Evaluate the model and print results
eval_results = model.evaluate(
    input_fn=dataset_input_fn("test"))
print("Accuracy: %.2f" % eval_results['accuracy'])

INFO:tensorflow:Starting evaluation at 2018-02-05-21:15:10
INFO:tensorflow:Restoring parameters from /tmp/ng_model/model.ckpt-2002
INFO:tensorflow:Finished evaluation at 2018-02-05-21:15:11
INFO:tensorflow:Saving dict for global step 2002: accuracy = 0.8951142, average_loss = 0.45907655, global_step = 2002, loss = 45.496902
Accuracy: 0.90


We can even use the same tools from scikit-learn that we use for any other model, once we have the array with predictions

In [6]:
test_predictions =\
    list(model.predict(input_fn=dataset_input_fn("test")))
test_predictions_classes = np.array(
    [p['class_ids'][0] for p in test_predictions])

print("Classification Report\n=====================")
print(classification_report(
    newsgroups['test_target'], test_predictions_classes))

INFO:tensorflow:Restoring parameters from /tmp/ng_model/model.ckpt-2002
Classification Report
             precision    recall  f1-score   support

          0       0.92      0.95      0.93       160
          1       0.83      0.85      0.84       195
          2       0.81      0.85      0.83       197
          3       0.78      0.73      0.75       196
          4       0.87      0.80      0.83       192
          5       0.90      0.89      0.89       196
          6       0.84      0.80      0.82       194
          7       0.85      0.91      0.88       198
          8       0.97      0.92      0.95       199
          9       0.96      0.97      0.96       199
         10       0.98      0.96      0.97       200
         11       0.99      0.93      0.96       198
         12       0.78      0.90      0.83       196
         13       0.87      0.95      0.91       198
         14       0.95      0.95      0.95       197
         15       0.98      0.92      0.95       200
    