# The estimator API
* [Tensorflow documentation](https://www.tensorflow.org/guide/estimators)
* [Paper](https://arxiv.org/pdf/1708.02637.pdf)
* `tensorflow.estimator`
* **Don't use the depricated** `tf.contrib.learn.Estimator` class

* High level API for machine learning tasks:
    - Abstraction of Graphs and Sessions
    - Training
    - Prediction
    - Evaluation
    - Export for Serving
    
### Conceptual idea
* Disentangle **data input pipeline** and **model**

### [Data input pipeline:](https://www.tensorflow.org/guide/datasets)
1. **Write data import function:**
~~~~(.python)
def train_input_fn(dataset):
       ...
       return feature_dict, labels
~~~~
2. **Define Feature columns:**
 * Each feature column has to be of type `tf.feature_column`
 * It identifies the feature name, its type, and any input pre-processing.
~~~~(.python)
feature_1 = tf.feature_column.numeric_column('feature_1')
feature_2 = tf.feature_column.numeric_column('feature_2', 
                                              normalizer_fn=lambda x: x * 42)
~~~~ 

### Model Fitting:
1. **Instantiate model** with features columns from above
~~~~(.python)
classifier = tf.estimator.LinearClassifier(feature_columns: list)
~~~~
2. **Call train method** with data import function from above
~~~~(.python)
classifier.train(input_fn=train_input_fn, steps=2000)
~~~~


### Prediction
1. **Define data importer for fit data** as for training above
~~~~(.python)
def predict_input_fn(dataset):
       ...
       return feature_dict, labels
~~~~
2. **Run predict method on trained model**
~~~~(.python)
predictions = classifier.predict(input_fn=predict_input_fn)
~~~~

# Get the data

In [1]:
import tensorflow as tf
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from collections import namedtuple
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
supervised = namedtuple("supervised", ["features", "target"])


def split_test_train(data):
    X_train, X_test, Y_train, Y_test = train_test_split(data.features, data.target, test_size = 0.2, random_state=5)
    #return supervised(X_train, Y_train.reshape(-1, 1)), supervised(X_test, Y_test.reshape(-1, 1))
    return supervised(X_train, Y_train), supervised(X_test, Y_test)

housing = fetch_california_housing()
data = supervised(pd.DataFrame(housing.data, columns=housing.feature_names), 
                  pd.DataFrame(housing.target,columns=["price"]))
train, test = split_test_train(data)

  return f(*args, **kwds)


# Specify model

In [39]:
# Define the input feature house age
feature = train.features[["HouseAge"]]
# Configure a numeric feature column for house age.
feature_columns = [tf.feature_column.numeric_column("HouseAge")] # -> linear regressor
# Define target variable
targets = train.target["price"]

In [40]:
# Use gradient descent as the optimizer for training the model.
# Loss function incorporated in model
optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)

# Configure the linear regression model with our feature columns and optimizer.
# Set a learning rate of 0.0000001 for Gradient Descent.
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=feature_columns,
    optimizer=optimizer
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpwdvcmomd', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f04385c5b70>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [41]:
def input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """Trains a linear regression model of one feature.
  
    Args:
      features: pandas DataFrame of features
      targets: pandas DataFrame of targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
    Returns:
      Tuple of (features, labels) for next data batch
    """
  
    # Convert pandas data into a dict of np arrays.
    features = {key:np.array(value) for key,value in dict(features).items()}                                           
 
    # Construct a dataset, and configure batching/repeating.
    ds = tf.data.Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
    ds = ds.batch(batch_size).repeat(num_epochs)
    
    # Shuffle the data, if specified.
    if shuffle:
      ds = ds.shuffle(buffer_size=10000)
    
    # Return the next batch of data.
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

# Train model

In [43]:
linear_regressor.train(input_fn = lambda:input_fn(feature, targets),
                       steps=1000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/tmpwdvcmomd/model.ckpt-1000
INFO:tensorflow:Saving checkpoints for 1001 into /tmp/tmpwdvcmomd/model.ckpt.
INFO:tensorflow:loss = 3.1064703, step = 1001
INFO:tensorflow:global_step/sec: 1074.22
INFO:tensorflow:loss = 7.228814, step = 1101 (0.094 sec)
INFO:tensorflow:global_step/sec: 985.891
INFO:tensorflow:loss = 2.8049595, step = 1201 (0.101 sec)
INFO:tensorflow:global_step/sec: 917.275
INFO:tensorflow:loss = 0.8651004, step = 1301 (0.109 sec)
INFO:tensorflow:global_step/sec: 971.47
INFO:tensorflow:loss = 1.0381328, step = 1401 (0.103 sec)
INFO:tensorflow:global_step/sec: 887.554
INFO:tensorflow:loss = 4.7686625, step = 1501 (0.114 sec)
INFO:tensorflow:global_step/sec: 754.104
INFO:tensorflow:loss = 3.3241026, step = 1601 (0.131 sec)
INFO:tensorflow:global_step/sec: 1064.73
INFO:tensorflow:loss = 4.298497, step = 1701 (0.094 sec)
INFO:tensorflow:global_step/sec: 1163.01
INFO:tensorflow:loss = 11.

<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x7f04385c59e8>

# Let's make a larger model with more features

In [44]:
extended_features = train.features
extended_features_columns = [tf.feature_column.numeric_column("HouseAge"), 
                     tf.feature_column.numeric_column("MedInc"),
                     tf.feature_column.numeric_column("AveRooms"),
                     tf.feature_column.numeric_column("AveBedrms"),
                     tf.feature_column.numeric_column("Population"),
                     tf.feature_column.numeric_column("AveOccup"),
                     tf.feature_column.numeric_column("Latitude"),
                     tf.feature_column.numeric_column("Longitude")                             
                    ]

In [48]:
linear_regressor = tf.estimator.LinearRegressor(feature_columns=extended_features_columns, 
                                                optimizer=optimizer)

linear_regressor.train(input_fn = lambda: input_fn(extended_features, targets),
                       steps=1000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp_9p2k0d8', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0411dd94e0>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmp_9p2k0d8/model.ckpt.
INFO:tensorflow:loss = 4.0561957, step = 1
INFO:tensorflow:global_step/sec: 746.608
INFO:tensorflow:loss = 0.40761775, step = 101 (0.135 sec)
INFO:tensorflow:global_step/sec: 999.405
INFO:tensorflow:loss = 5.962312, step = 201 (0.100 sec)
INFO:tensorflow:global_step/sec: 1021.46
INFO:tensorflow:loss =

### Questions regarding feature columns
* If I comment out feature columns that occur in the training set tensor flow still runs. 
* $\Rightarrow$ Are then only those features used for computation that are specified in the feature columns? 
# Making Predictions

In [247]:
def prediction_data_input_fn(features):
    features = {key:np.array(value) for key,value in dict(features).items()}                                           

    # Construct a dataset, and configure batching/repeating.
    ds = tf.data.Dataset.from_tensor_slices(features) # warning: 2GB limit
    ds = ds.batch(1).repeat(1)  
    
    # Return the next batch of data.
    features = ds.make_one_shot_iterator().get_next()
    return features

In [248]:
forecast = linear_regressor.predict(input_fn=lambda: prediction_data_input_fn(test.features))

In [249]:
# What type is the forecast?
type(forecast)

generator

In [128]:
# let's have a look at the element the itereator spits out
next(forecast)

INFO:tensorflow:Restoring parameters from /tmp/tmp_9p2k0d8/model.ckpt-1000


{'predictions': array([0.57823205], dtype=float32)}

In [129]:
def get_prediction_result(forecast: iter):
    targets = [t['predictions'][0] for t in forecast]
    return np.array(targets)

In [130]:
predictions = get_prediction_result(forecast)

### Warning using generator expressions:

In [132]:
print("Number of predicion results:", predictions.shape[0])
print("Number of samples in the test set", test.target.shape[0])

Number of predicion results: 4127
Number of samples in the test set 4128


These number diverge!
The reason is that we already took an example when we looked into the generator via the `next` method
Generator expressions allow for one iteration only and then forget the already seen items
So we should directly make a numpy array out of it in order not to loose any data.

In [134]:
forecast = linear_regressor.predict(input_fn=lambda: prediction_data_input_fn(test.features))
predictions = get_prediction_result(forecast)

INFO:tensorflow:Restoring parameters from /tmp/tmp_9p2k0d8/model.ckpt-1000


In [136]:
# Check
predictions.shape[0] == test.target.shape[0]

True

# Compute errors
* As an exercise we comput the errors using tensorflow
* And compare the results with numpy

In [235]:
"""
As there are several graph objects around at this point (tf estimators) we write the evaluation in its own graph
"""
Errors = namedtuple("Errors", ["mse", "rmse"])

new_graph = tf.Graph()
with new_graph.as_default():
    preds = tf.constant(predictions,  dtype=tf.float64, name="prediction_lables")
    real = tf.constant(test.target.values, dtype=tf.float64, name="real_labels")
    mse = tf.metrics.mean_squared_error(real, preds)[1]
    rmse = tf.sqrt(mse)
    errors = Errors(mse, rmse) 
    

with tf.Session(graph=new_graph) as sess:
    print("Does session use defined graph?", sess.graph is new_graph)
    #sess.run(tf.global_variables_initializer()) # Intitalize variables (not necessary for constants)
    sess.run(tf.local_variables_initializer()) # Magically this needs to be done too
    tf_errors = sess.run(errors)
    
tf_errors

Does session use defined graph? True


Errors(mse=3.7333066, rmse=1.9321767)

In [236]:
from sklearn import metrics
mean_squared_error = metrics.mean_squared_error(predictions, test.target.values)
sk_errors = Errors(mean_squared_error, np.sqrt(mean_squared_error))
sk_errors

Errors(mse=3.733306339134602, rmse=1.932176580733397)

# Accessing model parameters

In [246]:
names = linear_regressor.get_variable_names()
for name in names:
    print(name, linear_regressor.get_variable_value(name))

global_step 1000
linear/linear_model/AveBedrms/weights [[5.8685885e-07]]
linear/linear_model/AveOccup/weights [[1.4909873e-06]]
linear/linear_model/AveRooms/weights [[2.9085577e-06]]
linear/linear_model/HouseAge/weights [[1.6422522e-05]]
linear/linear_model/Latitude/weights [[1.9167783e-05]]
linear/linear_model/Longitude/weights [[-6.4031934e-05]]
linear/linear_model/MedInc/weights [[2.1488702e-06]]
linear/linear_model/Population/weights [[0.00046557]]
linear/linear_model/bias_weights [5.349638e-07]


In [266]:
# Getting rid of this trange lambda expression via a callable:
# Alternatively via currying
class InputProvider:

    def __init__(self, features, targets, batch_size=1, shuffle=True, num_epochs=None):
        self.features = {key:np.array(value) for key,value in dict(features).items()} 
        self.targets = targets
        self.batch_size=batch_size
        self.shuffle = shuffle
        self.num_epochs = num_epochs
 
    def __call__(self):
        ds = tf.data.Dataset.from_tensor_slices((self.features, self.targets)) # warning: 2GB limit
        ds = ds.batch(self.batch_size).repeat(self.num_epochs)
        if self.shuffle:
            ds = ds.shuffle(buffer_size=10000)
        features, labels = ds.make_one_shot_iterator().get_next()
        return features, labels

In [268]:

linear_regressor = tf.estimator.LinearRegressor(feature_columns=extended_features_columns, 
                                    optimizer=optimizer)

linear_regressor.train(input_fn = InputProvider(extended_features, targets),
                       steps=10000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmplnd6q4n3', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f03c0ba0c50>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmplnd6q4n3/model.ckpt.
INFO:tensorflow:loss = 0.32262403, step = 1
INFO:tensorflow:global_step/sec: 888.509
INFO:tensorflow:loss = 1.6980609, step = 101 (0.113 sec)
INFO:tensorflow:global_step/sec: 1120.29
INFO:tensorflow:loss = 24.407393, step = 201 (0.089 sec)
INFO:tensorflow:global_step/sec: 895.562
INFO:tensorflow:loss 

INFO:tensorflow:loss = 0.20109978, step = 7501 (0.129 sec)
INFO:tensorflow:global_step/sec: 610.902
INFO:tensorflow:loss = 5.7284136, step = 7601 (0.161 sec)
INFO:tensorflow:global_step/sec: 1234.17
INFO:tensorflow:loss = 0.063685626, step = 7701 (0.081 sec)
INFO:tensorflow:global_step/sec: 1110.38
INFO:tensorflow:loss = 0.98918515, step = 7801 (0.090 sec)
INFO:tensorflow:global_step/sec: 1231.36
INFO:tensorflow:loss = 0.10935179, step = 7901 (0.081 sec)
INFO:tensorflow:global_step/sec: 1003.35
INFO:tensorflow:loss = 5.2714276, step = 8001 (0.100 sec)
INFO:tensorflow:global_step/sec: 750.5
INFO:tensorflow:loss = 0.08102282, step = 8101 (0.133 sec)
INFO:tensorflow:global_step/sec: 603.251
INFO:tensorflow:loss = 7.0408173, step = 8201 (0.165 sec)
INFO:tensorflow:global_step/sec: 1132.22
INFO:tensorflow:loss = 0.93612725, step = 8301 (0.089 sec)
INFO:tensorflow:global_step/sec: 648.762
INFO:tensorflow:loss = 5.0673966, step = 8401 (0.156 sec)
INFO:tensorflow:global_step/sec: 964.146
INFO:

<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x7f03c17b1cf8>