# Machine Learning with Tensorflow

Tensorflow (TF) can not only be used for deep learning, but for numerical computations in general. It is specially suited for handling matrices and tensors.

In this notebook we will explore how to use TF for *traditional* machine learning (ML) model, and how it compares to solving the same problem using a deep learning model.

In [1]:
%pylab inline
plt.style.use('seaborn-talk')
import tensorflow as tf

print(tf.__version__)

Populating the interactive namespace from numpy and matplotlib
1.11.0


## Getting familiar with Tensorflow

TF uses lazy evaluation. When we define an operation, we only create the execution graph for that operation. We need to explicitly *run* the graph to obtain the result.
* TF also supports eager execution in Python, which runs the operations inmediately, but we will not use it in this notebook
* Eager execution is intented for running small tests in an interactive notebook or prompt

In Numpy, operations occur as soon as they are written.
However, in TF we need to explicitly run the graph to obtain a result:

In [4]:
x = tf.constant(5)
y = tf.constant(6)
z = tf.multiply(x,y)

In [6]:
x

<tf.Tensor 'Const_3:0' shape=() dtype=int32>

In [7]:
with tf.Session() as sess:
    z_ = sess.run(z)

In [8]:
z_

30

In [None]:
# Sum to numpy arrays

### Exercise

To obtain the area of a triangle given the length of its sides $a$, $b$ and $c$, you can use the Formula of Heron 

$\sqrt{s(s-a)(s-b)(s-c)}$ where $\displaystyle s=\frac{a+b+c}{2}$ 

* For more context, see https://en.wikipedia.org/wiki/Heron%27s_formula

Assume that $a$, $b$ and $c$ are given as an array with 3 columns and an arbirtray number of rows, named *sides*. Each row is a possible triangle

You should write two functions that takes as inputs the arrays *sides* and returns an array with as many components as triangles, being each component the area of the corresponding triangle:
* One function using Numpy
* One function using Tensorflow. 
 *  The function should return the *tf.Tensor* without running it. The caller of the function will need to run the output afterwards to get the result
 * How do you do a square root in TF? Look up the available operations at https://www.tensorflow.org/api_docs/python/tf

In [5]:
# Solution with Numpy
def heron_np(sides):
    return 

In [None]:
# Check results with numpy
sides = np.array([[5, 3, 7.1],[ 2.3, 4.1, 4.8]])
print("Input triangles:")
print(sides)

print("Output with multiple rows:")
multi = heron_np(sides)
print(multi)
print("Outputs with single rows:")
a1 = heron_np(np.array([sides[0,:]]))
a2 = heron_np(np.array([sides[1,:]]))
print(a1)
print(a2)
assert np.all(multi == np.append(a1,a2))

In [7]:
# Solution with TF: handling tensors in TF is quite similar to handling numpy arrays
def heron_tf(sides):
    return 

In [None]:
# Check the TF solution

## Our first model

For this model, we will use the following data:
* http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

In the data subdirectory, you will find three files. We have divided the dataset in three subsets
* Training
* Validation
* Test

**Question**: Why three different sets? What will we use the validation set for? Could not we use just two subsets?

In [13]:
! ls -hl ../data/taxi*

-rw-rw-r-- 1 juan juan  84K nov  4 01:52 ../data/taxi-test.csv
-rw-rw-r-- 1 juan juan 394K nov  4 01:52 ../data/taxi-train.csv
-rw-rw-r-- 1 juan juan  84K nov  4 01:52 ../data/taxi-valid.csv


In [14]:
# We will use Pandas to explore the data
import pandas as pd

In [15]:
# The CSV files come without a header, let's put some names for clarity
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']

In [None]:
# Read the train, validation and test data



### Transforming input data
We will train a simple linear model using tf.estimator, a higher level API of TF.

Notice that we have the data in Pandas dataframes. How can we feed a pandas dataframe to TF?

One option is converting it to tensor from the dataframe using `tf.convert_to_tensor`.

In [None]:
# Convert train, validation and test pd dataframes to tensors 

However, it is easier to keep the Pandas df info (column names, etc.) to evaluate the model later on. We can create a TF dataset from Pandas, to use with tf.estimator.

We will write a function so we can make several tests changing the number of epochs (one of the hyperaparemeters)

In [None]:
# Use tf.estimator.inputs.pandas_input_fn to create a TF dataset from Pandas
def pandas2tf(df, epochs):
    return 

In [None]:
tf_train = pandas2tf(df_train, 1)
tf_valid = pandas2tf(df_valid, 1)
tf_test = pandas2tf(df_test, 1)

### Feature columns

For the model, we need to select the feature columns. We will use all columns, except the *key* one, which is just an index.

Also, the first column is in fact the target variable that we will predict, so we will remove it from the features too.

In [None]:
# Define feature_cols using tf.feature_column.numeric_column

### Training the model

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)

import shutil
# WARNING!!!! THIS DIRECTORY WILL BE REMOVED, DON'T PUT ANYTHING THERE
OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

In [None]:
# Linear Regression model using tf.estimator.LinearRegressor

### Evaluate the model

Is this model good? How can we interpret the average loss in the evaluation metrics dict? What's the *physical meaning* of that number?

In [None]:
# Evaluate the model with the help of the validation set

#### Question

What is the average loss if we use the tf_train with evaluate? Is it correct to use those numbers to evaluate the model?

### Exercise: plot the average loss for the training and validation datasets, over the num. of epochs

What is the impact of the number of epochs in the results of the model?

Repeat the training process with epochs ranging from 1 to 10, and plot the *average_loss* for the training and validation datasets.
* What conclusions can you extract from that plot?

In [None]:
tf.logging.set_verbosity(tf.logging.ERROR)
OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
    
def train_for_epochs(nepochs):
    return #train_loss, valid_loss

In [None]:
# Plot the results
epochs = np.arange(1,11)
train_loss = []
valid_loss = []
for e in epochs:
    print("Training with %d epochs..." % e)
    t, v = train_for_epochs(e)
    train_loss.append(t)
    valid_loss.append(v)
    
plt.figure(figsize=(10,7))
plt.plot(epochs, train_loss, label='Training')
plt.plot(epochs, valid_loss, label='Validation')
plt.legend()
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.grid()

So the model does not look exceptionally good. This is normal, we have just tried a linear model without any kind of feature engineering or transformation; we don't know for instance if the relationship between *fare_amount* and the rest of features is linear or not.

Let's try another simple model using TF, but this one will be more complex

## A more complex model

We are going to reuse a lot of code from the previous model, we will just change the kind of estimator we are using.

Let's repeat the same estimator again, and we will run it for 20 epochs this time. Then we will compare it with the new one:

Is this new model better? Why does not the linear model change with the number of epochs?

What would you do next to improve the Deep Learning model?
* Possible answers: increase the number of epochs, maybe try a more complex model. Any other hyperparam?

Let's try with 150 epochs, and see if we can beat the training and validation loss

## Final decision about the models

So far, we have been ignoring the test set. We have used the validation dataset to change the hyperparameters of the model. It is now the turn to use the test set to finally decide which model is better: the linear regressor or the neural network.

For this, we can retrain the models using both the train and validation sets, with the hyperparameters that we have already decided. Then we will evaluate both models using only the test set, and we will find out which one is better predicting the test set, that is, a set of data that has not been used in any way to tune the model (the validation set has been used to tune the hyperparams, so somehow the validation set info is already included in the model).

In [None]:
# Evaluate the LinearRegressor model

In [None]:
# Evaluate tf.estimator.DNNRegressor model