# Intro to Neural Nets


In this exercise, we'll explore using neural nets (NN) to predicting `median_house_value`.

First, let's load and prepare the data.

In [1]:
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.contrib.learn.python.learn import learn_io

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))

In [2]:
def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets

Partitioning the dataset in **training** and **test** sets, enables to train on one set of examples and then to test the model against a different set of examples.


However, a better idea is having an intermediate set called validation, that can be used to leverage predictions to tweak parameters of the model, without exposing the test set, which should always be independent from the model selection phase.
We use the validation set to evaluate results from the training set. Then, use the test set to double-check our evaluation after the model has "passed" the validation set.
In other words, the schedule should be the following:

1. Train the model using the training set
2. Evaluate model on validation set
3. Pick the model that does best on the validation set (this includes choices of hyperparameters).
4. Double-check that model against the test set.

This is a better workflow because it creates fewer exposures to the test set.


For the **training set**, we'll choose the first 12000 examples, out of the total of 17000.

In [145]:
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()

12000


Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_person
count,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0
mean,35.6,-119.6,28.6,2652.7,540.7,1430.2,502.1,3.9,2.0
std,2.1,2.0,12.6,2171.8,420.9,1161.5,384.7,1.9,1.0
min,32.5,-124.3,1.0,2.0,1.0,3.0,1.0,0.5,0.1
25%,33.9,-121.8,18.0,1464.0,296.0,789.0,281.0,2.6,1.5
50%,34.3,-118.5,28.5,2131.0,435.0,1167.5,410.0,3.5,1.9
75%,37.7,-118.0,37.0,3165.0,652.0,1728.0,609.0,4.8,2.3
max,42.0,-114.3,52.0,37937.0,6445.0,35682.0,6082.0,15.0,41.3


In [4]:
training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()

Unnamed: 0,median_house_value
count,12000.0
mean,208.8
std,117.0
min,15.0
25%,120.4
50%,181.3
75%,266.0
max,500.0


For the **validation set**, we'll choose the last 5000 examples, out of the total of 17000. 

In [5]:
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_person
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,35.6,-119.5,28.7,2622.1,536.3,1428.1,499.2,3.9,2.0
std,2.1,2.0,12.6,2199.5,422.9,1114.4,384.1,1.9,1.4
min,32.6,-124.3,1.0,12.0,3.0,8.0,5.0,0.5,0.0
25%,33.9,-121.7,18.0,1453.0,298.0,790.8,283.0,2.5,1.5
50%,34.2,-118.5,29.0,2122.0,429.5,1166.0,407.0,3.5,1.9
75%,37.7,-118.0,37.0,3122.2,640.0,1705.0,595.0,4.7,2.3
max,42.0,-114.6,52.0,30405.0,4957.0,15037.0,4339.0,15.0,55.2


In [6]:
validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()

Unnamed: 0,median_house_value
count,5000.0
mean,203.8
std,113.5
min,26.6
25%,117.9
50%,177.9
75%,262.2
max,500.0


### Building a neural network

The NN is defined by the [DNNRegressor](https://www.tensorflow.org/versions/master/api_docs/python/contrib.learn.html#DNNRegressor) class.

Use **`hidden_units`** to define the structure of the NN.  The `hidden_units` argument provides a list of ints, where each int corresponds to a hidden layer and indicates the number of nodes in it.  For example, consider the following assignment:

`hidden_units=[3,10]`

The preceding assignment specifies a neural net with two hidden layers:

* The first hidden layer contains 3 nodes.
* The second hidden layer contains 10 nodes.

If we wanted to add more layers, we'd add more ints to the list. For example, `hidden_units=[10,20,30,40]` would create four layers with ten, twenty, thirty, and forty units, respectively.

By default, all hidden layers will use ReLu activation function and will be fully connected.

In [58]:
def train_nn_regression_model(
    learning_rate,
    steps,
    batch_size,
    hidden_units,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets):
  """Trains a neural network regression model.
  
  In addition to training, this function also prints training progress information,
  as well as a plot of the training and validation loss over time.
  
  Args:
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    hidden_units: A `list` of int values, specifying the number of neurons in each layer.
    training_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for training.
    training_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for training.
    validation_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for validation.
    validation_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for validation.
      
  Returns:
    A `LinearRegressor` object trained on the training data.
  """

  periods = 10
  steps_per_period = steps / periods

  # Create a linear regressor object.
  feature_columns = set([tf.contrib.layers.real_valued_column(my_feature) for my_feature in training_examples])
  dnn_regressor = tf.contrib.learn.DNNRegressor(
      feature_columns=feature_columns,
      hidden_units=hidden_units,
      optimizer=tf.train.GradientDescentOptimizer(learning_rate=learning_rate),
      gradient_clip_norm=5.0
  )
  
  # Create input functions
  training_input_fn = learn_io.pandas_input_fn(
     x=training_examples, y=training_targets["median_house_value"],
     num_epochs=None, batch_size=batch_size)
  predict_training_input_fn = learn_io.pandas_input_fn(
     x=training_examples, y=training_targets["median_house_value"],
     num_epochs=1, shuffle=False)
  predict_validation_input_fn = learn_io.pandas_input_fn(
      x=validation_examples, y=validation_targets["median_house_value"],
      num_epochs=1, shuffle=False)

  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  print("Training model...")
  print("RMSE (on training data):")
  training_rmse = []
  validation_rmse = []
  for period in range (0, periods):
    # Train the model, starting from the prior state.
    dnn_regressor.fit(
        input_fn=training_input_fn,
        steps=steps_per_period
    )
    # Take a break and compute predictions.
    training_predictions = list(dnn_regressor.predict(input_fn=predict_training_input_fn))
    validation_predictions = list(dnn_regressor.predict(input_fn=predict_validation_input_fn))
    # Compute training and validation loss.
    training_root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(training_predictions, training_targets))
    validation_root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(validation_predictions, validation_targets))
    # Occasionally print the current loss.
    print("  period %02d : %0.2f") % (period, training_root_mean_squared_error)
    # Add the loss metrics from this period to our list.
    training_rmse.append(training_root_mean_squared_error)
    validation_rmse.append(validation_root_mean_squared_error)
  print("Model training finished.")

  # Output a graph of loss metrics over periods.
  plt.ylabel("RMSE")
  plt.xlabel("Periods")
  plt.title("Root Mean Squared Error vs. Periods")
  plt.tight_layout()
  plt.plot(training_rmse, label="training")
  plt.plot(validation_rmse, label="validation")
  plt.legend()

  print("Final RMSE (on training data):   %0.2f") % training_root_mean_squared_error
  print("Final RMSE (on validation data): %0.2f") % validation_root_mean_squared_error

  return dnn_regressor

### Task 1: Train a NN model

**Adjust hyperparameters, aiming to drop RMSE below 110.**

Run the following block to train a NN model.  

Recall that in the linear regression exercise with many features, an RMSE of 110 or so was pretty good.  We'll aim to beat that.

Your task here is to modify various learning settings to improve accuracy on validation data.

Overfitting is a real potential hazard for NN's.  You can look at the gap between loss on training data and loss on validation data to help judge if your model is starting to overfit. If the gap starts to grow, that is usually a sure sign of overfitting.

Because of the number of different possible settings, it's strongly recommended that you take notes on each trial to help guide your development process.

Also, when you get a good setting, try running it multiple times and see how repeatable your result is. NN weights are typically initialized to small random values, so you should see differences from run to run.

**Note**  This not by no means a way to find parameters leading to the best RMSE, yet just a way of experimenting and gaining some intuition about the effects of the various knobs in out possession. If your aim is to find the model that can attain the best error, then you'll want to use a more rigorous process, like a parameter search.


In [174]:
dnn_regressor = train_nn_regression_model(
    learning_rate=0.01,
    steps=6500,
    batch_size=256,
    hidden_units=[10, 2],
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets)

Training model...
RMSE (on training data):
  period 00 : 127.73
  period 01 : 116.65
  period 02 : 111.22
  period 03 : 110.42
  period 04 : 106.61
  period 05 : 103.24
  period 06 : 103.82
  period 07 : 103.41
  period 08 : 100.48
  period 09 : 101.22
Model training finished.
Final RMSE (on training data):   101.22
Final RMSE (on validation data): 97.91


### Task 2: Evaluate on test data

**Confirm that your validation performance results hold up on test data.**

Once you have a model you're happy with, evaluate it on test data to compare that to validation performance.

Reminder, the test data set is located [here](https://storage.googleapis.com/ml_universities/california_housing_test.csv).

In [182]:
#Load Dataset Test
california_housing_dataframe_test = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_test.csv", sep=",")
california_housing_dataframe_test = california_housing_dataframe_test.reindex(
    np.random.permutation(california_housing_dataframe.index))

#Remove rows from dataframe with nan values
california_housing_dataframe_test = california_housing_dataframe_test.dropna(axis=0, how='all')

#Create vectors
test_examples = preprocess_features(california_housing_dataframe_test.head(275))
test_targets = preprocess_targets(california_housing_dataframe_test.head(275))
validation_examples_test = preprocess_features(california_housing_dataframe_test.tail(75))
validation_targets_test = preprocess_targets(california_housing_dataframe_test.tail(75))

#Prepare input vectors for the model
predict_test_input_fn = learn_io.pandas_input_fn(x=test_examples, y=test_targets["median_house_value"],
     num_epochs=1, shuffle=False)

predict_validation_test_input_fn = learn_io.pandas_input_fn(
    x=validation_examples_test, y=validation_targets_test["median_house_value"],
      num_epochs=1, shuffle=False)

#Evaluate on test data
test_predictions = list(dnn_regressor.predict(input_fn=predict_test_input_fn))
validation_test_predictions = list(dnn_regressor.predict(input_fn=predict_validation_test_input_fn))

#Compute performace
test_root_mean_squared_error = math.sqrt(metrics.mean_squared_error(test_predictions, test_targets))
test_validation_root_mean_squared_error = math.sqrt(metrics.mean_squared_error(validation_test_predictions, validation_targets_test))

print(test_root_mean_squared_error)
print(test_validation_root_mean_squared_error)

100.296694893
103.359908667
