# Validation
In this exercise, we'll dive a bit more deeply into the training and evaluation of a model.

As in the prior exercises, we're working with the California housing data set, to try and predict `median_house_value` at the city block level from 1990 census data.

In this exercise, we'll use multiple features (instead of a single feature), and also get familiar with the train / validation / test split methodology.

First off, let's load up and prepare our data.

In [None]:
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

# california_housing_dataframe = california_housing_dataframe.reindex(
#     np.random.permutation(california_housing_dataframe.index))

In [None]:
def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets

For the **training set**, we'll choose the first 12000 examples, out of the total of 17000.

In [None]:
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()

In [None]:
training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()

For the **validation set**, we'll choose the last 5000 examples, out of the total of 17000.

In [None]:
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()

In [None]:
validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()

### Examine the data
Okay, let's look at the data above. We have `9` input features that we can use.

**Take a quick skim over the table of values. Do they pass a quick sanity check?**

Take a look at the data on your own. Everything look okay? See how many issues you can spot. Don't worry if you don't have a background in statistics; common sense is often enough.

After you've had a chance to look over the data yourself, check the solution for some additional thoughts on how to sanity check data.

Let's take a close look at two features in particular: **`latitude`** and **`longitude`**. These are geographical coordinates of the city block in question.

This might make a nice visualization — let's plot `latitude` and `longitude`, and use color to show the `median_house_value`.

In [None]:
plt.figure(figsize=(13, 8))

ax = plt.subplot(1, 2, 1)
ax.set_title("Validation Data")

ax.set_autoscaley_on(False)
ax.set_ylim([32, 43])
ax.set_autoscalex_on(False)
ax.set_xlim([-126, -112])
plt.scatter(validation_examples["longitude"],
            validation_examples["latitude"],
            cmap="coolwarm",
            c=validation_targets["median_house_value"] / validation_targets["median_house_value"].max())

ax = plt.subplot(1,2,2)
ax.set_title("Training Data")

ax.set_autoscaley_on(False)
ax.set_ylim([32, 43])
ax.set_autoscalex_on(False)
ax.set_xlim([-126, -112])
plt.scatter(training_examples["longitude"],
            training_examples["latitude"],
            cmap="coolwarm",
            c=training_targets["median_house_value"] / training_targets["median_house_value"].max())
_ = plt.plot()

Wait a second ... this should have given us a nice map of the state of California, with red showing up in expensive areas like the San Francisco and Los Angeles.

The training set sort of does, compared to a [real map](https://www.google.com/maps/place/California/@37.1870174,-123.7642688,6z/data=!3m1!4b1!4m2!3m1!1s0x808fb9fe5f285e3d:0x8b5109a227086f55), but the validation set clearly doesn't.

**Go back up and look at the sanity check data again.**

Do you see any other differences in the distributions of features or targets between the training and validation data?

Check the solution to view the key issue.

### Task 1:  Go back up to the data importing and pre-processing code, and see if you spot any bugs there.
If you do, go ahead and fix the bug. Don't spend more than a minute or two looking. If you can't find the bug, check the solution for a hint.

When you've found and fixed the issue, re-run `latitude` / `longitude` plotting cell above and confirm that our sanity checks look better.

By the way, there's an important lesson here.

**Debugging in ML is often *data debugging* rather than code debugging.**

If the data is wrong, even the most advanced ML code can't save things.

### Task 2: Train and evaluate a model.

**Spend 5 minutes or so trying different hyperparameter settings.  Try to get the best validation performance you can.**

Go ahead and write some code to set up a linear_regressor, using the `LinearRegressor` interface provided by the TensorFlow Estimators library.

It's okay to use the code in the previous exercises, but you'll want to call `fit()` and `predict()` on the appropriate data sets.

Using multiple input features instead of a single feature doesn't require anything special; the Estimators interface accepts Pandas `DataFrame` objects.

If the `DataFrame` has multiple features defined (as ours does) these will all be used.

Compare the losses on training data and validation data.

With a single raw feature, our best root mean squared error (RMSE) was of about 180.

See how much better you can do now that we can use multiple features.

Use some of the sanity-checking methods we've looked at before.  These might include:

   * comparing distributions of predictions and actual target values

   * creating a scatter plot of predictions vs. target values

   * creating two scatter plots of validation data using `latitude` and `longitude`:
      * one plot mapping color to actual target `median_house_value`
      * a second plot mapping color to predicted `median_house_value` for side-by-side comparison.


In [None]:
#
# Your code here
#

### Task 3: Evaluate on test data.

**In the cell below, load in the test data set and evaluate your model on it.**

We've done a lot of iteration on our validation data.  Let's make sure we haven't overfit to the pecularities of that particular sample.

Test data set is located [here](https://storage.googleapis.com/ml_universities/california_housing_test.csv).

How does your test performance compare to the validation performance?  What does this say about the generalization performance of your model?

In [None]:
#
# Your code here
#