# Trying out features

**Learning Objectives:**
  * Improve the accuracy of a model by adding new features with the appropriate representation

The data is based on 1990 census data from California. This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively.

## Set Up
In this first cell, we'll load the necessary libraries.

In [None]:
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.contrib.learn as estimators
import tensorflow.contrib.layers as tflayers
from tensorflow.contrib.learn.python.learn.learn_io import pandas_input_fn

tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

Next, we'll load our data set.

In [None]:
df = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

## Examine and split the data

It's a good idea to get to know your data a little bit before you work with it.

We'll print out a quick summary of a few useful statistics on each column.

This will include things like mean, standard deviation, max, min, and various quantiles.

In [None]:
df.describe()

Now, split the data into two parts -- training and evaluation.

In [None]:
msk = np.random.rand(len(df)) < 0.8
traindf = df[msk]
evaldf = df[~msk]

## Training and Evaluation

In this exercise, we'll be trying to predict median_house_value. It will be our label (sometimes also called a target).

We'll modify the feature_cols and input function to represent the features you want to use.

In [None]:
import os
import math
import json
import shutil
import argparse
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.contrib.learn as estimators
import tensorflow.contrib.layers as tflayers
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib.learn.python.learn.learn_io import pandas_input_fn

def create_feature_cols():
  return [
    tflayers.real_valued_column('age'),
    tflayers.bucketized_column(tflayers.real_valued_column('latitude'), boundaries=np.arange(32.0, 42, 1).tolist()),
    tflayers.real_valued_column('num_rooms'),
    tflayers.real_valued_column('income')
  ]

def create_input_fn(df):
  def _impl():
    features = {
      'age' : tf.constant(df['housing_median_age']),
      'num_rooms' : tf.constant(df['total_rooms'] / df['households']),
      'latitude' : tf.constant(df['latitude']),
      'income' : tf.constant(df['median_income']),
    }
    label = tf.constant(df['median_house_value']/100000) # will talk about why later in the course
    return features, label

  return _impl


def train_and_eval(output_dir):
    tf.logging.set_verbosity(tf.logging.INFO)

    # train and eval input functions
    train_input_fn = create_input_fn(traindf)
    eval_input_fn = create_input_fn(evaldf)
    
    def _experiment_fn(output_dir):
        # create estimator
        model = estimators.LinearRegressor(model_dir=output_dir,
                                   feature_columns=create_feature_cols())

        experiment = estimators.Experiment(model, 
            train_input_fn=train_input_fn,
            eval_input_fn=eval_input_fn,
            #eval_metrics = {'rmse': estimators.MetricSpec(metric_fn=tf.metrics.root_mean_squared_error)},
            train_steps=100
        )
        return experiment
    
    learn_runner.run(_experiment_fn, output_dir=output_dir)
    

outdir = './trained_model'
shutil.rmtree(outdir, ignore_errors=True)
train_and_eval(outdir)