# A TensorFlow regression model to predict house values

This notebook guides you through the basic concepts to construct a linear regression model with the
TensorFlow library in Watson Studio, including how to import the predictive data, train the model to predict
median housing value, and save the model to use for future inference.

Some familiarity with Python is recommended. This notebook runs on Python and Spark.

## Table of contents

1. [Import the Data](#read-data)<br>
2. [Train the Model](#train)<br>
    a. [Save the Machine Learned Model to a local data set](#save)<br>
    b. [Restore the Machine Learned Model from the local data set](#restore)<br>
3. [Infer by using the Restored Machine Learned Model](#infer)<br>
4. [Measure the Quality of the Trained Model](#measurement)<br>
5. [Summary](#summary)<br>
    a. [Related Links](#links)<br>
    b. [Author Information](#author)<br>

<div class="alert alert-block alert-info">
<b>Term:</b> 'Tensor' means n-dimensional array, and TensorFlow is a library
that makes it easy to specify a computational 'flow' of tensors to
run that flow in the most efficient way possible given the available compute power. </div>

<a id="read-data"></a>
## Import the Data

As a prerequisite, you must prepare a CSV file that you would like to use the
regression model predictor on.

To provide an example, a CSV file was downloaded from the same source as the
`scikit learn fetch_california_housing()` method. The result yielded a CSV
file `cal_housing_data with headers.csv` with a sample data set that maps house
prices to several predictor variables
such as house age, number of bedrooms, and municipal population. The sample data
in a CSV file is available for download onto your personal computer at [John
Boyer's GitHub repo](https://github.com/john-boyer-phd/TensorFlow-
Samples/tree/master/Linear%20Regression).

To load the CSV file, open the **Files** window by clicking on the binary icon in the upper right corner and upload the file. Then, select the empty cell below and click Insert to code, and then Insert Pandas DataFrame into the empty cell below. After the code is loaded, the cell can be run to read the CSV file.

In [1]:
# @empty_cell: delete this comment and insert the CSV file dataframe in this cell

#import pandas as pd
#df_data_1 = pd.read_csv('../../cal_housing_data with headers.csv')
#Please change last two lines to 'df_data_1...' rather than any other dataframe name


Unnamed: 0,Longitude,Latitude,HousingMedianAge,TotalRooms,TotalBedrooms,Population,Households,MedianIncomeValue,MedianHouseValue
0,-122.23,37.88,41,880,129,322,126,8.3252,452600
1,-122.22,37.86,21,7099,1106,2401,1138,8.3014,358500
2,-122.24,37.85,52,1467,190,496,177,7.2574,352100
3,-122.25,37.85,52,1274,235,558,219,5.6431,341300
4,-122.25,37.85,52,1627,280,565,259,3.8462,342200


This first snippet imports the library `numpy`, and then extracts data from the
previously loaded dataframe into numpy arrays. 

Since the model is a simple model with
only 20,640 rows of data that you are training, it is loaded in one turn. For larger training sets, you must load the data in several smaller epochs. 

The
`housing_data` contains the 20,460 values for each of the eight predictor variables,
and the `housing_target` is the vector of 20,640 house values that this model
is trained to predict.

If the following code cells generate an error message, please recitify the last two lines of the previous import code to 
<br><br>
*df_data_1 = pd.read_csv(body)<br>
df_data_1.head()*

In [2]:
import numpy as np

# Make a numpy array from the dataframe
data = np.array([x for x in df_data_1.values])

# Separate the 'predictors' (aka 'features') from the dependent variable (aka 'label') 
# that we will learn how to predict
housing_data = np.delete(data, 8, axis=1)
housing_target = np.delete(data, slice(0, 8), axis=1)

The two lines of code here are just a little housekeeping to prepare for the
machine learning step:

In [3]:
m, n = housing_data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing_data]

<a id="train"></a>
## Train the Model

Now, a relatively simple machine learning model is implemented here as a quick
demonstration of the elements of TensorFlow based machine learning with as
little computing complexity as possible for clarity of comprehension. In essence, the data scientist describes what computations must
occur, and then TensorFlow determines how to do the computations efficiently.

You're going to start by defining the 'flow' or computation graph that
TensorFlow runs on. In this particular case, the compute tree for training a
multiple linear regression that uses the eight predictor variables and the housing
value variable that the model learns to predict is defined. Here's what the code
looks like:

In [4]:
import tensorflow as tf

# Make the compute graph
X = tf.constant(housing_data_plus_bias, dtype=tf.float64, name="X")
XT = tf.transpose(X)
y = tf.constant(housing_target.reshape(-1, 1), dtype=tf.float64, name="y")

theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)

The `X` variable is the matrix of eight predictors by the 20,640 samples. `XT` is a
transpose that is needed in the linear regression computation. The `Y` variable
is the dependent variable, and it is assigned to the 20,640 housing values in
the training data. The `theta` variable is the vector of linear regression
equation coefficients that results from the series of matrix operations on the
formula on the right side.

<div class="alert alert-block alert-info"> The previous code specifies the compute graph only, that is, the tensor flow. </div>

To perform the flow, run the following code. If you then run a line of code to
output `theta_value`, you can get an output similar to the output as shown here:

In [5]:
# Run the compute graph
with tf.Session() as sess:
    theta_value = theta.eval()
    
# For fun, show the linear regression model (i.e. the coefficients of the linear equation)
theta_value

array([[ -3.59402294e+06],
       [ -4.28237438e+04],
       [ -4.25767219e+04],
       [  1.15630387e+03],
       [ -8.18164928e+00],
       [  1.13410689e+02],
       [ -3.85350953e+01],
       [  4.83082868e+01],
       [  4.02485142e+04]])

<a id="save"></a>
## Save the Machine Learned Model to a Local data set

This is the machine learned linear regression model. It gives the coefficients
of a linear equation that is best fit to the training data. Values for the eight
predictor variables such as the age of the house and number of bedrooms can be
used to predict a house value.

Before you are shown the prediction part, you must learn to save and reload the
model in TensorFlow. Only after you save the model can you transport it to a
production deployment environment, where you can restore it so that it can be used for inference (prediction).

If this is the first time that you run this notebook, it is recommended that you
use this line to create a subdirectory in data sets to save the TensorFlow model
from this notebook:

In [6]:
# Make a subdirectory in which to save the model
!mkdir "../datasets/Linear Regression"

Then, to save the model, define a second simple TensorFlow compute model that
assigns the `theta_value` vector to a variable named `model`. The following code
creates and runs this simple tensor flow, and then saves the result in
the subdirectory that is created previously.

In [7]:
# Save the model
model = tf.Variable(tf.constant(theta_value, dtype=tf.float64), name="model")

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as saver_sess:
    init.run()
    theta_value = model.eval()
    save_path = saver.save(saver_sess, "../datasets/Linear Regression/Linear Regression.ckpt")

The save method that is used here is practical since it is the same 'checkpoint'
method that you would use if you were incrementally training a larger model in
epochs. It's also useful to understand that what you are saving is the compute
graph, the tf.Variable TensorFlow variables, and values that are defined in the model
that you are checkpointing. In other words, what gets saved is specific to the
type of model that you are training because the type of model affects the
compute graph, or tensor flow, that you specified. In a neural net, for example,
you must save the structure of the net in addition to the weights and biases.
For a linear regression, you already know that the structure is a linear
equation, so by saving the coefficients is sufficient. Regardless of what is
being saved, TensorFlow saves four files, as shown by the line of code and
its output:

In [8]:
# List the files that comprise the saved model
!ls "../datasets/Linear Regression"

checkpoint				    Linear Regression.ckpt.index
Linear Regression.ckpt.data-00000-of-00001  Linear Regression.ckpt.meta


<a id="restore"></a>
## Restore the Machine Learned Model from the Local data set

Now, suppose that you were to move these four files to a production deployment
environment. The following is the code that you can use to reload the model to use for
inference:

In [9]:
# Restore the saved model 
# NOTE: This should run on inference service initialization, not on every inference

sess_restore = tf.Session()

saver = tf.train.import_meta_graph('../datasets/Linear Regression/Linear Regression.ckpt.meta')
saver.restore(sess_restore,tf.train.latest_checkpoint('../datasets/Linear Regression/'))

theta_value = sess_restore.run('model:0')

sess_restore.close()

INFO:tensorflow:Restoring parameters from ../datasets/Linear Regression/Linear Regression.ckpt


In [10]:
# For fun, show the linear regression model again
theta_value

array([[ -3.59402294e+06],
       [ -4.28237438e+04],
       [ -4.25767219e+04],
       [  1.15630387e+03],
       [ -8.18164928e+00],
       [  1.13410689e+02],
       [ -3.85350953e+01],
       [  4.83082868e+01],
       [  4.02485142e+04]])

<a id="infer"></a>
## Infer by using the Restored Machine Learned Model

Finally, you can perform inferences by using the `theta_value` vector. To simulate
making a prediction, the zeroth row of the `housing_data` for the
values of the predictor values is used. The `predicted_value` is initialized to
the constant coefficient of the linear equation, and then the remaining
coefficients of the theta_value are placed in `linear_coefficients` to make the
loop easier to read. The loop then multiplies each predictor variable value
`housing_data[0][j]` by the corresponding coefficient. (Each coefficient `c` in
the for loop iteration of `linear_coefficients` is, unfortunately, an array of
size 1, so `c[0]` is used to get the actual value of the coefficient.)

It's worth noting that, for a larger model, you can also use
TensorFlow to perform the inference. But because this is a linear regression
that involves only nine coefficients, by using TensorFlow it might slow it down. 
Still, it is an easy tensor flow to write... an exercise for the reader!

In [11]:
# Now we'll do an inference to predict a value with the model
# We will use house_data[0] as if it had been received as input to the inference service

# NOTE: This could be rewritten as TensorFlow code, though that would be more typical of 
#       larger models. At only 9 iterations, this would likely be slower as TensorFlow code

# Start by setting the predicted value equal to the linear equation's constant term
predicted_value = theta_value[0][0]

# Get the coefficients of the features (i.e. exclude the constant term accounted for above)
coefficients = theta_value[1:]

# For each feature (independent variable), add to the predicted value the product
# of the coefficient for the feature (c = theta_value[j+1]) and the j^th feature of
# the inference service input data (represented by housing_data[0])
for j, c in enumerate(coefficients):
    predicted_value += c[0] * housing_data[0][j]

If you now run a line of Python code to see the value of `predicted_value`, you
get output like the following:

In [12]:
# For fun, show the predicted value
predicted_value

411111.09606514324

Use the following code for when you want to erase the model training and start
again on a blank plate. Remove the `#` in front of the line of code and run the
cell.

In [13]:
## For when you want to wipe out the training and do it again
# !rm -rf "../datasets/Linear Regression"

<a id="measurement"></a>
## Measuring the Quality of the Trained Model

Now you can measure the regression model quality with R squared. Begin by generate predictions for all of the housing data.


In [14]:
# Now we'll see how to compute predictions for a batch of items by using all of the training data
predicted_values = np.full((m), theta_value[0][0])

# Get the coefficients of the features (i.e. exclude the constant term accounted for above)
coefficients = theta_value[1:]

# For each of the m rows of housing data, update the predicted value (y) as follows:
    # For each feature (independent variable), add to the predicted value the product
    # of the coefficient for the feature (c = theta_value[j+1]) and the i^th row's
    # housing data value for the jth feature

for i, x in enumerate(housing_data):
    for j, c in enumerate(coefficients):
        predicted_values[i] += c * x[j]
        
predicted_values

array([ 411111.09606514,  416144.49078677,  380432.65417531, ...,
         25026.16974547,   37991.19625605,   55550.98309601])

Now we'll get the actual dependent variable values into a flat array.

In [15]:
y_actual = np.ndarray.flatten(housing_target)
y_actual

array([ 452600.,  358500.,  352100., ...,   92300.,   84700.,   89400.])

Now, we can calculated R squared using the scikit learn function, as this is how you'd normally do it.

In [16]:
from sklearn.metrics import r2_score
R2 = r2_score(y_actual, predicted_values)
R2

0.63710562292234463

To give more insight into how R squared characterizes model quality, we can also do the math manually. We start with taking the average of the dependent variable.

In [17]:
y_bar = np.mean(y_actual)
y_bar

206855.81690891474

Next, compute the data set variance from the mean (total sum of the squared
differences).

In [18]:
SStot = 0.0
for y_i in y_actual:
    diff = float(y_i - y_bar)
    SStot += (diff * diff)
SStot

274831981936881.9

Now you can compute the amount that the regression models predicted values that vary
from the mean. The total is the sum of squared differences between the predicted
values and the mean.

In [19]:
SSreg = 0.0
for f_i in predicted_values:
    diff = float(f_i - y_bar)
    SSreg += (diff * diff)
SSreg

175097001050335.3

The R squared is just the ratio. It gives the percentage of the variance from
the mean that is accounted for by using the regression model to predict values
the mean as the predicted value for any observation in the group.

In [20]:
R_squared = SSreg / SStot
R_squared

0.6371056229203638

A second way to think about this is to consider the amount of remaining error,
that is, the amount of remaining or 'residual' variance between the actual data
points and the regression models predicted values.

In [21]:
SSres = 0.0
for i, f_i in enumerate(predicted_values):
    diff = float(f_i - y_actual[i])
    SSres += (diff * diff)
SSres

99734980886003.83

So R squared can also be computed based on the percentage of leftover (residual)
variance.

In [22]:
R_squared = 1.0 - SSres / SStot
R_squared

0.6371056229223386

<a id="summary"></a>
## Summary

You learned how to train a linear regression model by using the TensorFlow library and teaching it to predict house values with several predictor variables. Unlike a classification model that predicts a nominal variable (for example, classifying an input image as being one of several possible classes), data scientists train and use a regression model to predict the value of a continuous variable or high-valued ordinal variable (like a property valuation or a number of hours a patient needs in an intensive care unit). 

<a id="links"></a>
## Related Links

- <a href="https://datascience.ibm.com/" target="_blank">See Watson Studio</a><br>
- <a href="https://www.ibm.com/developerworks/community/profiles/html/profileView.do?userid=060000VMNY&lang=en" target="_blank">Author's Blog on IBM Developer Works</a>



<a id="author"></a>
### Author

John M. Boyer, IBM Global Chief Data Office

Copyright © IBM Corp. 2018. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:100px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Want to do more?</span><span style="border: 1px solid #3d70b2;padding: 15px;float:right;margin-right:40px; color:#3d70b2; "><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
<span style="color:#5A6872;"> Try out this notebook with your free trial of IBM Watson Studio.</span>
</div>