# Logistic Regression
by Andrew E. Davidson

This notebook is an introduction to TensorFlow. To illustrate how basic TensorFlow works we are going to implement binary classifier using Logistic Regression. Logistic Regression can be thought of as neural network with a single neuron. Its simple enough that we can implement the algorithm manually. With the interest of learning how tensor flow works we are not going to take advantage of automatic differentiation or tensoflow optimizers. Using auto-diff and optimizers would dramatically reduce the amount of code required.

Note: [train.ipynb](train.ipynb) is a version of logistic regression using Keras. Its a fraction of the code a TensorFlow only solution requires.

<span style="color:red">There are couple of place where I used numpy reshape or to scale the feature matrix. This should be changed to use TensorFlow. Using numpy will work if you are using a single machine. Tensorflow will scale across a cluster.</span>
    
Supervised learning applications implemented with TensorFlow typically have the following structure
1. load some data

2. construct our 'computation graph'
    1. Initialize our input variables, model parameters, and place holders
    1. Forward Propagation
        * create graph that makes predictions give our features and model
    2. Calculate Cost
        * create graph to measure how well our preidiction match our lables
    3. Back Propagation
        * calculate the graidents. (i.e. the partical derivatites of our cost function with respect to the partameters our model learns
    4. use the gradients to update the learned parameters of our model
3. Train our model
    1. Use an optimizer or gradient decent with mini-batches
    2. save check points and final model
    3. save performance statistics. 
4. evaluate our model's performance on the training data set
5. evaluate our model's performance on the test data set
6. tune hyper parmeters 
7. goto to 1.

In [1]:
import logRegTestFunc as lrTest
import h5py
import numpy as np
import os
import pandas as pd
import tensorflow as tf
import keras
#from keras.utils.io_utils import HDF5Matrix
from scipy import stats

# import local files
import sys
sys.path.append('pyDevProj/src')
import utils.load as utils
import utils.preprocess as pre

# fix random seed for reproducibility
np.random.seed(42)

  return f(*args, **kwds)
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# 1) Load RNA Seq data

In [2]:
%%time
input_file = "data/tumor_normal.h5"

with h5py.File(input_file, "r") as f:
    print("Datasets:", list(f.keys()))
    
DEBUG = True
if DEBUG:
    # this data set is simple enough we can do the math on paper
    X_train, y_train = utils.test_train_data_set()
    epochs = 500
    batch_size = 3
    log_frequence = 100
    
else:
    X_train, y_train = utils.small_train_data_set(input_file)
    #X_train, y_train = utils.full_train_data_set(input_file)
    
    n,m = np.transpose(X_train).shape
    
    # reshape array into a column vector
    y_train = np.reshape(y_train, (1,m))

    epochs=100
    batch_size=128
    log_frequence = epochs / 5 #1 if DEBUG else 20 
    
# transpose, each column should be a separate sample
X_train = np.transpose( X_train )
n,m = X_train.shape

Datasets: ['X_test', 'X_train', 'class_labels', 'classes_test', 'classes_train', 'features', 'genes', 'labels', 'y_test', 'y_train']
Training on trival (debug) data set
CPU times: user 1.45 ms, sys: 617 µs, total: 2.06 ms
Wall time: 1.87 ms


In [3]:
print("n: num features: ", n)
print("m: num training samples: ", m)
print( "epochs: ", epochs )
print( "batch_size: ", batch_size )

utils.check_data(X_train, y_train, DEBUG)

n: num features:  2
m: num training samples:  7
epochs:  500
batch_size:  3


## Explore RNA Seq data
Given the way the sigmoid function works, the raw data will spend a lot of time learning using gradients near zero. We can speed up learning by normalizing the data.

In [4]:
if DEBUG:
    testND = np.arange(start=1., stop=13.).reshape(4,3)
    result = pre.normalizeData(testND)

    assert np.array_equal( np.mean(result, axis=1), 
                          np.array([0.,0.,0.,0.]) )
    assert sum(np.isclose( np.var(result, axis=1), 
                          np.array([1.,1.,1.,1.]) ), 4)

if not DEBUG:
    # explore some of the X_train data
    # do we need to normalize?
    print("******** Explore RNA Seq data *******")
    print("Raw Data, rows are features")
    print("\nX_train[0:2,0:8]\n", X_train[0:2,0:8])
    print("\nX_train[100:102,0:8]\n", X_train[100:102,0:8])
    
    print("\ny_train[0, 0:8]\n", y_train[0, 0:8])

    print("\nSummary stats for two genes")
    s = stats.describe(X_train[0,:])
    print("\nX_train[0,:]\n minmax:{} mean:{} variance:{}"
              .format(s.minmax, s.mean, s.variance) )
    
    s = stats.describe(X_train[99,:])
    print("\nX_train[99,:]\n minmax:{} mean:{} variance:{}"
              .format(s.minmax, s.mean, s.variance) )  
    
    print("\n********** normalizing data ********")
    X_train = pre.normalizeData( X_train )
    print("AEDWIP !!!!! do not load test data")
    X_test = pre.normalizeData( X_test )

# 2) Construction Computation Graph 
If you want to use TensorBoard to visualize your graph make sure you give all ops a name, and group releated ops using "name scopes"

In [5]:
# clear out any nodes from previous runs
tf.reset_default_graph()

## Forward Propagation
Create a graph that makes predictions given our features and model

### Init place holders and variables

In [6]:
# We use tf.name_scope() to group tensor operations in the graph image 
# created by TensorBoard. Use either the "name" argument or tf.identity to 
# lable nodes in the in the TensorBoard Image

with tf.name_scope("input_data"):
    # shape( None ) means the deminsion is not know. This is because the 
    # last min batch may have fewer samples. i.e. m / batch_size  may have 
    # a remainder

    # X is our feature matrix
    X = tf.placeholder( tf.float64, shape=(n, None), name="X")
    print("X.shape: ", X.shape)

    # y is our lable vector
    y = tf.placeholder( tf.float64, shape=(1, None), name="y")
    print("y.shape:", y.shape)

# initialize our model parameters.
# w is our models weight vector
# b is the y intercept for a linear model
with tf.name_scope("model_parameters"):
    if DEBUG:
        # ones() makes it easy to write unit tests
        w = tf.Variable( np.ones(( n, 1 )), name="w")
        b = tf.Variable( 1.0, name="b", dtype=tf.float64)
    else:
        # usally machine learning algorithms init values using a uniform
        # random distribution. Logistic regression typically uses zero.
        # This should make learning faster. Has to do with the 
        # shape of the 
        
        # sigmoid function
        w = tf.Variable( np.zeros(( n, 1 )), name="w")
        b = tf.Variable( 0., name="b", dtype=tf.float64)
    
print("w.shape: ", w.shape)
print("tf.transpose(w).shape: ", tf.transpose(w).shape)

print("b.dtype: ", b.dtype)

X.shape:  (2, ?)
y.shape: (1, ?)
w.shape:  (2, 1)
tf.transpose(w).shape:  (1, 2)
b.dtype:  <dtype: 'float64_ref'>


Compute z: the input to Sigmoid()
$$ 
z\: =\: { W }^{ t }X\: +\: b \tag{1}
$$

### prediction
Compute z: the input to Sigmoid()
$$ 
z\: =\: { W }^{ t }X\: +\: b \tag{1}
$$

Compute the "neuron activation"
$$ 
a \: = \:  \sigma (z)\: =\: 1\, /\, (1\: +\: { e }^{ -z }) \tag{2}
$$

In [7]:
with tf.name_scope("Predict"):
    # define forward propagation
    # z has shape [1,m]
    z = tf.matmul(tf.transpose(w), X) + b
    tf.identity(z, "z")

    # sigmoid has shape [1,m]
    sigmoid = 1.0 / (1.0 + tf.exp( -1.0 * z ))
    tf.identity(sigmoid, "sigmoid")
    
    a = sigmoid
    tf.identity(a, "a")
    
    # define the error_rate test statistic and a TensorBoard summary 
    # yhat is our estimate of y given our current model
    yhat = tf.cast(a > 0.5, a.dtype)
    tf.identity(yhat, "yhat")
    
    match = tf.cast( tf.equal(yhat, y), yhat.dtype ) 
    error_rate = 1.0 - tf.reduce_mean( match, axis=1 )
    tf.identity(error_rate, "error_rate")
    error_rate_summary = tf.summary.scalar('Error_rate', error_rate[0])

In [8]:
if DEBUG: # test eq. (1)
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        init.run() # actual init of variable    
        result = z.eval( feed_dict={X:X_train , y:y_train} )
        
    assert ( result.shape == (1,m) )
        
    def debugExpectedZ():
        return  np.array( [[ 4.,  8., 12., 16., 20., 24., 28.]] )
    
    assert np.array_equal( result, debugExpectedZ() )

In [9]:
lrTest.debugZ(m, z, {X:X_train, y:y_train}, DEBUG)

debugZ() passed


In [10]:
if DEBUG: # test eq. (2)
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        init.run() # actual init of variable    
        result = a.eval( feed_dict={X:X_train , y:y_train} )
        
    assert (result.shape == (1,m) )

    def debugExpectedSigmoid():
        zz = debugExpectedZ()
        expected = 1.0 / ( 1.0 + np.exp(-1.0 * zz) )
        return expected
        
    assert np.array_equal( result, debugExpectedSigmoid() )

In [11]:
if DEBUG: # test error_rate
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        init.run() # actual init of variable    
        #result = error_rate.eval( feed_dict={X:X_train , y:y_train} )
        fetchResults = {
            "yhat":yhat,
            "match":match,
            "error_rate":error_rate
        }
        
        results = sess.run(fetchResults, 
                          feed_dict={X:X_train , y:y_train} )
        aa = a.eval(feed_dict={X:X_train , y:y_train})
        expected = 1 - np.sum((aa > 0.5) == y_train) / m
        assert results['error_rate'] == expected

## Evaluate Model
Create a graph to measure how well our preidiction match our lables

### Loss Equation
$$ 
L(a ,y)\: =\: -\,(\: ylog(a)\: +\: (1-y)log(1-a )\: ) \tag{3}
$$

### Cost Equations

$$ 
J(W,b)\: =\: 1/m\, \sum _{ i=1 }^{ m }{ L(a ^{ (i) } } ,{ y }^{ (i) }) \tag{4}
$$

In [12]:
with tf.name_scope("Cost"):
    # define loss function. Shape will be [1,m]

    # use member wise multiplication
    propY = tf.multiply( y, tf.log(a) )
    propNotY = tf.multiply( (1.0 - y), tf.log(1.0 - a) )
    loss = - ( propY + propNotY )
    tf.identity(loss, "loss")
    
    # define cost. i.e. average loss. Shape will be a real number
    # and a TensorBoard summary 

    cost = 1/m * tf.reduce_sum( loss )
    tf.identity(cost, "cost")
    cost_summary = tf.summary.scalar('Cost', cost)

In [13]:
if DEBUG:
    # test eq. (3)
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        init.run() # actual init of variable   
        result = loss.eval( feed_dict={X:X_train , y:y_train} )
        
    assert (result.shape == (1,m) )
        
    def debugExpectedLoss():
        aa = debugExpectedSigmoid()
        prob = y_train * np.log( aa )
        probNot = (1.0 - y_train) * np.log( 1.0 - aa )
        expected = -1.0 * ( prob + probNot )
        return expected
        
    expected = debugExpectedLoss()
    assert np.array_equal( result, expected )

    # test eq. (4)
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        init.run() # actual init of variable   
        result = cost.eval( feed_dict={X:X_train , y:y_train} )
    
    def debugCost():
        ll = debugExpectedLoss()
        expected = 1.0/m * np.sum(ll, axis=1)
        return expected
        
    r = np.round( result, decimals=6 )
    d = np.round( debugCost(), decimals=6 )
    assert (r == d)

## Backward propagation

### Partial derivatives of cost function with respect to weights

$$ \frac{\partial J(w,b)}{\partial w} = \frac{1}{m}X(A-Y)^T\tag{7}$$

### Partial derivatives of cost function with respect to bias
$$ \frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})\tag{8}$$

In [14]:
with tf.name_scope("back_prop"):
    # Shape will be (n,1)
    dz = a - y
    tf.identity(dz, name="dz")
    
    dw = 1.0 / m * tf.matmul(X, tf.transpose(dz))
    tf.identity(dw, name="dw")

    # db is a scalar
    db = 1.0 / m * tf.reduce_sum( dz )
    tf.identity(db, name="db")

    learning_rate = 0.001
    update_w = tf.assign(w, w - learning_rate * dw, name="update_w")
    update_b = tf.assign(b, b - learning_rate * db, name="update_b")
    

In [15]:
if DEBUG: 
    with tf.name_scope("debug_gradient"):
        # test eq. (7)
        debug_reset_w = tf.assign( w, np.ones(( n, 1 )) )
        epsilon = 0.0001
        batch = {X:X_train, y:y_train}

        def debugGradientCheck():
            """uses definittion of derivative to compute a double side
            estimate of cost to check gradident"""

            with tf.Session() as sess:
                init = tf.global_variables_initializer()
                init.run()
                # get the partial derivate as calculated by our 
                # computation graph
                dwResult = dw.eval( feed_dict=batch )

            def debugEstimateCost(rowIdx):
                default = 1.
                posAdj = (default + epsilon/2.0)
                negAdj = (default + -1. * epsilon/2.0)
                if rowIdx == 0:
                    plusEps  = np.array( [[posAdj], [default]] )
                    minusEps = np.array( [[negAdj], [default]] )
                else:
                    plusEps  = np.array( [[default], [posAdj]] )
                    minusEps = np.array( [[default], [negAdj]] )

                def debugCost(adj):
                    with tf.Session() as sess:
                        init = tf.global_variables_initializer()
                        init.run() 
                        # tweak the values of w such that they will be 
                        # shared by the cost caluclation. We can not 
                        # use eval to share variables
                        fetchResults = [tf.assign(w,adj), cost]
                        _w, new_cost = sess.run(fetchResults,
                                                feed_dict=batch)
                        return new_cost

                upper_cost = debugCost( plusEps )
                lower_cost = debugCost( minusEps )

                estimate = (upper_cost - lower_cost) / epsilon
                return estimate

            g1 = debugEstimateCost(0)
            g2 = debugEstimateCost(1)
            estimated_gradients = np.array( [[g1], [g2]] )
            return [dwResult, estimated_gradients]


        dwResult, estimate = debugGradientCheck()
        rr = np.round(dwResult, decimals=7)
        gr = np.round(estimate, decimals=7)
        assert (np.array_equal(rr, gr))

        # as an additional check
        # Use tensor flow Automatic Differentiation to test gradient
        # Computing the gradient of cost with respect to W and b  
    with tf.name_scope("debug_gradient"):
        with tf.Session() as sess:
            init = tf.global_variables_initializer()
            init.run()        

            grad_w, grad_b = tf.gradients(xs=[w, b], ys=cost)
            gradWResult = grad_w.eval(feed_dict={X:X_train, y:y_train})

            # notice auto diff has greater precision than 2 sided estimate 
            rr = np.round(dwResult, decimals=11)
            gwr = np.round(gradWResult, decimals=11)
            assert (np.array_equal(rr, gwr))

In [16]:
if DEBUG: # test eq. (8)
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        init.run() # actual init of variable  
        batch = {X:X_train, y:y_train}
        result = db.eval( feed_dict=batch )
        
        # Use tensor flow Automatic Differentiation to test gradient
        # Computing the gradient of cost with respect to W and b
        grad_w, grad_b = tf.gradients(xs=[w, b], ys=cost)
        gBResult = grad_b.eval(feed_dict=batch)
        

        # notice greater precision than 2 sided estimate test
        rr = np.round(result, decimals=11)
        gbr = np.round(gBResult, decimals=11)
        assert (np.array_equal(rr, gbr))        

# Execute graph

### fetchBatch() TODO
when we run on full training set run is really long. Also very high memory presure. Test to see if memory preasure is becuase of fetchBatch or computation?

if presure is from batch maybe instead of keeping all of training set in memory we can read mini batch from disk. Does data package have read with start and end? If not we could pre process traing set into mini batches

Give that set is composed from three different sources we should consider selecting random mini batches so that each mini batch is more likely to look like the over all data set.

In [17]:
def fetchBatch(X, y, batchIndex, batchSize ):
    start = batchIndex * batchSize
    end = start + batchSize
    xBatch = X[:,start:end]
    yBatch = y[:,start:end] 
    return xBatch, yBatch

In [18]:
if DEBUG:
    with tf.Session() as sess, tf.name_scope("train"):
        init = tf.global_variables_initializer()
        init.run() # actual init of variable   
        
        def testFetchBatch( batchIndex, batchSize, debug=False ):
            Xb, yb = fetchBatch(X_train, y_train, batchIndex,
                                batchSize )
            return Xb, yb

        Xb, yb = testFetchBatch( batchIndex=0, batchSize=2)
        assert np.array_equal( Xb, [[1., 3.], [2., 4.]] )
        assert np.array_equal( yb, [[0., 0.]] )

        Xb, yb = testFetchBatch( batchIndex=1, batchSize=2)
        assert np.array_equal( Xb, [[5., 7.], [6., 8.]] )
        assert np.array_equal( yb, [[0., 1.]] )

        # there are 4 patch, the index of last batch is 3 an only 
        # has 1 row
        Xb, yb = testFetchBatch( batchIndex=3, batchSize=2)
        assert np.array_equal( Xb, [[13.], [14.]] )
        assert np.array_equal( yb, [[1.]])

## Train Model

In [19]:
# easy of use: combine all the summaries 
merged_summary = tf.summary.merge_all()

model_saver = tf.train.Saver()
init = tf.global_variables_initializer()

In [20]:
# set up logging so we can use TensorBoard to analyze results
from datetime import datetime

now = datetime.utcnow().strftime("%Y_%m_%d_%H:%M:%S")
root_logdir = "tf_logs"
logdir = "{}/run-{}/".format(root_logdir, now)

model_dir = "models/logisticRegression"
model_name = "{}/simpleLogisticRegression/{}/model".format(model_dir, now)
print("model_name:", model_name)

model_name: models/logisticRegression/simpleLogisticRegression/2018_02_17_01:24:48/model


In [21]:
%%time

# it appears that stats are collected for each mini batch
# how ever they are only written to disk when summary_writer.add()
# is executed.
summary_writer = tf.summary.FileWriter(logdir)
summary_writer.add_graph( tf.get_default_graph() )

CPU times: user 134 ms, sys: 3.12 ms, total: 137 ms
Wall time: 136 ms


In [22]:
%%time

with tf.Session() as sess:
    init.run() # actual init of variable
    previousCost = 0.
    for epoch in range( epochs ):
        totalCost = 0.0
        numBatches = int( np.ceil(m / batch_size) )
        for i in range( numBatches):   
            if i % log_frequence == 0:
                # save a check point
                model_saver.save(sess, model_name + ".modelCkPt")

                # save test stats for entire data set
                stat_data = { X:X_train, y:y_train }
                step = epoch * numBatches + i
                log_entry = sess.run(merged_summary, 
                                     feed_dict=stat_data)
                summary_writer.add_summary(log_entry, step)
                
            # train using next mini batch
            xBatch, yBatch = fetchBatch( X_train, y_train,
                                        batchIndex=i,
                                        batchSize=batch_size )
            
            data = { X:xBatch, y:yBatch}

            fetchResults ={
                            "cost":cost,
                            "error_rate":error_rate,
                            # the update ops are the root of our graph
                            "update_w":update_w,
                            "update_b":update_b
                            }
                
            results = sess.run(fetchResults, feed_dict=data)
            totalCost += results["cost"]
                    
        # we expect the change to be negative if our model 
        # is improving
        change = totalCost - previousCost 
        previousCost = totalCost
        
        if ((epoch % log_frequence) == 0):             
            data = { X:X_train, y:y_train}
            answer = {"error_rate":error_rate}
            
            totalResults = sess.run(answer, feed_dict=data)
            fmt = "epoch:{:>4} totalCost:{:,.8f} change:{:,.8f}" + \
                    " error rate:{:,.5f}"
            print(fmt.format(epoch, totalCost, change, 
                             totalResults["error_rate"][0]) )
        
    print("\nm:", m, " epochs:", epochs)
    print("final error_rate:", totalResults["error_rate"] )
    print("logdir:", logdir)
    
    # save final model
    final_model_path = model_saver.save(sess, model_name + ".tfModel")
    print("path to file model:", final_model_path)

# close the log file
summary_writer.close()
   
if DEBUG:    
#     msg = "check epochs == 5000, learning_rate = 0.001"
#     assert np.isclose(totalResults["error_rate"], 0.28571429),msg
    msg = "check epochs == 500, learning_rate = 0.001"
    assert np.isclose(totalResults["error_rate"], 0.42857),msg    

epoch:   0 totalCost:3.43121309 change:3.43121309 error rate:0.42857
epoch: 100 totalCost:2.95728493 change:-0.00472883 error rate:0.42857
epoch: 200 totalCost:2.48597448 change:-0.00469424 error rate:0.42857
epoch: 300 totalCost:2.01952291 change:-0.00462772 error rate:0.42857
epoch: 400 totalCost:1.56311102 change:-0.00448054 error rate:0.42857

m: 7  epochs: 500
final error_rate: [0.42857143]
logdir: tf_logs/run-2018_02_17_01:24:48/
path to file model: models/logisticRegression/simpleLogisticRegression/2018_02_17_01:24:48/model.tfModel
CPU times: user 1min 32s, sys: 1.23 s, total: 1min 33s
Wall time: 1min 32s
