# TensorFlow实战Titanic解析

## 一、数据读入及预处理

### 1. 使用pandas读入csv文件，读入为pands.DataFrame对象

In [1]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf

# read data from file
data = pd.read_csv('data/train.csv')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None


### 2. 预处理

1. 剔除空数据
2. 将'Sex'字段转换为int类型
3. 选取数值类型的字段，抛弃字符串类型字段

In [2]:
# fill nan values with 0
data = data.fillna(0)
# convert ['male', 'female'] values of Sex to [1, 0]
data['Sex'] = data['Sex'].apply(lambda s: 1 if s == 'male' else 0)
# 'Survived' is the label of one class,
# add 'Deceased' as the other class
data['Deceased'] = data['Survived'].apply(lambda s: 1 - s)

# select features and labels for training
dataset_X = data[['Sex', 'Age', 'Pclass', 'SibSp', 'Parch', 'Fare']]
dataset_Y = data[['Deceased', 'Survived']]

print(dataset_X)
print(dataset_Y)

     Sex   Age  Pclass  SibSp  Parch      Fare
0      1  22.0       3      1      0    7.2500
1      0  38.0       1      1      0   71.2833
2      0  26.0       3      0      0    7.9250
3      0  35.0       1      1      0   53.1000
4      1  35.0       3      0      0    8.0500
5      1   0.0       3      0      0    8.4583
6      1  54.0       1      0      0   51.8625
7      1   2.0       3      3      1   21.0750
8      0  27.0       3      0      2   11.1333
9      0  14.0       2      1      0   30.0708
10     0   4.0       3      1      1   16.7000
11     0  58.0       1      0      0   26.5500
12     1  20.0       3      0      0    8.0500
13     1  39.0       3      1      5   31.2750
14     0  14.0       3      0      0    7.8542
15     0  55.0       2      0      0   16.0000
16     1   2.0       3      4      1   29.1250
17     1   0.0       2      0      0   13.0000
18     0  31.0       3      1      0   18.0000
19     0   0.0       3      0      0    7.2250
20     1  35.

### 3. 将训练数据切分为训练集(training set)和验证集(validation set)

In [3]:
from sklearn.model_selection import train_test_split

# split training data and validation set data
X_train, X_val, y_train, y_val = train_test_split(dataset_X.as_matrix(), dataset_Y.as_matrix(),
                                                  test_size=0.2,
                                                  random_state=42)

# 二、构建计算图

### 逻辑回归

逻辑回归是形式最简单，并且最容易理解的分类器之一。从数学上，逻辑回归的预测函数可以表示为如下公式：

  *y = softmax(xW + b)*

其中，*x*为输入向量，是大小为*d×1*的列向量，*d*是特征数。*W*是大小为的*c×d*权重矩阵，*c*是分类类别数目。*b*是偏置向量，为*c×1*列向量。*softmax*在数学定义里，是指一种归一化指数函数。它将一个*k*维的向量*x*按照公式

![softmax](./images/tf_softmax.jpg)

的形式将向量中的元素转换为*(0, 1)*的区间。机器学习领域常使用这种方法将类似判别函数的置信度值转换为概率形式（如判别超平面的距离等）。*softmax*函数常用于输出层，用于指定唯一的分类输出。


### 1.	使用placeholder声明输入占位符
TensorFlow设计了数据Feed机制。也就是说计算程序并不会直接交互执行，而是在声明过程只做计算图的构建。所以，此时并不会触碰真实的数据，而只是通过placeholder算子声明一个输入数据的占位符，在后面真正运行计算时，才用数据替换占位符。

声明占位符placeholder需要给定三个参数，分别是输入数据的元素类型dtype、维度形状shape和占位符名称标识name。

In [4]:
# 声明输入数据占位符
# shape参数的第一个元素为None，表示可以同时放入任意条记录
X = tf.placeholder(tf.float32, shape=[None, 6], name='input')
y = tf.placeholder(tf.float32, shape=[None, 2], name='label')

### 2.	声明参数变量
变量的声明方式是直接定义tf.Variable()对象。

初始化变量对象有两种方式，一种是从protocol buffer结构VariableDef中反序列化，另一种是通过参数指定初始值。最简单的方式就是向下面程序这样，为变量传入初始值。初始值必须是一个tensor对象，或是可以通过convert_to_tensor()方法转换成tensor的Python对象。TensorFlow提供了多种构造随机tensor的方法，可以构造全零tensor、随机正态分布tensor等。定义变量会保留初始值的维度形状。

In [5]:
# 声明变量
weights = tf.Variable(tf.random_normal([6, 2]), name='weights')
bias = tf.Variable(tf.zeros([2]), name='bias')

### 3.	构造前向传播计算图

使用算子构建由输入计算出标签的计算过程。

在计算图的构建过程中，TensorFlow会自动推算每一个节点的输入输出形状。若无法运算，比如两个行列数不同的矩阵相加，则会直接报错。

In [6]:
y_pred = tf.nn.softmax(tf.matmul(X, weights) + bias)

### 4.	声明代价函数

使用交叉熵(cross entropy)作为代价函数。

In [7]:
# 使用交叉熵作为代价函数
cross_entropy = - tf.reduce_sum(y * tf.log(y_pred + 1e-10),
                                reduction_indices=1)
# 批量样本的代价值为所有样本交叉熵的平均值
cost = tf.reduce_mean(cross_entropy)

#### NOTE
在计算交叉熵的时候，对模型输出值 y_pred 加上了一个很小的误差值（在上面程序中是 1e-10），这是因为当 y_pred 十分接近真值 y_true 的时候，也就是 y_pred 的值非常接近 0 或 1 的取值时，计算会得到负无穷 -inf，从而导致输出非法，并进一步导致无法计算梯度，迭代陷入崩溃。要解决这个问题有三种办法：

1. 在计算时，直接加入一个极小的误差值，使计算合法。这样可以避免计算，但存在的问题是加入误差后相当于y_pred的值会突破1。在示例代码中使用了这种方案；
2. 使用 clip() 函数，当 y_pred 接近 0 时，将其赋值成为极小误差值。也就是将 y_pred 的取值范围限定在的范围内；
3. 当计算交叉熵的计算出现 nan 值时，显式地将cost设置为0。这种方式回避了  函数计算的问题，而是在最终的代价函数上进行容错处理。


### 5. 加入优化算法

TensorFlow内置了多种经典的优化算法，如随机梯度下降算法（SGD，Stochastic Gradient Descent）、动量算法（Momentum）、Adagrad算法、ADAM算法、RMSProp算法等。优化器内部会自动构建梯度计算和反向传播部分的计算图。

一般对于优化算法，最关键的参数是学习率（learning rate），对于学习率的设置是一门技术。同时，不同优化算法在不同问题上可能会有不同的收敛速度，在解决实际问题时可以做多种尝试。

In [8]:
# 使用随机梯度下降算法优化器来最小化代价，系统自动构建反向传播部分的计算图
train_op = tf.train.GradientDescentOptimizer(0.001).minimize(cost)

### 6. (optional) 计算准确率

In [9]:
# 计算准确率
correct_pred = tf.equal(tf.argmax(y, 1), tf.argmax(y_pred, 1))
acc_op = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# 三、构建训练迭代 & 执行训练

### 启动Session，代入数据进行计算。训练结束后使用验证集评估训练效果

In [10]:
with tf.Session() as sess:
    # variables have to be initialized at the first place
    tf.global_variables_initializer().run()

    # training loop
    for epoch in range(10):
        total_loss = 0.
        for i in range(len(X_train)):
            # prepare feed data and run
            feed_dict = {X: [X_train[i]], y: [y_train[i]]}
            _, loss = sess.run([train_op, cost], feed_dict=feed_dict)
            total_loss += loss
        # display loss per epoch
        print('Epoch: %04d, total loss=%.9f' % (epoch + 1, total_loss))

    print 'Training complete!'
    
    # Accuracy calculated by TensorFlow
    accuracy = sess.run(acc_op, feed_dict={X: X_val, y: y_val})
    print("Accuracy on validation set: %.9f" % accuracy)

    # Accuracy calculated by NumPy
    pred = sess.run(y_pred, feed_dict={X: X_val})
    correct = np.equal(np.argmax(pred, 1), np.argmax(y_val, 1))
    numpy_accuracy = np.mean(correct.astype(np.float32))
    print("Accuracy on validation set (numpy): %.9f" % numpy_accuracy)

Epoch: 0001, total loss=2575.566684773
Epoch: 0002, total loss=1825.624264524
Epoch: 0003, total loss=1160.571353403
Epoch: 0004, total loss=1239.010027034
Epoch: 0005, total loss=1150.387452601
Epoch: 0006, total loss=1177.490610881
Epoch: 0007, total loss=1143.255524297
Epoch: 0008, total loss=1119.519565795
Epoch: 0009, total loss=1104.524123580
Epoch: 0010, total loss=1095.065451600
Training complete!
Accuracy on validation set: 0.597765386
Accuracy on validation set (numpy): 0.597765386


# 四、存储和加载模型参数

变量的存储和读取是通过tf.train.Saver类来完成的。Saver对象在初始化时，为计算图加入了用于存储和加载变量的算子，并可以通过参数指定是要存储哪些变量。Saver对象的save()和restore()方法是触发图中算子的入口。


In [15]:
# 训练步数记录
global_step = tf.Variable(0, name='global_step', trainable=False)
# 存档入口
saver = tf.train.Saver()

# 在Saver声明之后定义的变量将不会被存储
# non_storable_variable = tf.Variable(777)

ckpt_dir = './ckpt_dir'
if not os.path.exists(ckpt_dir):
    os.makedirs(ckpt_dir)

with tf.Session() as sess:
    tf.global_variables_initializer().run()

    # 加载模型存档
    ckpt = tf.train.get_checkpoint_state(ckpt_dir)
    if ckpt and ckpt.model_checkpoint_path:
        print('Restoring from checkpoint: %s' % ckpt.model_checkpoint_path)
        saver.restore(sess, ckpt.model_checkpoint_path)

    start = global_step.eval()
    for epoch in range(start, start + 10):
        total_loss = 0.
        for i in range(0, len(X_train)):
            feed_dict = {
                X: [X_train[i]],
                y: [y_train[i]]
            }
            _, loss = sess.run([train_op, cost], feed_dict=feed_dict)
            total_loss += loss
        print('Epoch: %04d, loss=%.9f' % (epoch + 1, total_loss))


        # 模型存档
        global_step.assign(epoch).eval()
        saver.save(sess, ckpt_dir + '/logistic.ckpt',
                   global_step=global_step)
    print('Training complete!')

Restoring from checkpoint: ./ckpt_dir/logistic.ckpt-9
INFO:tensorflow:Restoring parameters from ./ckpt_dir/logistic.ckpt-9


NotFoundError: Key global_step_4 not found in checkpoint
	 [[Node: save_4/RestoreV2_5 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save_4/Const_0_0, save_4/RestoreV2_5/tensor_names, save_4/RestoreV2_5/shape_and_slices)]]

Caused by op u'save_4/RestoreV2_5', defined at:
  File "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 442, in start
    ioloop.IOLoop.instance().start()
  File "/usr/local/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/usr/local/lib/python2.7/site-packages/tornado/ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 276, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 391, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 199, in do_execute
    shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-12dcba6e0028>", line 4, in <module>
    saver = tf.train.Saver()
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Key global_step_4 not found in checkpoint
	 [[Node: save_4/RestoreV2_5 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save_4/Const_0_0, save_4/RestoreV2_5/tensor_names, save_4/RestoreV2_5/shape_and_slices)]]


# TensorBoard

TensorBoard是TensorFlow配套的可视化工具，可以用来帮助理解复杂的模型和检查实现中的错误。

TensorBoard的工作方式是启动一个WEB服务，该服务进程从TensorFlow程序执行所得的事件日志文件（event files）中读取概要（summary）数据，然后将数据在网页中绘制成可视化的图表。概要数据主要包括以下几种类别：
1.	标量数据，如准确率、代价损失值，使用tf.summary.scalar加入记录算子；
2.	参数数据，如参数矩阵weights、偏置矩阵bias，一般使用tf.summary.histogram记录；
3.	图像数据，用tf.summary.image加入记录算子；
4.	音频数据，用tf.summary.audio加入记录算子；
5.	计算图结构，在定义tf.summary.FileWriter对象时自动记录。

可以通过TensorBoard展示的完整程序：

In [None]:
################################
# Constructing Dataflow Graph
################################

# arguments that can be set in command line
tf.app.flags.DEFINE_integer('epochs', 10, 'Training epochs')
tf.app.flags.DEFINE_integer('batch_size', 10, 'size of mini-batch')
FLAGS = tf.app.flags.FLAGS

with tf.name_scope('input'):
    # create symbolic variables
    X = tf.placeholder(tf.float32, shape=[None, 6])
    y_true = tf.placeholder(tf.float32, shape=[None, 2])

with tf.name_scope('classifier'):
    # weights and bias are the variables to be trained
    weights = tf.Variable(tf.random_normal([6, 2]))
    bias = tf.Variable(tf.zeros([2]))
    y_pred = tf.nn.softmax(tf.matmul(X, weights) + bias)

    # add histogram summaries for weights, view on tensorboard
    tf.summary.histogram('weights', weights)
    tf.summary.histogram('bias', bias)

# Minimise cost using cross entropy
# NOTE: add a epsilon(1e-10) when calculate log(y_pred),
# otherwise the result will be -inf
with tf.name_scope('cost'):
    cross_entropy = - tf.reduce_sum(y_true * tf.log(y_pred + 1e-10),
                                    reduction_indices=1)
    cost = tf.reduce_mean(cross_entropy)
    tf.summary.scalar('loss', cost)

# use gradient descent optimizer to minimize cost
train_op = tf.train.GradientDescentOptimizer(0.001).minimize(cost)

with tf.name_scope('accuracy'):
    correct_pred = tf.equal(tf.argmax(y_true, 1), tf.argmax(y_pred, 1))
    acc_op = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    # Add scalar summary for accuracy
    tf.summary.scalar('accuracy', acc_op)

global_step = tf.Variable(0, name='global_step', trainable=False)
# use saver to save and restore model
saver = tf.train.Saver()

# this variable won't be stored, since it is declared after tf.train.Saver()
non_storable_variable = tf.Variable(777)

ckpt_dir = './ckpt_dir'
if not os.path.exists(ckpt_dir):
    os.makedirs(ckpt_dir)

################################
# Training the model
################################

# use session to run the calculation
with tf.Session() as sess:
    # create a log writer. run 'tensorboard --logdir=./logs'
    writer = tf.summary.FileWriter('./logs', sess.graph)
    merged = tf.summary.merge_all()

    # variables have to be initialized at the first place
    tf.global_variables_initializer().run()

    # restore variables from checkpoint if exists
    ckpt = tf.train.get_checkpoint_state(ckpt_dir)
    if ckpt and ckpt.model_checkpoint_path:
        print('Restoring from checkpoint: %s' % ckpt.model_checkpoint_path)
        saver.restore(sess, ckpt.model_checkpoint_path)

    start = global_step.eval()
    # training loop
    for epoch in range(start, start + FLAGS.epochs):
        total_loss = 0.
        for i in range(0, len(X_train), FLAGS.batch_size):
            # train with mini-batch
            feed_dict = {
                X: X_train[i: i + FLAGS.batch_size],
                y_true: y_train[i: i + FLAGS.batch_size]
            }
            _, loss = sess.run([train_op, cost], feed_dict=feed_dict)
            total_loss += loss
        # display loss per epoch
        print('Epoch: %04d, loss=%.9f' % (epoch + 1, total_loss))

        summary, accuracy = sess.run([merged, acc_op],
                                     feed_dict={X: X_val, y_true: y_val})
        writer.add_summary(summary, epoch)  # Write summary
        print('Accuracy on validation set: %.9f' % accuracy)

        # set and update(eval) global_step with epoch
        global_step.assign(epoch).eval()
        saver.save(sess, ckpt_dir + '/logistic.ckpt',
                   global_step=global_step)
    print('Training complete!')

![graph](./images/tf_graph.png)
![accuracy](./images/tf_accuracy.png)