# You can learn

- load training data and make it available to Keras
- design and train a nn for tabular data
- evaluate the performance of a neural network model in Keras on unseen data
- perform data preparation to improve skill when using NN
- tune the topology and configuration of NN in Keras

# 1.Sonar 数据集

来源: [Index of /ml/machine-learning-databases/undocumented/connectionist-bench/sonar](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/?C=D;O=D)

数据描述: [sonar_names](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.names)

两类sonar信号辨别:

- M: 表示从金属缸返回的声纳信号(bounced off a metal cylinder), 总 111 个样本 
- R: 表示从圆柱形的岩石返回的声纳信号(bounced off a roughly cylindrical rock) 总97个样本

总样本数目为208, 样本特征数为60, 这60个特征都是 0.0-1.0 范围内的数值.

> A benefit of using this dataset is that it is a standard benchmark problem. This means that we have some idea of the expected skill of a good model. Using cross validation, a neural network should be able to achieve performance around 84% with an upper bound on accuracy for custom models at around 88%.

我们先简单看一些数据集

In [1]:
# 加载数据
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url, header=None)

In [2]:
df.head(4)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R


In [3]:
df.tail(4)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
204,0.0323,0.0101,0.0298,0.0564,0.076,0.0958,0.099,0.1018,0.103,0.2154,...,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067,M
205,0.0522,0.0437,0.018,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.016,0.0029,0.0051,0.0062,0.0089,0.014,0.0138,0.0077,0.0031,M
206,0.0303,0.0353,0.049,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048,M
207,0.026,0.0363,0.0136,0.0272,0.0214,0.0338,0.0655,0.14,0.1843,0.2354,...,0.0146,0.0129,0.0047,0.0039,0.0061,0.004,0.0036,0.0061,0.0115,M


In [4]:
df.shape

(208, 61)

In [5]:
df_group = df.groupby(60)

In [6]:
df_group.size()

60
M    111
R     97
dtype: int64

# 2.神经网络模型性能基准

先定一个基准, 三层神经网络, 拓扑结构为60-60-1. 

## ref

1. [Scikit-learn API - Keras Documentation](https://keras.io/scikit-learn-api/)
2. [verbose: Model (functional API) - Keras Documentation](https://keras.io/models/model/)
3. [sklearn.model_selection.StratifiedKFold — scikit-learn 0.19.0 documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
4. [Losses - Keras Documentation](https://keras.io/losses/)

In [7]:
# Baseline NN model performance
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense
from keras. wrappers.scikit_learn import KerasClassifier

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 随机数设定
seed = 42
np.random.seed(seed)
# 加载数据
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url, header=None)
dataset = df.values
# 特征与类别
X = dataset[:, 0:60].astype(float)
Y = dataset[:, 60]
# 类别编码成数字形式 0 or 1
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# 创建神经网络模型
def creat_baseline():
    # 创建模型
    model = Sequential()
    # 中间层网络参数设定
    model.add(Dense(60, input_dim=60, init='normal', activation='relu'))#
    # 输出层网络参数设定
    model.add(Dense(1, init='normal', activation='sigmoid'))
    # 编译模型
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# 评估模型
estimator = KerasClassifier(build_fn=creat_baseline, nb_epoch=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: {0}% ({1}%)".format(results.mean()*100, results.std()*100))

Using Theano backend.


Baseline: 76.0367974258% (7.84685282946%)


结果显示神经网络模型在unseen data上的准确率为76.03%.
我们单单设计一个三层的神经网络,准确率就达到70%以上. 如果我们改进下, 会出现怎样的结果.
下面我们来看看.

# 3.数据处理

我们将数据特征正则化(standardize).数据特征的均值为0, 标准差为1.

> This is where the data is rescaled such that the mean value for each attribute is 0 and the standard deviation is 1. 

## ref

1. [sklearn.preprocessing.StandardScaler — scikit-learn 0.19.0 documentation](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)  
2. [sklearn.pipeline.Pipeline — scikit-learn 0.19.0 documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [1]:
# 两分类问题使用标准化数据
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense
from keras. wrappers.scikit_learn import KerasClassifier

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 随机数设定
seed = 42
np.random.seed(seed)
# 加载数据
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url, header=None)
dataset = df.values
# 特征与类别
X = dataset[:, 0:60].astype(float)
Y = dataset[:, 60]
# 类别编码成数字
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# 模型函数
def creat_baseline():
    # 创建模型
    model = Sequential()
    model.add(Dense(60, input_dim=60, init='normal', activation='relu'))
    model.add(Dense(1, init='normal', activation='sigmoid'))
    # 编译模型
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# 评估模型 
np.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=creat_baseline, nb_epoch=100, 
                                        batch_size=5, verbose=0)))
# 封装在pipeline中
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Standardized: {0}% ({1}%)".format(results.mean()*100, results.std()*100))

Using Theano backend.


Standardized: 85.5887454477% (5.27619115623%)


结果显示神经网络模型在unseen data上的准确率为85.59%.

# 4.改变神经网络层数和节点数

改变神经网络的拓扑结构, 来观察模型性能.

1. smaller, 建立一个小的神经网络模型, 三层: 60-30-1;
2. larger: 建立一个大的神经网络模型, 四层: 60-60-30-1.

In [9]:
# Smaller 神经网络模型 60-30-1
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense
from keras. wrappers.scikit_learn import KerasClassifier

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 随机数设定
seed = 42
np.random.seed(seed)
# 加载数据
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url, header=None)
dataset = df.values
# 特征与类别
X = dataset[:, 0:60].astype(float)
Y = dataset[:, 60]
# 类别编码成数字
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# smaller_model
def creat_smaller():
    # 创建模型
    model = Sequential()
    model.add(Dense(30, input_dim=60, init='normal', activation='relu'))
    model.add(Dense(1, init='normal', activation='sigmoid'))
    # 编译模型
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

np.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=creat_smaller, nb_epoch=100,
                                         batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Smaller: {0:.2f}% ({1:.2f}%)".format(results.mean()*100, results.std()*100))



Smaller: 81.30% (5.70%)


Smaller网络模型与基准的网络模型baseline 相比, 它们的神经网络层数相同, 就中间隐含层节点数不同, 前者节点数为30, 后者节点数为60. 

Smaller神经网络的结果为81.30%, 比baseline神经网络模型的85.59%低4.29%.

In [10]:
# Larger神经网络模型 60-60-30-1
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense
from keras. wrappers.scikit_learn import KerasClassifier

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 随机数设定
seed = 42
np.random.seed(seed)
# 加载数据
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url, header=None)
dataset = df.values
# 特征与类别
X = dataset[:, 0:60].astype(float)
Y = dataset[:, 60]
# 类别编码成数字
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# larger_model
def creat_larger():
    # 创建模型
    model = Sequential()
    model.add(Dense(60, input_dim=60, init='normal', activation='relu'))
    model.add(Dense(30, init='normal', activation='relu'))
    model.add(Dense(1, init='normal', activation='sigmoid'))
    # 编译模型
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

np.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=creat_larger, nb_epoch=100,
                                         batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Larger: {0:.2f}% ({1:.2f}%)".format(results.mean()*100, results.std()*100))



Larger: 88.49% (3.76%)


Larger网络模型(60-60-30-1)与基准的网络模型baseline(60-60-1)相比, 它们的神经网络层数不同,前者比后者多一层.

Larger神经网络的结果为88.49%, 比baseline神经网络模型的85.59% 高2.9%.