Random Forest 本身并不复杂，就是对数据进行多次采样（含特征采样），重要的是观察这个思路的效果。所以用Notebook来记录多个比对的结果。基本决策器采用Sklearn提供的库。而不再自行编制。<br>

- 从样本中用放回采样(Uniformly)选出n个样本
- 从所有属性中随机选择k个属性，构建决策器
- 重复以上2步m次，即构建了m棵决策树
- 通过投票表决结果。

我们尝试使用一个多维度的数据集，声呐判定mine/rock的数据集 sonar_all_data.csv. 数据集中最后一列M表示mine，R表示rock.

一下是数据样例, 一共60个特征和1个标签<br>
0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,
0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,
0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,
0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,
0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,
0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R

文件可以在https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks) 上下载. 本实验中文件名做了改变，(txt->csv)，内容没有变化。



In [233]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### 读取数据，shuffle, 转换标签M==>1.0, R==>0.0, 设定数据格式为float

In [234]:
filename = "sonar_all_data.csv"
df = pd.read_csv(filename, index_col=None, header=None)
print(df.shape)

(208, 61)


In [235]:
colnames = ['c' + str(i) for i in range(60)]
colnames.append('type')
df = pd.read_csv(filename, index_col=None, header=None, names=colnames)
df = df.sample(frac=1).reset_index(drop=True)
# df.head()

In [236]:
df['lbl']=1.0
df.loc[df['type']=='R', 'lbl'] = 0.0
# df.head()

In [237]:
df.drop('type', axis=1, inplace=True)
df.astype(np.float32, inplace=True)
print(type(df.iloc[0]['c8']),type(df.iloc[0]['lbl']))
print(df.shape)

<class 'numpy.float64'> <class 'numpy.float64'>
(208, 61)


In [238]:
df.head()

Unnamed: 0,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,...,c51,c52,c53,c54,c55,c56,c57,c58,c59,lbl
0,0.0522,0.0437,0.018,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.016,0.0029,0.0051,0.0062,0.0089,0.014,0.0138,0.0077,0.0031,1.0
1,0.0368,0.0279,0.0103,0.0566,0.0759,0.0679,0.097,0.1473,0.2164,0.2544,...,0.0105,0.0024,0.0018,0.0057,0.0092,0.0009,0.0086,0.011,0.0052,1.0
2,0.0392,0.0108,0.0267,0.0257,0.041,0.0491,0.1053,0.169,0.2105,0.2471,...,0.0083,0.008,0.0026,0.0079,0.0042,0.0071,0.0044,0.0022,0.0014,1.0
3,0.0096,0.0404,0.0682,0.0688,0.0887,0.0932,0.0955,0.214,0.2546,0.2952,...,0.0237,0.0078,0.0144,0.017,0.0012,0.0109,0.0036,0.0043,0.0018,1.0
4,0.115,0.1163,0.0866,0.0358,0.0232,0.1267,0.2417,0.2661,0.4346,0.5378,...,0.0099,0.0065,0.0085,0.0166,0.011,0.019,0.0141,0.0068,0.0086,1.0


### 70条记录做验证，剩下的做训练

In [239]:
feature_names = ['c'+str(i) for i in range(60)]
label_name = ['lbl']
test_x = df[:70][feature_names].get_values()
test_y = df[:70][label_name].get_values().ravel()
train_x = df[70:][feature_names].get_values()
train_y = df[70:][label_name].get_values().ravel()

#### DecisionTreeClassifier/SVM/LR for complete dataset

In [240]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(train_x, train_y)
clf.score(test_x, test_y)

0.75714285714285712

In [241]:
# 这段代码不重要，就是验证一下Score函数
count = 0
for x, y in zip(test_x, test_y):
    y_ = clf.predict([x])
    if y_ == y: count += 1
print(count/len(test_y))

0.7571428571428571


In [242]:
del clf
from sklearn import svm
clf = svm.SVC()
clf.fit(train_x, train_y.ravel())
clf.score(test_x, test_y.ravel())

0.59999999999999998

In [243]:
del clf
clf = svm.NuSVC()
clf.fit(train_x, train_y.ravel())
clf.score(test_x, test_y.ravel())

0.84285714285714286

In [244]:
del clf
clf = svm.LinearSVC()
clf.fit(train_x, train_y.ravel())
clf.score(test_x, test_y.ravel())

0.7857142857142857

In [245]:
del clf
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(max_iter=600, tol=1e-3)
clf.fit(train_x, train_y.ravel())
clf.score(test_x, test_y.ravel())

0.81428571428571428

### Random Forest
- 从样本中用放回采样(Uniformly)选出n个样本
- 从所有属性中随机选择k个属性，构建决策器
- 重复以上2步m次，即构建了m棵决策树
- 通过投票表决结果。

In [246]:
clf_candidates=[DecisionTreeClassifier, 
                svm.SVC, 
                svm.NuSVC, 
                svm.LinearSVC, 
                SGDClassifier]
clf_candidates

[sklearn.tree.tree.DecisionTreeClassifier,
 sklearn.svm.classes.SVC,
 sklearn.svm.classes.NuSVC,
 sklearn.svm.classes.LinearSVC,
 sklearn.linear_model.stochastic_gradient.SGDClassifier]

In [247]:
import sklearn

m = 3
votes = [1/m] * m

num_train = len(test_y)
num_feat  = len(test_x[0])

n = int(num_train * 0.6)
k = int(np.sqrt(num_feat))

index_of_train_data = np.arange(num_train)
index_of_train_feat = np.arange(num_feat)

clfs = []
feats = []

for i in range(m):
    clf = None
    np.random.shuffle(index_of_train_data)
    np.random.shuffle(index_of_train_feat)
    row_idx = index_of_train_data[:n]
    feat_idx = index_of_train_feat[:k]
    sub_test_x = test_x[row_idx,:][:, feat_idx]
    sub_test_y = test_y[row_idx]
    func = np.random.choice(clf_candidates)
    print(func)
    if func==sklearn.linear_model.stochastic_gradient.SGDClassifier:
        clf = func(max_iter=600, tol=1e-3)
    else:
        clf = func()
    clf.fit(sub_test_x, sub_test_y)
    clfs.append(clf)
    feats.append(feat_idx)
    del clf

<class 'sklearn.svm.classes.LinearSVC'>
<class 'sklearn.svm.classes.LinearSVC'>
<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>


In [248]:
predict = np.zeros(test_y.shape)
predict   

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.])

In [249]:
for clf, feat, vote in zip(clfs, feats, votes):
    predict += clf.predict(test_x[:,feat])*vote
      

In [250]:
predict[predict>0.5] = 1.0
predict[predict<=0.5] = 0.0
print(sum(predict==test_y)/len(test_y))

0.528571428571
