Random Forest 本身并不复杂，就是对数据进行多次采样（含特征采样），重要的是观察这个思路的效果。所以用Notebook来记录多个比对的结果。基本决策器采用Sklearn提供的库。而不再自行编制。<br>

- 从样本中用放回采样(Uniformly)选出n个样本
- 从所有属性中随机选择k个属性，构建决策器
- 重复以上2步m次，即构建了m棵决策树
- 通过投票表决结果。

我们尝试使用一个多维度的数据集，声呐判定mine/rock的数据集 sonar_all_data.csv. 数据集中最后一列M表示mine，R表示rock.

一下是数据样例, 一共60个特征和1个标签<br>
0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,
0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,
0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,
0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,
0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,
0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R

文件可以在https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks) 上下载. 本实验中文件名做了改变，(txt->csv)，内容没有变化。



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### 读取数据，shuffle, 转换标签M==>1.0, R==>0.0, 设定数据格式为float

In [2]:
filename = "sonar_all_data.csv"
df = pd.read_csv(filename, index_col=None, header=None)
print(df.shape)

(208, 61)


In [3]:
colnames = ['c' + str(i) for i in range(60)]
colnames.append('type')
df = pd.read_csv(filename, index_col=None, header=None, names=colnames)
df = df.sample(frac=1).reset_index(drop=True)
# df.head()

In [4]:
df['lbl']=1.0
df.loc[df['type']=='R', 'lbl'] = 0.0
# df.head()

In [5]:
df.drop('type', axis=1, inplace=True)
df.astype(np.float32, inplace=True)
print(type(df.iloc[0]['c8']),type(df.iloc[0]['lbl']))
print(df.shape)

<class 'numpy.float64'> <class 'numpy.float64'>
(208, 61)


In [6]:
df.head()

Unnamed: 0,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,...,c51,c52,c53,c54,c55,c56,c57,c58,c59,lbl
0,0.0409,0.0421,0.0573,0.013,0.0183,0.1019,0.1054,0.107,0.2302,0.2259,...,0.0028,0.0036,0.0105,0.012,0.0087,0.0061,0.0061,0.003,0.0078,0.0
1,0.0132,0.008,0.0188,0.0141,0.0436,0.0668,0.0609,0.0131,0.0899,0.0922,...,0.0044,0.0028,0.0021,0.0022,0.0048,0.0138,0.014,0.0028,0.0064,0.0
2,0.0126,0.0149,0.0641,0.1732,0.2565,0.2559,0.2947,0.411,0.4983,0.592,...,0.0092,0.0035,0.0098,0.0121,0.0006,0.0181,0.0094,0.0116,0.0063,0.0
3,0.026,0.0192,0.0254,0.0061,0.0352,0.0701,0.1263,0.108,0.1523,0.163,...,0.0118,0.012,0.0051,0.007,0.0015,0.0035,0.0008,0.0044,0.0077,0.0
4,0.0201,0.0178,0.0274,0.0232,0.0724,0.0833,0.1232,0.1298,0.2085,0.272,...,0.0131,0.0049,0.0104,0.0102,0.0092,0.0083,0.002,0.0048,0.0036,1.0


### 70条记录做验证，剩下的做训练

In [7]:
feature_names = ['c'+str(i) for i in range(60)]
label_name = ['lbl']
test_x = df[:70][feature_names].get_values()
test_y = df[:70][label_name].get_values().ravel()
train_x = df[70:][feature_names].get_values()
train_y = df[70:][label_name].get_values().ravel()

#### DecisionTreeClassifier/SVM/LR for complete dataset

In [8]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(train_x, train_y)
clf.score(test_x, test_y)

0.72857142857142854

In [9]:
# 这段代码不重要，就是验证一下Score函数
count = 0
for x, y in zip(test_x, test_y):
    y_ = clf.predict([x])
    if y_ == y: count += 1
print(count/len(test_y))

0.7285714285714285


In [10]:
del clf
from sklearn import svm
clf = svm.SVC()
clf.fit(train_x, train_y.ravel())
clf.score(test_x, test_y.ravel())

0.61428571428571432

In [11]:
del clf
clf = svm.NuSVC()
clf.fit(train_x, train_y.ravel())
clf.score(test_x, test_y.ravel())

0.81428571428571428

In [12]:
del clf
clf = svm.LinearSVC()
clf.fit(train_x, train_y.ravel())
clf.score(test_x, test_y.ravel())

0.7857142857142857

In [13]:
del clf
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(max_iter=600, tol=1e-3)
clf.fit(train_x, train_y.ravel())
clf.score(test_x, test_y.ravel())

0.75714285714285712

### Random Forest
- 从样本中用放回采样(Uniformly)选出n个样本
- 从所有属性中随机选择k个属性，构建决策器
- 重复以上2步m次，即构建了m棵决策树
- 通过投票表决结果。

In [14]:
# clf_candidates=[DecisionTreeClassifier, 
#                 svm.SVC, 
#                 svm.NuSVC, 
#                 svm.LinearSVC, 
#                 SGDClassifier]

clf_candidates=[DecisionTreeClassifier]
clf_candidates

[sklearn.tree.tree.DecisionTreeClassifier]

In [15]:
import sklearn

m = 100
votes = [1/m] * m

num_train = len(test_y)
num_feat  = len(test_x[0])

n = int(num_train * 0.6)
k = int(np.sqrt(num_feat))

index_of_train_data = np.arange(num_train)
index_of_train_feat = np.arange(num_feat)

clfs = []
feats = []

for i in range(m):
    clf = None
    np.random.shuffle(index_of_train_data)
    np.random.shuffle(index_of_train_feat)
    row_idx = index_of_train_data[:n]
    feat_idx = index_of_train_feat[:k]
    sub_test_x = test_x[row_idx,:][:, feat_idx]
    sub_test_y = test_y[row_idx]
    func = np.random.choice(clf_candidates)
    if func==sklearn.linear_model.stochastic_gradient.SGDClassifier:
        clf = func(max_iter=600, tol=1e-3)
    else:
        clf = func()
    clf.fit(sub_test_x, sub_test_y)
    clfs.append(clf)
    feats.append(feat_idx)
    del clf

In [16]:
predict = np.zeros(test_y.shape) 

In [17]:
for clf, feat, vote in zip(clfs, feats, votes):
    predict += clf.predict(test_x[:,feat])*vote

predict      

array([ 0.48,  0.5 ,  0.45,  0.59,  0.42,  0.38,  0.42,  0.43,  0.42,
        0.45,  0.46,  0.5 ,  0.55,  0.47,  0.54,  0.45,  0.35,  0.38,
        0.35,  0.51,  0.5 ,  0.48,  0.53,  0.54,  0.5 ,  0.5 ,  0.55,
        0.5 ,  0.49,  0.47,  0.53,  0.5 ,  0.52,  0.45,  0.46,  0.51,
        0.55,  0.52,  0.47,  0.49,  0.47,  0.44,  0.49,  0.46,  0.55,
        0.42,  0.49,  0.52,  0.44,  0.43,  0.43,  0.47,  0.5 ,  0.48,
        0.52,  0.49,  0.52,  0.55,  0.49,  0.53,  0.49,  0.5 ,  0.48,
        0.44,  0.52,  0.52,  0.54,  0.51,  0.49,  0.49])

In [18]:
predict[predict>0.5] = 1.0
predict[predict<=0.5] = 0.0
print(sum(predict==test_y)/len(test_y))

0.614285714286


In [19]:
predict == test_y

array([ True, False,  True, False, False, False,  True, False,  True,
       False, False,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True, False,  True, False,  True,  True,
        True, False,  True,  True, False, False,  True,  True,  True,
       False,  True, False,  True, False,  True, False,  True,  True,
       False,  True,  True, False,  True,  True, False,  True, False,
        True, False,  True,  True, False,  True,  True,  True, False,
        True,  True, False, False,  True,  True, False], dtype=bool)

In [20]:
sum(predict==test_y)

43

In [21]:
len(test_y)

70

In [22]:
test_y

array([ 0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,
        1.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,
        0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,
        0.,  0.,  1.,  0.,  1.])