## Random Forest

因為**Decision Tree**容易出現**Overfitting**的問題，Random Forest一次有多個Decision Tree並投票表決結果
- 1.決定要有幾棵Decision Tree
- 2.每次隨機取出部分的subset，建立一Decision Tree
- 3.每個Decision Tree都有一分類結果
- 4.投票計算所有的分類結果，最高票者為最終的分類結果

**training set：只能是數字**

### Bagging

#### Dataset

<img src="https://github.com/MiaZhang17/MachineLearning/blob/main/piecures/bagging_1.png?raw=true" style="width:200px;">

#### 1.隨機取出部分的subset進行分類

<img src="https://github.com/MiaZhang17/MachineLearning/blob/main/piecures/bagging_2.png?raw=true" style="width:600px;">

#### 2.所有的分類結果進行投票，取最高票為最終的分類結果

<img src="https://github.com/MiaZhang17/MachineLearning/blob/main/piecures/bagging_3.png?raw=true" style="width:400px;">

## Bagging training set

- 有用`replacement=True`，相同的row可以重複選取
- 通常選跟training set數量一樣，`n`個data作為subset

In [1]:
# randomly choose data to be training_set
from tree import build_tree, print_tree, car_data, car_labels
import random
random.seed(4)
print(len(car_data), car_data[0])

tree = build_tree(car_data, car_labels)

indices = [random.randint(0, 999) for i in range(1000)]
print(indices[:2])

data_subset = []
labels_subset = []
for i in indices:
    data_subset.append(car_data[i])
    labels_subset.append(car_labels[i])

subset_tree = build_tree(data_subset, labels_subset)
print_tree(subset_tree)

1000 ['low', 'med', '5more', '4', 'small', 'med']
[241, 310]
Person Capacity
--> Branch 2:
  Predict Counter({'unacc': 354})
--> Branch 4:
  Estimated Saftey
  --> Branch high:
    Buying Price
    --> Branch high:
      Price of maintenance
      --> Branch high:
        Predict Counter({'acc': 3})
      --> Branch low:
        Predict Counter({'acc': 6})
      --> Branch med:
        Predict Counter({'acc': 6})
      --> Branch vhigh:
        Predict Counter({'unacc': 6})
    --> Branch low:
      Price of maintenance
      --> Branch high:
        Size of luggage boot
        --> Branch big:
          Predict Counter({'vgood': 6})
        --> Branch med:
          Number of doors
          --> Branch 2:
            Predict Counter({'acc': 1})
          --> Branch 5more:
            Predict Counter({'vgood': 2})
        --> Branch small:
          Predict Counter({'acc': 1})
      --> Branch low:
        Size of luggage boot
        --> Branch big:
          Predict Counter({'vgood':

## Bagging features

- 每次隨機取出部分的feature
- 每棵Decision Tree使用的feature皆不同
- 通常取 $\sqrt{n}$ 個features

In [2]:
from tree import car_data, car_labels, split, information_gain
import random
import numpy as np
np.random.seed(1)
random.seed(4)

def find_best_split(dataset, labels):
    best_gain = 0
    best_feature = 0
    #Create features here
    features = np.random.choice(len(dataset[0]), 3, replace=False)

    for feature in features:
        data_subsets, label_subsets = split(dataset, labels, feature)
        gain = information_gain(labels, label_subsets)
        if gain > best_gain:
            best_gain, best_feature = gain, feature
    return best_gain, best_feature
  
indices = [random.randint(0, 999) for i in range(1000)]

data_subset = [car_data[index] for index in indices]
labels_subset = [car_labels[index] for index in indices]

print(find_best_split(data_subset, labels_subset))

(0.010225712539814483, 4)


In [3]:
from tree2 import build_tree, print_tree, car_data, car_labels, classify
import random
random.seed(4)

print(len(car_data))
# The features are the price of the car, the cost of maintenance, the number of doors, the number of people the car can hold, the size of the trunk, and the safety rating
unlabeled_point = ['high', 'vhigh', '3', 'more', 'med', 'med']
predictions = []
for i in range(20):
    indices = [random.randint(0, 999) for i in range(1000)]
    data_subset = [car_data[index] for index in indices]
    labels_subset = [car_labels[index] for index in indices]
    subset_tree = build_tree(data_subset, labels_subset)
    result = classify(unlabeled_point, subset_tree)
    predictions.append(result)

print(predictions)
final_prediction = max(predictions, key=predictions.count)
print(final_prediction)

from collections import Counter
print(Counter(predictions))

1728
['acc', 'unacc', 'acc', 'unacc', None, 'acc', 'acc', 'unacc', 'unacc', None, 'acc', 'unacc', 'acc', 'acc', 'acc', 'acc', 'acc', 'unacc', None, 'acc']
acc
Counter({'acc': 11, 'unacc': 6, None: 3})


In [4]:
from tree3 import training_data, training_labels, testing_data, testing_labels, make_random_forest, make_single_tree, classify
import numpy as np
import random
np.random.seed(1)
random.seed(1)

tree = make_single_tree(training_data, training_labels)
forest = make_random_forest(40, training_data, training_labels)
single_tree_correct = 0
forest_correct = 0

for i in range(len(testing_data)):
    prediction = classify(testing_data[i], tree)
    if prediction == testing_labels[i]:
        single_tree_correct += 1
    predictions = []
    for forest_tree in forest:
        result = classify(testing_data[i], forest_tree)
        predictions.append(result)
    forest_prediction = max(predictions, key=predictions.count)
    if forest_prediction == testing_labels[i]:
        forest_correct += 1
print('single_tree_correct:', single_tree_correct/len(testing_data))
print('forest_correct:', forest_correct / len(testing_data))

single_tree_correct: 0.8815028901734104
forest_correct: 0.9219653179190751


## sklearn

In [5]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
from cars import training_points, training_labels, testing_points, testing_labels
import warnings
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators=2000, random_state=0)
classifier.fit(training_points, training_labels)
print(classifier.score(testing_points, testing_labels))

0.9884393063583815
