# Random forests

**Random forests** puede verse como una variante de bagging de árboles que trata de mejorar la decorrelación de modelos base mediante aleatorización, no solo de los datos de entrenamiento, sino también de las variables de entrada. Así, la característica de split $j_i$ se optimiza sobre un conjunto aleatorio $S_i\subseteq\{1,\dotsc,D\}$,
$$(j_i,t_i)=\operatorname*{arg}
\min_{j\in S_i}\min_{t\in\mathcal{T}_j}\;%
\frac{\lvert\mathcal{D}_i^L(j,t)\rvert}{\lvert\mathcal{D}_i\rvert}\,c(\mathcal{D}_i^L(j,t))+%
\frac{\lvert\mathcal{D}_i^R(j,t)\rvert}{\lvert\mathcal{D}_i\rvert}\,c(\mathcal{D}_i^R(j,t))$$
En general, los bosques son más precisos que bagging ya que muchas características son irrelevantes. Por otro lado, los aprendices pueden entrenarse en paralelo, cosa que no puede hacerse con boosting.


**Ejemplo:** clasificación de correos en spam y no-spam

In [1]:
import pandas as pd
from matplotlib import transforms, pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score

df = pd.read_csv("https://github.com/empathy87/The-Elements-of-Statistical-Learning-Python-Notebooks/blob/master/data/Spam.txt?raw=True")
df.head()

Unnamed: 0,test,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,1,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [2]:
target = 'spam'
columns = ['word_freq_make', 'word_freq_address', 'word_freq_all',
           'word_freq_3d', 'word_freq_our', 'word_freq_over',
           'word_freq_remove', 'word_freq_internet', 'word_freq_order',
           'word_freq_mail', 'word_freq_receive', 'word_freq_will',
           'word_freq_people', 'word_freq_report', 'word_freq_addresses',
           'word_freq_free', 'word_freq_business', 'word_freq_email',
           'word_freq_you', 'word_freq_credit', 'word_freq_your',
           'word_freq_font', 'word_freq_000', 'word_freq_money',
           'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
           'word_freq_650', 'word_freq_lab', 'word_freq_labs',
           'word_freq_telnet', 'word_freq_857', 'word_freq_data',
           'word_freq_415', 'word_freq_85', 'word_freq_technology',
           'word_freq_1999', 'word_freq_parts', 'word_freq_pm',
           'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
           'word_freq_original', 'word_freq_project', 'word_freq_re',
           'word_freq_edu', 'word_freq_table', 'word_freq_conference',
           'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!',
           'char_freq_$', 'char_freq_#', 'capital_run_length_average',
           'capital_run_length_longest', 'capital_run_length_total']
X, y = df[columns].values, df[target].values
is_test = df.test.values
X_train, X_test = X[is_test == 0], X[is_test == 1]
y_train, y_test = y[is_test == 0], y[is_test == 1]

In [3]:
from sklearn.ensemble import BaggingClassifier

ntrees_list = [10, 50, 100, 200, 300, 400, 500]
for ntrees in ntrees_list:
    bag_clf = BaggingClassifier(n_estimators=ntrees, random_state=10, bootstrap=True).fit(X_train, y_train)
    y_test_hat = bag_clf.predict(X_test)
    bag_acc = accuracy_score(y_test, y_test_hat)
    print(f'Bagged {ntrees} trees, test err {1 - bag_acc:.1%}')

Bagged 10 trees, test err 5.9%
Bagged 50 trees, test err 5.5%
Bagged 100 trees, test err 5.4%
Bagged 200 trees, test err 5.5%
Bagged 300 trees, test err 5.5%
Bagged 400 trees, test err 5.4%
Bagged 500 trees, test err 5.6%


In [4]:
from sklearn.ensemble import RandomForestClassifier

for ntrees in ntrees_list:
    rf_clf = RandomForestClassifier(n_estimators=ntrees, random_state=10).fit(X_train, y_train)
    y_test_hat = rf_clf.predict(X_test)
    rf_acc = accuracy_score(y_test, y_test_hat)
    print(f'RF {ntrees} trees, test err {1 - rf_acc:.1%}')

RF 10 trees, test err 6.3%
RF 50 trees, test err 5.0%
RF 100 trees, test err 4.9%
RF 200 trees, test err 4.8%
RF 300 trees, test err 4.9%
RF 400 trees, test err 4.8%
RF 500 trees, test err 4.8%
