# 1. Build your own Random Forest

![it's easy!!!](https://media.makeameme.org/created/its-easy-just-5bd65c.jpg)


Have you ever thought about planting, I mean, implementing your own Random Forest? Yea? No? Well... let's implement one here!


Before starting, I would like to present some preliminary concepts:

## Please upvote me if you like, ok? (this is really  really important to me)

## 1.1 Preliminary concepts

* **Every model** (model = machine learning method in which the data for training has already been presented) has some kind of **weakness** (either because of overfitting / or because of underfitting / or because of the complexity of the problem / problem dimension / data quality / whether by noise / ...)
* There is a ceiling that the model alone can perform (by whatever metric we are using, whether for regression, classification, time series...). For example, a linear regression there is a limit to what it can represent, a decision tree as well...
* There is still another component, which is the **irreducible error**: the one that we cannot overcome and that is inherent to the problem, regardless of the model...
* So...what is the idea of the **ensemble**? combine **several weak models** and thus **create a strong model** (ensemble) so that they can obtain better performances..
* For those who enjoy competitions on Kaggle, for example, the best ranked results are produced by this type of approach...

## 1.2 Ensemble

* The word ensemble, comes from French, means "together"...(cool huh?) 
* ... and the ensemble approach depends on the type of problem we are trying to address...
* If we think about the bias and variance problem, we have 4 combinations
    * What we want: a model that has a **good balance** between bias (low) and variance (low)
    * If we have a weak model that has high variance and low bias then this model has learned too much and does not generalize (overfitting), then it is recommended to use ensembles that tend to reduce variance (**bagging**, for example)
    * If we have in the other case, where we have low variance but high bias, then it is recommended to use an ensemble that tends to decrease the bias (**stacking or boosting**, for example)
    * If the model has high bias and high variance, then it is incoherent/inconsistent...
* Basically there are 3 forms of ensemble: Bagging, Boosting and Stacking

### 1.2.1 Bagging (bootstrap aggregating)

* Helps to create a model that is more robust than models alone
* Each isolated model knows only part of the data (hence the name bootstrap, which is the generation of smaller bases given the larger base)
* As each isolated model knows only part of the data, it tends to be less overfitting
* The idea is to generate an ensemble model whose result is the "average" of the output of each isolated model (it can be by majority of votes (hard voting) for the classification case, for example OR the average probability of each class of each isolated model (soft voting))
* Just remember the statistic that if we have a variable i.d.d (independently and identically distributed) and we calculate the mean, we have an estimator that preserves the expected value and low variance
* We have a very good advantage here, which is the issue of **parallelism** (the training of each isolated model)
* **Random Forest** is a successful example of bagging: you sample your database for some decision trees, and then average the outputs of each tree, so you have simpler trees (with less overfitting), reducing variance and overall more robust performance

### 1.2.2 Boosting

* Boosting is a **sequential** improvement technique, which aims to **reduce bias**
* The idea is to sequentially adjust several weak models (iterative way), so that a model in a given step depends on the models of the previous steps
* Each model in the sequence is fitted by giving more importance to observations in the data set that were **difficult to fit** by previous models in the sequence. Intuitively, each new model focuses its efforts on the most difficult-to-fit observations so far
* We have a set of weights, where each weight is associated with an estimator
* Here we have an iterative optimization process, **which is not easy to solve**, where we want to find the best set of weights
* The most famous boosting methods use **gradient** (XGBooting, Light GBM) / or adaptive (Adaboosting) that each iteration penalizes the worst estimators so that they don't mess up the final result.
* The **disadvantage** is that boosting cannot be parallelizable (which can be computationally expensive)

### 1.2.3 Stacking

* So far we have ensembles that are homogeneous (all the estimated ones are of the **same nature**)
* But there is a way to join heterogeneous weak models, which is the case of Stacking, which is nothing more than layers of weak models
* We can build a classifier that is, for example, formed by the first layer where we have (SVM, KNN, Decision Tree, Bayesian)
* And in the second layer (meta-model) we can have a logistic regression
* The second layer will combine the output of each of the models from the first layer, and generate a new result
* And that's why he called stacking (stack of miscellaneous estimators)
* It is possible to create stacking of different levels, 3, 4, 5, ...
* The problem is that with each layer, it gets more **computationally expensive**

### In conclusion...

* Ensemble is a good output to help solve **bias** and **variance** problems
* Each technique attacks a type of problem
* And always be careful with the computational cost
* Chapters 15 and 16 of The Elements of Statistical Learning (Hastie) deal with this subject in great depth

# 2. Now let's build our own Random Forest

![Random, Forest, Radom!!!](https://memegenerator.net/img/instances/55591858.jpg)


## 2.1 What will we need?

1. Data (of course)
2. Some decision trees 🌳🌳🌳🌳🌳🌳🌳
3. Water 💧💧💧💧💧💧 (for the trees)
4. Some python code 🐍



**Hey ho, let's go! Hey ho, let's go!**





In [None]:
# general imports
import numpy as np
import pandas as pd
from tqdm import tqdm
import os
import seaborn as sns
import matplotlib.pyplot as plt


# sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import BaseEstimator
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import VotingClassifier

from mlxtend.feature_selection import ColumnSelector

from yellowbrick.model_selection import ValidationCurve

In [None]:
# SOME CONSTANTS
SEED = 123

# so that you can reproduce the notebook
np.random.seed(SEED)

**Importing the data**

In [None]:
df = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
print(df.shape)
df.head()

**Let's do a little exploratory data analysis + data engineering**

In [None]:
count = 1
for c in df.columns:
    print(f'{count} - {c}')
    print(f'- # of unique elements: {df[c].nunique()}')
    print(f'- Sample: {df[c].unique()[0:20]}')
    print(f'- Dtype: {df[c].dtype}')
    print(f'- # of missing values: {df[c].isnull().sum()} of {df.shape[0]}')
    print(f'- % of missing values: {np.round(df[c].isnull().sum() / df.shape[0], 3)}')
    
    
    if df[c].dtype == int or df[c].dtype == float:
        s = "- Statistics:\n"

        me = np.round(df[c].mean(), 2)
        st = np.round(df[c].std(), 2)
        s += f"-- Mean (std): {me} ({st})\n"

        q1 = np.round(df[c].quantile(0.25), 2)
        q2 = np.round(df[c].quantile(0.5), 2)
        q3 = np.round(df[c].quantile(0.75), 2)
        s += f"-- Quantiles: q1={q1}, q2={q2}, q3={q3}\n"
        s += f"-- Min {df[c].min()}\n"
        s += f"-- Max {df[c].max()}"    
        print(s)
        
    print('='*30)
    count += 1

## Please upvote me if you like, ok? (this is really  really important to me)

What we have?

* Numerical variables:
    * Age
    * RoomService
    * FoodCourt
    * ShoppingMall
    * Spa
    * VRDeck
* Categorical variables:
    * HomePlanet
    * Destination
* Binary variables:
    * CryoSleep
    * VIP
* Columns to drop:
    * PassengerId (Not interesting for the model)
    * Cabin (I don't know if it's relevant to the predictive model)
    * Name (I don't know if it's relevant to the predictive model)
* Outcome:
    * Transported
    


In [None]:
# drop some columns
df.drop(columns=['PassengerId', 'Cabin', 'Name'], inplace=True)
print(df.shape)
df.head()

Let's create a (very very very simple) pipeline for the predictor variables, which performs the following tasks:

* Numeric features:
    * Imputer: KNNImputer (k=5)
    * Scaler: StandardScaler
* Categorial features:
    * Imputer: Most frequent
    * Encoder: One Hot Encoder
* Binary features:
    * Imputer: Most frequent
    * Encoder: Ordinal Encoder

This pipeline will be used later, at the time of the experiments, ok?

In [None]:
# numerical features
numeric_features = ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
numeric_transformer = Pipeline(
    steps=[("imputer", KNNImputer(n_neighbors=5)), 
           ("scaler", StandardScaler())]
)

Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())])

# categorial features
categorical_features = ["HomePlanet", "Destination"]
categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent")), 
           ("ohe", OneHotEncoder(handle_unknown="ignore"))])
    
# binary features
binary_features = ["CryoSleep", "VIP"]
binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent")), 
           ("ohe", OrdinalEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("bin", binary_transformer, binary_features),
    ]
)

Now let's separate the predictor variables ($X$) and the outcome ($y$)...

In [None]:
X = df.drop(columns=['Transported'])
y = df['Transported']

# As the outcome is with boolean type, I must change it to 0 and 1
y = LabelEncoder().fit_transform(y)

Now let's separate training and test sets, being 70% and 30% respectively. These sets will be used by the following experiments...

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED, stratify=y)
print(f'X_train shape {X_train.shape}')
print(f'y_train shape {y_train.shape}')
print('-'*20)
print(f'X_test shape {X_test.shape}')
print(f'y_test shape {y_test.shape}')

# 3. Weak Model - Decision Tree


Now we are going to train several decision trees (from sklearn itself) with different depths (1, 2, 3, ..., 18, 19, 20) so that we can observe **how the accuracy** is affected by this parameter (all others decision tree parameters will be kept at their default setting).

## Please upvote me if you like, ok? (this is really  really important to me)

In [None]:
acc_dt = {
  'max_depth':[],
  'acc':[],
  'train_test':[],
}

max_depth = np.arange(2, 21, 1)
for md in tqdm(max_depth):
    pipe_dt = Pipeline(
        [('preprocessor', preprocessor), 
         ('estimator', DecisionTreeClassifier(random_state=SEED, max_depth=md))])    
    
    pipe_dt.fit(X_train, y_train)
    # first predict on X_train set
    y_pred = pipe_dt.predict(X_train)
    # store the result
    acc_dt['max_depth'].append(md)
    acc_dt['acc'].append(accuracy_score(y_train, y_pred))
    acc_dt['train_test'].append('train_dt')
    

    # now predict on X_test set
    y_pred = pipe_dt.predict(X_test)
    # store the result
    acc_dt['max_depth'].append(md)
    acc_dt['acc'].append(accuracy_score(y_test, y_pred))
    acc_dt['train_test'].append('test_dt')


In [None]:
result_dt = pd.DataFrame(acc_dt)
fig = plt.figure(figsize=(12, 8))
sns.lineplot(
    data=result_dt,
    x="max_depth", 
    y="acc", 
    hue="train_test"
)
plt.show()

In [None]:
result_dt[result_dt['train_test'] == 'test_dt'][['acc']].describe()

**What can we see in the graph above?**

Note in the graph that the training and testing accuracy curves grow "together" until at a certain depth (around 5 or 6) they start to take off... and, from there, they get more and more separated: for training continues to grow; and for testing it dropped and then remains stagnant. This phenomenon is known as **overfitting** and is characterized by the loss of generalization of the predictive model. In other words: the model learns so much about the training set that it becomes unable to do well on the test set.

Another phenomenon that we can observe here is the ceiling effect (which I mentioned at the beginning of the text). The "increase in the complexity" of the tree does not necessarily translate into a performance increase in the test suite, so it's better that we have simpler trees... you know? **simpler trees**....

Good, now what? what do we do?

How about we build our own random forest and see what happens...

## Please upvote me if you like, ok? (this is really  really important to me)

# 4. Ensemble Model: My Own Random Forest

**In essence: what is a random forest?**

in a very very very simplified way a random forest can be thought of as a set of simple trees, that is, shallower trees that know only part of the data.

Speaking of shallower trees, how about listening to a beautiful song?  

**Lady Gaga, Bradley Cooper - Shallow (from A Star Is Born) (Official Music Video)**


[![Lady Gaga, Bradley Cooper - Shallow (from A Star Is Born) (Official Music Video)](http://img.youtube.com/vi/bo_efYhYU2A/0.jpg)](https://www.youtube.com/watch?v=bo_efYhYU2A)


**Curiosities**: here in Brazil we have this version of the song...

**Paula Fernandes - Juntos (Ao Vivo Em Sete Lagoas, Brazil / 2019 / Origens)**


[![Paula Fernandes - Juntos (Ao Vivo Em Sete Lagoas, Brazil / 2019 / Origens)](http://img.youtube.com/vi/xd8xxl222Ls/0.jpg)](https://www.youtube.com/watch?v=xd8xxl222Ls)



ok... let's move on...


And how do we make each tree know **only part** of the data?


In the same way that we can build a pipeline to transform the data, we can build a pipeline that, after transforming the columns, **randomly selects** some of them to present to the trees.


Then we will build a committee (**VotingClassifier**) so that each tree can vote at the time of prediction what the outcome is. **The majority vote wins.**

## 4.1 Now it's hands-on!!!

The code below is a simplification of what is actually a Random Forest, but the main ideas are contained here: majority vote trees know only part of the dataset.

Each tree will only know 70% of the columns (I chose 70% arbitrarily, it could be any amount). But how many columns do we have after transforming the data? Let's check!!!

In [None]:
preprocessor.fit_transform(X_train).shape

There are $14$ columns (since some of them undergo the One Hot Encoding transformation). There are 14 columns (since some of them undergo the One Hot Encoding transformation). Now let's calculate 70% of 14 columns is approximately 9.79... let's round up to 10. That is, of the 14 possible columns, each tree will only know 10.

Don't think this is not enough... with this configuration we would be able to create a forest of up to 1001 different combinations (just think that it is a combinatorial analysis of 14, grouped 10 in 10, without repetition in which the order does not matter).

Let's use the way sklearn works (with the fit and predict methods). As our forest will be very simple, we will only have 2 parameters: number of estimators and maximum depth of each tree. All other parameters will be kept at their default settings. 

In [None]:
class MyOwnRandomForest(BaseEstimator):
    # Class responsible for simulating a RandomForest.

    def __init__(self, n_estimators, max_depth):
        """
        Here we will define the pipeline for each tree.
        
        :params:
        n_estimators: The number of trees in our own forest.
        
        max_depth: The maximum depth of each tree. 
        """
               
        self.estimators_list = []
        # for each estimator we will create a pipeline that selects the columns and then do its work
        for i in range(n_estimators):
            # the numpy package itself has a function that randomly selects numbers from an array
            # here of the 14, let's randomly select 10, (without repetition)
            columns = tuple(np.random.choice(14, 10, replace=False))
            
            # building the pipeline
            pipe_dt = make_pipeline(ColumnSelector(cols=columns),
                                     DecisionTreeClassifier(random_state=SEED, max_depth=max_depth))
            self.estimators_list.append((f'pipe_dt_{i}', pipe_dt))
            
        # Here we use the "voting classifier" itself    
        self.voting = VotingClassifier(estimators=self.estimators_list)
        
    
    def fit(self, X, y):
        # fits all
        self.voting.fit(X, y)

    def predict(self, X):
        # predicts all
        return self.voting.predict(X)
    

**Does it work? Let's try!!!**

In our first test we are going to create a forest with only 10 trees, and with the same settings as in the previous experiment. Let's see what happens...

In [None]:
acc_rn = {
  'max_depth':[],
  'acc':[],
  'train_test':[],
}

max_depth = np.arange(2, 21, 1)
for md in tqdm(max_depth):
    pipe_dt = Pipeline(
        [('preprocessor', preprocessor), 
         ('estimator', MyOwnRandomForest(max_depth=md, n_estimators=10))])    
    
    pipe_dt.fit(X_train, y_train)
    # first predict on X_train set
    y_pred = pipe_dt.predict(X_train)
    # store the result
    acc_rn['max_depth'].append(md)
    acc_rn['acc'].append(accuracy_score(y_train, y_pred))
    acc_rn['train_test'].append('train_rn')
    

    # now predict on X_test set
    y_pred = pipe_dt.predict(X_test)
    # store the result
    acc_rn['max_depth'].append(md)
    acc_rn['acc'].append(accuracy_score(y_test, y_pred))
    acc_rn['train_test'].append('test_rn')

In [None]:
result_rn = pd.DataFrame(acc_rn)
fig = plt.figure(figsize=(12, 8))
sns.lineplot(
    data=result_rn,
    x="max_depth", 
    y="acc", 
    hue="train_test", 
    dashes=True
)
plt.show()

In [None]:
result_rn[result_rn['train_test'] == 'test_rn'][['acc']].describe()

**What do we perceive here?**

Apparently our Random Forest took a little longer than the decision tree to go into overfitting (and has a more irregular behavior too). Is it possible to see this graphically?

In [None]:
result = pd.concat([result_dt, result_rn], axis=0).reset_index().drop(columns=['index'])
fig = plt.figure(figsize=(12, 8))
sns.lineplot(
    data=result[result['train_test'].isin(['test_dt', 'test_rn'])],
    x="max_depth", 
    y="acc",
    hue='train_test',
    dashes=True
)
plt.show()

**What if** we simultaneously change the number of estimators and the number of estimators? What will be the result of this?

In [None]:
max_depth = np.arange(2, 21, 1)
n_estimators = np.arange(2, 41, 2)

results = {
    'n_estimators':[],
    'max_depth':[],
    'acc':[]
}

for est in tqdm(n_estimators):
    for md in max_depth:
        pipe_dt = Pipeline(
            [('preprocessor', preprocessor), 
             ('estimator', MyOwnRandomForest(max_depth=md, n_estimators=est))])    

        pipe_dt.fit(X_train, y_train)
        # now predict on X_test set
        y_pred = pipe_dt.predict(X_test)
        # store the result
        results['n_estimators'].append(est)
        results['max_depth'].append(md)
        results['acc'].append(accuracy_score(y_test, y_pred))
    


In [None]:
results = pd.DataFrame(results)
results_pivot = results.pivot("n_estimators", "max_depth", "acc")
fig = plt.figure(figsize=(20, 8))
sns.heatmap(results_pivot, annot=True, fmt=".3f")
plt.show()

Which is the best result?

In [None]:
results.sort_values(['acc'], ascending=False).head(1)

Which is the worst result?

In [None]:
results.sort_values(['acc'], ascending=False).tail(1)

# 5. Some conclusions

1. A simple way to build a Random Forest solution was presented...
1. Obviously the purpose here is not to build to be competitive or put into production, my purpose is always to present complex ideas in a simple way...
1. Empirically, it was possible to observe that an ensemble model like this one can overcome the obstacle of overfitting...
1. There are certainly a lot of improvements to be made in the code above like parallelism and other things... if you have any ideas, leave them in the comments, ok?
1. If you want to know the origin of the origin of Random Forest, read the original article at https://link.springer.com/article/10.1023/A:1010933404324 or https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
1. And finally...

![](https://memecreator.org/static/images/memes/5153568.jpg)



Thank you very much for your attention, and feel free to make suggestions!!! Bye!!!

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQeTlBWPzFucVr0vMMgbWtuF1iX_Ja16zMiuUzpx41OHktuj_PeeGQht8qiof2LWZZfv4g&usqp=CAU)