# Ensemble learning
**Definition**:
Is a machine learning technique that combines several base models in order to produce one optimal predictive model. 
- Final prediction: more robust and less prone to erros.
- Best results: models are skillful in different ways.

**Types**:
- **Voting**:
    - Train different models on the same dataset.
    - Let each model make its predictions. 
    - Meta-model: aggregates predictions of individual models by majority voting. 
- **Bagging**:
    - Base estimator: Decision Tree, Logistic Regression, Neural Net, etc... 
    - Each estimator is trained on a distinct bootstrap sample of the training set
    - Estimators use all features for training and prediction
- Boosting

![title](https://drive.google.com/uc?export=view&id=1B4gM0ChrAk3C41BSjU4QasfSkFLzwAfk)

### Concepts:
**Out-of-bag evaluation**: Most random forestes use a technique called out-of-bag evaluation (OOB evaluation) to evaluate the quality of the model. OOB treats the training set as if it were on the test set of cross-validation. 
- Each decision tree in a random forest is typically trained on ~67% of the training examples. Therefore, each decision tree does not see ~33% of the training examples. 
- It helps us understand how well the model performs on unseen data, without cross-validation


### Voting Classifier Example 

In [1]:
from sklearn.datasets import load_breast_cancer

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import VotingClassifier

# Load data
data = load_breast_cancer()
X = data.data[:, :2]
y = data.target

# Split the data
seed = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

# Load classifiers
lr = LogisticRegression(random_state=seed)
dt = DecisionTreeClassifier(random_state=seed)
knn = KNeighborsClassifier()

# Define a list with classifiers
classifiers = [('Logistic Regression', lr),
               ('Classification Tree', dt),
               ('K Nearest Neighbors', knn)]

# Evaluate models
for clf_name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    print(f"{clf_name}: {score}")

# Voting classifier
vc = VotingClassifier(estimators=classifiers)
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
score = accuracy_score(y_test, y_pred)
print(f"Voting Classifier: {score}")

Logistic Regression: 0.8713450292397661
Classification Tree: 0.8128654970760234
K Nearest Neighbors: 0.8596491228070176
Voting Classifier: 0.8713450292397661


### Bagging Classifier Example

In [2]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

seed = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=seed)

# Instantiate models
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16, random_state=seed)
bc = BaggingClassifier(base_estimator=dt, n_estimators=300, n_jobs=-1)

# Train
dt.fit(X_train, y_train)
bc.fit(X_train, y_train)

# Precit
y_pred_dt = dt.predict(X_test)
y_pred = bc.predict(X_test)

# Evaluate
accuracy_dt = accuracy_score(y_test, y_pred_dt)
accuracy = accuracy_score(y_test, y_pred)
print(f"The accuracy of the Decision Tree Classifier is: {accuracy_dt}")
print(f"The accuracy of the Bagging Classifier is: {accuracy}")

The accuracy of the Decision Tree Classifier is: 0.8771929824561403
The accuracy of the Bagging Classifier is: 0.8830409356725146
