# Ensemble Learning
Combining the results of many classifiers (even each classifier is not strong) will produce a result that is better than the strongest classifier of them

## Ensemble Algorithms:
### Voting:
- Hard Voting:
    - Get the **most frequent** predicted class **without** considering probabilities
- Soft Voting (better):
    - Get the **most frequent** predicted class but **consider probabilities** (assign bigger weight to **high confidant votes**)


In [19]:
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [120]:
random_state = 25
decision_tree = DecisionTreeClassifier(max_depth=3, random_state=random_state)
svc = SVC(random_state=random_state)
logestic_reg = LogisticRegression(random_state=random_state,max_iter=2000)

# combine all the classifier and 
voting_clf = VotingClassifier(
    estimators=[
        ("svc", svc),
        ("log_reg", logestic_reg),
        ("decision_tree", decision_tree)
    ],
    
    voting="hard",
)

In [121]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
# dataset['feature_names'], dataset['target_names']
X, y = dataset['data'], dataset['target']

In [122]:
X.shape, y.shape

((569, 30), (569,))

In [123]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((426, 30), (143, 30), (426,), (143,))

In [125]:
from sklearn.metrics import accuracy_score
# train and test each classifier vs votin classifier
def test_voting():
    for clf in (decision_tree, svc, logestic_reg, voting_clf):
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        score = accuracy_score(predictions, y_test)
        print(clf.__class__.__name__, ": ", score)

test_voting()
# Voting classifier is better than the best of them !

DecisionTreeClassifier :  0.965034965034965
SVC :  0.9300699300699301
LogisticRegression :  0.958041958041958
VotingClassifier :  0.9790209790209791


In [129]:
# make sure to make probability=True so that SVM change its implementation to predict proability
svc = SVC(random_state=random_state, probability=True)

# all estimators in the voting should have predict_prob method to make soft vorting
voting_clf = VotingClassifier(
    estimators=[
        ("svc", svc),
        ("log_reg", logestic_reg),
        ("decision_tree", decision_tree)
    ],
    
    voting="soft",
)

In [128]:
test_voting()

DecisionTreeClassifier :  0.965034965034965
SVC :  0.9300699300699301
LogisticRegression :  0.958041958041958
VotingClassifier :  0.965034965034965


# Bagging and Pasting
Instead of using many classifiers, we use the same classifier on different training data samples. <br>

- Bagging: sampling of training data is **with replacempent** (some instances are shared between classifiers)
- Pasting: sampling of training data is **without** replacempent

both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor

![bagging](images/ensemble.png)


- The aggregation function is typically the statistical mode (i.e., the **most frequent** prediction, just like a
hard voting classifier) **for classification**, or the **average** for **regression**


In [131]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# max_samples: each trained on 100 training instances randomly sampled from the training set with replacement
# if you want to use pasting instead, just set bootstrap=False)
# The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions (–1 tells Scikit-Learn to use all available cores)
# The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class proba‐ bilities (i.e., if it has a predict_proba() method),
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), 
                            n_estimators=500, 
                            max_samples=100, 
                            bootstrap=True,
                            n_jobs=-1,
                           )


In [132]:
bagging.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1)

In [134]:
predictions = bagging.predict(X_test)
print(accuracy_score(predictions, y_test))

0.951048951048951


In [142]:
# increasing max samples lead to more generalization
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), 
                            n_estimators=500, max_samples=200, bootstrap=True, n_jobs=-1)
bagging.fit(X_train, y_train)
predictions = bagging.predict(X_test)
print(accuracy_score(predictions, y_test))

0.965034965034965


# Random Patches and Random Subspaces
The BaggingClassifier class supports sampling the features as well. This is controlled by two hyperparameters: **max_features and bootstrap_features**, They work the same way as **max_samples and bootstrap**, but for **feature sampling** instead of **instance sampling**. Thus, each predictor will be trained on a **random subset of the input features**.

Useful when dealing with high dimension data (such as images)


In [143]:
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), 
                            n_estimators=500, max_samples=200, bootstrap=True, n_jobs=-1,
                            max_features=10, bootstrap_features=True)
bagging.fit(X_train, y_train)
predictions = bagging.predict(X_test)
print(accuracy_score(predictions, y_test))
# better !

0.972027972027972


# Random Forests
Random Forests is a bagging of decision trees with max_samples = size of the training set

- a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

- The Random Forest algorithm instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features (better result).

In [144]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100,
                                       n_jobs=-1,
                                       max_leaf_nodes=16)

In [145]:
random_forest.fit(X_train, y_train)

RandomForestClassifier(max_leaf_nodes=16, n_jobs=-1)

In [146]:
predictions = random_forest.predict(X_test)
accuracy_score(predictions, y_test) # Best result till now !

0.9790209790209791

# Feature Importance
- Random Forests used to know the importance of each feature in the data

- Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest), and all importance of all features sum to 1

In [147]:
# get the importance of each feature in iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
rnd_forest = RandomForestClassifier(n_estimators=500, n_jobs=-1)

In [148]:
rnd_forest.fit(iris["data"], iris["target"])

RandomForestClassifier(n_estimators=500, n_jobs=-1)

In [151]:
iris["feature_names"], rnd_forest.feature_importances_

(['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 array([0.09582688, 0.025055  , 0.42632722, 0.4527909 ]))

# Boosting
training predictors sequentially to improve them (using the weights of the prev predictor as initial)
## AdaBoost
One way for a new predictor to correct its predecessor(prev) is to pay a bit more attention
to the training instances that the predecessor underfitted. This results in new predic‐
tors focusing more and more on the hard cases. 
