## Voting Classifier

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

#usingmoons dataset that we used chapter 5
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#training a voting classifier using 3 classifiers

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators = [('lr', log_clf),('rf',rnd_clf),('svc', svm_clf)],
    voting='hard')

voting_clf.fit(X_train,y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

In [3]:
#lets look at each classifier's accuracy
#this is an example of hard voting

from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.896
VotingClassifier 0.904


what we notice here is that the voting classifier outperforms all other clasifiers, taht is due to the fact that the voting classifier uses all their outcomes to predict a more precise one; askinga crowd for answer and compiling them gives more information than asking one expert.

In [4]:
#lets looka t an example of soft voting

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators = [('lr', log_clf),('rf',rnd_clf),('svc', svm_clf)],
    voting='soft')

voting_clf.fit(X_train,y_train)

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.904


soft voting gave us an even better performance of 91% !!!

## Bagging and Pasting in SkLearn

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

#example of bagging uèsing SkLearn, we can use either bagging or pasting but most of the time we can just got for bagging as
#default since it gives us overall better models, but if we have the time and cpu power required, we can go for 
#cross validation between the 2 and use the one with better results
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, 
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

## Out of Bag Evaluation

Basically when using baging, some of the instances are not used (roughly 37% usually), so data taht is not part of the dataset used for the training is called "out of bag" data.
We can use that data to test our model( by changing the value of "oob_score" to true ) 

In [9]:
#we create a bagging classifier and turn oob_score on 
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators = 500, 
    bootstrap=True, n_jobs=-1, oob_score=True)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.8906666666666667

In [10]:
#we compare the oob_score to the accuracy score

from sklearn.metrics import accuracy_score 
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test,y_pred)

0.912

turns out they are close 

In [11]:
#returning the class probabilities for each of each instance used 
bag_clf.oob_decision_function_

array([[0.40306122, 0.59693878],
       [0.36416185, 0.63583815],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.09473684, 0.90526316],
       [0.33333333, 0.66666667],
       [0.01621622, 0.98378378],
       [0.98963731, 0.01036269],
       [0.98019802, 0.01980198],
       [0.74074074, 0.25925926],
       [0.01098901, 0.98901099],
       [0.76966292, 0.23033708],
       [0.88172043, 0.11827957],
       [0.98387097, 0.01612903],
       [0.06842105, 0.93157895],
       [0.        , 1.        ],
       [0.98360656, 0.01639344],
       [0.91428571, 0.08571429],
       [0.9939759 , 0.0060241 ],
       [0.01515152, 0.98484848],
       [0.34871795, 0.65128205],
       [0.84482759, 0.15517241],
       [1.        , 0.        ],
       [0.97340426, 0.02659574],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.62903226, 0.37096774],
       [0.

## Random Patches and Random Subspaces

BaggingClassifier supports sampling features as well. this is usefull when dealing with high dimensionnal inputs (such as images). This can be done by keeping all training instances (bootstrap=false and max_samples=1.0) but sampling features (bootstrap_features = true and max_features= a value smaller than 1.0) 
This trades a bit of bias for lower variance.

# Random Forests

In [12]:
#training a random forest classifier with 500 trees

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf=16, n_jobs=-1)
rnd_clf.fit(X_train,y_train)

y_pred_rf = rnd_clf.predict(X_test)

TypeError: __init__() got an unexpected keyword argument 'max_leaf'