# HOMEWORK 2.2: Ensemble Learning

# Description of HW2.2

The aim of this assignment is to use Ensemble Learning to solve a problem. In this way, it is aimed to understand the benefits of Ensemble Learning and to teach the usage details of different ensemble approaches with the help of ScikitLearn library. 

The following methods will be implemented within the scope of this assignment. These are:
- Voting Classifier (hard & soft voting) 


In [1]:
# Common imports
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

In [2]:
# to make this notebook's output stable across runs
np.random.seed(42)

In [3]:
# generate moon dataset
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# TODO 2.2.1 Voting classifier in Scikit-Learn
- Train voting classifiers in Scikit-Learn, composed of at least three diverse classifiers


In [4]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score


clf1 = RandomForestClassifier(n_estimators=50, random_state=42)
clf2 = SVC(probability=True, kernel='linear', C=0.5, random_state=42)
clf3 = KNeighborsClassifier(n_neighbors=5)


#define your hard voting classifier
voting_clf_hard = voting_clf_hard = VotingClassifier(estimators=[('rf', clf1), ('svc', clf2), ('knn', clf3)], voting='hard')



#train your voting classifier using X_train, y_train
voting_clf_hard.fit(X_train, y_train)

# Make predictions on the test set
y_pred_hard = voting_clf_hard.predict(X_test)

# Evaluate the performance of the hard voting classifier
accuracy_hard = accuracy_score(y_test, y_pred_hard)
print("Hard Voting Classifier Accuracy:", accuracy_hard)


#write obtained individual classifiers. Compare them with "Voting Classifiers (hard and soft)" for "X_test/y_test" data
from sklearn.metrics import accuracy_score
print("Individual Classifiers:")
for clf in (clf1, clf2, clf3, voting_clf_hard):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{clf.__class__.__name__} Accuracy: {accuracy}")


Hard Voting Classifier Accuracy: 0.904
Individual Classifiers:
RandomForestClassifier Accuracy: 0.896
SVC Accuracy: 0.856
KNeighborsClassifier Accuracy: 0.912
VotingClassifier Accuracy: 0.904


# TODO 2.2.2 Bagging/Pasting
- One way to get a diverse set of classifiers is to use very different training algorithms
- Another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set
    - When sampling is performed with replacement, this method is called ***bagging*** 
        - short for ***bootstrap aggregating***
    - When sampling is performed without replacement, it is called pasting
- Both bagging and pasting allow training instances to be sampled several times across multiple predictors
- Only bagging allows training instances to be sampled several times for the same predictor
- Predictors can all be trained in parallel, via different CPU cores or even different servers.
- Similarly, predictions can be made in parallel.
- This is one of the reasons why bagging and pasting are such popular methods: they scale very well.

In [5]:
# Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClassifier class 
#  
#   (this is an example of bagging, but if you want to use pasting instead, just set bootstrap=False). 
# The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions 
#   (–1 tells Scikit-Learn to use all available cores):

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Define the base classifier (Decision Tree in this case)
base_classifier = DecisionTreeClassifier()


# Write Bagging Classifier and train it using ensemble of 500 Decision Tree classifiers, 
#  each trained on 100 training instances randomly sampled from the training set with replacement 

bag_clf = BaggingClassifier(base_classifier, n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)



# bagging accuracy score
# Using X_test dataset calculate/print Bagging Accuracy Score (using accuracy_score metric)
from sklearn.metrics import accuracy_score
y_pred_bagging = bag_clf.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print("Bagging Accuracy Score:", accuracy_bagging)


Bagging Accuracy Score: 0.904


In [6]:
# Without bagging accuracy score
tree_clf = DecisionTreeClassifier(random_state=42)
# Train Tree Classifier
tree_clf.fit(X_train, y_train)



# Calculate "Tree Classifier" accuracy score
# Using X_test dataset calculate/print Tree Classifier Accuracy Score (using accuracy_score metric)
# Tree Classifier accuracy score
y_pred_tree = tree_clf.predict(X_test)
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print("Tree Classifier Accuracy Score:", accuracy_tree)


Tree Classifier Accuracy Score: 0.856



# TODO 2.2.2 Out-of-Bag evaluation
- With bagging, some instances may be sampled several times for any given predictor, 
 while others may not be sampled at all. 
- By default a BaggingClassifier samples m training instances with replacement (bootstrap=True), 
 where m is the size of the training set
- Only about 60% of the training instances are sampled on average for each predictor
- The remaining 40% of the training instances that are not sampled are called out-of-bag (oob) instances
- Since a predictor never sees the oob instances during training, it can be evaluated on these instances, 
without the need for a separate validation set or cross-validation
- In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier
 to request an automatic oob evaluation after training

In [7]:
#write your out-of-bag (oob) classifier and train it. 
oob_bag_clf = BaggingClassifier(base_classifier, n_estimators=500, max_samples=100, bootstrap=True, oob_score=True, n_jobs=-1, random_state=42)
oob_bag_clf.fit(X_train, y_train)



# According to this oob evaluation print your oob score for test dataset 
print("Out-of-Bag (oob) Score:", oob_bag_clf.oob_score_)



# Calculate oob bagging accuracy score
from sklearn.metrics import accuracy_score
y_pred_oob_bagging = oob_bag_clf.predict(X_test)
accuracy_oob_bagging = accuracy_score(y_test, y_pred_oob_bagging)
print("Out-of-Bag (oob) Bagging Accuracy Score:", accuracy_oob_bagging)


Out-of-Bag (oob) Score: 0.9253333333333333
Out-of-Bag (oob) Bagging Accuracy Score: 0.904


# TODO 2.2.4 Random Forests
- Random Forest is an ensemble of Decision Trees
- Generally trained via the bagging method typically with max_samples set to the size of the training set
- Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, 
you can instead use the RandomForestClassifier class, which is more convenient and optimized for Decision Trees

- Train a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all available CPU cores:

In [8]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier


# Write your RandomForestClassifier classifier using sklearn RandomForestClassifier class and train it. 
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

# Calculate/print the prediction accuracy of the rnd_clf for Test Data
y_pred_rnd = rnd_clf.predict(X_test)
accuracy_rnd = accuracy_score(y_test, y_pred_rnd)
print("Random Forest Classifier Accuracy Score:", accuracy_rnd)

# Write your RandomForestClassifier classifier using BaggingClassifier equivalent and train it.(estimator=500, leafnode=16)
bag_rnd_clf = BaggingClassifier(DecisionTreeClassifier(max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42)
bag_rnd_clf.fit(X_train, y_train)

# Calculate/print the prediction accuracy of the bag_rnd_clf for Test Data
y_pred_bag_rnd = bag_rnd_clf.predict(X_test)
accuracy_bag_rnd = accuracy_score(y_test, y_pred_bag_rnd)
print("Bagging Random Forest Classifier Accuracy Score:", accuracy_bag_rnd)



Random Forest Classifier Accuracy Score: 0.912
Bagging Random Forest Classifier Accuracy Score: 0.912


# TODO 2.2.5 Write Your Novel Ensemble Model 
- Write your customized Ensemble Classifier written by yourself 
- Try to get the highest score for the same dataset
- There no limit. There is no specific constraints. Note that the classifier you write should be an Ensemble Classifier. 

Note: Students with the highest score on the assignment will be awarded an additional 10 points as an assignment score. All submitted scores will be ranked in descending order and the top 5 students will be awarded an additional +10 points for WH2.2.

In [9]:
# Define your classifiers
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
knn_clf = KNeighborsClassifier(n_neighbors=5)
svm_clf = SVC(probability=True, kernel='linear', C=0.5, random_state=42)

# Write your Ensemble Classifier and train it.

my_ensemble_clf = VotingClassifier(estimators=[('svm', knn_clf), ('rf', rf_clf),('knn', knn_clf)], voting='soft')

# Train your custom ensemble classifier
my_ensemble_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred_ensemble = my_ensemble_clf.predict(X_test)


# Calculate accuracy score
from sklearn.metrics import accuracy_score
accuracy_ensemble = accuracy_score(y_test, y_pred_ensemble)
print("Custom Ensemble Classifier Accuracy Score:", accuracy_ensemble)




Custom Ensemble Classifier Accuracy Score: 0.912
