<h1 style="color:black">Ensemble Techniques</h1>

<h3 style="color:orange;line-height:1.2">An ensemble technique is a machine learning technique that combines multiple models to improve the accuracy of predictions. The goal is to create a more robust model by capturing the strengths of each model while reducing their weaknesses</h3>

In [19]:
# there are 3 method to do so

<div style="border:1px solid black;padding:20px">

<h3>Bagging - </h3>

- Often considers __homogenous weak learners(models)__   i.e. same models
- Learns from them __independently from in other in parallel__
- And __combines them following some kind of deterministic averaging process__

____

<h3>Boosting -</h3>

- Often considers __homogenous weak learners(models) too__
- Learns them __sequentially in very adaptive way(a base model depends on the previous one)__
- And __combines them following a deterministic strategy__

____

<h3>Stacking - </h3>

- Often considers __heterogenous weak learners__ i.e. different models
- Learns them __in parallel independently__
- And combines them by __training a meta-model to output a prediction based on the different weak model's predictions__
</div>

<img src="images\\Ensemble_Techniques_D1.png" alt="image" width="800px">

- First, __input training data__ is used to build a number of models
- The __allocation function__ dictates how much of the training data each model recieves
  - Do they each receive the full training dataset or merely a sample?
  - Do they each receive every feature or a subset?
- After the models are constructed, they can be used to genarate a set of predictions, which must be managed in some way
- The __combination function__ governs how disagreements among the predictions are reconciled
  - e.g. The ensemble might use a majority vote to determine the final prediction, or it could use a more complex strategy such as
    weighting each model's votes based on its prior performance 

<img src="images\\Ensemble_Techniques_D2.png" alt="image" width="800px" border="2px">

<h1 style="color:red">1. Bagging -</h1>



- One of the first to gain widespread acceptance used as a technique called __Boostrap Aggregating__ or __Bagging__
- Bagging generates a number of training datasets by __bootstrap sampling(sampling with replacement)__ the original training data
- These datasets are then used to generate a set of models using a __single learning algorithm__
- The models' predictions are __combined using voting(for classification)__ or __averaging(for regression)__

<h2 style="color:limegreen">Bootstrap Method - </h2>

- It is a powerful statistical method for estimating a quantity from a data sample
- This is easiest way to understand if the quantity is a descriptive statistic such as mean or std deviation

____

__Let's assume we have sample of 100 values(x) and we'd like to get an estimate of the mean of the sample__

- We can directly calculate the mean as __mean = sum(x)/100__
- We know our sample is small and that our mean has error in it. We can improve the estimate using following __Bootstrap Procedure:__
  - Create many(say 1000) random sub-samples of our dataset with replacement(values can repeat)
  -  Calculate mean of each sub-sample
  -  Calculate average of all means of sub-samples
  -  Use this as our sample mean


____

__Let's assume now, we have a sample dataset of 1000 instances(x) and we are using the C5.0 algorithm__


Bagging would work as follows:

- Create many(say 100) random sub-samples with replacement
- Train a C5.0 on each sub-sample
- Given a new dataset, calculate the avearage prediction from each model


## ...continued

- When bagging with decision trees, __we are less concerned about individual trees overfitting the training data__
- For this reason, the __individual trees are grown deep without pruning__
- These trees will have both __high variance and low bias__


<h1 style="color:red">1.1. Random Forest -</h1>

<h3>The Random Forest is a model made up of many decision trees.</h3>

<div style="background-color:beige;padding:20px;border:1px solid black">
    
#### Rather than simply averaging predictions of trees, this model uses 2 key concepts that gives it the name random :

- Random sampling of training data points when building trees
- Random subsets of features considered when splitting nodes
</div>

- When training, each tree in random forest learns from a random sample of data points
- The samples are drawn __with replacement__, known as __booststraping__, which means that some samples will be used multiple times in single tree

- The other main concept in random forest is that only a __subset of all the features are considered for splitting each node__ in each decision tree
- Generally this is set to __sqrt(n_features) for classifiaction__

<img src="images\\Ensemble_Techniques_D4.png" alt="image" width="800px" border="2px">

In [120]:
# One of the way of splitting node is using Gini Impurity or Gini Index

<img src="images\\Ensemble_Techniques_D5.png" alt="image" width="500px" border="2px">

<div style="border:3px dotted red;padding:30px">

<h2 style="color:limegreen"> Random Forest Pseudocode - </h2>

- Randomly select __"k"__ features from total __"m"__ features, where __k << m__
- Among the __"k"__ features, calculate the node __"d"__ using the __best split point__
- Split the node into __Daughter Nodes__ using the __best split__
- Repeat __1 to 3__ steps until __"i"__ number of nodes has been reached
- Build forest by repeating steps __1 to 4__ for __"n"__ number of times to create __"n"__ number of trees

__To get in deatail explanation check out this PDF__: [Random Forest Steps](Files\\Random_Forest_Pseudocode_with_Examples_Fixed.pdf)

</div>

<h3 style="color:slateblue">Advantages of Random Forest Algorihtm - </h3>

- The same __Random Forest Algorithm__ or random forest classifier can be used for both __classification and regression__ task
- Random Forest classifier will __handle the missing values__, still we do it while EDA
- The __overfitting problem will never come__ when we use the random forest algorithm in any classification problem
- The random forest algorithm __can be used for Feature Engineering__
  - which means indentifying the most imp features out of the available features from the training dataset

<h1 style="color:red">2. Boosting -</h1>

<h4>Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers</h4>


- This is done by building a model from the training data, then creating a second model that __attempts to correct the errors from the <br>
    first model__.
- Models are added until the training set is predicted perfectly, or a __maximum number of models are added__

- __AdaBoost__ was the first successful boosting algorithm developed for binary classification and later on extended to multiclass problems

- It can be used to boost performance of any ML algorithm. It is best used with __weak learners__

In [168]:
# Let's first try to understand roughly

<img src="images\\Ensemble_Techniques_D6.png" alt="image" width="800px" border="2px">

<h2 style="color:limegreen"> Boosting Algorithm - </h2>

<img src="images\\Ensemble_Techniques_D7.png" alt="image" width="700px">

- __Samples__ difficult to classify receives increasingly larger weights until the algorithm identifies an algorithm that correctly <br>
    classifies these examples

- At each iteration, a __stage weight [ln(1-err/err)]__ is computed based on the error rate at that itertaion

- At the initial stage it assigns equal weight __1/N__ to all observations

- The overall sequence of weighted classifiers is then __combined__ into an ensemble and has a __strong potential__ to classify better <br>
    than any of the individual classifiers

<img src="images\\Ensemble_Techniques_D8.png" alt="image" width="700px">

<h2 style="color:limegreen"> Learning an AdaBoost Model from Data - </h2>



<img src="images\\Ensemble_Techniques_D9.png" alt="image" width="500px">

- Each sample have the same starting weight __1/N__ initially
  
- Fit a weak classifier using the weighted samples and __compute the k'th model's misclassification error(errk)__
- Compute the kth stage value as __ln((1-errk)/errk)__
- __Update the sample weights giving more weight to incorrectly predicted samples__ and less weight to correctly predicted samples


- e.g. If the model predicted 78 of 100 training instances correctly the error or mis-classification rate would be __78-100/100 = 0.22__

  <img src="images\\Ensemble_Techniques_D10.png" alt="image" width="200px" style="margin-left:30%">

- The above formula is modified to use the weightage of the training instances:

  <img src="images\\Ensemble_Techniques_D11.png" alt="image" width="200px" style="margin-left:30%">
  

<img src="images\\Ensemble_Techniques_D12.png" alt="image" width="700px" border="2px">

<h1 style="color:red">3. Stacking -</h1>

<h4 style="line-height:1.5">Building multiple models(typically of differing types) and supervisor model that learns how to best combine the predictions <br>
    of the primary models
</h4>

- Suppose in __C1 we used Decision Tree(C5.0)__, __C2 -> CART__ and __C3 -> Random Forest__
- Then from theses classifiers we will get output 1,2,3 respectively

<img src="images\\Ensemble_Techniques_D13.png" alt="image" width="500px" border="2px">

____

# Example Code

In [215]:
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

In [219]:
# Importing file

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv("Datasets\\pima-indians-diabetes.data.csv",names=names)   # dont have col names
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [223]:
X = df.iloc[:,0:8]
y = df.iloc[:,8]
X.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [233]:
# Bagged Decision Trees for classification

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

kfold = KFold(n_splits=10,random_state=42,shuffle=True)
# shuffle=True indicates that the data should be shuffled before splitting it into folds.
# Shuffling ensures that the data is randomly distributed across the folds.


cart = DecisionTreeClassifier()
num_trees = 100


model = BaggingClassifier(estimator=cart,n_estimators=num_trees,random_state=42)
results = cross_val_score(model,X,y,cv=kfold)
results

array([0.72727273, 0.75324675, 0.7012987 , 0.85714286, 0.79220779,
       0.71428571, 0.77922078, 0.76623377, 0.67105263, 0.80263158])

In [234]:
results.mean()

0.7564593301435407

In [239]:
# Random Forest Classification

from sklearn.ensemble import RandomForestClassifier

# using same k_fold

num_trees = 100
max_features = 3 #Total features m=8, K=3 i.e. randomly select 3 features

model = RandomForestClassifier(n_estimators=num_trees,max_features=max_features)   #n_estimators=10 by default, default=gini
results = cross_val_score(model,X,y,cv=kfold)
results

array([0.67532468, 0.76623377, 0.72727273, 0.83116883, 0.83116883,
       0.71428571, 0.79220779, 0.72727273, 0.67105263, 0.80263158])

In [243]:
results.mean()

0.753861927546138

In [245]:
# AdaBoost Classification

from sklearn.ensemble import AdaBoostClassifier

num_trees = 10

model = AdaBoostClassifier(n_estimators=num_trees,random_state=42)  #n_estimators=50 by default
results = cross_val_score(model,X,y,cv=kfold)
results

array([0.74025974, 0.80519481, 0.71428571, 0.79220779, 0.79220779,
       0.67532468, 0.77922078, 0.79220779, 0.67105263, 0.81578947])

In [249]:
results.mean()

0.7577751196172249

In [253]:
# Stacking Ensemble for Classification

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC   # support vector classifier
from sklearn.ensemble import VotingClassifier

# create sub-models
estimators=[]

model1 = LogisticRegression(max_iter=500)
estimators.append(('logistic', model1))

model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))

model3 = SVC()
estimators.append(('svc', model3))

# create ensemble model

model = VotingClassifier(estimators)
results = cross_val_score(model,X,y,cv=kfold)
results

array([0.75324675, 0.77922078, 0.71428571, 0.81818182, 0.81818182,
       0.7012987 , 0.81818182, 0.76623377, 0.71052632, 0.81578947])

In [255]:
estimators

[('logistic', LogisticRegression(max_iter=500)),
 ('cart', DecisionTreeClassifier()),
 ('svc', SVC())]

In [257]:
results.mean()

0.7695146958304853

# **Additional Code: "random_state = any int"**

In [261]:
from sklearn.model_selection import train_test_split
import numpy as np

# Generate data for the example (values from 0 to 10)
data = np.arange(10)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data,data, test_size=0.3, random_state=20)

# Display the training and testing sets
print("Training set X:", X_train)
print("Testing set X:", X_test)
print("Training labels y:", y_train)
print("Testing labels y:", y_test)

'''
First time Output:
Training set X: [0 7 2 9 4 3 6]
Testing set X: [8 1 5]
Training labels y: [0 7 2 9 4 3 6]
Testing labels y: [8 1 5]

Second time Output
Same as above

TRY TO CHANGE random_state = ANY OTHER INTEGER
'''

Training set X: [5 0 2 6 9 4 3]
Testing set X: [7 1 8]
Training labels y: [5 0 2 6 9 4 3]
Testing labels y: [7 1 8]


'\nFirst time Output:\nTraining set X: [0 7 2 9 4 3 6]\nTesting set X: [8 1 5]\nTraining labels y: [0 7 2 9 4 3 6]\nTesting labels y: [8 1 5]\n\nSecond time Output\nSame as above\n\nTRY TO CHANGE random_state = ANY OTHER INTEGER\n'

# **If random_state is not passed as an argument**

In [276]:
from sklearn.model_selection import train_test_split
import numpy as np

# Generate data for the example (values from 0 to 99)
data = np.arange(10)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, data, test_size=0.3)

# Display the training and testing sets
print("Training set X:", X_train)
print("Testing set X:", X_test)
print("Training labels y:", y_train)
print("Testing labels y:", y_test)

'''
First time Output
Training set X: [4 3 7 9 1 6 8]
Testing set X: [0 2 5]
Training labels y: [4 3 7 9 1 6 8]
Testing labels y: [0 2 5]

Second time Output
Training set X: [6 8 2 3 7 1 9]
Testing set X: [5 0 4]
Training labels y: [6 8 2 3 7 1 9]
Testing labels y: [5 0 4]
'''

Training set X: [7 1 4 3 5 8 2]
Testing set X: [0 6 9]
Training labels y: [7 1 4 3 5 8 2]
Testing labels y: [0 6 9]


'\nFirst time Output\nTraining set X: [4 3 7 9 1 6 8]\nTesting set X: [0 2 5]\nTraining labels y: [4 3 7 9 1 6 8]\nTesting labels y: [0 2 5]\n\nSecond time Output\nTraining set X: [6 8 2 3 7 1 9]\nTesting set X: [5 0 4]\nTraining labels y: [6 8 2 3 7 1 9]\nTesting labels y: [5 0 4]\n'