In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<img src="https://repository-images.githubusercontent.com/206526243/c6c83680-d003-11e9-8eda-4296871ca94a">

## This Notebook is going to be a deep dive into Ensemble techniques 🦈

#### Ensemble techniques leverage the strengths of Multiple Machine Learning models and improve the overall performacne.

Lately ensemble techniques have been heavily used in the Notebooks of Kaggle Competition winners 🏅🥇🥈 and are gaining massive popularity.In fact owing to their massive gain in popularity they are being extensively used in Hackathons and Kaggle competitions. 

# What is going to be covered

<b>This is an exhaustive list of what is going to be covered in this Notebook. The Notebook will cover the theory, intuition as well as the Implementation of all of the following points in great depth. This notebook can be used as your cheatsheet to Ensemble techniques </b>

### Simple Ensemble Techniques
1. Max Voting (Hard Voting)
2. Averaging (Soft Voting)

## Slightly advanced concepts 🎯

### Bagging 💰

1. Bagging Meta Estimator
2. Random Forrest 


### Boosting  🚄

1. Adaptive Boosting AdaBoost
2. CatBoost
3. XGBoost
4. GBM
5. Light GBM
6. XGBM (Xtreme GBM) <br>GBM * -> Gradient Boosting Machines

### Stacking 🌈

1. Stacking widely used Models like Decision trees, KNN, SVM, SGDs, Logistic Regressors

### Blending 

1. Applying Blending techniques to the stacked Models by breaking down the data into a validation set.

##  How are ensemble techniques this good !!! 💪

<b>A mean squared error can be decomposed into Bias and Variance. One of the main objectives of building an ensemble is to achieve a Final Model that has a low bias and a low variance. This can be done by combining individual models to derive low variance model and still maintain a pretty low bias. For these we use Individual discrete models and name them as base learners. Now how we create the base learners is upto us and there are multiple methods to go around this. 
</b>
1. We can tune the hyperparameters for the same model to create different base learners
2. We can choose different models like SVM, Decision Trees, KNN, Logistics Classifiers to be individual base learners.
3. Different data subsets sent to the same model
4. Giving different feature sets to different base learners.
    
#### Each of the base learner is independent from each other and so intuitively and mathematically it can be proven that ensembles do better than standalone models.

<b>
Suppose we have 3 Base Learners all with an accuracy of 80%.
All of them are merged to make an Ensemble then let's see how our accuracy is boosted(Here we would assume that the final Ensemble would take 2 correct ensembles as the benchmark for identifying the datapoint correctly).
    </b>

Pr--> Probability
Pr(All 3 BL are accurate) = 0.8* 0.8* 0.8 <br>
Pr(Any two of them are correct) = 0.8 * 0.8 * 0.2 <br>
Pr(Any two of them are correct) = 0.8 * 0.2 * 0.8 <br>
Pr(Any two of them are correct) = 0.2 * 0.8 * 0.8 <br>
Adding them mto get the Pr that the Ensemble is correct: <br>
<b> 0.512 + 0.384  = 0.896  </b> 
And this is the lower end of accuracy and this can further be boosted.   😎😎😎  
So this will definitely boost the accuracy from all the 3 Base Learners and as we scale asnd increase the number of Base Learners and move on to more sophisticated techniques we can see that this number can be increases even further.

## Voting 💂‍
#### Voting happens between the Models and the majority result is given the impetus and so the prediction has to be the mode of the predictions received from each model.
#### Basically each model casts a vote on what the Prediction should be and then the class that gets the majority wins and is taken as the Final Prediction.


**What do we need to ensure:**
The Models need to be different and varied. They represent the diversity that is needed so that we are able to get votes from a plethora of algorithms and then probably take a richer decision. Then once we have the votes from individual models we can derive a final prediction

In [2]:
# Importing some Important library dependencies and Datasets
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import learning_curve
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [3]:
# Here we will use 4 Models as Base Learners -> KNN, Decision Trees, SVM and Logistic Regressor
data = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
print(data.shape)
data.head()

In [4]:
# So the Only Null Values are in BMI, and since this is a small fraction of 5110, all of these
# Columns can be removes
data.isnull().sum()

In [5]:
# The cleaning step will Only Involve the dropping bmi columns with null values and Dropping other genders due to low presence in data
# Also the Id column is dropped since it has no role to play here
data = data.dropna(subset=['bmi'])
data = data[data['gender'] != 'Other']
data = data.drop('id', axis=1)

In [6]:
data.head()

In [7]:
# Encoding the Categorical Features, object type to Labels
cat_feat = data.dtypes[data.dtypes == 'O'].index.values
le = LabelEncoder()

for i in cat_feat:
    data[i] = le.fit_transform(data[i])
data

In [8]:
# Defining the data and creating the Training and Test Set
X = data.drop('stroke', axis=1)
y = data['stroke']
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
# Handling the Imbalanced Data  to generate more synthetic examples of the Minority class via SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_bal, Y_bal = smote.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_bal, Y_bal, test_size=0.2, random_state=42)

In [10]:
# Here we will use some Standalone mmodels LR, KNN, Decision Trees, Naive Bayes and SVM Classifier
all_model = [LogisticRegression(), KNeighborsClassifier(), DecisionTreeClassifier(),
            BernoulliNB(), SVC()]

recall = []
precision = []
f1=[]
balanced_accuracy=[]
accuracy=[]

for model in all_model:
    
    
    cv = cross_val_score(model, X_train, y_train, scoring='recall', cv=10).mean()
    recall.append(cv)
    
    cv = cross_val_score(model, X_train, y_train, scoring='precision', cv=10).mean()
    precision.append(cv)
    
    cv = cross_val_score(model, X_train, y_train, scoring='f1', cv=10).mean()
    f1.append(cv)
    
    cv = cross_val_score(model, X_train, y_train, scoring='balanced_accuracy', cv=10).mean()
    balanced_accuracy.append(cv)

model = ['LogisticRegression', 'KNeighborsClassifier', 'DecisionTreeClassifier',
        'BernoulliNB', 'SVC']

score = pd.DataFrame({'Model': model, 'Precision': precision, 'Recall': recall, 'F1':f1, 'balanced_accuracy':balanced_accuracy})
score.style.background_gradient(high=1,axis=0)

## Hard vs Soft Voting:

In classification problems, there are two types of voting: hard voting and soft voting. Hard voting entails picking the prediction with the highest number of votes, whereas soft voting entails combining the probabilities of each prediction in each model and picking the prediction with the highest total probability.


1. Hard voting is a simple Majority Voting scheme
2. In soft voting, every individual classifier provides a probability value that a specific data point belongs to a particular target class. The predictions are weighted by the classifier's importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote.


In [11]:
# Now we will use a Voting Technique to see how we are able to boost out results

from sklearn.ensemble import VotingClassifier

clf1 = LogisticRegression(random_state=42)
clf2 = RandomForestClassifier(random_state=42)
clf3 = GaussianNB()
clf4 = SVC(probability=True, random_state=42)
clf5 = KNeighborsClassifier(5)
clf6 = DecisionTreeClassifier()
eclf = VotingClassifier(estimators=[('LR', clf1), ('RF', clf2), ('GNB', clf3), ('SVC', clf4),("KNN",clf5),("DT",clf6)],
                        voting='hard', weights=[1,2,1,1,1,2])
eclf.fit(X_train, y_train)


In [12]:
y_pred=eclf.predict(X_test)
accuracy_score(y_test, y_pred)

In [13]:
# Trying the same with Soft voting/ Averaging technique
# Creating Individual Classifiers again:
from sklearn.ensemble import VotingClassifier

clf1 = LogisticRegression(random_state=42)
clf2 = RandomForestClassifier(random_state=42)
clf3 = GaussianNB()
clf4 = SVC(probability=True, random_state=42)
clf5 = KNeighborsClassifier(5)
clf6 = DecisionTreeClassifier()
eclf = VotingClassifier(estimators=[('LR', clf1), ('RF', clf2), ('GNB', clf3), ('SVC', clf4),("KNN",clf5),("DT",clf6)],
                        voting='soft', weights=[1,2,1,1,1,2])
eclf.fit(X_train, y_train)

In [14]:
y_pred=eclf.predict(X_test)
accuracy_score(y_test, y_pred)

## Bagging Techniques

As we saw the Simple Ensemble Techniques are quite intuitive and can be very easily applied in this case. Now we move to more sophisticated techniques that are not as trivial as the Previous One. We will first be studying Bagging Techniques. These are the bagging techniques.

1. Random Forest Algorithms  (Bagging Decision Tree)
2. Bagging Meta Estimator
3. Bagging Individual Algorithms like SVM, KNN to see how they work


## Theory - Bagging
#### Also called as Bootstrapped aggregator

**So what  do we do in Bagging, really 🤔**

1. We create n different Base Learning Models
2. We give each Model a subset of the Dataset available (The subset is not necessarily unique and we employ a Row sampling technique with replacements allowed, so some records may be duplicated across subsets of data being fed to different models) Ps- We can also do the sampling on the basis of the features for a better sampling.
3. The training data will get simultaneously trained on the Models (Parallel)
4. Then we pass the train data to each Model and get the Predictions
5. The next step invloves using a voting classifier to get the majority vote for which prediction should we go ahead with
So each Base Learner is like a Bag with some samples from the datasets.

### Random Forest 🌳🌲🌴
* <img src="https://www.freecodecamp.org/news/content/images/2020/08/how-random-forest-classifier-work.PNG" heigth=630 width=630>


Here in a Random Forest Classifier each Base Learner turns out to be a Decision Tree. So all of our Decision trees are fed different subsets and samples from the Data.  As each decision tree has a low bias and High Variance. Low bias means it is not very erroneous if the it is trained on the training data for the complete depth. High variance occurs when we introduce new set of data and so they are prone to create errors.

#### So here is why aggregation step becomes important

We have a set of n-Predictions for n Base Learners a.k.a Decision Trees in this case 
Each of them have a Low Bias and High Variance on the Test Data
When we aggregate all of them with Majority Voting the High Variance gets converted to Low Variance

### A quick difference between Random Forest and a Decision Tree Bagging Ensemble
<img src="https://i.stack.imgur.com/sYR7y.png">

**Some Maths to Prove this 😃**

Suppose we have an RV -> X
It is a normally distributed Radnom Variable with (u(Mean), S^2 (Variance)
If we sample the RV once the mean and the Variance are going to remain the same
If we sample it n Times -> and we generate a new RV constructed from the average of the samples of each Base Leaners here.
X' = (x1+x2+x3+x4+x5...xn)/n then
1. The mean will still remain to be u since N*(u/N)
2. The Variance will change to S^2/N because N*(var(x1)+var(x2)...var(xn)/N^2)
So this becomes a Low Bias and a "Low Variance" Model in this way.

## What happens if this turns out to be a regression Problem
So if that  turns out to be the case then we might not have the prediction output as a class but rather a continous numerical value. In that case we might have to take the mean of all the values present and then the mean will become our final Prediction.

### Bagging with Decision Trees as Base Learners

In [15]:
# test classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier
model = BaggingClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
n_scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

### Random Forest Classifiers

### A Support Vector Machine Bagging Example

### A KNN Bagging Ensemble with Selective Features

### Boosting Techniques

Boosting techniques create sequential Models as Base Learners and then data subsets are passed through them to make the overall performance robust.This will be the general workflow.
1. Create a base Learner
2. Pass a subset of data onto the Base Learner 
3. Now pass the full data to evaluate the wrong Predictions by the Model.
4. Now create a base learner 2 and only pass on the data points that were incorrectly predicted by the Base Learner 1 as training data to the Base Learner 2.
5. Now again pass the Entire dataset on Base Learner 2 to evaluate the wrong predictions.
6. This base learner will probably classify some other datapoints incorrectly.
7. These datapoints specifically will be passed on as training data to Base Learner 3.
8. This is a sequential process and this can go until we specify the number of BLs or steps.
9. Here is a good image to understand Bagging in a pictorial way.
<img src="https://pluralsight2.imgix.net/guides/a9a5ff4e-b617-4afe-b27b-d96793defa87_6.jpg">
