## Ensemble Learning
- Ensemble learning is a combination of several machine learning models in one problem. 
- These models are known as weak learners. The intuition is that when you combine several weak learners, they can become strong learners.




## Basic ensemble learning techniques
### Max voting

- In classification, the prediction from each model is a vote. In max voting, the final prediction comes from the prediction with the most votes.

- Let’s take an example where you have three classifiers with the following predictions:

        classifier 1 – class A
        classifier 2 – class B
        classifier 3 – class B

The final prediction here would be class B since it has the most votes.

### Averaging

- In averaging, the final output is an average of all predictions. This goes for regression problems. For example, in random forest regression, the final result is the average of the predictions from individual decision trees. 

- Let’s take an example of three regression models that predict the price of a commodity as follows:

        regressor 1 – 200
        regressor 2 – 300 
        regressor 3 – 400

The final prediction would be the average of 200, 300, and 400. 

## Advanced ensemble learning techniques
### Bagging

- Bagging takes random samples of data, builds learning algorithms, and uses the mean to find bagging probabilities. It’s also called bootstrap aggregating. Bagging aggregates the results from several models in order to obtain a generalized result. 

- The method involves:

        Creating multiple subsets from the original dataset with replacement,
        Building a base model for each of the subsets,
        Running all the models in parallel,
        Combining predictions from all models to obtain final predictions.


## Boosting

- Boosting is a machine learning ensemble technique that reduces bias and variance by converting weak learners into strong learners. The weak learners are applied to the dataset in a sequential manner. The first step is building an initial model and fitting it into the training set. 

- A second model that tries to fix the errors generated by the first model is then fitted. Here’s what the entire process looks like:

        Create a subset from the original data,
        Build an initial model with this data,
        Run predictions on the whole data set,
        Calculate the error using the predictions and the actual values,
        Assign more weight to the incorrect predictions,
        Create another model that attempts to fix errors from the last model,
        Run predictions on the entire dataset with the new model,
        Create several models with each model aiming at correcting the errors generated by the previous one,
        Obtain the final model by weighting the mean of all the models.


#### In today session, I will use pima indian diabetes dataset to predict if a person has a diabetes or not based on certain features such as blood pressure, skin thickness, age etc. I will train a standalone model first and then use bagging ensemble technique to check how it can improve the performance of the model

In [1]:
import pandas as pd

df = pd.read_csv("diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
# check missing values
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

### No missing value means we can move on.

In [4]:
# Statitistics overview.
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [5]:
df.Outcome.value_counts()


Outcome
0    500
1    268
Name: count, dtype: int64

- There is slight imbalance in our dataset but since it is not major we will not worry about it!



In [6]:
# Splitting df for training and testing
X = df.drop("Outcome",axis="columns")
y = df.Outcome

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[:5]

array([[ 0.63994726,  0.84832379,  0.14964075,  0.90726993, -0.69289057,
         0.20401277,  0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575,  0.53090156, -0.69289057,
        -0.68442195, -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, -1.28821221, -0.69289057,
        -1.10325546,  0.60439732, -0.10558415],
       [-0.84488505, -0.99820778, -0.16054575,  0.15453319,  0.12330164,
        -0.49404308, -0.92076261, -1.04154944],
       [-1.14185152,  0.5040552 , -1.50468724,  0.90726993,  0.76583594,
         1.4097456 ,  5.4849091 , -0.0204964 ]])

In [8]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=42)

In [9]:
X_train

array([[-0.84488505,  0.00330087,  0.45982725, ...,  0.8893767 ,
        -0.63687146, -0.44593516],
       [ 2.42174604, -1.02950492,  0.25303625, ...,  0.41977549,
        -0.28351757,  1.2558199 ],
       [-0.84488505, -0.40356202, -0.47073225, ...,  0.44515934,
        -0.17177318, -0.78628618],
       ...,
       [-0.84488505, -0.74783062,  0.04624525, ...,  0.77514938,
        -0.76673656, -0.27575966],
       [ 1.53084665,  1.09870096,  0.87340925, ...,  0.29285624,
         2.16579867,  0.74529338],
       [ 0.04601433,  0.72313521, -0.57412775, ..., -0.31635613,
        -0.55834837,  0.31985461]])

In [10]:
y_train

751    0
358    0
718    0
536    0
651    0
      ..
676    1
113    0
556    0
152    1
107    0
Name: Outcome, Length: 576, dtype: int64

In [11]:
X_train.shape

(576, 8)

In [12]:
y_train.value_counts()

Outcome
0    375
1    201
Name: count, dtype: int64

### Train using stand alone model


In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=5)
scores

array([0.67532468, 0.68831169, 0.66883117, 0.79738562, 0.7124183 ])

In [14]:
scores.mean()


0.708454290807232

### Apply bagging

In [17]:
from sklearn.ensemble import BaggingClassifier

base_estimator = DecisionTreeClassifier()


bag_model = BaggingClassifier(
    estimator=base_estimator, 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)
bag_model.fit(X_train, y_train)
bag_model.oob_score_

0.7621527777777778

In [18]:
bag_model.score(X_test, y_test)


0.734375

In [21]:
base_estimator = DecisionTreeClassifier()

bag_model = BaggingClassifier(
    estimator=base_estimator, 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)
scores = cross_val_score(bag_model, X, y, cv=5)
scores

array([0.75324675, 0.72727273, 0.74675325, 0.82352941, 0.73856209])

In [22]:
scores.mean()


0.7578728461081402

#### We can see some improvement in test score with bagging classifier as compared to a standalone classifier



In [23]:
# Let's try using random forest Classifier


from sklearn.ensemble import RandomForestClassifier

scores = cross_val_score(RandomForestClassifier(n_estimators=50), X, y, cv=5)
scores.mean()

0.7565486800780918