## Introduction

In [1]:
"""
Ensemble method means applying different decision trees together.
Idea to apply different trees together is to improve the model and makes it better model.
These decision trees are slow learners and together they creates model which is fast learner.

What Ensemble Method Solves
===========================
It solves the problem of overfitting.

What is overfitting
===================
Overfitting means when testing accuracy is less than training accuracy.

Different Ensemble Methods
==========================
1.) Random forest
2.) Bagging
3.) Boosting
4.) Stacking
"""



## Method 1: Random Forest

In [2]:
"""
How Random Forest works
=======================
Technique Used: Bootstrap with replacement (Statistical Technique for randomnly selecting data points)
Algorithm Used: Decision trees

How it works
============
For each set of samples which is taken out with replacement that is those samples can be used again
For each set of samples number of features taken out is sqrt(total features or variables)

Advantage of Random Forest
==========================
It solves overfitting and missing value problem.

Cross Validation Techniques
===========================
Different cross validation techniques are used such as below
 a.) Leave one out
 b.) Leave p pout
 c.) K-fold
 d.) Repeated random sub sampling validation  

We are implementing K-fold for random forest
============================================
1.) During K fold we have value of k which denotes number of groups that a given data is to be split
2.) Accuracy or performance paramters is calculated 10 times as K is 10
3.) During each calculation one group is chosen as testing and rest others as training dataset, keeping in mind that
group once chosen for testing dataset is not repeated in the next step

Average of the accuracy during these 10 iterations is chosen as final accuracy

Some Key points with respect to random forest method
=====================================================
For classification: Majority of vote from decision tree is chosen as answer
For Regression : Average of output from each decision tree is chosen as answer
"""




## Implementation of Random Forest on Diabetes Dataset

In [26]:
# importing the libraries to read the dataset
import pandas as pd

from statistics import mean 

from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [4]:
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class_name']

In [5]:
dataframe_read = pd.read_csv(
           'pima-indians-diabetes.csv', 
            names = columns
)

In [6]:
dataframe_read.head(5)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class_name
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
# converting the dataset into data and target
# from data I mean all the columns except last one
# from the target I mean only the target column
X = dataframe_read.iloc[:, 0:dataframe_read.shape[1]-1]
y = dataframe_read.iloc[:, dataframe_read.shape[1]-1]

# getting the features from the data
features =  X.columns

In [8]:
# creating the model
model = RandomForestClassifier(
                    max_depth=2, 
                    random_state=0
)

In [9]:
# bagging classifier
model_bagging = BaggingClassifier(
                base_estimator=SVC(),
                n_estimators=10, 
                random_state=0
)

In [19]:
# ada boosting
model_adaboost = AdaBoostClassifier(
      n_estimators=100, 
      random_state=0
)

## For kfold we donot need to split the dataset into data and target

In [10]:
# suppose we are working with three k fold
kf3 = KFold(n_splits=3, shuffle=False)

i = 1
accuracy_total = list()
for train_index, test_index in kf3.split(dataframe_read):
        
        X_train = dataframe_read.iloc[train_index].loc[:, features]
        X_test = dataframe_read.iloc[test_index].loc[:, features]
        y_train = dataframe_read.iloc[train_index].loc[:,'class_name']
        y_test = dataframe_read.iloc[test_index].loc[:, 'class_name']
        
        model.fit(X_train, y_train)
        
        accuracy = accuracy_score(
            y_test, model.predict(X_test)
        )
        
        print(
            "Accuracy during {} iterations is: {}".format(
             i, accuracy)
        )
        i +=1
        accuracy_total.append(accuracy) 
print(
    "Overall accuracy achieved using K fold is: {}".format(
        mean(accuracy_total))
)                                              

Accuracy during 1 iterations is: 0.71875
Accuracy during 2 iterations is: 0.7421875
Accuracy during 3 iterations is: 0.76953125
Overall accuracy achieved using K fold is: 0.7434895833333334


## K fold with shuffle

In [11]:
kf3 = KFold(n_splits=3, shuffle=True, random_state=42)

i = 1
accuracy_total = list()
for train_index, test_index in kf3.split(dataframe_read):
        
        X_train = dataframe_read.iloc[train_index].loc[:, features]
        X_test = dataframe_read.iloc[test_index].loc[:, features]
        y_train = dataframe_read.iloc[train_index].loc[:,'class_name']
        y_test = dataframe_read.iloc[test_index].loc[:, 'class_name']
        
        model.fit(X_train, y_train)
        
        accuracy = accuracy_score(
            y_test, model.predict(X_test)
        )
        
        print(
            "Accuracy during {} iterations is: {}".format(
             i, accuracy)
        )
        i +=1
        accuracy_total.append(accuracy) 
print(
    "Overall accuracy achieved using K fold is: {}".format(
        mean(accuracy_total))
)                          

Accuracy during 1 iterations is: 0.71875
Accuracy during 2 iterations is: 0.7109375
Accuracy during 3 iterations is: 0.73046875
Overall accuracy achieved using K fold is: 0.7200520833333334


In [12]:
"""
Bagging Method: It is same as we have discussed Random forest except that all features are used 
for each sample set

Salient features of Bagging Method
==================================
1.) Homogeneous Weak Learners are used to create model which is a better model

From the homogeneous weak learner I mean all the algorithms or classifiers used are same
All these models are run in parallel

Let us implement baggin method on the above dataset with K fold
"""



## K Fold without shuffle : Bagging Method

In [13]:
kf3 = KFold(n_splits=3, shuffle=False, random_state=42)

i = 1
accuracy_total = list()
for train_index, test_index in kf3.split(dataframe_read):
        
        X_train = dataframe_read.iloc[train_index].loc[:, features]
        X_test = dataframe_read.iloc[test_index].loc[:, features]
        y_train = dataframe_read.iloc[train_index].loc[:,'class_name']
        y_test = dataframe_read.iloc[test_index].loc[:, 'class_name']
        
        model_bagging.fit(X_train, y_train)
        
        accuracy = accuracy_score(
            y_test, model.predict(X_test)
        )
        
        print(
            "Accuracy during {} iterations is: {}".format(
             i, accuracy)
        )
        i +=1
        accuracy_total.append(accuracy) 
print(
    "Overall accuracy achieved using K fold is: {}".format(
        mean(accuracy_total))
)                                  



Accuracy during 1 iterations is: 0.74609375
Accuracy during 2 iterations is: 0.73828125
Accuracy during 3 iterations is: 0.76953125
Overall accuracy achieved using K fold is: 0.7513020833333334


## K fold with Shuffle: Bagging Method

In [14]:
kf3 = KFold(n_splits=3, shuffle=True, random_state=42)

i = 1
accuracy_total = list()
for train_index, test_index in kf3.split(dataframe_read):
        
        X_train = dataframe_read.iloc[train_index].loc[:, features]
        X_test = dataframe_read.iloc[test_index].loc[:, features]
        y_train = dataframe_read.iloc[train_index].loc[:,'class_name']
        y_test = dataframe_read.iloc[test_index].loc[:, 'class_name']
        
        model_bagging.fit(X_train, y_train)
        
        accuracy = accuracy_score(
            y_test, model.predict(X_test)
        )
        
        print(
            "Accuracy during {} iterations is: {}".format(
             i, accuracy)
        )
        i +=1
        accuracy_total.append(accuracy) 
print(
    "Overall accuracy achieved using K fold is: {}".format(
        mean(accuracy_total))
)                             

Accuracy during 1 iterations is: 0.7578125
Accuracy during 2 iterations is: 0.765625
Accuracy during 3 iterations is: 0.73046875
Overall accuracy achieved using K fold is: 0.7513020833333334


In [16]:
"""
Boosting
========

This is another ensemble Technique Here also homogeneous weak learners are used but the 
process is sequential that is model runs sequentially. 

Let us understand Sequential Process
====================================

1.) First first model is applied like decision tree or support vector classification 
2.) Error is calculated and is propagated to the next model along with the sample set

This will take more time as models are not running in parallel

Examples
=======
1.) Adaboost
2.) Gradient Boost
"""



In [20]:
kf3 = KFold(n_splits=3, shuffle=True, random_state=42)

i = 1
accuracy_total = list()
for train_index, test_index in kf3.split(dataframe_read):
        
        X_train = dataframe_read.iloc[train_index].loc[:, features]
        X_test = dataframe_read.iloc[test_index].loc[:, features]
        y_train = dataframe_read.iloc[train_index].loc[:,'class_name']
        y_test = dataframe_read.iloc[test_index].loc[:, 'class_name']
        
        model_adaboost.fit(X_train, y_train)
        
        accuracy = accuracy_score(
            y_test, model.predict(X_test)
        )
        
        print(
            "Accuracy during {} iterations is: {}".format(
             i, accuracy)
        )
        i +=1
        accuracy_total.append(accuracy) 
print(
    "Overall accuracy achieved using K fold is: {}".format(
        mean(accuracy_total))
)                             

Accuracy during 1 iterations is: 0.7578125
Accuracy during 2 iterations is: 0.765625
Accuracy during 3 iterations is: 0.73046875
Overall accuracy achieved using K fold is: 0.7513020833333334


## Stacking

In [21]:
"""
Let us undestand Stacking

Stacking
========
Models are run in parallel and these models are heterogenous such as we can use svc,
random forest and naive bayes in parallel

Use of different model can improve model accuracy to great extend
"""



In [27]:
# implementation of stacking
# creating base estimators
base_learners = [
                 ('rf_1', RandomForestClassifier(n_estimators=10, random_state=42)),
                 ('rf_2', KNeighborsClassifier(n_neighbors=5))
]             

# creating the final estimator
stacking_model = StackingClassifier(
    estimators = base_learners, 
    final_estimator = LogisticRegression()
)


In [28]:
## Implementing Stacking classifier
kf3 = KFold(n_splits=3, shuffle=True, random_state=42)

i = 1
accuracy_total = list()
for train_index, test_index in kf3.split(dataframe_read):
        
        X_train = dataframe_read.iloc[train_index].loc[:, features]
        X_test = dataframe_read.iloc[test_index].loc[:, features]
        y_train = dataframe_read.iloc[train_index].loc[:,'class_name']
        y_test = dataframe_read.iloc[test_index].loc[:, 'class_name']
        
        stacking_model.fit(X_train, y_train)
        
        accuracy = accuracy_score(
            y_test, model.predict(X_test)
        )
        
        print(
            "Accuracy during {} iterations is: {}".format(
             i, accuracy)
        )
        i +=1
        accuracy_total.append(accuracy) 
print(
    "Overall accuracy achieved using K fold is: {}".format(
        mean(accuracy_total))
)                             

Accuracy during 1 iterations is: 0.7578125
Accuracy during 2 iterations is: 0.765625
Accuracy during 3 iterations is: 0.73046875
Overall accuracy achieved using K fold is: 0.7513020833333334
