# Introduction
Machine Learning uses several techniques to build models and improve their performance. Ensemble learning methods help improve the accuracy of classification and regression models.


## What is Bagging?
Bagging, also known as **Bootstrap aggregating** is an ensemble learning technique commonly ussed in regression tasks to improve the performance and robustness of predictive models.<br>
It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used both for regression and classification models, specifically fore decision tress algorithms.

![Alt text](image.png)

## What is Bootstraping?
This is the method of randomly creating samples out of a population with replacement to estimate a population parameter.

![Alt text](image-1.png)

## Advantages of Bagging in Machine Learning
- It minimizes the overfitting of data. <br>
- Improves the model's accuracy. <br>
- It deals with higher dimensionaloty data efficiently. <br>



# Bagging in Python using IRIS Dataset

In [4]:
#Importing libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
#Loading dataset

url = 'https://raw.githubusercontent.com/kevincwu0/rnn-google-stock-prediction/master/Google_Stock_Price_Train.csv'
df = pd.read_csv(url)

In [7]:
df.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,1/3/2012,325.25,332.83,324.97,663.59,7380500
1,1/4/2012,331.27,333.87,329.08,666.45,5749400
2,1/5/2012,329.83,330.75,326.89,657.21,6590300
3,1/6/2012,328.34,328.77,323.68,648.24,5405900
4,1/9/2012,322.04,322.29,309.46,620.76,11688800
5,1/10/2012,313.7,315.72,307.3,621.43,8824000
6,1/11/2012,310.59,313.52,309.4,624.25,4817800
7,1/12/2012,314.43,315.26,312.08,627.92,3764400
8,1/13/2012,311.96,312.3,309.37,623.28,4631800
9,1/17/2012,314.81,314.81,311.67,626.86,3832800


In [24]:
from sklearn.model_selection import train_test_split

#X - feature columns, Y -target column
X = df.drop('Volume', axis = 1)
Y =df['Volume']

#Split dataset into training at testing set
X_fit, X_eval, y_fit, y_test = train_test_split(X,Y,test_size =0.30, random_state =1)

**Explanation** <br>

- **X** : feature matrix(input data) <br>
- **Y** : target vector (output labels)<br>

The **test_size** parameter specifies the proportion of the dataset that should be allocated to the testing set. In this case, 30% of the data will be used for testing, and 70% will be used for training.<br>

The **random_state** parameter is used to seed the random number generator, ensuring that the split is reproducible. If you use the same value for random_state in future runs, you should get the same split.<br>

- **X_fit**: Training features <br>
- **X_eval**: Testing features <br>
- **y_fit**: Training labels <br>
- **y_test**: Testing labels <br>

In [12]:
from sklearn.model_selection import KFold

#Creating a random sub sample to train multiple models
seed = 7
kf = KFold(n_splits= 10, shuffle = True, random_state= seed)

**Explanation** <br>
- Random seed was set to 7 to ensure reproducibility in the random shuffling of the data.<br>
- The KFold object named **kf** with **n_splits=10**, means the dataset will be split into 10 folds for cross-validation.<br>
- The **shuffle=True** parameter indicates that the data will be shuffled before splitting.<br>
- The **random_state=seed** ensures consistent shuffling across different runs.<br>



In [16]:
# Define decision tree classifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

#Creating a Decision Tree Classifier instance
clf = DecisionTreeClassifier(random_state= seed)
num_trees = 100

In [20]:
from sklearn.ensemble import BaggingClassifier

#Create a classsification model for bagging
model = BaggingClassifier(estimator = clf, n_estimators = num_trees, random_state= seed)

**Explanation** <br>
- **num_trees**  is the number of base estimators (trees) you want to include in the ensemble. It determines how many individual models will be trained and combined.<br>

We created a **BaggingClassifier** instance named **model** using the BaggingClassifier class. Then passed the following parameters: <br>

- **estimator=clf**: This specifies that the base estimator for the ensemble is the clf instance, which is a **DecisionTreeClassifier**. <br>
- **n_estimators=num_trees**: This sets the number of base estimators (trees) in the ensemble to the value of num_trees.<br>
- **random_state=seed**: This sets the random seed for reproducibility to the value of seed.<br>

In summary, the code creates a BaggingClassifier model for classification. It uses a DecisionTreeClassifier as the base estimator and builds an ensemble of num_trees decision trees using the bagging technique. The model's behavior is determined by the specified random_state for reproducibility.

In [22]:
from sklearn.model_selection import cross_val_score

#Performing cross validation
cv = KFold(n_splits = num_trees, shuffle =True, random_state= seed)
results = cross_val_score(model, X_fit, y_fit, cv=cv)

#Print accuracy results
for i, accuracy in enumerate(results):
    print(f'Model {i+1} Accuracy: {accuracy:.2f}')

ValueError: 
All the 100 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
99 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\ensemble\_bagging.py", line 337, in fit
    return self._fit(X, y, self.max_samples, sample_weight=sample_weight)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\ensemble\_bagging.py", line 472, in _fit
    all_results = Parallel(
                  ^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
             ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\_parallel_backends.py", line 597, in __init__
    self.results = batch()
                   ^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\ensemble\_bagging.py", line 141, in _parallel_build_estimators
    estimator_fit(X_, y, sample_weight=curr_sample_weight)
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\tree\_classes.py", line 889, in fit
    super().fit(
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\tree\_classes.py", line 186, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 560, in _validate_data
    X = check_array(X, input_name="X", **check_X_params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: '8/24/2015'

--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\ensemble\_bagging.py", line 337, in fit
    return self._fit(X, y, self.max_samples, sample_weight=sample_weight)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\ensemble\_bagging.py", line 472, in _fit
    all_results = Parallel(
                  ^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
             ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\_parallel_backends.py", line 597, in __init__
    self.results = batch()
                   ^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\ensemble\_bagging.py", line 141, in _parallel_build_estimators
    estimator_fit(X_, y, sample_weight=curr_sample_weight)
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\tree\_classes.py", line 889, in fit
    super().fit(
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\tree\_classes.py", line 186, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 560, in _validate_data
    X = check_array(X, input_name="X", **check_X_params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\gitom\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: '7/6/2015'
