Q1. What is an ensemble technique in machine learning?

If you are a beginner who wants to understand in detail what is ensemble, or if you want to refresh your knowledge about variance and bias, the comprehensive article below will give you an in-depth idea of ensemble learning, ensemble methods in machine learning, ensemble algorithm, as well as critical ensemble techniques, such as boosting and bagging. But before digging deep into the what, why, and how of ensemble, let's first take a look at some real-world examples that will simplify the concepts that are at the core of ensemble learning.

Example 1: If you are planning to buy an air-conditioner, would you enter a showroom and buy the air-conditioner that the salesperson shows you? The answer is probably no. In this day and age, you are likely to ask your friends, family, and colleagues for an opinion, do research on various portals about different models, and visit a few review sites before making a purchase decision. In a nutshell, you would not come to a conclusion directly. Instead, you would try to make a more informed decision after considering diverse opinions and reviews. In the case of ensemble learning, the same principle applies. Now let's see what ensemble means.

Q2. Why are ensemble techniques used in machine learning?

Bagging, boosting and stacking are the three most popular ensemble learning techniques. Each of these techniques offers a unique approach to improving predictive accuracy. Each technique is used for a different purpose, with the use of each depending on varying factors. Although each technique is different, many of us find it hard to distinguish between them. Knowing when or why we should use each technique is difficult.

Q3. What is bagging?

We use bagging for combining weak learners of high variance. Bagging aims to produce a model with lower variance than the individual weak models. These weak learners are homogenous, meaning they are of the same type.

Bagging is also known as Bootstrap aggregating. It consists of two steps: bootstrapping and aggregation.

Q4. What is boosting?

Boosting involves sequentially training weak learners. Here, each subsequent learner improves the errors of previous learners in the sequence. A sample of data is first taken from the initial dataset. This sample is used to train the first model, and the model makes its prediction. The samples can either be correctly or incorrectly predicted. The samples that are wrongly predicted are reused for training the next model. In this way, subsequent models can improve on the errors of previous models

Q5. What are the benefits of using ensemble techniques?

Ensemble methods offer several advantages over single models, such as improved accuracy and performance, especially for complex and noisy problems. They can also reduce the risk of overfitting and underfitting by balancing the trade-off between bias and variance, and by using different subsets and features of the data. Furthermore, ensemble methods can handle different types of data and tasks, such as classification, regression, clustering, and anomaly detection, by using different types of base models and aggregation methods. Additionally, they can provide more confidence and reliability by measuring the diversity and agreement of the base models, and by providing confidence intervals and error estimates for the predictions.

Q6. Are ensemble techniques always better than individual models?

There is no absolute guarantee a ensemble model performs better than an individual model, but if you build many of those, and your individual classifier is weak. Your overall performance should be better than an individual model.

Q7. How is the confidence interval calculated using bootstrap?

In statistics, a confidence interval is a range of values that is used to estimate a population parameter, such as a mean, median, or variance, along with a level of confidence that the true parameter falls within that range. Bootstrap is a resampling technique used to estimate the sampling distribution of a statistic and, in turn, to calculate confidence intervals. Here's how you can calculate a confidence interval using bootstrap:

Data Resampling: Start with your original sample data, and create a large number of bootstrap samples by randomly sampling with replacement. Each bootstrap sample should have the same size as the original data.

Calculate Statistic: For each of the bootstrap samples, compute the statistic of interest. This could be the mean, median, standard deviation, or any other statistic you want to estimate.

Create a Sampling Distribution: After calculating the statistic for each bootstrap sample, you will have a distribution of the statistic. This distribution approximates the sampling distribution of the statistic.

Quantiles: Calculate the desired quantiles from the sampling distribution. The quantiles you choose depend on the level of confidence you want for your interval. For example, for a 95% confidence interval, you would typically calculate the 2.5th and 97.5th percentiles of the sampling distribution.

Confidence Interval: The quantiles obtained in step 4 represent the lower and upper bounds of the confidence interval. You can use them to estimate the population parameter.

Q8. How does bootstrap work and What are the steps involved in bootstrap?

This procedure of using the bootstrap method to estimate the skill of the model can be summarized as follows:
1.Choose a number of bootstrap samples to perform
2.Choose a sample size
3.For each bootstrap sample
4.Draw a sample with replacement with the chosen size
5.Fit a model on the data sample
6.Estimate the skill of the model on the out-of-bag sample.
7.Calculate the mean of the sample of model skill estimates.

Q9. A researcher wants to estimate the mean height of a population of trees. They measure the height of a
sample of 50 trees and obtain a mean height of 15 meters and a standard deviation of 2 meters. Use
bootstrap to estimate the 95% confidence interval for the population mean height.

In [2]:
import numpy as np

# Sample data
sample_mean = 15
sample_std = 2
sample_size = 50

# Number of resamples
num_resamples = 10000

# Create an array to store the resample means
resample_means = np.zeros(num_resamples)

# Perform the bootstrap
for i in range(num_resamples):
    # Generate a resample by sampling with replacement
    resample = np.random.choice(np.random.normal(sample_mean, sample_std, sample_size), sample_size, replace=True)
    
    # Calculate the mean of the resample
    resample_means[i] = np.mean(resample)

# Calculate the 95% confidence interval
lower_bound = np.percentile(resample_means, 2.5)
upper_bound = np.percentile(resample_means, 97.5)

print("95% Confidence Interval: ({:.2f}, {:.2f})".format(lower_bound, upper_bound))

95% Confidence Interval: (14.23, 15.80)
