In [None]:
# Q1. What is an ensemble technique in machine learning?
Ans.
An ensemble technique in machine learning involves combining multiple individual models to create a stronger
and more robust predictive model. The idea behind ensemble methods is that by combining the predictions of 
multiple models, the overall performance can be better than that of any individual model. This can lead to
improved generalization, better accuracy, and increased stability.

In [None]:
# Q2. Why are ensemble techniques used in machine learning?
Ans.
The ensemble techniques used in machine learning.
1. Improved Accuracy: Ensemble methods can often achieve higher accuracy than individual models. By combining
the predictions of multiple models, the ensemble can better capture complex patterns in the data, leading to 
improved generalization.

2. Reduction of Overfitting: Ensemble techniques, especially bagging, help reduce overfitting by averaging or
combining predictions from multiple models. This is particularly beneficial when dealing with high-variance 
models that may perform well on the training data but poorly on new, unseen data.

3. Handling Different Aspects of Data: Different models may excel at capturing different aspects or patterns
within the data. Ensemble techniques allow combining these strengths, leading to a more comprehensive 
understanding of the underlying relationships in the data.

4. Increased Stability: Ensemble methods tend to provide more stable and reliable predictions. This is crucial
in scenarios where small changes in the training data can lead to significant variations in the model's output.

5. Easy Parallelization: Some ensemble methods, like bagging, are highly parallelizable. Training individual 
models can be done independently, making ensembles suitable for distributed computing environments.

In [None]:
# Q3. What is bagging?
Ans. 
Bagging involves training multiple instances of the same learning algorithm on different subsets of the training
data. The final prediction is often an average (for regression) or a vote (for classification) of the predictions
from individual models.Bagging aims to reduce overfitting and variance by combining diverse models trained on 
different subsets of the data.

In [None]:
# Q4. What is boosting?
Ans.
Boosting focuses on training multiple weak learners (models slightly better than random chance) sequentially. Each
subsequent model gives more weight to instances that were misclassified by the previous models.Boosting aims to
improve accuracy by iteratively emphasizing difficult-to-learn examples, leading to a strong, robust predictive 
model. Popular algorithms include AdaBoost, Gradient Boosting (e.g., XGBoost, LightGBM).

In [None]:
# Q5. What are the benefits of using ensemble techniques?
Ans.
Ensemble techniques offer several benefits in machine learning:
1. Improved Accuracy: Ensembles often achieve higher accuracy than individual models by combining the strengths of 
multiple models, capturing a broader range of patterns in the data.

2. Reduction of Overfitting: Ensembles, especially bagging, help reduce overfitting by averaging or combining predictions
from multiple models. This is beneficial when dealing with high-variance models that may perform well on training data 
but poorly on new data.

3. Enhanced Robustness: Ensembles are more robust to outliers and noise in the data. Individual models might make errors on
certain instances, but by aggregating predictions, the impact of these errors is mitigated.

4. Easy Parallelization: Some ensemble methods, like bagging, are highly parallelizable. Training individual models can be
done independently, making ensembles suitable for distributed computing environments.

5. Ensemble Diversity: Ensuring diversity among individual models in an ensemble is crucial. Different models may make different
errors, and combining them can help compensate for individual weaknesses.

In [None]:
# Q6. Are ensemble techniques always better than individual models?
Ans. 
Ensemble techniques are not always better than individual models, but they often outperform single models, especially when 
dealing with complex or noisy data. The effectiveness of ensembles depends on factors such as the diversity of base models
, the quality of individual models, and the nature of the data. Ensembles shine in situations where different models capture
different aspects of the underlying patterns, leading to improved overall performance. However, in simpler or well-behaved
datasets, where a single strong model may suffice, the benefits of ensembles might be less pronounced. It's essential to 
consider the specific characteristics of the problem at hand when deciding whether to use ensemble techniques.

In [None]:
# Q7. How is the confidence interval calculated using bootstrap?
Ans.
Calculating a confidence interval using bootstrap involves the following steps:

1. Data Resampling:
Start with your original sample data. Generate multiple bootstrap samples by randomly sampling with replacement from the 
original data. Each bootstrap sample should have the same size as the original dataset.

2. Statistic Calculation:
For each bootstrap sample, calculate the statistic of interest. In the case of estimating the population mean, this would be the
sample mean.

3. Create Bootstrap Distribution:
Create a distribution of the calculated statistics from the bootstrap samples.

4. Percentile Method:
Determine the confidence interval by finding the desired percentiles of the bootstrap distribution.
For a 95% confidence interval, you would typically use the 2.5th percentile as the lower bound and the 97.5th percentile as the
upper bound.

Let's say we have a sample of 50 tree heights with a mean of 15 meters and a standard deviation of 2 meters.

Now, perform the following steps:
1. Data Resampling:
Generate, let's say, 1000 bootstrap samples by randomly selecting 50 tree heights with replacement from the original sample.

2. Statistic Calculation:
Calculate the mean for each of the 1000 bootstrap samples.

3. Create Bootstrap Distribution:
You now have a distribution of 1000 sample means from the bootstrap samples.

4. Percentile Method:
Determine the 95% confidence interval by finding the 2.5th and 97.5th percentiles of the bootstrap distribution.

5. The resulting confidence interval would be the range of values from the lower bound to the upper bound.

In [None]:
# Q8. How does bootstrap work and What are the steps involved in bootstrap?
Ans.
Bootstrap is a resampling technique used to estimate the sampling distribution of a statistic by repeatedly 
resampling with replacement from the observed data. 

Here are the steps involved in the bootstrap procedure:
    
1. Start with your original dataset, which consists of observations or measurements.
2. Generate multiple bootstrap samples by randomly selecting data points from the original dataset with replacement.
Each bootstrap sample should have the same size as the original dataset, but individual data points may be repeated.
3. For each bootstrap sample, calculate the statistic of interest. This could be the mean, median, standard deviation,
or any other relevant statistic.
4. Repeat steps 2 and 3 a large number of times (e.g., 1000 or more) to create a distribution of the calculated statistics.
Estimate of Sampling Distribution:

The distribution of these calculated statistics forms an empirical approximation of the sampling distribution of the statistic of interest.
Confidence Interval:

Determine the confidence interval for the statistic by finding the desired percentiles of the bootstrap distribution.
For example, a 95% confidence interval would be determined by the 2.5th and 97.5th percentiles of the bootstrap distribution.

In [6]:
# Q9. A researcher wants to estimate the mean height of a population of trees. They measure the height of a
# sample of 50 trees and obtain a mean height of 15 meters and a standard deviation of 2 meters. Use
# bootstrap to estimate the 95% confidence interval for the population mean height.

# Given Data 
samples = 50
sample_mean = 15
sample_std = 2
confidence_level = 0.95

# Calculate the t value for desired level of confidence
import scipy.stats as stats
alpha = 1 - confidence_level
dof = samples-1
t_value = stats.t.ppf(1 - alpha/2, dof)

# calculate the standard error and margin of error
import math
std_error = sample_std / math.sqrt(samples)
margin_of_error = t_value * std_error

# calculate the confidence interval bounds
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# print 95% confidence interval
print(f'Sample mean height for {samples} Trees is {sample_mean} and Sample Standard Deviation is {sample_std}')
print(f'T-Statistic with {confidence_level*100}% condifence interval for dof {dof} : {t_value:.4f}')
print(f'Standard Error : {std_error:.4f}')
print(f'Margin of error : {margin_of_error:.4f}')
print(f'\nEstimated Population mean with 95% confidence interval is ({lower_bound:.2f} , {upper_bound:.2f})')

Sample mean height for 50 Trees is 15 and Sample Standard Deviation is 2
T-Statistic with 95.0% condifence interval for dof 49 : 2.0096
Standard Error : 0.2828
Margin of error : 0.5684

Estimated Population mean with 95% confidence interval is (14.43 , 15.57)
