In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 10)
plt.rcParams['figure.figsize'] = (10, 6)

# Bootstrapping

The central idea of bootstrapping is that you can treat your original sample of data as a proxy for the true underlying population. By repeatedly resampling from this observed sample, you simulate the process of drawing multiple samples from the unknown population.

Why It Works (Intuitively)

Law of Large Numbers: If your original sample is reasonably representative of the population, the more times you resample from it, the closer the distribution of your bootstrap statistics (means, medians, etc.) will approximate the true sampling distribution of that statistic.
Empirical Distribution Function: Bootstrapping essentially creates an empirical approximation of the true population's distribution. This is useful when the theoretical distribution is unknown or difficult to work with mathematically.
Statistical Rigor

While the intuitive explanation is helpful, there's more rigorous statistical justification behind why bootstrapping provides valid results in many cases:

Consistency: The bootstrap distribution of a statistic often converges to the true sampling distribution of that statistic as the sample size and the number of bootstrap samples increase. This property is called consistency.
Central Limit Theorem: As mentioned earlier, the Central Limit Theorem is in play. The sampling distribution of means, even if drawn from a non-normal population via bootstrapping, often approaches normality. This unlocks the use of familiar statistical tools and confidence interval calculations.
Assumptions and Limitations

Representativeness: Your original sample should be reasonably representative of the underlying population. Bootstrapping can't fix fundamental biases in your data.
Independence: Traditional bootstrapping often assumes independent observations. Special adaptations exist for time series data where observations are dependent over time.
Computation: For complex statistics or large datasets, bootstrapping can become computationally intensive.
Applications in Depth

Confidence Intervals: The 2.5th and 97.5th percentiles of the bootstrap distribution of a statistic are commonly used to construct a 95% confidence interval for that statistic.
Hypothesis Testing: Instead of traditional parametric tests that rely on theoretical assumptions, you can use the bootstrap distribution to estimate a p-value for a hypothesis test.

In [3]:
from sklearn.datasets import load_iris

iris = load_iris()