#9. We will now consider the Boston housing data set, from the ISLP library.
##(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate μˆ.

In [1]:
!pip install ISLP

from ISLP import load_data
import pandas as pd

boston = load_data('Boston')

mu_hat = boston['medv'].mean()
mu_hat


Collecting ISLP
  Downloading ISLP-0.4.0-py3-none-any.whl.metadata (7.0 kB)
Collecting lifelines (from ISLP)
  Downloading lifelines-0.30.0-py3-none-any.whl.metadata (3.2 kB)
Collecting pygam (from ISLP)
  Downloading pygam-0.9.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pytorch-lightning (from ISLP)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Collecting torchmetrics (from ISLP)
  Downloading torchmetrics-1.5.2-py3-none-any.whl.metadata (20 kB)
Collecting autograd-gamma>=0.3 (from lifelines->ISLP)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting formulaic>=0.2.2 (from lifelines->ISLP)
  Downloading formulaic-1.0.2-py3-none-any.whl.metadata (6.8 kB)
Collecting scipy>=0.9 (from ISLP)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.3 MB/s[0m eta [36m

22.532806324110677

##(b) Provide an estimate of the standard error of μˆ. Interpret this result.

In [2]:
import numpy as np

# 計算樣本標準差
sample_std_dev = boston['medv'].std()

# 計算觀測值數量
n = len(boston['medv'])

# 計算標準誤
se_mu_hat = sample_std_dev / np.sqrt(n)
se_mu_hat


0.4088611474975351

The standard error of \(\hat{\mu}\), approximately 0.409, tells us how much the sample mean is expected to vary from the true population mean. This result indicates that our sample mean is likely within 0.409 units of the true mean, on average. A smaller standard error means our sample mean is a more precise estimate of the population mean.


## (c) Now estimate the standard error of μˆ using the bootstrap. How does this compare to your answer from (b)?

In [3]:
import numpy as np

# 設定引導法的重抽樣次數
n_bootstraps = 1000
bootstrap_means = []

# 進行引導法重抽樣
for _ in range(n_bootstraps):
    bootstrap_sample = boston['medv'].sample(frac=1, replace=True)
    bootstrap_means.append(bootstrap_sample.mean())

# 計算引導法估計的標準誤
bootstrap_se_mu_hat = np.std(bootstrap_means)
bootstrap_se_mu_hat


0.4221100955168359

The standard error estimated using the bootstrap method is approximately 0.422, which is slightly higher than the standard error of 0.409 calculated in (b). This indicates a slightly greater variability in the sample mean when repeatedly resampling. The bootstrap method provides a more robust standard error estimate, especially useful if the sample distribution is not perfectly normal.

The similarity between the two values suggests that our sample mean is a stable and reliable estimate of the population mean.


## (d) Based on your bootstrap estimate from (c), provide a 95 % con- fidence interval for the mean of medv. Compare it to the results obtained by using Boston['medv'].std() and the two standard error rule (3.9).

In [4]:
# Calculate the sample mean of medv as the point estimate
mu_hat = boston['medv'].mean()

# Calculate the 95% confidence interval using the bootstrap standard error
ci_lower = mu_hat - 2 * bootstrap_se_mu_hat
ci_upper = mu_hat + 2 * bootstrap_se_mu_hat
(ci_lower, ci_upper)


(21.688586133077006, 23.377026515144347)

The 95% confidence interval for the mean of `medv` based on the bootstrap standard error is approximately:

\[
(21.69, 23.38)
\]

### Comparison to the Two Standard Error Rule

Using the two standard error rule from (3.9) with the sample standard error of 0.409 obtained in (b), the confidence interval would be:

\[
[\hat{\mu} - 2 \times 0.409, \hat{\mu} + 2 \times 0.409] = (21.69, 23.37)
\]

Both intervals are very similar, suggesting that the bootstrap estimate aligns well with the sample standard error. This consistency indicates that our sample mean is a reliable estimate of the population mean of `medv`.


##(e) Based on this data set, provide an estimate, μˆmed, for the median value of medv in the population.

In [5]:
# Calculate the median of medv as an estimate of the population median
mu_med_hat = boston['medv'].median()
mu_med_hat


21.2

## (f) We now would like to estimate the standar derrorofμˆmed.Unfor- tunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

In [6]:
# Set number of bootstrap samples
n_bootstraps = 1000
bootstrap_medians = []

# Perform bootstrap sampling
for _ in range(n_bootstraps):
    bootstrap_sample = boston['medv'].sample(frac=1, replace=True)
    bootstrap_medians.append(bootstrap_sample.median())

# Calculate the standard error of the median from bootstrap samples
bootstrap_se_mu_med = np.std(bootstrap_medians)
bootstrap_se_mu_med


0.36825995913213233

The bootstrap estimate of the standard error for the median is approximately 0.368.

### Interpretation

This standard error indicates that the sample median of `medv` is expected to vary by about 0.368 from the true population median if we repeatedly sampled from the population. Since there is no simple formula to calculate the standard error of the median, the bootstrap method provides a reliable estimate by resampling the data multiple times. This value gives us confidence that our sample median is a fairly accurate representation of the population median for `medv` in the Boston housing dataset.


## (g) Based on this data set, provide an estimate for the tenth per- centile of medv in Boston census tracts. Call this quantity μˆ0.1. (You can use the np.percentile() function.)

In [7]:
# Calculate the 10th percentile of medv
mu_0_1_hat = np.percentile(boston['medv'], 10)
mu_0_1_hat


12.75

The estimated tenth percentile of `medv` in the Boston housing dataset is approximately 12.75.

### Interpretation

This value of 12.75 indicates that 10% of the median house values in Boston census tracts fall below this amount. It provides insight into the lower end of the housing value distribution, representing the threshold for the bottom 10% of house values in this dataset.


## (h) Use the bootstrap to estimate the standard error of μˆ0.1. Com- percentile() ment on your findings.

In [8]:
# Set number of bootstrap samples
n_bootstraps = 1000
bootstrap_percentiles = []

# Perform bootstrap sampling to estimate the standard error of the 10th percentile
for _ in range(n_bootstraps):
    bootstrap_sample = boston['medv'].sample(frac=1, replace=True)
    bootstrap_percentiles.append(np.percentile(bootstrap_sample, 10))

# Calculate the standard error of the 10th percentile from bootstrap samples
bootstrap_se_mu_0_1 = np.std(bootstrap_percentiles)
bootstrap_se_mu_0_1


0.5073257705853311

The bootstrap estimate of the standard error for the tenth percentile of `medv` is approximately 0.507.

### Interpretation

This standard error of 0.507 indicates that the estimated tenth percentile of `medv` (12.75) may vary by about 0.507 units if we repeatedly sampled from the population. This variability shows the expected range around our tenth percentile estimate, providing a measure of confidence in the stability of the percentile estimate for the Boston housing data set. Since there is no direct formula for the standard error of a percentile, the bootstrap method is useful for capturing this estimate’s variability.
