9. We will now consider the Boston housing data set, from the ISLP
library.

In [1]:
# Install the ISLP library if it's not already installed
!pip install ISLP

# Import necessary libraries
import ISLP
import pandas as pd

# Load the Boston housing dataset
boston = ISLP.load_data('Boston')


Collecting ISLP
  Downloading ISLP-0.4.0-py3-none-any.whl.metadata (7.0 kB)
Collecting lifelines (from ISLP)
  Downloading lifelines-0.30.0-py3-none-any.whl.metadata (3.2 kB)
Collecting pygam (from ISLP)
  Downloading pygam-0.9.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pytorch-lightning (from ISLP)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Collecting torchmetrics (from ISLP)
  Downloading torchmetrics-1.5.1-py3-none-any.whl.metadata (20 kB)
Collecting autograd-gamma>=0.3 (from lifelines->ISLP)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting formulaic>=0.2.2 (from lifelines->ISLP)
  Downloading formulaic-1.0.2-py3-none-any.whl.metadata (6.8 kB)
Collecting scipy>=0.9 (from ISLP)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.0 MB/s[0m eta [36m

(a) Based on this data set, provide an estimate for the population
mean of medv. Call this estimate $\hat{\mu}$.

In [2]:
mu_hat = boston['medv'].mean()
print(f"The estimate for the population mean of medv (mu_hat) is: {mu_hat}")

The estimate for the population mean of medv (mu_hat) is: 22.532806324110677


(b) Provide an estimate of the standard error of $\hat{\mu}$. Interpret this
result.

Hint: We can compute the standard error of the sample mean by
dividing the sample standard deviation by the square root of the
number of observations.

In [4]:
import numpy as np

# Calculate the standard error of mu_hat
standard_error = boston['medv'].std() / np.sqrt(len(boston))

print(f"The estimated standard error of mu_hat is: {standard_error}")



The estimated standard error of mu_hat is: 0.4088611474975351


Interpretation:
The standard error of $\hat{\mu}$ represents the estimated variability
or uncertainty in our sample mean ($\hat{\mu}$) as an estimate of the
true population mean of medv.  
A smaller standard error suggests that our estimate of the
population mean is likely more precise.
In this case, the standard error is relatively small, indicating that
our estimate of the population mean ($\hat{\mu}$) is likely a good
representation of the true population mean.

(c) Now estimate the standard error of $\hat{\mu}$ using the bootstrap. How
does this compare to your answer from (b)?

In [5]:
# Number of bootstrap samples
n_bootstrap = 1000

# Initialize an array to store the bootstrap sample means
bootstrap_means = np.empty(n_bootstrap)

# Perform bootstrapping
for i in range(n_bootstrap):
    # Sample with replacement from the original data
    bootstrap_sample = boston['medv'].sample(n=len(boston), replace=True)
    # Calculate the mean of the bootstrap sample
    bootstrap_means[i] = bootstrap_sample.mean()

# Estimate the standard error of mu_hat using the bootstrap
bootstrap_standard_error = np.std(bootstrap_means)

print(f"The estimated standard error of mu_hat using the bootstrap is: {bootstrap_standard_error}")


The estimated standard error of mu_hat using the bootstrap is: 0.4130365394388512


Comparison:
The bootstrap estimate of the standard error is likely to be more accurate,
especially when the data does not follow a normal distribution.
Comparing the bootstrap standard error to the standard error from (b), we see
that the two values are relatively similar. This suggests that the standard
error calculated using the sample standard deviation and the square root of
the number of observations is a decent approximation in this case.

(d) Based on your bootstrap estimate from (c), provide a 95 % confidence
interval for the mean of medv. Compare it to the results
obtained by using Boston['medv'].std() and the two standard
error rule (3.9).

Hint: You can approximate a 95 % confidence interval using the
formula [ˆμ − 2SE(ˆμ), ˆμ + 2SE(ˆμ)].

In [6]:
# Calculate the 95% confidence interval using the bootstrap standard error
confidence_interval_bootstrap = [mu_hat - 2 * bootstrap_standard_error, mu_hat + 2 * bootstrap_standard_error]

print(f"The 95% confidence interval for the mean of medv (using bootstrap) is: {confidence_interval_bootstrap}")


# Calculate the 95% confidence interval using Boston['medv'].std() and the two standard error rule
confidence_interval_standard_error = [mu_hat - 2 * standard_error, mu_hat + 2 * standard_error]

print(f"The 95% confidence interval for the mean of medv (using standard error) is: {confidence_interval_standard_error}")


The 95% confidence interval for the mean of medv (using bootstrap) is: [21.706733245232975, 23.35887940298838]
The 95% confidence interval for the mean of medv (using standard error) is: [21.715084029115605, 23.35052861910575]


The confidence intervals obtained using the bootstrap and the standard error are
quite similar. This indicates that the standard error calculated using the
sample standard deviation provides a reasonable approximation for the
standard error of the sample mean in this case.

(e) Based on this data set, provide an estimate, $\hat{\mu}_{med}$, for the median
value of medv in the population.

In [7]:
mu_med_hat = boston['medv'].median()

print(f"The estimate for the population median of medv (mu_med_hat) is: {mu_med_hat}")

The estimate for the population median of medv (mu_med_hat) is: 21.2


(f) We now would like to estimate the standard error of $\hat{\mu}_{med}$. Unfortunately,
there is no simple formula for computing the standard
error of the median. Instead, estimate the standard error of the
median using the bootstrap. Comment on your findings.

In [8]:
# Number of bootstrap samples
n_bootstrap = 1000

# Initialize an array to store the bootstrap sample medians
bootstrap_medians = np.empty(n_bootstrap)

# Perform bootstrapping
for i in range(n_bootstrap):
    # Sample with replacement from the original data
    bootstrap_sample = boston['medv'].sample(n=len(boston), replace=True)
    # Calculate the median of the bootstrap sample
    bootstrap_medians[i] = bootstrap_sample.median()

# Estimate the standard error of mu_med_hat using the bootstrap
bootstrap_standard_error_median = np.std(bootstrap_medians)

print(f"The estimated standard error of mu_med_hat using the bootstrap is: {bootstrap_standard_error_median}")



The estimated standard error of mu_med_hat using the bootstrap is: 0.38131246701360255


Comparing the standard error of the mean (from previous parts) and the standard error of the median (current),
we can observe that the standard error of the mean is typically smaller than that of the median.
This is because the mean is generally more robust and less affected by outliers than the median.

(g) Based on this data set, provide an estimate for the tenth percentile
of medv in Boston census tracts.

Call this quantity $\hat{\mu}_{0.1}$.
(You can use the np.percentile() function.)



In [10]:
mu_01_hat = np.percentile(boston['medv'], 10)

print(f"The estimate for the tenth percentile of medv (mu_0.1_hat) is: {mu_01_hat}")

The estimate for the tenth percentile of medv (mu_0.1_hat) is: 12.75


(h) Use the bootstrap to estimate the standard error of $\hat{\mu}_{0.1}$. Comment
on your findings.

In [11]:
# Number of bootstrap samples
n_bootstrap = 1000

# Initialize an array to store the bootstrap sample 10th percentiles
bootstrap_percentiles_01 = np.empty(n_bootstrap)

# Perform bootstrapping
for i in range(n_bootstrap):
    # Sample with replacement from the original data
    bootstrap_sample = boston['medv'].sample(n=len(boston), replace=True)
    # Calculate the 10th percentile of the bootstrap sample
    bootstrap_percentiles_01[i] = np.percentile(bootstrap_sample, 10)

# Estimate the standard error of mu_0.1_hat using the bootstrap
bootstrap_standard_error_percentile_01 = np.std(bootstrap_percentiles_01)

print(f"The estimated standard error of mu_0.1_hat using the bootstrap is: {bootstrap_standard_error_percentile_01}")


The estimated standard error of mu_0.1_hat using the bootstrap is: 0.5065491067014134


Comment on findings:
The bootstrap estimate of the standard error of the 10th percentile provides
an indication of the uncertainty in our estimate of the 10th percentile
of medv.  A smaller standard error suggests that our estimate is more
precise.
In general, the standard error of a percentile is typically larger than
that of the mean or median, as percentiles are more sensitive to
the extreme values in the data.
In this case, the bootstrap standard error of the 10th percentile is relatively
small, indicating that our estimate of the 10th percentile is likely
fairly precise.