<blockquote>
    <h1>Exercise 5.9</h1>
    <p>We will now consider the <code>Boston</code> housing data set, from the <code>MASS</code> library.</p>
    <ol>
        <li>Based on this data set, provide an estimate for the population mean of <code>medv</code>. Call this estimate $\hat{\mu}$.</li>
        <li>Provide an estimate of the standard error of $\hat{\mu}$. Interpret this result. <br>
            <i>Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.</i></li>
        <li>Now estimate the standard error of $\hat{\mu}$ using the bootstrap. How does this compare to your answer from 2?</li>
        <li>Based on your bootstrap estimate from 3, provide a $95 \%$ confidence interval for the mean of <code>medv</code>. Compare it to the results obtained using <code>t.test(Boston\$medv)</code>. <br>
            <i>Hint: You can approximate a $95 \%$ confidence interval using the formula $[\hat{\mu}-2SE(\hat{\mu}), \hat{\mu}+2SE(\hat{\mu})]$.</i></li>
        <li>Based on this data set, provide an estimate, $\hat{\mu}_{med}$, for the median value of <code>medv</code> in the population.</li>
        <li>We now would like to estimate the standard error of $\hat{\mu}_{med}$. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.</li>
        <li>Based on this data set, provide an estimate for the tenth percentile of <code>medv</code> in Boston suburbs. Call this quantity $\hat{\mu}_{0.1}$. (You can use the <code>quantile()</code> function.)</li>
        <li>Use the bootstrap to estimate the standard error of $\hat{\mu}_{0.1}$. Comment on your findings.</li>
    </ol>
</blockquote>

In [1]:
import pandas as pd
import numpy as np

# https://stackoverflow.com/questions/34398054/ipython-notebook-cell-multiple-outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.utils import resample
import scipy.stats

In [2]:
df = pd.read_csv("../../DataSets/Boston/Boston.csv")

<h3>Exercise 5.9.1</h3>
<blockquote>
    <i>Based on this data set, provide an estimate for the population mean of <code>medv</code>. Call this estimate $\hat{\mu}$.</i>
</blockquote>

In [3]:
df = df[['medv']]
mu_hat = df.mean().iloc[0]
mu_hat

22.532806324110677

<p>So $\hat{\mu} = 22.53$.</p>

<h3>Exercise 5.9.2</h3>
<blockquote>
    <i>Provide an estimate of the standard error of $\hat{\mu}$. Interpret this result. <br>
            <i>Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.</i>
</blockquote>

In [4]:
n = df.shape[0]
stderr_est = ((df.std(ddof=1).iloc[0])**2 / n)**0.5
stderr_est

0.40886114749753505

<p>So $\widehat{SE}(\hat{\mu}) = 0.41$.</p>

<h3>Exercise 5.9.3</h3>
<blockquote>
    <i>Now estimate the standard error of $\hat{\mu}$ using the bootstrap. How does this compare to your answer from 2?</i>
</blockquote>

In [5]:
sample_size = int(1.0 * df.shape[0])
B = 1000
mu_arr = np.array([resample(
                df, 
                replace=True, 
                n_samples=sample_size, 
                random_state=r
            ).mean().iloc[0] for r in range(0, B)])

mean_boot = np.mean(mu_arr)
((1/(B-1))*np.sum((mu_arr - mean_boot)**2))**0.5


0.426642674393897

<p>The results are pretty similar.</p>

<h3>Exercise 5.9.4</h3>
<blockquote>
    <i>Based on your bootstrap estimate from 3, provide a $95 \%$ confidence interval for the mean of <code>medv</code>. Compare it to the results obtained using <code>t.test(Boston\$medv)</code>. <br>
            <i>Hint: You can approximate a $95 \%$ confidence interval using the formula $[\hat{\mu}-2SE(\hat{\mu}), \hat{\mu}+2SE(\hat{\mu})]$.</i></i>
</blockquote>

In [6]:
mean_boot - 2*stderr_est, mean_boot + 2*stderr_est

h = stderr_est * scipy.stats.t.ppf((1 + 0.95) / 2, n-1)
mean_boot - h, mean_boot + h

(21.71050142041995, 23.345946010410092)

(21.724945405882938, 23.331502024947103)

<h3>Exercise 5.9.5</h3>
<blockquote>
    <i>Based on this data set, provide an estimate, $\hat{\mu}_{med}$, for the median value of <code>medv</code> in the population.</i>
</blockquote>

In [7]:
med_hat = df.median().iloc[0]
med_hat

21.2

<h3>Exercise 5.9.6</h3>
<blockquote>
    <i>We now would like to estimate the standard error of $\hat{\mu}_{med}$. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.</i>
</blockquote>

In [8]:
med_arr = np.array([resample(
                df, 
                replace=True, 
                n_samples=sample_size, 
                random_state=r
            ).median().iloc[0] for r in range(0, B)])

mean_boot = np.mean(med_arr)
((1/(B-1))*np.sum((med_arr - mean_boot)**2))**0.5

0.39204966897440774

<h3>Exercise 5.9.7</h3>
<blockquote>
    <i>Based on this data set, provide an estimate for the tenth percentile of <code>medv</code> in Boston suburbs. Call this quantity $\hat{\mu}_{0.1}$. (You can use the <code>quantile()</code> function.)</i>
</blockquote>

In [9]:
tenth_hat = np.percentile(df, 10)
tenth_hat

12.75

<h3>Exercise 5.9.8</h3>
<blockquote>
    <i>Use the bootstrap to estimate the standard error of $\hat{\mu}_{0.1}$. Comment on your findings.</i>
</blockquote>


In [10]:
tenth_arr = np.array([np.percentile(resample(
                df, 
                replace=True, 
                n_samples=sample_size, 
                random_state=r
            ), 10) for r in range(0, B)])

mean_boot = np.mean(tenth_arr)
((1/(B-1))*np.sum((tenth_arr - mean_boot)**2))**0.5

0.4992672283109169