# Estimation

# Reading

[Chapter 13: 13.1 - 13.4](https://inferentialthinking.com/chapters/13/Estimation.html)<br><br>

In the previous notebooks we use statistical analysis to draw conclusions about a parameter of a dataset. In this notebook learn how to use the dataset to estimate a parameter for the entire population.

An example of _estimation_ is when we survey a sample of voters on their potential voting choice for a candidate, and then we use the sample's data to estimate the percentage of voters nationwide that will vote for that candidate. During the estimation we will need to consider that the sample of voters are randomly chosen, and random samples can have slightly different measurements based on chance. We will need to account for the variations in sampled data.

We start by getting an understanding of percentiles: what it is and what does it mean when a value is in the 75th percentile, for example.

---

## Percentile

A dataset that contains numerical data can be sorted in increasing or decreasing order. After sorting, each number in the dataset has a particular position or a rank, and a _percentile_ is the value at a particular rank. A percentile, as the name implies, is a value that is between 0 and 100.

A simple definition of a percentile:<br>
The 80th percentile is the smallest value in the dataset that is greater than or equal to 80% of the data.

If we have a sequence of n numbers, to find the number that's at the _pth_ percentile:
- Sort the numbers in increasing order.
- k = (p / 100) * n
- If k is an integer, take the kth element of the sorted numbers.
- If k is not an integer, round it up to the next integer, and take that element of the sorted numbers.

Example:<br>
- A sorted sequence is 5, 8, 20, 24, 30
- The 80th percentile is:<br>
k = (80 / 100) * 5 = 4, the 4th element is 24


- The 35th percentile is:<br>
k = (35/100) * 5 = 1.75, rounding 1.75 to 2, the 2nd element is 8

Now that we have a general idea of a percentile, we apply it to a larger dataset, which is also used by the textbook, and the dataset contains scores of an exam by class sections.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/scores_by_section.csv"
scores = pd.read_csv(url)
print("First 5 rows:")
scores.head()

To have an overview of the data we plot the score distribution.

In [None]:
plt.figure(figsize=(4,3))
plt.hist(scores.Midterm, bins=np.arange(-0.5, 25.6, 1), edgecolor='black')
plt.xlabel('Score')
plt.title('Distribution of Midterm Scores')
plt.grid()
plt.show()

To find the 85th percentile, we can use the same calculation steps as above, or we can use the `percentile` function of numpy:<br>
`p% = np.percentile(data_sequence, percentile)`

In [None]:
print("85th percentile:", np.percentile(scores.Midterm, 85))

### Quartile

The range of percentiles, which is between 0 and 100, can be divided into 4 sections where each section is called a _quartile_.

- The _first quartile_ is the 25th percentile
- The _second quartile_ is the 50th percentile. The 50th percentile is also called the _median_, the midpoint of the sorted data.
- The _third quartile_ is the 75th percentile.

In a distribution, values above the 25th percentile and below the 75th percentile is in the _midddle 50%_ interval.

For the midterm scores, we calculate:

In [None]:
print("First quartile:", np.percentile(scores.Midterm, 25))
print("Second quartile:", np.percentile(scores.Midterm, 50))
print("Third quartile:", np.percentile(scores.Midterm, 75))
print("Middle 50% is between:", np.percentile(scores.Midterm, 25), "and", np.percentile(scores.Midterm, 75))

---

## The Bootstrap

We now tackle the problem of how to estimate a parameter for the entire population when the sample of data on which we base our estimation will vary slightly each time we sample. The method we will use to overcome the problem is called a _bootstrap_.

The bootstrap generates new random samples by a method called _resampling_: the new samples are drawn at random from the original sample.

To see how a bootstrap works, we work with the same dataset from the textbook, which works with salary data from the city of San Francisco.


In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/san_francisco_2019.csv"
salaries = pd.read_csv(url)
print("First 5 rows:")
salaries.head()

In [None]:
print("Total number of salaries in the dataset:", len(salaries))

- <u>Data Overview</u>

The `Total Compensation` column has the salary of each position in 2019, and it has the true salary if someone has worked the entire year. But if someone worked for only 2 months in 2019, then the data would only be for 2 months.

We now calculate what the minimum someone would earn if the person worked for at least half a year, at 40 hours a week, given that the minimum wage in 2019 was about $15/hr.

In [None]:
min_halfyear_salary = 15 * 40 * (52 / 2)
print("Minimum yearly salary:", min_halfyear_salary)

Next we remove the rows where the `Total Compensation` column is below the minimum half year salary of about 15,600.

In [None]:
salaries = salaries[salaries['Total Compensation'] > 15600]
print("Number of salaries in the dataset:", len(salaries))

Visualizing the compensation we have:

In [None]:
plt.figure(figsize=(4,3))
plt.hist(salaries['Total Compensation'], bins=np.arange(0, 726000, 25000), edgecolor='black')
plt.xlabel('Salary')
plt.title('Distribution of Salaries')
plt.grid()
plt.show()

Observing the histogram, we notice that most of the salaries are under 400,000, but there are some salaries that are above 600,000.

We inspect these high salaries:

In [None]:
salaries[salaries['Total Compensation'] > 600000]

There are 3 positions that have the high salaries.

- <u>The Parameter</u>

For this exercise, we want to estimate the median salary of the city employees. Our parameter is the median salary.

Recall that the median is also the 50th percentile, which means it's the salary where 50% of the salaries are below it. It is the midpoint of the range of salaries.

For this exercise, the dataset happens to be for the entire payroll of city employees, so we don't really need to estimate it, we can simply find the median.

In [None]:
median_salary = np.median(salaries['Total Compensation'])
print("Median salary:", median_salary)

But to prove how bootstrapping works, we will sample 500 salaries for our sample. Then we'll use the median of the sample to estimate the salary for all city employees, and compare it with the actual median salary above to see if our estimate is correct.

- <u>The Sample</u>

We sample 500 salaries without replacement, plot their distribution, and find the median.

In [None]:
sample_salaries = salaries['Total Compensation'].sample(n=500, replace=True)

plt.figure(figsize=(4,3))
plt.hist(sample_salaries, bins=np.arange(0, 726000, 25000), edgecolor='black')
plt.xlabel('Salary')
plt.title('Distribution of Sampled Salaries')
plt.grid()
plt.show()

sample_median_salary = np.median(sample_salaries)
print("Median salary of the sample:", sample_median_salary)

- <u>The Bootstrap</u>

We now _resample_ from the sample of 500 data under these 2 conditions:
1. Create a sample of the same size: 500<br>
because we want to find the median of 500 salaries.
2. But _with_ replacement<br>
so that we have different data in the resample.

With replacement, some of the data in the original sample will be re-used and put in the resample more than once, and some of the data in the original sample might not be put in the resample at all.

This resampling process helps data scientists to create new but similar samples to the original sample. They can now do simulation with many different samples without having to run the actual sampling of data, which could be prohibitively costly with both time and money. Rather than being stuck with one actual sample, data scientists pull themselves up by their own _bootstraps_ so they can continue with their data analysis work.

The bootstrap works because the law of averages says that a large sample is likely to have the same distribution as the population from which the sample is drawn. Since the resampling is 500 and the "population" that it's drawn from is also 500, the resampling is a large sample and should have a similar distribution and a similar median.

We now write a function to resample the original sample and find its median.

In [None]:
def resample():
  a_resample = sample_salaries.sample(n=500, replace=True)
  return np.median(a_resample)

print(resample())

Running the Code cell above several times, we observe that the median is fairly close to the median of the original sample.

We run the `resample()` function 5000 times to get their medians and plot the medians.

In [None]:
L = []
for i in range(5000):
    L.append(resample())
median_array = np.array(L)

In [None]:
plt.figure(figsize=(4,3))
plt.hist(median_array, bins=np.arange(120000, 160000, 2000), edgecolor='black')
plt.xlabel('Median Salary')
plt.title('Distribution of Resampled Medians')
plt.scatter(median_salary, 0, color='red', s=40, zorder=2)
plt.grid()
plt.show()

The red dot is the actual median salary.

- <u>The Result</u>

To see if the estimate is good enough, we first define "good enough" as when the actual median is in the middle 95% of the resampled medians.

In [None]:
lower_limit = np.percentile(median_array, 2.5)
upper_limit = np.percentile(median_array, 97.5)
print("population is in the middle 95%:", lower_limit < median_salary < upper_limit)

To check that the result above is not just a lucky outcome, we want to repeat the simulation 100 times and see how many times the population median is within the middle 95% range.

In [None]:
## Note that this Code cell takes minutes to run

# list to store results
middle_95_percent_results = []

# repeat 100 times
for i in range(100):
    # one simulation
    L = []
    for i in range(5000):
        a_resample = sample_salaries.sample(n=500, replace=True)
        L.append(np.median(a_resample))
    median_array = np.array(L)

    # limits of the middle 95%
    lower_limit = np.percentile(median_array, 2.5)
    upper_limit = np.percentile(median_array, 97.5)

    # store the result of whether the population median is in the middle 95%
    middle_95_percent_results.append(lower_limit < median_salary < upper_limit)

result_array = np.array(middle_95_percent_results)

In [None]:
print("Number of True values in result_array:", np.sum(result_array))

This means out of 100 test runs, all 100 of the estimated medians is within the middle 95% range.

Statistical theory about the bootstrap states that about 95 out of the 100 simulations will result in the population median being within the middle 95% range. Note that this doesn't mean that the population median at the middle of the 95% range. It only means the population median is somewhere in this middle 95% range. And it could be in this range more than 95 times or less than 95 times, but if we repeat the experiment many times, it is close to 95 out of 100 times.

---

## Confidence Interval

In the previous example the range of resampled medians of each simulation is called an _interval of estimates_. This range of interval of estimates will be correct - or will contain the actual median - about 95% of the time. Therefore the interval of estimates is called a _95% confidence interval_.

We will now use the bootstrap to work with a sample where we don't know the actual population parameter. We'll estimate the population parameter from the sample with a 95% confidence interval.

We go back to the dataset with mothers and babies' birth weights.

In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/baby.csv"
births = pd.read_csv(url)
births.head()

- <u>The Parameter</u>

A baby's birth weight is important because a healthy birth weight generally means the baby is healthy. The birth weight is related to the gestational days: babies who are full term tend to be heavier than babies who are born prematurely.

For the parameter, we will find the ratio of birth weight to gestational days, and calculate the median of this ratio.

In [None]:
ratios = births['Birth Weight'] / births['Gestational Days']
median_ratio = np.median(ratios)
print("Median ratio:", median_ratio)

Now that we have the median birth weight to gestational days ratio for the sample, we will estimate this ratio for the population.

- <u>The Bootstrap</u>

1. We write a function to find the ratio for one resample.

In [None]:
def resample_ratio():
    a_resample = ratios.sample(n=len(ratios), replace=True)
    return np.median(a_resample)

print(resample_ratio())

2. We simulate by running `resample_ratio()` 5000 times to get a range of ratios and find the 95% confidence interval.

In [None]:
L = []
for i in range(5000):
    L.append(resample_ratio())
median_array = np.array(L)

lower_limit = np.percentile(median_array, 2.5)
upper_limit = np.percentile(median_array, 97.5)
print("95% confidence interval:", round(lower_limit, 3), "to", round(upper_limit, 3))

We can see that the original sample's ratio of 0.429 is within this 95% confidence interval. And we can say that the population's ratio is within the interval about 95% of the time.

If we had changed the confidence interval to 80%, then the range of the interval will be smaller because we're looking at the middle 80% of the range.

In [None]:
lower_limit = np.percentile(median_array, 10)
upper_limit = np.percentile(median_array, 90)
print("80% confidence interval:", round(lower_limit, 3), "to", round(upper_limit, 3))

The range is smaller, so it seems that this is more exact than the 95% confidence interval, but at the same time, the population's ratio is likely to be in this interval only 80% of the time.

### Conditions Needed by the Bootstrap

We can see that the bootstrap is a relatively simple method to estimate a parameter of a population with a confidence interval. But for the bootstrap to work well, there are a few points to keep in mind:

1. The sample must be large, since we depend on the law of averages.
2. The simulation must have a loop that runs a large number of times to create many resamples. Repeating the resampling 10,000 times is recommended in general.
3. The parameter that we want to estimate must not be at the edge of the population distribution. Specifically, the bootstrap would not work for:
> - The max or min value, or a very high or low percentile, or a parameter that only occurs for a small number of data in the sample.
> - The population distribution that is _not_ approximately close to a bell curve.
> - The original sample that is very small, such as 10 or 15 data records.

---

## Using the Confidence Interval

In addition to using the confidence interval to estimate a parameter of the population, we look at two examples where the confidence interval has two other uses.

### Test the hypotheses

The interval confidence can be used to quickly test hypotheses.

First we go back to the `births` DataFrame.

In [None]:
births.head()

- <u>The Parameter</u>

We want to estimate the average age of the mothers in the population, so the parameter to estimate is the average age.

We look at the average age for the original sample.

In [None]:
print(round(np.mean(births['Maternal Age']),2))

- <u>The Bootstrap</u>

Next we run a simulation with 5000 resamples to find the 95% confidence interval.

In [None]:
def resample_age():
    a_resample = births['Maternal Age'].sample(n=len(births), replace=True)
    return np.mean(a_resample)

L = []
for i in range(5000):
    L.append(resample_age())
average_array = np.array(L)

lower_limit = np.percentile(average_array, 2.5)
upper_limit = np.percentile(average_array, 97.5)
print("95% confidence interval:", round(lower_limit,2), "to", round(upper_limit,2))

- <u>The Hypotheses</u>

> - The null hypothesis: The average age is 30.
> - The alternative hypothesis: The average age is not 30.

- <u>The Conclusion</u>

Since the 95% confidence interval is 26.91 to 27.57, we can see that 30 is past the 95th percentile of 27.57. If we use the standard p-value cut off of 5%, then the data rejects the null hypothesis.

As a review, we can also calculate the p-value of 30:

In [None]:
np.count_nonzero(average_array >= 30) / len(births) * 100

We we see that the very small p-value provides evidence to support the alternative hypothesis.

### Compare the Baseline

For the second use of the confidence interval, we look at the dataset from the textbook on patients of Hodgkin's disease who received chemotherapy.

In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/hodgkins.csv"
hodgkins = pd.read_csv(url)
hodgkins.head()


- <u>The Parameter</u>

The 2 columns we're interested in are `base` and `month15`, which are the scores for the patient's lungs at the base line (before treatment) and after 15 months of treatment, respectively. The higher the score, the healthier the lungs.

We calculate the difference in lung scores and add it as a column of the DataFrame.

In [None]:
hodgkins['diff'] = hodgkins['base'] - hodgkins['month15']
hodgkins.head()

The average difference in lung scores for the sample is:

In [None]:
print(round(np.mean(hodgkins['diff']),2))

What about the average difference in lung scores for the entire population?

- <u>The Hypotheses</u>

> - The null hypothesis: the difference in lung scores is 0.
> - The alternative hypothesis: the difference in lungh scores is not 0.

We want a p-value cut off of 1%, which means we use a wider range of differences than the standard cut off of 5%.

- <u>The Bootstrap</u>

Since we want a p-value cut off of 1%, we need to resample the difference in lung scores, and run the simulation to find the 99% confidence interval.




In [None]:
def resample_diff():
    a_resample = hodgkins['diff'].sample(n=len(hodgkins), replace=True)
    return np.mean(a_resample)

L = []
for i in range(5000):
    L.append(resample_diff())
average_array = np.array(L)

lower_limit = np.percentile(average_array, 0.5)
upper_limit = np.percentile(average_array, 99.5)
print("99% confidence interval:", round(lower_limit,2), "to", round(upper_limit,2))

We see that even with the wider range of 17.61 to 40.1, the 0 difference of the null hypothesis is not even close to the 99% confidence interval. This means the data rejects the null hypothesis, and the difference in lung scores is not 0.

Thanks to the confidence interval calculation, we can do better than just making the decision to reject the null hypothesis, which we first did in a previous notebook. Now we can also say how far the null hypothesis is from our estimate. The null hypothesis value of 0 is far from our lowest estimate of 17.61.

---

In this notebook we see that by using the law of averages with large samples, we can bootstrap ourselves by resampling the original sample that we have, so that we can run simulation with the resamples. The simulation gives us a confidence interval for our estimate of a parameter of the population.