# Statistical Inference

## Terms

**Statistical Inference** is the process of using a sample to make a conclusion about the broader population from which it is taken.

- **Population** - Pool of individuals from which a statistical sample is drawn for a study.
- **Population Parameter** - Data based on an entire popluation such as means, and standard deviations
- **Sample** - Subset of individulas collected from the population
- **Sample Estimate** - A numerical characteristic of the sample that estimates the population parameter.
- **Point Estimate** - Estimating a single value from the sample
- **Sampling Distribution** - A distribution of point estimates, where each point estimate was calculated from a different random sample from the same population.
- **Confidence Interval** - A range of plausible values for the population parameter.
- **Sample Variability** - Estimates vary from sample to sample due to sampling variability

Different Types of Sampling
- **Random Sampling** - Selecting a subset of observations from a population where each observation is equally likely to be selected at any point during the selection process
- **Representative Sampling** - Selecting a subset of observations from a population where the sample’s characteristics are a good representation of the population’s characteristics

Different Types of Parameters
- **Variance** - The mean of the sum of the squared distances of each observation from the mean value of all observations.
- **Standard Deviation** - The square root of the variance.

We will be focussing on two types of statistical inference:
1. Using categorical observations to estimate the proportion of a category.
2. Using quantitative observations to estimate the average (or mean)

## Random Functions

```plot_grid``` can be used to plot sampling distributions side by side.
```r
sampling_distribution_panel <- plot_grid(sampling_distribution_20,
                                         sampling_distribution,
                                         sampling_distribution_100,
                                         ncol = 3)
```


```summarize``` can be used to get parameters such as mean, median, and standard deviation. (Note whether the data is categorical or quantitative)
```r
sample_1_estimates <- sample_1 %>%
                      summarize(
                          sample_1_mean = mean(age),
                          sample_1_med = median(age),
                          sample_1_sd = sd(age)
                      )
```

## Sampling Distributions

### Sampling Distribution for Categorical Variables (Proportions)

Using ```summarize``` we can find the porportion of certain parameters.
```r
airbnb %>%
summarize(
    n = sum(room_type = "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
)
```

Instead of getting the population parameter, we can use a sample to approximate the proportion of rooms that are listed with room type "Entire home/apt". We can do this by using the ```rep_sample_n``` function from the ```infer``` package.

```r
library(infer)

sample_1 <- rep_sample_n(tbl = airbnb, size = 40)

airbnb_sample_1 <- summarize(sample_1,
  n = sum(room_type == "Entire home/apt"),
  prop = sum(room_type == "Entire home/apt") / 40
)

airbnb_sample_1
```

***PROBLEM:*** Estimates vary from sample to sample due to **sample variablility**.

By getting the sampling distribution, we can see how much we would expect our sample proportions from this population to vary.

```r
samples <- rep_sample_n(airbnb, size = 40, reps = 20000)
```
Here, we get 20000 samples of size 40.
- ```size = 40``` refers to each sample size
- ```reps = 20000``` refers to the number of samples we are taking.

This returns a tibble:

<img src="media/replicate.png" width="400px">

Notice that the ```replicate``` column indicates the replicate (sample) to which each listing belongs.

Now, to calculate the proportions for each replicate (sample of 40)

```r
sample_estimates <- samples |>
                    group_by(replicate) |>
                    summarize(sample_proportion = sum(room_type == "Entire home/apt") / 40)
```

<img src="media/replicate_sample_estimates.png" width="200px">

We can now visualize the distribution of the sample proportions for sample of size 40 using a histogram.
```r
sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) +
                             geom_histogram(fill = "dodgerblue3", color = "lightgrey", bins = 12) +
                             labs(x = "Sample proportions", y = "Count") +
                             theme(text = element_text(size = 12))
```

<img src="media/sample_distribution.png" width="250px">

We can also get the mean of the sample proportions.
```r
sample_estimates |>
  summarize(mean = mean(sample_proportion))
```

### Sampling Distribution for Qualitative Variables (Means)

```r
samples <- rep_sample_n(airbnb, size = 40, reps = 20000)

sample_estimates <- samples |>
                    group_by(replicate) |>
                    summarize(sample_mean = mean(price))
                    
sampling_distribution_40 <- ggplot(sample_estimates, aes(x = sample_mean)) +
                                geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
                                labs(x = "Sample mean price per night (Canadian dollars)", y = "Count") +
                                theme(text = element_text(size = 12))
```

Comparing the Population distribution, distribution of the sample and the sampling distribution shows that the centers of the distributions are all around the same price ($150)

<div style="display: flex; flex-direction: row; width: 1000px;">
    <ul >
        <li>The original distribution is skewed right</li>
        <li>The sample distribution has a similar shape</li>
        <li>The sampling distribution is bell shaped and has lower spread than the population or sample distributions. The sample means vary less than the individual observations becuase there will be high and smal values which will keep the average from being too extreme.</li>
    </ul>
    <img src="media/distribution_comparisons.png" width="200px">
</div>

One way to improve the point estimate is to take a larger sample.

<img src="media/sample_size_comparisons.png" width="300px">

Notice:
1. The mean of the sample mean (across samples) is equal to the population mean. (The sampling distribution is centered at the population mean.)
2. Increasing the size of the sample decreases the spread (variability) of the sampling distribution.
3. The distribution of the sample mean is roughly bell-shaped.

## Bootstrapping

In real life, we usually only have access to one sample from the population. Therefore, we cannot construct the sampling distribution from the previous section. However, we can try to approximate the sampling distribution. We discuss ***interval estimation*** and construct ***confidence intervals*** using just a single sample from a population.

When you take a large enough sample from the population, it *looks* like the population. So, by taking many samples from our single sample (***bootstrapping***), we can get an approximation fo the true sampling distribution, the ***bootstrap distribution***.

***NOTE:*** We must sample ***WITH REPLACEMENT***. If we had a sample of size $n$ and obtained a sample from it of size $n$ *without* replacement, it will just return our original sample.

### Processs

$n$ is equivalent is the number of observations in the original sample.

***EACH BOOTSTRAP SAMPLE has the same number of observations as the original sample***

1. Randomly select an observation from the original sample, which was drawn from the population.
2. Record the observation’s value.
3. Replace that observation.
4. Repeat steps 1–3 (sampling with replacement) until you have $n$ observations, which form a bootstrap sample. 
5. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the $n$ observations in your bootstrap sample.
6. Repeat steps 1–5 many times to create a distribution of point estimates (the bootstrap distribution).
7. Calculate the plausible range of values around our observed point estimate.

<img src="media/bootstrapping_process.png" width="400px">

Getting the original sample:
```r
one_sample <- airbnb |>
              rep_sample_n(40)
```

Doing steps 1-3 listed above to generate a single boostrap sample in R. (Here, we take 20,000 botstrap samples from the original sample)
```r
boot20000 <- one_sample |>
             rep_sample_n(size = 40, replace = TRUE, reps = 20000)
```

Let's see the histograms fo the first six replicates of our bootstrap samples.
```r
six_bootstrap_samples <- boot20000 |>
                         filter(replicate <= 6)

ggplot(six_bootstrap_samples, aes(price)) +
    geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
    labs(x = "Price per night (Canadian dollars)", y = "Count") +
    facet_wrap(~replicate) +
    theme(text = element_text(size = 12))
```
<img src="media/six_bootstrap_samples.png" width="400px">

We can calculate the point estimates for our 20,000 bootstrap samples and generate a bootstrap distribution of our point estimates.
```r
boot20000_means <- boot20000 %>%
                   group_by(replicate) %>%
                   summarize(mean = mean(price))

boot_est_dist <- ggplot(boot20000_means, aes(x = mean)) +
                     geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
                     labs(x = "Sample mean price per night \n (Canadian dollars)", y = "Count") +
                     theme(text = element_text(size = 12))
```

Comparing the bootstrap and sampling distribution:

<img src="media/sampling_bootstrap_dist.png" width="400px">

Notice:
1. The shape and spread of the true sampling distribution and the bootstrap distribution are similar; the bootstrap distribution lets us get a sense of the point estimate's variability.
2. The means of the two distributions are different. Because we are resampling from the original sample repeatedly, we see that the bootstrap distribution is centered that the original sample's mean value (unlike the sample distribution of the sample mean, which is centered at the population parameter value).

Overall:
- **Bootstrap(sampling) distribution** estimates the (true) sampling distribution.
- **Standard Devation** of the bootstrap dampling distribution estimates the **standard error** of the (true) samplign distribution.

Summary of the bootstrap process:

<img src="media/bootstrap_process.png" width="400px">

### Calulating a Plausible Range

We will find the range of values covering the middle 95% of the bootstrap distribution, giving us a 95% confidence interval. This means that if we took 100 random samples and calculated 100 95% confidence intervals, then about 95% of the ranges would capture the population parameter's value.

A higher confidence level corresponds to a wider range of the interval and a lower confidence level corresponds to a narrower range. Therfore the level we choose is based on what chance we are willing to take of being wrong.

To calculate a 95% percentile bootstrap confidence interval, we will do the following:
1. Arrange the observations in the bootstrap distribution in ascending order.
2. Find the value such that 2.5% of observations fall below it (the 2.5% percentile). Use that value as the lower bound of the interval.
3. Find the value such that 97.5% of observations fall above it (the 97.5% percentile). Use that value as the upper bound of the interval.

The ```quantile()``` function would handle everything for us.
```r
bounds <- boot20000_means %>%
          select(mean) %>%
          pull() %>%
          quantile(c(0.025, 0.975))
```

<img src="media/c_interval.png" width="400px">

Our interval, 119.28 to 203.63 captures the middle 95% of the sample mean prices in the bootstrap distribution.

We can, then, report:

"Here the sample mean price per night of 40 Airbnb listings was $155.8 and we are 95% "confident" that the true population mean price per night for all Airbnb listings in Vancouver is between 119.28 and 203.63."

> "The sample mean ___ of (number of samples) was ___ and we are ___ confident that the true population mean ___ of ___ is between (confidence interval)."