# 📊 PLS 120 - Week 5: Sampling and Estimation

**Binder Developer:** Mohammadreza Narimani  
**Lab Content Creator:** Parastoo Farajpoor  
**Date:** October 29, 2025  
**Course:** Applied Statistics in Agricultural Sciences

---

## 🔧 Setup: Load Required Packages

In [None]:
# Load required packages
library(ggplot2)
library(tigerstats)

---

## 🎲 set.seed() function

For today's lab, let's first discuss about set.seed() function. The set.seed() function in R is critical for creating reproducible research, especially when involving random number generation. This function sets the seed of R's random number generator, which is essential for ensuring that the results of random processes are reproducible.

**Usage:**
```r
set.seed(value)
```

- **value**: An integer used to initialize the random number generator.

**Example:**

In [None]:
dataset <- iris

#let's sample 10 row indices from the iris data set. Every time you run the sample function, it will select different row indices.
random_rows <- sample(nrow(dataset), size = 10, replace = FALSE)
random_rows

The value inside the set.seed() function is called the 'seed'. The seed can be any integer, and the specific number doesn't usually matter unless you need to reproduce results exactly. Often, people use easy-to-remember numbers like 123, 42, or 2023.

In [None]:
# Set the seed
set.seed(123)

#Let's sample again
random_rows <- sample(nrow(dataset), size = 10, replace = FALSE)
random_rows

We can also change the number of seeds in our code.

In [None]:
# Set the seed
set.seed(42)

#Let's sample again
random_rows <- sample(nrow(dataset), size = 10, replace = FALSE)
random_rows

---

## 📏 Z-score

Z-scores help us **standardize** data by converting values to standard deviations from the mean. This makes it easier to:
- Compare values from different datasets
- Identify outliers
- Calculate probabilities

### Formula:
$$z = \frac{x - \mu}{\sigma}$$

Where:
- `x` = individual value
- `μ` = mean
- `σ` = standard deviation

In [None]:
set.seed(12)

#generate 100 random numbers that have a normal distribution with a mean of 50 and standard deviation of 25
normal <- rnorm(100, mean=50, sd=25)
normal

df_normal <- data.frame(normal)
ggplot(df_normal,aes(x=normal))+geom_density()


z_normal <- (normal - mean(normal)) / sd(normal)
z_normal

df_z_normal <- data.frame(z_normal)
ggplot(df_z_normal,aes(x=z_normal))+geom_density()

#Notice how the y axis units also changes when we plot the standardized data. The reason is original data have been compressed to a smaller range in z-score values.

To find the area under the curve of a standard normal distribution for z≤ 2, you can use the cumulative distribution function (CDF). In R, this can be accomplished using the pnorm() function, which provides the probability that a standard normal variable is less than or equal to a given value.

### pnorm() Function:
$$P(Z \leq z) = \text{pnorm}(z)$$

In [None]:
# Define the z-value
z_value <- 2

# Calculate the area under the curve for z <= 2
area_under_curve <- pnorm(z_value)

# Print the area under the curve
print(area_under_curve)


# Visualize this using pnormGC() function inside tigerstats package for better understanding (no need to go into details for this for now)
library(tigerstats)
pnormGC(z_value, graph = TRUE)

To find the z-score related to a specific area under the curve (cumulative probability) of a standard normal distribution, you can use the qnorm() function in R.

### qnorm() Function:
$$z = \text{qnorm}(p) \text{ where } P(Z \leq z) = p$$

In [None]:
# Define the cumulative probability
area_under_curve <- 0.98

# Calculate the z for the area under the curve of 0.98
z_value <- qnorm(area_under_curve)

# Print the z
print(z_value)


# Visualize this using qnormGC() function inside tigerstats package for better understanding (no need to go into details for this for now)
library(tigerstats)
qnormGC(area_under_curve, graph = TRUE)

---

## 🎯 Confidence Interval

Confidence intervals provide an estimated range that is likely to include the true value of an unknown population parameter. This range, along with the confidence level, gives more context to the estimate than a single point estimate like the sample mean. It tells you not only about what you think the true value might be, but also how uncertain you are.

When constructing confidence intervals, z-scores are used to determine how wide the interval should be to contain the true population parameter (like a mean or proportion) with a certain level of confidence. This is crucial when you are trying to understand how extreme or typical a particular value is within a distribution. The z-score corresponds to the desired confidence level:

- **95% confidence level** typically corresponds to a z-score of approximately **1.96**.
- **99% confidence level** corresponds to a z-score of about **2.58**.

The above values are derived from the properties of the normal distribution, as confidence intervals often assume that the means of samples are normally distributed around the population mean.

### Formula for Confidence Interval:
$$CI = \bar{x} \pm z_{\alpha/2} \times \frac{s}{\sqrt{n}}$$

Where:
- `x̄` = sample mean
- `z` = critical z-value
- `s` = sample standard deviation
- `n` = sample size

**Alpha (α)**: it is the significance level. It basically is the probability that the confidence interval does not include the true population parameter. If you have a 95% confidence level, it means you are 95% confident that the confidence intervals constructed from the same population will contain the true population parameter. Correspondingly, alpha would be 100% - 95% = 5% or 0.05. This 5% represents the risk you are taking that your interval might miss the true value. Similarly, for a 99% confidence level, alpha would be 0.01.

In [None]:
# Set the significance level at 5% (alpha = 0.05)
alpha = 0.05

# Calculate the z-score for a 95% confidence level
# We use 1 - alpha / 2 because we are interested in the two-tailed confidence interval
z_score = qnorm(1 - alpha / 2)

# Print the z-score
print(z_score)

**Example:**
Imagine you are a researcher studying the average amount of time students spend doing homework each week. You randomly select a sample of 100 students and find that they spend an average of 5 hours per week on homework, with a standard deviation of 1 hour.

If you wanted to create a 95% confidence interval to estimate the true average for the entire student population, you would:

In [None]:
# Sample parameters
sample_mean <- 5  # sample mean in hours
sample_sd <- 1     # sample standard deviation in hours
sample_size <- 100 # number of students in the sample

# Z-score for a 95% confidence level
alpha <- 0.05
z_score <- qnorm(1 - alpha / 2)  # Finds the critical z-value

# Calculate the margin of error
margin_of_error <- z_score * (sample_sd / sqrt(sample_size))
margin_of_error

# Calculate the confidence interval
lower_bound <- sample_mean - margin_of_error
lower_bound

upper_bound <- sample_mean + margin_of_error
upper_bound

**Margin of error** measures the range within which the true value (of the population) lies relative to the observed value (from the sample) with a certain level of confidence. Essentially, the margin of error gives you an idea about how close the sample's statistic is likely to be to the true value of the population parameter. In this example, it means we are 95% confident that the mean of all students fall between the lower bound and upper bound.

### Margin of Error Formula:
$$ME = z_{\alpha/2} \times \frac{s}{\sqrt{n}}$$

---

## 📏 Calculating sample size

When designing any type of experiment, calculating sample size is a critical first step. By developing an experiment with a sufficient sample size, you can minimize the amount of error, or false positives that are the result of random chance in the data.

### Formula for Sample Size (Proportions):
$$n = \frac{z^2 \times p \times (1-p)}{d^2}$$

Where:
- `n` = required sample size
- `z` = z-score for desired confidence level
- `p` = expected proportion
- `d` = desired margin of error

**Example:** You are designing an experiment to survey cat ownership on campus, and you want to know if students on campus have cats above or below the national average. Nationally, approximately 20% of people have a cat as a pet. In order to determine whether this is true, you need to collect enough survey information, or need to know how many students to survey.

In [None]:
# Step1: Define the national average of cat ownership as our prevalence rate.
prev <- 0.2  # 20% of the national population owns cats


# Step 2: Determine the Confidence Level and Calculate Z-Score. Let's say we want to be 90% confident.
alpha <- 0.1  # This gives us a 90% confidence level

# Calculate the z-score, which defines how many standard deviations away from the mean covers 90% of the normal distribution.
z_score <- qnorm(1 - alpha / 2)  # Use 1 - alpha/2 to find the upper percentile
z_score


# Step 3: Decide on a margin of error. A smaller margin of error means more precision, but requires a larger sample size. Let's say we want 5% margin of error.
d <- 0.05 


# Step 4:  Calculate the sample size
sample_size = z_score^2 * prev * (1 - prev) / (d^2)
sample_size

# Round up the calculated sample size to the nearest whole number
ceiling(sample_size)

This calculation tells you how many students you need to survey to be 90% confident that your results are within 5% of the true proportion, assuming cat ownership follows the national average.

Now, if we increase the prevalence rate:

In [None]:
prev <- 0.3
alpha <- 0.1

z_score <- qnorm(1 - alpha / 2)
z_score

d <- 0.05 

sample_size = z_score^2 * prev * (1 - prev) / (d^2)
sample_size

ceiling(sample_size)

We see that the sample size also increases. It means we need to have more samples to have a good estimate of population parameters.

Now, if we want to have more precision and reduce the effect of random chance further, we might want to increase the confidence level and decrease the margin of error.

In [None]:
prev <- 0.3
alpha <- 0.01  # Increase confidence level to 99%

# Recalculate the z-score for a 99% confidence level
z_score <- qnorm(1 - alpha / 2)
z_score

d <- 0.01  # Reduce the margin of error to 1%

# Recalculate the sample size with the new parameters
sample_size = z_score^2 * prev * (1 - prev) / (d^2)
sample_size


# Round up the calculated sample size to the nearest whole number
ceiling(sample_size)

We see that when we increased the precision, the number of samples also increased. There is always a **trade-off** between the number of samples and the accuracy we are looking for. You should decide based on what is practical for your experiment.

### Key Takeaways:
- **Higher confidence level** → Larger sample size needed
- **Lower margin of error** → Larger sample size needed  
- **Higher expected proportion** (up to 0.5) → Larger sample size needed
- **Cost vs. Precision**: Balance practical constraints with statistical requirements

---

## 📧 Need Help?

**Mohammadreza Narimani** (Teaching Assistant)  
📧 mnarimani@ucdavis.edu  
🏫 Department of Biological and Agricultural Engineering, UC Davis  
⏰ Office Hours: Thursdays 10 AM - 12 PM (Zoom)  
🔗 [Join Zoom Office Hours](https://ucdavis.zoom.us/j/99533096447)

---

*Last updated: October 2025 | PLS 120 - Applied Statistics in Agricultural Sciences | UC Davis*