# üìä PLS 120 Lab 6: Confidence Intervals and T-Tests

**Binder Environment Developer:** Mohammadreza Narimani  
**Lab Content Developer:** Parastoo Farajpoor  
**Date:** November 5, 2024  
**Course:** PLS 120 - Applied Statistics in Agricultural Sciences  
**Institution:** UC Davis

In [None]:
# Load required libraries
suppressPackageStartupMessages({
  library(ggplot2)
  library(dplyr)
  library(knitr)
  library(tigerstats)
})

## üñ®Ô∏è Printing Values in R

Let's just quickly talk about how to print out the values of some objects.

In [None]:
#1. Basic method: write down the name of the object and you'll see the value it holds.
x <- 42
x

In [None]:
#2. Basic Printing with print() function: The print() function is the standard way to display basic data types and objects in R. It outputs the content to the console.
x <- 2
print(x)

In [None]:
x <- c(1,2,3)
print(x)

In [None]:
x <- c("Alice", "Bob")
print(x)

In [None]:
#3. cat() function: cat() concatenates and prints objects. It combines string vectors into a single string vector, separating them by a space or other specified character.

x <- c(4,6,8)
cat("The values of x are:", x)

In [None]:
x <- c(4,20,21)
cat("The three values of x are:", x[1], x[2], x[3])

In [None]:
## you can't store the cat function inside an object
statement <- cat("The three values of x are:", x[1], x[2], x[3])
print(statement)

In [None]:
#4. paste() Function: paste() combines string vectors into a single string vector, separating them by a space or other specified character. 
x <- c(1,2,3)
paste("The three values of x are:", x[1], x[2], x[3])

In [None]:
##you can store the paste function inside an object and print it later.
statement <- paste("The three values of x are:", x[1], x[2], x[3])
print(statement)

For this lab, we're going to learn how to construct confidence intervals to see where we can claim a population mean is, as well as make hypotheses tests to compare two means to one another and make claims about how they are the same or different.

We're going to use the iris data set again. If we recall, there are three species of iris in this data set and different measurements about their characteristics. We're going to explore how these species are the same or different using the tests available to us.

In [None]:
#These are the libraries we need.
library(ggplot2)
library(dplyr)

#Next, we'll load the data.
data <- iris

str(data)

In [None]:
#You can see that the variable "Species" is a factor. If it was not a factor, we could use as.factor() to make this variable a factor.
data$Species <- as.factor(data$Species)
str(data)

In [None]:
#let's see how many observation of each factor (species) we have.
table(data$Species)

In [None]:
#Let's start off with sepal length. First, Let's make a quick visualization of the data so we can make some quick comparisons and hypotheses before we explore the confidence intervals, and make comparisons with T-tests. What are some hypotheses we can make based on the boxplot?
ggplot(data, aes(x=Species, y=Sepal.Length, fill=Species))+geom_boxplot()

In [None]:
ggplot(data, aes(x=Species, y=Sepal.Length, fill=Species))+geom_boxplot()+geom_point()

In [None]:
ggplot(data, aes(x=Species, y=Sepal.Length, fill=Species))+geom_boxplot()+geom_jitter()

#Reminder: data is the object name that we defined for our data frame, Species is the variable in the data frame that we want to map in the x axis, Sepal.Length is the variable of the dataset that we want to map in the y axis, fill specifies that the color filling of the boxplots should be based on Species, geom_boxplot() adds the boxplots to the plot, geom_jitter() adds a layer that plots individual data points as small, randomly jittered points around the category axis (Species in this case). Plotting with geom_jitter() has the advantage of showing all the datapoints and not letting points to overlap compared to geom_point() in this case.

In [None]:
#Next, we'll use tidyverse to make three data sets, each containing only one of the species. This will simplify the process of making comparisons between them later on. 
setosa_df <- data %>% filter(Species == "setosa")
versicolor_df <- data %>% filter(Species == "versicolor")
virginica_df <- data %>% filter(Species == "virginica")

## üìä Sample Means

**Formula:** $\bar{x} = \frac{\sum x_i}{n}$

Then, let's calculate the Sepal Length sample mean for all 3 dataset:

In [None]:
mean_setosa <- mean(setosa_df$Sepal.Length)
mean_versicolor <- mean(versicolor_df$Sepal.Length)
mean_virginica <-mean(virginica_df$Sepal.Length)


#Let's create a vector containing all of these 3 means and print the values.
all_means <- c(mean_setosa, mean_versicolor, mean_virginica)
print(all_means)

Looking at the means, they seem to be different. But remember, these means are only an "estimate" of the population means based on a number of samples. So, how can we say these are the true means? We'll need to construct the confidence intervals for each of the species. There are several ways to do this, but we'll start by constructing the individual parts necessary, and build the confidence interval.

## üìê Standard Deviations

**Formula:** $s = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}}$

In [None]:
#we already have the means calculated, but we will also need the standard deviation and the standard error.
sd_setosa <- sd(setosa_df$Sepal.Length)
sd_versicolor <- sd(versicolor_df$Sepal.Length)
sd_virginica <- sd(virginica_df$Sepal.Length)

In [None]:
#you can use cat() function to concatenate strings and variables into a single output.
cat("The standard deviation of Setosa is", sd_setosa,
    ", Versicolor is", sd_versicolor,
    ", and Virginica is", sd_virginica)

In [None]:
#you can also use paste function for this purpose.
paste("The standard deviation of Setosa is", sd_setosa,
    ", Versicolor is", sd_versicolor,
    ", and Virginica is", sd_virginica)

## üìè Standard Error

**Formula:** $SE = \frac{s}{\sqrt{n}}$

In [None]:
#To calculate the standard error, we will divide the SD by the square root of the sample size, n, which is 50 for each species.
n <- 50

SE_setosa <- sd_setosa/sqrt(n)
SE_versicolor <- sd_versicolor/sqrt(n)
SE_virginica <- sd_virginica/sqrt(n)


#Next, we have sample size > 30 in all dataset, but the population variance is unknown. In this case, we will need to calculate the t score instead of z score (you can also use z-score. Refer to the slides for more explanation). To calculate the t score, we need to set an alpha, determine the degrees of freedom, then get a t score using the qt() function, which pulls values from a t distribution. 

#We want a pretty high confidence, so lets set the alpha to 0.05, or a 95% confidence interval.
a <- 0.05
DoF <- n-1

## üéØ Critical T-Value

**Formula:** $t_{\alpha/2, df}$ where $\alpha = 0.05$

In [None]:
#Now we'll do the t score itself. We'll divide the alpha by two to encompass both sides of the distribution, and set the degrees of freedom.
t_score <- qt(1-a/2, DoF) #Reminder: z score was calculated using qnorm(1-(alpha/2))
t_score


#Now that we have a t score and a standard error, we can calculate the margin of error, which will tell us the bounds of our confidence interval.

#setosa

## üå∏ Setosa Confidence Interval

**Margin of Error:** $ME = t \times SE$  
**Confidence Interval:** $\bar{x} \pm ME$

In [None]:
ME_setosa <- t_score*SE_setosa

LB_setosa<- mean_setosa - ME_setosa
UB_setosa<- mean_setosa + ME_setosa

CI_setosa <- c(LB_setosa, mean_setosa, UB_setosa)
CI_setosa

## üå∫ Versicolor Confidence Interval

In [None]:
#versicolor
ME_versicolor <- t_score * SE_versicolor

LB_versicolor <- mean_versicolor - ME_versicolor
UB_versicolor <- mean_versicolor + ME_versicolor

CI_versicolor <- c(LB_versicolor, mean_versicolor, UB_versicolor)
CI_versicolor

## üå∑ Virginica Confidence Interval

In [None]:
#virginica
ME_virginica <- t_score * SE_virginica

LB_virginica <- mean_virginica - ME_virginica
UB_virginica <- mean_virginica + ME_virginica

CI_virginica <- c(LB_virginica, mean_virginica, UB_virginica)
CI_virginica

Now that we have the confidence intervals, we are in a better positions to determine if the mean sepal lengths for the whole population are truly different than our sample means.

In [None]:
#Let's take a look at the confidence intervals all at once. Run all three lines at the same time. 
CI_setosa
CI_versicolor
CI_virginica

None of the confidence intervals overlap with each other, which is usually a sign that the means are different.

In other words: If the confidence intervals of two groups overlap, this suggests that there may not be a statistically significant difference between their means at the given confidence level. Overlapping confidence intervals mean that we cannot be confident that there is a true difference in means; the populations could be similar regarding the measured feature (sepal length, in this case).

---

## üìß Need Help?

**Mohammadreza Narimani** (Teaching Assistant)  
üìß mnarimani@ucdavis.edu  
üè´ Department of Biological and Agricultural Engineering, UC Davis  
‚è∞ Office Hours: Thursdays 10 AM - 12 PM (Zoom)  
üîó [Join Zoom Office Hours](https://ucdavis.zoom.us/j/99533096447)

---

*Week 6 - Confidence Intervals and T-Tests | PLS 120 | UC Davis*