Skip to content

06 T‐Test

Serena Kim edited this page Feb 29, 2024 · 3 revisions

Download and Import the Dataset

We will use General Social Survey data (https://gss.norc.org/get-the-data) for this chapter. There are three available formats on this page. Let’s learn how we can import STATA files. Select “STATA”

Install the package:

install.packages("rio")

Now you have installed the package, you can use the data.

library(rio)
GSS2021 <- import("GSS2021.dta")

View(GSS2021) #view the dataset

You will see the data frame with 739 columns (variables) and 4,032 entries (survey respondents).

Arrange the variables in the alphabetical order

We'll need to use the dplyr package:

install.packages("dplyr")
library(dplyr)
GSS2021 <- GSS2021 %>% select(sort(names(.)))
  • GSS2021 is the data frame containing various columns (variables) of data.
  • %>% is the pipe operator in R. It is used to pass the data from the left side of the operator to a function on the right side, making it easier to chain multiple operations together.
  • select() is a function select specific columns from a data frame.
  • sort() is a built-in R function that sorts the elements of a vector in ascending order.
  • names(.) is a way to get the names (column names) of the data frame specified within the select() function. . (dot) means that we'll use all variables.

Descriptive Statistics Table

Before we perform any tests, let's obtain the descriptive statistics first (See Ch 05 for the detailed explanation for the descriptive statistics):

install.packages("psych")
library(psych)

Now we can summarize characteristics of the variable of interest:

summary <- describe(GSS2021$realinc)
summary

  • vars (Variables): This column lists the names of the variables (columns) in the dataframe for which summary statistics are calculated.
  • n (Sample Size): This column displays the number of non-missing (non-NA) values in each variable. It represents the sample size for each variable.
  • mean (Mean): This column shows the arithmetic mean (average) of the values in each variable.
  • sd (Standard Deviation): This column displays the standard deviation of the values in each variable. It measures the spread or dispersion of the data.
  • median (Median): This column shows the median (middle) value of the data in each variable. It represents the central tendency of the data.
  • mad (Median Absolute Deviation): This column represents the median absolute deviation, a robust measure of data variability that is less affected by outliers compared to standard deviation.
  • min (Minimum): This column displays the minimum (smallest) value in each variable.
  • max (Maximum): This column shows the maximum (largest) value in each variable.
  • skewness (Skewness): This column provides a measure of the skewness of the data distribution. Positive values indicate right skew (long right tail), while negative values indicate left skew (long left tail).
  • kurtosis (Kurtosis): This column measures the kurtosis of the data distribution, indicating how much the distribution deviates from a normal distribution. High positive values suggest a more peaked distribution, while low values suggest a flatter distribution.
  • se (Standard Error of the Mean): This column displays the standard error of the mean for each variable. It represents the standard deviation of the sample means if random samples were repeatedly taken from the population.

In case you want to include multiple variables, such as realinc and age:

variable_names <- c("realinc", "age")
summary_table <- do.call(rbind, lapply(GSS2021[, variable_names], describe))
print(summary_table)

One sample t-test

One sample t-test is used to determine if the mean of a single sample significantly differs from a known or hypothesized population mean.

Positive one tail one-sample t-test

In a positive one-tailed test, you are testing for the possibility of the sample mean being significantly greater than the hypothesized population mean. For example, I'm testing whether the family income in constant dollars (base = 1986) (variable name: realinc) of American adults over 18 is greater than $40,000.

result_pos_onetail <- t.test(GSS2021$realinc, mu = 40000, alternative = "greater")
result_pos_onetail

  • t.test(): This is a function in R used to perform various t-tests. It is used to compare the mean of a sample to a known or hypothesized value.
  • one_sample_subset$realinc: This specifies the variable or column from the dataframe one_sample_subset that you want to perform the t-test on.
  • mu = 19: Here, mu is the value of the population mean under the null hypothesis. In this case, it's set to 19.
  • alternative = "greater": This argument specifies the alternative hypothesis. Here, it indicates that the alternative hypothesis (your research hypothesis) is that the true population mean is greater than the specified value of 19.

Negative one tail one-sample t-test

In a negative one-tailed test, you are testing for the possibility of the sample mean being significantly less than the hypothesized population mean. For example, I'm testing whether the family income in constant dollars (base = 1986) (variable name: realinc) of American adults over 18 is less than $41,000.

> result_onetail_neg <- t.test(GSS2021$realinc, mu = 41000, alternative = "less")
> result_onetail_neg

Two-tailed one-sample t-test

In a two-tailed test, the hypothesis is non-directional, and you are testing for the possibility of the sample mean being significantly different from the hypothesized population mean, regardless of the direction of the difference. For example, if the research hypothesis is that the sample mean is simply expected to be different (either greater or less than) $40,500, the hypothesized population mean.

result_two_tailed <- t.test(one_sample_subset$realinc, mu = 40500, alternative = "two.sided")

Two sample t-test

A two-sample t-test is a statistical test used to determine whether the means of two independent groups are significantly different from each other. It is often used when comparing the means of two different populations or groups to assess whether there is a statistically significant difference between their means.

Here, we want to know whether the average income is greater for families with any kids (than the families without any kids). There's "childs" variable, which indicate the number of children. We'll create "children" column based on this "childs" column:

GSS2021 <- GSS2021 %>% 
  mutate(children = case_when(
    childs != 0 ~ 1,
    TRUE ~ 2
  )) 
  • The case_when() function checks the condition childs != 0 for each observation.
    • The condition childs != 0 checks whether the childs variable is not equal to 0. If it is not 0, it assigns 1 to 'children'; otherwise, it assigns 2.
  • The result is stored in the GSS2021 dataframe, replacing the original children variable.

Positive one tail two-sample t-test

My hypothesis is that individuals with children (=1) has a greater realinc than the individuals who do not (=2).

positive_one_tail_two_sample <- t.test(realinc ~ children, data = GSS2021, alternative = "greater")

Negative one tail two-sample t-test

My hypothesis is that individuals with children (=1) has a *smaller realinc than the individuals who do not (=2).

negative_one_tail_two_sample <- t.test(realinc ~ children, data = GSS2021, alternative = "less")
negative_one_tail_two_sample

Two-tailed two-sample t-test

I am testing whether the family income is different between people who have kids and people who do not. The childs column indicate the number of children. Therefore, I want to create a new column called children, with a value of 1 indicating people who have kids or 0 otherwise.

result_two_tail_two_sample <- t.test(realinc ~ children, data = GSS2021)
result_two_tail_two_sample