Skip to content

07 Chi‐Square Test

Serena Kim edited this page Apr 22, 2024 · 6 revisions

You can download the dataset from our shared drive.

Import dataset

Install the package:

install.packages("rio")

Now you have installed the package, you can use the data.

library(rio)
GSS2021 <- import("GSS2021.dta")

To re-order the variables in the alphabetical order:

install.packages("dplyr")
library(dplyr)
GSS2021 <- GSS2021 %>% select(sort(names(.)))

Frequency of each value of a variable

Here, we're using abany variable:

#Calculate frequency of the variable
frequency_table <- table(GSS2021$abany, useNA = "no")
frequency_table

But it doesn't show the proportion of each group, so I want to add the proportion to the table:

# Calculate proportions using prop.table
proportion_table <- prop.table(frequency_table)

# Combine the frequency and proportion tables using cbind
combined_table <- cbind(Frequency = frequency_table, Proportion = proportion_table)
  • cbind is used to combine the frequency and proportion tables into a single table, which includes both frequency and proportion columns.

Convert the labels of variables

It would be great if we can view what "1" and "2" mean when we print out the paper. So, we will re-label the variable:

GSS2021 <- GSS2021 %>%
    mutate(abany = ifelse (abany %in% 1, "Yes",
                           ifelse(abany %in% 2, "No", NA)))
  • GSS2021 %>%: This part of the code takes the data frame GSS2021 and pipes it into the subsequent operation using %>%. This operator allows you to chain operations together.
  • The mutate function is used to create or modify variables in a data frame.
  • The ifelse function is used to conditionally assign values to the new abany variable based on the values of the existing abany variable.
  • ifelse(abany %in% 1, "Yes", ...) checks if the value of abany is equal to 1. If it is, it assigns the character "Yes" to the abany variable.
  • If the condition is not met (...), another ifelse is used: ifelse(abany %in% 2, "No", NA): This checks if the value of abany is equal to 2. If it is, it assigns the character "No" to the abany variable.
  • If neither condition is met, it assigns NA (missing value) to the abany variable.

Subset the dataset for the chi-square test

The chi-square (χ²) test is a statistical method used to determine whether there is a significant association between two categorical variables. In this demo, we will test the statistical independence between individuals' political ideology and their confidence in different subjects including the Press (compress), scientific community (consci), congress (conlegis), and U.S. Supreme Court (conjudge).

The "confidence" variables are defined as below:

**"Would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in BELOW?" **

  • (a) Press (compress)
  • (b) Scientific community (consci)
  • (c) Congress (conlegis)
  • (d) U.S. Supreme Court (conjudge)

chi_subset <- GSS2021 %>% select(id, polviews, conpress, consci, conlegis, conjudge)

Create a categorical variable based on a numeric variable

The variable polviews varies from 1 to 7 as follows:

To simplify the comparison between individuals with different political views, I want to create a new category polparty:

chi_subset <- chi_subset %>%
  mutate(polviews = ifelse(polviews %in% 1:3, "Democrats",
                           ifelse(polviews %in% 4, "Independent",
                                  ifelse(polviews %in% 5:7, "Republican", NA))))

Level of Confidence - From Numeric Values to Characters

GSS2021 <- GSS2021 %>% mutate(consci = case_when(
consci == 1 ~ "A great deal",
consci == 2 ~ "Somewhat",
consci == 3 ~ "Hardly any"
))

Perform a chi-square test and create the contingency tables

I want to test whether individuals' political party affiliation is associated with their confidence in the scientific community.

chi_result <- chisq.test(chi_subset$consci, chi_subset$polviews)

The chi_result value saves the chi-square test result. Now let's print out the observed frequency and expected frequency tables:

chi_result$observed  
chi_result$expected

If we want to add the margins, we can use the addmargins function

addmargins(chi_result$observed)
addmargins(chi_result$expected)

Explain the residuals

chi_result

This will print out the Pearson's Chi-square test.

To understand the differences between observed and expected frequencies, you can use the residuals() function:

residuals(chi_result)

This residuals function returns the Pearson residuals, which are a measure of the differences between the observed and expected counts in a contingency table.

  • Positive residuals (values greater than 0) indicate that there are more observations than expected for that combination.
  • Negative residuals (values less than 0) indicate that there are fewer observations than expected for that combination.
  • Residuals near 0 suggest that the observed and expected counts are similar.

For example,

  • The residual of 9.6480567 for "A Great Deal" and "Democrats" suggests that there are significantly more observations than expected in that category.
  • The residual of -2.7550663 for "A Great Deal" and "Independent" suggests that there are fewer observations than expected in that category.

Visualizing the residuals

residuals_matrix <- as.matrix(residuals(chi_result))
rownames(residuals_matrix) <- c("A great deal", "Somewhat", "Hardly any")  
colnames(residuals_matrix) <- c("Democrats", "Independent", "Republican")

We created a matrix, residuals_matrix from the residuals above and now we are ready to create a heatmap:

heatmap(residuals_matrix,
Rowv = NA, Colv = NA,  # No clustering
col = cm.colors(256),  # Color scheme
scale = "none",        # No scaling
margins = c(10, 10),    # Increase margins
xlab = "Political Views",
ylab = "Consciousness",
cexRow = 0.9,          # Adjust row label font size
cexCol = 0.9)          # Adjust column label font size

Your turn - Examine a statistical independence between abany and polview