-
Notifications
You must be signed in to change notification settings - Fork 0
07 Chi‐Square Test
You can download the dataset from our shared drive.
Install the package:
install.packages("rio")
Now you have installed the package, you can use the data.
library(rio)
GSS2021 <- import("GSS2021.dta")
To re-order the variables in the alphabetical order:
install.packages("dplyr")
library(dplyr)
GSS2021 <- GSS2021 %>% select(sort(names(.)))
Here, we're using abany
variable:
#Calculate frequency of the variable
frequency_table <- table(GSS2021$abany, useNA = "no")
frequency_table
But it doesn't show the proportion of each group, so I want to add the proportion to the table:
# Calculate proportions using prop.table
proportion_table <- prop.table(frequency_table)
# Combine the frequency and proportion tables using cbind
combined_table <- cbind(Frequency = frequency_table, Proportion = proportion_table)
-
cbind
is used to combine the frequency and proportion tables into a single table, which includes both frequency and proportion columns.
It would be great if we can view what "1" and "2" mean when we print out the paper. So, we will re-label the variable:
GSS2021 <- GSS2021 %>%
mutate(abany = ifelse (abany %in% 1, "Yes",
ifelse(abany %in% 2, "No", NA)))
- GSS2021 %>%: This part of the code takes the data frame GSS2021 and pipes it into the subsequent operation using %>%. This operator allows you to chain operations together.
- The mutate function is used to create or modify variables in a data frame.
- The
ifelse
function is used to conditionally assign values to the newabany
variable based on the values of the existingabany
variable.
ifelse(abany %in% 1, "Yes", ...)
checks if the value ofabany
is equal to 1. If it is, it assigns the character "Yes" to theabany
variable.- If the condition is not met (...), another
ifelse
is used:ifelse(abany %in% 2, "No", NA)
: This checks if the value ofabany
is equal to 2. If it is, it assigns the character "No" to theabany
variable.- If neither condition is met, it assigns NA (missing value) to the abany variable.
The chi-square (χ²) test is a statistical method used to determine whether there is a significant association between two categorical variables. In this demo, we will test the statistical independence between individuals' political ideology and their confidence in different subjects including the Press (compress
), scientific community (consci
), congress (conlegis
), and U.S. Supreme Court (conjudge
).
The "confidence" variables are defined as below:
**"Would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in BELOW?" **
- (a) Press (
compress
) - (b) Scientific community (
consci
) - (c) Congress (
conlegis
) - (d) U.S. Supreme Court (
conjudge
)
chi_subset <- GSS2021 %>% select(id, polviews, conpress, consci, conlegis, conjudge)
The variable polviews
varies from 1 to 7 as follows:
To simplify the comparison between individuals with different political views, I want to create a new category polparty
:
chi_subset <- chi_subset %>%
mutate(polviews = ifelse(polviews %in% 1:3, "Democrats",
ifelse(polviews %in% 4, "Independent",
ifelse(polviews %in% 5:7, "Republican", NA))))
I want to test whether individuals' political party affiliation is associated with their confidence in the scientific community.
chi_result <- chisq.test(chi_subset$consci, chi_subset$polviews)
The chi_result
value saves the chi-square test result. Now let's print out the observed frequency and expected frequency tables:
chi_result$observed
chi_result$expected
If we want to add the margins, we can use the addmargins
function
addmargins(chi_result$observed)
addmargins(chi_result$expected)
chi_result
This will print out the Pearson's Chi-square test.
To understand the differences between observed and expected frequencies, you can use the residuals()
function:
residuals(chi_result)
This residuals
function returns the Pearson residuals, which are a measure of the differences between the observed and expected counts in a contingency table.
- Positive residuals (values greater than 0) indicate that there are more observations than expected for that combination.
- Negative residuals (values less than 0) indicate that there are fewer observations than expected for that combination.
- Residuals near 0 suggest that the observed and expected counts are similar.
For example,
- The residual of 9.6480567 for "A Great Deal" and "Democrats" suggests that there are significantly more observations than expected in that category.
- The residual of -2.7550663 for "A Great Deal" and "Independent" suggests that there are fewer observations than expected in that category.