## The logic of chi-square tests with "toast" sample data

In [1]:
#We will begin by creating a function to generate a 2x2 matrix for our silly sample data on toast.

make2x2table <- function(ul) # The user supplies the count for the upper left cell
{
ll <- 50 - ul # Calculate the lower left cell
ur <- 30 - ul # Calculate the upper right cell
lr <- 50 - ur # Calculate the lower right cell
# Put all of the cells into a 2x2 matrix
matrix(c(ul,ur,ll,lr), nrow=2, ncol=2, byrow=TRUE)
}

In [2]:
make2x2table(15) # Should be like Table 7.2 in Stanton
make2x2table(0)   # Should be like Table 7.3
make2x2table(30) 	# Should be like Table 7.4

0,1
15,15
35,35


0,1
0,30
50,20


0,1
30,0
20,50


In [3]:
#Next, we will write a function to calculate the chi-square score so we understand the logic. 
#In the future, you can just run the chi-square test with R's built-in function (see below).
calcChiSquared <- function(actual, expected) # Calculate chi-squared
{
diffs <- actual - expected        	# Take the raw difference for each cell
diffsSq <- diffs ^ 2              	# Square each cell
diffsSqNorm <- diffsSq / expected # Normalize with expected cells
sum(diffsSqNorm)                 	 # Return the sum of the cells
}

In [4]:
# This makes a matrix that is just like Table 7.2
# This table represents the null hypothesis of independence
expectedValues <- matrix(c(15,15,35,35), nrow=2, ncol=2, byrow=TRUE)

In [9]:
#To find the critical value for different degrees of freedom and alpha values, use a reference table, like the one found at 
# https://www.mun.ca/biology/scarr/4250_Chi-square_critical_values.html
calcChiSquared(make2x2table(15),expectedValues)

This result (chi-sq=0, which is not less than our critical value of 3.84) means that the observed values (ul=15) are the result of chance. The variables (toast topping and landing side) are independent and that there is likely no association between them.

In [10]:
calcChiSquared(make2x2table(0),expectedValues)

The result: 42.86 > 3.84, which means that the observations are not likely due to chance, which means that there is likely an association between the variables.

In [11]:
calcChiSquared(make2x2table(30),expectedValues)

The result: 42.86 > 3.84, which means that the observations are not likely due to chance, which means that there is likely an association between the variables.

In [12]:
# Run the chi-square test on Table 7.1 data
chisq.test(make2x2table(20), correct=FALSE)


	Pearson's Chi-squared test

data:  make2x2table(20)
X-squared = 4.7619, df = 1, p-value = 0.0291


The chi-sq score is 4.76, and this result is greater than the critical value of 3.84, so the variables (toast topping and landing side) are likely to be associated. df=1 and alpha = 0.05 (5%, which is the alpha value for a 95% confidence interval) to identify the critical value of 3.84. p-value < alpha, the results are statistically significant. 0.0291 < 0.05, so our results may be said to be statistically significant.

In [6]:
# Run the chi-square test on Table 7.1 data
# correct = FALSE means that the calculation will not consider Yates' correction. 
# Yates' correction makes the chi-square test more conservative to handle small samples.
chisq.test(make2x2table(20), correct=FALSE)


	Pearson's Chi-squared test

data:  make2x2table(20)
X-squared = 4.7619, df = 1, p-value = 0.0291


In [7]:
# Run the chi-square test on Table 7.1 data
# The ftable() extracts a split of survivors and nonsurvivors by gender, which we then test for independence.
badBoatMF <- ftable(Titanic, row.vars=2, col.vars="Survived")
badBoatMF
chisq.test(badBoatMF, correct=FALSE)

       Survived   No  Yes
Sex                      
Male            1364  367
Female           126  344


	Pearson's Chi-squared test

data:  badBoatMF
X-squared = 456.87, df = 1, p-value < 2.2e-16


## How to run chi-square tests on your own data with matrices larger than 2x2

In [13]:
#Example of creating a contingency table first and then running the chi-square test
install.packages("MASS")
library(MASS)
data(survey)
head(survey)

Installing package into 'C:/Users/ASG/Documents/R/win-library/4.0'
(as 'lib' is unspecified)



package 'MASS' successfully unpacked and MD5 sums checked


"cannot remove prior installation of package 'MASS'"
"problem copying C:\Users\ASG\Documents\R\win-library\4.0\00LOCK\MASS\libs\x64\MASS.dll to C:\Users\ASG\Documents\R\win-library\4.0\MASS\libs\x64\MASS.dll: Permission denied"
"restored 'MASS'"



The downloaded binary packages are in
	C:\Users\ASG\AppData\Local\Temp\RtmpITOHJK\downloaded_packages


"package 'MASS' was built under R version 4.0.4"


Unnamed: 0_level_0,Sex,Wr.Hnd,NW.Hnd,W.Hnd,Fold,Pulse,Clap,Exer,Smoke,Height,M.I,Age
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<dbl>,<fct>,<dbl>
1,Female,18.5,18.0,Right,R on L,92.0,Left,Some,Never,173.0,Metric,18.25
2,Male,19.5,20.5,Left,R on L,104.0,Left,,Regul,177.8,Imperial,17.583
3,Male,18.0,13.3,Right,L on R,87.0,Neither,,Occas,,,16.917
4,Male,18.8,18.9,Right,R on L,,Neither,,Never,160.0,Metric,20.333
5,Male,20.0,20.0,Right,Neither,35.0,Right,Some,Never,165.0,Metric,23.667
6,Female,18.0,17.7,Right,L on R,64.0,Right,Some,Never,172.72,Imperial,21.0


In the built-in data set survey, the Smoke column records the students smoking habit, while the Exer column records their exercise level. The allowed values in Smoke are "Heavy", "Regul" (regularly), "Occas" (occasionally) and "Never". As for Exer, they are "Freq" (frequently), "Some" and "None".

Test the hypothesis whether the students smoking habit is independent of their exercise level at .05 significance level.

In [14]:
#create a contingency table for the number of students who smoke and exercise.
smoke<-table(survey$Smoke, survey$Exer)
smoke #view contingency table

       
        Freq None Some
  Heavy    7    1    3
  Never   87   18   84
  Occas   12    3    4
  Regul    9    1    7

In [15]:
#Test the hypothesis whether the students smoking habit is independent of their exercise level at .05 significance level.
chisq.test(smoke, correct=FALSE)

"Chi-squared approximation may be incorrect"



	Pearson's Chi-squared test

data:  smoke
X-squared = 5.4885, df = 6, p-value = 0.4828


As the p-value 0.4828 is greater than the .05 significance level, we do not reject the null hypothesis that the smoking habit is independent of the exercise level of the students.

<b>ENHANCED SOLUTION:</b> The warning message found in the solution above is due to the small cell values in the contingency table. To avoid such warning, we combine the second and third columns of tbl, and save it in a new table named ctbl. Then we apply the chisq.test function against ctbl instead.


In [16]:
ctbl = cbind(smoke[,"Freq"], smoke[,"None"] + smoke[,"Some"])
ctbl
chisq.test(ctbl)

0,1,2
Heavy,7,4
Never,87,102
Occas,12,7
Regul,9,8



	Pearson's Chi-squared test

data:  ctbl
X-squared = 3.2328, df = 3, p-value = 0.3571


We still accept the null hypothesis that the two variables are independent because the chi-squared value is 3.23, which is *not* greater than the critical value of 7.815 for df=3 and alpha = 0.05, and the p-value of 0.35 is *not* less than the alpha value of 0.05.