## EXAMPLE 1 -- simple statistical test

Let's *test* if students choose seating in a non-random way. \
A null hypothesis is that students choose seats at random, so we would expect a 50/50 split between the left and the right of the classroom. \
However, let us consider 100 students, where 36 sit on the left and 64 sit on the right.

In [None]:
cat("Binomial test p =",binom.test(64,100,.5)$p.value,"\n\n")

# trivially this is the same as considering those who sat on the other side
cat("Binomial test p =",binom.test(36,100,.5)$p.value,"\n\n")

Let's now create a 2x2 "observation" matrix by reshaping a 4x1 list.

In [None]:
Seating <- matrix(c(36, 50, 64, 50),
       nrow = 2,
       dimnames = list(c("Observed", "Expectation"),
                       c("Left", "Right")))

# View the matrix to check it is right

print(Seating)

We'll now perform Fisher's exact test on the matrix. \
This is by default a two-sided test, checking if there is *a* difference. \
What if the right side is closer to the door, so we think people might prefer that side? \
We can also use a one-sided version to test if "more people sit on the left than expected" or "less people sit on the left than expected".

In [None]:
# Print out the test results.
# Using `cat` allows us to print more complex statements with multiple parts (the initial text and then also the p value).
# The `\n` are special characters to add new lines and make the printing easier to read.

result <- fisher.test(Seating)
cat("\nFisher's exact test (Two-sided)\np =",result$p,"\n")

result <- fisher.test(Seating, alternative="greater")
cat("\nFisher's exact test (One-sided)\np =",result$p,"\n")

result <- fisher.test(Seating, alternative="less")
cat("\nFisher's exact test (One-sided)\np =",result$p,"\n")

In [None]:
# Let's look at the "marginals" in more detail.
# addMargins sums these values for a matrix.

print(addmargins(Seating))

# Consider a slightly different seating arrangement.

Seating <- matrix(c(45, 50, 55, 50),
       nrow = 2,
       dimnames = list(c("Observed", "Expectation"),
                       c("Left", "Right")))

# Now look at the margins  compared to before.
# The sum over the columns (on the right side) are still equal to 100.
# This should be fixed, we have 100 students no matter what (unless the lecture is that bad and they leave halfway through).
# However, the sum over the rows can change depending on where students can sit, and so this is NOT fixed.

cat("\n")
print(addmargins(Seating))

In [None]:
## EXAMPLE 2

# Let's consider (randomly generated) dairy yield from two different cattle breeds.
# 1000 normally distributed yield variantes, with a slightly different means but both with standard deviation of 2.

holstein <- rnorm(1000,90,2)
brown_swiss <- rnorm(1000,88,2)

# Let's plot the distribution just to help visualise.
# Plotting these as histograms with different colours in the same plot (add=T) and title (main="Exam results").

hist(holstein,col=rgb(0,0,1,1/4), xlim=c(80,100),main="Milk yield",xlab="yield (L)")
hist(brown_swiss,col=rgb(1,0,0,1/4), xlim=c(80,100),add=T)

# Manually add a legend to the plot for the two classes

legend("topright",c("Holstein","Brown Swiss"),fill=c(rgb(0,0,1,1/4),rgb(1,0,0,1/4)))

In [None]:
# Now let's test if two different cattle breeds have the same milk yield.
# Our null hypothesis is that the two breeds have the same yield.
# Can use the t-test, which assumes normally distributed data and equal variances.

result <- t.test(holstein,brown_swiss,var.equal=T)
cat("p =",result$p.value,"\ndof =",result$parameter,"\nt statistic =",result$statistic,"\n")

# Given the degrees of freedom and the t statistic, we can also calculate the p-value from the theoretical Students t distribution.
# Because we used a two-sided test, multiply the result by 2.

cat("Distribution calculated p =",2*pt(result$statistic,result$parameter, lower.tail=FALSE),"\n\n\n")

cor.test(sort(brown_swiss),sort(holstein))
summary(lm(sort(brown_swiss) ~ sort(holstein)))

summary(lm(sort(brown_swiss) ~ sort(holstein)))

In [None]:
## EXAMPLE 4

# Consider the crop harvest at many different locations over two different years

year_2022 <- c(180,161,188,151,178,160,190,172,118,164,222,217,155,165,129,193,210,191,176,205,170,167,189)
year_2023 <- c(133,211,168,137,144,172,143,174,139,156,170,162,212,170,171,161,177,162,151,186,132,158,97)

# How do the distributions look like?
# Wouldn't expect anything too obvious since the locations could all be quite different

hist(year_2022,col=rgb(1,0,0,1/4),breaks = seq(50,250,length.out = 16),main="Crop yield",xlab="yield (kg)")
hist(year_2023,col=rgb(0,0,1,1/4),breaks = seq(50,250,length.out = 16), add=T) 
legend("right",c("2022","2023"),fill=c(rgb(0,0,1,1/4),rgb(1,0,0,1/4)))

plot(year_2022,year_2023,col=rgb(0,0,1),xlab="Raw p",ylab="Adjusted p")

In [None]:
# Let's confirm it is appropriate to assume they are normal distributions

cat("2022 p =",shapiro.test(year_2022)$p,"\n")
cat("2023 p =",shapiro.test(year_2023)$p,"\n")

# And let's look at their mean values

cat("2022 mean =",mean(year_2022),"\n")
cat("2023 mean =",mean(year_2023),"\n")

# And finally let's test that there is a difference in these samples.
# Since the distributions are not normal etc, we should use a non-parametric test.
# Technically, the null hypothesis that the distributions are identical, and you would not expect to draw larger values from one population.

cat("Mann–Whitney U test p =",wilcox.test(year_2022, year_2023, paired = FALSE,exact=F)$p.value,"\n")

In [None]:
## Further investigations

## EXAMPLE 1
# Does the sample size make much difference? Consider looking at the chi-square test (chisq).
#
# Can we generalise to more than 2 categeories (first row of seats, second row of seats, third row of seats, etc.)?

## EXAMPLE 2
# Does changing var.equal in EXAMPLE 2 to false change the results much?
#
# What if we change the variances in the milk yields from 2 to something different for each class?
#
# Can you find an example where a TRUE/FALSE value for var.equal would give a significant or not-significant result? What would that mean?

## EXAMPLE 3
# Does increasing/decreasing N or B make it more or less likely to get significant hits even after p value correction?
#
# What sort of conditions would lead to one method being too conservative/lenient. Is it meaningful if you have a significant hit after one correction
# but not another?

## EXAMPLE 4
# Consider if the crop yields from 2022 and 2023 were measured at the same set of locations, so the first yield from 2022 and 2023 are from the same field.
# The yields are now "paired" and can be compared more directly rather than just a list of yields.
#
# We can then use a paired wilcox test, but how does this affect the results? What is the interpretation?
#
# What if we ignored the fact the data is not normally distributed, and used something like a paired t-test? How does that look? How does that relate
# to the power of the test?

In [None]:
## EXAMPLE 1
# Does the sample size make much difference? Consider looking at the chi-square test (chisq).

Seating <- matrix(c(360, 500, 640, 500),
       nrow = 2,
       dimnames = list(c("Observed", "Expectation"),
                       c("Left", "Right")))
result <- fisher.test(Seating)
cat("\nFisher's exact test (Two-sided)\np =",result$p,"\n")

#Yes, larger sample size (but proportionally the same) is highly statistically significant. Smaller sample sizes are much harder to be significant.

cat("\nChi Square test (Two-sided)\np =",chisq.test(Seating)$p.value,"\n")

# The chi square test is very similar, but since it is non-parametric, it is less powerful than Fisher's exact test in this case. But it is also more general.



# Can we generalise to more than 2 categeories (first row of seats, second row of seats, third row of seats, etc.)?

# Make a larger matrix with a few extra value

Seating <- matrix(c(10, 25, 40, 25,28,22, 30,20, 24,26, 15, 35),
       nrow = 2,       
       dimnames = list(c("Observed", "Expectation"),
                       c("Row 1", "Row 2", "Row 3","Row 4", "Row 5","Row 6")))

# chi square test works fine, but Fisher's exact test gives an error (as the exact test becomes too hard to calculate, it is mostly for 2x2 matrices).
print(Seating)
cat("\nChi Square test (Two-sided)\np =",chisq.test(Seating)$p.value,"\n")
cat("\nFisher's exact test (Two-sided)\np =",fisher.test(Seating)$p,"\n")

# can approximate the larger Fisher test using "simulate.p.value=TRUE"
#cat("\nFisher's exact test (Two-sided)\np =",fisher.test(Seating,simulate.p.value=TRUE)$p,"\n")

In [None]:
## EXAMPLE 2
# Does changing var.equal in EXAMPLE 2 to false change the results much?
#
 
holstein <- rnorm(1000,90,2)
brown_swiss <- rnorm(1000,88,2)

result <- t.test(holstein,brown_swiss,var.equal=F)
cat("p =",result$p.value,"\ndof =",result$parameter,"\nt statistic =",result$statistic,"\n")

#There is a tiny change in degrees of freedom when variance is not equal (it is now "Welch's t-test"), but no change to t statistic and p value is still significant.


# What if we change the variances in the milk yield from 2 to something different for each class?

holstein <- rnorm(1000,90,3)
brown_swiss <- rnorm(1000,88,4)

result <- t.test(holstein,brown_swiss,var.equal=F)
cat("p =",result$p.value,"\ndof =",result$parameter,"\nt statistic =",result$statistic,"\n")

# Still significant, but there is a larger drop in degrees of freedom and t-statistic, as the distribitions of milk yields are more likely to overlap with larger variance.

# Can you find an example where a TRUE/FALSE value for var.equal would give a significant or not-significant result? What would that mean?

# There will be some parameters and random seed that will do this (similar to the case in Example 1).
# If the variance truly is equal, then picking var.equal=F and getting a non-significant result means you used an "under-powered" test, and incorrectly will say it is non-significant (false negative error).
# Alternatively, picking var.equal=T when they are not might mean your test is "over-powered" and you are incorrectly reporting it is as significant (false positive error).

In [None]:
## EXAMPLE 3
# Does increasing/decreasing N or B make it more or less likely to get significant hits even after p value correction?
#
# The corrections are very robust, and so in general will never report any significant hits.
# However, increasing N/B (so more flips per person) can result in a person getting a very signficant outlier, which may be significant after correction.
# Similarly, decreasing N/B (so more people flipping less coins) means the correction factor is higher, so even less likely for a signficant result after correction.


# What sort of conditions would lead to one method being too conservative/lenient. Is it meaningful if you have a significant hit after one correction
# but not another?

# If your probabilities are very unevenly distibuted (so a few highly signficiant values and a few non-signifant values), then the Bonferroni correction will be much stricter than B-H.
# They are both valid choices, so there is not any meaningful conclusion if p-values are significant after one correction but not the other. 
# However, it is important not to change your method because you didn't like the outcome (p-value hacking!).

In [None]:
## EXAMPLE 4
# Consider if the crop yields from 2022 and 2023 were measured at the same set of locations, so the first yield from 2022 and 2023 are from the same field.
# The yields are now "paired" and can be compared more directly rather than just a list of yields.
#
# We can then use a paired wilcox test, but how does this affect the results? What is the interpretation?
#

cat("Mann–Whitney U test p =",wilcox.test(year_2022, year_2023, paired = F,exact=F)$p.value,"\n")
cat("Mann–Whitney U test p =",wilcox.test(year_2022, year_2023, paired = T,exact=F)$p.value,"\n")

# In this case, the test is still significant, but a lot weaker and the effect size is also lower (see lecture 6).
# Before we were considering if one year had higher yield than the the other in general, but now we can also say one year yielded more crops from the same locations than the other year (so is easier to support a claim of "better").


# What if we ignored the fact the data is not normally distributed, and used something like a paired t-test? How does that look? How does that relate
# to the power of the test?

cat("Student's t test (unpaired) p =",t.test(year_2022, year_2023, paired = F)$p.value,"\n")
cat("Student's t test (paired) p =",t.test(year_2022, year_2023, paired = T)$p.value,"\n")

# Interestingly the p-value is higher for unpaired and lower for paired.
# In general, since this test assumption is not appropriate (not normally distributed and too small sample size to approximate normality), it may be over/under-powered, depending on where the t-distribution is being tested.

## Additional example

This is a slightly more realistic case, where we are examining some actual data. \
The goal here is to manipulate the data into a useful shape, and then assess statistical significance.

In [None]:
# read in the data from the csv, which uses ';' as the separator rather than ','
# the first row of the file is also some download information rather than data, so we skip the first row
slaughter_weights <- read.csv('../../cattle-slaughterWeights.csv',sep=';',skip=1)
print(head(slaughter_weights,n=20))

# we now want to reshape the data to sum over the 12 months in a year for each TypeOfUse and Year
grouped_weights <- aggregate(cbind(tot_number_animals=slaughter_weights$tot_number_animals), by=list(TypeOfUse=slaughter_weights$TypeOfUse,Year=slaughter_weights$Year), FUN=sum)

# we can also do this more compactly with the '~' formula syntax, which produces the same output 
grouped_weights <- aggregate(tot_number_animals ~ TypeOfUse + Year,data=slaughter_weights, FUN=sum)

# print out the new dataframe so we can see it
print(grouped_weights)

In [None]:
# We can now use a t-test to see if the total number of cattle slaughtered is statistically different by their type of use (beef or dairy)
# The data for each type of use is from the same time periods, so we want to use a paired t-test.

t.test(grouped_weights[grouped_weights$TypeOfUse == 'Beef',]$tot_number_animals,grouped_weights[grouped_weights$TypeOfUse == 'Dairy',]$tot_number_animals,paired=TRUE)

# or again using the more compact formula synatx
t.test(tot_number_animals ~ TypeOfUse, data = grouped_weights,paired=TRUE)

# We can also explore the one-sided tests, where we can change default setting from "alternative='two.sided'" to "alternative='greater'" or "alternative='less'".
# Based on the data, which one-sided test would you expect to be significant?
# How does the p-value compare for that one-sided test compared to the two-sided test?
# Why does that depend on the symmetry of the distribution?

t.test(tot_number_animals ~ TypeOfUse, data = grouped_weights,paired=TRUE,alternative='two.sided')

In [None]:
# The t-test we were using implicitly assumes unequal variances in the two sets of samples (var.equal=FALSE).
# Was this the correct assumption to make?

var.test(slaughter_weights[slaughter_weights$TypeOfUse == 'Beef',]$std, slaughter_weights[slaughter_weights$TypeOfUse == 'Dairy',]$std)

# again with the ~ syntax
var.test(std ~ TypeOfUse, data=slaughter_weights)