# Statistical Tests III
---

Solutions are provided below.  
Each example contains easier parts at the start, plus more challenging extensions.  
The extensions are useful to understand the concepts more generally, and are likely close to the difficulty of the final parts of an exam question.

---
#### Example 1

Let's look now at some example data for average annual temperature.
 - Make a scatterplot of average temperature over time. \
   Considering the spread of data, which correlation test might be most appropriate?
 - Calculate the other type of correlation as well, and compare the significances. \
   How does that match the interpretation for linear versus monotonic relationships?
 
**Extension**
 - What happens if you change the year values slightly (by less than 1.0 years)? \
   How much do the correlations change? \
   What if you change by more than 1.0 years?
 - What happens to the correlations if you change one temperature value to be a huge outlier?
 - There is another correlation metric implemented in R called [Kendall rank correlation](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient). \
   See how the correlation coefficient for this test compares to Pearson and Spearman correlations.

---

In [None]:
# The years have been relabelled for simplicitly to start from 1.
Temperatures <- c(51.5,52.0,52.5,52.7,48.6,52.3,49.6,50.8,51.0,52.8,52.0,52.6,53.0,52.9,51.4,50.8,51.2,50.3,51.0,50.4,51.6,50.6,49.7,51.0,53.9,53.5,52.1,50.6,51.8,51.7,51.2,52.4,50.1,53.6,50.3,54.7,53.9,54.3,53.4,52.9,53.3,53.7,53.8,52.0,55.0,52.1,53.4,53.8,53.8,51.9,52.1,52.7,51.8,56.6,53.3,55.6,56.3,56.2,56.1,56.2,53.6,55.7,56.3)
Years <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63)

# Plot the data.
plot(Years, Temperatures, type = "p",col="darkred",cex=3,lwd=4)

# Since the data has no obvious outliers, we can calculate the Pearson correlation, which is the default .
cor.test(Years,Temperatures)

# We'll also calculate the Spearman correlation.
# We have "ties" in our data, where the temperatures are the same across two years, so we have to use an approximation to handle these ties rather than the exact method.
cor.test(Years,Temperatures,method='spearman',exact=F)

# The data is quite linear, but has a lot of minor fluctuations that disrupt the monotonic relationship, so the Pearson correlation is larger than the Spearman correlation (but both are significant).

In [None]:
## Example 1 extension
# What happens if you change year values slightly, how much do the correlations change?

# We can add a small amount of normally distributed random noise (mean=0 and sd=0.2) to Years as Years+rnorm(length(Years),0,0.2)
Years_noise <- Years+rnorm(length(Years),0,0.2)
plot(Years_noise, Temperatures, type = "p",col="darkred",cex=3,lwd=4)

# We can recalculate the correlations.
cor.test(Years_noise,Temperatures,method="pearson",exact=F)
cor.test(Years_noise,Temperatures,method="spearman",exact=F)

# There is a slight but insignificant change to the correlation and p value when using pearson R.
# Since the years don't change enough to swap ranks, the spearman R is totally unchanged.
# Increasing the variance (try rerunning with Years+rnorm(length(Years),0,5)) will affect both correlations.

# What happens to the correlations if you change one value to be a huge outlier?
Temperatures_noise = Temperatures
Temperatures_noise[15] = 80.7

cor.test(Years_noise,Temperatures_noise,method="pearson",exact=F)
cor.test(Years_noise,Temperatures_noise,method="spearman",exact=F)

# Pearson correlation can be totally changed (depending on how big the outlier is), but there will be a much smaller effect for Spearman, as the rank of the datapoint matters, not the exact value.

# We can also calculate the Kendall correlation, which measures how "diagonal" data is.
# Althought the algorithm method is very different to Pearson or Spearman, the correlation coefficient is fairly similar.
cor.test(Years_noise,Temperatures_noise,method='kendall',exact=F)

---
#### Example 2

Let's investigate the "spurious correlation of ratios" effect.
 - Generate three sets of normally distributed data, which can all have the same mean and std.
 - Make a scatterplot using two of the datasets as X verus Y and the third dataset as the colour of each point.
 - Test the Pearson correlation of X and Y, and then test the Pearson correlation of X/Z and Y/Z.

**Extension**
 - What happens if you change the standard deviation for X/Y/Z? \
   Does the correlation match the equation from the slides?

---

In [None]:
# Consider 3 random variables that are normally distributed (with identical parameters).
N_samples = 100
mean = 50

X <- rnorm(N_samples,mean,5)
Y <- rnorm(N_samples,mean,5)
Z <- rnorm(N_samples,mean,5)

In [None]:
# We can make a colour palette ("viridis"-like) that goes from blue->yellow.
YlOrBrRdBu <- c("#FDE725", "#21908C", "#3B528B", "#440154")
col <- colorRampPalette(YlOrBrRdBu)(50)

# Roughly bin the Z values so we can colour each scatter point by the Z value, and then plot.
cols = col[cut(Z,50)]
plot(X, Y, col=cols,type = "p",cex=3,lwd=4)

# Let's calculate the correlation as well.
# This can be significant by chance, but it is unlikely.
cor.test(X,Y)

# Now let's divide X & Y by Z.
# This may be some normalisation procedure or similar approach.
# We are using the same colours as determined before, just now plotting X/Z versus Y/Z.
plot(X/Z, Y/Z, col=cols,type = "p",cex=3,lwd=4)

# And calculate the new correlation.
# This should be significant now, but only due to the "spurious correlation of ratios" effect.
cor.test(X/Z,Y/Z)

In [None]:
## Example 2 extension
# What happens if you alter the standard deviation for X/Y/Z?

# If the standard deviation is huge the result is less significant, while a lower deviation is more reliably significant.
# Huge outliers with high deviation can skew the data much more, so this is expected.

# From the equation from the slides, we expect a correlation of roughly 0.5 if the variances are equal.

# Now let's make the variance in X and Y much greater than in Z, and we should see a low correlation value.
X <- rnorm(N_samples,mean,20)
Y <- rnorm(N_samples,mean,15)
Z <- rnorm(N_samples,mean,2)
cor.test(X/Z,Y/Z)

# On the other hand, if we increase the variance in Z to be large, the correlation should approach 1. 
X <- rnorm(N_samples,mean,2)
Y <- rnorm(N_samples,mean,3)
Z <- rnorm(N_samples,mean,20)
cor.test(X/Z,Y/Z)

---
#### Example 3

We can also examine correlations across many sets of data at once, rather than between just a pair of observations.
 - Create a 10x10 matrix of randomly distributed variates.
 - Calculate the correlation coefficient and signifiance across all the pairwise combinations.
 - Plot the results using the `corrplot` package.

**Extension**
 - Verify that the correlation matrix values match calculating the correlations between some pairs of columns.
 - If we change the distribution from exponential to normal, how many correlations are "falsely" significant?
 - If we change the correlation to spearman (from pearson), would we expect to see more significant correlations with the exponential distribution?
 - Consider how many of these "random" sets of data are significantly correlated. \
   If we account for "multiple hypothesis testing", how many would we expect by chance?
   
---

In [None]:
# Install the corrplot package if necessary.
#install.packages("corrplot")

# Load the corrplot package.
library(corrplot)

In [None]:
# Make a matrix with random expoential values, calculate the (pairwise) correlation matrix.
A <- matrix(rexp(100,5),nrow=10)

# Add some names to the columns to make it look a bit nicer.
colnames(A) <- c("A","B","C","D","E","F","G","H","I","J")

# Test for correlations across the matrix.
correlations <- cor(A,method="pearson")

# Also calculate the p-values for the correlations.
p = cor.mtest(A)$p

# Plot the correlation matrix, indicate which correlations are not significant.
corrplot(correlations,type = "upper",p.mat=p,sig.level=0.05)

In [None]:
## Example 3 extension
# What happens if you alter the standard deviation for X/Y/Z?

# Verify that the correlation matrix values match calculating the correlations between each column.
# We can access each column like A[,"Alpha"] or A[,"Golf"], and each element of the correlation matrix like correlations["Alpha","Golf"].
cor.test(A[,"A"],A[,"G"])
correlations["A","G"]

# If we change the distribution from exponential to normal, how many correlations are "falsely" significant?
A <- matrix(rnorm(100,50,2),nrow=10)
colnames(A) <- c("A","B","C","D","E","F","G","H","I","J")
correlations <- cor(A,method="pearson")
p = cor.mtest(A)$p
corrplot(correlations,type = "upper",p.mat=p,sig.level=0.05)

# If we change the correlation to spearman (from pearson), would we expect to see more significant correlations with the exponential distribution?
A <- matrix(rexp(100),nrow=10)
colnames(A) <- c("A","B","C","D","E","F","G","H","I","J")
correlations <- cor(A,method="spearman")
p = cor.mtest(A)$p
corrplot(correlations,type = "upper",p.mat=p,sig.level=0.05)

# An exponential distribution generally has more extreme values (outliers) than a normal distribution, which could either lead to a larger or smaller correlation.
# Since spearman is more robust to outliers (and we don't expect any signficance from a random set of values), we would expect it to have fewer significant correlations compared to pearson.


# We conducted n(n-1)/2 tests, so for n=10, that is 45 tests.
# Given a false positive rate of approximately 5% (p=0.05), we would estimate around 2 to 3 tests to be significant.

# We can calculate this from the "p" variable, removing the 10 diagonal tests (data always correlates with itself perfectly) and dividing by two (we double test A correlating with B and B correlating with A).
(sum(p<0.05)-10)/2

# Or a bit more fancy, we can consider only the "upper triangle" of the matrix, which removes the diagonal elements and the duplicate tests above/below the diagonal.
sum(p[upper.tri(p)]<0.05)

---
## Real world example

This is a slightly more realistic case, where we are examining some [actual data](https://opendata.swiss/en/dataset/rinder-verteilung-pro-gemeinde).  
The goal here is to manipulate the data into a useful shape, and then assess statistical significance.  
These examples are harder than you would encounter in an exam and use more advanced data analysis techniques, but are useful to try.

This data is the number of cattle for each commune in Switzerland
 - Filter out communes that have values of 0 for the count of cattle, as these are likely incomplete records.
 - Plot the count of cattle against the countPerSurfacekm2 (and also against countPer100Inhabitants).
 - Test for correlation betweeen these variables
   -  Does this match your interpretation about which types communes (in terms of land area or population density) would have large amounts of cattle?
 - Try different levels of filtering (N=100, 1000, etc). \
   How does that impact the results/correlations? \
   In particular, notice the correlation for the threhsold N=750 and how this relatives to correlation transitivity.
   
---

In [None]:
# Read in the data from the csv, which uses ';' as the separator rather than ','.
# The first row of the file is also some download information rather than data, so we skip the first row.
cattle_by_communes <- read.csv('cattle-map-commune.csv',sep=';',skip=1)

# Print out the first few rows so we can see what the data looks like.
head(cattle_by_communes)

# Filter out communes that have less than N cattle.
N=1
cattle_by_communes <- cattle_by_communes[(cattle_by_communes$count >= N),] 

# Plot the data.
plot(countPerSurfacekm2 ~ count,data=cattle_by_communes)

# And test for Spearman correlations.
cor.test(~ countPerSurfacekm2 + count,data=cattle_by_communes,method="spearman",exact=FALSE)
cor.test(~ countPer100Inhabitants + count,data=cattle_by_communes,method="spearman",exact=FALSE)
cor.test(~ countPer100Inhabitants + countPerSurfacekm2,data=cattle_by_communes,method="spearman",exact=FALSE)

---
## Optional example 1


We discussed the datasaurus dozen in the lecture, where obviously different patterns end up with incredibly similar statistical properties.  
 - Plot out some the different datasets and calculate the correlation coefficients. How does the coefficient compare to what you would estimate.
 - How do the values compare across different datasets?
   
---

In [None]:
# Install and load the datasaurus package, which contains the data for the "data dozen" set.
#install.packages("datasauRus")
library(datasauRus)

# "Slice" out the points for the 'dino' dataset.
# We'll plot them and then calculate the correlation coefficient.
dino <- datasaurus_dozen[datasaurus_dozen$dataset == 'dino',]
plot(y ~ x, data=dino, type = "p",col="darkgreen",cex=3,lwd=4)
cor.test(dino$x,dino$y)

# Repeat for the slant_down and slant_up datasets.
slant_down <- datasaurus_dozen[datasaurus_dozen$dataset == 'slant_down',]
plot(y ~ x, data=slant_down, type = "p",col="darkgreen",cex=3,lwd=4)
cor.test(slant_down$x,slant_down$y)

slant_up <- datasaurus_dozen[datasaurus_dozen$dataset == 'slant_up',]
plot(y ~ x, data=slant_up, type = "p",col="darkgreen",cex=3,lwd=4)
cor.test(slant_up$x,slant_up$y)

#### Optional example 2

Consider some crop yield data from Frossard 2019, in particular the number of ears of wheat and the number of ears with grains.
 - Load in the data and plot *nr.ears.m2* against *DW.ears_with_grains.g.m2*
 - Test if the above variables are correlated using a Pearson correlation test (`cor.test()`).
   - What is the reported statistical significance?
   - What is the r2 effect size?  
 - Repeat the two tasks above but using *nr.ears.m2* and *soil.cover.morning*. \
   Does the differences in correlation make sense based on your interpretation of these variables?

**Extension**
 - What will sorting the data (like `ears <- sort(crop_data$nr.ears.m2)`) first do to the correlation coefficients and significance? \
   Is this a valid form of data preprocessing?

---

In [None]:
# Load in the data and plot it.
crop_data <- read.table('Frossard_2019.csv',header=TRUE,sep=",")
plot(DW.ears_with_grains.g.m2 ~ nr.ears.m2,data=crop_data, type = "p",col="darkgreen",cex=3,lwd=4)

# Let's calculate the correlation coefficient and p-value.
# By default, it is the Pearson method.
cor.test(crop_data$nr.ears.m2,crop_data$DW.ears_with_grains.g.m2)

# We can also use the "formula" notation.
# It is slightly different here, as "~ Y + X" rather than the "Y ~ X" we saw before
#cor.test(~ DW.ears_with_grains.g.m2 + nr.ears.m2,data=crop_data)

# And repeat for the *soil.cover.morning* variable
plot(soil.cover.morning ~ nr.ears.m2,data=crop_data, type = "p",col="red",cex=3,lwd=4)
cor.test(crop_data$nr.ears.m2,crop_data$soil.cover.morning)

In [None]:
## Optional example 2 extension
# What will sorting the data (like ears <- sort(crop_data$nr.ears.m2)) first do to the correlation coefficients and significance?

ears <- sort(crop_data$nr.ears.m2)
weight <- sort(crop_data$DW.ears_with_grains.g.m2)
plot(ears, weight, type = "p",col="royalblue",cex=3,lwd=4)
cor.test(ears,weight)

# Sorting the data independently removes the relationship between the two variables, and so destroys the original correlation.
# The plot now looks very correlated, as both variables were sorted separately, and indeed there is a strong and significant correlation.

# Preprocessing data without careful thought can introduce or remove many important correlations, so be careful!
# Instead, if we sort row by row, preserving the paired relationship, the plot looks the same as before.
sorted_crop <- crop_data[order(crop_data$nr.ears.m2),]
plot(DW.ears_with_grains.g.m2 ~ nr.ears.m2,data=sorted_crop, type = "p",col="darkgreen",cex=3,lwd=4)