# Statistical Tests III
---

Solutions are provided below.  
Each example contains easier parts at the start, plus more challenging extensions.  
The extensions are useful to understand the concepts more generally, and are likely close to the difficulty of the final parts of an exam question.

---

#### Warmup

We discussed the datasaurus dozen in the lecture, where obviously different patterns end up with incredibly similar statistical properties.  
 - Plot out some the different datasets and calculate the correlation coefficients. How does the coefficient compare to what you would estimate.
 - How do the values compare across different datasets?
   
---

In [None]:
# Load the datasaurus package.
library(datasauRus)

# Slice out the 'dino' entries. Plot them and find the correlation coefficient.
dino <- datasaurus_dozen[datasaurus_dozen$dataset == 'dino',]
plot(y ~ x, data=dino, type = "p",col="darkgreen",cex=3,lwd=4)
result <- cor.test(dino$x,dino$y)

# print out the results in a nicer way.
cat("\nCorrelation\n-------\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

# Repeat for the slant_down and slant_up datasets.
slant_down <- datasaurus_dozen[datasaurus_dozen$dataset == 'slant_down',]
plot(y ~ x, data=slant_down, type = "p",col="darkgreen",cex=3,lwd=4)
result <- cor.test(slant_down$x,slant_down$y)
cat("\nCorrelation\n-------\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

slant_up <- datasaurus_dozen[datasaurus_dozen$dataset == 'slant_up',]
plot(y ~ x, data=slant_up, type = "p",col="darkgreen",cex=3,lwd=4)
result <- cor.test(slant_up$x,slant_up$y)
cat("\nCorrelation\n-------\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

---
#### Example 1

Consider some crop yield data from Frossard 2019, in particular the number of ears of wheat and the number of ears with grains.
 - Load in the data and plot *nr.ears.m2* against *DW.ears_with_grains.g.m2*
 - Test if the above variables are correlated.
 - Repeat the two tasks above but using *nr.ears.m2* and *soil.cover.morning*. Does the differences in correlation make sense based on your expectations for these variables.

**Extension**
 - What will sorting the data (like `ears <- sort(crop_data$nr.ears.m2)`) first do to the correlation coefficients and significance?

---

In [None]:
# Load in the data and plot it.
crop_data <- read.table('Frossard_2019.csv',header=TRUE,sep=",")
plot(crop_data$nr.ears.m2,crop_data$DW.ears_with_grains.g.m2, type = "p",col="darkgreen",cex=3,lwd=4)

#Let's calculate the correlation coefficient and p-value.
# By default, it is the Pearson method.
result <- cor.test(crop_data$nr.ears.m2,crop_data$DW.ears_with_grains.g.m2)
result

# print out the results in a nicer way.
cat("\nCorrelation\n-------\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

# And repeat for the *soil.cover.morning* variable
plot(crop_data$nr.ears.m2,crop_data$soil.cover.morning, type = "p",col="pink",cex=3,lwd=4)
cor.test(crop_data$nr.ears.m2,crop_data$soil.cover.morning)

In [None]:
## Example 1 extension
# What will sorting the data (like ears <- sort(crop_data$nr.ears.m2)) first do to the correlation coefficients and significance?

ears <- sort(crop_data$nr.ears.m2)
weight <- sort(crop_data$DW.ears_with_grains.g.m2)
plot(ears, weight, type = "p",col="darkgreen",cex=3,lwd=4)
result <- cor.test(ears,weight)
cat("Correlation of sports scores\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

# Sorting the data independently removes the relationship between the two variables, and so destroys the original correlation.
# The plot now looks very correlated, as both variables were sorted separately, and indeed there is a strong and significant correlation.


# Preprocessing data without careful thought can introduce or remove many important correlations, so be careful!
# Instead, if we sort row by row, preserving the relationship, the plot looks the same as before.
sorted_crop <- crop_data[order(crop_data$nr.ears.m2),]
plot(sorted_crop$nr.ears.m2, sorted_crop$DW.ears_with_grains.g.m2, type = "p",col="darkgreen",cex=3,lwd=4)

---
#### Example 2

Let's look now at the average annual temperature.
 - Make a scatterplot for temperature over time and calculate the pearson correlation.
 - Try calculating the spearman and kendall correlations as well, and compare the significances. How does that match the interpretation for linear versus monotonic relationships?

**Extension**
 - What happens if you change the year values slightly (by less than 1.0 years), how much do the correlations change? What if you change by more than 1.0 years?
 - What happens to the correlations if you change one temperature value to be a huge outlier?

---

In [None]:
# The years have been relabelled for simplicitly to start from 1.
Temperatures <- c(51.5,52.0,52.5,52.7,48.6,52.3,49.6,50.8,51.0,52.8,52.0,52.6,53.0,52.9,51.4,50.8,51.2,50.3,51.0,50.4,51.6,50.6,49.7,51.0,53.9,53.5,52.1,50.6,51.8,51.7,51.2,52.4,50.1,53.6,50.3,54.7,53.9,54.3,53.4,52.9,53.3,53.7,53.8,52.0,55.0,52.1,53.4,53.8,53.8,51.9,52.1,52.7,51.8,56.6,53.3,55.6,56.3,56.2,56.1,56.2,53.6,55.7,56.3)
Years <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63)

# Plot the data and calculate the pearson (default) correlation.
plot(Years, Temperatures, type = "p",col="darkred",cex=3,lwd=4)
result <- cor.test(Years,Temperatures)
cat("\nCorrelation of annual temperatures\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

# Let's also look the Spearman and Kendall correlations.
# We have "ties" in our data, where the temperatures are the same across two years.
# We use an approximation to handle these ties rather than the exact method.
result <- cor.test(Years,Temperatures,method='spearman',exact=F)
cat("\nCorrelation of annual temperatures (Spearman)\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")
result <- cor.test(Years,Temperatures,method='kendall',exact=F)
cat("\nCorrelation of annual temperatures (Kendall)\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

In [None]:
## Example 2 extension
# What happens if you change year values slightly, how much do the correlations change?

# We can add normally distributed random noise (mean=0 and sd=0.05) to Years as Years+rnorm(length(Years),0,.05)
Years <- Years+rnorm(length(Years),0,2)
plot(Years, Temperatures, type = "p",col="darkred",cex=3,lwd=4)

# We can recalculate the correlation.
result <- cor.test(Years,Temperatures,method="pearson",exact=F)
cat("\nCorrelation of annual temperatures\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

# There is a slight but insignificant change to the correlation and p value when using pearson R.
# Since the years don't change enough to swap ranks, the spearman R is totally unchanged.
# Increasing the variance to say 2 will affect both correlations.

# What happens to the correlations if you change one value to be a huge outlier?
Temperatures[15] = 80.7

result_p <- cor.test(Years,Temperatures,method="pearson",exact=F)
cat("\nCorrelation of annual temperatures\nr =",result_p$estimate,"\nr2 =",result_p$estimate**2,"\np =",result_p$p.value,"\n")

result_s <- cor.test(Years,Temperatures,method="spearman",exact=F)
cat("\nCorrelation of annual temperatures\nr =",result_s$estimate,"\nr2 =",result_s$estimate**2,"\np =",result_s$p.value,"\n")

# Pearson correlation can be totally changed (depending on how big the outlier is), but there will be almost no effect to spearman, as the coordinate doesn't matter, only the ranks.

---
#### Example 3

Let's investigate the "spurious correlation of ratios" effect.
 - Generate three sets of normally distributed data, which can have the same mean and std.
 - Make a scatterplot using two of the datasets as X/Y and the third dataset as the colour of each point.
 - Test the correlation of X and Y, and then check the correlation of X/Z and Y/Z.


Let's consider (randomly generated) dairy yield from two different cattle breeds.
 - Generate two sets of normally distributed data, use the same sample size and std, but a different mean.
 - Plot the distributions using histograms.
 - Use an appropriate statistical test to see if the two cattle breeds have similar milk yields.

**Extension**
 - What happens if you change the standard deviation for X/Y/Z? Does the correlation match the equation from the slides?
 - What could we change to make the results match the expectations better?

---

In [None]:
# Consider 3 random variables that are normally distributed (with identical parameters).
N_samples = 50

X <- rnorm(N_samples,100,5)
Y <- rnorm(N_samples,100,5)
Z <- rnorm(N_samples,100,5)

In [None]:
# Make a colour palette ("viridis"-like) that goes from blue->yellow.
YlOrBrRdBu <- c("#FDE725", "#21908C", "#3B528B", "#440154")
col <- colorRampPalette(YlOrBrRdBu)(50)

# Roughly bin the Z values so we can colour each scatter point by the Z value, and then plot.
cols = col[cut(Z,50)]
plot(X, Y, col=cols,type = "p",cex=3,lwd=2)

# Let's calculate the correlation as well.
result <- cor.test(X,Y)
cat("Correlation of X and Y\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

In [None]:
# Now let's divide X & Y by Z.
# This may be some normalisation procedure or similar approach.
# We are using the same colours as determined in the previous panel.
plot(X/Z, Y/Z, col=cols,type = "p",cex=3,lwd=2)
result <- cor.test(X/Z,Y/Z)
cat("Correlation of X and Y\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

In [None]:
## Example 3 extension
# What happens if you alter the standard deviation for X/Y/Z?

# If the standard deviation is huge the result is less significant, while a lower deviation is more reliably significant.
# Huge outliers with high deviation can skew the data much more, so this is expected.

# Does it match the equation expectations from the slides?

X <- rnorm(200,50,10)
Y <- rnorm(200,50,10)
Z <- rnorm(200,50,10)
result <- cor.test(X/Z,Y/Z)
cat("Correlation of X and Y\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

X <- rnorm(200,50,5)
Y <- rnorm(200,50,5)
Z <- rnorm(200,50,20)
result <- cor.test(X/Z,Y/Z)
cat("Correlation of X and Y\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

# Yes, we can calculate this.
# In the first case, the equation predicts r=0.5, and in the second it is about 0.941


# What could we do to make the results match expectations better?

X <- rnorm(200,10000,5)
Y <- rnorm(200,10000,5)
Z <- rnorm(200,10000,20)
result <- cor.test(X/Z,Y/Z)
cat("Correlation of X and Y\nr =",result$estimate,"\nr2 =",result$estimate**2,"\np =",result$p.value,"\n")

# The equation shown in the lectures is a slight approximation, as there is also a dependence on the mean in the full expression.
# If we increase the mean, we can get r values much closer to 0.941.

---
#### Example 4

We can also examine correlations across many sets of data at once, rather than between just a pair of observations.
 - Create a 10x10 matrix of randomly distributed variates.
 - Calculate the correlation coefficient and signifiance across all the pairwise combinations.
 - Plot the results using the `corrplot` package.

**Extension**
 - Verify that the correlation matrix values match calculating the correlations between some pairs of columns.
 - If we change the distribution from exponential to normal, how many correlations are "falsely" significant?
 - If we change the correlation to spearman (from pearson), would we expect to see more significant correlations with the exponential distribution?
 - Consider how many of these "random" sets of data are correlated. We will investigate this "multiple hypothesis testing" problem in detail next lecture.

---

In [None]:
# Install the corrplot package if necessary.
install.packages("corrplot")

# Load the corrplot package.
library(corrplot)

In [None]:
# Make a matrix with random expoential values, calculate the (pairwise) correlation matrix.
A <- matrix(rexp(100,5),nrow=10)

# Add some names to the columns to make it look a bit nicer.
colnames(A) <- c("Alpha","Bravo","Charlie","Delta","Echo","Foxtrot","Golf","Hotel","India","Juliett")

# Test for correlations across the matrix.
correlations <- cor(A,method="pearson")

# Also calculate the p-values for the correlations.
p = cor.mtest(A)$p

# Plot the correlation matrix, indicate which correlations are not significant.
corrplot(correlations,type = "upper",p.mat=p,sig.level=0.05)

In [None]:
## Example 4 extension
# What happens if you alter the standard deviation for X/Y/Z?

# Verify that the correlation matrix values match calculating the correlations between each column.
# We can access each column like A[,"Alpha"] or A[,"Golf"], and each element of the correlation matrix like correlations["Alpha","Golf"].
cat("Direct test\nr = ",cor.test(A[,"Alpha"],A[,"Golf"])$estimate,"\nMatrix correlation\nr =",correlations["Alpha","Golf"])

# If we change the distribution from exponential to normal, how many correlations are "falsely" significant?
A <- matrix(rnorm(100,50,2),nrow=10)
colnames(A) <- c("Alpha","Bravo","Charlie","Delta","Echo","Foxtrot","Golf","Hotel","India","Juliett")
correlations <- cor(A,method="pearson")
p = cor.mtest(A)$p
corrplot(correlations,type = "upper",p.mat=p,sig.level=0.05)


# If we change the correlation to spearman (from pearson), would we expect to see more significant correlations with the exponential distribution?
A <- matrix(rexp(100),nrow=10)
colnames(A) <- c("Alpha","Bravo","Charlie","Delta","Echo","Foxtrot","Golf","Hotel","India","Juliett")
correlations <- cor(A,method="spearman")
p = cor.mtest(A)$p
corrplot(correlations,type = "upper",p.mat=p,sig.level=0.05)

# An exponential distribution generally has more extreme values (outliers) than a normal distribution, which could either lead to a larger or smaller correlation.
# Since spearman is more robust to outliers (and we don't expect any signficance from a random set of values), we would expect it to have fewer significant correlations compared to pearson.

---
## Real world example

This is a slightly more realistic case, where we are examining some [actual data](https://opendata.swiss/en/dataset/rinder-verteilung-pro-gemeinde).  
The goal here is to manipulate the data into a useful shape, and then assess statistical significance.  
These examples are harder than you would encounter in an exam and use more advanced data analysis techniques, but are useful to try.

This data is the number of cattle for each commune in Switzerland
 - Filter out communes that have values of 0 for the count of cattle, as these are likely incomplete records.
 - Plot the count of cattle against the countPerSurfacekm2 (and also against countPer100Inhabitants).
 - Test for correlation betweeen these variables
   -  See if you can make an interpretation about which types communes (in terms of land area or population density) have large amounts of cattle
 - How does changing the filtering in the first step (filtering out communes with less than 100/500/1000/etc cattle) change these results?

---

In [None]:
# Read in the data from the csv, which uses ';' as the separator rather than ','.
# The first row of the file is also some download information rather than data, so we skip the first row.
cattle_by_communes <- read.csv('cattle-map-commune.csv',sep=';',skip=1)

# Filter out communes that have less than N cattle.
# Try changing the value of N and seeing how the correlation analyses change.
# What does that imply about the data?

N=1
cattle_by_communes <- cattle_by_communes[(cattle_by_communes$count >= N),] 

# Plot the data.
plot(countPerSurfacekm2 ~ count,data=cattle_by_communes)

# And test for correlation.
cor.test(cattle_by_communes$count,cattle_by_communes$countPerSurfacekm2,method="spearman",exact=FALSE)
cor.test(cattle_by_communes$count,cattle_by_communes$countPer100Inhabitants,method="spearman",exact=FALSE)

# We can use the ~ syntax as well.
cor.test(~ countPer100Inhabitants + count, data=cattle_by_communes,method="spearman",exact=FALSE)