## Correlation

The goal of this exercise is to perform a correlation analysis. We will use dataset 2, analysing patient Weight with respect to the blood LDL level.


Let's take a close look at the dependency of the body weight to the different variables in our patient dataset 2.

In [1]:
require(testthat, quietly = TRUE)

# Load dataset 2 ("./data/DATA_SET_REFERENCE_2.csv")
# and make sure the quality control is performed.

ds2 <- read.csv("./data/DATA_SET_REFERENCE_2.csv", row.names = 1)
summary(ds2)

ds2 <- ds2[complete.cases(ds2),]

   Hours_sun           Age             Weight        Sugar_blood    
 Min.   : 0.000   Min.   : 16.00   Min.   : 12.00   Min.   : 50.00  
 1st Qu.: 5.000   1st Qu.: 33.00   1st Qu.: 64.00   1st Qu.: 71.78  
 Median : 7.000   Median : 49.00   Median : 81.00   Median : 95.25  
 Mean   : 6.857   Mean   : 49.35   Mean   : 80.38   Mean   : 95.43  
 3rd Qu.: 9.000   3rd Qu.: 66.00   3rd Qu.: 95.00   3rd Qu.:119.80  
 Max.   :23.000   Max.   :105.00   Max.   :300.00   Max.   :140.90  
                                   NA's   :1                        
      LDL        Color_house  Sleep_hours    Hospital_times   Minutes_Reading 
 Min.   : 21.0   Blue : 24   Min.   :6.033   Min.   : 0.000   Min.   :-36.53  
 1st Qu.:116.8   Brown:427   1st Qu.:7.033   1st Qu.: 1.000   1st Qu.:108.79  
 Median :125.4   Red  :809   Median :8.000   Median : 2.000   Median :135.31  
 Mean   :124.8               Mean   :7.854   Mean   : 1.946   Mean   :135.49  
 3rd Qu.:132.5               3rd Qu.:9.033   3rd Qu.:

A scatterplot suggests a relationship between our variables Hours_sun and LDL. How do we quantify such a 
relationship? Lets calculate the covariance according to this formula in R:

$ cov(x,y) = \frac{\sum(x_i - \mu_x) (y_i - \mu_y)}{ N }$

In [2]:
# Calculate and return the covariance between variable "Weight" and "LDL". Round the obtained value
# to three decimal places after the comma and assign it to the requested variable.

# Weight_LDL_cov <- 

# your code here
Weight_LDL_cov <- round(cov(ds2$Weight, ds2$LDL), 3)


# check the obtained value:
print(Weight_LDL_cov)

[1] 163.769


In [3]:
test_that("The covariance variable type needs to be 'numeric'", {
    expect_equal(class(Weight_LDL_cov), 'numeric')
})


[32mTest passed[39m 🌈


Now let's calculate the correlation instead:

$ cor(x,y) = \rho_{x,y} = \frac{cov_{x,y}}{ \sigma_x  \sigma_y } $

In [4]:
# Now calculate and return the Pearson correlation to see for ourselves if correlation is 
# independent of the data scaling. Again, round the correlation value to three decimal places:

# Weight_LDL_cor <-

# your code here
Weight_LDL_cor <- round(cor(ds2$Weight, ds2$LDL), 3)

# First let's make sure the variable has the correct type:
print(class(Weight_LDL_cor))

# and then check the actual value
print(Weight_LDL_cor)

[1] "numeric"
[1] 0.822


In [5]:
test_that("The correlation variable type needs to be 'numeric'", {
    expect_equal(class(Weight_LDL_cor), 'numeric')
})

test_that("The correlation 'Weight_LDL_cor' must be in the range [-1,1]", {
    expect_true(Weight_LDL_cor >= -1 & Weight_LDL_cor <= 1)
})

[32mTest passed[39m 🥳
[32mTest passed[39m 🎊


In [6]:
# Calculate the p-value for the correlation between Weight and LDL, allowing either 
# positive or negative correlation as alternative hypothesis:

# weight.ldl.test.pval <- 

# your code here
weight.ldl.test <- cor.test(ds2$Weight, ds2$LDL, alternative = "two.sided")
weight.ldl.test.pval <- weight.ldl.test$p.value


# Let's check the type of the obtained value, remember we are looking for the p-value of the correlation value:
print(class(weight.ldl.test.pval))

# And now let's check the value itself:
print(weight.ldl.test.pval)

[1] "numeric"
[1] 1.904958e-309


In [7]:
test_that("The correlation p-value needs to be 'numeric'", {
    expect_equal(class(weight.ldl.test.pval), 'numeric')
})

test_that("The correlation p-value 'weight.ldl.test.pval' must be in the range [0,1]", {
    expect_true(weight.ldl.test.pval >= 0 & weight.ldl.test.pval <= 1)
})


[32mTest passed[39m 😀
[32mTest passed[39m 🎉


In [8]:
# Now calculate and return the p-value for the positive correlation between Weight and LDL

# weight.ldl.pos.cor.pval <- 


# your code here
weight.ldl.pos.cor.test <- cor.test(ds2$Weight, ds2$LDL, alternative = "greater")
weight.ldl.pos.cor.pval <- weight.ldl.pos.cor.test$p.value


print(class(weight.ldl.pos.cor.pval))
print(weight.ldl.pos.cor.pval)

[1] "numeric"
[1] 9.52479e-310


In [9]:
test_that("The correlation p-value 'weight.ldl.pos.cor.pval' needs to be 'numeric'", {
    expect_equal(class(weight.ldl.pos.cor.pval), 'numeric')
})

test_that("The correlation p-value 'weight.ldl.pos.cor.pval' must be in the range [0,1]", {
    expect_true(weight.ldl.pos.cor.pval >= 0 & weight.ldl.pos.cor.pval <= 1)
})


[32mTest passed[39m 🥇
[32mTest passed[39m 🥳


In [10]:
# And finally, calculate the p-value for the negative correlation between Weight and LDL

# weight.ldl.neg.cor.pval <- 


# your code here
weight.ldl.neg.cor.test <- cor.test(ds2$Weight, ds2$LDL, alternative = "less")
weight.ldl.neg.cor.pval <- weight.ldl.neg.cor.test$p.value


print(class(weight.ldl.neg.cor.pval))
print(weight.ldl.neg.cor.pval)

[1] "numeric"
[1] 1


In [11]:
test_that("The correlation p-value 'weight.ldl.neg.cor.pval' needs to be 'numeric'", {
    expect_equal(class(weight.ldl.neg.cor.pval), 'numeric')
})

test_that("The correlation p-value 'weight.ldl.neg.cor.pval' must be in the range [0,1]", {
    expect_true(weight.ldl.neg.cor.pval >= 0 & weight.ldl.neg.cor.pval <= 1)
})


[32mTest passed[39m 🥇
[32mTest passed[39m 🎊
