# Signifinace Tests

A significance test answers the question: what is the probability that a sample comes from a population. If the probability is low we can conclude that it is unlikely that the sample belongs to that population. However, if the probability is high it doesn't mean it is from the same population, we just didn't prove it isn't. We can never prove it is.
By convention the following probability levels were chosen:

* \> 5% likelyhood that the sample could have been drawn by chance from the population is high and therefore we can't say anything about the relationship between the 2
* 1% < x < 5% - the sample probably doesn't belong to population
* < 1% - the sample is very likely to belong to another population 

**Based on "Practical Statistics - Simply Explained" by Russel Langley**

## zM Test (1733)

In [1]:
zm <- function(sample, population_mean, population_sd, sides=2) {
    result <- abs(mean(sample) - population_mean) * sqrt(length(sample)) / population_sd
    return(100 * pnorm(result, lower.tail=FALSE) * sides)
}
# note the use of 2-sided probabilities by default because we ask whether there is a
# difference in result, not whether the difference is smaller or greater 
# (we use absolute value calculation) 

In [2]:
# Problem 1: 
# Gold melts at 1060, assume a sd of 3 degrees obtained by running many melting experiments.
# An unknown metal was found out to melt at 1072. 
# What is the probability that the metal is gold ?

cat(sprintf('Prob(metal=gold)=%.6f%%', zm(1072, 1060, 3)))

Prob(metal=gold)=0.006334%

In [3]:
# Problem 2:
# 40 patients with a certain disease have a mean pulse rate of 73
# Mean pulse rate for healthy people = 70 with an sd = 5
# How likely the increased pulse rate is due to illness and not just a random variation of a set of healthy people ?

cat(sprintf('Prob(disease causes higher pulse rate)=%.6f%%', zm(rep(73, 40), 70, 5)))

# Pulse rate is statistically significant but not practically significant (only 3 extra beats per minute).

Prob(disease causes higher pulse rate)=0.014780%

In [4]:
# Problem 3:
# Copper melts at 1080 degrees with an sd = 5.
# How likely that metal from Problem 1 is copper ?

cat(sprintf('Prob(metal=copper)=%.6f%%\n', zm(1072, 1080, 5)))

# and if we retest the melting temperature of the unknown metal ?
cat(sprintf('Prob(metal=copper)=%.6f%%', zm(c(1072, 1071, 1072, 1073), 1080, 5)))

Prob(metal=copper)=10.959858%
Prob(metal=copper)=0.137428%

## Student's _t_ Test (1908)

In [5]:
# same formula as for zM test with 2 exceptions:
# 1. The standard deviation of the sample is used (sd of the population is unknown)
# 2. Obtaining the p-value from the result depends on the sample size (so is like p-values for z 
# but different values, depending on sample size -> smaller the sample the larger the z value for 
# the same p-value)

# Problem 1:
# A printing press can make 45 copies per minute
# After an adjustment to increase its throughput we obtain: 46, 47, 48 copies per minute.
# How likely the change is due to random variation rather than the effect of adjustment ?

result <- t.test(x=c(46, 47, 48), mu=45, conf.level=0.95);
cat(
    sprintf(
        'Possible values for the mean of a 3-item sample with confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)

Possible values for the mean of a 3-item sample with confidence interval 0.95 %: [44.5158622882497, 49.4841377117503]
p-value for current sample: 0.0741799002274485


In [6]:
# Problem 2:
# If the next 2 trials in Problem 1 yield: 47, 47 how is the p-value changing ?
result <- t.test(x=c(46, 47, 48, 47, 47), mu=45, conf.level=0.95);
cat(
    sprintf(
        'Possible values for the mean for the 5-item sample with confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)

# note the second confidence interval does not intersect with the true mean -> variation is very likely
# due to improvement rather than chance.

Possible values for the mean for the 5-item sample with confidence interval 0.95 %: [46.1220109669149, 47.8779890330851]
p-value for current sample: 0.00319820215233531


In [7]:
# Problem 3:
# Between 1945 and 1962 the telephone bill of a CEO averaged $48 per year. 
# With the new secretary the values were: $56, $51, $63, $60
# Should the secretary be fired for over-using the phone ?
result <- t.test(x=c(56, 51, 63, 60), mu=48, conf.level=0.95);
cat(
    sprintf(
        'Possible values for the mean for the 4-item sample with confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)
# Probably yes, unless the phone service became more expensive on its own.

Possible values for the mean for the 4-item sample with confidence interval 0.95 %: [49.2317619603331, 65.7682380396669]
p-value for current sample: 0.0353301023027491


In [8]:
# Problem 4:
# The average income of a popultion is $20560
# A random sample of 49 people from a neighbour region is 18505
# Is the difference statistically significant ?

# Not enough info -> standard deviation of the sample is not known.

sample1 <- seq(from=18505-49, to=18505+49, by=2)
result <- t.test(x=sample1, mu=20560, conf.level=0.95);
cat(
    sprintf(
        'Everyone got the almost the same income [%s, %s] (sd=%s): confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        min(sample1), max(sample1),
        sd(sample1),
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)

sample2 <- seq(from=18505-49*300, to=18505+49*300, by=600)
result <- t.test(x=sample2, mu=20560, conf.level=0.95);
cat(
    sprintf(
        'Everyone got the different incomes [%s, %s] (sd=%s): confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        min(sample2), max(sample2),
        sd(sample2),
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)

Everyone got the almost the same income [18456, 18554] (sd=29.1547594742265): confidence interval 0.95 %: [18496.7143090347, 18513.2856909653]
p-value for current sample: 1.90805983449625e-92
Everyone got the different incomes [3805, 33205] (sd=8746.42784226795): confidence interval 0.95 %: [16019.2927104071, 20990.7072895929]
p-value for current sample: 0.103025307777696


## Wilcoxon's Sum of Ranks Test (1945)

Compare 2 samples of measurements with the assumption that they come from 2 populations **with equal means** (or same population).
Instead of using the values of measurements we rank them in order and based on the smallest sum of ranks determine the probability of obtaining that sum due to chance.

Instead of determining the probability of the parent groups having the same mean, we could determine
the probability of the sample groups having the **same standard deviation**. For that we need to rank differently.
We use the Siegel and Tukey's "paired alternation" of ranking: 
* smallest measurement - rank 1
* largest measurement and second largest - rank 2 and 3
* second and third smallest measurements - rank 4 and 5
* ...
This way the extreme values get smaller ranks. To increase the accuracy of the Siegel-Tukey test
you need to make the means of the samples equal (e.g. add the difference between the means to the sample with smaller mean).

**The test can be applied for any probability distribution**

The procedure is applied differently based on the size of samples:
1. Number of measurements in the samples are not equal and both are smaller than 20 -> ranking should be made in both directions (smallest to largest and largest to smallest) and the smallest sum of ranks from the 4 results is selected.
2. Number of measurements is equal or at least one sample has more than 20 measurements -> apply the formula below:

\begin{equation*}
z = \frac{n_R(n_A + n_B + 1) - 2R}{\sqrt{\frac{n_A n_B (n_A + n_B + 1)}{3}}}
\end{equation*}

* $n_A$ / $n_B$ - number of measurements in sample A / B
* $n_R$ - number of measurements in the sample with smallest sum of ranks
* $R$ - smallest sum of ranks

In [9]:
# Problem 1:
# Examine whether there is a significant different in the recovery times of 2 groups of patients
# on which a specific operation to remove the gallbladder is performed.

group1 <- c(16,20,25,19,22,15,22,19)
group2 <- c(18,19,15,16,21,17,17,14)

w <- wilcox.test(group1, group2, exact=FALSE, conf.int=0.95)  # exact=FALSE turns off the warning when measurements with equal values are present
# in general ties make the test less accurate.

cat(
    sprintf(
        'There is no significant difference in operations on patient recovery: %.4f,\nmean1=%.4f, mean2=%.4f', 
        w$p.value, 
        mean(group1),
        mean(group2)
    )
)


There is no significant difference in operations on patient recovery: 0.1015,
mean1=19.7500, mean2=17.1250

In [10]:
# Problem 2:
# A man recorded the number of advertisments in 2 radio stations. Is there sufficient evidence that
# one station airs more ads than the other ?
station1 <- c(341, 326, 360, 305, 326)
station2 <- c(352, 382, 347)

w <- wilcox.test(station1, station2, exact=FALSE)
cat(
    sprintf(
        'There is no significant difference in number of ads between the stations: %.4f,\nmean1=%.4f, mean2=%.4f', 
        w$p.value, 
        mean(station1),
        mean(station2)
    )
)

cat(sprintf('\n\n'))

t <- t.test(x=station1, y=station2, mu=0)  # mu is the difference of means -> i.e. no difference

cat(
    sprintf(
        'There is no significant difference in number of ads between the stations: %.4f,\nmean1=%.4f, mean2=%.4f', 
        t$p.value, 
        mean(station1),
        mean(station2)
    )
)

cat(sprintf('\nThe value of the t-test is more precise but does not change the overall conclusion.'))


There is no significant difference in number of ads between the stations: 0.1337,
mean1=331.6000, mean2=360.3333

There is no significant difference in number of ads between the stations: 0.1041,
mean1=331.6000, mean2=360.3333
The value of the t-test is more precise but does not change the overall conclusion.

In [11]:
# Perform the Siegel-Tukey test on data in Problem 2 to determine whether there is a significant
# difference in dispersions of the samples.

if(!require(jmuOutlier)) {
    install.packages("jmuOutlier"); require(jmuOutlier)
}

siegel_tukey <- function(x, y){
    m1 <- mean(x)
    m2 <- mean(y)
    if (m1 < m2){
        x <- x + (m2 - m1)
    }
    else {
        y <- y + (m2 - m1)
    }
    if (length(x) != length(y) && length(x) < 20 && length(y) < 20) {
        s1 <- siegel.test(x, y, reverse=FALSE)
        s2 <- siegel.test(x, y, reverse=TRUE)
        if (min(c(sum(s1$rank.x), sum(s1$rank.y))) < min(c(sum(s2$rank.x), sum(s2$rank.y)))){
            s <- s1
        }
        else {
            s <- s2
        }
    }
    else {
        s <- siegel.test(x, y)
    }
    return(s)
}

s <- siegel_tukey(station1, station2)

cat(
    sprintf(
        'Probability that the 2 stations are consistent in playing ads: %.4f\nsd1=%.4f, sd2=%.4f',
        s$p.value,
        sd(station1),
        sd(station2)
    )
)

Loading required package: jmuOutlier


Probability that the 2 stations are consistent in playing ads: 0.6964
sd1=20.4034, sd2=18.9297

In [12]:
# Problem 3
# First sample consists from the number of passangers travelling each week on a certain route.
# Second sample represents the same quantity but for another type of aeroplane (with same number of seats
# as previous model). Is the second model more popular than the first ?
sample1 <- c(3204, 2967, 3053, 3267, 3370, 3492, 3105, 3330)
sample2 <- c(3568, 3299, 3618, 3494)
w <- wilcox.test(sample1, sample2, exact=FALSE)
cat(
    sprintf(
        'There probability of no significant difference between the aeroplane models is: %.4f (rank=%.2f),\nmean1=%.4f, mean2=%.4f', 
        w$p.value,
        w$statistic,
        mean(sample1),
        mean(sample2)
    )
)

There probability of no significant difference between the aeroplane models is: 0.0338 (rank=3.00),
mean1=3223.5000, mean2=3494.7500

In [13]:
# Problem 4:
# The problem consists in determining whether domestic cats are better fed than stray cats. 
# There are 30 samples of weights domestic cats and 15 of stray. The smallest rank total is 234.
# Is there a significant difference in cat weights ?

# Solution
# We apply the formula for the case when one sample has 20 or more measurements:
wilcox2.test <- function(na, nb, R) {
    nr <- min(na, nb)
    z <- sqrt(3) * (nr * (na + nb + 1) - 2 * R) / (sqrt(na*nb*(na + nb + 1)))
    return(pnorm(z, lower.tail=FALSE))
}

cat(sprintf('Probability of no significant difference in weights: %.4f', wilcox2.test(30, 15, 234)))

Probability of no significant difference in weights: 0.0038

## Wilcoxon's Signed Ranks Test (1945)

The test is applied to samples that consist of "matched measurements". 
A pair of matched measurement is either 2 measurements obtained from the same subject by applying different treatments to it 
or from 2 distinct subjects that are carefully matched on their relevant characteristics.

E.g. you can treat the same patient with different medicines at different times and observe the effect but you can't remove the
appendix from a patient twice, so you would compare the rate of recovery from different operations for 2 patients matched 
by age, sex, lifestyle, etc.

Restrictions
* Matched measurements with equal values should be removed (or repeat the measurements with greater precision)
* At least 6 pairs of matched measurements should be left (after removing equal pairs)

The procedure is as follows:
1. Compute the difference between matched measurements
2. Arrange differences by absolute values (in increasing order). Group negative and positive ranks with same absolute value.
3. Assign sequential ranks to the ordered measurements (use the average of ranks if there are multiple equal values)
4. Sum up the positive and negative ranks (again as absolute values)
5. Pick the smallest sum of ranks like in the Sum of Ranks Test $R$

The smaller is the value of $R$ the least likely is that there is no difference between the samples.

If we have more than 20 matched measurements the following formula applies:

\begin{equation*}
z = \frac{\frac{1}{2} n (n + 1) - 2R}{\sqrt{\frac{n (n + 1) (2n + 1)}{6}}}
\end{equation*}

* $n$ - number of matched measurements
* $R$ - smallest sum of ranks



In [14]:
# Problem 1:
# Determine whether the new sedative "Nockout" is better than the standard "Phenobarbitone" by testing them on 10 people 
# suffering from insomnia.
# Experiment is performed by giving half of them one treatment on the same night and the other half the other treatment, then 
# after sufficient time has passed change the treatments.

phenobarbitone <- c(7.5, 7, 7, 5.75, 4.25, 9.25, 8, 7.25, 8.5, 7.75)
nockout <- c(8, 6, 6.75, 5, 4.5, 8, 7.5, 6.25, 8, 7.75)

w_paired <- wilcox.test(phenobarbitone, nockout, paired=TRUE, exact=FALSE)
w_independent <- wilcox.test(phenobarbitone, nockout, paired=FALSE, exact=FALSE)

cat(
    sprintf(
        'Prob of no difference between the treatments: %.4f (w sum of ranks =%.4f) (mean_p=%.2f, mean_n=%.2f)', 
        w_paired$p.value, 
        w_independent$p.value,
        mean(phenobarbitone),
        mean(nockout)
    )
)

# The null hypothesis is probably false. Looking at the mean sleep times it looks like the new treatment is less effective.
# Note how paired test experiment provides evidence of statistical significance whereas if we treat the data as independent groups
# (we assume the treatments were given to different groups of 10 patients each) such evindence is clearly missing. 
# Conclusion: by changing the design of the experiment and how we compare the data we can get more information from the same 
# number of measurements.

Prob of no difference between the treatments: 0.0494 (w sum of ranks =0.5949) (mean_p=7.22, mean_n=6.78)

In [15]:
# Problem 2:
# A research chemist tests the effect of hair bleach on hair strength. Sever women are picked for the experiment.
# The hair strength is tested before and after the use of the bleach. Hair strength is tested to by measuring the breaking
# point of 6 hairs and taking the average weight in grams of the load used for testing. Note how the average is used (different hairs
# have naturally different breaking points).

before <- c(105, 105, 93, 120, 111, 80, 91)
after <- c(97, 95, 93, 117, 108, 85, 86)

w <- wilcox.test(before, after, paired=TRUE, exact=FALSE)

cat(
    sprintf(
        'Prob of no effect on hair due to bleach: %.4f (mean_b=%.2f, mean_a=%.2f)', 
        w$p.value, 
        mean(before),
        mean(after)
    )
)

# The null hypothesis is not proven false.

cat('\n')
paste(c('differences', d <- before - after), collapse=', ')

# There are several ties which the chemist removes by repeated measurements

before <- c(105, 105, 93.2, 120.1, 111.4, 80.1, 91.3)
after <- c(97, 95, 93, 117.1, 108.3, 84.7, 86)

w <- wilcox.test(before, after, paired=TRUE, exact=FALSE)

cat(
    sprintf(
        'Prob of no effect on hair due to bleach: %.4f (mean_b=%.2f, mean_a=%.2f)', 
        w$p.value, 
        mean(before),
        mean(after)
    )
)

# The null hypothesis still not proven false. Note the decrease in p value though.

Prob of no effect on hair due to bleach: 0.1706 (mean_b=100.71, mean_a=97.29)


Prob of no effect on hair due to bleach: 0.1083 (mean_b=100.87, mean_a=97.30)

In [16]:
# Problem 3: 
# We want to know whether a chemical solution is prolonging the life of cut flowers. We picked pairs of 25 types of flowers,
# put them in a vase close to each other (to exclude variations in environment conditions like degree of light), one vase 
# with plain water, another with the solution added.

R <- 50
n <- 25

z <- (0.5 * n * (n + 1) - 2 * R) / sqrt(1/6 * n * (n + 1) * (2 * n + 1))
pvalue <- pnorm(z, lower.tail=FALSE)

cat(
    sprintf(
        'Prob of no effect of chemical compound on flower life: %.4f', pvalue
    )
)

# The difference is statistically significant.

Prob of no effect of chemical compound on flower life: 0.0012

In [17]:
# Problem 4:
# Test the effectiveness of gasoline consumption of a motor lubricant. 
# Take 8 different cars and test them on 100 mile run (use both times the same brand of gasoline).

gasoline <- c(17.1, 29.5, 23.8, 37.3, 19.6, 24.2, 30, 20.9)
gasoline_and_lubricant <- c(14.2, 30.3, 21.5, 36.3, 19.6, 24.5, 26.7, 20.6)

w <- wilcox.test(gasoline, gasoline_and_lubricant, paired = TRUE, exact = FALSE)

cat(
    sprintf(
        'Prob of no effect on fuel consumption: %.4f (mean_g=%.2f, mean_gl=%.2f)', 
        w$p.value, 
        mean(gasoline),
        mean(gasoline_and_lubricant)
    )
)

# The effectiveness of lubricant is not proven.

Prob of no effect on fuel consumption: 0.1508 (mean_g=25.30, mean_gl=24.21)

In [19]:
# Problem 5:
# Darwin's experiment of comparing the growth of cross-fertilized plants vs self-fertilized.
f_cross <- c(23.5, 12, 21, 22, 19.125, 21.5, 22.125, 20.375, 18.25, 21.625, 23.25, 21, 22.125, 23, 12)
f_self <- c(17.375, 20.375, 20, 20, 18.375, 18.625, 18.625, 15.25, 16.5, 18, 16.25, 18, 12.75, 15.5, 18) 

t <- t.test(f_cross, f_self, paired = TRUE)
w <- wilcox.test(f_cross, f_self, paired = TRUE)

cat(
    sprintf(
        'Prob of no effect of fertilization method: %.4f (t.test=%.4f) (mean_c=%.2f, mean_s=%.2f)', 
        w$p.value,
        t$p.value,
        mean(f_cross),
        mean(f_self)
    )
)

# The effect is likely statistically significant.


Prob of no effect of fertilization method: 0.0413 (t.test=0.0497) (mean_c=20.19, mean_s=17.57)

## Wilcoxon's Stratified Test (1946)

The test is used to compare 2 independent stratified random samples with the same number of measurements in each stratum. It is an extension of the Sum of Ranks Test for samples that can be split into distinct strata.

Restrictions
* Number of measurements in each stratum should be the same

The procedure is as follows:
1. Split the samples into strata
2. Rank each strata separately
3. Compute the sum of ranks for each strata
4. Sum the sum of ranks for all strata
5. Pick the smallest total sum
6. Use the formula below:

\begin{equation*}
z = \frac{
    n_1(2n_1 + 1) + n_2(2n_2 + 1) + ... + n_k(2n_k + 1) - 2R
}{
      \sqrt{
          \frac{
              n_1^2(2n_1 + 1) + n_2^2(2n_2+1) + ... + n_k^2(2n_k + 1)
          }{3}
      }
}
\end{equation*}

* $k$ - number of strata
* $n_i$ - number of samples in stratum $i$
* $R$ - smallest sum of ranks

When the number of samples in each stratum is the same the formula simplifies to:
\begin{equation*}
z = \frac{kn(2n + 1) - 2R}{\sqrt{\frac{kn^2(2n + 1)}{3}}}
\end{equation*}

And when there is a 1 stratum with same number of measurements for each sample ($k = 1$) we obtain the formula for Sum of Ranks Test:

\begin{equation*}
z = \frac{n(n * n + 1) - 2R}{\sqrt{\frac{n * n(n + n + 1)}{3}}}
\end{equation*}






In [23]:
wilcox_stratified.test <- function(s1, s2) {
  sR1 <- 0
  sR2 <- 0
  nA <- c()
  nB <- c()
  for (t in mapply(list, s1, s2, SIMPLIFY = FALSE)) {
    g1 <- t[[1]]
    g2 <- t[[2]]
    r <- rank(c(g1, g2))
    l1 <- length(g1)
    l2 <- length(g2)
    sR1 <- sR1 + sum(r[1:l1])
    sR2 <- sR2 + sum(r[(l1 + 1):(l1 + l2)])
    nA <- c(nA, l1)
    nB <- c(nB, l2)
  }
  
  R <- min(sR1, sR2)

  n <- sum(pmin(nA, nB) * (nA + nB + 1)) - 2*R
  n <- sqrt(3) * n
  d <- sqrt(sum(nA * nB * (nA + nB + 1)))
  
  return(data.frame(statistic=R, p.value=2 * pnorm(n / d, lower.tail=FALSE)))  # two tailed value
}

In [24]:
# Problem 1
# Two DDT preparations are used in 3 different concentrations to measure how well they kill flour beetles. 
# For each concentration 4 experiments are performed. Is there a significant difference between preparations ?
# Values are given in percetange of killed bettles from the sample.

conc_25mg_A <- c(18, 26, 30, 50)
conc_25mg_B <- c(34, 42, 53, 63)

conc_50mg_A <- c(33, 42, 44, 44)
conc_50mg_B <- c(60, 62, 66, 80)

conc_100mg_A <- c(44, 50, 56, 64)
conc_100mg_B <- c(74, 77, 84, 92)

preparation_A <- list(conc_25mg_A, conc_50mg_A, conc_100mg_A)
preparation_B <- list(conc_25mg_B, conc_50mg_B, conc_100mg_B)

cat(
    sprintf(
        'Prob that there is no significant difference between preparations: %.6f, Mean(A)=%.2f, Mean(B)=%.2f', 
        wilcox_stratified.test(preparation_A, preparation_B)$p.value,
        mean(unlist(preparation_A)),
        mean(unlist(preparation_B))
    )
)

# Preparation B is clearly more effective in killing flour beetles.

Prob that there is no significant difference between preparations: 0.000246, Mean(A)=41.75, Mean(B)=65.58

In [26]:
# Problem 2
# We want to compare 2 treatments for curing acne. 
# We split the patients into 2 groups, within each group we separate the patients into 4 strata, based on the severity of the disease.
# We measure the number of weeks it takes for the cure to reach 90%.

mild_A <- c(2, 3)
mild_B <- c(2, 4)

moderate_A <- c(3, 5, 6, 10)
moderate_B <- c(4, 6, 7, 9)

severe_A <- c(6, 8, 11)
severe_B <- c(9, 14, 14)

very_severe_A <- c(8, 10, 11)
very_severe_B <- c(12, 14, 15)

treatment_A <- list(mild_A, moderate_A, severe_A, very_severe_A)
treatment_B <- list(mild_B, moderate_B, severe_B, very_severe_B)

r <- wilcox_stratified.test(treatment_A, treatment_B)

cat(
    sprintf(
        'Prob that there is no significant difference between treatments: (%d) %.4f, Mean(A)=%.2f, Mean(B)=%.2f', 
        r$statistic,
        r$p.value,
        mean(unlist(treatment_A)),
        mean(unlist(treatment_B))
    )
)

# The difference is probably significant 1% < x < 5%

Prob that there is no significant difference between treatments: (34) 0.0419, Mean(A)=6.92, Mean(B)=9.17

In [28]:
# Problem 3
# A man is challenging Houdini who can escape faster from standard policy handcuffs. 
# The experiment is performed with 2 handcuff models, each competitor tries to escape 3 times.
# The measurements is the number of seconds needed to unlock the handcuffs.
# Is the contender less or more skilled ?

h1_Houdini <- c(10.9, 11.3, 10.2)
h1_Anon <- c(10.3, 10.6, 12.2)

h2_Houdini <- c(13.8, 15.1, 14.3)
h2_Anon <- c(16.3, 15.2, 15.8)

houdini <- list(h1_Houdini, h2_Houdini)
anon <- list(h1_Anon, h2_Anon)

r <- wilcox_stratified.test(houdini, anon)

cat(
    sprintf(
        'Prob that there is no significant difference between contestants: (%d) %.4f, Mean(Houdini)=%.2f, Mean(Anon)=%.2f', 
        r$statistic,
        r$p.value,
        mean(unlist(houdini)),
        mean(unlist(anon))
    )
)

# Statistical difference between contestants is not proved.

Prob that there is no significant difference between contestants: (16) 0.1228, Mean(Houdini)=12.60, Mean(Anon)=13.40

In [30]:
# Problem 4
# Joe is testing a 'magic powder' from Bolivia which the natives claim that it prevents clothes from shrinking after laundering.
# He tested 9 different cloths, cutting them into 4 strips, 2 strips are tested with magic powder + detergent and 2 with detergent.
# The measurements indicate percentage of shrinkage of the cloth.

A_mp <- c(0, 1)
A_od <- c(3, 3)

B_mp <- c(2, 5)
B_od <- c(4, 6)

C_mp <- c(6, 4)
C_od <- c(5, 7)

D_mp <- c(10, 8)
D_od <- c(7, 11)

E_mp <- c(4, 1)
E_od <- c(3, 2)

F_mp <- c(1, 2)
F_od <- c(1, 4)

G_mp <- c(6, 5)
G_od <- c(9, 9)

H_mp <- c(0, 2)
H_od <- c(2, 3)

I_mp <- c(4, 7)
I_od <- c(3, 5)


magic_powder <- list(A_mp, B_mp, C_mp, D_mp, E_mp, F_mp, G_mp, H_mp, I_mp)
ordinary_detergent <- list(A_od, B_od, C_od, D_od, E_od, F_od, G_od, H_od, I_od)

r <- wilcox_stratified.test(magic_powder, ordinary_detergent)

cat(
    sprintf(
        'Prob that there is no significant difference between chemicals: (%d) %.4f, Mean(Magic Powder)=%.2f, Mean(Detergent)=%.2f', 
        r$statistic,
        r$p.value,
        mean(unlist(magic_powder)),
        mean(unlist(ordinary_detergent))
    )
)

# Significant difference not proven

Prob that there is no significant difference between chemicals: (38) 0.0707, Mean(Magic Powder)=3.78, Mean(Detergent)=4.83