# Signifinace Tests

A significance test answers the question: what is the probability that a sample comes from a population. If the probability is low we can conclude that it is unlikely that the sample belongs to that population. However, if the probability is high it doesn't mean it is from the same population, we just didn't prove it isn't. We can never prove it is.
By convention the following probability levels were chosen:

* \> 5% nlikelyhood that the sample could have been drawn by chance from the population is high and therefore we can't say anything about the relationship between the 2
* 1% < x < 5% - the sample probably doesn't belong to population
* < 1% - the sample is very likely to belong to another population 

## zM Test (1733)

In [30]:
zm <- function(sample, population_mean, population_sd, sides=2) {
    result <- abs(mean(sample) - population_mean) * sqrt(length(sample)) / population_sd
    return(100 * pnorm(result, lower.tail=FALSE) * sides)
}
# note the use of 2-sided probabilities by default because we ask whether there is a
# difference in result, not whether the difference is smaller or greater 
# (we use absolute value calculation) 

In [27]:
# Problem 1: 
# Gold melts at 1060, assume a sd of 3 degrees obtained by running many melting experiments.
# An unknown metal was found out to melt at 1072. 
# What is the probability that the metal is gold ?

cat(sprintf('Prob(metal=gold)=%.6f%%', zm(1072, 1060, 3)))

Prob(metal=gold)=0.006334%

In [28]:
# Problem 2:
# 40 patients with a certain disease have a mean pulse rate of 73
# Mean pulse rate for healthy people = 70 with an sd = 5
# How likely the increased pulse rate is due to illness and not just a random variation of a set of healthy people ?

cat(sprintf('Prob(disease causes higher pulse rate)=%.6f%%', zm(rep(73, 40), 70, 5)))

# Pulse rate is statistically significant but not practically significant (only 3 extra beats per minute).

Prob(disease causes higher pulse rate)=0.014780%

In [32]:
# Problem 3:
# Copper melts at 1080 degrees with an sd = 5.
# How likely that metal from Problem 1 is copper ?

cat(sprintf('Prob(metal=copper)=%.6f%%\n', zm(1072, 1080, 5)))

# and if we retest the melting temperature of the unknown metal ?
cat(sprintf('Prob(metal=copper)=%.6f%%', zm(c(1072, 1071, 1072, 1073), 1080, 5)))

Prob(metal=copper)=10.959858%
Prob(metal=copper)=0.137428%

## Student's _t_ Test (1908)

In [23]:
# same formula as for zM test with 2 exceptions:
# 1. The standard deviation of the sample is used (sd of the population is unknown)
# 2. Obtaining the p-value from the result depends on the sample size (so is like p-values for z 
# but different values, depending on sample size -> smaller the sample the larger the z value for 
# the same p-value)

# Problem 1:
# A printing press can make 45 copies per minute
# After an adjustment to increase its throughput we obtain: 46, 47, 48 copies per minute.
# How likely the change is due to random variation rather than the effect of adjustment ?

result <- t.test(x=c(46, 47, 48), mu=45, conf.level=0.95);
cat(
    sprintf(
        'Possible values for the mean of a 3-item sample with confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)

Possible values for the mean of a 3-item sample with confidence interval 0.95 %: [44.5158622882497, 49.4841377117503]
p-value for current sample: 0.0741799002274485


In [24]:
# Problem 2:
# If the next 2 trials in Problem 1 yield: 47, 47 how is the p-value changing ?
result <- t.test(x=c(46, 47, 48, 47, 47), mu=45, conf.level=0.95);
cat(
    sprintf(
        'Possible values for the mean for the 5-item sample with confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)

# note the second confidence interval does not intersect with the true mean -> variation is very likely
# due to improvement rather than chance.

Possible values for the mean for the 5-item sample with confidence interval 0.95 %: [46.1220109669149, 47.8779890330851]
p-value for current sample: 0.00319820215233531


In [26]:
# Problem 3:
# Between 1945 and 1962 the telephone bill of a CEO averaged $48 per year. 
# With the new secretary the values were: $56, $51, $63, $60
# Should the secretary be fired for over-using the phone ?
result <- t.test(x=c(56, 51, 63, 60), mu=48, conf.level=0.95);
cat(
    sprintf(
        'Possible values for the mean for the 4-item sample with confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)
# Probably yes, unless the phone service became more expensive on its own.

Possible values for the mean for the 4-item sample with confidence interval 0.95 %: [49.2317619603331, 65.7682380396669]
p-value for current sample: 0.0353301023027491


In [35]:
# Problem 4:
# The average income of a popultion is $20560
# A random sample of 49 people from a neighbour region is 18505
# Is the difference statistically significant ?

# Not enough info -> standard deviation of the sample is not known.

sample1 <- seq(from=18505-49, to=18505+49, by=2)
result <- t.test(x=sample1, mu=20560, conf.level=0.95);
cat(
    sprintf(
        'Everyone got the almost the same income [%s, %s] (sd=%s): confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        min(sample1), max(sample1),
        sd(sample1),
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)

sample2 <- seq(from=18505-49*300, to=18505+49*300, by=600)
result <- t.test(x=sample2, mu=20560, conf.level=0.95);
cat(
    sprintf(
        'Everyone got the different incomes [%s, %s] (sd=%s): confidence interval %s %%: [%s]\np-value for current sample: %s\n', 
        min(sample2), max(sample2),
        sd(sample2),
        attr(result$conf.int, 'conf.level'),
        toString(result$conf.int),
        result$p.value
    )
)

Everyone got the almost the same income [18456, 18554] (sd=29.1547594742265): confidence interval 0.95 %: [18496.7143090347, 18513.2856909653]
p-value for current sample: 1.90805983449625e-92
Everyone got the different incomes [3805, 33205] (sd=8746.42784226795): confidence interval 0.95 %: [16019.2927104071, 20990.7072895929]
p-value for current sample: 0.103025307777696
