<div style="text-align: center; font-size: 30px;">
Statistics Labs<br/>
</div>
<div style="text-align: center; font-size: 30px;">
Inference and Hypothesis Testing
</div>
<div style="text-align: center; font-size: 16px; font-style: italic">
Material prepared by M. Dolores Frías, Jesús Fernández, and Carmen M. Sordo, senior lectures from the Department of Applied Mathematics and Computer Science at the University of Cantabria.
</div>

# Objectives

In this practice, we will use R options to apply inferential analysis, either to calculate confidence
intervals or to solve hypothesis tests. As we have already seen in class, we will study a small and
representative sample from the population to draw conclusions about population parameters. Specif-
ically, we will focus on obtaining information about the main characteristics of the population, such
as proportion, mean, and variance.
Regarding the calculation of confidence intervals, there is no specific function in R for this purpose;
instead, it is part of the information generated when programming a hypothesis test

# Meaning of the confidence interval

We will begin by analyzing the meaning of the confidence interval. The purpose of the confidence
interval is to provide a certain guarantee regarding the presence of a population parameter within an
interval constructed from a sample.

Let’s consider the following question: Is the probability of getting heads when flipping a coin 0.5?

To answer this question, we will conduct the following experiment:

We toss a coin 30 times and estimate the value of the proportion of getting heads (P) based on the
proportion of heads obtained (p=pest) in those 30 tosses. As we’ve seen in previous practices, we
can simulate the coin tosses using the `rbinom` function since we have 30 independent Bernoulli
trials with the same success probability (P=0.5):

In [None]:
n <- 30 # 30 coin tosses
P <- 0.5 # Probability of getting head on a throw
pest <- rbinom(1,n,P)/n; pest # Proportion of the sample

We repeat 50 times (m=50) the experiment of tossing the coin 30 times, and calculate the
proportion of heads obtained for each of the m samples:

In [None]:
m <- 50  # number of times the experiment is repeated
pest <- rbinom(m,n,P)/n; pest # Sample proportion for each experiment

The variable `pest` stores the proportion of heads obtained when tossing 30 coins in each of the
50 experiments.

We set the confidence level to $1-\alpha = 0.95$ and, assuming that the distribution of sample proportions follows a Normal distribution (since n is sufficiently large), we calculate the corresponding
confidence interval for each of the 50 samples using the following expression:

$p \pm z_{\alpha/2} \sqrt {{{p(1-p)} \over n}}$ 

Lets do this with R: 

In [None]:
alfa <- 0.05 
z <- qnorm(1-alfa/2)
e <- z*sqrt(pest*(1-pest)/n); e # Values of the error for each experiment. 

We represent the m = 50 resulting confidence intervals with the command:

In [None]:
# Confidence intervals
matplot(rbind(pest-e, pest+e),rbind(1:m,1:m),type="l",lty=1, xlim=c(0,1),
        ylab="m", xlab="[p-e, p+e]", main="Confidence intervals of P")

We mark with a vertical line the value P=0.5

In [None]:
# Confidence intervals
matplot(rbind(pest-e, pest+e),rbind(1:m,1:m),type="l",lty=1, xlim=c(0,1),
        ylab="m", xlab="[p-e, p+e]", main="Confidence intervals of P")
# Vertical line
abline(v=P)

What do you observe? Does the confidence interval depend on the chosen sample?

Do all the confidence intervals drawn capture the population proportion (P)?

<div class="alert alert-block alert-info">
<strong>PRACTICE ON YOUR OWN</strong>

Perform the same experiment 10000 times and check with what probability the value of the population proportion falls within the confidence interval calculated for each sample. You should
program this calculation with R, determining the number of intervals obtained that contain P.
Try for different confidence levels (90% and 99%). 

What happens to the confidence intervals obtained if the confidence level is increased or decreased?

What percentage of intervals contain the value of the proportion 0.5? What relationship exists between that percentage and the considered confidence level?

Now try the following. For a given confidence level (for example, 0.95), what happens to the confidence interval when you increase the sample size? Try with values such as n = 100 and n = 300, for example.

</div>

# Inference and hypothesis testing based on a single sample. 

We are going to see how to perform parameter estimation and hypothesis testing for a single population using R. We will focus specifically on three parameters: the population proportion, mean, and variance.

## 1. Proportions

### Confidence interval for a population proportion

As we know, in order to calculate proportions, we need a dataset with a categorical variable. This
allows us to group individuals based on that factor or category. It is also possible to calculate proportions for a numerical variable by determining how many elements meet a specific condition, such as
whether the variable is less than or greater than a certain value.

The data file *pulsations.rda*, which we have worked with before, contains several columns that
are categorical variables and allow us to compute values for the proportion. Specifically, the columns
*Run*, *Smoke* and *Sex* classify individuals based on certain properties (e.g., having participated in the race or not, being a smoker or not, being male or female). With this information, we can estimate
the proportion of people with a specific property in our sample (point estimate) and calculate the
confidence interval for that proportion in the population.

To do so, you must first correctly load the data from that file, as we have done in previous practical sessions. Once the data is loaded in R, we will calculate the point estimate of the proportion of individuals in the population who smoke (P).

In [None]:
# Set the working directory
setwd("data/") 
# Load the data
load("pulsations.rda")
attach(pulsations)
# Calculate the sample proportion
smokers <- sum(Smoke=="Yes")
n <- length(Smoke)
p <- smokers/n; p

From the previous result, we obtain the point estimate of the proportion of smokers in the population, which is 0.304.

As we know, the point estimate does not provide information about the accuracy of the estimate for P. To address this, we will estimate P using a confidence interval, which does provide information about the confidence we have in estimating the population parameter.

We will now program in R the calculation of the confidence interval for the proportion of individuals in the population who smoke, with a 95% confidence level, using the formula covered in class:

$p \pm z_{\alpha/2} \sqrt {{{p(1-p)} \over n}}$

In [None]:
# Confidence interval for the proportion of smokers in the population
alfa <- 0.05
z <- qnorm(1-alfa/2)
e <- z*sqrt(p*(1-p)/n)
conclusion <- "The proportion of smokers in the population lies within the interval"
sprintf("%s [%6.3f, %6.3f] with a confidence of %2d%s", conclusion, p-e, p+e, (1-alfa)*100, "%")

With the `sprintf` function, we can write a final conclusion by mixing characters (`strings` indicated with `%s`) with numerical values, whether real numbers (`%6.3f`) or integers (`%2d`). For example, `%6.3f` would include in the output a floating-point number (the f stands for float) with 6 total characters, 3 of which are decimal places. The numerical values or characters assigned to each placeholder (those starting with %) are the ones listed at the end, separated by commas.

In R there is a function called (`prop.test`) that allows you to calculate the point estimate and
confidence intervals for P. Additionally, this function is used to perform hypothesis tests of proportions as we will see later.

In [None]:
prop.test(smokers, n, conf.level=1-alfa)

The result provides a large amount of information. However, since we are interested in the point
estimate and confidence intervals of P, we are only concerned with the outcome obtained from the
confidence interval for the given confidence level (`95 percent confidence interval:`), which turns
out to be [0.2149697, 0.4102950], and the point estimate of the proportion of smokers in the population
(`sample estimates: p`) with a value of 0.3043478.

Compare this result obtained for the confidence interval with the value calculated earlier using the
formula provided in class. The difference you observe is because in R, the formula implemented in
the `prop.test` function for calculating the confidence interval of a proportion is given by the *Wilson
interval*:

$\frac{{{p + \frac{{1}}{{2n}}  z_{\alpha / 2}^2  \pm z_{\alpha / 2} \sqrt {\frac{{p\left( {1 - p} \right)}}{n} + \frac{{z_{\alpha / 2}^2}} {{4n^2}} }} }} {{ {1 + \frac{{1}}{n}} z_{\alpha / 2}^2 }}$

This is an improved version of the normal approximation we have seen in theory. Whereas the estimate discussed in theory should only be considered under the conditions we are aware of, the *Wilson
interval* yields good results even with small sample sizes. 

Additionally, as shown in the `prop.test` documentation, this function applies *Yates’s continuity*
correction by default (`correct=TRUE`) which we also did not consider in theory.

### Hypothesis test on a population proportion

As mentioned before, `prop.test` also allows us to conduct hypothesis tests for proportions by appropriately configuring the different arguments of the function. Consult the `prop.test` help documentation
to see what other arguments it accepts.

In this regard, we can determine whether, based on the sample, the null hypothesis that the
proportion of smokers is equal to that of non-smokers is accepted or not, i.e., $H_0: P=0.5$ and
therefore the alternative hypothesis is that the proportion of smokers in the population is different
from that value $H_1: P\neq 0.5$. It is therefore a two-sided test of proportions, which is programmed
in the `prop.test` function with the argument `alternative="two.sided"`. Additionally, we must
include that the value to be tested is 0.5 with the argument `p=0.5`.

In [None]:
prop.test(smokers, n, alternative = "two.sided", p=0.5, conf.level=1-alfa)

The arguments `alternative="two.sided"` and `p=0.5` are taken by default by the function, so the result we obtain is exactly the same as before when we calculated the confidence interval. In other words, by calculating the previous confidence interval, we were also resolving that two-sided test.

Let’s focus now on the result obtained for the p-value, which will allow us to decide whether
to accept or reject the null hypothesis. According to the hypothesis test proposed, which questions
whether the proportion of smokers in the population is equal to that of non-smokers ($H_0:P=0.5$, $H_1:P \neq 0.5$), we observe that the sample we have worked with has provided sufficient evidence to
reject the null hypothesis with a confidence level of 95%, since the p-value (`p-value = 0.0002633`) is less than the considered $\alpha$, and therefore the alternative hypothesis is accepted. In other words, we accept that 50% of the
population are non-smokers with a confidence level of 95%.

We reach the same conclusion if, instead of analyzing the p-value, we consider the result obtained
for the confidence interval, since in this example a two-sided test is proposed. It is observed that the
confidence interval obtained does not include the value $P=0.5$, so, as it could not be otherwise, the
same conclusion is reached. The null hypothesis is rejected with a confidence level of 95%.

In the same way that we have applied the test for this example in which the alternative hypothesis
is one of inequality, we could define hypotheses for a one-sided test (either left-sided or right-sided),
according to the problem analyzed, simply by defining the type of alternative hypothesis appropriately
in the `alternative` option and assigning the corresponding value to be tested. 

Consult the `prop.test` help to determine the possible values of the `alternative` option depending on the type of test. In particular, the `alternative` argument should be set equal to the *less* value to indicate that it is a
left-sided test or to the *greater* value to indicate a right-sided test.

<div class="alert alert-block alert-info">
<strong>PRACTICE ON YOUR OWN</strong>
    
- Using the data in *pulsations.rda*:
  1. Calculate the confidence interval for the proportion of women who smoke with a confidence level of 95%.
  2. According to the previous result, it can be assumed that the proportion of women who smoke is the same as those who do not smoke. Justify your answer.
  3. Calculate the confidence interval for the proportion of individuals with a Pulse2 exceeding 100 beats among those who ran, with a confidence level of 95%.
  4. Calculate the confidence interval for the proportion of individuals with a height exceeding 180 cm and weight exceeding 85 kg, with a confidence level of 99%.
   
<br>

- A certain tablet medicine has been tested effective in relieving an allergy in at least 60% of patients. The manufacturer has developed a soluble version of the product and wants to check if the medicine in this form is equally effective. A sample of 40 people with the allergy is taken. The new product relieved 19 of them. Is there enough evidence to suggest that the introduction of the soluble version has altered the effectiveness of the medicine? Perform the test using $\alpha=0.01$ and find the p-value of the test.

</div>


## 2. Means

### Confidence interval for a population mean

The confidence interval for means in R is also obtained using the function that solves hypothesis
tests for means `t.test`, by setting up a two-sided contrast. Consult the help documentation for this
function to learn about its accepted arguments. 

To familiarize ourselves with the `t.test` function, let’s continue working with the data from the *pulsations.rda* file to estimate the confidence interval for the population mean height at a significance level of $\alpha=0.05$. The command to execute in this case would be:

In [None]:
alfa <- 0.05
t.test(Height, conf.level=1-alfa)

Among other calculated values, the two that interest us are: the confidence interval for the population
mean (*95 percent confidence interval*) and its point estimator (the sample mean, *mean of x*).

R computes the confidence interval using the expression $\bar{x} \pm t_{n-1,\alpha/2}\frac{S}{\sqrt{n}}$ , which is most appropriate when the population variance is unknown (the most common case). 

The quantile $t_{n-1,\alpha/2}$ can be calculated as `qt(1-alfa/2,n-1)`. In other words, we could have obtained the same confidence interval result by executing the following commands:

In [None]:
# Confidence interval for a mean
n <- length(Height)
t <- qt(1-alfa/2,n-1)
lim.inf <- mean(Height) - t*sd(Height)/sqrt(n)
lim.sup <- mean(Height) + t*sd(Height)/sqrt(n)
conclusion <- "The population mean height lies within the interval"
sprintf("%s [%6.3f, %6.3f] with a confidence of %2d%s", conclusion, lim.inf, lim.sup, (1-alfa)*100, "%")

However, the `t.test` function simplifies the calculation (just one line of code) and also provides us
with much more information.

### Hypothesis test on a population mean

Let’s analyze an example now to see how to solve hypothesis tests for a mean using the `t.test` function, based on the p-value.

*Recent studies claim that the height of the population under study is greater than 180 cm. Given
the sample collected in the pulsations.rda file, can we accept this hypothesis with a 99% confidence
level?*

In this case, the contrast being considered is as follows:

$H_0: \mu\geq 180$

$H_1: \mu< 180$

As we can see, this is a one-sided left-tailed test in which the value of $\mu$ to be tested is $180$ cm.
Therefore, the command to execute will be:

In [None]:
alfa <- 0.01
t.test(Height, alternative='less', mu=180, conf.level=1-alfa)

The *alternative* option is now set to *less* to indicate that this is a one-tailed test on the left side. Additionally, the argument *mu = 180* has been included with the value to be tested.

In the console, we obtain the value of the contrast statistic (t), the number of degrees of freedom
(df), and the p-value, among others. Based on the obtained p-value, we can conclude that the analyzed sample provides sufficient evidence to reject the null hypothesis with 99% confidence. Specifically, the p-value (`p-value = 6.845e-08`) is less than the significance level considered in this example (0.01). Therefore, we accept that the population height under study is less than 180 cm with 99% confidence.

<div class="alert alert-block alert-info">
<strong>PRACTICE ON YOUR OWN</strong>
 
- With the data in *pulsations.rda* calculate:
  1. The confidence interval for the average weight of women with $\alpha=0.05$.
  2. The confidence interval for the mean pulse increment (Pulse2 - Pulse1) for individuals who ran with $\alpha=0.1$.
  3. Additionally, recent studies claim that the average height of women in this population is $\mu=167$
cm. Based on our sample, can we accept this hypothesis with a confidence level of 99%? Justify
the result.
     
</n>

- Over fifty consecutive school days and at the same time, the number of terminals connected to
the internet at a university has been observed. The results are in the file *terminales.dat*. Based on this data,
  1. Provide 95% and 99.5% confidence intervals for the mean number of terminals connected to the
internet. Comment the results. 
  2. Assuming the population follows a Normal distribution, calculate 90% and 95% confidence intervals for the variance of the number of terminals connected to the internet. **Note**: In R Base
package, there is no function that calculates confidence intervals for a variance, so you should
program a function that returns the lower and upper bounds of the confidence interval according
to the formulation seen in class. Have it available to apply in any other example.

</n>

- The soil pH is an important variable when designing structures that will be in contact with the
ground. The owner of a potential construction site claims that the soil pH is not higher than 6.5.
Nine soil samples have been taken from the land: 7.3  6.5  6.4  6.1  6  6.5  6.2  5.8  6.7.

   Assuming that the pH variable follows a Normal distribution, answer the following questions:
  1. Find a confidence interval for the mean pH with a significance level of 10%.
  2. Is the owner’s claim accepted with a risk of $\alpha=0.05$?

</div>