## 12.8 Confidence intervals 

### 12.8.1 Confidence intervals for the regression coefficients

Confidence intervals give us more information than the estimated regression coefficient, because they take into account the uncertainty in our estimates. Confidence intervals are constructed in such a way so that they will contain the true population parameter a specified proportion of the time, typically 95%.

Remember that a formal interpretation of the 95% confidence interval is as follows: if the analysis was repeated 100 times and 95% confidence intervals were obtained each time, then 95% of those confidence intervals would contain the true population parameter. More informally, confidence intervals are often interpreted as a range of plausible values for the true population parameter.

The definition of a  95% confidence interval for the parameter $\hat{\beta}_1$  is given by:

$$
\hat{\beta}_1 \pm t_{0.025, n-2} SE(\hat{\beta}_1)
$$

where  $t_{0.025, n-2}$ represents the $0.975^{th}$ centile of a t-distribution with $(𝑛−2)$ degrees of freedom.

If 0 lies in the confidence interval, we would conclude that the independent variable and outcome are not associated, because it is plausible that $\beta1=0$. If 0 does not lie within the confidence interval, then there is evidence of an association.

Note that if $n$ is sufficiently large, the t-distribution is well approximated by a normal distribution. In this case, a 95% confidence interval can be found by:

$$
\hat{\beta}_1 \pm 1.96 \times SE(\hat{\beta}_1)
$$

 
Example: Calculate a 95% for  $\hat{\beta}_1$  (using the values given in the R output above):

Click the button to reveal the solution.

```{toggle}
Solution: 

$$
\hat{\beta}_1 \pm 1.96 \times SE(\hat{\beta}_1) \\
0.46656 \pm 1.96 \times 0.03054
$$

which gives a 95% CI from 0.407 to 0.526.

Since 0 does not lie within the interval, we conclude that there is evidence of an association between birthweight and length of pregancy (as indicated by the results of the hypothesis test).
```

Alternatively, we can obtain confidence intervals using confint in R:

In [1]:
#Confidence intervals for beta_1
data<- read.csv('https://www.inferentialthinking.com/data/baby.csv')
model1<-lm(Birth.Weight~Gestational.Days, data=data)
confint(model1, parm=2, level = 0.95)

Unnamed: 0,2.5 %,97.5 %
Gestational.Days,0.4066435,0.5264702


We have looked at creating 95% confidence intervals, but there is nothing special about 95%. We may instead be interested in 99% or 90% confidence intervals. A similar approach is used to construct these intervals.

*Exercise:* Edit the above code to find 95% and 90% confidence intervals for $\hat{\alpha_1}$ in Model 2. **Hint**: Use the [R documentation](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/confint) for the command ```confint``` as a guide.

So far we have only discussed conducting inference on the estimated regression coefficients. However, it may also be of interest to determine **confidence intervals for the fitted outcomes**, or **prediction intervals**. The subsequent two sections describe and illustrate these two concepts, respectively. 

### 12.8.2 Confidence intervals for a fitted value 

Confidence intervals can be obtained for the fitted value of $Y$ given a particular value of $X$. We denote the fitted value when $X=x$ as $y_x$. The expected value of $y_x$ is equal to $\hat{Y}=\beta_0+\beta_1x$ and its variance is given by:

$$
V(\hat{y}_x) = \sigma^2 (\frac{1}{n}+\frac{(x-\bar{x})^2}{SS_{xx}})
$$

where $SS_{xx}=\sum_{i=1}^n(x_i-\bar{x})^2$, i.e. the sum of squares of $X$.  

Therefore, the 95% confidence interval is given by:

$$
\hat{y_x} \pm t_{n-2, 0.975}\hat{\sigma} \sqrt{\frac{1}{n}+ \frac{(x-\bar{x})^2}{SS_{xx}}}
$$

95% confidence intervals can be obtained for values of the independent variable that do not belong in the data. However, the width of the confidence interval increases with the distance from the mean (as can be seem from the formula and figure given below). Care must be taken when extrapolating outside the range of the observed data as this makes an un-testable assumption that linearity continues outside the observed data range. 

*Example*. The R code below calculates a 95% confidence interval for the fitted value of birthweight of a baby born after 280 gestational days. 
 

In [8]:
#Confidence interval for a fitted value 
new.data<-data.frame(Gestational.Days=280)
predict(model1, newdata=new.data, interval="confidence", level=0.95)

fit,lwr,upr
119.8818,118.9215,120.8421


The 95\% confidence interval for $y_{280}$ is (118.9, 120.8). Informally, we can interpret this as: it is plausible that the true value of $y_{280}$ lies between 118.9 and 120.8.  
