## 12.5 Examples

We now consider how estimates of the population parameters can be obtained in R using our two examples. Recall we are interested in investigating (1) the association between birthweight and length of pregnancy and (2) birthweight and mother's smoking status. 

In both examples, birthweight is the outcome. In Example 1, the independent variable is length of pregnancy, $L$ (i.e. number of gestational days) and in Example 2, the independent variable is an indicator variable for whether or not the mother smokes, $S$. 

We will use the ```lm()``` to perform simple linear regressions in R. Click [here](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm) for details of how this command works. 

### 12.5.1 Example 1

The following model defines our assumed relationship between the length of pregnancy ($L$) and a baby's birthweight ($Y$): 

$$ \text{Model 1: }y_i = \beta_0 + \beta_1 l_i +  \epsilon_i $$

The following code can be used to perform this linear regression in R: 

In [1]:
#Example 1: Investigating the relationship between birthweight and length of pregancy
data<- read.csv('https://www.inferentialthinking.com/data/baby.csv')
model1<-lm(Birth.Weight~Gestational.Days, data=data)
summary(model1)


Call:
lm(formula = Birth.Weight ~ Gestational.Days, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.348 -11.065   0.218  10.101  57.704 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -10.75414    8.53693   -1.26    0.208    
Gestational.Days   0.46656    0.03054   15.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.74 on 1172 degrees of freedom
Multiple R-squared:  0.1661,	Adjusted R-squared:  0.1654 
F-statistic: 233.4 on 1 and 1172 DF,  p-value: < 2.2e-16


+ The estimated intercept, $\hat{\beta_0}$ is equal to -10.75. This is interpreted as: the estimated mean birthweight of a child born after 0 gestational days is -10.75oz. Since there are no observations with 0 gestational days in the study, this is an extrapolation based on the observed data and an assumption of linearity. Estimates based on extrapolation should be interpreted with caution and in this case, the results make little sense because a negative weight is estimated. Moreover, no child is born after 0 gestational days and so this intercept is of little interest. Later on in the lesson, we will discuss a technique called **centering** which is often used to make more interpretable intercepts. 

+ The estimated slope, $\hat{\beta_1}$ is equal to 0.47. This is interpreted as: the mean birthweight of a baby is estimated to increase by 0.47oz for each daily increase in the gestational period.

+ The estimated residual standard error, $\hat{\sigma}$ is equal to 16.74 (the estimated residual variance is equal to $16.74^2$). This means that the observed outcomes are scattered around the fitted regression line with a standard deviation of 16.74oz.  


### 12.5.2 Example 2

In our second example, the independent variable is binary. To include this in the model, we use a **dummy** variable that takes the value 1 if the mother smokes and 0 if the mother doesn't smoke: 

$$ s_{i}
\begin{cases}
    1 & \text{ if the $i^{th}$ baby's mother smokes} \\
    0 & \text{ if the $i^{th}$ baby's mother does not smoke}
\end{cases} $$

We then define the following linear regression model:

$$ \text{Model 2: }y_i = \alpha_0 + \alpha_1 s_i + \epsilon_i$$

When including binary (or categorical) variables in a linear regression in R, we can tell R to treat it as a factor variable using ```factor()```: 


In [2]:
#Example 2: Investigating the relationship between birthweight and mother's smoking status.
model2<-lm(Birth.Weight~factor(Maternal.Smoker), data=data)
summary(model2)


Call:
lm(formula = Birth.Weight ~ factor(Maternal.Smoker), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-68.085 -11.085   0.915  11.181  52.915 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 123.0853     0.6645 185.221   <2e-16 ***
factor(Maternal.Smoker)True  -9.2661     1.0628  -8.719   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.77 on 1172 degrees of freedom
Multiple R-squared:  0.06091,	Adjusted R-squared:  0.06011 
F-statistic: 76.02 on 1 and 1172 DF,  p-value: < 2.2e-16


+ $\hat{\alpha_0} = 123.09$. This is interpreted as the estimated mean birthweight (in oz) of a child with "dummy" variable equal to 0, i.e. it is the estimated mean birthweight of children whose mothers do not smoke. 

+ $\hat{\alpha_1}=-9.23$. The mean birthweight is estimated to decrease by 9.23oz per unit increase in the "dummy" variable. A unit increase in the dummy variable equates to moving from the non-smoking group to the smoking group, so we can interpret this as the difference in mean birthweights between the two groups. 

+ $\hat{\sigma}=17.77$. The observed outcomes are scattered around the fitted regression line with a standard deviation of 17.77oz. 

Note that the outputs for Models 1 and 2 consist of a number of other values we have yet to discuss. We address these in the subsequent sections. In Section 3.4 we will learn how to conduct statistical inference on the estimated parameters, which will help us to interpret the standard errors, $t$-values and $p$-values in the above output. Later in Section 5 we will discuss analysis of variance which will help us to interpret the "R-squared" values and the $F$-test.  