# Chapter 4 Linear Regression

Post-menopausal women who exercise less tend to have lower bone mineral density
(BMD), putting them at increased risk for fractures. But they also tend to be older,
frailer, and heavier, which may explain the association between exercise and BMD. People whose diet is high fat on average have higher low-density lipoprotein (LDL)
cholesterol, a risk factor for CHD. But they are also more likely to smoke and be
overweight, factors which are also strongly associated with CHD risk. Increasing
body mass index (BMI) predicts higher levels of hemoglobin $HbA_{1c}$, a marker
for poor control of glucose levels; however, older age and ethnic background also
predict higher $HbA_{1c}$ .

These are all examples of potentially complex relationships in observational data
where a continuous outcome of interest, such as BMD, SBP, and $HbA_{1c}$ , is related
to a risk factor in analyses that do not take account of other factors. But in each case
the risk factor of interest is associated with a number of other factors, or potential
<font color=red>confounders</font>, which also predict the outcome. So the simple association we observe
between the factor of interest and the outcome may be explained by the other factors.

Similarly, in experiments, including clinical trials, factors other than treatment
may need to be taken into account. If the randomization is properly implemented,
treatment assignment is on average not associated with any prognostic variable,
so confounding is usually not an issue. However, in stratified and other complex
study designs, multipredictor analysis is used to ensure that CIs, hypothesis tests,
and P-values are valid. For example, it is now standard practice to account for
clinical center in the analysis of multisite clinical trials, often using the random
effects methodology to be introduced in Chap. 7. And with continuous outcomes,
stratifying on a strong predictor in both design and analysis can account for a
substantial proportion of outcome variability, increasing the efficiency of the study. Multipredictor analysis may also be used when baseline differences are apparent
between the randomized groups, to account for potential confounding of treatment
assignment.

Another way the predictor–outcome relationship can depend on other factors
is that an association may not be the same in all parts of the population.For
example, hormone therapy (HT) has a smaller beneficial effect on LDL levels among
postmenopausal women who are also taking statins, and its effect on BMD may
be greater in younger postmenopausal women. These are examples of interaction,
where the association of a factor of primary interest with an outcome is modified by
another factor.

The problem of sorting out complex relationships is not restricted to continuous
outcomes; the same issues arise with the binary outcomes covered in Chap. 5,
survival times in Chap. 6, and repeated measures in Chap. 7. A general statistical
approach to these problems is needed. 

The topic of this chapter is the multipredictor linear regression model, a flexible
and widely used tool for assessing the joint relationships of multiple predictors
with a continuous outcome variable. We begin by illustrating some basic ideas
in a simple example (Sect. 4.1). Then in Sect. 4.2, we present the assumptions of
the multipredictor linear regression model and show how the simple linear model
reviewed in Chap. 3 is extended to accommodate multiple predictors. Section 4.3
shows how categorical predictors with multiple levels are coded and interpreted.
Sections 4.4–4.6 describe how multipredictor regression models can be used to deal
with confounding, mediation, and interaction, respectively. Section 4.7 introduces
some simple methods for assessing the fit of the model to the data and how well the
data conform to the underlying assumptions of the model. Section 4.8 introduces
sample size, power, and minimum detectable effect calculations for the multiple
linear model. In Chap. 9, we use a potential outcomes view of causal effects to show
how and under what conditions multipredictor regression models might be used to
estimate them, and in Chap. 10 we discuss the difficult problem of which variables
and how many to include in a multipredictor model.

## 4.1 Example: Exercise and Glucose

Glucose levels above 125 mg/dL are diagnostic of diabetes, while levels in the range
from 100 to 125 mg/dL signal increased risk of progressing to this serious and
increasingly widespread condition. So it is of interest to determine whether exercise,
a modifiable lifestyle factor, would help people reduce their glucose levels and thus
avoid diabetes.

To answer this question definitively would require a randomized clinical trial,
a difficult and expensive undertaking. As a result, research questions like this are
often initially looked at using observational data. But this is complicated by the fact
that people who exercise differ in many ways from those who do not, and some of
the other differences might explain any unadjusted association between exercise and
glucose level.

##### Load the required libraries

In [1]:
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
plt.style.use('seaborn-white')
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression as sk_lm
from sklearn.metrics import explained_variance_score, mean_squared_error,r2_score
from sklearn.preprocessing import scale

##### Load the data `Hers` and check the dimension and NAs

In [3]:
hers=pd.read_stata('./Data/Chapter4/hersdata.dta')
hers.head(5)

Unnamed: 0,HT,age,raceth,nonwhite,smoking,drinkany,exercise,physact,globrat,poorfair,...,LDL,HDL,TG,tchol1,LDL1,HDL1,TG1,SBP,DBP,age10
0,placebo,70,African American,yes,no,no,no,much more active,good,no,...,122.400002,52.0,73.0,201.0,137.600006,48.0,77.0,138,78.0,7.0
1,placebo,62,African American,yes,no,no,no,much less active,good,no,...,241.600006,44.0,107.0,216.0,150.600006,48.0,87.0,118,70.0,6.2
2,hormone therapy,69,White,no,no,no,no,about as active,good,no,...,166.199997,57.0,154.0,254.0,156.0,66.0,160.0,134,78.0,6.9
3,placebo,64,White,no,yes,yes,no,much less active,good,no,...,116.199997,56.0,159.0,207.0,122.599998,57.0,137.0,152,72.0,6.4
4,placebo,65,White,no,no,no,no,somewhat less active,good,no,...,150.600006,42.0,107.0,235.0,172.199997,35.0,139.0,175,95.0,6.5


In [4]:
hers.shape

(2763, 37)

In [5]:
hers.isna().sum()

HT            0
age           0
raceth        0
nonwhite      0
smoking       0
drinkany      2
exercise      0
physact       0
globrat       3
poorfair      3
medcond       0
htnmeds       0
statins       0
diabetes      0
dmpills       0
insulin       0
weight        2
BMI           5
waist         2
WHR           3
glucose       0
weight1     150
BMI1        153
waist1      151
WHR1        151
glucose1    150
tchol         4
LDL          11
HDL          11
TG            4
tchol1      150
LDL1        155
HDL1        155
TG1         150
SBP           0
DBP           1
age10         0
dtype: int64

In [6]:
hers.diabetes.value_counts()

no     2032
yes     731
Name: diabetes, dtype: int64

In [7]:
hers.exercise.value_counts()

no     1695
yes    1068
Name: exercise, dtype: int64

##### Table 4.1 Unadjusted regression of glucose on exercise

In [8]:
glureg=smf.ols('glucose~exercise', data=hers, subset=hers['diabetes']=='no').fit()
print(glureg.summary())

                            OLS Regression Results                            
Dep. Variable:                glucose   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     14.97
Date:                Fri, 25 Jan 2019   Prob (F-statistic):           0.000113
Time:                        08:55:04   Log-Likelihood:                -7502.4
No. Observations:                2032   AIC:                         1.501e+04
Df Residuals:                    2030   BIC:                         1.502e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          97.3610      0.282    3

Table 4.1 shows a simple linear model using a measure of exercise to predict
baseline glucose levels among 2,032 participants without diabetes in the HERS clinical trial of hormone therapy (HT) (Hulley et al. 1998). Women with diabetes
are excluded because the research question is whether exercise might help to prevent
progression to diabetes among women at risk, and because the causal determinants
of glucose may be different in that group. Furthermore, glucose levels are far more
variable among diabetics, a violation of the assumption of homoscedasticity, as we
show in Sect. 4.7.3 below. <font color=red>The coefficient estimate (coef.) for exercise shows
that average baseline glucose levels were about $1.7mg/dL$ lower among women who
exercised at least three times a week than among women who exercised less. This
difference is statistically significant ($t=-3.87;P < 0:0005$).</font>
    
However, women who exercise are slightly younger, a little more likely to use
alcohol, and in particular have lower average BMI, all factors associated with
glucose levels. This implies that the lower average glucose we observe among
women who exercise could be due at least in part to differences in these other
predictors. Under these conditions, it is important that our estimate of the difference
in average glucose levels associated with exercise be “adjusted” for the effects
of these potential confounders of the unadjusted association. Ideally, adjustment
using a multipredictor regression model provides an estimate of the causal effect
of exercise on average glucose levels, by holding the other variables constant. In
Chap. 9, the rationale for estimation of causal effects using multipredictor regression
models is explained in more detail.

In [9]:
glures=smf.ols('glucose~exercise+age+drinkany+BMI', data=hers, subset=hers['diabetes']=='no').fit()
print(glures.summary())

                            OLS Regression Results                            
Dep. Variable:                glucose   R-squared:                       0.072
Model:                            OLS   Adj. R-squared:                  0.070
Method:                 Least Squares   F-statistic:                     39.22
Date:                Fri, 25 Jan 2019   Prob (F-statistic):           1.14e-31
Time:                        08:55:04   Log-Likelihood:                -7416.8
No. Observations:                2028   AIC:                         1.484e+04
Df Residuals:                    2023   BIC:                         1.487e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          78.9624      2.593     

From Table 4.2, we see that in a multiple regression model that also includes—
that is, adjusts for—age, alcohol use (drinkany), and BMI, average glucose is
estimated to be only about $1mg/dL$ lower among women who exercise ($95\%~ CI ~0.1–
1.8, P=0.027$), holding the other three factors constant. The multipredictor model
also shows that average glucose levels are about 0.7mg/dL higher among alcohol
users than among nonusers. Average levels also increase by about $0.5mg/dL$ per unit
increase in BMI, and by $0.06mg/dL$ for each additional year of age. Each of these
associations is statistically significant after adjustment for the other predictors in
the model. Furthermore, the association of each of the four predictors with glucose
levels is adjusted for the effects of the other three, in the sense of taking account of
its correlation with the other predictors and their adjusted associations with glucose levels. 

In summary, the multipredictor model for glucose levels shows that the
unadjusted association between exercise and glucose is partly but not completely
explained by BMI, age, and alcohol use, and that exercise remains a statistically
significant predictor of glucose levels after adjustment for these three other factors—
that is, when they are held constant by the multipredictor regression model.

Still, we have been careful to retain the language of association rather than cause
and effect, and in Chaps. 9 and 10 will suggest that adjustment for additional potential
confounders would be needed before we could consider a causal interpretation
of the result.

## 4.2 Multiple Linear Regression Model
Confounding thus motivates models in which the average value of the outcome is
allowed to depend on multiple predictors instead of just one. Many basic elements
of the multiple linear model carry over from the simple linear model, which was
reviewed in Sect. 3.3. In Sect. 9.1, we show how this model is potentially suited to
estimating causal relationships between predictors and outcomes.

### 4.2.1 Systematic Part of the Model

For the simple linear model with a single predictor, the regression line is defined by

\begin{equation*} 
\begin{split}
E[y\mid x]&=\text{average value of outcome y given predictor value x}\\
&=\beta_0+\beta_1 x
\end{split}
\end{equation*}
In the multiple regression model, this generalizes to
\begin{equation*}
E[y\mid x]=\beta_0+\beta_1 x_1+\beta_2 x_2+\cdots+\beta_px_p \tag{4.2}
\end{equation*} 
where $x$ represents the collection of $p$ predictors $x_1, x_2, \dots, x_p$ in the model, and $\beta_1, \beta_2, \dots, \beta_p$ are the corresponding regression coefficients. 

The right-hand side of model has a relatively simple form, <font color=red>a linear
combination of the predictors and coefficients.</font> Analogous linear combinations of
predictors and coefficients, often referred to as the linear predictor, are used in
all the other regression models covered in this book. Despite the simple form of
(4.2), the multipredictor linear regression model is a flexible tool, and with the
elaborations to be introduced later in this chapter, usually allows us to representwith
considerable realism how the average value of the outcome varies systematically
with the predictors. In Sect. 4.7, we will consider methods for examining the
adequacy of this part of the model and for improving it.

#### Interpretation of Adjusted Regression Coefficients
In (4.2), the coefficient $\beta_j, j=1,\dots,p$ gives <font color=red>the change in $E[y\mid x]$ for an increase
of one unit in predictor $x_j$, holding other factors in the model constant</font>; each of the estimates is adjusted for the effects of all the other predictors. As in the simple linear
model, the intercept $\beta_0$ gives the value of $E[y\mid x]$ when all the predictors are equal to
zero; 'centering' of the continuous predictors can make the intercept interpretable. If confounding has been persuasively ruled out, we may be willing to interpret the
adjusted coefficient estimates as representing causal effects.

### 4.2.2 Random Part of the Model
As before, individual observations of the outcome $y_i$ are modeled as varying by an
error term $\varepsilon_i$ about an average determined by their predictor values $x_i$:
\begin{equation*}
\begin{split}
y_i&=e[y_i\mid x_i]+\varepsilon\\
&=\beta_0+\beta_1x_{1i}+\beta_2x_{2i}+\cdots+\beta_p x_{pi}+\varepsilon_i
\end{split}
\end{equation*}
where $X_{ji}$ is the value of predictor variable $x_j$ for observation $i$. We again assume
that $\varepsilon \sim i.i.d \mathscr{N} (0, \sigma^2_{\varepsilon})$; that is, $\varepsilon$ is normally distributed with mean zero and the
same standard deviation $\sigma_\varepsilon$ at every value of $x$, and that its values are statistically
independent.

#### Fitted Values, Sums of Squares, and Variance Estimators
From (4.2), it is clear that the fitted values $\hat{y}_i$, defined for the simple linear model in
(3.4), now depend on all p predictors and the corresponding regression coefficient
estimates, rather than just one predictor and two coefficients. The resulting sums of
squares and variance estimators introduced in Sect. 3.3 are otherwise unchanged in
the multipredictor model.

#### Variance of Adjusted Regression Coefficients
Including multiple predictors does affect the variance of $\hat{\beta}_j$, which now depends on
an additional factor $r_j$, the multiple correlation of $x_j$ with the other predictors in the
model. Specifically,\begin{equation}
Var(\hat{\beta}_j)=\dfrac{\sigma^2_{y\mid x}}{(n-1)\sigma^2_{x_j}(1-r^2_j)} \tag{4.4}
\end{equation}
where, as before, $\sigma^2_{y\mid x}$ is the residual variance of the outcome and $\sigma^2_{x_j}$ is the variance
of $x_j$; $r_j$ is equivalent to $r=\sqrt{R^2}$ from a multiple linear model in which $x_j$ is regressed on all the other predictors. The term $1/(1-r_j^2)$ is known as the <font color=DodgerBlue> variance inflation factor (IMF)</font>, since $Var(\hat{\beta}_j)$ is increased to the extent that $x_j$ is correlated with other predictors in the model.

However, inclusion of other predictors, especially powerful ones, also tends to decrease $\sigma^2_{y\mid x}$, the residual or unexplained variance of the outcome. Thus, the overall impact of including other predictors on $Var(\hat{\beta}_j)$ depends on both the correlation of $x_j$ with the other predictors and how much additional variability they explain. In the glucose example, the standard error of the coefficient estimate for exercise declines
slightly, from 0.44 to 0.43, after adjustment for age, alcohol use, and BMI. This
reflects the reduction in residual standard deviation previously described, as well as
a variance inflation factor in the adjusted model of only 1.03.

#### t-Tests and Confidence Intervals
The t -tests of the null hypothesis $H_0: \beta_j=0$ and CIs for $\beta_j$ carry over almost unchanged for each of the $\beta_s$ estimated by the model, only using (4.4) rather than (3.11) to compute the standard error of the regression coefficient, and comparing the
t -statistic to a t -distribution with $n-(p+1)$ degrees of freedom (p is the number of predictors in the model, and and an extra degree of freedom is used in estimation of
the intercept $\beta_0$).

However, there is a substantial difference in interpretation, since the results are
now adjusted for other predictors. Thus in rejecting the null hypothesis $H_0: \beta_j=0$ we would be making the stronger claim that, in the population $x_j$ predicts $y$, holding the other factors in the model constant. Similarly, the CI for $\beta_j$ refers to the parameter wich takes account of the other $p-1$ predictors in the model.

We have just seen that $Var(\hat{\beta}_j)$ may not be increased by adjustment. However, in
Sect. 4.4 we will see that including other predictors in order to control confounding
commonly has the effect of attenuating the unadjusted estimate of the association
of $x_j$ with $y$. This reflects the fact that the population parameter being estimated
in the adjusted model is often closer to zero than the parameter estimated in the
unadjusted model, since some of the unadjusted association is explained by other
predictors. If this is the case, then even if $Var(\hat{\beta}_j)$ is unchanged, it may be more difficult to reject $H_0:\beta_j=0$ in the adjusted model. In the glucose example, the
adjusted coefficient estimate for exercise is considerably smaller than the unadjusted
estimate. As a result the t -statistic is reduced from-3.87 to -2.22—still statistically
significant, but less highly so.

### 4.2.3 Generalization of $R^2$ and r

The coefficient of determination $R^2=\dfrac{MSS}{TSS}$ retains its interpretation as <font color=red>the
proportion of the total variability of the outcome that can be accounted for by the
predictor variables.</font> Under themodel, the fitted values summarize all the information
that the predictors supply about the outcome. Thus, the multiple correlation
coefficient $r=\sqrt{R^2}$ now represents <font color=red>the correlation between the outcome y and the
fitted values $\hat{y}$.</font> It is easy to confirm this identity by extracting the fitted values from
a regression model and computing their correlation with the outcome (Problem 4.3). In the glucose example, $R^2$ increases from less than $1\%$ in the unadjusted model to
$7\%$ after inclusion of age, alcohol use, and BMI, a substantial increase in relative if
not absolute terms.

### 4.2.4 Standardized Regression Coefficients

In Sect. 3.3.9, we saw that the slope coefficient $\beta_1$ in a simple linear model is
systematically related to the Pearson correlation coefficient (3.12); specifically, $r=\beta_1\frac{\sigma_x}{\sigma_y}$, where $\sigma_x$ and $\sigma_y$ are the standard deviations of the predictor and
outcome. Moreover, we pointed out that the scale-free correlation coefficient makes
it easier to compare the strength of association between the outcome and various
predictors across single-predictor models. In the context of a multipredictor model, <font color=DodgerBlue> standardized regression coefficients </font> play this role. The standardized regression coefficient $\beta_j^s$ for predictor $x_j$ is defined in analogy to (3.12) as $$
\beta_j^s=\beta_j \dfrac{\sigma_{x_j}}{\sigma_y}, \tag{4.5}$$ where $\sigma_x$ and $\sigma_y$ are the standard deviations of predictor $x_j$ and the outcome $y$. <font color=red> These standardized coefficient estimates are what would be obtained from the
regression if the outcome and all the predictors were first rescaled to have standard
deviation 1. </font> Thus, they give the change in standard deviation units in the average
value of $y$ per standard deviation increase in the predictor. Standardized coefficients
make it easy to compare the strength of association of different continuous
predictors with the outcome within the same model.

For binary predictors, however, the unstandardized regression coefficients may
be more directly interpretable than the standardized estimates, since the unstandardized
coefficients for such predictors simply estimate the differences in the average
value of the outcome between the two groups defined by the predictor, holding the
other predictors in the model constant.

## 4.3 Categorical Predictors

In Chap. 3, the simple regression model was introduced with a single continuous
predictor. However, predictors in both simple and multipredictor regression models
can be binary, categorical, or discrete numeric, as well as continuous numeric.

### 4.3.1 Binary Predictors
The exercise variable in the model for LDL levels shown in Table 4.1 is an example
of a binary predictor. A good way to code such a variable is as an indicator or dummy
variable, taking the value 1 for the group with the characteristic of interest, and 0
for the group without the characteristic. With this coding, the regression coefficient
corresponding to this variable has a straightforward interpretation as the increase or
decrease in average outcome levels in the group with the characteristic, with respect
to the reference group. 

To see this, consider the simple regression model for average glucose values: \begin{equation}
E[\text{glucose}\mid x]=\beta_0+\beta_1\text{exercise} \tag{4.6}
\end{equation}
With the indicator coding of exercise (1 D yes, 0 D no), the average value of
glucose is $\beta_0+\beta_1$ among women who do exercise, and $\beta_0$ among the rest. It follows directly that $\beta_1$ is the difference in average glucose levels between the two groups. This is consistent with our more general definition of $\beta_j$ as <font color=red> the change in $E[y\mid x]$ for a one unit increase in $x_j$.</font> Furthermore, the t -test of the null hypothesis $H_0: \beta_1=0$ is a test of whether the between-group difference in average glucose levels differs
from zero. In fact, this unadjusted model is equivalent to a t -test comparing glucose
levels in women who do and do not exercise. A final point: when coded this way, the
average value of the exercise variable gives the proportion of women who exercise.

A commonly used alternative coding for binary variables is (1= yes, 2 = no).
With this coding, the coefficient $\beta_1$ retains its interpretation as the between-group
difference in average glucose levels, but now among women who do not exercise as
compared to those who do, a less intuitive way to think of the difference. Furthermore,
with this coding the coefficient $\beta_0$ has no straightforward interpretation, and
the average value of the binary variable is not equal to the proportion of the sample
in either group. However, overall model fit, including fitted values of the outcome,
standard errors, and P-values, are the same with either coding (Problem 4.1).

### 4.3.2 Multilevel Categorical Predictors

The 2,763 women in the HERS cohort also responded to a question about how
physically active they considered themselves compared to other women their age.
The five-level response variable physact ranged from “much less active” to
“much more active,” and was coded in order from 1 to 5. This is an example of
an ordinal variable, as described in Chap. 2, with categories that are meaningfully
ordered, but separated by increments that may not be accurately reflected in the
numerical codes used to represent them. For example, responses “much less active”
and “somewhat less active” may represent a larger difference in physical activity
than “somewhat less active” and “about as active.”

Multilevel categorical variables can also be nominal, in the sense that there is
no intrinsic ordering in the categories. Examples include ethnicity, marital status,
occupation, and geographic region. With nominal variables, it is even clearer that
the numeric codes often used to represent the variable in the database cannot be
treated like the values of a numeric variable such as glucose.

Categories are usually set up to be mutually exclusive and exhaustive, so that
every member of the population falls into one and only one category. In that case,
both ordinal and nominal categories define subgroups of the population.

Both types of categorical variables are easily accommodated in multipredictor
linear and other regression models, using indicator or dummy variables. As with
binary variables, where two categories are represented in the model by a single
indicator variable, categorical variables with $K\geq 2$ levels are represented by $K-1$ indicators, one for each of level of the variable except a baseline or reference level.
Suppose level 1 is chosen as the baseline level. Then, for $k=2,3, \dots, K,$ indicator
variable k has value 1 for observations belonging to the category k, and 0 for
observations belonging to any of the other categories. Note that for $K=2$, this also describes the binary case, in which the “no” response defines the baseline or
reference group and the indicator variable takes on value 1 only for the “yes” group.

Following the Python pandas `get_dummies()` for the naming of the
four indicator variables, Table 4.3 shows the values of the four indicator variables
corresponding to the five response levels of physact. Each level of physact is
defined by a unique pattern in the four indicator variables.

In [10]:
hers['physact']=hers['physact'].astype('category')
hers['i.physact']=hers['physact'].astype('category')
hers['i.physact'].cat.categories=['1.physact','2.physact','3.physact','4.physact','5.physact']
hers.head(5)

Unnamed: 0,HT,age,raceth,nonwhite,smoking,drinkany,exercise,physact,globrat,poorfair,...,HDL,TG,tchol1,LDL1,HDL1,TG1,SBP,DBP,age10,i.physact
0,placebo,70,African American,yes,no,no,no,much more active,good,no,...,52.0,73.0,201.0,137.600006,48.0,77.0,138,78.0,7.0,5.physact
1,placebo,62,African American,yes,no,no,no,much less active,good,no,...,44.0,107.0,216.0,150.600006,48.0,87.0,118,70.0,6.2,1.physact
2,hormone therapy,69,White,no,no,no,no,about as active,good,no,...,57.0,154.0,254.0,156.0,66.0,160.0,134,78.0,6.9,3.physact
3,placebo,64,White,no,yes,yes,no,much less active,good,no,...,56.0,159.0,207.0,122.599998,57.0,137.0,152,72.0,6.4,1.physact
4,placebo,65,White,no,no,no,no,somewhat less active,good,no,...,42.0,107.0,235.0,172.199997,35.0,139.0,175,95.0,6.5,2.physact


##### Table 4.3 Coding of indicators for a multilevel categorical variable
<font color=red> Come back later</font>

Furthermore, the corresponding $\beta_s$ have a straightforward interpretation. For the
moment, consider a simple regression model in which the five levels of physact
are the only predictors. Then, \begin{equation}
E[\text{glucose}\mid x]=\beta_0+\beta_2 \text{2.physact}+\cdots+\beta_5\text{5.physact} \tag{4.7}
\end{equation}

For clarity, the $\beta_s$ in (4.7) are indexed in accord with the levels of physact, so $\beta_1$ does not appear in the model. Letting the four indicators take on values of $0$ or $1$ as appropriate for the five groups defined by physact, we obtain
\begin{equation}
E[\text{glucose}\mid x]=\begin{cases}
\beta_0 & physact=1\\
\beta_0+\beta_2 & physact=2\\
\beta_0+\beta_3 & physact=3\\
\beta_0+\beta_4 & physact=4\\
\beta_0+\beta_5 & physact=5
\end{cases}\tag{4.8}
\end{equation}
From (4.8), it is clear that the intercept $\beta_0$ gives <font color=red>the value of $E[\text{glucose}\mid x]$ in the reference or much less active group (physact=1)</font>. Then it is just a matter of
subtracting the first line of (4.8) from the second to see that $\beta_2$ gives <font color=red>the difference in the average glucose in the somewhat less active group (physact=2) as compared
to the much less active group</font>. Accordingly, the t-test of $H_0: \beta_2=0$ is a test of whether average glucose levels are the same in the much less and somewhat less active groups (physact=1 and 2). And similarly for $\beta_3, \beta_4$, and $\beta_5$.

### 4.3.4 Multiple Pairwise Comparisons Between Categories

### 4.3.5 Testing for Trend Across Categories

#### Departures from Linear Trend

## 4.4 Confounding

### 4.4.1 Range of Confounding Patterns

### 4.4.2 Confounding Is Difficult to Rule Out

### 4.4.3 Adjusted Versus Unadjusted $\hat{\beta}$s

### 4.4.4 Example: BMI and LDL

## 4.5 Mediation

### 4.5.1 Indirect Effects via the Mediator

### 4.5.2 Overall and Direct Effects

### 4.5.3 Percent Explained

### 4.5.4 Example: BMI, Exercise, and Glucose

### 4.5.5 Pitfalls in Evaluating Mediation

#### Temporality

#### Problems with PE

## 4.6 Interaction

### 4.6.1 Example: Hormone Therapy and Statin Use

### 4.6.2 Example: BMI and Statin Use

### 4.6.3 Interaction and Scale

### 4.6.4 Example: Hormone Therapy and Baseline LDL

### 4.6.5 Details

## 4.7 Checking Model Assumptions and Fit

### 4.7.1 Linearity

#### Component-Plus-Residual Plots

#### Smooth Transformations of the Predictors

#### Restricted Cubic Splines

#### Categorizing the Predictor

#### Nonlinearity, Interaction, and Covariate Overlap

### 4.7.2 Normality

#### Residual Plots

#### Testing for Departures from Normality

#### Normalizing Transformations of the Outcome

#### Alternatives to Transformation: Bootstrap and GLMs

### 4.7.3 Constant Variance

#### Residual Plots

#### Subsample Variances

#### Testing for Departures from Constant Variance

#### When Departures May Cause Trouble

#### Variance-Stabilizing Outcome Transformations

#### Robust Standard Errors

#### GLMs

### 4.7.4 Outlying, High Leverage, and Influential Points

#### DFBETAs

#### Addressing Influential Points

### 4.7.5 Interpretation of Results for Log Transformed Variables

#### Log Transformation of the Predictor

#### Log Transformation of the Outcome

#### Log Transformation of Both Predictor and Outcome

### 4.7.6 When to Use Transformations

## 4.8 Sample Size, Power, and Detectable Effects

### 4.8.1 Calculations Using Standard Errors Based on Published Data

## 4.9 Summary

## 4.11 Problems