## Multiple Regression Analysis: Further Issues

In this notebook, we cover some issues regarding the implementation of regression analyses. We first discuss more flexible specification of regression equations such as variable scaling, standardization, polynomials, and interactions. They can be conveinently included in the **formula** and used in the **statsmodels** OLS estimation. Then, we discuss the model predictions and their confidence intervals. After that we would like to introduce using the **qualitatve regressors** or **categorical variable** in the regression model.

#### Topics:

1. Model Formula / Specification
2. Prediction
3. Qualitative Regressors

### 1. Model Formula / Specification

If we run a regression in **statsmodels** using a syntax like

``` python
smf.ols('y ~ x1 + x2 + x3', data = sample)
```

the expression **y ~ x1 + x2 + x3** is referred to as a model **formula** or **specification**. It is a compact symbolic way to describe our regression equation. The dependent variable is separated from the regressors by a '~' and the regressors are separated by a '+' indicating that they enter the equation in a linear fashion. A constant is added by default. Such formula can be specified in more complex ways to indicate different kinds of regression equations. We will cover the most important ones in this section.

#### Data Scaling: Arithmetic Operations within a Formula

Woodldrige (2019) discusses how different scaling of the variables in the model affect the parameter estimates and other statistics in Section 6.1. As an example, a model relating the birth weight to cigarette smoking of the mother during pregnancy and the family income. The basic model equation is

$$bwght = \beta_0 + \beta_1 cigs + \beta_2 faminc + u$$

which translates into formula syntax as **bwght ~ cigs + faminc**.

If we want to measure the weight in pounds rather than ounces, there are two ways to implement different rescaling in *Python*. We can

- Define a different variable like **bwghtlbs = bwght / 16** and use this variable in the formula: **bwghtlbs ~ cigs + faminc**
- Specify this rescaling directly in the formula: **I(bwght/16) ~ cigs + faminc**

The later approach can be more convenient. Note that the **I(...)** brackets describe any parts of the formula in which we specify arithmetic transformations.

If we want to measure the number of cigarettes smoked per day in packs, we could again define a new variable **pack = cigs / 20** and use it as a regressor or simply specify the formula **bwght ~ I(cigs/20) + faminc**. Here, the importance to use the **I** function is easy to see. If we specified the formula **bwght ~ I(cigs/20 + faminc)** instead, we would have a nonsense model with only one regressor: the sum of the packs smoked and the income.

Below example demonstrates these features. As discussed in Wooldridge (2019, Section 6.1), dividing the dependent variable by 16 changes all coefficients by the same factor $\frac{1}{16}$ and dividing a regressor by 20 changes its coefficient by the factor 20. Other statistics like $R^2$ are unaffected.        

In [2]:
# Import the dependencies
import wooldridge as woo
import pandas as pd
import statsmodels.formula.api as smf

In [3]:
# Load the data set 'bwght'
bwght = woo.dataWoo('bwght')

In [4]:
# Regress and report coefficients:
reg = smf.ols(formula = 'bwght ~ cigs + faminc', data = bwght)
results = reg.fit()

In [6]:
# Weight in pounds, manual way:
bwght['bwght_lbs'] = bwght['bwght'] / 16
reg_lbs = smf.ols(formula = 'bwght_lbs ~ cigs + faminc', data = bwght)
results_lbs = reg_lbs.fit()

In [7]:
# Weight in pounds, direct way:
reg_lbs2 = smf.ols(formula = 'I(bwght/16) ~ cigs + faminc', data = bwght)
results_lbs2 = reg_lbs2.fit()

In [9]:
# Packs of cigarettes:
reg_packs = smf.ols(formula = 'bwght ~ I(cigs / 20) + faminc', data = bwght)
results_packs = reg_packs.fit()

In [13]:
# Compare results:
table = pd.DataFrame({"No Scaling": round(results.params, 4),
                     "Manual (Pounds)": round(results_lbs.params, 4),
                     "Direct (Pounds)": round(results_lbs2.params, 4),
                     "Packs of Cigarettes": round(results_packs.params, 4)})

print(f'Compare Results: \n{table}\n')

Compare Results: 
              No Scaling  Manual (Pounds)  Direct (Pounds)  Packs of Cigarettes
I(cigs / 20)         NaN              NaN              NaN              -9.2682
Intercept       116.9741           7.3109           7.3109             116.9741
cigs             -0.4634          -0.0290          -0.0290                  NaN
faminc            0.0928           0.0058           0.0058               0.0928



#### Standardization: Beta Coefficients

A specific arithmetic operation is the standardization. A variable is standardized by subtracting its mean and dividing by its standard deviation. For example, the standardized dependent variable $y$ and regressor $x_1$ are

$$z_y = \frac{y - \bar{y}}{sd(y)}$$

and

$$z_{x_{1}} = \frac{x_1 - \bar{x_1}}{sd(x_1)}$$

If the regression model only contains standardized variables, the coefficients have a special interpretation. They measure by how many *standard deviation* $y$ changes as the respective independent variable by *one standard deviation*. Inconsistent with the notation used here, they are sometimes referred to as beta coefficients.

In *Python*, we can use the same type of arithmetic transformations to subtract the mean and divide by the standard deviation. It can be done more conveniently by defining and using a function **scale()** directly for all variables we want to standardize. 

#### Wooldridge, Example 6.1: Effects of Pollution on Housing Prices

We are interested in how air pollution (nox) and other neighborhood characteristics affect the value of a house. A model using standardization for all variables is expressed in a formula as

``` python
price_sc ~ 0 + nox_sc + crime_sc + rooms_sc + dist_sc + stratio_sc
```

With **variable_sc** denoting the scaled version of **variable**. The ouptut shows the parameter estimates of this model. The housing price drops by 0.34 standard deviations as the air pollution increases by one standard deviation.

In [14]:
# Import the dependencies
import wooldridge as woo
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [15]:
# Import the data set 'hprice2'
hprice2 = woo.dataWoo('hprice2')

In [16]:
# Define a function for the standardization
def scale(x):
    x_mean = np.mean(x)
    x_var = np.var(x, ddof = 1)
    x_scaled = (x - x_mean) / np.sqrt(x_var)
    return x_scaled

In [17]:
# Standardize the variables
hprice2['price_sc'] = scale(hprice2['price'])
hprice2['nox_sc'] = scale(hprice2['nox'])
hprice2['crime_sc'] = scale(hprice2['crime'])
hprice2['rooms_sc'] = scale(hprice2['rooms'])
hprice2['dist_sc'] = scale(hprice2['dist'])
hprice2['stratio_sc'] = scale(hprice2['stratio'])

In [18]:
# Build the regression model based on the standardized variables
reg = smf.ols(formula = 
              'price_sc ~ 0 + nox_sc + crime_sc + rooms_sc + dist_sc + stratio_sc',
             data = hprice2)
results = reg.fit()

In [19]:
# Print Regression Table
table = pd.DataFrame({'Betas': round(results.params, 4),
                     'SE': round(results.bse, 4),
                     "t-Stat": round(results.tvalues, 4),
                     "pValue": round(results.pvalues, 4)})

print(f'Regression Table: \n{table}\n')

Regression Table: 
             Betas      SE   t-Stat  pValue
nox_sc     -0.3404  0.0445  -7.6511     0.0
crime_sc   -0.1433  0.0307  -4.6693     0.0
rooms_sc    0.5139  0.0300  17.1295     0.0
dist_sc    -0.2348  0.0430  -5.4641     0.0
stratio_sc -0.2703  0.0299  -9.0274     0.0



#### Logarithms

We have already seen in previous section that we can include **numpy** functions **log** directly in formulas to represent logarithmic and semi-logarithmic models. A simple example of a partially logarithmic model and its formula would be

$$log(y) = \beta_0 + \beta_1 log(x_1) + \beta_2 x_2 + u$$

which can be expressed as **np.log(y) ~ np.log(x1) + x2**.

Below script shows this again for the house price example.As the air pollution *nox* increases by *one percent*, the hous price drops by about 0.72 *percent*. As the number of rooms increases by *one*, the value of the house increases by roughly 30.6%. Wooldridge (2019, Section 6.2) discusses how the latter value is only an approximation and the actual estimated effect is (exp(0.306) - 1) = 0.358 which is 35.8%.

In [None]:
# Import the dependencies
import wooldridge as woo
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Import the data set 'hprice2'
hprice2 = woo.dataWoo('hprice2')

In [None]:
# Build the log-log regression model
reg = smf.ols(formula = 'np.log(price) ~ np.log(nox) + rooms', data = hprice2)
results = reg.fit()

In [None]:
# Print regression table
table = pd.DataFrame({'Betas': round(results.params, 4),
                     'Standarde Errors': round(results.bse, 4),
                     't Statistics': round(results.tvalues, 4),
                     'p Value': round(results.pvalues, 4)})
print(f'Regression Table: \n{table}\n')

#### Quadratics and Polynomials

Specifying quadratic terms or higher powers of regressors can be a useful way to make a model more flexible by allowing the partial effects or (semi-)elasticities to decrease or increase with the value of the regressor.

Instead of creating additional variables containing the squared value of a regressor, in *Python* we can simply add **$I(x**2)$** to a formula. Higher order terms are specified accordingly. A simple cubic model and its corresponding formula are

$$y = \beta_0 + \beta1 x + \beta_2 x^2 + \beta_3 x^3 + u$$

In below example, we implement this model and present detailed results including *t* statistics and their *p* values. The quadratic term of *rooms* has a significantly positive coefficient $\hat{\beta_4}$ implying that the semi-elasticity increases with more rooms. The negative coefficient for rooms and the positive coefficient for *rooms* imply that for "small" number of rooms, the price decreases with the number of rooms and for "large" values, it increases. The number of rooms implying the smallest price can be found as

$$\text{rooms*} = -\frac{\beta_3}{2\beta_4} \approx 4.4$$

In [None]:
# Import the dependencies
import wooldridge as woo
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Import the data set 'hprice2'
hprice2 = woo.dataWoo('hprice2')

In [None]:
# Build the regression model with quadratic term of rooms
reg = smf.ols(formula = 'np.log(price) ~ np.log(nox) + np.log(dist) + rooms + I(rooms**2) + stratio', 
              data = hprice2)
results = reg.fit()

In [None]:
# Print regresssion table
table = pd.DataFrame({'Betas': round(results.params, 4),
                     'Standarde Errors': round(results.bse, 4),
                     't Statistics': round(results.tvalues, 4),
                     'p Value': round(results.pvalues, 4)})
print(f'Regression Table: \n{table}\n')

#### Hypothesis Testing

A natural question to ask is whether a regressor has additional statistically significant explanatory power in a regression model, given all the other regressors. In simple model specifications, this quation can be answered by a simple *t* test, so the results for all regresssors are availabel with a quick look at the standard regression table. When working with polynomials or other specifications, the influnece of one regressor is captured by several parameters. We can test its significance with an *F* test of the joint null hypothesis that all of these parameters are equal to zero. Let's revisit our previous example model

$$log(price) = \beta_0 + \beta_1 log(nox) + \beta_2 log(dist) + \beta_3 rooms + \beta_4 rooms^2 + \beta_5 stratio + u$$

The significance of *rooms* can be assessed with an *F* test of $H_0: \beta_3 = \beta_4 = 0$. As discussed, such as test can be performed with the command **f_test()** form the module **statsmodels**.

In [None]:
# Import the dependencies
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Import the data set 'hprice2'
hprice2 = woo.dataWoo('hprice2')

In [None]:
# Build the regression model with quadratic term of rooms
reg = smf.ols(formula = 'np.log(price) ~ np.log(nox) + np.log(dist) + rooms + I(rooms**2) + stratio', 
              data = hprice2)
results = reg.fit()

In [None]:
# Implement F Test for rooms
hypotheses = ['rooms = 0', 'I(rooms ** 2) = 0']
ftest = results.f_test(hypotheses)
fstat = ftest.statistic[0][0]
fpval = ftest.pvalue

print(f'F Statistics: {fstat}\n')
print(f'F Test p-value: {fpval}\n')

#### Interaction Terms

Models with interaction terms allow the effect of one variable $x_1$ to depend on the value of another variable $x_2$. A simple model including an interaction term would be

$$y = \beta_0 + \beta_1 x_1 + \beta2 x_2 + \beta_3 x_1 x_2 + u$$

Of course, we can implement this in *Python* by defining a new variable containing the product of the two regressors. But again, a direct specification in model formula is more convenient. The expression **x1:x2** within a formula adds the interaction term $x_1 x_2$. Even more conveniently, **$x1*x2$** adds not only the interaction but also both original variables allowing for a very concise syntax. So the model can be specified in *Python* as either of the two formulas:

``` python
y ~ x1 + x2 + x1:x2
```

Or

``` python
y ~ x1*x2
```

If one variable $x_1$ is interacted with a set of other varaibles, they can be grouped by parentheses to allow for a compact syntax. For example, the shortest way to express the model equation

$$y = \beta_0 + \beta_1 x_1 + \beta2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + u$$

in *Python* syntax is $

``` python
y ~ x1*(x2 + x3)
```

**Wooldridge, Example 6.3: Effects of Attendance on Final Exam Performance**

This example analyze a model including a standardized dependent variable, quadratic terms and an interaction. Standardized scores in the final exam are explained by class attendance, prior performance and an interaction term:

$$stndfnl = \beta_0 + \beta_1 atnrte + \beta2 priGPA + \beta_3 ACT + \beta_4 priGPA^2 + \beta_5 ACT^2 + \beta_6 (priGPA \cdot atndrte) + u$$

We estimate this model. The effect of attending classes is

$$\frac{\delta stndfnl}{\delta atndrte} = \beta_1 + \beta_6 priGPA$$

For the average $\bar{priGPA}$ = 2.59, we estimate this partial effect to be around 0.0078. It tests the null hypothesis that this effect is zero using a simple *F* test. With *p* value of 0.0034, this hypothesis can be rejected at all common significance level.

In [None]:
# Import the dependencies
import wooldridge as woo
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Import the data set 'attend'
attend = woo.dataWoo('attend')

In [None]:
# Find the number of observations
n = attend.shape[0]

In [None]:
# Build the regression model
reg = smf.ols(formula = 'stndfnl ~ atndrte*priGPA + ACT + I(priGPA ** 2), I(ACT ** 2)',
             data = attend)
results = reg.fit()

In [None]:
# Print regression table
table = pd.DataFrame({'Betas': round(results.params, 4),
                     'Standarde Errors': round(results.bse, 4),
                     't Statistics': round(results.tvalues, 4),
                     'p Value': round(results.pvalues, 4)})
print(f'Regression Table: \n{table}\n')

In [None]:
# Estimate for partial effect at priGPA = 2.59
b = results.params
partial_effect = b['atndrte'] + 2.59 * b['atndrte:priGPA']
print(f'Partial Effect: {partial_effect}\n')

In [None]:
# F test for partial effect at priGPA = 2.59
hypotheses = 'atndrte + 2.59 * atndrte:priGPA = 0'
ftest = results.f_test(hypotheses)
fstat = ftest.statistic[0][0]
fpval = ftest.pvalue

print(f'F Statistics: {fstat}\n')
print(f'F Test p-value: {fpval}\n')