# Multiple Regression Including Interaction Terms


When fitting a multiple regression model we are assessing whether scores on a scale dependent variable (DV) can be predicted from scores on several independent variables (IVs). We extend the linear regression equation by adding a new term for each IV. We can include each IV in the model separately but we may also believe that two (or more) of our IVs interact in some meaningful way that influences their relationship to the DV.

When running multiple regression analyses using the statsmodels software package there are a couple of different ways to include interaction terms in the regression model. In this notebook I will demonstrate how to conduct a multiple regression analysis that includes the interaction between two of the IVs as an additional term in the model.

In [1]:
# Importing key software libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats

In [2]:
# Importing the data. Again, I will use the insurance dataset to illustrate. 

df = pd.read_csv("insurance.csv")

df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [3]:
# Fitting the multiple regression model using statsmodels. 
# In this first model I will only include the two IVs of interest separately

mod_1 = smf.ols(formula = "expenses ~ age + bmi", data = df).fit()

mod_1.summary()

0,1,2,3
Dep. Variable:,expenses,R-squared:,0.117
Model:,OLS,Adj. R-squared:,0.116
Method:,Least Squares,F-statistic:,88.66
Date:,"Fri, 08 Sep 2023",Prob (F-statistic):,6.99e-37
Time:,16:22:45,Log-Likelihood:,-14394.0
No. Observations:,1338,AIC:,28790.0
Df Residuals:,1335,BIC:,28810.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-6437.3475,1744.025,-3.691,0.000,-9858.676,-3016.019
age,241.9001,22.298,10.849,0.000,198.158,285.642
bmi,333.3909,51.371,6.490,0.000,232.614,434.168

0,1,2,3
Omnibus:,321.702,Durbin-Watson:,2.01
Prob(Omnibus):,0.0,Jarque-Bera (JB):,592.058
Skew:,1.511,Prob(JB):,2.7300000000000004e-129
Kurtosis:,4.222,Cond. No.,287.0


Above, we can see from the output that we have a significant regression model and that both age and bmi are signifcant predictors of the DV (insurance expenses). To extend this we can also include the interaction between age and bmi to see is this is significant. There are two ways to specify an interaction term in statsmodels when using R-style formulas. The first is the matrix version and the second is multiplicative. I demonstrate each of these in turn below. 

In [4]:
# First using the matrix approach. Here note that the interaction term is specified by separating the two IVs with a colon. 

mod_int1 = smf.ols(formula = "expenses ~ age + bmi + age:bmi", data = df).fit()

mod_int1.summary()

0,1,2,3
Dep. Variable:,expenses,R-squared:,0.118
Model:,OLS,Adj. R-squared:,0.116
Method:,Least Squares,F-statistic:,59.22
Date:,"Fri, 08 Sep 2023",Prob (F-statistic):,6.13e-36
Time:,16:24:54,Log-Likelihood:,-14394.0
No. Observations:,1338,AIC:,28800.0
Df Residuals:,1334,BIC:,28820.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-9150.4861,4634.062,-1.975,0.049,-1.82e+04,-59.644
age,312.9752,114.657,2.730,0.006,88.047,537.903
bmi,421.8625,149.127,2.829,0.005,129.314,714.411
age:bmi,-2.2998,3.639,-0.632,0.528,-9.439,4.839

0,1,2,3
Omnibus:,321.813,Durbin-Watson:,2.008
Prob(Omnibus):,0.0,Jarque-Bera (JB):,592.5
Skew:,1.51,Prob(JB):,2.1900000000000003e-129
Kurtosis:,4.227,Cond. No.,19700.0


Again we get a significant model (the F-value is significant). Note that using the formula specification as above we get a coefficient for age individually (the main effect of age), a coefficient for bmi individually (the main effect of bmi) and a coefficient for the interaction between age and bmi (age:bmi). In this case we can see that although individually age and bmi are signficant contributors to the model, there is no significant interaction between these variables when predicting scores on the DV (expenses). In relation to this dataset, this would tell us that there is no relationship between bmi and age that influences insurance expenses. Although it is the case that expenses increase with age and expenses increase with increasing bmi. It is not the case that bmi is related to age and in that in combination they can be used to predict insurance expenses. I will now run the model again and remove the individual main effects terms and just include the interaction. 

In [5]:
# Again, using the matrix approach (:). This time leaving out the main effects terms. 

mod_int2 = smf.ols(formula = "expenses ~ age:bmi", data = df).fit()

mod_int2.summary()

0,1,2,3
Dep. Variable:,expenses,R-squared:,0.112
Model:,OLS,Adj. R-squared:,0.111
Method:,Least Squares,F-statistic:,168.7
Date:,"Fri, 08 Sep 2023",Prob (F-statistic):,2.02e-36
Time:,16:27:46,Log-Likelihood:,-14398.0
No. Observations:,1338,AIC:,28800.0
Df Residuals:,1336,BIC:,28810.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3869.3561,788.123,4.910,0.000,2323.262,5415.450
age:bmi,7.7588,0.597,12.990,0.000,6.587,8.931

0,1,2,3
Omnibus:,328.95,Durbin-Watson:,2.018
Prob(Omnibus):,0.0,Jarque-Bera (JB):,614.007
Skew:,1.532,Prob(JB):,4.68e-134
Kurtosis:,4.275,Cond. No.,3330.0


On this occasion we can see from the model output that the interaction term is the only IV included in the model. What we still find is that we have a signficant regression model but now the interaction term is signficant. This is because this is the only variable included in the model. These variables still acount for the same amount of variance explained in the DV (R-squared = 0.11). If we were to run this model again and add the IV in individually we would find that the interaction term became non-signficant as it does not explain anything beyond that accounted for by the main effects of the IVs. If we were comparing these different models we would likely consider a model that just contained individual IVs as the best fitting model as it would be the least complex and would still account for the same amount of variance explained. 

Next I will fit the same models using the multiplicative specification in the formula. 

In [6]:
# Fitting the multiple regression modle again but this time using the multiplication operator instead of the colon. 
# Note the colon has been replaced by the asterisk symbol for multiplication. 

mod_int3 = smf.ols(formula = "expenses ~ age + bmi + age*bmi", data = df).fit()

mod_int3.summary()

0,1,2,3
Dep. Variable:,expenses,R-squared:,0.118
Model:,OLS,Adj. R-squared:,0.116
Method:,Least Squares,F-statistic:,59.22
Date:,"Fri, 08 Sep 2023",Prob (F-statistic):,6.13e-36
Time:,16:30:22,Log-Likelihood:,-14394.0
No. Observations:,1338,AIC:,28800.0
Df Residuals:,1334,BIC:,28820.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-9150.4861,4634.062,-1.975,0.049,-1.82e+04,-59.644
age,312.9752,114.657,2.730,0.006,88.047,537.903
bmi,421.8625,149.127,2.829,0.005,129.314,714.411
age:bmi,-2.2998,3.639,-0.632,0.528,-9.439,4.839

0,1,2,3
Omnibus:,321.813,Durbin-Watson:,2.008
Prob(Omnibus):,0.0,Jarque-Bera (JB):,592.5
Skew:,1.51,Prob(JB):,2.1900000000000003e-129
Kurtosis:,4.227,Cond. No.,19700.0


In the above case for interaction model 3 we get the same output as we got for the first interaction model, where the main effects of age and bmi are significant predictors but the interaction between them is not. I will run this analysis again but this time only including the interaction term (age * bmi) in the model formula. 

In [7]:
# Running the analysis a fourth time. Not on this occasion only the age*bmi interaction is included. 

mod_int4 = smf.ols(formula = "expenses ~ age*bmi", data = df).fit()

mod_int4.summary()

0,1,2,3
Dep. Variable:,expenses,R-squared:,0.118
Model:,OLS,Adj. R-squared:,0.116
Method:,Least Squares,F-statistic:,59.22
Date:,"Fri, 08 Sep 2023",Prob (F-statistic):,6.13e-36
Time:,16:31:54,Log-Likelihood:,-14394.0
No. Observations:,1338,AIC:,28800.0
Df Residuals:,1334,BIC:,28820.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-9150.4861,4634.062,-1.975,0.049,-1.82e+04,-59.644
age,312.9752,114.657,2.730,0.006,88.047,537.903
bmi,421.8625,149.127,2.829,0.005,129.314,714.411
age:bmi,-2.2998,3.639,-0.632,0.528,-9.439,4.839

0,1,2,3
Omnibus:,321.813,Durbin-Watson:,2.008
Prob(Omnibus):,0.0,Jarque-Bera (JB):,592.5
Skew:,1.51,Prob(JB):,2.1900000000000003e-129
Kurtosis:,4.227,Cond. No.,19700.0


Note the difference in the above output compared to the second model, where we used the matrix specification to include only the interaction term in the model. When using the multiplicative specification, the model that is fitted includes each variable individualy and the interaction term. So, although we didn't specify the individual IVs in the formula the model returned is the full model containing main effects and interactions. If we were looking at a three-way interaction (e.g. age * bmi * region) then specifiying the IVs as an interaction, in a way similar to model 4, would return a results summary containing the main effect of each IV and all two-way and the three-way combination of the IVs. When including interaction terms, it is worth noting that these can be specified for any number of variables (e.g. age * bmi * region * sex * IV5) but interpreting the meaning of such complex interaction terms can be very difficult and I would suggest that there should be good theoretical reasons for including anything beyond the interaction between two IVs in the model.  

## Summary:

- When fitting multiple regression models using statsmodels it is possible to include the interaction between our IVs as a term in the model. 
- There are two ways to specify the interaction terms in the model formula. The first is the matrix approach where variable names are separated by the colon operator (IV1:IV2). When this is the only IV term included in the model, then the interaction is the only predictor that will be fitted. The second is the multiplicative method, where variables names are separated by an asterisk, signifiying multiplication (IV1 * IV2). When this is the only IV term included in the model, then the output returned will included main effects and interaction terms for all variables named in the interaction. 