# Multiple Regression with Categorical Predictors


Multiple regression analysis allows us to investigate the relationship between mulitple independent variables (IVs) and a scale dependent variable (DV). In a previous notebook I demonstrated how to run a multiple regression analysis and interpret the results of the model. The examples used included multiple IVs that were all measured on some sort of scale (either discrete or continuous). However, when conducting multiple regression we can also include categorical predictors (IVs) in the model along with scale IVs. 

When including categorical variables in a multiple regression analysis, one of the categories is taken as a reference value and difference in scores on the DV between categories are indicated as differences between the reference category and the comparison category. To achieve this we have to create new variables containing dummy codes that indicate which is the reference category and which is the comparison category. This means creating n-1 dummy variables, where n is the number of categories in our categorical IV. 

When running these analyses using the statsmodels software library, sometimes, statsmodels will recognise that a variable is categorical and automatically create the dummy codes. However, if the variable is not recognised as a categorical object or if you want greater control over the analysis it is often better to dummy code the variables manually before running the analysis. This can be done easily using the pandas get_dummies method. 

In this notebook I will demonstrate how to run multiple regression, using statsmodel, including a categorical variables. I will do this using a dataset relating to insurance premiums in the US. The multiple regression model will assess whether we can predict a participants insurance expenses using scale IVs of age and body mass index (bmi), as well as categorical IVs of the region the participant lives in (Four categories: Northeast, Southeast, Southwest, Northwest) and whether or not they are a smoker (Two categories: Yes, No). 



In [1]:
# Importing key software libraries. 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats

In [2]:
# Importing the data. 

df = pd.read_csv("insurance.csv")

df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   expenses  1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Above we can see that the insurance dataset is quite large, with 1338 participants. We can also see that each variable has 1338 non-null data entries, indicating that there is no missing data. We can also see the data type for our variables. The DV (expenses) is a floating point variable. This is essential as when using multiple regression we are predicting scores on a scale DV. The IVs of interest all have appropriate data types for the analysis we want to conduct. The two scale IVs age and bmi are shown as being of integer and floating point datatypes, respectively. The two categorical IVs we also want to include in the model, smoker and region, are shown as having the object data type. This suggests that we don't need to do any data wrangling to tidy up the data prior to running the analysis. 

In [4]:
# Fitting the multiple regression model using statsmodels. 
# In this first model I will include the two scale IVs and only one categorical IV (region)

mod_1 = smf.ols(formula = "expenses ~ age + bmi + region", data = df).fit()

mod_1.summary()

0,1,2,3
Dep. Variable:,expenses,R-squared:,0.12
Model:,OLS,Adj. R-squared:,0.117
Method:,Least Squares,F-statistic:,36.47
Date:,"Fri, 08 Sep 2023",Prob (F-statistic):,4.3e-35
Time:,15:46:49,Log-Likelihood:,-14392.0
No. Observations:,1338,AIC:,28800.0
Df Residuals:,1332,BIC:,28830.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-5522.8429,1814.865,-3.043,0.002,-9083.148,-1962.538
region[T.northwest],-979.6585,893.324,-1.097,0.273,-2732.134,772.817
region[T.southeast],62.7481,897.817,0.070,0.944,-1698.542,1824.038
region[T.southwest],-1561.9465,896.530,-1.742,0.082,-3320.711,196.818
age,242.9380,22.304,10.892,0.000,199.184,286.692
bmi,321.8171,53.615,6.002,0.000,216.638,426.996

0,1,2,3
Omnibus:,322.532,Durbin-Watson:,2.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,595.076
Skew:,1.51,Prob(JB):,6.04e-130
Kurtosis:,4.245,Cond. No.,305.0


In the above output we can see that we have a significant multiple regression model by inspecting the F-statistic associated with the model (F(5, 1332) = 36.47, p < 0.001). If we look at the coefficients table this gives us the ${\beta}$ coefficent (slope) for the regression line associated with each of the IVs and tells us if that variable is a significant predictor in the model. We can see that the scale IVs are both significant (age: ${\beta}$ = 242.94, t(1332) = 10.89, p < 0.001; bmi: ${\beta}$ = 321.82, t(1332) = 6.00, p < 0.001). Importantly, we can see that the region categorical IV has been automatically dummy coded and ${\beta}$ coefficents are given for three of the four regions (northwest, southeast, southwest). The fourth category that is missing (northeast) has been used as the reference category for the analysis. The coefficients shown indicate the change in DV scores for that region relative to the northeast reference category region. For example, the northwest has a coefficient of -979.66. This indicates that changing region from northeast to northwest results in a reduction in units on the expenses DV. Effectively, insurance expenses are lower in the northwest region. Note that the associated t-values for all three region categories are not significant, indicating that the different levels of the region IV are not contributing significantly to the multiple regression model. It appears that the region someone lives in is not a predictor of how much they will pay in insurance premium expenses. 

Next I will run the analysis again and include a second categorical IV (smoker) to see if this predicts insurance premium expenses. On this occasion I will also manually create dummy coded variables for the two categorical IVs. Although this was not needed for the region IV, it is necessary to know how to do this in case the software library being used to run the model does not automatically recognise that a variable is categorical. 

In [5]:
# Dummy coding the two categorical variables using pandas get_dummies. 
# The drop_first argument tells pandas to use the first category as the reference. 

df = pd.get_dummies(df, columns = ["smoker", "region"], drop_first = True)
df.head()

Unnamed: 0,age,sex,bmi,children,expenses,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,female,27.9,0,16884.92,1,0,0,1
1,18,male,33.8,1,1725.55,0,0,1,0
2,28,male,33.0,3,4449.46,0,0,1,0
3,33,male,22.7,0,21984.47,0,1,0,0
4,32,male,28.9,0,3866.86,0,1,0,0


Above, we can see that four new variables have been appended to the dataframe. These are the dummy coded levels of the categorical variables. The dummy code for smoker has taken 'no' (that they are not a smoker) as the reference category and returned a vector of zeros and ones that indicate if someone is a smoker. A one in the column means, 'yes', that person is a smoker. Similarly for the region variable, the northeast region is the category level that has been used as the reference category and we have dummy codes showing a one if someone lives in the northeast, southeast, or southwest. Having manually created our dummy coded variables for the two categorical IVs we can now include them in the multiple regression model. Note also that creating these dummy variables has also removed the original smoker and region variable from the dataframe.

In [6]:
# Fitting the model again, this time explicitly including the dummy coded IVs. Also, adding the smoker IV

mod_2 = smf.ols(formula = "expenses ~ age + bmi + region_northwest + region_southeast + region_southwest + smoker_yes", 
               data = df).fit()

mod_2.summary()

0,1,2,3
Dep. Variable:,expenses,R-squared:,0.749
Model:,OLS,Adj. R-squared:,0.748
Method:,Least Squares,F-statistic:,660.8
Date:,"Fri, 08 Sep 2023",Prob (F-statistic):,0.0
Time:,15:57:14,Log-Likelihood:,-13554.0
No. Observations:,1338,AIC:,27120.0
Df Residuals:,1331,BIC:,27160.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.16e+04,976.185,-11.887,0.000,-1.35e+04,-9689.098
age,258.6206,11.930,21.679,0.000,235.217,282.024
bmi,340.0928,28.672,11.862,0.000,283.846,396.339
region_northwest,-303.3267,477.837,-0.635,0.526,-1240.723,634.070
region_southeast,-1039.1591,480.476,-2.163,0.031,-1981.732,-96.586
region_southwest,-915.1583,479.539,-1.908,0.057,-1855.893,25.576
smoker_yes,2.385e+04,413.496,57.682,0.000,2.3e+04,2.47e+04

0,1,2,3
Omnibus:,298.411,Durbin-Watson:,2.079
Prob(Omnibus):,0.0,Jarque-Bera (JB):,705.597
Skew:,1.209,Prob(JB):,6.05e-154
Kurtosis:,5.61,Cond. No.,307.0


We can see in the above output that adding the category of smoker to the model has significantly improved the model over the first model fitted. The model is obviously still significant (F(1, 1331) = 660.80, p < 0.001), but the R-squared value, indicating the amount of variance in insurance premium expenses explained, has increased from about 12% (0.12) in model 1 (mod_1) to about 75% (0.749) in model 2 (mod_2). It would appear that adding an IV indicating whether someone smokes accounts for a large amount of insurance expenses. The coefficient for smoker_yes is ${\beta}$ = 23850.00 indicating that when someone smokes their insurance expenses are significantly higher than non-smokers (smoker_no). Understandably, this IV is a significant predictor in the model (t(1331) = 57.68, p < 0.001). Interestingly, the southeast category is also now a significant contributor to the model (${\beta}$ = -1039.16, t(1331) = 2.16, p = 0.031). The negative beta coefficient indicates that living in the southeast results in significantly lower insurance expenses compared to the northeast reference category. The other region categories are still not significant predictors of the DV (expenses), but the two scale IVs (age and bmi) are still significant and have positive beta coefficients (age: ${\beta}$ = 258.62, t(1331) = 21.68, p < 0.001; bmi: ${\beta}$ = 340.09, t(1331) = 11.86, p < 0.001). The positive values of both these coefficients indicates that as age and bmi increase so do insurance expenses. 


## Summary:

- Multiple regression models can be fitted using both scale and categorical variables to predict scores on a scale DV. 
- When including categorical IVs it is important to check that they are correctly identified as object/ category data types. If they are, then statsmodels will automatically dummy code them in the analysis. 
- If a categorical variable is not the correct data type and is shown as an integer, then is is necessary to create dummy coded versions of the variable. This can be achieved using the pandas get_dummies method. 
- Dummy coding will take one categorical level of the variable as a reference level and create n - 1 new variables that indicate a difference of level to the reference category. 
