## Introduction

This in-class example demonstrates how to incorporate qualitative explanatory variables into a multiple linear regression model. It covers most all of the popular ways that binary (dummy) variables are included in a regression model.

What you need to know:  
- Statsmodels and pandas modules in python
- Theoretical concepts on multiple linear regression model
- How to create and work with binary (dummy) variables

The list of [references](#References) for detailed concepts and techniques used in this exerise.
***

## Content
- [Regression Using Dummy Variable](#Regression-Using-Dummy-Variable)
- [Interactions Involving Dummy Variables](#Interactions-Involving-Dummy-Variables) 
- [References](#References)

***
## Data Description

The data set is contained in a comma-separated value (csv) file named ```WAGE.csv``` with column headers. 

Description of the data is as follow:

| Name | Description |
| :--- | :--- |
| wage     | average hourly earnings |
| educ     | years of education |
| exper    | years potential experience |
| tenure   | years with current employer |
| female   | = 1 if female |
| married  | = 1 if married |
| numdep   | number of dependents |
| lwage     | log(wage) |
| expersq  | exper^2 |
| tenursq  | tenure^2 |

***
## Load the required modules

In [15]:
import math
import numpy as np
import pandas as pd
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

***
## Load the data set
The data set is contained in a comma-separated value (csv) file named "*WAGE*" with column header. 

In [16]:
data = pd.read_csv("WAGE.csv")

#### Check if the data is properly imported

In [17]:
data.head()

Unnamed: 0,wage,educ,exper,tenure,female,married,numdep,lwage,expersq,tenursq
0,3.1,11,2,0,1,0,2,1.131402,4,0
1,3.24,12,22,2,1,1,3,1.175573,484,4
2,3.0,11,2,0,0,0,2,1.098612,4,0
3,6.0,8,44,28,0,1,0,1.791759,1936,784
4,5.3,12,7,2,0,1,1,1.667707,49,4


Summary statistics for women:

In [18]:
data.query("female == 1").describe()

Unnamed: 0,wage,educ,exper,tenure,female,married,numdep,lwage,expersq,tenursq
count,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0
mean,4.587659,12.31746,16.428571,3.615079,1.0,0.52381,1.087302,1.416353,455.555556,41.662698
std,2.529363,2.472642,13.652738,5.357968,0.0,0.500427,1.214257,0.444235,616.668566,119.257369
min,0.53,0.0,1.0,0.0,1.0,0.0,0.0,-0.634878,1.0,0.0
25%,3.0,12.0,5.0,0.0,1.0,0.0,0.0,1.098612,25.0,0.0
50%,3.75,12.0,13.0,2.0,1.0,1.0,1.0,1.321756,169.0,4.0
75%,5.51,13.0,26.0,4.0,1.0,1.0,2.0,1.70652,676.0,16.0
max,21.629999,18.0,50.0,34.0,1.0,1.0,5.0,3.074081,2500.0,1156.0


Summary statistics for men:

***
## Regression Using Dummy Variable

Consider a simpler model that only includes a dummy variable:
$$wage = \beta_0 + \delta_0 female + u$$

The coefficients in this have a simple interpretation. The intercept $\beta_0$ is the average wage for men in the sample, i.e. $female=0$.

It provides a simple way to carry out a *comparison-of-means* test between the two groups, which in this case are men and women.

Generally, simple regression on a constant and a dummy variable is a straightforward way to compare the means of two groups.

In [33]:
model_mean = smf.ols(formula = 'wage ~ female', data = data).fit()
print(model_mean.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.116
Model:                            OLS   Adj. R-squared:                  0.114
Method:                 Least Squares   F-statistic:                     68.54
Date:                Tue, 04 Mar 2025   Prob (F-statistic):           1.04e-15
Time:                        17:41:24   Log-Likelihood:                -1400.7
No. Observations:                 526   AIC:                             2805.
Df Residuals:                     524   BIC:                             2814.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.0995      0.210     33.806      0.0

The average wage *difference* for women in the sample is:

In [39]:
7.0995-(-2.118)

9.2175

The estimated wage differential between men and women is larger because it does not control for differences in education, experience, and tenure,
and these are lower, on average, for women than for men in this sample.

We can also add other exogenous regressors to the model: 
$$wage = \beta_0 + \delta_0 female + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u.$$

In [None]:
model_eq = smf.ols(formula = "wage ~ female + educ + exper + tenure", data = data).fit()
print(model_eq.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.359
Method:                 Least Squares   F-statistic:                     74.40
Date:                Tue, 04 Mar 2025   Prob (F-statistic):           7.30e-50
Time:                        17:42:43   Log-Likelihood:                -1314.2
No. Observations:                 526   AIC:                             2638.
Df Residuals:                     521   BIC:                             2660.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.5679      0.725     -2.164      0.0

The average wage *difference* for women in the sample is:

In [41]:
-1.5679-(-1.8109)

0.24299999999999988

Why do we obtain different results?

***
## Interactions Involving Dummy Variables

Consider a model that allows for wage differences among four groups: married men, married women, single men, and single women. To do this, we select **single men** as our base group and define dummy variables for each of the remaining groups. Call these $marrmale$ (married men), $marrfem$ (married women), and $singfem$ (single women).

The model is specified as:
$$\log(wage) = \beta_0 + \delta_0 female + \delta_1 married + \delta_2 female \cdot married + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4 tenure + \beta_5 tenure^2 + u$$
where we use $(female \cdot married)$ to denote the set of interaction between dummy variables.

It is helpful to create those variables automatically *within* the model specification. For this purpose, we use the function for generating categorical variables in the ```statsmodels``` module.

We can use the ```C()``` operator to explicitly indicate that $female$ and $married$ should be treated as categorica variables.

In [48]:
model_interact = smf.ols(formula = "lwage ~ C(female)*C(married) + educ + exper + expersq + tenure + tenursq", data = data).fit()
print(model_interact.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.461
Model:                            OLS   Adj. R-squared:                  0.453
Method:                 Least Squares   F-statistic:                     55.25
Date:                Tue, 04 Mar 2025   Prob (F-statistic):           1.28e-64
Time:                        18:01:05   Log-Likelihood:                -250.96
No. Observations:                 526   AIC:                             519.9
Df Residuals:                     517   BIC:                             558.3
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept   

#### Allowing for Different Slopes

We can use the same approach for estimating different slopes.

Consider the following model:
$$\log(wage) = \beta_0 + \delta_0 female + \beta_1 educ + \delta_2 female \cdot educ + \beta_2 exper + \beta_3 exper^2 + \beta_4 tenure + \beta_5 tenure^2 + u.$$

In [49]:
model_alt = smf.ols(formula = "lwage ~ C(female)*educ + exper + expersq + tenure + tenursq", data = data).fit()
print(model_alt.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.441
Model:                            OLS   Adj. R-squared:                  0.433
Method:                 Least Squares   F-statistic:                     58.37
Date:                Tue, 04 Mar 2025   Prob (F-statistic):           1.67e-61
Time:                        18:01:05   Log-Likelihood:                -260.49
No. Observations:                 526   AIC:                             537.0
Df Residuals:                     518   BIC:                             571.1
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               0.3888    

***
## References

- Jeffrey M. Wooldridge (2019) "Introductory Econometrics: A Modern Approach, 7e" Chapter 7.

- The pandas development team (2020). "[pandas-dev/pandas: Pandas](https://pandas.pydata.org/)." Zenodo.
    
- Seabold, Skipper, and Josef Perktold (2010). "[statsmodels: Econometric and statistical modeling with python](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html)." Proceedings of the 9th Python in Science Conference.