## Introduction

This in-class example demonstrates how to incorporate qualitative explanatory variables into a multiple linear regression model. It covers most all of the popular ways that binary (dummy) variables are included in a regression model.

What you need to know:  
- Statsmodels and pandas modules in python
- Theoretical concepts on multiple linear regression model
- How to create and work with binary (dummy) variables

The list of [references](#References) for detailed concepts and techniques used in this exerise.
***

## Content
- [Regression Using Dummy Variable](#Regression-Using-Dummy-Variable)
- [Interactions Involving Dummy Variables](#Interactions-Involving-Dummy-Variables) 
- [References](#References)

***
## Data Description

The data set is contained in a comma-separated value (csv) file named ```WAGE.csv``` with column headers. 

Description of the data is as follow:

| Name | Description |
| :--- | :--- |
| wage     | average hourly earnings |
| educ     | years of education |
| exper    | years potential experience |
| tenure   | years with current employer |
| female   | = 1 if female |
| married  | = 1 if married |
| numdep   | number of dependents |
| lwage     | log(wage) |
| expersq  | exper^2 |
| tenursq  | tenure^2 |

***
## Load the required modules

In [None]:
import numpy as np
import pandas as pd
import statsmodels
import statsmodels.formula.api as smf

***
## Load the data set
The data set is contained in a comma-separated value (csv) file named "*WAGE*" with column header. 

#### Check if the data is properly imported

Summary statistics for women:

Summary statistics for men:

***
## Regression Using Dummy Variable

Consider a simpler model that only includes a dummy variable:
$$wage = \beta_0 + \delta_0 female + u$$

The coefficients in this have a simple interpretation. The intercept $\beta_0$ is the average wage for men in the sample, i.e. $female=0$.

It provides a simple way to carry out a *comparison-of-means* test between the two groups, which in this case are men and women.

Generally, simple regression on a constant and a dummy variable is a straightforward way to compare the means of two groups.

The average wage *difference* for women in the sample is:

The estimated wage differential between men and women is larger because it does not control for differences in education, experience, and tenure,
and these are lower, on average, for women than for men in this sample.

We can also add other exogenous regressors to the model: 
$$wage = \beta_0 + \delta_0 female + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u.$$

The average wage *difference* for women in the sample is:

Why do we obtain different results?

***
## Interactions Involving Dummy Variables

Consider a model that allows for wage differences among four groups: married men, married women, single men, and single women. To do this, we select **single men** as our base group and define dummy variables for each of the remaining groups. Call these $marrmale$ (married men), $marrfem$ (married women), and $singfem$ (single women).

The model is specified as:
$$\log(wage) = \beta_0 + \delta_0 female + \delta_1 married + \delta_2 female \cdot married + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4 tenure + \beta_5 tenure^2 + u$$
where we use $(female \cdot married)$ to denote the set of interaction between dummy variables.

It is helpful to create those variables automatically *within* the model specification. For this purpose, we use the function for generating categorical variables in the ```statsmodels``` module.

We can use the ```C()``` operator to explicitly indicate that $female$ and $married$ should be treated as categorica variables.

#### Allowing for Different Slopes

We can use the same approach for estimating different slopes.

Consider the following model:
$$\log(wage) = \beta_0 + \delta_0 female + \beta_1 educ + \delta_2 female \cdot educ + \beta_2 exper + \beta_3 exper^2 + \beta_4 tenure + \beta_5 tenure^2 + u.$$

***
## References

- Jeffrey M. Wooldridge (2019) "Introductory Econometrics: A Modern Approach, 7e" Chapter 7.

- The pandas development team (2020). "[pandas-dev/pandas: Pandas](https://pandas.pydata.org/)." Zenodo.
    
- Seabold, Skipper, and Josef Perktold (2010). "[statsmodels: Econometric and statistical modeling with python](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html)." Proceedings of the 9th Python in Science Conference.