## Introduction

This in-class example demonstrates how to incorporate qualitative explanatory variables into a multiple linear regression model. It covers most all of the popular ways that binary (dummy) variables are included in a regression model.

What you need to know:  
- Pandas, Statsmodels modules
- Theoretical concepts on multiple linear regression model
- How to create and work with binary (dummy) variables

The list of [references](#References) for detailed concepts and techniques used in this exerise.

***

## Data Description

```
-----------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-----------------------------------------------------------------------------------------------------------------------------
wage            float   %8.2g                 average hourly earnings
educ            byte    %8.0g                 years of education
exper           byte    %8.0g                 years potential experience
tenure          byte    %8.0g                 years with current employer
nonwhite        byte    %8.0g                 =1 if nonwhite
female          byte    %8.0g                 =1 if female
married         byte    %8.0g                 =1 if married
numdep          byte    %8.0g                 number of dependents
smsa            byte    %8.0g                 =1 if live in SMSA
northcen        byte    %8.0g                 =1 if live in north central U.S
south           byte    %8.0g                 =1 if live in southern region
west            byte    %8.0g                 =1 if live in western region
construc        byte    %8.0g                 =1 if work in construc. indus.
ndurman         byte    %8.0g                 =1 if in nondur. manuf. indus.
trcommpu        byte    %8.0g                 =1 if in trans, commun, pub ut
trade           byte    %8.0g                 =1 if in wholesale or retail
services        byte    %8.0g                 =1 if in services indus.
profserv        byte    %8.0g                 =1 if in prof. serv. indus.
profocc         byte    %8.0g                 =1 if in profess. occupation
clerocc         byte    %8.0g                 =1 if in clerical occupation
servocc         byte    %8.0g                 =1 if in service occupation
lwage           float   %9.0g                 log(wage)
expersq         int     %9.0g                 exper^2
tenursq         int     %9.0g                 tenure^2
-----------------------------------------------------------------------------------------------------------------------------
 ```

***
## Load the required modules

In [1]:
import math
import numpy as np
import pandas as pd
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

***
## Load the data set
The data set is contained in a comma-separated value (csv) file named "*WAGE.csv*" with column header. 

In [2]:
data = pd.read_csv("WAGE.csv")

#### Check if the data is properly imported

In [3]:
data.head()

Unnamed: 0,wage,educ,exper,tenure,nonwhite,female,married,numdep,smsa,northcen,...,trcommpu,trade,services,profserv,profocc,clerocc,servocc,lwage,expersq,tenursq
0,3.1,11,2,0,0,1,0,2,1,0,...,0,0,0,0,0,0,0,1.131402,4,0
1,3.24,12,22,2,0,1,1,3,1,0,...,0,0,1,0,0,0,1,1.175573,484,4
2,3.0,11,2,0,0,0,0,2,0,0,...,0,1,0,0,0,0,0,1.098612,4,0
3,6.0,8,44,28,0,0,1,0,1,0,...,0,0,0,0,0,1,0,1.79176,1936,784
4,5.3,12,7,2,0,0,1,1,0,0,...,0,0,0,0,0,0,0,1.667707,49,4


Summary statistics for women:

In [7]:
print(data.query("female == 1")[["wage", "educ", "exper","tenure"]].describe())

             wage        educ       exper      tenure
count  252.000000  252.000000  252.000000  252.000000
mean     4.587659   12.317460   16.428571    3.615079
std      2.529363    2.472642   13.652738    5.357968
min      0.530000    0.000000    1.000000    0.000000
25%      3.000000   12.000000    5.000000    0.000000
50%      3.750000   12.000000   13.000000    2.000000
75%      5.510000   13.000000   26.000000    4.000000
max     21.629999   18.000000   50.000000   34.000000


Summary statistics for men:

In [8]:
print(data.query("female == 0")[["wage", "educ", "exper","tenure"]].describe())

             wage        educ       exper      tenure
count  274.000000  274.000000  274.000000  274.000000
mean     7.099489   12.788321   17.558394    6.474453
std      4.160858    3.002882   13.499907    8.369297
min      1.500000    2.000000    1.000000    0.000000
25%      4.142500   12.000000    6.000000    0.000000
50%      6.000000   12.000000   14.000000    3.000000
75%      8.765000   15.000000   28.000000    9.000000
max     24.980000   18.000000   51.000000   44.000000


***
## Single dummy independent variable

Estimate the model 
$$wage = \beta_0 + \delta_0 female + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u.$$

The average wage for women in the sample is:

Since we have performed a multiple regression that controls for educ, exper, and tenure. The wage differential cannot be explained by
different average levels of education, experience, or tenure between men and women. We can conclude that the differential of \\$1.81 is due to gender or factors associated with gender
that we have not controlled for in the regression.

Consider a simpler model that all other explanatory variables are dropped from the equation:
$$wage = \beta_0 + \delta_0 female + u$$

The coefficients in this have a simple interpretation. The intercept $\beta_0$ is the average wage for men in the sample, i.e. $female=0$.

It provides a simple way to carry out a *comparison-of-means* test between the two groups, which in this case are men and women.

Generally, simple regression on a constant and a dummy variable is a straightforward way to compare the means of two groups.

The average wage for women in the sample is:

The estimated wage differential between men and women is larger because it does not control for differences in education, experience, and tenure,
and these are lower, on average, for women than for men in this sample.

***
## Using Dummy Variables for Multiple Categories

Consider a model that allows for wage differences among four groups: married men, married women, single men, and single women. To do this, we select single men as our base group and define dummy variables for each of the remaining groups. Call these $marrmale$ (married men), $marrfem$ (married women), and $singfem$ (single women).

The model is specified as:
$$\log(wage) = \beta_0 + \delta_0 marrmale + \delta_1 marrfem + \delta_2 singfem + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4 tenure + \beta_5 tenure^2 + u.$$

There are multiple ways to generate new dummy variables. 

For the task at hand, we use Pandas ```get_dummies``` function to expand the variables into multiple dummies.

Estimate the model:

We can use this equation to obtain the estimated difference between any two groups. Because the overall intercept is common to all groups, we can ignore that in finding differences. 

The estimated proportionate difference between single and married women is $\delta_2 - \delta_1$: 

It means that single women earn about 8.8% more than married women.

Unfortunately, we cannot the estimation results for testing whether the estimated difference between single and married women is statistically significant. Knowing the standard errors on marrfem and
singfem is not enough to carry out the test. The easiest thing to do is to choose one of these groups to be the base group and to reestimate the equation.

***
## Interactions Involving Dummy Variables

The section above already demonstrates how to create interactive dummies manually. Sometimes it is helpful to create those variables automatically *within* the model specification. For this purpose, we use the function for generating categorical variables in the ```statsmodels``` module.

Consider the model in the previous section:
$$\log(wage) = \beta_0 + \delta_0 female + \delta_1 married + \delta_2 female \cdot married + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4 tenure + \beta_5 tenure^2 + u.$$

We can use the ```C()``` operator to explicitly indicate that $female$ and $married$ should be treated as categorica variables.

#### Allowing for Different Slopes

We can use the same approach for estimating different slopes.

Consider the following model:
$$\log(wage) = \beta_0 + \delta_0 female + \beta_1 educ + \delta_2 female \cdot educ + \beta_2 exper + \beta_3 exper^2 + \beta_4 tenure + \beta_5 tenure^2 + u.$$

***
## References
- REM 750 lecture on:\
    Multiple linear regression model

- Jeffrey M. Wooldridge (2012) "Introductory Econometrics: A Modern Approach, 5e" Chapter 7.

- LinkedIn Training on:\
    [Python Data Analysis]()\
    [Pyton Statistics Essential Training]()
    
- Seabold, Skipper, and Josef Perktold (2010). "[statsmodels: Econometric and statistical modeling with python](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html)." Proceedings of the 9th Python in Science Conference.