# Multiple Regression Analysis with Qualitative Information

In [1]:
import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

In [2]:
def report(model, option=3, return_info=False):
    summary_table = model.summary2().tables[1]

    # Select columns based on the option
    if option == 1:
        summary = summary_table[["Coef.", "Std.Err."]]
    elif option == 2:
        summary = summary_table[["Coef.", "Std.Err.", "t"]]
    elif option == 3:
        summary = summary_table[["Coef.", "Std.Err.", "t", "P>|t|"]]
    elif option == 4:
        summary = summary_table[["Coef.", "P>|t|"]]
    else:
        summary = summary_table.copy()  # All columns

    # Round the values for clean display
    summary = summary.round(3)

    # Model info
    n = int(model.nobs)
    r2 = round(model.rsquared, 3)
    mse = round(np.sqrt(model.mse_resid), 3)
    model_info = f"R² = {r2}, n = {n}, SE = {mse}"

    if return_info:
        return summary, model_info
    else:
        print("Model info:", model_info)
        return summary


In [3]:
wooldridge.data()

  J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93    

## Examples

In [19]:
wage1 = wooldridge.data('wage1')
gpa1 = wooldridge.data('gpa1')
jtrain = wooldridge.data('jtrain')
hprice1 = wooldridge.data('hprice1')
beauty = wooldridge.data('beauty')

### 7.1 Hourly Wage Equation

In [5]:
wooldridge.data('wage1', description= True)

name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| wage     | average hourly earnings         |
| educ     | years of education              |
| exper    | years potential experience      |
| tenure   | years with current employer     |
| nonwhite | =1 if nonwhite                  |
| female   | =1 if female                    |
| married  | =1 if married                   |
| numdep   | number of dependents            |
| smsa     | =1 if live in SMSA              |
| northcen | =1 if live in north central U.S |
| south    | =1 if live in southern region   |
| west     | =1 if live in western region    |
| construc | =1 if work in construc. indus.  |
| ndurman  | =1 if in nondur. manuf. indus.  |
| trcommpu | =1 if in trans, commun, pub ut  |
| trade    | =1 if in wholesale or retail    |
| services | =1 if in services indus.  

In [6]:
model01 = smf.ols('wage ~ female + educ + exper + tenure', data = wage1).fit()
report(model01)

Model info: R² = 0.364, n = 526, SE = 2.958


Unnamed: 0,Coef.,Std.Err.,t,P>|t|
Intercept,-1.568,0.725,-2.164,0.031
female,-1.811,0.265,-6.838,0.0
educ,0.572,0.049,11.584,0.0
exper,0.025,0.012,2.195,0.029
tenure,0.141,0.021,6.663,0.0


Men [Female = 0 ] has a negative intercept of -1.57 which isn't useful as no one has control variables with values of zero  
On average, females with same education, experience and tenure as men earn less than men by $1.81 per hour 

In [7]:
model012 = smf.ols('wage ~ female', data = wage1).fit()
report(model012)

Model info: R² = 0.116, n = 526, SE = 3.476


Unnamed: 0,Coef.,Std.Err.,t,P>|t|
Intercept,7.099,0.21,33.806,0.0
female,-2.512,0.303,-8.279,0.0


In this simple model, men earn $7.1 per hour on average.
Females earn on average $2.51 less than men, they earn $7.1- 2.51 = $4.59$ per hour 

The difference between the two groups is statistically significant $t=-8.3$

Note: simple regression provides an easy way to do comparison of means test but assumes homoscedasticity 

### 7.2 Effects of Computer Ownership on College GPA

In [8]:
wooldridge.data('gpa1', description= True)

name of dataset: gpa1
no of variables: 29
no of observations: 141

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| age      | in years                       |
| soph     | =1 if sophomore                |
| junior   | =1 if junior                   |
| senior   | =1 if senior                   |
| senior5  | =1 if fifth year senior        |
| male     | =1 if male                     |
| campus   | =1 if live on campus           |
| business | =1 if business major           |
| engineer | =1 if engineering major        |
| colGPA   | MSU GPA                        |
| hsGPA    | high school GPA                |
| ACT      | 'achievement' score            |
| job19    | =1 if job <= 19 hours          |
| job20    | =1 if job >= 20 hours          |
| drive    | =1 if drive to campus          |
| bike     | =1 if bicycle to campus        |
| walk     | =1 if walk to campus           |
| voluntr  | 

In [9]:
model02= smf.ols(formula='colGPA ~ PC + hsGPA + ACT', data=gpa1).fit()
report(model02)

Model info: R² = 0.219, n = 141, SE = 0.333


Unnamed: 0,Coef.,Std.Err.,t,P>|t|
Intercept,1.264,0.333,3.793,0.0
PC,0.157,0.057,2.746,0.007
hsGPA,0.447,0.094,4.776,0.0
ACT,0.009,0.011,0.822,0.413


A student who has pc is expected to score on average 0.16 higher than other students with same ACT and high school GPA scores 

hsGPA is statically significant, so dropping it can change the coefficient of PC

### 7.3 Effects of Training Grants on Hours of Training

In [10]:
wooldridge.data('jtrain', description= True)

name of dataset: jtrain
no of variables: 30
no of observations: 471

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| year     | 1987, 1988, or 1989             |
| fcode    | firm code number                |
| employ   | # employees at plant            |
| sales    | annual sales, $                 |
| avgsal   | average employee salary         |
| scrap    | scrap rate (per 100 items)      |
| rework   | rework rate (per 100 items)     |
| tothrs   | total hours training            |
| union    | =1 if unionized                 |
| grant    | = 1 if received grant           |
| d89      | = 1 if year = 1989              |
| d88      | = 1 if year = 1988              |
| totrain  | total employees trained         |
| hrsemp   | tothrs/totrain                  |
| lscrap   | log(scrap)                      |
| lemploy  | log(employ)                     |
| lsales   | log(sales)               

In [11]:
df = jtrain.copy()
df = df[df['year'] == 1988]
model03 = smf.ols('hrsemp ~ grant + lsales + lemploy', data = df ).fit()
report(model03)

Model info: R² = 0.237, n = 105, SE = 24.38


Unnamed: 0,Coef.,Std.Err.,t,P>|t|
Intercept,46.665,43.412,1.075,0.285
grant,26.254,5.592,4.695,0.0
lsales,-0.985,3.54,-0.278,0.781
lemploy,-6.07,3.883,-1.563,0.121


We used hrsemp in level form because it is zero in many rows  
grant is statistically significant with a large coefficient

### 7.4 Housing Price Regression

In [12]:
wooldridge.data('hprice1', description = True)

name of dataset: hprice1
no of variables: 10
no of observations: 88

+----------+------------------------------+
| variable | label                        |
+----------+------------------------------+
| price    | house price, $1000s          |
| assess   | assessed value, $1000s       |
| bdrms    | number of bdrms              |
| lotsize  | size of lot in square feet   |
| sqrft    | size of house in square feet |
| colonial | =1 if home is colonial style |
| lprice   | log(price)                   |
| lassess  | log(assess                   |
| llotsize | log(lotsize)                 |
| lsqrft   | log(sqrft)                   |
+----------+------------------------------+

Collected from the real estate pages of the Boston Globe during 1990.
These are homes that sold in the Boston, MA area.


In [13]:
model04 = smf.ols('lprice ~ llotsize + lsqrft + bdrms + colonial', data = hprice1).fit()
report(model04)

Model info: R² = 0.649, n = 88, SE = 0.184


Unnamed: 0,Coef.,Std.Err.,t,P>|t|
Intercept,-1.35,0.651,-2.073,0.041
llotsize,0.168,0.038,4.395,0.0
lsqrft,0.707,0.093,7.62,0.0
bdrms,0.027,0.029,0.934,0.353
colonial,0.054,0.045,1.202,0.233


Controlling for lotsize, bdrms, sqrft, colonial style house is priced $5.4\%$ more than non colonial style house 

### 7.5 Log Hourly Wage Equation

In [14]:
model05 = smf.ols('lwage ~ female + educ + exper + expersq + tenure + tenursq', data = wage1).fit()
report(model05)

Model info: R² = 0.441, n = 526, SE = 0.4


Unnamed: 0,Coef.,Std.Err.,t,P>|t|
Intercept,0.417,0.099,4.212,0.0
female,-0.297,0.036,-8.281,0.0
educ,0.08,0.007,11.868,0.0
exper,0.029,0.005,5.916,0.0
expersq,-0.001,0.0,-5.431,0.0
tenure,0.032,0.007,4.633,0.0
tenursq,-0.001,0.0,-2.493,0.013


Controlling for educ, exper, tenure, women earn less than men by around 29.7%  
A better approximation is $$\exp(\hat \beta) - 1 $$  
$\exp(-0.297) - 1 \approx -0.257$, women earn less than comparable men by 25.7%

### 7.6 Log Hourly Wage Equation

In [15]:
wooldridge.data('wage1', description= True)

name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| wage     | average hourly earnings         |
| educ     | years of education              |
| exper    | years potential experience      |
| tenure   | years with current employer     |
| nonwhite | =1 if nonwhite                  |
| female   | =1 if female                    |
| married  | =1 if married                   |
| numdep   | number of dependents            |
| smsa     | =1 if live in SMSA              |
| northcen | =1 if live in north central U.S |
| south    | =1 if live in southern region   |
| west     | =1 if live in western region    |
| construc | =1 if work in construc. indus.  |
| ndurman  | =1 if in nondur. manuf. indus.  |
| trcommpu | =1 if in trans, commun, pub ut  |
| trade    | =1 if in wholesale or retail    |
| services | =1 if in services indus.  

In [16]:
df = wage1.copy()
df['marrmale'] = ((df['female'] == 0) & (df['married'] == 1)).astype(int)
df['marrfem'] = ((df['female'] == 1) & (df['married'] == 1)).astype(int)
df['singfem'] = ((df['female'] == 1) & (df['married'] == 0)).astype(int)

In [17]:
model06 = smf.ols('lwage ~ marrmale + marrfem + singfem + educ  + exper + expersq + tenure + tenursq', data=df).fit()
report(model06)

Model info: R² = 0.461, n = 526, SE = 0.393


Unnamed: 0,Coef.,Std.Err.,t,P>|t|
Intercept,0.321,0.1,3.213,0.001
marrmale,0.213,0.055,3.842,0.0
marrfem,-0.198,0.058,-3.428,0.001
singfem,-0.11,0.056,-1.98,0.048
educ,0.079,0.007,11.787,0.0
exper,0.027,0.005,5.112,0.0
expersq,-0.001,0.0,-4.847,0.0
tenure,0.029,0.007,4.302,0.0
tenursq,-0.001,0.0,-2.306,0.022


All the coefficients are statistically significant. Base group for the dummy variables is single males.  
Married men are expected to earn more than single men by 21.3% with the same education, experience and tenure, while married women earns 19.8% less

### 7.7 Effects of Physical Attractiveness on Wage

In [18]:
wooldridge.data('beauty', description= True)

name of dataset: beauty
no of variables: 17
no of observations: 1260

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| wage     | hourly wage                   |
| lwage    | log(wage)                     |
| belavg   | =1 if looks <= 2              |
| abvavg   | =1 if looks >=4               |
| exper    | years of workforce experience |
| looks    | from 1 to 5                   |
| union    | =1 if union member            |
| goodhlth | =1 if good health             |
| black    | =1 if black                   |
| female   | =1 if female                  |
| married  | =1 if married                 |
| south    | =1 if live in south           |
| bigcity  | =1 if live in big city        |
| smllcity | =1 if live in small city      |
| service  | =1 if service industry        |
| expersq  | exper^2                       |
| educ     | years of schooling            |
+----------+------------------

### 7.8 Effects of Law School Rankings on Starting Salaries

### 7.9 Effects of Computer Usage on Wages

### 7.10 Log Hourly Wage Equation

### 7.11 Effects of Race on Baseball Player Salaries

### 7.12 A Linear Probability Model of Arrests

### 7.13 Evaluating a Job Training Program using Unrestricted Regression Adjustment

## 🚧Computer Exercises