# Examples of Interpreting Coefficient Estimates in a Linear Regression
Coding Author: *Ian He*

Date: *Jul 12, 2023*

Python Version: *3.11*

Example source: *https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-a-regression-model-when-some-variables-are-log-transformed/*

In [1]:
import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
from scipy import stats                # for statistical functions, e.g., different mean functions
import patsy                           # provides a syntax for specifying models  
import statsmodels.api as sm           # provides models like OLS, GMM, ANOVA, etc.
import statsmodels.formula.api as smf  # provides a way to directly specify models from formulas

In [2]:
logtrans = pd.read_csv('..\\Data\\lgtrans.csv')
logtrans

Unnamed: 0,female,read,write,math
0,male,57,52,41
1,female,68,59,53
2,male,44,33,54
3,male,63,44,47
4,male,47,52,57
...,...,...,...,...
195,female,55,59,52
196,female,42,46,38
197,female,57,41,57
198,female,55,62,58


## Different types of means for `write` (our dependent variable)

In [3]:
print('The arithmetic mean of writing scores is ', np.mean(logtrans['write']), '.', sep='')
print('The geometric mean of writing scores is ', stats.gmean(logtrans['write']), '.', sep='')
print('The harmonic mean of writing scores is ', stats.hmean(logtrans['write']), '.', sep='')

The arithmetic mean of writing scores is 52.775.
The geometric mean of writing scores is 51.84959851260954.
The harmonic mean of writing scores is 50.84403450478561.


## Group means of `write` across genders

In [4]:
print('For males:')
print(' The arithmetic mean of writing scores is ', logtrans.groupby('female')['write'].mean()[0], ';', sep='')
print(' The geometric mean of writing scores is ', logtrans.groupby('female')['write'].apply(stats.gmean)[0], ';', sep='')
print(' The harmonic mean of writing scores is ', logtrans.groupby('female')['write'].apply(stats.hmean)[0], '.\n', sep='')

print('For females:')
print(' The arithmetic mean of writing scores is ', logtrans.groupby('female')['write'].mean()[1], ';', sep='')
print(' The geometric mean of writing scores is ', logtrans.groupby('female')['write'].apply(stats.gmean)[1], ';', sep='')
print(' The harmonic mean of writing scores is ', logtrans.groupby('female')['write'].apply(stats.hmean)[1], '.', sep='')

For males:
 The arithmetic mean of writing scores is 54.99082568807339;
 The geometric mean of writing scores is 54.343831217618856;
 The harmonic mean of writing scores is 53.642364007332986.

For females:
 The arithmetic mean of writing scores is 50.120879120879124;
 The geometric mean of writing scores is 49.012224919302966;
 The harmonic mean of writing scores is 47.85388275552543.


## Generate a dummy indicating "female"

In [5]:
logtrans['female'] = pd.get_dummies(logtrans['female'])['female']
logtrans

Unnamed: 0,female,read,write,math
0,0,57,52,41
1,1,68,59,53
2,0,44,33,54
3,0,63,44,47
4,0,47,52,57
...,...,...,...,...
195,1,55,59,52
196,1,42,46,38
197,1,57,41,57
198,1,55,62,58


## When dependent variable is log transformed

We start with an intercept-only model:
$$\ln({\rm write}) = \beta_0 + e_i$$

In [6]:
lgmodel1 = smf.ols('np.log(write) ~ 1', data=logtrans).fit()
print(lgmodel1.summary())

                            OLS Regression Results                            
Dep. Variable:          np.log(write)   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 12 Jul 2023   Prob (F-statistic):                nan
Time:                        08:09:53   Log-Likelihood:                 45.093
No. Observations:                 200   AIC:                            -88.19
Df Residuals:                     199   BIC:                            -84.89
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.9483      0.014    288.402      0.0

The exponentiated value of $\hat{\beta}_0$ is equal to the **geometric mean** of `write`.

In [7]:
np.exp(lgmodel1.params)

Intercept    51.849599
dtype: float64

Next, let's turn to a model with a single binary variable:
$$\ln({\rm write}) = \beta_0 + \beta_1 \cdot {\rm female} + e_i$$

In [8]:
lgmodel2 = smf.ols('np.log(write) ~ female', data=logtrans).fit()
print(lgmodel2.summary())

                            OLS Regression Results                            
Dep. Variable:          np.log(write)   R-squared:                       0.071
Model:                            OLS   Adj. R-squared:                  0.066
Method:                 Least Squares   F-statistic:                     15.11
Date:                Wed, 12 Jul 2023   Prob (F-statistic):           0.000139
Time:                        08:09:53   Log-Likelihood:                 52.446
No. Observations:                 200   AIC:                            -100.9
Df Residuals:                     198   BIC:                            -94.30
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.8921      0.020    198.446      0.0

The exponentiated value of $\hat{\beta}_0$ is equal to the **geometric mean** for the **male** group's writing scores.

In [9]:
np.exp(lgmodel2.params[0])

49.01222491930301

For $\hat{\beta}_1$, it can be interpreted using the following formula:
$$\left[\exp\left(\hat{\beta}_1 \times \Delta\right) - 1\right] \times 100\%$$

In [10]:
print('Switching from male students to female students, we expect to see an about ', round((np.exp(lgmodel2.params[1])-1)*100, 2), '% increase in the geometric mean of writing scores.', sep='')

Switching from male students to female students, we expect to see an about 10.88% increase in the geometric mean of writing scores.


Finally, let's see a longer model with multiple regressors:
$$\ln({\rm write}) = \beta_0 + \beta_1 \cdot {\rm female} + \beta_2 \cdot {\rm read} + \beta_3 \cdot {\rm math} + e_i$$

In [11]:
lgmodel3 = smf.ols('np.log(write) ~ female + read + math', data=logtrans).fit()
print(lgmodel3.summary())

                            OLS Regression Results                            
Dep. Variable:          np.log(write)   R-squared:                       0.504
Model:                            OLS   Adj. R-squared:                  0.497
Method:                 Least Squares   F-statistic:                     66.44
Date:                Wed, 12 Jul 2023   Prob (F-statistic):           1.11e-29
Time:                        08:09:53   Log-Likelihood:                 115.25
No. Observations:                 200   AIC:                            -222.5
Df Residuals:                     196   BIC:                            -209.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.1352      0.060     52.419      0.0

When all other variables are held at some fixed value,
* writing scores will be **12.16%** higher for the female students than for the male students;
* with a one-unit increase in `read`, we expect to see a **0.67%** increase in writing score;
* with a ten-unit increase in `read`, we expect to see a **6.86%** increase in writing score.

In [12]:
round((np.exp(lgmodel3.params[1])-1)*100, 2)

12.16

In [13]:
round((np.exp(lgmodel3.params[2])-1)*100, 2)

0.67

In [14]:
round((np.exp(lgmodel3.params[2]*10)-1)*100, 2)

6.86

## When some independent variables are log transformed
Our model is
$${\rm write} = \beta_0 + \beta_1 \cdot {\rm female} + \beta_2 \cdot \ln({\rm read}) + \beta_3 \cdot \ln({\rm math}) + e_i$$

In [15]:
lgmodel4 = smf.ols('write ~ female + np.log(read) + np.log(math)', data=logtrans).fit()
print(lgmodel4.summary())

                            OLS Regression Results                            
Dep. Variable:                  write   R-squared:                       0.530
Model:                            OLS   Adj. R-squared:                  0.523
Method:                 Least Squares   F-statistic:                     73.70
Date:                Wed, 12 Jul 2023   Prob (F-statistic):           5.92e-32
Time:                        08:09:53   Log-Likelihood:                -657.58
No. Observations:                 200   AIC:                             1323.
Df Residuals:                     196   BIC:                             1336.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      -99.1640     10.804     -9.178   

Holding all other indenpendent variables constant,
* the expected mean difference in writing scores between the female and male groups is about **5.39 points**;
* for a **1%** increase in reading score, the difference in the expected mean writing scores will be **0.17 points**;
* for a **10%** increase in reading score, the difference in the expected mean writing scores will be **1.61 points**.

In interpreting $\hat{\beta}_2$, the formula used is
$$\hat{\beta}_2 \times \ln(1+\Delta\%)$$

In [16]:
round(lgmodel4.params[1], 2)

5.39

In [17]:
round(lgmodel4.params[2]*np.log(1+0.01), 2)

0.17

In [18]:
round(lgmodel4.params[2]*np.log(1+0.1), 2)

1.61

## When both dependent variable and some independent variables are log transformed
Our model is
$$\ln({\rm write}) = \beta_0 + \beta_1 \cdot {\rm female} + \beta_2 \cdot {\rm read} + \beta_3 \cdot \ln({\rm math}) + e_i$$

In [19]:
lgmodel5 = smf.ols('np.log(write) ~ female + read + np.log(math)', data=logtrans).fit()
print(lgmodel5.summary())

                            OLS Regression Results                            
Dep. Variable:          np.log(write)   R-squared:                       0.507
Model:                            OLS   Adj. R-squared:                  0.500
Method:                 Least Squares   F-statistic:                     67.30
Date:                Wed, 12 Jul 2023   Prob (F-statistic):           5.85e-30
Time:                        08:09:53   Log-Likelihood:                 115.90
No. Observations:                 200   AIC:                            -223.8
Df Residuals:                     196   BIC:                            -210.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        1.9281      0.247      7.808   

Holding all other variables constant,
* the expected percent increase in geometric mean from male group to female group is about **12.10%**;
* for a **one-unit** increase in reading score, we expect to see a **0.66%** increase in the geometric mean of writing score;
* for a **1%** increase in math score, we expect to see a **0.41%** increase in writing score;
* for a **10%** increase in math score, we expect to see a **3.97%** increase in writing score.

In interpreting $\hat{\beta}_3$, the formula used is
$$\left[(1+\Delta\%)^{\hat{\beta}_3} - 1\right] \times 100\%$$

In [20]:
round((np.exp(lgmodel5.params[1])-1)*100, 2)

12.1

In [21]:
round((np.exp(lgmodel5.params[2])-1)*100, 2)

0.66

In [22]:
round(((1+0.01)**lgmodel5.params[3]-1)*100, 2)    # Python uses "**" (instead of "^") for to raise a value to a power.

0.41

In [23]:
round(((1+0.1)**lgmodel5.params[3]-1)*100, 2)

3.97