<a href="https://colab.research.google.com/github/Frost0088/Econometrics-with-Python/blob/main/Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Multiple Linear Regression Analysis**

In [2]:
pip install wooldridge

Collecting wooldridge
  Downloading wooldridge-0.4.4-py3-none-any.whl (5.1 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/5.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/5.1 MB[0m [31m3.4 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.9/5.1 MB[0m [31m12.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m4.5/5.1 MB[0m [31m43.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m5.1/5.1 MB[0m [31m46.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: wooldridge
Successfully installed wooldridge-0.4.4


In [3]:
# importing requied packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import statsmodels.api as sm
import wooldridge as wo

Data: Ceo salary dataset

In [7]:
# loading the dataset
data = pd.DataFrame(wo.data("ceosal2"))

# first five rows
data.head()

Unnamed: 0,salary,age,college,grad,comten,ceoten,sales,profits,mktval,lsalary,lsales,lmktval,comtensq,ceotensq,profmarg
0,1161,49,1,1,9,2,6200.0,966,23200.0,7.057037,8.732305,10.051908,81,4,15.580646
1,600,43,1,1,10,10,283.0,48,1100.0,6.39693,5.645447,7.003066,100,100,16.96113
2,379,51,1,1,9,3,169.0,40,1100.0,5.937536,5.129899,7.003066,81,9,23.668638
3,651,55,1,0,22,22,1100.0,-54,1000.0,6.478509,7.003066,6.907755,484,484,-4.909091
4,497,44,1,1,8,6,351.0,28,387.0,6.20859,5.860786,5.958425,64,36,7.977208


Dependent Variable: salary

Independent Variables: **[age, company tenure, ceo tenure, sales, profits, market value]**

In [8]:
# defining variables
y = data["salary"]
x = data[["age", "comten", "ceoten", "sales", "profits", "mktval"]]

# adding constant term to the independent variable
X = sm.add_constant(x)

In [9]:
# fitting the OLS model
model = sm.OLS(y, X).fit()

# printing model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                 salary   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.181
Method:                 Least Squares   F-statistic:                     7.463
Date:                Tue, 19 Mar 2024   Prob (F-statistic):           4.15e-07
Time:                        19:21:24   Log-Likelihood:                -1358.5
No. Observations:                 177   AIC:                             2731.
Df Residuals:                     170   BIC:                             2753.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        504.6397    282.063      1.789      0.0

Surprisingly, even at 10% level of signigficance, only one variable 'ceoten' is coming out to be statistically significant. Maybe we have included some variable (s) that is/are correlated with some other variable(s).

Common sense says that the variable 'sales' are 'profits' might be correlated. And these two variables might be correlated with the variable 'mktval'. Let us inspect with the help of a heatmap.

In [10]:
corr = data.corr()
fig = px.imshow(corr)
fig.show()

As we can see, variables 'sales' and 'profits' are highly correlated. These two variables are highly correlated with 'mktval' which makes sense.

I am going to omit 'sales' and 'mktval' variables from the model.

In [33]:
y = data["salary"]
x = data[["ceoten", "profits"]]
X = sm.add_constant(x)

# fitting the OLS model
model = sm.OLS(y, X).fit()

# printing model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                 salary   R-squared:                       0.178
Model:                            OLS   Adj. R-squared:                  0.169
Method:                 Least Squares   F-statistic:                     18.86
Date:                Tue, 19 Mar 2024   Prob (F-statistic):           3.87e-08
Time:                        12:46:00   Log-Likelihood:                -1361.8
No. Observations:                 177   AIC:                             2730.
Df Residuals:                     174   BIC:                             2739.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        646.8873     64.123     10.088      0.0

Now we can see that the coefficients of the variables profits and ceoten are coming out to be statistically significant.

The coefficient of ceoten is 12.45 that is the amount by which the salary of a ceo is expected to increase on an average with one additional year of tenure. But this is not very useful. What we would like to know is - by what percent on an avg. the salary of a ceo is expected to increase with one additional year of tenure. For this we need a log-lin model.

Log-Lin model can take the following form: log(y_hat) = b0 + b1 * x1 + b2 * x2. I will cover this in another notebook with name: Functional Forms of Regression.