### Official Websites (Examples)
https://bashtage.github.io/linearmodels/panel/index.html

### Official Github (Examples)

https://github.com/bashtage/linearmodels/blob/main/README.md

## 1. Wage (Wage ~ Experience)

In [10]:
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
import statsmodels.api as sm
from linearmodels import BetweenOLS, FirstDifferenceOLS, PooledOLS
import pandas as pd

pd.set_option('display.float_format', lambda x:'%.3f' % x)

In [2]:
data = wage_panel.load()
data

Unnamed: 0,nr,year,black,exper,hisp,hours,married,educ,union,lwage,expersq,occupation
0,13,1980,0,1,0,2672,0,14,0,1.197540,1,9
1,13,1981,0,2,0,2320,0,14,1,1.853060,4,9
2,13,1982,0,3,0,2940,0,14,0,1.344462,9,9
3,13,1983,0,4,0,2960,0,14,0,1.433213,16,9
4,13,1984,0,5,0,3071,0,14,0,1.568125,25,5
...,...,...,...,...,...,...,...,...,...,...,...,...
4355,12548,1983,0,8,0,2080,1,9,0,1.591879,64,5
4356,12548,1984,0,9,0,2080,1,9,1,1.212543,81,5
4357,12548,1985,0,10,0,2080,1,9,0,1.765962,100,5
4358,12548,1986,0,11,0,2080,1,9,1,1.745894,121,5


In [None]:
# data.to_csv("wage_panel.csv")

In [3]:
# Convert the year into categorical type
year = pd.Categorical(data.year)

data = data.set_index(['nr','year'])

In [5]:
data["year"] = year
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4360 entries, (13, 1980) to (12548, 1987)
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   black       4360 non-null   int64   
 1   exper       4360 non-null   int64   
 2   hisp        4360 non-null   int64   
 3   hours       4360 non-null   int64   
 4   married     4360 non-null   int64   
 5   educ        4360 non-null   int64   
 6   union       4360 non-null   int64   
 7   lwage       4360 non-null   float64 
 8   expersq     4360 non-null   int64   
 9   occupation  4360 non-null   int64   
 10  year        4360 non-null   category
dtypes: category(1), float64(1), int64(9)
memory usage: 378.9 KB


In [6]:
print(wage_panel.DESCR)
print(data.head())


F. Vella and M. Verbeek (1998), "Whose Wages Do Unions Raise? A Dynamic Model
of Unionism and Wage Rate Determination for Young Men," Journal of Applied
Econometrics 13, 163-183.

nr                       person identifier
year                     1980 to 1987
black                    =1 if black
exper                    labor market experience
hisp                     =1 if Hispanic
hours                    annual hours worked
married                  =1 if married
educ                     years of schooling
union                    =1 if in union
lwage                    log(wage)
expersq                  exper^2
occupation               Occupation code

         black  exper  hisp  hours  married  educ  union     lwage  expersq  \
nr year                                                                       
13 1980      0      1     0   2672        0    14      0  1.197540        1   
   1981      0      2     0   2320        0    14      1  1.853060        4   
   1982      0    

### 1.1 Descriptive Statistics

In [11]:
data.describe()

Unnamed: 0,black,exper,hisp,hours,married,educ,union,lwage,expersq,occupation
count,4360.0,4360.0,4360.0,4360.0,4360.0,4360.0,4360.0,4360.0,4360.0,4360.0
mean,0.116,6.515,0.156,2191.257,0.439,11.767,0.244,1.649,50.425,4.989
std,0.32,2.826,0.363,566.352,0.496,1.746,0.43,0.533,40.782,2.32
min,0.0,0.0,0.0,120.0,0.0,3.0,0.0,-3.579,0.0,1.0
25%,0.0,4.0,0.0,2040.0,0.0,11.0,0.0,1.351,16.0,4.0
50%,0.0,6.0,0.0,2080.0,0.0,12.0,0.0,1.671,36.0,5.0
75%,0.0,9.0,0.0,2414.25,1.0,12.0,0.0,1.991,81.0,6.0
max,1.0,18.0,1.0,4992.0,1.0,16.0,1.0,4.052,324.0,9.0


In [13]:
print(data.columns)

Index(['black', 'exper', 'hisp', 'hours', 'married', 'educ', 'union', 'lwage',
       'expersq', 'occupation', 'year'],
      dtype='object')


### 1.2 Basic Regression (Pooled OLS) on Panel Data

- PooledOLS is just plain OLS that understands that various panel data structures. It is useful as a base model

$$
y_{it} = \beta x_{it} + (\alpha + \epsilon_{it})
$$

In [15]:
# Determine the exogenous variables
exog_vars = ["black", "hisp", "exper", "expersq", "married", "educ", "union", "year"]
exog = sm.add_constant(data[exog_vars])
model = PooledOLS(data.lwage, exog)
pooled_res = model.fit()
print(pooled_res)

                          PooledOLS Estimation Summary                          
Dep. Variable:                  lwage   R-squared:                        0.1893
Estimator:                  PooledOLS   R-squared (Between):              0.2066
No. Observations:                4360   R-squared (Within):               0.1692
Date:                Mon, Mar 25 2024   R-squared (Overall):              0.1893
Time:                        14:49:54   Log-likelihood                   -2982.0
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      72.459
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                 F(14,4345)
Min Obs:                       8.0000                                           
Max Obs:                       8.0000   F-statistic (robust):             72.459
                            

### 1.3 Entity Effect

When modeling panel data it is common to consider models beyond what OLS will efficiently estimate. The most common are error component models which add an additional term to the standard OLS model,

$$
y_{it} = \beta x_{it} + \alpha_i + \epsilon_{it}
$$

where $\alpha_i$ affects all values of entity i. 

When the $\alpha_i$ are uncorrelated with the regressors in $x_{it}$ , 

a random effects model can be used to efficiently estimate parameters of this model.

#### 1.3.1 Random Effect

The random effects model is <font color = "red"> virtually identical to the pooled OLS model </font>  <font color = "orange">except that is accounts for the structure of the model and so is more efficient </font>. Random effects uses a quasi-demeaning strategy which subtracts the time average of the within entity values to account for the common shock.

In [16]:
from linearmodels import RandomEffects

mod = RandomEffects(data.lwage, exog)
re_res = mod.fit()
print(re_res)

                        RandomEffects Estimation Summary                        
Dep. Variable:                  lwage   R-squared:                        0.1806
Estimator:              RandomEffects   R-squared (Between):              0.1853
No. Observations:                4360   R-squared (Within):               0.1799
Date:                Mon, Mar 25 2024   R-squared (Overall):              0.1828
Time:                        15:20:00   Log-likelihood                   -1622.5
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      68.409
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                 F(14,4345)
Min Obs:                       8.0000                                           
Max Obs:                       8.0000   F-statistic (robust):             68.409
                            

- The model fit is fairly similar, although the return to experience has changed substantially, as has its significance. 

- <font color = "orange"> This is partially explainable by the inclusion of the year dummies </font> which will fit the trend in experience and so only the cross-sectional differences matter. 

- The quasi-differencing in the random effects estimator depends on a quantity that depends on the relative variance of the idiosyncratic shock and the common shock. 

- This can be accessed using variance_decomposition.

In [21]:
print(re_res.variance_decomposition)
re_res.theta.head()

Effects                  0.107
Residual                 0.123
Percent due to Effects   0.464
Name: Variance Decomposition, dtype: float64


Unnamed: 0_level_0,theta
nr,Unnamed: 1_level_1
13,0.645
17,0.645
18,0.645
45,0.645
110,0.645


The coefficient $\theta_i$ determines how much demeaning takes place. When this value is 1, the RE model reduces to the pooled model since this occurs when there is no variance in the effects. When panels are unbalanced it will vary across entities, but in this balanced panel all values are the same.

#### 1.3.2 Between Estimator 

The between estimator is an alternative, usually less efficient estimator, can can be used to estimate model parameters. 

It is particular simple since it first computes the time averages of y and x and then runs a simple regression using these averages.

The year dummies are dropped since the averaging removes differences due to the year. expersq was also dropped since it is fairly co-linear with exper. These results are broadly similar to the previous models.

#### Entity Effect + Time Effect

## 2. Grunfeld data 
using formulas to specify models

    invest  - Gross investment in 1947 dollars
    value   - Market value as of Dec. 31 in 1947 dollars
    capital - Stock of plant and equipment in 1947 dollars
    firm    - General Motors, US Steel, General Electric, Chrysler,
            Atlantic Refining, IBM, Union Oil, Westinghouse, Goodyear,
            Diamond Match, American Steel
    year    - 1935 - 1954

In [None]:
# Load data
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data.to_csv('grunfeld.csv')

In [None]:
# reset the index
data = data.set_index(['firm','year'])

### PanelOLS with Entity Effects¶

Entity effects are specified using the special command EntityEffects. By default a constant is not included, and so if a constant is desired, 1+ should be included in the formula. When including effects, the model and fit are identical whether a constant is included or not.

In [None]:
### No Constant
mod = PanelOLS.from_formula("invest ~ value + capital + EntityEffects", data = data)
print(mod.fit())

In [None]:
### Add Constant
mod = PanelOLS.from_formula("invest ~ 1 + value + capital + EntityEffects", data = data)
print(mod.fit())

### PanelOLS with Entity Effects & Time Effects

Time effects can be similarly included using *TimeEffects*. In many models, time effects can be consistently estimated and so they could be equivalently included in the set of regressors using a categorical variable.

In [None]:
mod = PanelOLS.from_formula(
    "invest ~ 1 + value + capital + EntityEffects + TimeEffects", data = data
)
print(mod.fit())

### Between OLS

In [None]:
mod = BetweenOLS.from_formula("invest ~ 1 + value + capital", data=data)
print(mod.fit())

### First Difference OLS

In [None]:
mod = FirstDifferenceOLS.from_formula("invest ~ value + capital", data=data)
print(mod.fit())

### Pooled OLS

The pooled OLS estimator is a special case of PanelOLS when there are no effects. It is effectively identical to OLS in statsmodels (or WLS) but is included for completeness.

In [None]:
mod = PooledOLS.from_formula("invest ~ 1 + value + capital", data = data)
print(mod.fit())