## **Project: Study On Panel Data Methodologies With Application To Macroeconometrics (Inflation Forecasting)**.

> ### **Title**: Merge of Dataset.


#### **Table of Contents:**
<ul>
<li><a href="#1">1. .</a></li>
<li><a href="#2">2. .</a></li>
<li><a href="#3">3. .</a></li>
</ul>

<a id=''></a>

#### Dataset Description and Variable Overview:

---

> #### **Inflation & Price Stability**

| **Variable Code** | **Description**                                    | **Units**                                         |
| ----------------- | -------------------------------------------------- | ------------------------------------------------- |
| **PCPIPCH**       | Inflation, average consumer prices `(Target)`      | Percent change                                    |

---

> #### **Public Finance**
| **Variable Code** | **Description**                                    | **Units**                                         |
| ----------------- | -------------------------------------------------- | ------------------------------------------------- |
| GGSB_NPGDP        | General government structural balance              | Percent of potential GDP                          |
| GGXWDG_NGDP       | General government gross debt                      | Percent of GDP                                    |

---


> #### **Economic Output & Productivity**
| **Variable Code** | **Description**                                    | **Units**                                         |
| ----------------- | -------------------------------------------------- | ------------------------------------------------- |
| PPPPC             | Gross domestic product per capita, current prices  | Purchasing power parity; international dollars    |

---

> #### **International Trade & Balance**
| **Variable Code** | **Description**                                    | **Units**                                         |
| ----------------- | -------------------------------------------------- | ------------------------------------------------- |
| TX_RPCH           | Volume of exports of goods and services            | Percent change                                    |
| TM_RPCH           | Volume of imports of goods and services            | Percent change                                    |
---

> #### **Savings & Investment **
| **Variable Code** | **Description**                                    | **Units**                                         |
| ----------------- | -------------------------------------------------- | ------------------------------------------------- |
| NID_NGDP          | Total investment                                   | Percent of GDP                                    |

---

> #### **Country Metadata**

| **Variable Code** | **Description**                                    | **Units**                                         |
| ----------------- | -------------------------------------------------- | ------------------------------------------------- |
| Country_Code      | ID number for each country                         | ID                                                |
| Country           | Name of 70 countries                               | String                                            |
| Advanced_Country  | Is the country developed (1) or developing (0)?    | Boolean                                           |
| Years             | date from 2000 to 2024                             | Date                                              |

---


#### **1. Inflation & Price Stability (التضخم واستقرار الأسعار)**

| Variable Code    | Term             | التفسير |                تأثيره على التضخم                                            |
| ---------------- | ---------------- | --------------------------------- | --------------------------------------------------- |
| **PCPIPCH**   | Inflation (CPI) | معدل التضخم بناءً على متوسط أسعار المستهلكين؛ مؤشر رئيسي لاستقرار الأسعار. | المتغير الهدف، ويقيس بشكل مباشر مدى ارتفاع الأسعار. |

---

#### **2. Public Finance (المالية العامة)**

| Variable Code    | Term             | التفسير |                تأثيره على التضخم                                            |
| ---------------- | ---------------- | --------------------------------- | --------------------------------------------------- |
| **GGSB_NPGDP**  | Structural Budget Balance         | الميزان الهيكلي بعد خصم أثر الدورة الاقتصادية.                         | الفائض الهيكلي يُعتبر إشارة إلى سياسة مالية انكماشية تقلل من التضخم.      |
| **GGXWDG_NGDP** | Gross Government Debt (% of GDP)  | الدين العام كنسبة من الناتج؛ يعكس عبء الحكومة المالي.                  | ارتفاع الدين قد يُجبر الحكومة على التوسع النقدي مستقبلاً مما يزيد التضخم. |


---


#### **3. Economic Output & Productivity (الإنتاجية والناتج الاقتصادي)**

| Variable Code    | Term             | التفسير |                تأثيره على التضخم                                            |
| ---------------- | ---------------- | --------------------------------- | --------------------------------------------------- |
| **PPPPC**      | GDP per Capita (PPP)   | نصيب الفرد من الناتج باستخدام تعادل القوة الشرائية.             | ارتفاعه يشير إلى قدرة شرائية أعلى، ما قد يدفع بالأسعار إلى الارتفاع.               |

---


#### **4. International Trade & Balance (التجارة الدولية والحساب الجاري)**

| Variable Code    | Term             | التفسير |                تأثيره على التضخم                                            |
| ---------------- | ---------------- | --------------------------------- | --------------------------------------------------- |
| **TX_RPCH**   | Export Volume Growth            | نمو حجم الصادرات.                   | زيادة الصادرات قد تقلل المعروض المحلي وترفع الأسعار.                |
| **TM_RPCH**   | Import Volume Growth            | نمو حجم الواردات.                   | زيادة الواردات توفر بدائل أرخص وتقلل من التضخم.                     |

---

#### **5. Savings & Investment (الادخار والاستثمار)**

| Variable Code    | Term             | التفسير |                تأثيره على التضخم                                            |
| ---------------- | ---------------- | --------------------------------- | --------------------------------------------------- |
| **NID_NGDP**  | Gross Capital Formation | الاستثمار الإجمالي كنسبة من الناتج. | استثمار أكبر قد يرفع الإنتاج في الأجل الطويل مما يقلل التضخم. |

---

#### **6. Country Metadata (بيانات الدول)**

| Variable Code    | Term             | التفسير |                تأثيره على التضخم                                            |
| ---------------- | ---------------- | --------------------------------- | --------------------------------------------------- |
| **Country_Code**     | Country ID         | معرف رقمي فريد لكل دولة. | -                 |
| **Country**           | Country Name       | اسم الدولة.              | -                 |
| **Advanced_Country** | Development Status | متقدمة (1) أو نامية (0). | -                 |
| **Years**             | Year               | السنة ما بين 2000 و2024. | -                 |

---



In [1]:
# install 
#!pip install pandas linearmodels statsmodels  pydynpd


**Import Library**

In [48]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import math

import statsmodels.api as sm
from linearmodels import  OLS
from linearmodels.panel import PanelOLS, RandomEffects,PooledOLS

from linearmodels.panel import compare
from statsmodels.tsa.stattools import adfuller
from scipy.stats import chi2
from sklearn.metrics import mean_squared_error, r2_score
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_breusch_godfrey, het_white

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from statsmodels.regression.mixed_linear_model import MixedLM


sns.set(rc={'figure.figsize': [15,5]}, font_scale=1.2);
pd.set_option('future.no_silent_downcasting', True)


**Load Dataset**

In [78]:
df = pd.read_csv("../02-Dataset/01.4-Data_Clean.csv")
print(df.shape)
display(df.head())

(2925, 11)


Unnamed: 0,WEO_Country_Code,Country,Advanced_Country,Year,PCPIPCH,GGSB_NPGDP,GGXWDG_NGDP,PPPPC,TX_RPCH,TM_RPCH,NID_NGDP
0,193,Australia,1,1980,10.136,-0.276986,9.396714,10277.787,-0.915,4.997,27.139
1,193,Australia,1,1981,9.488,0.326679,5.183462,11529.563,-3.403,10.042,28.897
2,193,Australia,1,1982,11.352,-0.223435,14.483242,12049.62,8.771,5.46,26.502
3,193,Australia,1,1983,10.039,-2.308832,18.715089,12305.539,-4.36,-9.819,23.062
4,193,Australia,1,1984,3.96,0.592963,18.584186,13391.167,16.092,22.058,26.734


In [79]:
df.drop(columns=[ "WEO_Country_Code"],inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country           2925 non-null   object 
 1   Advanced_Country  2925 non-null   int64  
 2   Year              2925 non-null   int64  
 3   PCPIPCH           2925 non-null   float64
 4   GGSB_NPGDP        2925 non-null   float64
 5   GGXWDG_NGDP       2925 non-null   float64
 6   PPPPC             2925 non-null   float64
 7   TX_RPCH           2925 non-null   float64
 8   TM_RPCH           2925 non-null   float64
 9   NID_NGDP          2925 non-null   float64
dtypes: float64(7), int64(2), object(1)
memory usage: 228.6+ KB


In [80]:
# # df.PCPIPCH = ((df.PCPIPCH - df.PCPIPCH.min()) / df.PCPIPCH.max())
# # df.PCPIPCH = np.log((df.PCPIPCH+1) + df.PCPIPCH.mean())
# # df.PCPIPCH = np.log((df.PCPIPCH+1) - df.PCPIPCH.min())
# #df.PCPIPCH = np.log(np.abs(df.PCPIPCH))

# df["PCPIPCH"] = df.groupby("Country")["PCPIPCH"].transform( lambda x: np.log((x+1) - x.min() ))


# df.PCPIPCH.describe()
 

<a id='1'></a>

### **1. :**

In [81]:
panel_df = df.set_index(['Country', 'Year'])

for n in panel_df.columns: 
    # Include lagged inflation as independent variable (lag 1)
    panel_df[f'{n}_lag'] = panel_df.groupby(level=0)[f'{n}'].shift(1)
    panel_df[f'{n}_lag'] = panel_df[f'{n}_lag'].fillna(
        panel_df.groupby(level=0)[f'{n}'].transform('first')
    )

panel_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Advanced_Country,PCPIPCH,GGSB_NPGDP,GGXWDG_NGDP,PPPPC,TX_RPCH,TM_RPCH,NID_NGDP,Advanced_Country_lag,PCPIPCH_lag,GGSB_NPGDP_lag,GGXWDG_NGDP_lag,PPPPC_lag,TX_RPCH_lag,TM_RPCH_lag,NID_NGDP_lag
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Australia,1980,1,10.136,-0.276986,9.396714,10277.787,-0.915,4.997,27.139,1.0,10.136,-0.276986,9.396714,10277.787,-0.915,4.997,27.139
Australia,1981,1,9.488,0.326679,5.183462,11529.563,-3.403,10.042,28.897,1.0,10.136,-0.276986,9.396714,10277.787,-0.915,4.997,27.139
Australia,1982,1,11.352,-0.223435,14.483242,12049.62,8.771,5.46,26.502,1.0,9.488,0.326679,5.183462,11529.563,-3.403,10.042,28.897
Australia,1983,1,10.039,-2.308832,18.715089,12305.539,-4.36,-9.819,23.062,1.0,11.352,-0.223435,14.483242,12049.62,8.771,5.46,26.502
Australia,1984,1,3.96,0.592963,18.584186,13391.167,16.092,22.058,26.734,1.0,10.039,-2.308832,18.715089,12305.539,-4.36,-9.819,23.062


In [82]:
# ========== Split data for out-of-sample forecasting ==========
# Define dependent and independent variables
y = panel_df['PCPIPCH']  # Inflation Rate (Consumer Prices, annual %)
X_vars = [

    # 2. Public Finance
    "GGSB_NPGDP",   ## General government structural balance
    "GGXWDG_NGDP",  # General government gross debt (% of GDP)


    
    # 3. Economic Output & Productivity & Exchange & Purchasing Power
    "PPPPC",        # GDP per capita based on PPP
    

    # 4. International Trade & Balance
    "TX_RPCH",      # Export volume growth
    "TM_RPCH",      ## Import volume growth
    

    # 5. Savings & Investment
    "NID_NGDP",     ## Investment (% of GDP)
    

    # 6. Metadata
   "Advanced_Country"
    
]

X = panel_df[X_vars]
#X = sm.add_constant(X)


# Helper function for metrics
def evaluate(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    return rmse, r2


In [83]:
# ===============================
# Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_np = X.values

vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X_np, i) for i in range(X_np.shape[1])]

print(vif_data)

            feature       VIF
0        GGSB_NPGDP  1.317427
1       GGXWDG_NGDP  2.674162
2             PPPPC  3.457454
3           TX_RPCH  1.665954
4           TM_RPCH  1.722332
5          NID_NGDP  3.872626
6  Advanced_Country  3.431196


In [84]:
# Use years up to 2015 for training, after 2015 for testing
train = panel_df.reset_index()
train = train[train['Year'] <= 2024].set_index(['Country', 'Year'])
test = panel_df.reset_index()
test = test[test['Year'] > 2018].set_index(['Country', 'Year'])

y_train = train['PCPIPCH']
X_train = train[X_vars]
#X_train = sm.add_constant(X_train)

y_test = test['PCPIPCH']
X_test = test[X_vars]

#X_test = sm.add_constant(X_test)


In [86]:
# ===============================
# A. Pooled OLS model
pooled_ols_model = PooledOLS(y_train, X_train)
pooled_ols_res = pooled_ols_model.fit()
print("Pooled OLS Results:")
print(pooled_ols_res.summary)

# Predict and evaluate RMSE
pooled_preds = pooled_ols_res.predict(X_test)
pooled_rmse,pooled_r2 = evaluate(y_test, pooled_preds)
print(f"Pooled OLS RMSE (out-of-sample): {pooled_rmse:.4f} \nPooled OLS R_sq (out-of-sample): {pooled_r2:.4f}" )

Pooled OLS Results:
                          PooledOLS Estimation Summary                          
Dep. Variable:                PCPIPCH   R-squared:                        0.3760
Estimator:                  PooledOLS   R-squared (Between):              0.7242
No. Observations:                2925   R-squared (Within):               0.0013
Date:                Wed, Jun 11 2025   R-squared (Overall):              0.3760
Time:                        04:15:15   Log-likelihood                -1.008e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      251.18
Entities:                          65   P-value                           0.0000
Avg Obs:                       45.000   Distribution:                  F(7,2918)
Min Obs:                       45.000                                           
Max Obs:                       45.000   F-statistic (robust):             251.18
        

In [87]:
X_train = X_train.drop(columns=['Advanced_Country'], errors='ignore')
X_test = X_test.drop(columns=['Advanced_Country'], errors='ignore')

In [88]:
# ===============================
# B. Fixed Effects model

fe_model = PanelOLS(y_train, X_train, entity_effects=True)
fe_res = fe_model.fit()
print("Fixed Effects Results:")
print(fe_res.summary)

# Predict (note: fixed effects prediction only valid for known entities in test set)
fe_preds = fe_res.predict(X_test)
fe_rmse, fe_r2 = evaluate(y_test, fe_preds)
print(f"Fixed Effects RMSE (out-of-sample): {fe_rmse:.4f}\nFixed Effects R_sq (out-of-sample):{fe_r2:.4f}")

Fixed Effects Results:
                          PanelOLS Estimation Summary                           
Dep. Variable:                PCPIPCH   R-squared:                        0.0550
Estimator:                   PanelOLS   R-squared (Between):             -0.7586
No. Observations:                2925   R-squared (Within):               0.0550
Date:                Wed, Jun 11 2025   R-squared (Overall):             -0.3667
Time:                        04:15:17   Log-likelihood                   -9618.4
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      27.696
Entities:                          65   P-value                           0.0000
Avg Obs:                       45.000   Distribution:                  F(6,2854)
Min Obs:                       45.000                                           
Max Obs:                       45.000   F-statistic (robust):             27.696
     

In [90]:
# ===============================
# c. Random Effects model
re_model = RandomEffects(y_train, X_train)
re_res = re_model.fit()
print("Random Effects Results:")
print(re_res.summary)

# Predict and RMSE
re_preds = re_res.predict(X_test)
re_rmse, re_r2 = evaluate(y_test, re_preds)
print(f"Random Effects RMSE (out-of-sample): {re_rmse:.4f}\nRandom Effects R_sq (out-of-sample): {re_r2:.4f}")


Random Effects Results:
                        RandomEffects Estimation Summary                        
Dep. Variable:                PCPIPCH   R-squared:                        0.0716
Estimator:              RandomEffects   R-squared (Between):              0.5854
No. Observations:                2925   R-squared (Within):               0.0308
Date:                Wed, Jun 11 2025   R-squared (Overall):              0.3182
Time:                        04:15:25   Log-likelihood                   -9704.5
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      37.547
Entities:                          65   P-value                           0.0000
Avg Obs:                       45.000   Distribution:                  F(6,2919)
Min Obs:                       45.000                                           
Max Obs:                       45.000   F-statistic (robust):             37.547
    

<a id='2'></a>

### **2. :**

In [91]:
# ===============================
# Hausman test: fixed vs random effects
def hausman(fe, re):
    b_diff = fe.params - re.params
    cov_diff = fe.cov - re.cov
    stat = np.dot(np.dot(b_diff.T, np.linalg.inv(cov_diff)), b_diff)
    df_n = b_diff.shape[0]
    pval = 1 - chi2.cdf(stat, df_n)
    return stat, pval

haus_stat, haus_pval = hausman(fe_res, re_res)
print(f"Hausman test statistic: {haus_stat:.4f}, p-value: {haus_pval:.4f}")
if haus_pval < 0.05:
    print("Hausman test suggests Fixed Effects preferred.")
else:
    print("Hausman test suggests Random Effects preferred.")

Hausman test statistic: 104.8604, p-value: 0.0000
Hausman test suggests Fixed Effects preferred.


In [92]:
# ===============================
# Wald test on Fixed Effects (joint significance of entity effects)

comparison = compare({ 'Pooled OLS': pooled_ols_res, 'Fixed Effects': fe_res})
print("Wald Test (F-test) for joint significance of entity effects:")
print(comparison)


Wald Test (F-test) for joint significance of entity effects:
                   Model Comparison                  
                            Pooled OLS  Fixed Effects
-----------------------------------------------------
Dep. Variable                  PCPIPCH        PCPIPCH
Estimator                    PooledOLS       PanelOLS
No. Observations                  2925           2925
Cov. Est.                   Unadjusted     Unadjusted
R-squared                       0.3760         0.0550
R-Squared (Within)              0.0013         0.0550
R-Squared (Between)             0.7242        -0.7586
R-Squared (Overall)             0.3760        -0.3667
F-statistic                     251.18         27.696
P-value (F-stat)                0.0000         0.0000
GGSB_NPGDP                     -0.0014         0.0924
                             (-0.0434)       (2.7754)
GGXWDG_NGDP                     0.0225        -0.0120
                              (6.6722)      (-2.3213)
PPPPC                

In [93]:
# ===============================
# 1.Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_np = X.values

vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X_np, i) for i in range(X_np.shape[1])]

print(vif_data)

            feature       VIF
0        GGSB_NPGDP  1.317427
1       GGXWDG_NGDP  2.674162
2             PPPPC  3.457454
3           TX_RPCH  1.665954
4           TM_RPCH  1.722332
5          NID_NGDP  3.872626
6  Advanced_Country  3.431196


In [94]:
# ===============================
# 2. Heteroskedasticity tests (Breusch-Pagan and White)

X_with_const = sm.add_constant(X_train, has_constant='add')

fe_residuals = re_res.resids

bp_test = het_breuschpagan(fe_residuals, X_with_const)
print(f"Breusch-Pagan test: stat={bp_test[0]:.4f}, p-value={bp_test[1]:.4f}")
white_test = het_white(fe_residuals, X_with_const)
print(f"White test: stat={white_test[0]:.4f}, p-value={white_test[1]:.4f}")


Breusch-Pagan test: stat=50.3791, p-value=0.0000
White test: stat=374.6214, p-value=0.0000


In [95]:
# ===============================
# 3. Serial Correlation test (Breusch-Godfrey) on Fixed Effects residuals
bg_test = acorr_breusch_godfrey(sm.OLS(y_train, X_train).fit(), nlags=2)
print(f"Breusch-Godfrey test for serial correlation: LM stat={bg_test[0]:.4f}, p-value={bg_test[1]:.4f}")

Breusch-Godfrey test for serial correlation: LM stat=1487.8106, p-value=0.0000


In [96]:
# ===============================
# Summary of RMSEs for model comparison
# Summary Table
print("\n========== Model Performance Summary ==========")
print(f"{'Model':<30} {'RMSE':>10} {'R²':>10}")
print(f"{'Pooled OLS':<30} {pooled_rmse:>10.4f} {pooled_r2:>10.4f}")
print(f"{'Fixed Effects':<30} {fe_rmse:>10.4f} {fe_r2:>10.4f}")
print(f"{'Random Effects':<30} {re_rmse:>10.4f} {re_r2:>10.4f}")



Model                                RMSE         R²
Pooled OLS                         5.0055    -0.4379
Fixed Effects                     10.2349    -5.0119
Random Effects                     5.4794    -0.7231


# **END**