In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [2]:
file_path = "PAPI2018_sample_clean.csv"
df = pd.read_csv(file_path)

In [10]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,urban,female,age,time_in_commune_or_ward,time_in_province,lv_educ,no_family_members,party_member,income,ln_income
0,0,7014,1,1,56.0,20.0,20,4.0,2,0,5000000.0,15.424948
1,1,7003,1,0,37.0,37.0,37,6.0,5,0,7000000.0,15.761421
2,2,3780,1,1,34.0,34.0,34,8.0,4,1,15000000.0,16.523561
3,3,13742,1,1,36.0,36.0,36,6.0,3,1,15000000.0,16.523561
4,4,11886,0,0,61.0,61.0,61,0.0,3,0,5000000.0,15.424948


In [5]:
df = df.dropna(subset=['income', 'lv_educ'])

In [6]:
df['ln_income'] = np.log(df['income'])

In [7]:
X = df[['lv_educ']]
X = sm.add_constant(X)
y = df['ln_income']

In [9]:
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:              ln_income   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.208
Method:                 Least Squares   F-statistic:                     1197.
Date:                Sun, 23 Feb 2025   Prob (F-statistic):          3.52e-233
Time:                        16:31:25   Log-Likelihood:                -5547.6
No. Observations:                4561   AIC:                         1.110e+04
Df Residuals:                    4559   BIC:                         1.111e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         14.9217      0.027    552.791      0.0

**Conclusion**

As expected, education positively impacts income. The results confirm that higher education levels correspond to higher income, aligning with general economic theories on human capital investment. However, the R-squared value suggests that other factors also significantly influence income, which could be explored in a multiple regression model.

**Interpretation & Comparison**

In [11]:
beta_1 = model.params['lv_educ']
p_value = model.pvalues['lv_educ']
r_squared = model.rsquared

print(f"Interpretation:\n"
      f"The estimated coefficient for lv_educ (education level) is {beta_1:.4f}, "
      f"which means that for each additional level of education, income increases by approximately "
      f"{beta_1 * 100:.2f}% on average.\n"
      f"The p-value is {p_value:.4f}, indicating that education level has a statistically significant impact on income.\n"
      f"The R-squared value is {r_squared:.3f}, meaning that about {r_squared * 100:.2f}% of the variation in income "
      f"is explained by education level.")

if beta_1 > 0:
    print("\nThe positive coefficient confirms our expectation that higher education levels are associated with higher income.")
else:
    print("\nThe negative or insignificant coefficient contradicts our expectation, suggesting education level may not have a strong effect on income.")


Interpretation:
The estimated coefficient for lv_educ (education level) is 0.1866, which means that for each additional level of education, income increases by approximately 18.66% on average.
The p-value is 0.0000, indicating that education level has a statistically significant impact on income.
The R-squared value is 0.208, meaning that about 20.80% of the variation in income is explained by education level.

The positive coefficient confirms our expectation that higher education levels are associated with higher income.
