# INFO 2950 Phase V - An Analysis of the Effects Income Class on Economic Indicators

**Team**: Carter Zhu (yz553), Jase Rivera (jcr297), Matthew Mentis-Cort (mam692)


## Table of Contents

1. [Introduction](#Introduction)
2. [Research Questions](#Research-Questions)
3. [Why is our research important?](#Why-is-our-research-important?)
4. [Data Description](#Data-Description)
5. [Data Description & Data Cleaning](#Data-Description-&-Data-Cleaning)
6. [Preregistration Statements](#Preregistration-Statements)
7. [Data Analysis](#Data-Analysis)
   1. [Imports](#Imports)
   2. [Data Exploration and Visualization](#Data-Exploration-and-Visualization)
   3. [Hypothesis 1 (education expenditure and GDP growth)](#Hypothesis-1-education-expenditure-and-GDP-growth)
   4. [Hypothesis 2 (education expenditure and unemployment rate)](#Hypothesis-2-(education-expenditure-and-unemployment-rate))
   5. [Models and Evaluation](#Models-and-Evaluation)
8.  [Conclusions](#Conclusions)
9.  [Limitations](#Limitations)
10. [Resources](#Resources)

## Introduction

This EDA uses data from the World Bank Group (1965–2023) to explore **global economic trends**, focusing on **unemployment rates**, **GDP growth**, and **government expenditure on education**. The analysis aims to uncover patterns and relationships that highlight economic performance, emphasizing wealth inequality, development, and the impact of education investments on unemployment and GDP growth. Our analysis of the World Bank Group data highlights the strong correlation between education and poverty reduction, showing that **higher education levels are associated with better economic outcomes**. Our analysis reveals that government expenditure has a delayed impact on GDP growth, with short-term effects being negative, particularly in developed countries, while developing countries may experience more immediate benefits. Unemployment has a weakly positive relationship with government expenditure, reflecting mismatches between education systems and labor market demands, with structural differences across country types influencing the outcomes. 

## Research Questions

How does government expenditure on education influence **long-term changes in GDP growth rates** (as a measure of economic growth) and **unemployment rates** in **developing** (low-income and lower-middle-income) versus **developed** (upper-middleincome and high-income) countries?

## Why is our research important?

This research is important because it examines how education expenditure influences GDP growth and unemployment, highlighting the inequalities between developed and undeveloped countries. The findings can be used as guidance for policymakers and economists for change-making and future predictions.

## Data Description & Data Cleaning

**What are the observations (rows) and the attributes (columns)?**

Observations (Rows): Each row represents a country and an economic indicator (e.g., GDP growth, education expenditure, unemployment) across multiple years (1965–2023), capturing trends over time.

Attributes (Columns):

- **Country Name**: The name of the country.
- **Country Code**: A unique identifier for each country (ISO or similar).
- **Series Name**: The economic indicator (e.g., GDP growth, unemployment, government expenditure on education).
- **Year Columns**: Yearly data points from 1965 to 2023, where each column represents the recorded value of the economic indicator for that specific year.

**Why was this dataset created?**

The dataset was created to track **economic performance** and **trends** over time, helping **governments, policymakers, and economists** understand key metrics like **GDP growth, education expenditure, and unemployment**. These insights guide **development strategies** and **interventions** worldwide. 

**Who funded the creation of the dataset?**

The dataset was funded and crafted by The **World Bank**, a global financial institution that supports development efforts in countries world-wide. Its mission is to **reduce poverty** and promote **sustainable development** so the World Bank regularly collects and publishes data on a wide range of **economic indicators** to aid in **policy-making** and **development planning**.

**What processes might have influenced what data was observed and recorded and what was not?**

- **Data availability and reliability**: Some countries may not have consistent or complete data records due to political instability, inadequate infrastructure, or limited data collection capabilities. This can lead to missing or incomplete data for certain years.
- **Selection of indicators**: The choice of economic indicators included in the dataset (e.g., GDP growth, unemployment, government expenditure) reflects the focus areas of development agencies and governments. I believe these indicators chosen are deemed critical and essential for assessing economic health.
- **Reporting standards**: Differences in national reporting standards and data collection methods could lead to variations in data quality and coverage.
- **Bias in data recording**: Some data points may reflect self-reported figures from governments, which could be subject to political influence or reporting biases, particularly in countries where transparency or accurate reporting is less enforced.

**What preprocessing was done, and how did the data come to be in the form that you are using?**

The original data consisted of four datasets grouped by **income level** of the countries (low_income, lower_middle, upper_middle, high_income) to distinguish between developing and developed countries. Preprocessing was done to standardize economic indicators and ensure consistency across countries and years.

First, only the relevant economic indicators, GDP growth, government expenditure on education, and unemployment, were retained. Rows missing essential information like **Country Name**, **Country Code**, or **Series Name** were removed. Year columns from 1960-1964 were excluded due to data quality issue, and **non-numeric values like ".." were replaced with NaN**.

**Linear interpolation** estimated missing values, but rows with excessive gaps were dropped. A **country_type** column was added to classify countries by **income group**, and all datasets were merged into a single, consistent dataset for analysis.

**If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?**

Data collection for this type of dataset is typically done through government reporting mechanisms and international data collection efforts, not directly involving individuals. Governments or institutions contributing data likely understood that the information would be used for development assessments, economic planning, and research purposes. However, the individuals whose economic data are aggregated (e.g., unemployment figures) may not be aware of the specific data collection.


**How was the dataset cleaned for this project?**

The data cleaning process involved structured steps to ensure consistency, completeness, and proper formatting. The dataset was filtered to focus on three key variables: GDP growth (annual %), a measure of economic growth; government expenditure on education (% of GDP), indicating investment in education; and unemployment (% of labor force), assessing labor market outcomes.

Rows with missing key identifiers, such as Country Name or Series Name, were removed, and columns from 1960-1964 were excluded due to excessive missing values. Non-numeric placeholders (e.g., "..") were replaced with NaN, and linear interpolation estimated missing values, preserving temporal continuity. Remaining NaN rows were dropped, and a new column, country_type, classified countries by income group (e.g., low_income, high_income). Finally, the datasets were merged, ensuring a unified, consistent dataset for meaningful analysis.

[Data Cleaning Github Link](https://github.com/Matty67889/info-2950-project/blob/main/data_cleaning_new.ipynb)

<https://github.com/Matty67889/info-2950-project/blob/main/data_cleaning_new.ipynb> (same link)

**Where can your raw source data be found?:**

The four original dataset that we merged along with the final_dataset is in the Google Drive. we have also uploaded it to Github.

https://drive.google.com/drive/folders/1Ka6LAx-YR8TA3QkG9799JndAbxd8Jlio?usp=sharing

https://databank.worldbank.org/source/world-development-indicators#


## Preregistration Statements



**Hypothesis 1:** Higher government expenditure on education is associated with higher GDP growth in both developing and developed countries, but the impact is greater in developing countries.

- **Null Hypothesis**: The effect of government expenditure on education on GDP growth does not differ significantly between developing and developed countries 

- **Alternative Hypothesis**: The effect of government expenditure on education on GDP growth is significantly greater in developing countries compared to developed countries 

**Context & Rationale:** 

Education is a well-established driver of economic growth because it builds human capital by enhancing an individual's skill and productivity. For instance, in many developing countries, where access to quality education is limited, an increase in government spending on education can most definitely help bridge this gap in literacy and technical training, equipping more people to participate effectively in the labor force. Studies like Barro (1991) and Hanushek & Woessmann (2012) have shown that such investments are associated with significant improvements in innovation and economic output, particularly in developing countries where baseline education levels are lower. For developing countries, increased government expenditure on education has the potential to address systemic deficits in human capital and creates significant improvements in economic output. Developed countries, on the other hand, may already have robust educational systems so it will result in smaller returns on additional expenditure. For example, in countries like India, government programs focused on primary and secondary education, such as the Sarva Shiksha Abhiyan (Education for All Movement), have contributed to higher school enrollment rates and long-term economic benefits. By contrast, in developed countries like Germany or the United States, where educational systems are already advanced and broadly accessible, additional spending on education often results in smaller marginal improvements, as the foundational infrastructure and access are already in place.

**Analysis Plan:**

To test this hypothesis, a multiple regression analysis will be conducted where the dependent variable is GDP growth (annual %), and the independent variables are government expenditure on education (total % of GDP) and income class. Income class will be represented as a binary variable (1 = developing countries; 0 = developed countries), and we will include an interaction term between government expenditure on education and income class. This setup allows us to test whether the effect of education expenditure on GDP growth differs significantly between developing and developed countries. We will test the significance of the interaction term using a two-tailed t-test, where a p-value <= 0.05 will indicate that the effect of education expenditure on GPD growth is significantly different between developing and developed countries. Additionally, we will evaluate the magnitude and the direction of the coefficient to confirm whether the effect is stronger in developing countries. 



**Hypothesis 2:** An increase in government expenditure on education reduces long-term unemployment rates more significantly in developing countries than in developed countries.

- **Null Hypothesis**: The interaction between government expenditure on education and income class does not have a significant effect on unemployment rates

- **Alternative Hypothesis**: The interaction between government expenditure on education and income class has a significant effect on unemployment rates, with a stronger reduction in unemployment rates in developing countries compared to developed countries 

**Context & Rationale:**

Education enhances employability by providing individuals with skills that meet the demands of the job market, a relationship consistently highlighted in labor economics research (OECD, 2010; Psacharopoulos & Patrinos, 2004). For example, in developing countries like Kenya, investments in technical and vocational education programs, such as the Youth Empowerment Project, have helped bridge skill gaps and prepared young people for roles in industries like construction and technology, significantly reducing unemployment among youth. While both developing and developed countries benefit from increased education spending, the impact is often greater in developing nations. This is because many developing countries face a shortage of trained workers, and education initiatives can directly address these gaps. By contrast, in developed countries like Japan, where educational attainment is already high, additional spending may be associated with smaller improvements, as the workforce is already equipped with advanced skills and labor markets tend to be more efficient.


**Analysis Plan:** 

We will perform a multiple regression analysis where the dependent variable is the unemployment rate (total % of labor force, modeled ILO estimate). The independent variables will include government expenditure on education (total % of GDP), year (to capture long-term trends), and income class. Income class will be represented as a binary variable (1 = developing countries) [low and lower middle-income], 0 = developed countries [upper middle and high-income]). We will create a interaction terms between income class and education expenditure to model differential effects between these variables. For testing, we will focus on the coefficients of the interaction terms between education expenditure and income class. For each interaction term, a two-tailed t-test will be conducted to test whether its coefficient is significantly different from zero. A p-value <= 0.05 will mean there is a significant effect. Additionally, we will examine the magnitude and sign of these coefficients to determine whether education spending has a stronger nagtive effect on unemployment in developing countries(low and lower middle-income groups) compared to developed countries (upper middle and high-income groups).

To address the long term effect, we will access whether the interaction term remains signifcant acorss different years by including a three-way interaction (Education Expenditure x Income Class x Year). This will allow us to analyze whether the impact of education expenditure on unemployment changes over time and whether this change varies between developing and developed countries.






## Data Analysis

### Imports

In [1]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.factorplots import interaction_plot

### Data Exploration and Visualization


The summary table provides key statistics for GDP growth, government expenditure on education, and unemployment, offering insights into global economic trends and variability. GDP growth averages 3.49% with a significant standard deviation of 6.33% showing there is considerable fluctuations. The extreme values, such as a minimum of -64.05% and a maximum of 153.49%, suggest rare events like severe recessions, rapid industrialization, or post-conflict recovery that warrant closer examination later in our analysis part. 

Government expenditure on education averages 4.27% of GDP, with a smaller standard deviation of 2.23%, but ranges from 0% to 44.33%. This could mean there is possible underreporting or rare cases of unusually high prioritization of education spending. 

Unemployment averages 7.93%, with a wide range from 0.1% to 38.8% reflecting diverse labor market conditions, from full employment in developed nations to severe economic crises in some regions. These extremes and variability may stem from global shocks, structural differences, or data quality issues which we will explore further in later section.

In [None]:
data=  pd.read_csv('data/final_cleaned_data.csv')
summary_statistics = data.describe()
summary_statistics

Because we are interested in the trends for each income group, we made some lineplots showing how GDP Growth, government expendiure, and unemployment change over time. Making these charts allows us to see commonalities between variables in various time periods. We can compare the trends of the lines between plots for different variables to see which ones have similar trends over time.

In [None]:
#GDP Growth (Annual %) over Time (lineplot)

plt.figure(figsize=(18, 8))
sns.lineplot(
    data=data,  
    x="Year",
    y="GDP_growth",
    hue="country_type",
    markers=True,
    dashes=False,
    errorbar=None
)
plt.title("GDP Growth (Annual %) Trend Analysis Over Time (1990-2023)")
plt.ylabel("GDP Growth (%)")
plt.xlabel("Year")
plt.xticks(rotation=60)
plt.legend(title="Country Type")
plt.grid(True)
plt.show()

The chart shows GDP growth trends from 1990 to 2023 for different income groups, highlighting notable differences in stability and recovery. Low-income countries tend to experience greater fluctuations in growth, making them more vulnerable to economic shocks and slower to recover. High-income and upper-middle-income countries show more steady growth and quicker rebounds from crises. Key downturns, such as the 2008 financial crisis and the 2020 COVID-19 pandemic, affected all groups, but the declines were sharper and recovery slower for low-income countries. Recent years show a strong recovery across all groups, but growth remains less stable for developing nations compared to wealthier countries.


In [None]:
#Unemployment Over Time (line chart)

plt.figure(figsize=(18, 8))
sns.lineplot(
    data=data, 
    x="Year",
    y="Unemployment",  
    hue="country_type",
    markers=True,
    dashes=False,
    errorbar=None
)
plt.title("Unemployment Trend Analysis Over Time (1990-2023)")
plt.ylabel("Unemployment Rate (% of total labor force)")
plt.xlabel("Year")
plt.xticks(rotation=60)
plt.legend(title="Country Type")
plt.grid(True)
plt.show()


This chart shows unemployment trends from 1990 to 2023 across different income levels. Low-income countries generally have stable unemployment rates, which might be influenced by informal employment or limited data accuracy. Upper-middle-income countries show consistently higher unemployment rates but have improved recently. High-income countries maintain the lowest unemployment rates and recover quickly from global crises like the 2008 financial crisis and the 2020 pandemic. The differences highlight structural challenges in middle-income countries and the resilience of high-income nations.


In [None]:
'''Government Expenditure on Education (% of GDP) Trend Analysis Over 
Time (1990-2023)'''
plt.figure(figsize=(18, 8))
sns.lineplot(data=data, x="Year", y="Education_expenditure",  
             hue="country_type", markers=True, dashes=False, 
             errorbar=None)
plt.title("Government Expenditure on Education (% of GDP) Trend " +
          "Analysis Over Time (1990-2023)")
plt.ylabel("Government Expenditure (% of GDP)")
plt.xlabel("Year")
plt.xticks(rotation=60)
plt.legend(title="Country Type")
plt.grid(True)
plt.show()

This chart shows that high-income countries consistently spend the most on education, maintaining levels above 4.5% of GDP. Upper-middle-income countries also display stable spending but slightly lower than high-income nations. Lower-middle-income countries show a steady decline in expenditure after the 2000s, stabilizing at around 4%. Low-income countries have the lowest levels of spending, with some improvement after the 2000s but still below 4% of GDP. The trends highlight persistent disparities in education investment across income groups, with higher-income nations maintaining steady support, while lower-income countries face more fluctuations.

Becuase we are interested in looking at the relationship government expenditure, unemployment and the income class of a country, we make some visualizations analyzing trends in these economic indicators for various classes. We first group our analysis dataframe by development status for usage in our graphs.

In [6]:
# add column for each group of income status
# is_developed = 0 if country is "developed" (high, upper middle income)
#                1 if country is "developing" (low, lower middle income)
# analysis_df['income_class'].unique()
class_mappings = {"high_income": 0,
                  "upper_middle": 0,
                  "lower_middle": 1,
                  "low_income": 1}
data['is_developed'] = data['country_type'].map(class_mappings)

# group by development status
development_grouped = data.groupby('is_developed')

To examine the trends in government expenditure and unemployment rate for each development status, we make a scatterplot and observe the general trends of the dots. We are looking to see what kind of relationship there is between government expenditure and unemployment rate for each group (positive or negative linear fits, neutral, or different kind of fit?).

In [None]:
grid = sns.FacetGrid(data, col='is_developed', hue='is_developed')
grid.map(sns.scatterplot, 'Education_expenditure', 'Unemployment')
grid.set_xlabels("Government Expenditure (% of GDP)")
grid.set_ylabels("Unemployment Rate (%)")
# https://stackoverflow.com/questions/43920341/facetgrid-change-titles
# grid.set_titles(col_template='C{col_name}')
''' Title inspiration: https://stackoverflow.com/questions/29813694/
how-to-add-a-title-to-seaborn-facet-plot
'''
grid.fig.subplots_adjust(top=0.8)
grid.fig.suptitle("Government Expenditure vs. Unemployment Rate")
plt.show()

Looking at the scatterplots, there appears to be no relationship between goverment expenditure and unemployment rate for either development status. This is a bit peculiar, as one might expect that more money spent on education would be associated with more educated people, which is associated with more employment. Instead, there seems to be more unemployment as expenditure on education increases.

We also want some idea of how the each economic indicator relates to the income class. To examine this, we look at the averages of each economic indicator for countries of each development status.

In [8]:
bar_groups = ['developed', 'developing']
economic_indicator_avgs = development_grouped.mean(numeric_only=True)

In [None]:
''' Bar chart formation source: https://matplotlib.org/stable
/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-
lines-bars-and-markers-barchart-py'''

plot = sns.barplot(data=economic_indicator_avgs,
            x='is_developed',
            y='Education_expenditure')
plt.title("Country Development vs. Government Expenditure")
plt.xlabel("Development Status")
plt.ylabel("Average Education Government expenditure (% of GDP)")
plot.set_xticks(np.arange(len(bar_groups)), bar_groups)
plt.show()

In [None]:
plot = sns.barplot(data=economic_indicator_avgs,
            x='is_developed',
            y='Unemployment')
plt.title("Country Development vs. Unemployment Rate")
plt.xlabel("Development Status")
plt.ylabel("Average Unemployment (% of labor force)")
plot.set_xticks(np.arange(len(bar_groups)), bar_groups)
plt.show()

We observe that developed countries have a slightly higher unemployment rate and government expenditure amount on average.

We also want to observe the correlation between variables to examine which variables may have a relationship. 

In [None]:
# Compute Correlation Matrix
correlation_matrix = data[["GDP_growth", "Education_expenditure",
                           "Unemployment"]].corr()

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm",
            linewidths=0.5)
plt.title("Correlation Heatmap Between Key Economic Indicators")
plt.show()

The heatmap shows weak correlations between the variables of interest. There is a slightly negative correlation between GDP growth and education spending (-0.076) and between GDP growth and unemployment (-0.086). Education spending and unemployment show a very weak positive correlation (0.083). These results suggest minimal linear relationships between these indicators. One possible reason for the weak correlations is that these economic variables are influenced by a wide range of other factors, such as political stability, infrastructure, technological advancements, and global economic trends, which may overshadow the direct relationships between education spending, GDP growth, and unemployment. Additionally, the effects of education spending on GDP growth and unemployment may manifest only over the long term, and these lagged effects might not be captured in the analysis. 

### Models and Evaluation


#### Hypothesis 1 education expenditure and GDP growth

Model: `gdp_growth` $\sim$ $\alpha$ + `education_expenditure` + `development_status` + (`development_status` $\times$ `education_expenditure`)

In [None]:
# Linear Regression: gdp analysis

gdp_model_data = data[["Country Name", "Year", "country_type", 
                       "GDP_growth", "Education_expenditure"]].dropna()

gdp_model_data["country_type_encoded"] = gdp_model_data["country_type"] \
    .map({"low_income": 1, "lower_middle": 1, 
          "upper_middle": 0, "high_income": 0})

gdp_model_data["interaction"] = gdp_model_data["country_type_encoded"] * \
    gdp_model_data["Education_expenditure"]

gdp_model_data = gdp_model_data.dropna(subset=["GDP_growth", 
                                               "Education_expenditure", 
                                               "country_type_encoded", 
                                               "interaction"])

gdp_model_data["GDP_growth"] = pd.to_numeric(gdp_model_data["GDP_growth"], 
                                             errors="coerce")
gdp_model_data["Education_expenditure"] = pd.to_numeric(
    gdp_model_data["Education_expenditure"], errors="coerce")
gdp_model_data["interaction"] = pd.to_numeric(
    gdp_model_data["interaction"], errors="coerce")

if not gdp_model_data.empty:
    X_gdp = gdp_model_data[["Education_expenditure", 
                            "country_type_encoded", "interaction"]]
    y_gdp = gdp_model_data["GDP_growth"]
    X_gdp = sm.add_constant(X_gdp)
    gdp_model = sm.OLS(y_gdp, X_gdp).fit()
    print("\nGDP Growth Regression:\n", gdp_model.summary())
else:
    print("No valid data for GDP regression.")


The regression shows a very low **R-squared value of 0.011**, indicating that only **1.1%** of the variation in **GDP growth** is explained by the model. **Education expenditure** has a negative coefficient **(-0.482)**, meaning higher spending on education is associated with lower **GDP growth**, which could reflect **delayed effect**s or other **confounding factors**. The interaction term **(0.406)** suggests that the impact of education expenditure is more positive in developing countries. The negative coefficient for **country_type_encoded (-1.315)** indicates that **developing countries** generally have lower GDP growth compared to developed countries. While the coefficients are statistically significant, the model's explanatory power is weak, and further analysis such as testing for non-linear relationships or interactions with other variables to capture more complexity in the data or exploring time lags (e.g., education expenditure's impact on GDP growth after **5 or 10 years**)is needed to better understand these relationships.

The regression model predicts GDP growth based on **education expenditure, country type**, and their **interaction**. For high-income countries, the base GDP growth rate is **5.45%**, but each additional **1%** of GDP spent on education is associated with a **0.48% decrease** in growth. **Low** and **lower-middle-income** countries have a lower base GDP growth rate due to the country type coefficient (-**1.31%**), but the interaction term **(+0.41%)** shows that education spending has a less negative impact in these countries compared to high-income ones. Outliers, such as countries experiencing unusually high growth or sharp recessions, may influence the model. These could include **resource-rich nations** with rapid growth or countries recovering from conflicts or crises. The negative coefficient for education spending might also reflect **delayed returns on investment** or specific contexts where increased spending coincides with economic challenges.


## Hypothesis 1 Significance Test

The regression results for Hypothesis 1 show statistically significant p-values for all key variables, including education expenditure **(p = 0.000)**, country type **(p = 0.001)**, and their interaction term (p = 0.000). The significant interaction term supports the **alternative hypothesis**, suggesting that the effect of education expenditure on GDP growth is **stronger** in developing countries compared to developed countries. However, the negative coefficient for education expenditure **(-0.482)** indicates an unexpected inverse relationship, potentially reflecting unaccounted factors such as inefficiencies in spending or lag effects. Based on these results, we reject the null hypothesis, as the interaction term confirms that the impact of education spending on GDP growth differs significantly between country types, but further analysis or more complex model is necessary to fully understand the underlying dynamics.

In [None]:
#GDP residual

# Residuals from the model for hyp 1 gdp
residuals = gdp_model.resid


fitted_values = gdp_model.fittedvalues


sns.residplot(x=fitted_values, y=residuals, lowess=True, 
              line_kws={'color': 'red'})
plt.title("Residuals vs. Fitted Values (Homoscedasticity)")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

sns.histplot(residuals, kde=True, bins=30, color='blue')
plt.title("Histogram of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()


The residual plots reveal some key insights about the model's performance. In the Residual vs. Fitted Values plot, most residuals are centered around zero, indicating that the model generally does not systematically over- or under-predict GDP growth. However, there are noticeable outliers, with residuals exceeding 50 and even reaching 150. These extreme deviations suggest that certain events, such as economic booms, financial crises, or unique country-specific conditions, are not well-captured by the model. For instance, sudden changes like post-conflict recovery or discovery of natural resources may have caused the model to mispredict for these cases.


The histogram of residuals shows a long right tail, highlighting cases where the model significantly underestimates GDP growth. This suggests potential nonlinear relationships between the variables or the presence of unmodeled factors influencing growth. 

#### Hypothesis 2 (education expenditure and unemployment rate)

Model: `log(unemplyoment)` $\sim$ $\alpha$ + `log(education_expenditure)` + `development_status` + (`development_status` $\times$ `log(education_expenditure)`) + `year` + (`development_status` $\times$ `log(education_expenditure)` $\times$ `year`)

In [None]:
# Linear Regression
# Education expenditure and unemployment rate analysis with log 
# transformation

unemployment_model_data = data[["Country Name", "Year", 
                                "country_type", "Unemployment", 
                                "Education_expenditure"]].dropna()

unemployment_model_data["country_type_encoded"] = \
    unemployment_model_data["country_type"].map({"low_income": 1, 
                                                 "lower_middle": 1, 
                                                 "upper_middle": 0, 
                                                 "high_income": 0})

unemployment_model_data["Unemployment"] = pd.to_numeric(
    unemployment_model_data["Unemployment"], errors="coerce")
unemployment_model_data["Education_expenditure"] = pd.to_numeric(
    unemployment_model_data["Education_expenditure"], errors="coerce")

# Apply log transformation
unemployment_model_data["log_Unemployment"] = np.log1p(
    unemployment_model_data["Unemployment"])
unemployment_model_data["log_Education_expenditure"] = np.log1p(
    unemployment_model_data["Education_expenditure"])

unemployment_model_data["interaction"] = \
    unemployment_model_data["country_type_encoded"] * \
    unemployment_model_data["log_Education_expenditure"]

unemployment_model_data["three_way_interaction"] = \
    unemployment_model_data["interaction"] * unemployment_model_data["Year"]

unemployment_model_data = unemployment_model_data.dropna(subset=[
    "log_Unemployment", "log_Education_expenditure", 
    "country_type_encoded", "interaction", "three_way_interaction"])

if not unemployment_model_data.empty:
    X_unemployment = unemployment_model_data[[
        "log_Education_expenditure", "country_type_encoded", 
        "interaction", "Year", "three_way_interaction"]]
    y_unemployment = unemployment_model_data["log_Unemployment"]
    X_unemployment = sm.add_constant(X_unemployment)
    unemployment_model = sm.OLS(y_unemployment, X_unemployment).fit()
    print("\nUnemployment Regression with Log Transformation:\n", 
          unemployment_model.summary())
else:
    print("No valid data for unemployment regression.")


The regression shows an **R-squared of 0.035**, meaning the model explains **3.5%** of the variation in log-unemployment. **Log-education expenditure** has a small positive coefficient (0.119), suggesting weak links between higher education spending and increased unemployment, possibly due to short-term labor market adjustments. **The interaction term (-4.443)** shows education spending reduces unemployment more in developing countries, which also have lower baseline **unemployment rates (-0.297)**. **The three-way interaction (0.0023)** suggests this effect varies slightly over time. The low R-squared indicates much of the variation in unemployment stems from other factors.

For **high-income countries**, the baseline log-unemployment is **11.91**, with a **0.12 increase** per 1% rise in education spending. Developing countries see more impact from education spending **(-4.443)** but have a lower baseline unemployment rate **(-0.297)**. Over time, this effect increases slightly **(0.0023)**, reflecting structural or labor demand shifts.

Residual analysis highlights some issues. In the Residual vs. Fitted Values plot, residuals mostly center around zero, but outliers exceed **+/-1.5**, likely from unmodeled factors like policy shifts or economic disruptions. A **right-skewed residual histogram** suggests the model underestimates unemployment in some cases. Adding predictors or testing for **nonlinearity** could improve the fit.

In [None]:

residuals_unemployment = unemployment_model.resid

fitted_values_unemployment = unemployment_model.fittedvalues

sns.residplot(x=fitted_values_unemployment, y=residuals_unemployment,
              lowess=True, line_kws={'color': 'red'})
plt.title("Residuals vs. Fitted Values (Homoscedasticity Check)")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

sns.histplot(residuals_unemployment, kde=True, bins=30, color='blue')
plt.title("Histogram of Residuals (Normality Check)")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

## Hypothesis 2 Signifance Test

The regression results for Hypothesis 2 indicate **statistically significant** p-values for log-education expenditure **(p = 0.003)**, country type **(p = 0.000)**, and year **(p = 0.000)**. The three-way interaction term **(p = 0.068**)** and interaction term (p = 0.073) are not strictly significant but suggest some potential effects. The **positive** coefficient for log-education expenditure (0.119) suggests that higher education spending is associated with higher unemployment, which may reflect complexities in the labor market or short-term disruptions. The negative interaction coefficient **(-4.443)** indicates that education spending in developing countries may play a role in reducing unemployment. Based on these results, we reject the null hypothesis and conclude that education expenditure has a differential effect on unemployment across country types, though further exploration is needed to clarify these relationships.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

X_unemployment_rf = unemployment_model_data[["log_Education_expenditure",
"country_type_encoded", "interaction", "Year", "three_way_interaction"]]
y_unemployment_rf = unemployment_model_data["log_Unemployment"]

# Initialize the Random Forest model
rf_model = RandomForestRegressor(random_state=42, n_estimators=100)

#fit
rf_model.fit(X_unemployment_rf, y_unemployment_rf)

# Predict
y_pred_rf = rf_model.predict(X_unemployment_rf)

#peroformnace
mse_rf = mean_squared_error(y_unemployment_rf, y_pred_rf)
r2_rf = r2_score(y_unemployment_rf, y_pred_rf)

print("Random Forest Mean Squared Error:", mse_rf)
print("Random Forest R-squared:", r2_rf)


The Random Forest model for analyzing the relationship between education expenditure and unemployment achieved a Mean Squared Error (MSE) of 0.0579 and a high R-squared (R²) value of 0.8617. These results suggest that the model performs well in capturing the relationships within the dataset.

The low MSE indicates that the model's predictions for the unemployment rate (log-transformed) are, on average, very close to the observed values. This highlights the model's ability to effectively reduce prediction errors compared to simpler models like linear regression, which struggled to explain much of the variance in the data.

The high R-squared value of 0.8617 suggests that 86.17% of the variability in the unemployment rates can be explained by the variables included in the model. This indicates a strong predictive capability, with education expenditure (log-transformed), country type, and interaction terms playing significant roles in explaining unemployment trends.

## Conclusions

The analysis provides insights into how government expenditure on education impacts long-term GDP growth and unemployment rates in developing and developed countries. For GDP growth, the findings from the linear regression suggest that education spending has a negative immediate impact, reflected in the negative coefficient for education expenditure. This may stem from delayed benefits, as the returns on educational investments often take years to materialize, particularly in developing countries where investments focus on infrastructure or younger populations. The interaction term indicates that this negative impact is less pronounced in developing countries, likely because these nations have more to gain from educational improvements. However, the low R-squared value highlights the complexity of GDP growth, which is influenced by numerous factors such as trade policies, political stability, and natural resource management. Random Forest analysis of GDP growth resulted in an R-squared of -0.356, indicating that it failed to explain variance effectively in this context, underscoring the challenges of modeling GDP growth.

For unemployment, the linear regression results indicate a weak positive relationship between education expenditure and unemployment rates, which might reflect the mismatch between education systems and labor market demands, particularly in developing countries. Significant interaction terms highlight that the impact of education spending varies by country type, likely due to differences in labor market structures and the stages of economic development. Random Forest regression provided a stronger performance, with an R-squared of 0.86, suggesting it captured the non-linear relationships and complex interactions better than linear regression. These results demonstrate that the impact of education spending on unemployment is multifaceted, with effects shaped by broader structural factors.

Outliers identified in the residual plots for both GDP growth and unemployment highlight events such as financial crises, rapid recoveries, or other country-specific conditions that the models fail to capture. These outliers suggest that economic relationships are highly context-dependent and that linear models may struggle to fully account for such variability. Random Forest offered improvements in capturing these dynamics, but further refinement is required.

Overall, the findings suggest that while education expenditure has an impact, its effects on GDP growth and unemployment are influenced by context, time lags, and structural factors. Developing countries stand to benefit the most in the long term from improving education systems, but short-term effects remain less evident. 

Moving forward, for Phase 5 we will focus on exploring additional non-linear models as we have seen linear regression may not be the best model to capture this data. we will also explore lagged effects, and feature engineering to improve model performance and provide a deeper understanding of the dynamics between education spending, economic growth, and labor market outcomes.

## Limitations

Our model does not encapsulate all of the variables that may be involved in our predictions. Since our data is at the national level, we fail to account for regional and demographic dispairities within countries. For example, we do not account for factors such as unemployment differences in urban versus rural areas or differences among socioeconomic groups. As a result, the predictions of our models cannot be generalized to all unemployment situations. Other limitations are listed below.

- **Missing Data**: The dataset includes gaps, particularly among developing nations with less robust data collection, which were either excluded or filled using interpolation. This can lead to trends that may not accurately reflect real economic conditions, potentially skewing the results.
- **Bias in Self-Reporting**: Data from government sources may be prone to inaccuracies, especially in politically sensitive areas such as unemployment and GDP growth. Underreporting or overreporting, particularly in developing countries, could distort the findings. For example, informal labor markets may not be fully captured, and education spending may exclude inefficiencies or misallocations.
- **Interpolation**: Missing data filled via interpolation can smooth over significant real-world fluctuations, underestimating volatility or economic shocks. Sudden changes, such as those caused by political or financial crises, may therefore be inadequately reflected.
- **Measurement and Definition Issues**: Key variables like "education expenditure" and "unemployment" are simplified measures. They do not account for qualitative factors such as spending efficiency, allocation between education levels, or underemployment, particularly in developing economies.
- **Time Lag Effects**: The models do not account for the delayed effects of education spending, which can take years to impact GDP growth or unemployment. This omission could lead to an overemphasis on short-term trends while missing long-term benefits.
- **Unaccounted Variables and External Shocks**: Other influences, such as trade policies, political stability, demographic shifts, and technological advancements, are omitted, which could result in biased estimates. Additionally, events like financial crises or natural disasters introduce noise that may reduce the reliability of the findings.

## Resources


**Economic Papers**

Barro, Robert J. “Economic Growth in a Cross Section of Countries.” The Quarterly Journal of    Economics, vol. 106, no. 2, 1991, pp. 407–43. JSTOR, https://doi.org/10.2307/2937943. Accessed 7 Dec. 2024.

Hanushek, Eric A., and Ludger Woessmann. “Do Better Schools Lead to More Growth? Cognitive Skills, Economic Outcomes, and Causation.” Journal of Economic Growth, vol. 17, no. 4, 14 July 2012, pp. 267–321, hanushek.stanford.edu/sites/default/files/publications/Hanushek%2BWoessmann%202012%20JEconGrowth%2017%284%29.pdf, https://doi.org/10.1007/s10887-012-9081-x. 


Psacharopoulos, George, and Harry Anthony Patrinos. “Returns to Investment in Education: A Decennial Review of the Global Literature.” Education Economics, vol. 26, no. 5, 7 June 2018, pp. 445–458, www.tandfonline.com/doi/full/10.1080/09645292.2018.1484426, https://doi.org/10.1080/09645292.2018.1484426.

OECD (2024), Education at a Glance 2024: OECD Indicators, OECD Publishing, Paris, https://doi.org/10.1787/c00cad36-en.

**World Bank Open Data:** 

The World Bank. “World Development Indicators | DataBank.” Worldbank.org, The World Bank, 2024, databank.worldbank.org/source/world-development-indicators#.


**Data Manipulation Technique:** 

GeeksforGeeks. “Interpolation in Python.” GeeksforGeeks, 19 Mar. 2024, www.geeksforgeeks.org/interpolation-in-python/.


**Python Libraries:** 

Pandas. “Pandas Documentation — Pandas 1.0.1 Documentation.” Pandas.pydata.org, 2024, pandas.pydata.org/docs/.

NumPy. “NumPy Documentation.” Numpy.org, numpy.org/doc/.

Matplotlib. “Matplotlib: Python Plotting — Matplotlib 3.3.4 Documentation.” Matplotlib.org, 2024, matplotlib.org/stable/index.html.


Scikit-learn. “Scikit-Learn: Machine Learning in Python.” Scikit-Learn.org, 2019, scikit-learn.org/stable/.

“Introduction — Statsmodels.” Www.statsmodels.org, www.statsmodels.org/stable/index.html.


**Python Resources**

https://stackoverflow.com/questions/43920341/facetgrid-change-titles

https://stackoverflow.com/questions/29813694/how-to-add-a-title-to-seaborn-facet-plot

https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py

[Back To Top](#Table-of-Contents)