<a href="https://colab.research.google.com/github/JonasGiven/GDP-PREDICTION-IN-2030-IN-SA/blob/main/sa_gdp_matricpassrate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Predicting GDP Growth Based on Educational Pass Rates: An Analytical Approach Using Historical Data**

## **Problem Statement**

Despite significant investments in education, the relationship between educational outcomes and economic performance remains complex and not fully understood. This project aims to investigate the correlation between historical pass rates and GDP growth from 1995 to 2023. By analyzing these trends, the project seeks to develop a predictive model that can estimate future GDP changes based on variations in educational pass rates. Specifically, it aims to predict the impact on GDP if the pass rate increases by 10% by 2030.


## **Context**

Education is widely regarded as a key driver of economic development. A well-educated population is often linked to higher productivity, innovation, and overall economic growth. However, quantifying this relationship poses a challenge due to the myriad of influencing factors. This project leverages historical data on pass rates and GDP from 1995 to 2023 to explore this relationship in a structured and empirical manner. By employing machine learning techniques, the project will provide insights into how improvements in education can translate into economic gains. This analysis will aid policymakers and stakeholders in making informed decisions to foster both educational and economic advancements.

## **Import and install necessary libraries**

In [None]:
!pip install mlflow

In [None]:
import mlflow
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

## **Load dataset**

In [None]:
df_gdp = pd.read_csv("https://raw.githubusercontent.com/JonasGiven/Datasets/main/SA_matricratepass_gdp%202024-07-04%2012_46_25.csv")

In [None]:
df_gdp.head(10)

## **Data exploration**

In [None]:
# Drop 2023 it's an outlier
# Assuming df_gdp is your DataFrame containing GDP data
df_gdp = df_gdp[df_gdp['Year'] != 2023]

# Display the tail of the DataFrame to verify
df_gdp.tail()

In [None]:
#@title Box and whisker diagram of GDP and official matric pass rate
# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Box plot for the official matric pass rate

sns.boxplot(data=df_gdp, x='Official matric pass rate', ax=axes[0], color ='cyan')
axes[0].set_title('Box and Whisker Plot for Official Matric Pass Rate')
axes[0].set_xlabel('Official matric pass rate (%)')

# Calculate quartiles and min/max for pass rate
q1_pass = df_gdp['Official matric pass rate'].quantile(0.25)
median_pass = df_gdp['Official matric pass rate'].median()
q3_pass = df_gdp['Official matric pass rate'].quantile(0.75)
min_pass = df_gdp['Official matric pass rate'].min()
max_pass = df_gdp['Official matric pass rate'].max()

# Annotate quartiles and min/max for pass rate below the plot
axes[0].text(0.5, -0.15, f'Min: {min_pass:.2f}\nQ1: {q1_pass:.2f}\nMedian: {median_pass:.2f}\nQ3: {q3_pass:.2f}\nMax: {max_pass:.2f}',
             transform=axes[0].transAxes, ha='center', va='top', fontsize=10, bbox=dict(facecolor='white', alpha=0.5))

# Box plot for GDP in billion $
sns.boxplot(data=df_gdp, x='GDP in billion $', ax=axes[1])
axes[1].set_title('Box and Whisker Plot for GDP in Billion $')
axes[1].set_xlabel('GDP (in billion $)')

# Calculate quartiles and min/max for GDP
q1_gdp = df_gdp['GDP in billion $'].quantile(0.25)
median_gdp = df_gdp['GDP in billion $'].median()
q3_gdp = df_gdp['GDP in billion $'].quantile(0.75)
min_gdp = df_gdp['GDP in billion $'].min()
max_gdp = df_gdp['GDP in billion $'].max()

# Annotate quartiles and min/max for GDP below the plot
axes[1].text(0.5, -0.15, f'Min: {min_gdp:.2f}\nQ1: {q1_gdp:.2f}\nMedian: {median_gdp:.2f}\nQ3: {q3_gdp:.2f}\nMax: {max_gdp:.2f}',
             transform=axes[1].transAxes, ha='center', va='top', fontsize=10, bbox=dict(facecolor='white', alpha=0.5))


# Display the plots
plt.tight_layout()
plt.show()

### Observational Report

The box and whisker plots reveal important insights about the official matric pass rate and GDP. For the pass rate, the minimum was 47.40%, with Q1 at 61.43%, a median of 69.60%, Q3 at 75.27%, and a maximum of 81.30%. This shows a generally high pass rate, with most values clustered around the median to upper quartile range. For GDP, the minimum was \$129.09 billion, Q1 was \$171.05 billion, the median was \$326.67 billion, Q3 was \$392.22 billion, and the maximum was $458.20 billion. The GDP values indicate significant economic growth over the years, with a wide range from the lower quartile to the maximum. These observations suggest a positive correlation where higher pass rates are associated with higher GDP, reflecting the potential impact of education on economic performance.

In [None]:
# @title Official matric pass rate vs GDP in billion $
df_gdp.plot(kind='scatter', x='Official matric pass rate', y='GDP in billion $', s=32, alpha=.8,color = 'green', marker = '+')
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
# @title Year vs GDP in billion $

def _plot_series(series, series_name, series_index=0, color ='red'):
  palette = list(sns.palettes.mpl_palette('Dark2'),)
  xs = series['Year']
  ys = series['GDP in billion $']

  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = df_gdp.sort_values('Year', ascending=True)
_plot_series(df_sorted, '')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Year')
_ = plt.ylabel('GDP in billion $')

In [None]:
# @title Year vs Official matric pass rate
def _plot_series(series, series_name, series_index=0, color = 'green'):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  xs = series['Year']
  ys = series['Official matric pass rate']

  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = df_gdp.sort_values('Year', ascending=True)
_plot_series(df_sorted, '')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Year')
_ = plt.ylabel('Official matric pass rate')

In [None]:
# @title GDP vs official matric pass rate from 1995 - 2022
# Create figure and axis objects
fig, ax1 = plt.subplots(figsize=(10, 6))

# Plot Pass Rate on the left y-axis

ax1.plot(df_gdp['Year'], df_gdp['Official matric pass rate'], label='Official matric pass rate', marker='s', color='green')
ax1.set_xlabel('Year')
ax1.set_ylabel('Official matric pass rate	', color='green')
ax1.tick_params(axis='y', labelcolor='green')

# Create a second y-axis for GDP
ax2 = ax1.twinx()
ax2.plot(df_gdp['Year'], df_gdp['GDP in billion $'], label='GDP (in billion $)', marker='o', color='red')
ax2.set_ylabel('GDP (in billion $)', color='red')
ax2.tick_params(axis='y', labelcolor='red')

# Title and legend
plt.title('GDP vs official matric pass rate over Years')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

# Grid
ax1.grid(True)

plt.show()

### Observations

My analysis reveals a clear correlation between South Africa’s GDP and the official matric pass rate over the years. When the pass rate increases, the GDP also tends to rise, and when the pass rate drops, the GDP tends to fall. For instance, from around 1998 to 2003, there was a significant increase in the pass rate, accompanied by a rise in GDP. However, from 2003 to 2010, the pass rate declined sharply, which slowed the GDP growth. Conversely, from 2009 to 2014, we observed a sharp increase in the pass rate, and the GDP grew as well. This pattern continues, with the pass rate drop in 2020 and 2021 due to COVID-19 causing a slight decrease in GDP. These trends suggest a strong link between educational outcomes and economic performance in South Africa.

## **Model training**

In [None]:
# Data labeling
y = df_gdp['GDP in billion $']
X = df_gdp[['Year', 'Official matric pass rate']]

# Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

# Model building
rf = RandomForestRegressor(max_depth=2, random_state=100)
rf.fit(X_train, y_train)

In [None]:
# Ensure that X_train and X_test are DataFrames with the correct column names
print("Training features:", X_train.columns)
print("Testing features:", X_test.columns)

## **Apply model to make predictions**

In [None]:
y_rf_train_pred = rf.predict(X_train)
y_rf_test_pred = rf.predict(X_test)

In [None]:
# Print the predictions
print("Training predictions:", y_rf_train_pred)
print("Testing predictions:", y_rf_test_pred)

In [None]:
#@title Predicted gdp in 2030 if the official pass rate increase by 10% from 2022
#in 2022 the official pass rate is 80.1% in 2030 it will be 90.1%

# Define the input data for prediction
input_data = pd.DataFrame({
    'Year': [2030],
    'Official matric pass rate': [90.1]
})

# Predict
pred_GDP_2030 = rf.predict(input_data)

print("Predicted GDP for 2030:", pred_GDP_2030[0])

## **Evaluate model perfomance**

In [None]:
rf_train_mse = mean_squared_error(y_train, y_rf_train_pred)
rf_train_r2 = r2_score(y_train, y_rf_train_pred)

rf_test_mse = mean_squared_error(y_test, y_rf_test_pred)
rf_test_r2 = r2_score(y_test, y_rf_test_pred)

## **Configure and run mlflow**

In [None]:
# @title mlflow, dagshub

os.environ['MLFLOW_TRACKING_USERNAME'] = 'Jonas'
os.environ['MLFLOW_TRACKING_PASSWORD'] = '6a23abfadd67ec688cfc99d274ca11cbec50bb1a'
os.environ['MLFLOW_TRACKING_PROJECTNAME'] = 'ml002_SA_matricpassrate_gdp'
mlflow.set_tracking_uri('https://www.dagshub.com/Jonas/ml002_SA_matricpassrate_gdp.mlflow')

# Log results with MLflow
if mlflow.active_run():
    mlflow.end_run()

with mlflow.start_run(run_name='gdp_matricpassrate_rf'):

    # Log parameters
    mlflow.log_param('train_test_split', 0.2)
    mlflow.log_param('input_data', 'SA_matricratepass_gdp')
    mlflow.log_param('random_state', 100)
    mlflow.log_param('model', 'random forest regressor')
    mlflow.log_param('max_depth', 2)

    # Log metrics
    mlflow.log_metric('train_mse', rf_train_mse)
    mlflow.log_metric('train_r2', rf_train_r2)
    mlflow.log_metric('test_mse', rf_test_mse)
    mlflow.log_metric('test_r2', rf_test_r2)
    mlflow.log_metric('pred_GDP_2030', pred_GDP_2030)

    # Log model
    mlflow.sklearn.log_model(rf, 'model')

    # Plotting
    plt.figure(figsize=(5, 5))
    plt.scatter(y_train, y_rf_train_pred, c="#7CAE00", alpha=0.3)  # alpha makes regions that are highly overlapping darker

    z = np.polyfit(y_train, y_rf_train_pred, 1)
    p = np.poly1d(z)

    plt.plot(y_train, p(y_train), '#F8766D')
    plt.ylabel('Predicted GDP')
    plt.xlabel('Experimental GDP')
    plt.title('Experimental vs Predicted GDP')

    # Save plot
    plot_filename = 'rf_experimental_predicted.png'
    plt.savefig(plot_filename)
    mlflow.log_artifact(plot_filename)
    plt.show()

print('MLflow run completed successfully')