# Ethiopia Socio-Economic Machine Learning Project

End-to-end ML project using Ethiopia's World Bank World Development Indicators (WDI)
from `API_ETH_DS2_en_csv_v2_6515.csv`.

**Learning objectives**:
- Data understanding & preprocessing
- Exploratory Data Analysis (EDA)
- Supervised learning (regression & classification)
- Unsupervised learning (clustering)
- Model comparison, bias–variance, and interpretation

This notebook is structured like a full academic project report and is suitable
for a university ML course.


## 0. Imports and configuration

We reuse the implementation in `ethiopia_socioeconomic_ml_project.py` and
import all the key functions here so that we can interactively run each
pipeline step in the notebook.


In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid", context="talk")

from ethiopia_socioeconomic_ml_project import (
    DATA_PATH,
    load_and_prepare_dataset,
    perform_eda,
    choose_regression_target,
    prepare_supervised_datasets,
    build_regression_models,
    evaluate_regression_models,
    plot_regression_feature_importance,
    create_growth_categories,
    prepare_classification_data,
    build_classification_models,
    evaluate_classification_models,
    plot_confusion_matrices,
    perform_kmeans_clustering,
    summarize_bias_variance_and_overfitting,
)


## 1. Data understanding & preprocessing

We load the Ethiopia WDI dataset, drop metadata rows, reshape to a
Year × Indicator matrix, and select numeric indicators with sufficient
coverage across years.


In [None]:
features_df, indicator_meta = load_and_prepare_dataset(DATA_PATH)
print("Years available:", features_df.index.min(), "to", features_df.index.max())
print("Number of indicators used:", features_df.shape[1])
features_df.head()


## 2. Exploratory Data Analysis (EDA)

We examine summary statistics, a correlation heatmap for high-variance
indicators, and time trends of key socio-economic indicators.


In [None]:
perform_eda(features_df, indicator_meta)


## 3A. Supervised learning – Regression

We choose a continuous economic indicator (e.g., exports or GDP-related)
as the regression target and train multiple models:
- Linear Regression
- KNN Regressor
- SVR (RBF kernel)
- Random Forest Regressor

We compare them using RMSE, MAE, and R², and inspect Random Forest feature
importances.


In [None]:
target_code = choose_regression_target(features_df)
target_name = indicator_meta.get(target_code, target_code)
print(f"Chosen regression target: {target_code} – {target_name}")

X_train_reg, X_test_reg, y_train_reg, y_test_reg, feature_names_reg = \
    prepare_supervised_datasets(features_df, target_code)

reg_models = build_regression_models()
reg_results = evaluate_regression_models(
    reg_models, X_train_reg, X_test_reg, y_train_reg, y_test_reg
)
reg_results


In [None]:
rf_reg_pipeline = reg_models["Random Forest Regressor"]
plot_regression_feature_importance(rf_reg_pipeline, feature_names_reg)


## 3B. Supervised learning – Classification

We convert the continuous target into growth **categories** (Low/Medium/High)
based on year-over-year percentage change, then train:
- Logistic Regression
- KNN Classifier
- SVM (RBF kernel)
- Random Forest Classifier

We evaluate models using Accuracy, Precision, Recall, and F1-score and
visualize confusion matrices.


In [None]:
df_cls = create_growth_categories(features_df, target_code)
(
    X_train_cls,
    X_test_cls,
    y_train_cls,
    y_test_cls,
    feature_names_cls,
    class_names,
) = prepare_classification_data(df_cls)

cls_models = build_classification_models()
cls_results = evaluate_classification_models(
    cls_models,
    X_train_cls,
    X_test_cls,
    y_train_cls,
    y_test_cls,
    class_names,
)
cls_results


In [None]:
plot_confusion_matrices(cls_models, X_test_cls, y_test_cls, class_names)


## 4. Unsupervised learning – KMeans clustering

We perform KMeans clustering on standardized indicators to identify groups
of years with similar socio-economic profiles and visualize the clusters in
a 2D PCA projection.


In [None]:
cluster_labels, cluster_centers_df = perform_kmeans_clustering(features_df)
cluster_centers_df.iloc[:, :10]


## 5. Model comparison, interpretation, and conclusions

Here we connect the quantitative results to ML theory (bias–variance,
overfitting vs. underfitting) and to Ethiopia's development context.
Use this section as a template for writing the **discussion and conclusion**
sections of your project report.


In [None]:
summarize_bias_variance_and_overfitting(reg_results, cls_results)
