In [10]:
import sys, os
sys.path.append(os.path.abspath(".."))

from src.data_processor import DataProcessor

## 2A. Data Cleaning

The Africa-wide dataset used in this assignment was derived from the UN DESA population and health indicators. Data cleaning focused on ensuring completeness, correctness, and internal consistency before moving to supervised modeling.

### Handling Missing Values

Only one variable contained missing data:

- `maternal_mortality_ratio_deaths_per_100000_population`, with roughly 25% missing values, all from the year 2024. A verification against the original long-format UN dataset confirmed that these entries were not reported at all for this year. This establishes a pattern of **systematic non-reporting**, not random missingness or cleaning errors.

Because these missing values reflect genuine absence of data, we:

- Did not impute maternal mortality values.
- Did not drop the corresponding rows.
- Restricted models involving maternal mortality to the years with complete data.

All other indicators (fertility, under-five mortality, life expectancy, population growth) had no missing values, so no imputation was necessary.

### Addressing Outliers

The dataset contains real demographic extremes such as:

- Very high fertility rates (≥7)
- Elevated under-five mortality (>150 per 1,000)
- High maternal mortality (>1,000 per 100,000)

These values originate from countries with documented health challenges and are **true characteristics** of African demographic variation, not anomalies. Therefore, we adopted a retention-based strategy:

- Outliers were left unchanged.
- No capping, clipping, or transformations were applied.

This ensures fidelity to real-world population health patterns and avoids masking meaningful disparities across the continent.

### Ensuring Data Quality and Consistency

To guarantee dataset integrity:

- All numeric columns were cast to appropriate numeric types.
- Region and country names were standardized by trimming whitespace.
- Column names were normalized to lowercase with underscores for modeling compatibility.
- Dataset structure was validated to ensure **one unique record per country–year pair**.

After these checks, the dataset was confirmed clean, consistent, and ready for feature engineering.


In [11]:

processor = DataProcessor(r"C:\Users\STUDENT\Documents\Africa-Population-Trend-and-Health-Analysis-UN-dataset-\df_africa_cleaned.xls")
df_clean = processor.process()

df_clean.head()


Unnamed: 0,code,region_country_area,year,life_expectancy_at_birth_for_both_sexes_years,life_expectancy_at_birth_for_females_years,life_expectancy_at_birth_for_males_years,maternal_mortality_ratio_deaths_per_100000_population,population_annual_rate_of_increase_percent,total_fertility_rate_children_per_women,under_five_mortality_rate_for_both_sexes_per_1000_live_births
0,2,Africa,2010,58.8,60.3,57.2,593.0,2.6,4.9,93.9
1,2,Africa,2015,60.8,62.6,59.1,526.0,2.6,4.6,81.0
2,2,Africa,2020,62.3,64.3,60.2,487.0,2.4,4.2,69.0
3,2,Africa,2024,64.0,66.1,62.0,,2.3,4.0,62.4
4,11,Western Africa,2010,54.3,55.1,53.5,850.0,2.8,5.7,123.2


## 2B. Feature Engineering

Feature engineering was performed to enhance the dataset’s predictive value and support the supervised learning tasks outlined in this assignment. The engineered features were selected based on domain knowledge, EDA insights, and their relevance to the hypotheses being tested.

### Fertility Category (High vs. Low)

To operationalize Hypothesis 3 (“Can under-five mortality classify countries into high vs. low fertility?”), we created a categorical feature:

- **High fertility**: `total_fertility_rate` ≥ 4  
- **Low fertility**: `total_fertility_rate` < 4

This threshold follows demographic transition literature, where countries with fertility ≥ 4 are typically considered to be in earlier stages of transition. This feature converts a continuous indicator into a label suitable for classification tasks.

### Trend Features (2010–2024)

To capture demographic progress over time rather than relying solely on static values, we introduced several trend-based features computed as:


Change 2010–2024 = Value_2024 - Value_2010


Trend features were created for:

- Life expectancy (both sexes) change (2010–2024)  
- Total fertility rate change (2010–2024)  
- Under-five mortality rate change (2010–2024)  
- Population growth rate change (2010–2024)  

These features represent the trajectory of each country’s demographic shift and provide valuable signals for regression models.

### Country-Level Mean Indicator Features

To capture long-term demographic levels rather than single-year snapshots, we computed the mean of each key indicator for every country across all available years (2010–2024). The following averages were created:

- Mean life expectancy (both sexes)  
- Mean total fertility rate  
- Mean under-five mortality  
- Mean population growth rate  

These aggregated metrics help smooth short-term fluctuations and improve model stability.

### Scaling and Normalization

For algorithms sensitive to variable scales (e.g., linear models, logistic regression), we applied **standardization** to numeric features during model training. This ensures that all features contribute proportionately to the objective function and prevents scale-driven bias.

- Normalization was **not applied** to the raw dataset itself to preserve interpretability; instead, scaling occurs only within the modeling pipeline to avoid data leakage.

### Summary

The engineered features improve the dataset’s predictive capacity by combining:

- A classification label (**fertility category**)  
- Temporal dynamics (**2010–2024 trends**)  
- Aggregated demographic summaries (**country-level means**)  
- Scale-adjusted versions of numeric predictors during model training  

These enhancements directly support the supervised machine learning tasks defined in the assignment and align with best practices in demographic modeling.


### Testing the Feature Engineering logic

In [12]:
from src.data_processor import DataProcessor
from src.feature_engineer import FeatureEngineer



# Engineer features
fe = FeatureEngineer(df_clean)
df_features = fe.process()

df_features.head()


Unnamed: 0,code,region_country_area,year,life_expectancy_at_birth_for_both_sexes_years,life_expectancy_at_birth_for_females_years,life_expectancy_at_birth_for_males_years,maternal_mortality_ratio_deaths_per_100000_population,population_annual_rate_of_increase_percent,total_fertility_rate_children_per_women,under_five_mortality_rate_for_both_sexes_per_1000_live_births,fertility_category,life_expectancy_at_birth_for_both_sexes_years_change_2010_2024,total_fertility_rate_children_per_women_change_2010_2024,under_five_mortality_rate_for_both_sexes_per_1000_live_births_change_2010_2024,population_annual_rate_of_increase_percent_change_2010_2024,life_expectancy_at_birth_for_both_sexes_years_mean,total_fertility_rate_children_per_women_mean,under_five_mortality_rate_for_both_sexes_per_1000_live_births_mean,population_annual_rate_of_increase_percent_mean
0,2,Africa,2010,58.8,60.3,57.2,593.0,2.6,4.9,93.9,high,5.2,-0.9,-31.5,-0.3,61.475,4.425,76.575,2.475
1,2,Africa,2015,60.8,62.6,59.1,526.0,2.6,4.6,81.0,high,5.2,-0.9,-31.5,-0.3,61.475,4.425,76.575,2.475
2,2,Africa,2020,62.3,64.3,60.2,487.0,2.4,4.2,69.0,high,5.2,-0.9,-31.5,-0.3,61.475,4.425,76.575,2.475
3,2,Africa,2024,64.0,66.1,62.0,,2.3,4.0,62.4,high,5.2,-0.9,-31.5,-0.3,61.475,4.425,76.575,2.475
4,11,Western Africa,2010,54.3,55.1,53.5,850.0,2.8,5.7,123.2,high,4.1,-1.3,-34.0,-0.6,56.3,5.0,105.275,2.475


In [13]:

# Check class imbalance for fertility_category (Part 2C)


print("Raw counts:")
print(df_features["fertility_category"].value_counts())

print("\nPercentages:")
print(df_features["fertility_category"].value_counts(normalize=True) * 100)


Raw counts:
fertility_category
high    152
low      92
Name: count, dtype: int64

Percentages:
fertility_category
high    62.295082
low     37.704918
Name: proportion, dtype: float64


### 2C. Handling Class Imbalance

After creating the binary fertility category (High vs. Low Fertility), we examined the distribution of the two classes:

- **High Fertility:** 152 observations (62.3%)  
- **Low Fertility:** 92 observations (37.7%)

Although the classes are not perfectly equal, this represents only a **moderate imbalance**, not a severe one. Both classes are sufficiently represented, so applying oversampling, undersampling, or synthetic data generation (e.g., SMOTE) was **not required**.

To preserve the natural demographic structure of the dataset during model training, we used a **stratified train–test split**, ensuring that both fertility categories were proportionally represented in the training and testing sets.

This approach maintains fairness in the classification task without altering the underlying demographic patterns.


In [14]:
from src.model_trainer import ModelTrainer


# H1: Fertility predicts maternal mortality (REGRESSION)

In [15]:
# H1: Fertility predicts maternal mortality (REGRESSION)

trainer_h1 = ModelTrainer(
    df_features,
    target="maternal_mortality_ratio_deaths_per_100000_population",
    problem_type="regression"
)

trainer_h1.train_test_split()
trainer_h1.scale_numeric()

best_h1, params_h1 = trainer_h1.train_models()

results_h1 = trainer_h1.map_predictions()
results_h1.head()



Training: LinearRegression
LinearRegression Score = 0.7563105707680513

Training: RandomForestRegressor
Best Params: {'max_depth': 10, 'n_estimators': 400}
Best Score: 0.8219625248974196

Final Best Model:
RandomForestRegressor(max_depth=10, n_estimators=400)
Parameters: {'max_depth': 10, 'n_estimators': 400}


Unnamed: 0,region_country_area,year,code,actual,predicted
76,Guinea,2020,324,553.0,592.054226
138,Somalia,2020,706,621.0,555.316333
152,Togo,2015,768,441.0,475.367541
60,Eritrea,2015,232,399.0,350.668665
156,Tunisia,2020,788,37.0,43.3475


# H2: Life expectancy predicts population growth (REGRESSION)

In [16]:


trainer_h2 = ModelTrainer(
    df_features,
    target="population_annual_rate_of_increase_percent",
    problem_type="regression"
)

trainer_h2.train_test_split()
trainer_h2.scale_numeric()

best_h2, params_h2 = trainer_h2.train_models()

results_h2 = trainer_h2.map_predictions()
results_h2.head()



Training: LinearRegression
Model LinearRegression FAILED. Reason: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Training: RandomForestRegressor
Best Params: {'max_depth': 20, 'n_estimators': 200}
Best Score: 0.6947879545823505

Final Best Model:
RandomForestRegressor(max_depth=20, n_estimators=200)
Parameters: {'max_depth': 20, 'n_estimators': 200}


Unnamed: 0,region_country_area,year,code,actual,predicted
24,Southern Africa,2010,18,1.2,1.6215
6,Western Africa,2020,11,2.3,2.4395
153,Mozambique,2015,508,2.9,2.98
211,Sudan,2024,729,1.6,2.251
198,South Africa,2020,710,1.6,1.4065


## Predict Maternal Mortality (Regression)

In [17]:
from src.model_trainer import ModelTrainer

# ----------------------------
# MODEL 1: Maternal Mortality Regression
# ----------------------------
trainer_mm = ModelTrainer(
    df_features,
    target="maternal_mortality_ratio_deaths_per_100000_population",
    problem_type="regression"
)

trainer_mm.train_test_split()
trainer_mm.scale_numeric()

best_model_mm, params_mm = trainer_mm.train_models()

results_mm = trainer_mm.map_predictions()
results_mm.head(10)



Training: LinearRegression
LinearRegression Score = 0.7563105707680513

Training: RandomForestRegressor
Best Params: {'max_depth': 20, 'n_estimators': 400}
Best Score: 0.813733179486259

Final Best Model:
RandomForestRegressor(max_depth=20, n_estimators=400)
Parameters: {'max_depth': 20, 'n_estimators': 400}


Unnamed: 0,region_country_area,year,code,actual,predicted
76,Guinea,2020,324,553.0,596.0575
138,Somalia,2020,706,621.0,555.0375
152,Togo,2015,768,441.0,468.065
60,Eritrea,2015,232,399.0,355.5875
156,Tunisia,2020,788,37.0,42.37
165,United Rep. of Tanzania,2020,834,238.0,296.31
85,Liberia,2010,430,634.0,607.0425
142,Zimbabwe,2010,716,618.0,636.47
114,Namibia,2020,516,215.0,315.6175
31,Cameroon,2015,120,447.0,518.065


# Under-Five Mortality → Life Expectancy (Regression)

In [19]:
# Revised Hypothesis 3 — Regression
trainer_h3 = ModelTrainer(
    df_features,
    target="life_expectancy_at_birth_for_both_sexes_years",
    problem_type="regression"
)

trainer_h3.train_test_split()
trainer_h3.scale_numeric()

best_h3, params_h3 = trainer_h3.train_models()

results_h3 = trainer_h3.map_predictions()
results_h3.head()



Training: LinearRegression
Model LinearRegression FAILED. Reason: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Training: RandomForestRegressor
Best Params: {'max_depth': 20, 'n_estimators': 200}
Best Score: 0.9857259822879308

Final Best Model:
RandomForestRegressor(max_depth=20, n_estimators=200)
Parameters: {'max_depth': 20, 'n_estimators': 200}


Unnamed: 0,region_country_area,year,code,actual,predicted
24,Southern Africa,2010,18,58.1,57.7585
6,Western Africa,2020,11,56.9,56.676
153,Mozambique,2015,508,58.7,57.9835
211,Sudan,2024,729,66.5,66.566833
198,South Africa,2020,710,65.2,64.9995
