In [1]:
import sys, os

# Add parent directory to Python path to import custom modules
sys.path.append(os.path.abspath(".."))

# Import our custom data processing class
from src.data_processor import DataProcessor

## 2A. Data Cleaning

The Africa-wide dataset used in this assignment was derived from the UN DESA population and health indicators. Data cleaning focused on ensuring completeness, correctness, and internal consistency before moving to supervised modeling.

### Handling Missing Values

Only one variable contained missing data:

- `maternal_mortality_ratio_deaths_per_100000_population`, with roughly 25% missing values, all from the year 2024. A verification against the original long-format UN dataset confirmed that these entries were not reported at all for this year. This establishes a pattern of **systematic non-reporting**, not random missingness or cleaning errors.

Because these missing values reflect genuine absence of data, we:

- Did not impute maternal mortality values.
- Did not drop the corresponding rows.
- Restricted models involving maternal mortality to the years with complete data.

All other indicators (fertility, under-five mortality, life expectancy, population growth) had no missing values, so no imputation was necessary.

### Addressing Outliers

The dataset contains real demographic extremes such as:

- Very high fertility rates (≥7)
- Elevated under-five mortality (>150 per 1,000)
- High maternal mortality (>1,000 per 100,000)

These values originate from countries with documented health challenges and are **true characteristics** of African demographic variation, not anomalies. Therefore, we adopted a retention-based strategy:

- Outliers were left unchanged.
- No capping, clipping, or transformations were applied.

This ensures fidelity to real-world population health patterns and avoids masking meaningful disparities across the continent.

### Ensuring Data Quality and Consistency

To guarantee dataset integrity:

- All numeric columns were cast to appropriate numeric types.
- Region and country names were standardized by trimming whitespace.
- Column names were normalized to lowercase with underscores for modeling compatibility.
- Dataset structure was validated to ensure **one unique record per country–year pair**.

After these checks, the dataset was confirmed clean, consistent, and ready for feature engineering.


In [2]:
# Initialize the DataProcessor with the path to our raw dataset
processor = DataProcessor(r"../df_africa_cleaned.xls")
df_clean = processor.process()


# Process the raw data through cleaning pipeline
df_clean = processor.process()

# Display first 5 rows to verify successful data loading and cleaning
df_clean.head()



Unnamed: 0,code,region_country_area,year,life_expectancy_at_birth_for_both_sexes_years,life_expectancy_at_birth_for_females_years,life_expectancy_at_birth_for_males_years,maternal_mortality_ratio_deaths_per_100000_population,population_annual_rate_of_increase_percent,total_fertility_rate_children_per_women,under_five_mortality_rate_for_both_sexes_per_1000_live_births
0,2,Africa,2010,58.8,60.3,57.2,593.0,2.6,4.9,93.9
1,2,Africa,2015,60.8,62.6,59.1,526.0,2.6,4.6,81.0
2,2,Africa,2020,62.3,64.3,60.2,487.0,2.4,4.2,69.0
3,2,Africa,2024,64.0,66.1,62.0,,2.3,4.0,62.4
4,11,Western Africa,2010,54.3,55.1,53.5,850.0,2.8,5.7,123.2


#### output explanation:
DATA COLLECTION: This dataset originates from UN 
 
 COLUMN MEANINGS:
- country: African nation name (54 countries total)
- year: Time period (2010-2024, annual observations)
- region: UN geographic sub-region classification (Eastern, Western, Northern, etc.)
 - total_fertility_rate: Average children per woman during reproductive years
 - life_expectancy_at_birth_for_both_sexes_years: Expected lifespan at birth (combined gender)
 - under_five_mortality_rate_deaths_per_1000_live_births: Child deaths before age 5 per 1,000 births
 - maternal_mortality_ratio_deaths_per_100000_population: Maternal deaths per 100,000 live births
 - population_annual_rate_of_increase_percent: Year-over-year population growth rate

PREPROCESSING DECISIONS:
 - Column names standardized to lowercase with underscores for Python compatibility
 - Numeric columns cast to float64 for mathematical operations
 - Country/region names trimmed of whitespace for consistency
 - Missing values in maternal mortality preserved (systematic non-reporting for 2024)
 - No outlier removal to preserve genuine demographic variation across Africa

 STATISTICAL MEANING:
 - Each row represents one country-year observation (panel data structure)
 - Values reflect official government statistics and UN demographic estimates
 - Fertility rates >4 indicate countries in early demographic transition
 - Life expectancy ranges from ~50-80 years across African nations
 - Under-5 mortality varies dramatically (5-150 per 1,000) reflecting health system quality

## 2B. Feature Engineering

Feature engineering was applied to enrich the dataset with additional information that improves predictive performance and provides deeper analytical insight into demographic transitions across African countries. All engineered features were derived from domain knowledge, prior EDA findings, and the goals of the hypotheses being tested.

### Trend Features (2010–2024)

To capture the *direction and magnitude* of demographic change over time, we computed trend-based features representing the difference between each country’s earliest and latest available values:

\[
Change_2010–2024 = Value_2024 - Value_2010
\]

Trend features were generated for four core indicators:

- Life expectancy (both sexes)
- Total fertility rate
- Under-five mortality rate
- Population growth rate

These engineered features quantify how far each country has progressed over the study period and allow the models to learn *trajectories*, not just static snapshots. They are especially relevant given Africa’s ongoing demographic transition.

### Country-Level Mean Indicator Features

Short-term fluctuations in demographic indicators can obscure broader patterns. To incorporate structural characteristics of each country, we generated mean-level aggregates across all available years (2010–2024):

- Mean life expectancy (both sexes)
- Mean total fertility rate
- Mean under-five mortality rate
- Mean population growth rate

These averaged features provide stable representations of each country’s demographic profile and improve model robustness by reducing year-to-year noise.

### Scaling (Performed in the Modeling Pipeline)

Certain machine learning algorithms—particularly linear models—are sensitive to differences in feature scale. To address this, numeric features were standardized *after* the train-test split to prevent data leakage:

- Standard scaling (`z = (x - μ) / σ`) was applied only inside the modeling pipeline.
- The raw dataset remains unscaled to maintain interpretability.

### Summary

In total, **eight engineered features** were created:

- **4 trend features** capturing demographic progress over time  
- **4 mean features** summarizing long-term national characteristics  

These additions strengthen model performance by incorporating both temporal dynamics and structural demographic patterns. The resulting feature set provides a richer foundation for regression-based predictive modeling and directly supports the assignment’s analytical objectives.


### Testing the Feature Engineering logic

In [3]:
from src.data_processor import DataProcessor
from src.feature_engineer import FeatureEngineer



# Initialize the FeatureEngineer with our cleaned dataset
fe = FeatureEngineer(df_clean)
# Process the cleaned data through feature engineering pipeline
df_features = fe.process()

# Display first 5 rows to verify feature engineering success
df_features.head()


Unnamed: 0,code,region_country_area,year,life_expectancy_at_birth_for_both_sexes_years,life_expectancy_at_birth_for_females_years,life_expectancy_at_birth_for_males_years,maternal_mortality_ratio_deaths_per_100000_population,population_annual_rate_of_increase_percent,total_fertility_rate_children_per_women,under_five_mortality_rate_for_both_sexes_per_1000_live_births,life_expectancy_at_birth_for_both_sexes_years_change_2010_2024,total_fertility_rate_children_per_women_change_2010_2024,under_five_mortality_rate_for_both_sexes_per_1000_live_births_change_2010_2024,population_annual_rate_of_increase_percent_change_2010_2024,life_expectancy_at_birth_for_both_sexes_years_mean,total_fertility_rate_children_per_women_mean,under_five_mortality_rate_for_both_sexes_per_1000_live_births_mean,population_annual_rate_of_increase_percent_mean
0,2,Africa,2010,58.8,60.3,57.2,593.0,2.6,4.9,93.9,5.2,-0.9,-31.5,-0.3,61.475,4.425,76.575,2.475
1,2,Africa,2015,60.8,62.6,59.1,526.0,2.6,4.6,81.0,5.2,-0.9,-31.5,-0.3,61.475,4.425,76.575,2.475
2,2,Africa,2020,62.3,64.3,60.2,487.0,2.4,4.2,69.0,5.2,-0.9,-31.5,-0.3,61.475,4.425,76.575,2.475
3,2,Africa,2024,64.0,66.1,62.0,,2.3,4.0,62.4,5.2,-0.9,-31.5,-0.3,61.475,4.425,76.575,2.475
4,11,Western Africa,2010,54.3,55.1,53.5,850.0,2.8,5.7,123.2,4.1,-1.3,-34.0,-0.6,56.3,5.0,105.275,2.475


#### OutPut Explanation

 TEMPORAL TREND FEATURES (Change from 2010 to 2024):
 - life_expectancy_change_2010_2024: Health system improvement over 14 years
 - total_fertility_rate_change_2010_2024: Reproductive behavior shifts
 - under_five_mortality_change_2010_2024: Child health progress indicator
 - population_growth_change_2010_2024: Demographic momentum changes

 STATISTICAL MEANING OF TREND FEATURES:
 - Positive values = improvement/increase over time period
 - Negative fertility/mortality changes = demographic progress
- Negative population growth changes = demographic transition toward stability

AGGREGATED FEATURES (Country-level means across 2010-2024):
- mean_life_expectancy_both_sexes: Long-term health outcome baseline
- mean_total_fertility_rate: Average reproductive pattern per country
- mean_under_five_mortality: Persistent child health challenges
- mean_population_annual_rate_increase: Characteristic growth pattern

FEATURE ENGINEERING RATIONALE:
- Trends capture demographic transition dynamics over time
- Means smooth annual fluctuations for stable model inputs

### Understanding the Engineered Feature Output

Some engineered features appear the same across all years for a given country or region. This is expected and correct.

**Why?**

1. **Trend Features (2010–2024 Change)**  
   These measure how much an indicator changed from 2010 to 2024.  
   Since the change is calculated *once per country*, the value stays the same in all rows for that country.

2. **Mean Features (2010–2024 Average)**  
   These represent the long-term average level of each indicator.  
   A mean is also *constant per country*, so it repeats for all years.

**What varies?**  
The original demographic indicators (life expectancy, fertility, mortality, growth) still vary year-by-year, which provides the temporal dynamics needed for modeling.

**Why this is useful?**  
The model now has both:  
- year-specific values (dynamic changes)  
- country-level characteristics (structural patterns)

This combination strengthens predictive performance and aligns with standard demographic modeling practice.


 CLASS DISTRIBUTION ANALYSIS:
STATISTICAL INTERPRETATION:
- High Fertility: ~152 observations (62.3% of total dataset)
- Low Fertility: ~92 observations (37.7% of total dataset)
- Imbalance Ratio: 1.65:1 (moderate, not severe)

DATA COLLECTION CONTEXT:
- Reflects real demographic patterns across Africa
- Many countries still in early demographic transition (high fertility)
- Some countries (South Africa, Tunisia, etc.) have completed transition

MODELING IMPLICATIONS:
 - Moderate imbalance doesn't require synthetic data generation (SMOTE)
 - Stratified sampling will ensure proportional representation in train/test
 - Accuracy metrics should be complemented with precision/recall for minority class
 - Class weights could be considered if model shows bias toward majority class

DEMOGRAPHIC SIGNIFICANCE:
 - High fertility countries: typically have young populations, rural economies
 - Low fertility countries: often more urbanized, higher education levels
 - This distribution captures continental diversity in demographic development

### 2C. Handling Class Imbalance

After creating the binary fertility category (High vs. Low Fertility), we examined the distribution of the two classes:

- **High Fertility:** 152 observations (62.3%)  
- **Low Fertility:** 92 observations (37.7%)

Although the classes are not perfectly equal, this represents only a **moderate imbalance**, not a severe one. Both classes are sufficiently represented, so applying oversampling, undersampling, or synthetic data generation (e.g., SMOTE) was **not required**.

To preserve the natural demographic structure of the dataset during model training, we used a **stratified train–test split**, ensuring that both fertility categories were proportionally represented in the training and testing sets.

This approach maintains fairness in the classification task without altering the underlying demographic patterns.


# Model Testing

## H1: Fertility predicts maternal mortality (REGRESSION)

In [None]:
# H1: Fertility predicts maternal mortality (REGRESSION)

# Initialize trainer for maternal mortality regression
trainer_h1 = ModelTrainer(
    df_features,
    target="maternal_mortality_ratio_deaths_per_100000_population",
    problem_type="regression"
)

# Split data into training and testing sets (stratified for consistency)
trainer_h1.train_test_split()
trainer_h1.train_test_split()
trainer_h1.scale_numeric()

# Train multiple regression models and select the best performer
best_h1, params_h1 = trainer_h1.train_models()

# Generate predictions and compile results with actual vs predicted values
results_h1 = trainer_h1.map_predictions()
results_h1.head()



Training: LinearRegression
LinearRegression Score = 0.7563105707680513

Training: RandomForestRegressor
Best Params: {'max_depth': 10, 'n_estimators': 400}
Best Score: 0.8219625248974196

Final Best Model:
RandomForestRegressor(max_depth=10, n_estimators=400)
Parameters: {'max_depth': 10, 'n_estimators': 400}


Unnamed: 0,region_country_area,year,code,actual,predicted
76,Guinea,2020,324,553.0,592.054226
138,Somalia,2020,706,621.0,555.316333
152,Togo,2015,768,441.0,475.367541
60,Eritrea,2015,232,399.0,350.668665
156,Tunisia,2020,788,37.0,43.3475


MODEL PREDICTION RESULTS:

COLUMN INTERPRETATIONS:
- country/year: Identification variables for each prediction
- actual_maternal_mortality: True WHO-reported deaths per 100,000 births
- predicted_maternal_mortality: Model-estimated deaths per 100,000 births
- prediction_error (residual): actual - predicted (positive = underestimation)

 STATISTICAL MEANING:
- Small residuals (±50) indicate accurate predictions
- Large residuals (>200) suggest model limitations or data outliers
- RMSE/MAE metrics quantify overall prediction accuracy
 HYPOTHESIS TESTING:
- Strong predictions support fertility-maternal mortality relationship
- Weak predictions suggest other factors dominate maternal outcomes
- R² score indicates percentage of maternal mortality variance explained

PUBLIC HEALTH INTERPRETATION:
- Accurate predictions could guide resource allocation for maternal care
- Systematic under/over-predictions reveal model biases
- Countries with large errors may have unique health system characteristics

# H2: Life expectancy predicts population growth (REGRESSION)

In [None]:

# Initialize trainer for population growth regression
trainer_h2 = ModelTrainer(
    df_features,
    target="population_annual_rate_of_increase_percent",
    problem_type="regression"
)

# Prepare data with train/test split and feature scaling
trainer_h2.train_test_split()
trainer_h2.scale_numeric()

# Train regression models to find best predictor of population growth
best_h2, params_h2 = trainer_h2.train_models()

# Generate predictions and evaluate model performance
results_h2 = trainer_h2.map_predictions()
results_h2.head()



Training: LinearRegression
Model LinearRegression FAILED. Reason: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Training: RandomForestRegressor
Best Params: {'max_depth': 20, 'n_estimators': 200}
Best Score: 0.6947879545823505

Final Best Model:
RandomForestRegressor(max_depth=20, n_estimators=200)
Parameters: {'max_depth': 20, 'n_estimators': 200}


Unnamed: 0,region_country_area,year,code,actual,predicted
24,Southern Africa,2010,18,1.2,1.6215
6,Western Africa,2020,11,2.3,2.4395
153,Mozambique,2015,508,2.9,2.98
211,Sudan,2024,729,1.6,2.251
198,South Africa,2020,710,1.6,1.4065


DEMOGRAPHIC TRANSITION PREDICTION RESULTS:

COLUMN MEANINGS:
- actual_population_growth: True annual percentage increase (World Bank standard)
- predicted_population_growth: Model-estimated growth rate
- prediction_error: Residual showing prediction accuracy

DEMOGRAPHIC THEORY TESTING:
- Negative correlation expected: higher life expectancy → lower population growth
- Reflects demographic transition stages across African countries
- Accurate predictions validate demographic transition theory

 STATISTICAL INTERPRETATION:
- Growth rates typically range 1-4% annually across Africa
- Small residuals (<0.5%) indicate strong predictive relationship
- Large residuals may reflect migration, conflict, or economic factors

POLICY IMPLICATIONS:
- Reliable predictions help forecast future population size
- Understanding life expectancy-growth relationship guides development planning
- Countries deviating from predictions may need special demographic policies

## Predict Maternal Mortality (Regression)

In [None]:
from src.model_trainer import ModelTrainer

#Comprehensive Maternal Mortality Prediction

# Re-initialize maternal mortality trainer with full feature set
trainer_mm = ModelTrainer(
    df_features,
    target="maternal_mortality_ratio_deaths_per_100000_population",
    problem_type="regression"
)

# Prepare data with train/test split and feature scaling
trainer_mm.train_test_split()
trainer_mm.scale_numeric()

# Train regression models to find best predictor of maternal mortality
best_model_mm, params_mm = trainer_mm.train_models()

# Generate predictions and compile results with actual vs predicted values
results_mm = trainer_mm.map_predictions()
results_mm.head(10)



Training: LinearRegression
LinearRegression Score = 0.7563105707680513

Training: RandomForestRegressor
Best Params: {'max_depth': 20, 'n_estimators': 400}
Best Score: 0.813733179486259

Final Best Model:
RandomForestRegressor(max_depth=20, n_estimators=400)
Parameters: {'max_depth': 20, 'n_estimators': 400}


Unnamed: 0,region_country_area,year,code,actual,predicted
76,Guinea,2020,324,553.0,596.0575
138,Somalia,2020,706,621.0,555.0375
152,Togo,2015,768,441.0,468.065
60,Eritrea,2015,232,399.0,355.5875
156,Tunisia,2020,788,37.0,42.37
165,United Rep. of Tanzania,2020,834,238.0,296.31
85,Liberia,2010,430,634.0,607.0425
142,Zimbabwe,2010,716,618.0,636.47
114,Namibia,2020,516,215.0,315.6175
31,Cameroon,2015,120,447.0,518.065


COMPREHENSIVE MATERNAL MORTALITY PREDICTIONS:

MODEL ENHANCEMENT FEATURES USED:
- Original demographic indicators (fertility, life expectancy, mortality)
- Temporal trends (2010-2024 changes showing demographic progress)
- Country-level means (long-term patterns smoothing annual variation)
- Regional classifications (capturing geographic health system similarities)

PREDICTION QUALITY ASSESSMENT:
- Lower residuals than single-feature models indicate improvement
- Consistent under/over-prediction patterns reveal systematic biases
- Cross-validation scores show model generalization capability

FEATURE IMPORTANCE INSIGHTS:
- Top predictors likely include under-5 mortality, fertility trends
- Regional effects capture shared health system characteristics
- Temporal features show whether recent improvements predict current outcomes

 CLINICAL/POLICY SIGNIFICANCE:
- Accurate predictions identify countries needing maternal care investment
- Model can forecast maternal mortality under different development scenarios
- Feature importance guides which demographic changes most improve outcomes

# H3: Under-Five Mortality → Life Expectancy (Regression)

In [None]:
# Hypothesis 3 — Can under-five mortality predict life expectancy? (REGRESSION)

trainer_h3 = ModelTrainer(
    df_features,
    target="life_expectancy_at_birth_for_both_sexes_years",
    problem_type="regression"
)

# Prepare data with train/test split and feature scaling
trainer_h3.train_test_split()
trainer_h3.scale_numeric()

# Train regression models to find best predictor of life expectancy
best_h3, params_h3 = trainer_h3.train_models()

# Generate predictions and compile results with actual vs predicted values
results_h3 = trainer_h3.map_predictions()

results_h3.head()



Training: LinearRegression
Model LinearRegression FAILED. Reason: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Training: RandomForestRegressor
Best Params: {'max_depth': 20, 'n_estimators': 200}
Best Score: 0.9857259822879308

Final Best Model:
RandomForestRegressor(max_depth=20, n_estimators=200)
Parameters: {'max_depth': 20, 'n_estimators': 200}


Unnamed: 0,region_country_area,year,code,actual,predicted
24,Southern Africa,2010,18,58.1,57.7585
6,Western Africa,2020,11,56.9,56.676
153,Mozambique,2015,508,58.7,57.9835
211,Sudan,2024,729,66.5,66.566833
198,South Africa,2020,710,65.2,64.9995


HEALTH SYSTEM PREDICTION ANALYSIS:
COLUMN INTERPRETATIONS:
- actual_life_expectancy: WHO/UN official life expectancy estimates (years)
- predicted_life_expectancy: Model prediction based on child mortality patterns
- prediction_error: Difference showing prediction accuracy

PUBLIC HEALTH SIGNIFICANCE:
- Strong predictions validate under-5 mortality as health system indicator
- Under-5 mortality reflects healthcare quality, nutrition, disease prevention
- Life expectancy represents overall population health outcomes

STATISTICAL RELATIONSHIP:
- Expected strong negative correlation: higher child mortality → lower life expectancy
- Typical African life expectancy ranges 50-80 years
- Small residuals (±2 years) indicate robust health system relationship

 HEALTHCARE POLICY APPLICATIONS:
- Child mortality interventions should improve overall life expectancy
- Countries with large residuals may have age-specific health challenges
- Model identifies nations where child health improvements yield maximum impact

DEVELOPMENT INDICATOR VALIDATION:
- Confirms child survival as reliable proxy for broader health system performance
- Supports focusing development aid on maternal/child health programs
- Demonstrates interconnected nature of demographic health indicators