# Machine Learning Modeling for Predicting Poverty and Malnutrition

This notebook is dedicated to developing a machine learning model aimed at predicting poverty and malnutrition rates. The approach and methodology implemented here are greatly inspired by the research paper titled "Multivariate Random Forest Prediction of Poverty and Malnutrition Prevalence" (2021) authored by Browne et al.

## Inspiration from the Research Paper

The cornerstone of this notebook lies in the methodologies introduced by Browne and their team. Their work in applying multivariate random forest techniques to predict poverty and malnutrition provides a framework for this modeling.

### Key Highlight from the Paper:
- **Outcome and Impact**: The paper demonstrates significant potential in using machine learning for social good, specifically in predicting crucial human welfare indicators.

## Implementation in This Notebook

Building upon the insights from Browne et al.'s paper, this notebook implements a similar approach with some modifications. Unlike the approach in paper, which filters data by year and country, this notebook utilizes the entire dataset without such specific filtering, providing a more comprehensive view across different timeframes and geographical locations.


In [1]:
import os

# Creating directories for code from research paper 
# Directory for saving .npy files
directory = '/kaggle/working/data/W'
if not os.path.exists(directory):
    os.makedirs(directory)

# Directory for saving results
results_directory = '/kaggle/working/results'
if not os.path.exists(results_directory):
    os.makedirs(results_directory)

# Contemporaneous Forecasting

 ## Original code from research paper, only directories was adjusted to kaggles

**[Link to the code](http://barrett.dyson.cornell.edu/research/code.html)**

In [2]:
# import numpy as np
# import pandas as pd
# import pickle
# import pdb
# from sklearn.ensemble import RandomForestRegressor as RF
# from sklearn.metrics import r2_score as r2
# from sklearn.metrics import mean_squared_error as mse
# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import Ridge as ridge
# ##set seed
# np.random.seed(6023)

# pd.options.mode.chained_assignment = None  # default='warn'

# nt = 2000
# folds = 5
# rftypes = ['ind','joint'] ##ind must be first
# countries = ["Bangladesh","Ethiopia","Ghana","Guatemala","Honduras","Mali","Nepal","Kenya","Senegal","Uganda","Nigeria"]
# pyears = [["04","07","11","14"],["05","11","16"],["08","14"],["14"],["11"],["06","12"],["06","11","16"],['08','14'],["05",'10'],["06",'11','16'],["08","13"]]
# outcomes = ['stunted', 'wasted',  'healthy', 'poorest','underweight_bmi']
# syears = np.sort(list(set([y for years in pyears for y in years])))
# xa = pd.read_csv('/kaggle/input/ml-prediction-of-poverty-and-malnutrition-dataset/data.csv')
# imps = [[] for country in countries]

# ##for each survey year, for each country
# for i,year in enumerate(syears):
# 	for j,country in enumerate(countries):
# 		print(i,j)		
# 		##filter data
# 		x = xa[xa.country == country]
# 		x = x.drop('country',axis=1)
# 		x = x[x.year <= int('20' + year)]

# 		if np.shape(x)[0]>0:

# 			##drop largely missing features
# 			mask = np.array(x.isnull().sum()/len(x))			
# 			x = x.loc[:,mask<.2]

# 			##drop rows with missing data
# 			x = x.dropna(axis=0)

# 			##test train split and CV
# 			xtr = x[x.year < int('20' + year)]
# 			x = x[x.year == int('20' + year)]
# 			if np.shape(x)[0]>0:
# 				q = np.arange(np.shape(x)[0])
# 				for fold in range(folds):
# 					np.random.shuffle(q)
# 					trinds = q[:int(.8*len(q))]
# 					teinds = q[int(.8*len(q)):]
# 					xtrf = pd.concat([xtr,x.iloc[trinds,:]],ignore_index=True)
# 					xtef = x.iloc[teinds,:]

# 					##make random forests
# 					labs = list(xtrf.columns)[1:-5]
# 					for rftype in rftypes:
# 						if rftype =='joint':
# 							W = np.load('/kaggle/working/data/W/W'+year+country+'c.npy')
# 							ytrf = np.dot(W,xtrf[outcomes].T).T
# 							ytef = np.dot(W,xtef[outcomes].T).T
# 							rf = RF(nt,max_depth=4,max_features = .333)
# 							rf = rf.fit(xtrf[labs],ytrf)
# 							predtr =  rf.predict(xtrf[labs])
# 							predte = rf.predict(xtef[labs])
# 							Winv = np.linalg.inv(W)
# 							predtr = np.dot(Winv,predtr.T).T
# 							predte = np.dot(Winv,predte.T).T
# 							imps[j].append(rf.feature_importances_)
		
# 							for k,outcome in enumerate(outcomes):
# 								xtrf[outcome+'rf' + rftype] = predtr[:,k]				
# 								xtef[outcome+'rf'+ rftype] = predte[:,k] 

# 						else:
# 							ytrf = xtrf[outcomes]
# 							ytef = xtef[outcomes]
# 							for outcome in outcomes:
# 								rf = RF(nt,max_depth=4,max_features = .333)
# 								rf = rf.fit(xtrf[labs],ytrf[outcome])
# 								xtrf[outcome+'rf' + rftype] = rf.predict(xtrf[labs])
# 								xtef[outcome+'rf' + rftype] = rf.predict(xtef[labs])				
# 							W = np.linalg.svd(np.corrcoef((np.array(xtrf[outcomes])-np.array(xtrf[[outcome + 'rfind' for outcome in outcomes]])).T))[2]
# 							np.save('/kaggle/working/data/W/W'+year+country+'c',W)

# 					xtrf = xtrf[outcomes + [outcome + 'rfind' for outcome in outcomes] + [outcome + 'rfjoint' for outcome in outcomes]]	
# 					xtef = xtef[outcomes + [outcome + 'rfind' for outcome in outcomes] + [outcome + 'rfjoint' for outcome in outcomes]]

# 					with open('/kaggle/working/results/'+year+country+'resultscrf' +str(fold) +'.pkl','wb') as f:
# 						pickle.dump([xtrf,xtef],f)
                        

In [3]:
import numpy as np
import pandas as pd
from typing import Tuple, Dict, List
import yaml

from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.metrics import r2_score, mean_squared_error

### RF method: independent
##### In this method, a separate Random Forest model is trained for each outcome variable. This means that for each of the outcomes (like 'stunted', 'wasted', etc.), an individual model is created and trained only on that specific outcome. This approach treats each outcome as a distinct problem and doesn't account for potential correlations between different outcomes.

In [4]:
# Load data 
df = pd.read_csv("/kaggle/input/ml-prediction-of-poverty-and-malnutrition-dataset/data.csv")

# Load YAML configuration file for parameters_data_processing
# These are just defined variables which will be used in data preprocessing 
with open('/kaggle/input/parametrs-for-predicting-poverty/parameters_data_processing.yml', 'r') as file:
    config = yaml.safe_load(file)

# This kind of configuration is used because pipeline is also created in kedro framework 
# Kedro framework help create reproducible, maintainable and modular data-science code :)
# Accessing the variables from the YAML file
variables = config['variables']


In [5]:
variables

['latnum',
 'longnum',
 'year',
 'country',
 'URBAN_RURA',
 'alt',
 'chrps',
 'deathcount',
 'lst',
 'numevents',
 'pasture',
 'sif',
 'slope',
 'tree',
 'tt00_500k',
 'stunted',
 'wasted',
 'healthy',
 'poorest',
 'underweight_bmi']

In [6]:
def drop_rows_with_missing_values(data: pd.DataFrame, max_missing: int = None) -> pd.DataFrame:
    """
    Drop rows from a DataFrame based on the specified maximum number of allowed missing values.

    Parameters:
    - data (pd.DataFrame): The DataFrame from which rows are to be removed.
    - max_missing (int, optional): The maximum number of missing values allowed in a row. 
                                   Rows with more missing values than this number will be dropped. 
                                   If not specified or None, any row with at least one missing value will be dropped.

    Returns:
    - pd.DataFrame: A new DataFrame with rows dropped based on the specified criteria.

    Raises:
    - ValueError: If the max_missing is negative.
    - TypeError: If the provided dataframe is not a pandas DataFrame.
    """
    if max_missing is not None and (not isinstance(max_missing, int) or max_missing < 0):
        raise ValueError("max_missing must be a non-negative integer or None")
    if not isinstance(data, pd.DataFrame):
        raise TypeError("The first argument must be a pandas DataFrame")

    if max_missing is None:
        return data.dropna(axis=0)
    else:
        min_non_missing = data.shape[1] - max_missing
        return data.dropna(thresh=min_non_missing, axis=0)

# Example usage:
# To drop rows with any missing value
# df = drop_rows_with_missing_values(df)

# To drop rows with more than 2 missing values
# df = drop_rows_with_missing_values(df, 2)


In [7]:
def preprocess_raw_data(raw_data: pd.DataFrame, variables: Dict) -> pd.DataFrame:
    """
    Preprocesses the raw data from a research paper.

    This function performs necessary preprocessing steps on the dataset obtained from a research paper. 
    The specific preprocessing tasks can include cleaning the data, handling missing values, and converting 
    data types, but should be defined based on the actual requirements of the dataset.

    Args:
        raw_data: DataFrame containing the raw data from the research paper.

    Returns:
        pd.DataFrame: Preprocessed data, reflecting the necessary preprocessing steps applied to the raw data.
    """
    df = raw_data[variables]
    preprocessed_df = drop_rows_with_missing_values(df)

    return preprocessed_df

# Example of usage
# preprocessed_data = preprocess_raw_data(df)


In [8]:
preprocessed_data = preprocess_raw_data(df, variables)

In [9]:
preprocessed_data

Unnamed: 0,latnum,longnum,year,country,URBAN_RURA,alt,chrps,deathcount,lst,numevents,pasture,sif,slope,tree,tt00_500k,stunted,wasted,healthy,poorest,underweight_bmi
0,22.981516,90.155785,2004,Bangladesh,1,5.60252,0.437479,0,-1.065822,14,0.048795,-0.745444,0.021781,16.3438,365.67800,0.294118,0.058824,0.941176,0.071429,0.272727
1,22.444431,90.329185,2004,Bangladesh,1,4.67774,0.447809,0,-0.963057,14,0.042821,-0.840625,0.007349,0.0000,397.88600,0.444444,0.000000,1.000000,0.062500,0.400000
2,22.487263,90.206123,2004,Bangladesh,1,5.14527,0.453208,0,-0.963257,14,0.030075,-0.841667,0.005347,0.0000,344.36600,0.600000,0.050000,0.950000,0.000000,0.354839
3,23.016359,90.192879,2004,Bangladesh,1,6.17460,0.433324,0,-1.074995,14,0.052236,-0.744531,0.027679,16.0333,369.38500,0.500000,0.062500,0.937500,0.062500,0.323529
4,22.952404,90.454414,2004,Bangladesh,1,5.39076,0.406196,0,-1.065344,14,0.041699,-0.748497,0.025029,18.7647,273.92400,0.555556,0.000000,1.000000,0.111111,0.277778
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14333,7.368560,3.937503,2013,Nigeria,0,212.91400,0.622106,382,-0.786959,345,0.004750,-0.413014,0.508662,0.0000,6.60494,0.250000,0.250000,0.750000,0.000000,0.212121
14334,8.570109,3.547449,2013,Nigeria,1,392.34100,1.081900,311,-0.804163,296,0.001639,-0.269682,0.396409,16.6122,271.40400,0.431373,0.176471,0.784314,0.969697,0.347222
14335,8.765458,3.603125,2013,Nigeria,1,394.41600,1.080867,311,-0.780643,296,0.059900,-0.248022,0.527184,15.6757,227.03800,0.217391,0.000000,0.956522,0.000000,0.097561
14336,8.660406,3.522780,2013,Nigeria,1,389.46400,1.084190,311,-0.781408,296,0.004605,-0.257015,0.390382,21.0244,219.00000,0.195122,0.073171,0.926829,0.379310,0.026316


In [10]:
# Load YAML configuration file for parameters_data_science
# This file contain configuration for modeling with params like n_folds, random_state etc. 
# Also we define what are our features and targets 
with open('/kaggle/input/parametrs-for-predicting-poverty/parameters_data_science.yml', 'r') as file:
    config = yaml.safe_load(file)

# Choose the model options for our five target variables
model_options_stunted = config["model_options_stunted"]
model_options_wasted = config["model_options_wasted"]
model_options_healthy = config["model_options_healthy"]
model_options_poorest = config["model_options_poorest"]
model_options_underweight_bmi = config["model_options_underweight_bmi"]

In [11]:
model_options_stunted

{'n_folds': 5,
 'random_state': 42,
 'n_trees': 2000,
 'max_depth': 4,
 'max_features': 0.333,
 'features': ['latnum',
  'longnum',
  'URBAN_RURA',
  'alt',
  'chrps',
  'deathcount',
  'lst',
  'numevents',
  'pasture',
  'sif',
  'slope',
  'tree',
  'tt00_500k'],
 'target': 'stunted'}

## **Modeling for target variable 'stunted'** 
### Prevalence of stunted (HAZ < -2) children under five (0-59 months)
#### cluster prevalence weighted at hh level

In [12]:
def generate_cv_splits(data: pd.DataFrame, model_options: Dict) -> List[Tuple]:
    """
    Generates cross-validation train-test splits.

    Args:
        data: DataFrame containing the features and target.
        model_options: Dictionary containing 'features', 'target', 'n_folds', and 'random_state'.

    Returns:
        A list of tuples, each containing X_train, X_test, y_train, y_test for each fold.
    """
    features = model_options["features"]
    target = model_options["target"]
    n_folds = model_options["n_folds"]
    random_state = model_options["random_state"]

    X = data[features]
    y = data[[target]]  # Keep y as a DataFrame

    kf = KFold(n_splits=n_folds, shuffle=True, random_state=random_state)

    cv_data_splits = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        cv_data_splits.append((X_train, X_test, y_train, y_test))

    return cv_data_splits


In [13]:
cv_data_splits_stunted = generate_cv_splits(preprocessed_data, model_options_stunted)

In [14]:
def train_independent_rf(X_train: pd.DataFrame, y_train: pd.DataFrame, model_options: Dict) -> Dict[str, RF]:
    """
    Trains independent Random Forest models for each outcome.

    Args:
        X_train: Training data of independent features.
        y_train: Training data for the target variables.
        model_options: Dictionary containing 'n_trees', 'max_depth', and 'max_features'.

    Returns:
        Dictionary of trained Random Forest models for each outcome.
    """
    n_trees = model_options["n_trees"]
    max_depth = model_options["max_depth"]
    max_features = model_options["max_features"]

    models = {}
    for outcome in y_train.columns:
        model = RF(n_estimators=n_trees, max_depth=max_depth, max_features=max_features)
        model.fit(X_train, y_train[outcome])
        models[outcome] = model
    return models

def train_models_on_cv_folds(cv_data_splits: List[Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]], model_options: Dict) -> Dict[str, Dict[str, RF]]:
    """
    Trains independent Random Forest models for each outcome on each fold of cross-validation splits.

    Args:
        cv_data_splits: List of tuples containing (X_train, X_test, y_train, y_test) for each fold.
        model_options: Dictionary containing 'n_trees', 'max_depth', and 'max_features'.

    Returns:
        Dictionary of dictionaries containing trained Random Forest models for each outcome, for each fold.
    """
    num_folds = len(cv_data_splits)
    print(f"Starting training of independent Random Forest models for {num_folds} folds...")

    target_names = ", ".join(cv_data_splits[0][2].columns)
    print(f"Training models for target: {target_names}")
    
    all_fold_models = {}
    for fold_index, (X_train, _, y_train, _) in enumerate(cv_data_splits):
        fold_key = f'fold_{fold_index + 1}'
        print(f"Training model for {fold_key}")

        fold_models = train_independent_rf(X_train, y_train, model_options)
        all_fold_models[fold_key] = fold_models
        print(f"Completed training model for {fold_key}")
    
    print("ok im done :)")
    return all_fold_models


In [15]:
all_fold_models_stunted = train_models_on_cv_folds(cv_data_splits_stunted, model_options_stunted)

Starting training of independent Random Forest models for 5 folds...
Training models for target: stunted
Training model for fold_1
Completed training model for fold_1
Training model for fold_2
Completed training model for fold_2
Training model for fold_3
Completed training model for fold_3
Training model for fold_4
Completed training model for fold_4
Training model for fold_5
Completed training model for fold_5
ok im done :)


In [16]:
all_fold_models_stunted

{'fold_1': {'stunted': RandomForestRegressor(max_depth=4, max_features=0.333, n_estimators=2000)},
 'fold_2': {'stunted': RandomForestRegressor(max_depth=4, max_features=0.333, n_estimators=2000)},
 'fold_3': {'stunted': RandomForestRegressor(max_depth=4, max_features=0.333, n_estimators=2000)},
 'fold_4': {'stunted': RandomForestRegressor(max_depth=4, max_features=0.333, n_estimators=2000)},
 'fold_5': {'stunted': RandomForestRegressor(max_depth=4, max_features=0.333, n_estimators=2000)}}

In [17]:
def evaluate_model_performance(cv_data_splits: List[Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]], 
                               trained_models: Dict[str, Dict[str, RF]]) -> Dict[str, Dict[str, float]]:
    """
    Evaluates the performance of trained models on each fold of cross-validation splits.

    Args:
        cv_data_splits: List of tuples containing (X_train, X_test, y_train, y_test) for each fold.
        trained_models: Dictionary of dictionaries containing trained models for each outcome, for each fold.

    Returns:
        A dictionary with evaluation metrics (R-squared and MSE) for each model on each fold.
    """
    performance_metrics = {}

    for fold_index, (_, X_test, _, y_test) in enumerate(cv_data_splits):
        fold_key = f'fold_{fold_index + 1}'
        models = trained_models[fold_key]
        fold_metrics = {}

        # for outcome, model in models.items():
        #     predictions = model.predict(X_test)
        #     r2 = r2_score(y_test, predictions)
        #     mse = mean_squared_error(y_test, predictions)
        #     fold_metrics[outcome] = {'R2': r2, 'MSE': mse}

        # performance_metrics[fold_key] = fold_metrics

        for outcome, model in models.items():
            predictions = model.predict(X_test)
            r2 = r2_score(y_test[outcome], predictions)
            mse = mean_squared_error(y_test[outcome], predictions)
            
            # Flatten the structure
            performance_metrics[f'{fold_key}_{outcome}_R2'] = r2
            performance_metrics[f'{fold_key}_{outcome}_MSE'] = mse

    return performance_metrics

In [18]:
performance_metrics_stunted = evaluate_model_performance(cv_data_splits_stunted, all_fold_models_stunted)

In [19]:
performance_metrics_stunted

{'fold_1_stunted_R2': 0.21187771935894162,
 'fold_1_stunted_MSE': 0.03890300274016168,
 'fold_2_stunted_R2': 0.1817678232682095,
 'fold_2_stunted_MSE': 0.04159865809701701,
 'fold_3_stunted_R2': 0.20149998035768335,
 'fold_3_stunted_MSE': 0.04161872351936654,
 'fold_4_stunted_R2': 0.190764402361008,
 'fold_4_stunted_MSE': 0.04309506162799954,
 'fold_5_stunted_R2': 0.18658033286817366,
 'fold_5_stunted_MSE': 0.04105871722231164}

For the 'stunted' target variable (Prevalence of stunted children under five), the Random Forest models exhibit moderate performance across five folds. The R-squared values range from approximately 0.18 to 0.21, indicating limited explanatory power, while the consistent Mean Squared Error (MSE) values suggest somewhat accurate predictions but with room for improvement. Further model refinement through advanced techniques or additional informative variables is recommended.

**The same kind of modeling we will perform now for the rest of the target variables**

## **Modeling for target variable 'wasted'**
### Prevalence of wasted (WHZ < -2) children under five (0-59 months)
#### cluster prevalence weighted at hh level

In [20]:
cv_data_splits_wasted = generate_cv_splits(preprocessed_data, model_options_wasted)

all_fold_models_wasted = train_models_on_cv_folds(cv_data_splits_wasted, model_options_wasted)

performance_metrics_wasted = evaluate_model_performance(cv_data_splits_wasted, all_fold_models_wasted)

Starting training of independent Random Forest models for 5 folds...
Training models for target: wasted
Training model for fold_1
Completed training model for fold_1
Training model for fold_2
Completed training model for fold_2
Training model for fold_3
Completed training model for fold_3
Training model for fold_4
Completed training model for fold_4
Training model for fold_5
Completed training model for fold_5
ok im done :)


In [21]:
performance_metrics_wasted

{'fold_1_wasted_R2': 0.21692345613265063,
 'fold_1_wasted_MSE': 0.012312732843319555,
 'fold_2_wasted_R2': 0.20337690643120798,
 'fold_2_wasted_MSE': 0.01385699206390864,
 'fold_3_wasted_R2': 0.18923092474948133,
 'fold_3_wasted_MSE': 0.013769665040703273,
 'fold_4_wasted_R2': 0.20308781176376367,
 'fold_4_wasted_MSE': 0.013551942347336294,
 'fold_5_wasted_R2': 0.18796557157880345,
 'fold_5_wasted_MSE': 0.014671527981963046}

The 'wasted' target variable (Prevalence of wasted children under five) also demonstrates moderate performance, with R-squared values ranging from 0.18 to 0.22 and relatively consistent MSE values. This suggests that the model's predictions are reasonably accurate but can be enhanced further.

## **Modeling for target variable 'healthy'**
### Prevalence of healthy weight (WHZ ≤ 2 and ≥-2) among children under five (0-59 months)
#### cluster prevalence weighted at hh level

In [22]:
cv_data_splits_healthy = generate_cv_splits(preprocessed_data, model_options_healthy)

all_fold_models_healthy = train_models_on_cv_folds(cv_data_splits_healthy, model_options_healthy)

performance_metrics_healthy = evaluate_model_performance(cv_data_splits_healthy, all_fold_models_healthy)

Starting training of independent Random Forest models for 5 folds...
Training models for target: healthy
Training model for fold_1
Completed training model for fold_1
Training model for fold_2
Completed training model for fold_2
Training model for fold_3
Completed training model for fold_3
Training model for fold_4
Completed training model for fold_4
Training model for fold_5
Completed training model for fold_5
ok im done :)


In [23]:
performance_metrics_healthy

{'fold_1_healthy_R2': 0.12916464785169657,
 'fold_1_healthy_MSE': 0.018565709528774184,
 'fold_2_healthy_R2': 0.13660799737475027,
 'fold_2_healthy_MSE': 0.019885175051428318,
 'fold_3_healthy_R2': 0.1433960121234893,
 'fold_3_healthy_MSE': 0.019713786160609422,
 'fold_4_healthy_R2': 0.12800670390045488,
 'fold_4_healthy_MSE': 0.020549070909121808,
 'fold_5_healthy_R2': 0.12776145009006024,
 'fold_5_healthy_MSE': 0.02088543205375344}

For the 'healthy' target variable (Prevalence of healthy weight among children under five), the models perform modestly, with R-squared values between 0.12 and 0.14 and consistent MSE values. Additional model refinement and feature engineering may be beneficial to improve predictive accuracy.

## Modeling for target variable 'poorest'
### Percentage of Households below the Comparative Threshold (assetindex ≤ -0.9080) for the Poorest Quintile of the Asset-Based Comparative Wealth Index
#### cluster prevalence weighted at hh level

In [24]:
cv_data_splits_poorest = generate_cv_splits(preprocessed_data, model_options_poorest)

all_fold_models_poorest = train_models_on_cv_folds(cv_data_splits_poorest, model_options_poorest)

performance_metrics_poorest = evaluate_model_performance(cv_data_splits_poorest, all_fold_models_poorest)

Starting training of independent Random Forest models for 5 folds...
Training models for target: poorest
Training model for fold_1
Completed training model for fold_1
Training model for fold_2
Completed training model for fold_2
Training model for fold_3
Completed training model for fold_3
Training model for fold_4
Completed training model for fold_4
Training model for fold_5
Completed training model for fold_5
ok im done :)


In [25]:
performance_metrics_poorest

{'fold_1_poorest_R2': 0.36165550630222143,
 'fold_1_poorest_MSE': 0.05689562235980636,
 'fold_2_poorest_R2': 0.3509485781630918,
 'fold_2_poorest_MSE': 0.05352143977151113,
 'fold_3_poorest_R2': 0.3706900014843789,
 'fold_3_poorest_MSE': 0.0533273194382932,
 'fold_4_poorest_R2': 0.3659549150350757,
 'fold_4_poorest_MSE': 0.0577215392265688,
 'fold_5_poorest_R2': 0.3678559314402866,
 'fold_5_poorest_MSE': 0.056766932148470686}

The 'poorest' target variable (Percentage of households below the Comparative Threshold) showcases relatively better performance, with R-squared values ranging from 0.34 to 0.37 and consistent MSE values. This suggests that the model is relatively effective in predicting this variable.

## Modeling for target variable 'underweight_bmi'
### Prevalence of underweight (BMI < 18.5) women of reproductive age
#### cluster prevalence weighted at individual level

In [26]:
cv_data_splits_underweight_bmi = generate_cv_splits(preprocessed_data, model_options_underweight_bmi)

all_fold_models_underweight_bmi = train_models_on_cv_folds(cv_data_splits_underweight_bmi, model_options_underweight_bmi)

performance_metrics_underweight_bmi = evaluate_model_performance(cv_data_splits_underweight_bmi, all_fold_models_underweight_bmi)

Starting training of independent Random Forest models for 5 folds...
Training models for target: underweight_bmi
Training model for fold_1
Completed training model for fold_1
Training model for fold_2
Completed training model for fold_2
Training model for fold_3
Completed training model for fold_3
Training model for fold_4
Completed training model for fold_4
Training model for fold_5
Completed training model for fold_5
ok im done :)


In [27]:
performance_metrics_underweight_bmi

{'fold_1_underweight_bmi_R2': 0.38778797547614385,
 'fold_1_underweight_bmi_MSE': 0.012804047386395963,
 'fold_2_underweight_bmi_R2': 0.35483380599213765,
 'fold_2_underweight_bmi_MSE': 0.013133224715601414,
 'fold_3_underweight_bmi_R2': 0.3404723659599226,
 'fold_3_underweight_bmi_MSE': 0.013970080845448525,
 'fold_4_underweight_bmi_R2': 0.370545205098881,
 'fold_4_underweight_bmi_MSE': 0.013436298287742721,
 'fold_5_underweight_bmi_R2': 0.3765941222128132,
 'fold_5_underweight_bmi_MSE': 0.01264371040822323}

Lastly, for the 'underweight_bmi' target variable (Prevalence of underweight women of reproductive age), the Random Forest models perform well, with R-squared values ranging from 0.34 to 0.39 and consistent MSE values. This indicates that the model has a good level of predictive accuracy for this particular variable.

### RF method: joint
##### Contrary to the independent approach, the joint method involves training a single Random Forest model that predicts all outcomes simultaneously. This method takes into account the potential correlations and interdependencies between different outcome variables. It uses a technique to orthogonalize the data (making it independent) and then applies the Random Forest on these transformed outcomes. After prediction, the results are transformed back. This approach can potentially provide more accurate predictions, especially when the outcomes are interrelated.

In [28]:
# I WILL TRY TO CREATE THIS :) 

# Sequential Nowcasting next


In [29]:
# AND ALSO THIS :)