# Random Forest Regression for Multidimensional Poverty Estimation

This notebook implements a complete modeling pipeline for nowcasting multidimensional poverty indicators using Random Forest Regression. The objective is to predict CONEVAL’s official poverty statistics for each Mexican state by leveraging text-derived indicators from various digital sources, including YouTube, Telegram, Google Trends, and online news outlets.

## Data Integration

The analysis integrates six categories of text-based features across two reference years (2020 and 2022):

- **Google Trends**: Normalized search interest on poverty-related keywords.
- **YouTube Comments**: Share of comments related to each poverty dimension and their average sentiment.
- **Telegram Posts**: Share of negatively connoted posts about each dimension.
- **News Outlets**: Share of LDA-derived topics aligned with poverty-related themes.
- **Official Statistics**: CONEVAL’s ground truth multidimensional poverty indicators.

**Geographic Scope**: All 32 Mexican states.  
**Temporal Scope**: 2020 (training) and 2022 (validation).

## Methodology: Random Forest Regression

Random Forest is a particularly suitable method for this task due to its:

- Ability to capture **nonlinear relationships** and **interactions** between features.
- Robustness to **multicollinearity** and **noisy predictors**, especially when combining multiple heterogeneous sources.
- Built-in mechanism for **feature importance estimation**, allowing implicit selection of relevant variables.
- Strong performance in **high-dimensional, low-sample-size (p > n)** contexts, such as the current setting with 32 observations and 37 features.

## Feature Engineering and Model Training

- All numeric indicators, excluding `state`, `year`, and the target column, were included as predictors.
- No feature scaling was applied, as Random Forest is inherently scale-invariant.
- A separate model was trained for each dimension of poverty.

Two training-validation strategies were implemented:

1. **In-Sample**: Models were trained on both 2020 and 2022 data, with evaluation on 2022 only.
2. **Out-of-Sample**: Models were trained solely on 2020 and tested on 2022, simulating a real-world nowcasting scenario.

## Hyperparameter Selection

Initially, the following configuration was manually selected based on its strong empirical performance:

- `n_estimators = 100`: Provides stable ensemble averaging without overfitting.
- `max_depth = 4`: Controls model complexity and prevents overly deep trees.
- `min_samples_leaf = 2`: Avoids over-specialization on very small sample splits.

Subsequently, we introduced hyperparameter tuning via **GridSearchCV** using **Leave-One-Out Cross-Validation (LOOCV)**. The following search space was explored:

```python
param_grid = {
    'n_estimators': [90, 100, 110],
    'max_depth': [3, 4, 5],
    'min_samples_leaf': [1, 2, 3]}
```

This grid was intentionally narrow and centered around the manual baseline configuration. Broader or more dispersed parameter ranges were initially tested but led to a degradation in model performance. The grid was thus refined to reflect values empirically known to work well, while still allowing for some tuning flexibility.

## Model Evaluation

Each model was evaluated on 2022 using:

- **R-squared (R²)**: Measures explained variance.
- **Mean Absolute Error (MAE)**: Captures average prediction error.

Results are stored in:
- `results.csv`: contains actual vs predicted values for each state and dimension.
- `metrics.csv`: includes model performance and the list of important features (with importances).
- Scatter plots: for each poverty dimension, we visualize predicted vs actual values and overlay the perfect prediction line.

In [1]:
# load necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, LeaveOneOut
from sklearn.metrics import r2_score, mean_absolute_error
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# load the data
tg_2020 = pd.read_csv('clean_data/tg_2020.csv')
tg_2022 = pd.read_csv('clean_data/tg_2022.csv')
gt_2020 = pd.read_csv('clean_data/gt_2020.csv')
gt_2022 = pd.read_csv('clean_data/gt_2022.csv')
yt_2020 = pd.read_csv('clean_data/yt_2020.csv')
yt_2022 = pd.read_csv('clean_data/yt_2022.csv')
news_2020 = pd.read_csv('clean_data/news_2020.csv')
news_2022 = pd.read_csv('clean_data/news_2022.csv')
off_2020 = pd.read_csv('clean_data/off_2020.csv')
off_2022 = pd.read_csv('clean_data/off_2022.csv')

In [3]:
# there were inconsistencies in the state names, so this mapping standardizes the state names across all datasets
state_name_map = {
    "México": "Estado de México",
    "Mexico": "Estado de México",
    "Estados Unidos Mexicanos": "Estado de México",
    "Michoacán de Ocampo": "Michoacán",
    "Veracruz de Ignacio de la Llave": "Veracruz",
    "Coahuila de Zaragoza": "Coahuila",
    "Yucatan": "Yucatán",
    "Queretaro": "Querétaro",
    "San Luis Potosi": "San Luis Potosí",
    "Nuevo Leon": "Nuevo León",
    "Michoacan": "Michoacán",
    "Michoacán de Ocampo": "Michoacán"}

# aplply the mapping
for df in [off_2020, off_2022, gt_2020, gt_2022, yt_2020, yt_2022, tg_2020, tg_2022, news_2020, news_2022]:
    df['state'] = df['state'].astype(str).str.strip()
    df['state'] = df['state'].replace(state_name_map)
    df['state'] = df['state'].replace("nan", None)  

# 'state' columns as strings in all dataframes
for df in [off_2020, off_2022, gt_2020, gt_2022, yt_2020, yt_2022, tg_2020, tg_2022, news_2020, news_2022]:
    df['state'] = df['state'].astype(str)

In [4]:
# create dataset for 2020
data_2020 = off_2020.copy()
data_2020['year'] = 2020

# merge Google Trends
data_2020 = data_2020.merge(gt_2020, on='state', how='inner')

# merge YouTube
data_2020 = data_2020.merge(yt_2020, on='state', how='inner')

# merge Telegram
data_2020 = data_2020.merge(tg_2020, on='state', how='inner')

# merge News (=LDA topics)
data_2020 = data_2020.merge(news_2020, on='state', how='inner')


# create dataset for 2022
data_2022 = off_2022.copy()
data_2022['year'] = 2022

# merge Google Trends
data_2022 = data_2022.merge(gt_2022, on='state', how='inner')

# merge YouTube
data_2022 = data_2022.merge(yt_2022, on='state', how='inner')

# merge Telegram
data_2022 = data_2022.merge(tg_2022, on='state', how='inner')

# merge News (=LDA topics)
data_2022 = data_2022.merge(news_2022, on='state', how='inner')

# combine datasets for 2020 and 2022
combined_data = pd.concat([data_2020, data_2022], ignore_index=True)

# check that we have all the states in both years (32 states * 2 years = 64 observations)
print(f"Number of observations: {len(combined_data)}")

Number of observations: 64


In [5]:
# map poverty dimensions to target columns
POVERTY_DIMENSIONS = {
    'income': 'income_target',
    'health': 'health_target',
    'education': 'educ_target',
    'social_security': 'social_target',
    'housing': 'housing_target',
    'food': 'food_target'}

In [6]:
# get features to use 
def get_all_feature_columns(data, target_columns, exclude_cols=['state', 'year']):
    return [col for col in data.columns if col not in target_columns + exclude_cols and data[col].dtype in [np.float64, np.int64]]

In [7]:
def train_all_data_grid(data, dimension, target_col):
    all_targets = list(POVERTY_DIMENSIONS.values())
    feature_cols = get_all_feature_columns(data, all_targets)

    X = data[feature_cols].values
    y = data[target_col].values
    mask = ~(np.isnan(X).any(axis=1) | np.isnan(y))
    X, y = X[mask], y[mask]

    param_grid = {
        'n_estimators': [90, 100, 110],
        'max_depth': [3, 4, 5],
        'min_samples_leaf': [1, 2, 3]}

    rf = RandomForestRegressor(random_state=42)
    search = GridSearchCV(
        rf, param_grid, scoring='neg_mean_squared_error', cv=LeaveOneOut(),
        n_jobs=-1, verbose=0)
    search.fit(X, y)

    return {
        'dimension': dimension,
        'model': search.best_estimator_,
        'feature_cols': feature_cols}

In [8]:
# function to validate on 2022
def validate_2022(data, rf_results):
    data_2022 = data[data['year'] == 2022].copy()
    results = {}

    for dim, res in rf_results.items():
        print(f"\n--- {dim.upper()} ---")
        model = res['model']
        features = res['feature_cols']
        target = POVERTY_DIMENSIONS.get(dim)

        if target not in data_2022.columns:
            continue

        X_test = data_2022[features].values
        y_test = data_2022[target].values

        mask = ~(np.isnan(X_test).any(axis=1) | np.isnan(y_test))
        X_test, y_test = X_test[mask], y_test[mask]
        states = data_2022['state'].values[mask]

        if len(X_test) == 0:
            continue

        y_pred = model.predict(X_test)
        r2 = r2_score(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)

        print(f"R² = {r2:.3f}, MAE = {mae:.3f}")

        results[dim] = {
            'states': states,
            'y_true': y_test,
            'y_pred': y_pred,
            'r2': r2,
            'mae': mae}

    return results

In [29]:
# save all results 
def save_rf_results(results_dict, models_dict, folder_name):
    os.makedirs(folder_name, exist_ok=True)

    # 1. save predictions (results.csv)
    preds = {'state': next(iter(results_dict.values()))['states']}
    for dim, res in results_dict.items():
        preds[f'{dim}_actual'] = res['y_true']
        preds[f'{dim}_predicted'] = res['y_pred']
    pd.DataFrame(preds).to_csv(f"{folder_name}/results.csv", index=False)

    # 2. save metrics (metrics.csv)
    rows = []
    for dim, res in results_dict.items():
        model = models_dict[dim]['model']
        features_used = models_dict[dim]['feature_cols']
        importances = model.feature_importances_

        important_features = [
            (f, round(imp, 4)) for f, imp in zip(features_used, importances) if imp > 0]

        num_used = len(important_features)

        rows.append({
            'dimension': dim,
            'r2': res['r2'],
            'mae': res['mae'],
            'n_features_used': num_used,
            'important_features': sorted(important_features, key=lambda x: -x[1])})

    pd.DataFrame(rows).to_csv(f"{folder_name}/metrics.csv", index=False)

    # 3. scatter plot for each dimension
    for dim, res in results_dict.items():
        y_true = res['y_true']
        y_pred = res['y_pred']

        plt.figure(figsize=(10, 10))
        plt.scatter(
            y_true, y_pred,
            color='royalblue',
            edgecolor='black',
            s=250,  
            alpha=0.9)

        min_val = min(min(y_true), min(y_pred))
        max_val = max(max(y_true), max(y_pred))
        plt.plot(
            [min_val, max_val], [min_val, max_val],
            'r--', linewidth=3, label='Perfect prediction')

        plt.xlabel("CONEVAL's statistics", fontsize=24, fontweight='bold')
        plt.ylabel("Predicted Values", fontsize=24, fontweight='bold')
        plt.title(f"{dim.replace('_', ' ').title()}\n$R^2$ = {res['r2']:.3f}", fontsize=28, fontweight='bold')

        plt.xticks(fontsize=20, fontweight='bold')
        plt.yticks(fontsize=20, fontweight='bold')
        plt.legend(fontsize=18)
        plt.grid(True)
        plt.tight_layout()
        plt.savefig(f"{folder_name}/{dim}_plot.png")
        plt.close()

In [10]:
rf_all_results = {}
for dim, target in POVERTY_DIMENSIONS.items():
    if target in combined_data.columns:
        result = train_all_data_grid(combined_data, dim, target)
        if result:
            rf_all_results[dim] = result

print("\n=== In-Sample Validation) ===")
validation_rf_all = validate_2022(combined_data, rf_all_results)

save_rf_results(validation_rf_all, rf_all_results, "rf_in_sample_grid")


=== In-Sample Validation) ===

--- INCOME ---
R² = 0.896, MAE = 3.580

--- HEALTH ---
R² = 0.840, MAE = 4.041

--- EDUCATION ---
R² = 0.867, MAE = 1.384

--- SOCIAL_SECURITY ---
R² = 0.900, MAE = 3.697

--- HOUSING ---
R² = 0.952, MAE = 1.698

--- FOOD ---
R² = 0.907, MAE = 1.450


In [44]:
def train_2020_grid(data, dimension, target_col):
    train_data = data[data['year'] == 2020].copy()
    all_targets = list(POVERTY_DIMENSIONS.values())
    feature_cols = get_all_feature_columns(train_data, all_targets)

    X = train_data[feature_cols].values
    y = train_data[target_col].values
    mask = ~(np.isnan(X).any(axis=1) | np.isnan(y))
    X, y = X[mask], y[mask]

    param_grid = {
        'n_estimators': [90, 100, 110],
        'max_depth': [3, 4, 5],
        'min_samples_leaf': [1, 2, 3]}

    rf = RandomForestRegressor(random_state=42)
    search = GridSearchCV(
        rf, param_grid, scoring='neg_mean_squared_error', cv=LeaveOneOut(),
        n_jobs=-1, verbose=0)
    search.fit(X, y)

    return {
        'dimension': dimension,
        'model': search.best_estimator_,
        'feature_cols': feature_cols}

In [45]:
rf_2020_results = {}
for dim, target in POVERTY_DIMENSIONS.items():
    if target in combined_data.columns:
        result = train_2020_grid(combined_data, dim, target)
        if result:
            rf_2020_results[dim] = result

print("\n=== Out-Of-Sample Validation) ===")
validation_rf_2020 = validate_2022(combined_data, rf_2020_results)

save_rf_results(validation_rf_2020, rf_2020_results, "rf_out_sample_grid")


=== Out-Of-Sample Validation) ===

--- INCOME ---
R² = 0.198, MAE = 9.927

--- HEALTH ---
R² = -0.618, MAE = 11.936

--- EDUCATION ---
R² = 0.076, MAE = 3.951

--- SOCIAL_SECURITY ---
R² = 0.368, MAE = 9.457

--- HOUSING ---
R² = 0.573, MAE = 5.441

--- FOOD ---
R² = 0.139, MAE = 4.598
