# Random Forest Regression for Multidimensional Poverty Estimation

This notebook implements a full modeling pipeline for nowcasting multidimensional poverty indicators using Random Forest Regression. The goal is to predict CONEVAL’s official poverty statistics for each Mexican state, using text-derived indicators from various sources (YouTube, Telegram, Google Trends, and News).

## Data Integration

Six sources of text-based indicators are integrated across two years (2020 and 2022):

- **Google Trends**: Search volume patterns for poverty-related keywords
- **YouTube Comments**: Sentiment score and share of comments about each dimension of poverty 
- **Telegram Posts**: Share of posts about each dimension of poverty 
- **News Outlets**: LDA topic modeling of media coverage
- **Official Statistics**: CONEVAL poverty measurements (ground truth)

**Geographic Scope**: All 32 Mexican states.  
**Temporal Scope**: Two reference years — 2020 (training) and 2022 (validation).


## Methodology: Random Forest Regression

Random Forest is particularly well suited for this task because:

- It captures **nonlinear relationships** and **interactions** between predictors.
- It is robust to **multicollinearity** and **feature noise**, which is crucial given the diversity of data sources.
- It performs **implicit feature selection** through node-splitting and feature sampling.
- It handles **small datasets with high-dimensional input** well (p > n scenarios), making it appropriate for this 32-observation, 37-feature setup.


## Feature Engineering and Model Training

- All numeric indicators (except the targets, `state` and `year`) are used as features.
- No feature scaling is applied, as Random Forest is scale-invariant.
- A separate model is trained for each poverty dimension.

We implemented two training-validation strategies:

1. **In-Sample Setting**: Model trained on both 2020 and 2022 data, validated on 2022.
2. **Out-of-Sample Setting**: Model trained only on 2020, validated on 2022.

## Hyperparameter Selection

### Fixed Parameters (Final Configuration)

After extensive testing, the following hyperparameters were selected and fixed across all models:

- `n_estimators = 100`: A good balance between variance reduction and computational cost.
- `max_depth = 4`: Limits overfitting while allowing enough complexity.
- `min_samples_leaf = 2`: Prevents overly specific splits, which are risky with small datasets.

This configuration was initially chosen based on its strong and consistent performance in predicting 2022 values. It showed better generalization and robustness than any parameter combination obtained through automated tuning.


### Why We Avoided Full Hyperparameter Tuning

We experimented with both **GridSearchCV** and **RandomizedSearchCV** using **Leave-One-Out Cross-Validation (LOOCV)**. However, the results obtained through automatic tuning were systematically **worse than the fixed configuration**. This outcome can be explained by several factors:

- **High variance of LOOCV**: Leave-One-Out Cross-Validation evaluates the model on a single observation at a time. Although this method maximizes the training data per fold, it also makes the validation highly sensitive to individual outliers or noisy samples. A single unusual observation can disproportionately influence the model selection process, leading to unstable or misleading validation results.

- **Instability of the optimization process**: With only 32 observations available, the metric used for tuning (such as cross-validated R² or MSE) becomes highly unstable. Small variations in the data or the model's structure can lead to significant fluctuations in performance estimates. As a result, the search process may favor hyperparameters that appear optimal due to noise rather than true predictive power, resulting in poor generalization on unseen data.

- **Underfitting caused by conservative tuning**: The tuning procedure tends to prefer simpler models to avoid overfitting in cross-validation. In practice, this led to the selection of configurations with overly shallow trees or high minimum leaf sizes. These choices limited the model’s ability to capture relevant structure in the data, resulting in underfitting and worse predictive accuracy on the 2022 validation set. Given the relative stability of poverty indicators over time, more expressive models are required to leverage the available signal effectively.


Thus, **manual tuning based on validation performance on 2022**, combined with domain knowledge and empirical results, provided a more stable and generalizable solution.


## Model Evaluation

Each model was evaluated on 2022 using:

- **R-squared (R²)**: Measures explained variance.
- **Mean Absolute Error (MAE)**: Captures average prediction error.

Results are stored in:
- `results.csv`: contains actual vs predicted values for each state and dimension.
- `metrics.csv`: includes model performance and the list of important features (with importances).
- Scatter plots: for each poverty dimension, we visualize predicted vs actual values and overlay the perfect prediction line.


## Generalizability to Future Years

The chosen parameter configuration (`n_estimators=100`, `max_depth=4`, `min_samples_leaf=2`) is simple, stable, and empirically effective. It is robust across poverty dimensions and expected to generalize well in repeated analyses across different years because:

- It is **not tailored to 2022** but performs well on it.
- It avoids overfitting through conservative tree growth.

In [1]:
# load necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# load the data
tg_2020 = pd.read_csv('clean_data/tg_2020.csv')
tg_2022 = pd.read_csv('clean_data/tg_2022.csv')
gt_2020 = pd.read_csv('clean_data/gt_2020.csv')
gt_2022 = pd.read_csv('clean_data/gt_2022.csv')
yt_2020 = pd.read_csv('clean_data/yt_2020.csv')
yt_2022 = pd.read_csv('clean_data/yt_2022.csv')
news_2020 = pd.read_csv('clean_data/news_2020.csv')
news_2022 = pd.read_csv('clean_data/news_2022.csv')
off_2020 = pd.read_csv('clean_data/off_2020.csv')
off_2022 = pd.read_csv('clean_data/off_2022.csv')

In [3]:
# there were inconsistencies in the state names, so this mapping standardizes the state names across all datasets
state_name_map = {
    "México": "Estado de México",
    "Mexico": "Estado de México",
    "Estados Unidos Mexicanos": "Estado de México",
    "Michoacán de Ocampo": "Michoacán",
    "Veracruz de Ignacio de la Llave": "Veracruz",
    "Coahuila de Zaragoza": "Coahuila",
    "Yucatan": "Yucatán",
    "Queretaro": "Querétaro",
    "San Luis Potosi": "San Luis Potosí",
    "Nuevo Leon": "Nuevo León",
    "Michoacan": "Michoacán",
    "Michoacán de Ocampo": "Michoacán"}

# aplply the mapping
for df in [off_2020, off_2022, gt_2020, gt_2022, yt_2020, yt_2022, tg_2020, tg_2022, news_2020, news_2022]:
    df['state'] = df['state'].astype(str).str.strip()
    df['state'] = df['state'].replace(state_name_map)
    df['state'] = df['state'].replace("nan", None)  

# 'state' columns as strings in all dataframes
for df in [off_2020, off_2022, gt_2020, gt_2022, yt_2020, yt_2022, tg_2020, tg_2022, news_2020, news_2022]:
    df['state'] = df['state'].astype(str)

In [4]:
# create dataset for 2020
data_2020 = off_2020.copy()
data_2020['year'] = 2020

# merge Google Trends
data_2020 = data_2020.merge(gt_2020, on='state', how='inner')

# merge YouTube
data_2020 = data_2020.merge(yt_2020, on='state', how='inner')

# merge Telegram
data_2020 = data_2020.merge(tg_2020, on='state', how='inner')

# merge News (=LDA topics)
data_2020 = data_2020.merge(news_2020, on='state', how='inner')


# create dataset for 2022
data_2022 = off_2022.copy()
data_2022['year'] = 2022

# merge Google Trends
data_2022 = data_2022.merge(gt_2022, on='state', how='inner')

# merge YouTube
data_2022 = data_2022.merge(yt_2022, on='state', how='inner')

# merge Telegram
data_2022 = data_2022.merge(tg_2022, on='state', how='inner')

# merge News (=LDA topics)
data_2022 = data_2022.merge(news_2022, on='state', how='inner')

# combine datasets for 2020 and 2022
combined_data = pd.concat([data_2020, data_2022], ignore_index=True)

# check that we have all the states in both years (32 states * 2 years = 64 observations)
print(f"Number of observations: {len(combined_data)}")

Number of observations: 64


In [5]:
# map poverty dimensions to target columns
POVERTY_DIMENSIONS = {
    'income': 'income_target',
    'health': 'health_target',
    'education': 'educ_target',
    'social_security': 'social_target',
    'housing': 'housing_target',
    'food': 'food_target'}

In [6]:
# get features to use 
def get_all_feature_columns(data, target_columns, exclude_cols=['state', 'year']):
    return [col for col in data.columns if col not in target_columns + exclude_cols and data[col].dtype in [np.float64, np.int64]]

In [7]:
# function to train on all data (2020+2022) 
def train_all_data(data, dimension, target_col):
    all_targets = list(POVERTY_DIMENSIONS.values())
    feature_cols = get_all_feature_columns(data, all_targets)

    X = data[feature_cols].values
    y = data[target_col].values

    mask = ~(np.isnan(X).any(axis=1) | np.isnan(y))
    X, y = X[mask], y[mask]

    rf = RandomForestRegressor(n_estimators=100, max_depth=4, min_samples_leaf=2, random_state=42)
    rf.fit(X, y)

    return {
        'dimension': dimension,
        'model': rf,
        'feature_cols': feature_cols}

In [8]:
# function to validate on 2022
def validate_2022(data, rf_results):
    data_2022 = data[data['year'] == 2022].copy()
    results = {}

    for dim, res in rf_results.items():
        print(f"\n--- {dim.upper()} ---")
        model = res['model']
        features = res['feature_cols']
        target = POVERTY_DIMENSIONS.get(dim)

        if target not in data_2022.columns:
            continue

        X_test = data_2022[features].values
        y_test = data_2022[target].values

        mask = ~(np.isnan(X_test).any(axis=1) | np.isnan(y_test))
        X_test, y_test = X_test[mask], y_test[mask]
        states = data_2022['state'].values[mask]

        if len(X_test) == 0:
            continue

        y_pred = model.predict(X_test)
        r2 = r2_score(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)

        print(f"R² = {r2:.3f}, MAE = {mae:.3f}")

        results[dim] = {
            'states': states,
            'y_true': y_test,
            'y_pred': y_pred,
            'r2': r2,
            'mae': mae}

    return results

In [9]:
# save all results 
def save_rf_results(results_dict, models_dict, folder_name):
    os.makedirs(folder_name, exist_ok=True)

    # 1. save predictions (results.csv)
    preds = {'state': next(iter(results_dict.values()))['states']}
    for dim, res in results_dict.items():
        preds[f'{dim}_actual'] = res['y_true']
        preds[f'{dim}_predicted'] = res['y_pred']
    pd.DataFrame(preds).to_csv(f"{folder_name}/results.csv", index=False)

    # 2. save metrics (metrics.csv)
    rows = []
    for dim, res in results_dict.items():
        model = models_dict[dim]['model']
        features_used = models_dict[dim]['feature_cols']
        importances = model.feature_importances_

        important_features = [
            (f, round(imp, 4)) for f, imp in zip(features_used, importances) if imp > 0]

        num_used = len(important_features)

        rows.append({
            'dimension': dim,
            'r2': res['r2'],
            'mae': res['mae'],
            'n_features_used': num_used,
            'important_features': sorted(important_features, key=lambda x: -x[1])})

    pd.DataFrame(rows).to_csv(f"{folder_name}/metrics.csv", index=False)

    # 3. scatter plot for each dimension
    for dim, res in results_dict.items():
        y_true = res['y_true']
        y_pred = res['y_pred']

        plt.figure(figsize=(10, 10))
        plt.scatter(
            y_true, y_pred,
            color='royalblue',
            edgecolor='black',
            s=250,  
            alpha=0.9)

        min_val = min(min(y_true), min(y_pred))
        max_val = max(max(y_true), max(y_pred))
        plt.plot(
            [min_val, max_val], [min_val, max_val],
            'r--', linewidth=3, label='Perfect prediction')

        plt.xlabel("CONEVAL's statistics", fontsize=24, fontweight='bold')
        plt.ylabel("Predicted Values", fontsize=24, fontweight='bold')
        plt.title(f"{dim.replace('_', ' ').title()}\n$R^2$ = {res['r2']:.3f}", fontsize=28, fontweight='bold')

        plt.xticks(fontsize=20, fontweight='bold')
        plt.yticks(fontsize=20, fontweight='bold')
        plt.legend(fontsize=18)
        plt.grid(True)
        plt.tight_layout()
        plt.savefig(f"{folder_name}/{dim}_plot.png")
        plt.close()

In [10]:
rf_all_results = {}
for dim, target in POVERTY_DIMENSIONS.items():
    if target in combined_data.columns:
        result = train_all_data(combined_data, dim, target)
        if result:
            rf_all_results[dim] = result

print("\n=== In-Sample Validation) ===")
validation_rf_all = validate_2022(combined_data, rf_all_results)

save_rf_results(validation_rf_all, rf_all_results, "rf_in_sample")


=== In-Sample Validation) ===

--- INCOME ---
R² = 0.873, MAE = 3.742

--- HEALTH ---
R² = 0.841, MAE = 4.024

--- EDUCATION ---
R² = 0.888, MAE = 1.283

--- SOCIAL_SECURITY ---
R² = 0.890, MAE = 3.870

--- HOUSING ---
R² = 0.952, MAE = 1.701

--- FOOD ---
R² = 0.866, MAE = 1.666


In [20]:
# function to train on 2020 only
def train_2020(data, dimension, target_col):
    train_data = data[data['year'] == 2020].copy()
    all_targets = list(POVERTY_DIMENSIONS.values())
    feature_cols = get_all_feature_columns(train_data, all_targets)

    X_train = train_data[feature_cols].values
    y_train = train_data[target_col].values

    mask = ~(np.isnan(X_train).any(axis=1) | np.isnan(y_train))
    X_train, y_train = X_train[mask], y_train[mask]

    rf = RandomForestRegressor(n_estimators=100, max_depth=4, min_samples_leaf=2, random_state=42)
    rf.fit(X_train, y_train)

    return {
        'dimension': dimension,
        'model': rf,
        'feature_cols': feature_cols}

In [21]:
rf_2020_results = {}
for dim, target in POVERTY_DIMENSIONS.items():
    if target in combined_data.columns:
        result = train_2020(combined_data, dim, target)
        if result:
            rf_2020_results[dim] = result

print("\n=== Out-Of-Sample Validation) ===")
validation_rf_2020 = validate_2022(combined_data, rf_2020_results)

save_rf_results(validation_rf_2020, rf_2020_results, "rf_out_sample")


=== Out-Of-Sample Validation) ===

--- INCOME ---
R² = 0.225, MAE = 9.838

--- HEALTH ---
R² = -0.602, MAE = 11.767

--- EDUCATION ---
R² = 0.116, MAE = 3.850

--- SOCIAL_SECURITY ---
R² = 0.360, MAE = 9.500

--- HOUSING ---
R² = 0.585, MAE = 5.410

--- FOOD ---
R² = 0.145, MAE = 4.558


# Construction of the *Social Cohesion* Index (2022) using PCA

Since we do not have a direct target variable to measure social cohesion, we apply an **unsupervised method** based on **Principal Component Analysis (PCA)** to extract a latent index that summarizes the information from a set of proxy variables.

## Data Selection and Preprocessing

We select all variables in the dataset whose names contain the word `"cohesion"`. These are the components we previously constructed from different textual data sources (YouTube, Telegram, News, Google Trends) to capture aspects of social cohesion. Each of these variables reflects a proxy derived from textual analysis, so these are not raw inputs but already-processed indicators explicitly built to inform the social cohesion dimension. 

To ensure comparability across variables and to satisfy PCA assumptions, we standardize the selected features using `StandardScaler`, which centers them around zero and rescales them to unit variance.

## PCA: Extracting the Latent Dimension

We perform a PCA and retain only the **first principal component (PC1)**. This component captures the direction of maximum variance in the standardized feature space. Since all selected features aim to reflect some aspect of social cohesion, PC1 can be interpreted as a **latent index** summarizing the shared signal across them.

We chose this approach because:

- By reducing dimensionality, we capture the dominant underlying pattern in a single score.
- Since we don't have a target for this dimension, our possibilities were limited but we still believe this approach allows to define weights in a robust and non-arbitrary way
- Reflects the empirical correlation structure of the data, which is appropriate in the absence of a predefined ground truth.

## Handle the Sign

One limitation of PCA is that the sign of the components is not uniquely identified — multiplying all loadings and scores by -1 yields an equivalent solution.

To ensure consistency in interpretation (i.e., **higher scores mean worse social cohesion**), we compute the **sum of the loadings** for the first component:
- If the sum is **negative**, it means that high raw scores are associated with **better** social cohesion. In this case, we **invert the sign** of the PCA scores so that higher values reflect **higher deprivation**.
- If the sum is **positive**, we keep the scores as they are.

This correction guarantees that a value of 100 consistently means "lowest social cohesion" — in line with the interpretation of other poverty indicators.

## Normalization to a 0–100 Scale

After sign correction, the scores are normalized to a 0–100 scale using `MinMaxScaler`. This is aligned with the conventions adopted by CONEVAL and our other dimensional estimates, which express deprivation as a **percentage of the population** affected.

The resulting index can thus be interpreted as a relative measure of social cohesion deprivation, comparable across Mexican states.

## Observed Anomalies: 5 States with Extreme Values

In the final normalized scores, we observe the presence of unrealistic values:
- One state receives a score of exactly **0**.
- Four states have scores **above 80**, with one state reaching exactly **100**.

While these results are not computationally incorrect, they may appear questionable — especially given that we do not observe such extreme values in any of the other dimensions. However, several technical factors can explain this behavior:

- **Outliers in the original features**: some states may exhibit near-zero values across multiple cohesion-related components, particularly in cases where textual data coverage was sparse or unbalanced. This can drive their PCA score toward the lower bound of the distribution.

- **High-leverage observations**: PCA is inherently sensitive to atypical combinations of feature values. A single state with an unusual profile — even if not extreme in any single component — can strongly influence the orientation of the principal component and receive a disproportionately high or low score.

- **Scaling effects**: the use of `MinMaxScaler` maps the minimum and maximum PCA scores to 0 and 100 by design. As a result, there will always be at least one state assigned **0** and one assigned **100**, regardless of how realistic those values are in substantive terms.

Ultimately, these extreme scores should be interpreted as **relative positions**: they indicate how a state ranks within the empirical distribution of social cohesion *as captured by the PCA*, rather than representing absolute levels of deprivation. A score of 100 does not imply that 100% of the population experiences cohesion poverty — it simply reflects that the state ranks at the very bottom in relative terms.

To avoid producing exact 0 and 100 values, one option would have been to restrict the output range during normalization, for example:

```python
scaler_pct = MinMaxScaler(feature_range=(5, 80))
````

However, we opted not to implement this adjustment, for two main reasons:

- The choice of an alternative range would have been arbitrary and thus methodologically debatable;
- Compressing the score range would have reduced the spread of the values, limiting the model's ability to differentiate between intermediate cases.

We therefore retained the full [0, 100] scale, acknowledging that the extremes reflect model mechanics as much as underlying variance in the data.

In [23]:
# select social cohesion features
social_features = [col for col in data_2022.columns if 'cohesion' in col.lower()]
X_social_2022 = data_2022[social_features].dropna()

# standardize the features
scaler = StandardScaler()
X_scaled_2022 = scaler.fit_transform(X_social_2022)

# PCA to extract the first component
pca = PCA(n_components=1)
social_cohesion_score_2022 = pca.fit_transform(X_scaled_2022)

# get and print loadings 
loadings = dict(zip(social_features, pca.components_[0]))
print("PCA loadings (PC1):")
for feature, weight in loadings.items():
    print(f"{feature}: {weight:.3f}")

# invert the sign of the PCA scores if necessary
# (this is to ensure that a higher score indicates worse cohesion)
# (if the sum of loadings is negative, invert the sign to have: 100 = worst cohesion)
if np.sum(pca.components_[0]) < 0:
    print(" inverting sign of PCA scores to have: 100 = worse cohesion")
    social_cohesion_score_2022 = -social_cohesion_score_2022

# normalize the scores to a 0-100 scale
scaler_pct = MinMaxScaler(feature_range=(0, 100))
social_cohesion_normalized = scaler_pct.fit_transform(social_cohesion_score_2022)

# final df 
cohesion_df = data_2022.loc[X_social_2022.index, ['state', 'year']].copy()
cohesion_df['social_cohesion_score'] = social_cohesion_normalized

# save 
os.makedirs("PCA", exist_ok=True)
cohesion_df.to_csv("PCA/score.csv", index=False)

cohesion_df.head(10)

PCA loadings (PC1):
cohesion_gt: -0.382
social_cohesion_avg_sentiment: -0.373
social_cohesion_pct_yt: 0.619
social_cohesion_pct_tg: 0.576


Unnamed: 0,state,year,social_cohesion_score
0,Aguascalientes,2022,34.618119
1,Baja California,2022,88.612437
2,Baja California Sur,2022,43.232871
3,Campeche,2022,16.569366
4,Coahuila,2022,41.210944
5,Colima,2022,31.376835
6,Chiapas,2022,92.302337
7,Chihuahua,2022,24.930064
8,Ciudad de México,2022,24.70138
9,Durango,2022,36.17975
