<h1 style="text-align: center;" markdown="1">Machine Learning Algorithms for Poverty Prediction</h1> 
<h2 style="text-align: center;" markdown="2">A project of the World Bank's Knowledge for Change Program</h2>
<h3 style="text-align: center;" markdown="3">(KCP, Grant TF0A4534)</h3>


> *This notebook is part of a series that has been developed as an empirical comparative assessment of machine learning classification algorithms applied to poverty prediction. The objectives of this project are to explore how well machine learning algorithms perform when given the task to identify the poor in a given population, and to provide a resource of machine learning techniques for researchers, data scientists, and statisticians in developing countries.*

<h1 style="text-align: center;" markdown="3">Model Robustness</h1> 
<h2 style="text-align: center;" markdown="3">Indonesia Poverty Prediction</h2>

# Table of Contents
[Model Robustness Introduction](#introduction)    
[Select Models for Comparison](#define)   

[2012 Results](#2012)

[2011 Results](#2011) 

[2013 Results](#2013)

[2014 Results](#2014)

[Overall Results](#overall) 

[Summary](#summary)

# Model Robustness Introduction <a class="anchor" id="introduction"></a>

One of the best practical tests of a machine learning model is to measure its performance on an entirely new dataset. After all, this is the point of the models, pragmatically speaking. Throughout these notebooks, 2012 Indonesia expenditure survey data has been used to train and validate many models. In this notebook, similarly structured survey data from other years are used as alternative test sets.

Specifically, survey data from 

* 2011
* 2013
* 2014 

We'll use this data to test the top Indonesia models from the first notebooks (1-10) as well as the more advanced models in the recent notebooks (12-15). We'll test each year individually, and then look at a mean performance across all years considered.

First, load the libraries we'll need. The data has already been prepared in a previous notebook.

In [1]:
%matplotlib inline

import os
import sys
import json
from pathlib import Path

import numpy as np
import pandas as pd
from pandas.io.stata import StataReader

from matplotlib import pyplot as plt
from IPython.display import display
import seaborn as sns
sns.set()

from sklearn.model_selection import train_test_split

# Add our local functions to the path
sys.path.append(os.path.join(os.pardir, 'src'))
from data.load_data import get_country_filepaths, load_data
from data.sampler import Sampler
from features import process_features
from models.evaluation import load_model, evaluate_model, load_model_metrics, MODELS_DIR
from models.ensemble.simple_ensemble import simple_ensemble_preds
from visualization import visualize

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


**NOTE: Only Make Predictions If Neccessary** If the `PREDICTIONS_ALREADY_SAVED` flag is `False`, models will be loaded and new predictions will be saved. Otherwise, if it is `True` we will skip prediction and get right to loading and comparing results.

In [2]:
PREDICTIONS_ALREADY_SAVED = True

# Select Models for Comparison <a class="anchor" id="define"></a>

Load the Indonesia calibrated and advanced models. DeepFM was calculaed in another notebook so we can just load those here regardless of whether the other predictions need to be made. Therefore we'll exclude it in the lists except for loading results.

**NOTE** that we only consider the top 10 from the Indonesia calibrated models.

In [3]:
# Load phase 1 models from a stored csv ranking the top models.
# NOTE that we only consider the top 10 from this list
top_N = 10
calibrated_phase_1_model_names = list(pd.read_csv('../data/processed/idn/idn-master-results.csv').name[:top_N].values)

calibrated_phase_1_models = {model_name: load_model(model_name, 'idn')['model'] 
                             for model_name in calibrated_phase_1_model_names 
                             if Path(MODELS_DIR, 'idn', model_name + '.pkl').exists() 
                             and 'feats' not in model_name
                             and 'dl' not in model_name 
                             and 'deepfm' not in model_name} 

len(calibrated_phase_1_models)

10

Add the advanced models

In [4]:
# ensemble_simple and deepfm handled separately below
advanced_model_names = ['ensemble_stacked', 'automl_tpot']
advanced_models = {model_name: load_model(model_name, 'idn')['model'] 
                   for model_name in advanced_model_names 
                   if Path(MODELS_DIR, 'idn', model_name + '.pkl').exists() 
                   and 'feats' not in model_name
                   and 'dl' not in model_name} 

Add the DeepFM models

In [5]:
deep_model_names = ['deepfm_full_undersample_cv', 'deepfm_full_cv']

Construct the complete list of models to be compared

In [6]:
models_considered = (calibrated_phase_1_model_names + 
                     advanced_model_names + 
                     deep_model_names + 
                     ['ensemble_simple'])

# 2012 Results <a class="anchor" id="2012"></a>

Since we ultimately want to compare predictions from all years considered, we'll load the results from the original 2012 data first. These results were generated using only the test data from 2012.

Recall, all models were _trained_ on this data. **All** new data from 2011, 2013, 2014 will be used _only_ as a test set. So it will be interesting to see how the mean rank of models changes.

In [7]:
country = 'idn'
metrics = [load_model_metrics(model_name, country) 
           for model_name in models_considered
           if Path(MODELS_DIR, country, model_name + '.pkl').exists() 
           and 'feats' not in model_name
           and 'dl' not in model_name]
results_2012 = visualize.display_model_comparison(metrics, 
                                             show_roc=False, 
                                             show_cm=False, 
                                             show_pov_rate_error=False, 
                                             highlight_best=True, 
                                             transpose=True, 
                                             rank_order=True)

Unnamed: 0,accuracy,recall,precision,f1,cross_entropy,roc_auc,cohen_kappa,mean_rank
deepfm_full_cv,0.932415,0.525986,0.65625,0.583942,0.161349,0.943099,0.547647,3.0
mlp_full_undersample_cv,0.857694,0.854518,0.44028,0.581136,0.308514,0.932805,0.481756,4.57143
lr_full_oversample,0.854459,0.843033,0.433238,0.572346,0.344135,0.926334,0.473514,6.14286
ensemble_simple,0.845008,0.873189,0.418192,0.565535,0.34436,0.93394,0.464792,6.57143
rf_full_classwts,0.894672,0.550596,0.543572,0.547061,0.285334,0.904138,0.482144,6.71429
svm_full_oversample,0.867253,0.755508,0.455099,0.568031,0.305525,0.909297,0.454544,7.0
lr_full_oversample_cv,0.851611,0.838116,0.427456,0.566159,0.348544,0.925356,0.471316,7.71429
automl_tpot,0.896794,0.555481,0.592262,0.573282,0.625172,0.750474,0.472997,7.85714
xgb_full_undersample_cv,0.835837,0.87632,0.403156,0.552247,0.366854,0.928831,0.453724,8.14286
ensemble_stacked,0.827208,0.886484,0.390749,0.542411,0.352263,0.929823,0.44074,9.14286


# 2011 Results <a class="anchor" id="2011"></a>

Now we'll start to look at some new years, beginning with the 2011 survey.

In [8]:
country = 'idn-2011'
if not PREDICTIONS_ALREADY_SAVED:
    TRAIN_PATH, TEST_PATH, QUESTIONS_PATH = get_country_filepaths(country)
    X_test, y_test, w_test = load_data(TEST_PATH)

    # phase 1 models 
    for model_name, model in calibrated_phase_1_models.items():
        if 'xgb' in model_name:
            y_pred = model.predict(X_test.as_matrix())
            y_prob = model.predict_proba(X_test.as_matrix())[:, 1]
        else:
            y_pred = model.predict(X_test)
            y_prob = model.predict_proba(X_test)[:, 1]        

        evaluate_model(y_test=y_test,
                       y_pred=y_pred,
                       y_prob=y_prob,
                       store_model=True,
                       model_name=model_name, 
                       country=country,
                       predict_pov_rate=False,
                       show=False)

    # advanced models
    for model_name, model in advanced_models.items():
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]        

        evaluate_model(y_test=y_test,
                       y_pred=y_pred,
                       y_prob=y_prob,
                       store_model=True,
                       model_name=model_name, 
                       country=country,
                       predict_pov_rate=False,
                       show=False)
        
    # simple emsemble models
    y_prob = simple_ensemble_preds(X_test)   
    y_pred = y_prob.round()
  

    evaluate_model(y_test=y_test,
                   y_pred=y_pred,
                   y_prob=y_prob,
                   store_model=True,
                   model_name='ensemble_simple', 
                   country=country,
                   predict_pov_rate=False,
                   show=False)


Once the predictions have been made and saved if neccessary, we load the results into a metrics DataFrame.

In [9]:
metrics = [load_model_metrics(model_name, country) 
           for model_name in models_considered
           if Path(MODELS_DIR, country, model_name + '.pkl').exists() 
           and 'feats' not in model_name
           and 'dl' not in model_name]
results_2011 = visualize.display_model_comparison(metrics, 
                                             show_roc=False, 
                                             show_cm=False, 
                                             show_pov_rate_error=False, 
                                             highlight_best=True, 
                                             transpose=True, 
                                             rank_order=True)

Unnamed: 0,accuracy,recall,precision,f1,cross_entropy,roc_auc,cohen_kappa,mean_rank
deepfm_full_cv,0.901866,0.550262,0.560966,0.555563,0.258528,0.909481,0.500412,1.14286
deepfm_full_undersample_cv,0.894275,0.551509,0.524493,0.537662,0.280025,0.908701,0.478017,1.85714


We already see that the top models have changed order. The ensemble techniques have fallen in mean rank, and the DeepFM models have risen. Their specific mean rank values (far right column) are quite ahead of the third place model, which is also a neural network model.

# 2013 Results <a class="anchor" id="2013"></a>

Moving on to another year, let's look at the 2013 results.

In [10]:
country = 'idn-2013'
if not PREDICTIONS_ALREADY_SAVED:
    TRAIN_PATH, TEST_PATH, QUESTIONS_PATH = get_country_filepaths(country)
    X_test, y_test, w_test = load_data(TEST_PATH)

    # phase 1 models
    for model_name, model in calibrated_phase_1_models.items():  
        if 'xgb' in model_name:
            y_pred = model.predict(X_test.as_matrix())
            y_prob = model.predict_proba(X_test.as_matrix())[:, 1]
        else:
            y_pred = model.predict(X_test)
            y_prob = model.predict_proba(X_test)[:, 1]        

        evaluate_model(y_test=y_test,
                       y_pred=y_pred,
                       y_prob=y_prob,
                       store_model=True,
                       model_name=model_name, 
                       country=country,
                       predict_pov_rate=False,
                       show=False)
    
    # advanced models
    for model_name, model in advanced_models.items():
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]        

        evaluate_model(y_test=y_test,
                       y_pred=y_pred,
                       y_prob=y_prob,
                       store_model=True,
                       model_name=model_name, 
                       country=country,
                       predict_pov_rate=False,
                       show=False)
    
    # simple ensemble    
    y_prob = simple_ensemble_preds(X_test)   
    y_pred = y_prob.round()
  

    evaluate_model(y_test=y_test,
                   y_pred=y_pred,
                   y_prob=y_prob,
                   store_model=True,
                   model_name='ensemble_simple', 
                   country=country,
                   predict_pov_rate=False,
                   show=False)


Once the predictions have been made and saved if neccessary, we load the results into a metrics DataFrame.

In [11]:
metrics = [load_model_metrics(model_name, country) 
           for model_name in models_considered
           if Path(MODELS_DIR, country, model_name + '.pkl').exists() 
           and 'feats' not in model_name
           and 'dl' not in model_name]
results_2013 = visualize.display_model_comparison(metrics, 
                                             show_roc=False, 
                                             show_cm=False, 
                                             show_pov_rate_error=False, 
                                             highlight_best=True, 
                                             transpose=True, 
                                             rank_order=True)

Unnamed: 0,accuracy,recall,precision,f1,cross_entropy,roc_auc,cohen_kappa,mean_rank
deepfm_full_cv,0.90236,0.633535,0.413204,0.500181,0.236145,0.910612,0.44872,1.42857
deepfm_full_undersample_cv,0.922165,0.399048,0.494219,0.441564,0.200481,0.907161,0.400241,1.57143


The DeepFM and neural network models continue to outperform the others. This time the undersampled DeepFM model far outranks the others.

# 2014 Results <a class="anchor" id="2014"></a>

Finally we look at results from one more year

In [12]:
country = 'idn-2014'
if not PREDICTIONS_ALREADY_SAVED:
    TRAIN_PATH, TEST_PATH, QUESTIONS_PATH = get_country_filepaths(country)
    X_test, y_test, w_test = load_data(TEST_PATH)

    # phase 1 models
    for model_name, model in calibrated_phase_1_models.items():
        if 'xgb' in model_name:
            y_pred = model.predict(X_test.as_matrix())
            y_prob = model.predict_proba(X_test.as_matrix())[:, 1]
        else:
            y_pred = model.predict(X_test)
            y_prob = model.predict_proba(X_test)[:, 1]        

        evaluate_model(y_test=y_test,
                       y_pred=y_pred,
                       y_prob=y_prob,
                       store_model=True,
                       model_name=model_name, 
                       country=country,
                       predict_pov_rate=False,
                       show=False)

    # advanced models
    for model_name, model in advanced_models.items():
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]        

        evaluate_model(y_test=y_test,
                       y_pred=y_pred,
                       y_prob=y_prob,
                       store_model=True,
                       model_name=model_name, 
                       country=country,
                       predict_pov_rate=False,
                       show=False)

    # simple emsemble
    y_prob = simple_ensemble_preds(X_test)   
    y_pred = y_prob.round()
  

    evaluate_model(y_test=y_test,
                   y_pred=y_pred,
                   y_prob=y_prob,
                   store_model=True,
                   model_name='ensemble_simple', 
                   country=country,
                   predict_pov_rate=False,
                   show=False)


Once the predictions have been made and saved if neccessary, we load the results into a metrics DataFrame.

In [13]:
metrics = [load_model_metrics(model_name, country) 
           for model_name in models_considered
           if Path(MODELS_DIR, country, model_name + '.pkl').exists() 
           and 'feats' not in model_name
           and 'dl' not in model_name]
results_2014 = visualize.display_model_comparison(metrics, 
                                             show_roc=False, 
                                             show_cm=False, 
                                             show_pov_rate_error=False, 
                                             highlight_best=True, 
                                             transpose=True, 
                                             rank_order=True)

Unnamed: 0,accuracy,recall,precision,f1,cross_entropy,roc_auc,cohen_kappa,mean_rank
deepfm_full_undersample_cv,0.923899,0.528387,0.435947,0.477737,0.19737,0.914623,0.437102,1.28571
deepfm_full_cv,0.913996,0.59957,0.398456,0.47875,0.214769,0.911671,0.433948,1.71429


Again, the deep models come out on top! Let's look at the overall results.

# Overall Results <a class="anchor" id="overall"></a>


Here the mean rank is calculated on the _mean_ of metrics accross years.

In [14]:
overall = pd.concat((results_2011, results_2012, results_2013, results_2014)).drop('mean_rank', axis=1)
overall = overall.groupby(overall.index).mean().reset_index().rename(columns={'index': 'name'}).to_dict(orient='report')
overall_results = visualize.display_model_comparison(overall, 
                                             show_roc=False, 
                                             show_cm=False, 
                                             show_pov_rate_error=False, 
                                             highlight_best=True, 
                                             transpose=True, 
                                             rank_order=True)

Unnamed: 0,accuracy,recall,precision,f1,cross_entropy,roc_auc,cohen_kappa,mean_rank
mlp_full_undersample_cv,0.857694,0.854518,0.44028,0.581136,0.308514,0.932805,0.481756,4.42857
lr_full_oversample,0.854459,0.843033,0.433238,0.572346,0.344135,0.926334,0.473514,6.0
deepfm_full_cv,0.912659,0.577338,0.507219,0.529609,0.217698,0.918715,0.482682,6.14286
ensemble_simple,0.845008,0.873189,0.418192,0.565535,0.34436,0.93394,0.464792,6.42857
rf_full_classwts,0.894672,0.550596,0.543572,0.547061,0.285334,0.904138,0.482144,6.85714
svm_full_oversample,0.867253,0.755508,0.455099,0.568031,0.305525,0.909297,0.454544,7.0
lr_full_oversample_cv,0.851611,0.838116,0.427456,0.566159,0.348544,0.925356,0.471316,7.57143
automl_tpot,0.896794,0.555481,0.592262,0.573282,0.625172,0.750474,0.472997,7.71429
xgb_full_undersample_cv,0.835837,0.87632,0.403156,0.552247,0.366854,0.928831,0.453724,7.85714
deepfm_full_undersample_cv,0.895194,0.598603,0.451674,0.491371,0.261289,0.917504,0.43757,8.71429


It appears that when the test set is made larger, the neural network models are the clear winners, beating even the simple ensemble model that performed so well on the original data.

## Advanced

Lets do the same ranking using only advanced models. This should effectively subset the above table without changing the relative ranking, an easy check to make sure our calculations are consistent.

In [15]:
advanced = ['deepfm_full_cv', 'deepfm_full_undersample_cv', 'automl_tpot', 'ensemble_simple', 'ensemble_stacked']
advanced_results = visualize.display_model_comparison([metric for metric in overall if metric['name'] in advanced], 
                                             show_roc=False, 
                                             show_cm=False, 
                                             show_pov_rate_error=False, 
                                             highlight_best=True, 
                                             transpose=True, 
                                             rank_order=True)

Unnamed: 0,accuracy,recall,precision,f1,cross_entropy,roc_auc,cohen_kappa,mean_rank
deepfm_full_cv,0.912659,0.577338,0.507219,0.529609,0.217698,0.918715,0.482682,2.28571
ensemble_simple,0.845008,0.873189,0.418192,0.565535,0.34436,0.93394,0.464792,2.71429
automl_tpot,0.896794,0.555481,0.592262,0.573282,0.625172,0.750474,0.472997,3.0
ensemble_stacked,0.827208,0.886484,0.390749,0.542411,0.352263,0.929823,0.44074,3.42857
deepfm_full_undersample_cv,0.895194,0.598603,0.451674,0.491371,0.261289,0.917504,0.43757,3.57143


## Non-Advanced

Here are the results for the simpler models.

In [16]:
nonadvanced_results = visualize.display_model_comparison([metric for metric in overall if metric['name'] not in advanced], 
                                             show_roc=False, 
                                             show_cm=False, 
                                             show_pov_rate_error=False, 
                                             highlight_best=True, 
                                             transpose=True, 
                                             rank_order=True)

Unnamed: 0,accuracy,recall,precision,f1,cross_entropy,roc_auc,cohen_kappa,mean_rank
mlp_full_undersample_cv,0.857694,0.854518,0.44028,0.581136,0.308514,0.932805,0.481756,2.71429
lr_full_oversample,0.854459,0.843033,0.433238,0.572346,0.344135,0.926334,0.473514,4.0
rf_full_classwts,0.894672,0.550596,0.543572,0.547061,0.285334,0.904138,0.482144,4.28571
svm_full_oversample,0.867253,0.755508,0.455099,0.568031,0.305525,0.909297,0.454544,4.57143
xgb_full_undersample_cv,0.835837,0.87632,0.403156,0.552247,0.366854,0.928831,0.453724,5.0
lr_full_oversample_cv,0.851611,0.838116,0.427456,0.566159,0.348544,0.925356,0.471316,5.28571
lr_full_undersample,0.830133,0.883593,0.394889,0.545836,0.392892,0.926233,0.44493,6.42857
lr_full_classwts,0.830513,0.866754,0.39387,0.541618,0.384733,0.923196,0.428159,7.14286
rf_full_undersample_cv_ada,0.818561,0.904739,0.380136,0.535342,0.530756,0.928685,0.419727,7.14286
lda_full_oversample,0.815948,0.8906,0.375087,0.52786,0.424324,0.921219,0.408593,8.42857


# Summary <a class="anchor" id="summary"></a>

When compared over all years, it is clear that the neural network models are the most robust. The deep learning models in particular consistently rank in the top two, except in 2012, the year used to train the models. This suggests that while the deep learning models may tend to overfit their training data, they better capture the overall structure of the data distribution over many years.

Two of the ensemble-based models––the simple ensemble average and the under-sampled XGBoost models––also consistently outrank many other top models.

It is somewhat surprising that the simple ensemble average, which consists only of the average probability over all of the top 10 models reported in, outperforms the stacked ensemble. The simple ensemble average has more heterogeneity in its base learners than any of the other models, including the stacked ensembles. Although in general stacked ensemble models are thought to be much more robust than simple ensemble averages, for these particular data the simple ensemble always outperforms the stacked ensemble according to the mean metric, likely because of the variety of methods it incorporates.

The models based on linear discriminant analysis, while also simple, are not robust across years. They trail far behind the other models in every one of the metrics considered in this report.